본문 바로가기

Programming/Machine Learning

손글씨 분류 실습

digit_train.zip
8.91MB

목표

- 손글씨 숫자(0~9)를 분류하는 모델을 만들어보자.

- 선형 분류모델의 불확실성을 확인

- 이미지 데이터의 형태를 이해

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

digit_data = pd.read_csv('data/digit_train.csv')
digit_data.head()

digit_data.shape

 

EDA

img0 = digit_data.iloc[0, 1:]

print(img0.max())
print(img0.min())

plt.hist(img0, bins = 255)
plt.show()

plt.imshow(img0.values.reshape(28, 28))
# plt.imshow(img0.values.reshape(28, 28), cmap='gray')
plt.show()

img0 = digit_data.iloc[100, 1:]
plt.imshow(img0.values.reshape(28, 28))
plt.show()

 

5000장 추출

X = digit_data.iloc[:5000, 1:]
y = digit_data.iloc[:5000, 0]
print(X.shape)
print(y.shape)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

 

모델링(교차검증)

- KNN

- Decision tree

- Logistic regression

- Linear SVM

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
logi = LogisticRegression()
svm = LinearSVC()

knn_result = cross_val_score(knn, X_train, y_train, cv=5)
tree_result = cross_val_score(tree, X_train, y_train, cv=5)
logi_result = cross_val_score(logi, X_train, y_train, cv=5)
svm_result = cross_val_score(svm, X_train, y_train, cv=5)

print('knn : ', knn_result.mean())
print('tree : ', tree_result.mean())
print('logi : ', logi_result.mean())
print('svm : ', svm_result.mean())

tree2 = DecisionTreeClassifier(max_depth=5)
tree_result2 = cross_val_score(tree2, X_train, y_train, cv=5)
print('tree(max_depth:5)', tree_result2.mean())

 

스케일링

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
min_max_scaler.fit(X_train)

X_train_scaled = min_max_scaler.transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)

plt.hist(X_train_scaled[0])

svm_result2 = cross_val_score(svm,X_train_scaled, y_train,cv=5)

print(svm_result2.mean())

 

test 데이터 확인

knn.fit(X_train, y_train)
tree.fit(X_train, y_train)
logi.fit(X_train, y_train)
svm.fit(X_train_scaled, y_train)

print('knn : ', knn.score(X_test, y_test))
print('tree : ', tree.score(X_test, y_test))
print('logi : ', logi.score(X_test, y_test))
print('svm : ', svm.score(X_test, y_test))

 

예측의 불확실성

knn.predict_proba(X_test[50:80])

tree.predict_proba(X_test[100:200])

tree3 = DecisionTreeClassifier(max_depth=5)
tree3.fit(X_train, y_train)

tree3.predict_proba(X_test[100:200])

logi.predict_proba(X_test[50:80])

img2 = X_test.iloc[52]
plt.imshow(img2.values.reshape(28, 28), cmap="gray")
plt.show()

 

분류평가지표

from sklearn.metrics import classification_report

pre = knn.predict(X_test)
print(classification_report(pre, y_test))

pre2 = tree.predict(X_test)
print(classification_report(pre2, y_test))