๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๊ทœ๋ฆฌ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

kyuree 2023. 3. 31. 00:41

์ €๋ฒˆ ์‹œ๊ฐ„์— ์•Œ์•„๋ณธ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ง์ ‘ ์‹ค์Šตํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค~

 

*์ถœ์ฒ˜* 

https://m.blog.naver.com/baek2sm/221786426960

 

ํŒŒ์ด์ฌ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์ œ

๋จธ์‹ ๋Ÿฌ๋‹&๋”ฅ๋Ÿฌ๋‹ ์ฟก๋ถ(MLCook) ์‚ฌ์ดํ‚ท๋Ÿฐ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ์˜ˆ์ œ ์•ˆ๋…•ํ•˜์„ธ์š”. ๋™๋„ค์ฝ”๋”์ž…๋‹ˆ๋‹ค. ์ด...

blog.naver.com

(์ด ๋ถ„์˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฑฐ์˜ ๋˜‘๊ฐ™์ด ์‹ค์Šตํ–ˆ๊ธฐ์— ์ถœ์ฒ˜๋ฅผ ๋จผ์ € ๋ฐํž™๋‹ˆ๋‹ค)


1.  ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ฐœ์š”

- ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ์ „ํ†ต์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๋˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜

- ์ง€๋„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

- ๊ธฐ๋ณธ ์›๋ฆฌ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ(Baye’s theorem)๋ฅผ ์ ์šฉํ•œ ์›๋ฆฌ

- ํ”ํžˆ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๋กœ ์„ค๋ช…๋จ

์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜

ํ…์ŠคํŠธ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ํ† ๋Œ€๋กœ ์ •์ƒ ๋ฉ”์ผ์— ์†ํ•  ํ™•๋ฅ ๊ณผ ์ŠคํŒธ ๋ฉ”์ผ์— ์†ํ•  ํ™•๋ฅ ์„ ๊ฐ๊ฐ ๊ณ„์‚ฐํ•˜๊ณ  ํ™•๋ฅ ์ด ๋” ๋†’์€ ์ชฝ์„ ๊ฒฐ๊ณผ๋กœ ์ถœ๋ ฅํ•จ

 

์ˆ˜์‹์œผ๋กœ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

P(์ •์ƒ ๋ฉ”์ผ | ํ…์ŠคํŠธ) = P(๋‹จ์–ด1 | ์ •์ƒ ๋ฉ”์ผ) * P(๋‹จ์–ด2 | ์ •์ƒ ๋ฉ”์ผ) * P(์ •์ƒ ๋ฉ”์ผ)
P(์ŠคํŒธ ๋ฉ”์ผ | ํ…์ŠคํŠธ) = P(๋‹จ์–ด1 | ์ŠคํŒธ ๋ฉ”์ผ) * P(๋‹จ์–ด2 | ์ŠคํŒธ ๋ฉ”์ผ) * P(์ŠคํŒธ ๋ฉ”์ผ)

์—ฌ๊ธฐ์„œ P( ๋‹จ์–ด1 | ์ •์ƒ ๋ฉ”์ผ)์˜ ํ™•๋ฅ ์€ ์ •์ƒ ๋ฉ”์ผ์ธ ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋‹จ์–ด1์ด ์ถœํ˜„ํ•œ ๋นˆ๋„๋กœ ๊ณ„์‚ฐ๋จ

์ฆ‰, ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ…์ŠคํŠธ์˜ ์ˆœ์„œ ์ƒ๊ด€์—†์ด    ํ…์ŠคํŠธ์— ์ถœํ˜„ํ•œ ๋นˆ๋„๋ฅผ ํ† ๋Œ€๋กœ ๋ถ„๋ฅ˜ํ•จ

 

 

2. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์‹ค์Šต

๋ฐ์ดํ„ฐ ์„ธํŠธ ์„ค๋ช…

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” ๋‰ด์Šค ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ด์šฉ
  • ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Œ
  • ์˜๋ฌธ 20๊ฐœ topic data๋กœ, target 0 ~ 19 ์ด 20๊ฐœ topic์œผ๋กœ ๋œ ์ •๋‹ต์ง€๊ฐ€ ์žˆ์Œ
  • 18846 ์˜ data(๋ฌธ๋‹จ) ์ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ฌธ๋‹จ๋ณ„ ๊ธธ์ด๋Š” ๋ชจ๋‘ ๋‹ค๋ฆ„
  • ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์ด 20๊ฐœ์˜ ๋‰ด์Šค ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฏ€๋กœ ๋ฌด์ž‘์œ„๋กœ ๋ถ„๋ฅ˜ํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€๋˜๋Š” ์ •ํ™•๋„๋Š” 5%...(์ฐพ์•„๋ด๋„ ์™œ ์ •ํ™•๋„๊ฐ€ 5%์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์Œ)

์‚ฌ์šฉํ•  ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

 

# -*- coding:utf-8 -*-
from sklearn.datasets import fetch_20newsgroups  #๋‚ด์žฅ๋œ ๋ฐ์ดํ„ฐ์…‹
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

import numpy as np
# ๋ถ„๋ฅ˜์šฉ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
news = fetch_20newsgroups()
X, y, labels = news.data, news.target, news.target_names

# ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

 

 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

 

# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(๋ฒกํ„ฐํ™”)
vectorizer = CountVectorizer()
tfid = TfidfTransformer()

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

X_train_tfid = tfid.fit_transform(X_train_vec)
X_test_tfid = tfid.transform(X_test_vec)

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์ด๊ธฐ์— ๊ทธ๋Œ€๋กœ ํ•™์Šต์— ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ณ  ์•ฝ๊ฐ„์˜ ์ „์ฒ˜๋ฆฌ ํ•„์š”ํ•จ

์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๊ธฐ๋Šฅ๋“ค๋„ ์ œ๊ณตํ•จ

์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€ CounterVectorizer์ธ๋ฐ ์ด๋Š” ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜์‹œ์ผœ์คŒ

  • ์˜ˆ์‹œ(์ถœ์ฒ˜ : ์‚ฌ์ดํ‚ท๋Ÿฐ ๊ณต์‹ ๋ฌธ์„œ)

์•„๋ž˜์˜ ํ…์ŠคํŠธ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ณ€ํ™˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ด๋ณด์ž

์œ„ ํ…์ŠคํŠธ๋ฅผ countervectorizer๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด ์•„๋ž˜๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ถœ๋ ฅ๋จ

๋‹ค์Œ์œผ๋กœ TfidfTransformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ์ƒ˜ํ”Œ(ํ…์ŠคํŠธ)์— ํฌํ•จ๋œ ๋‹จ์–ด ๋นˆ๋„์ˆ˜์™€ ํŠน์ • ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ ์ƒ˜ํ”Œ์˜ ์ˆ˜๊นŒ์ง€ ๊ณ ๋ คํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•œ ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜

 

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ํ•™์Šต์„ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ๋งˆ๋ฌด๋ฆฌ๋จ

์˜์–ด ๋ฌธ์žฅ์€ ์ด๋ ‡๊ฒŒ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ํ•œ๊ตญ์–ด๋Š” ์กฐ๊ธˆ ๋” ๋ณต์žกํ•จ

 

๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •

# ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋‚˜์ด๋ธŒ ๋ฒ ์ด๋ธŒ + ๊ทธ๋ฆฌ๋“œ์„œ์น˜๋กœ ๋ชจ๋ธ ํ•™์Šต
nb = MultinomialNB()
param_grid = [{'alpha': np.linspace(0.01, 1, 100)}]
gs = GridSearchCV(estimator=nb, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1)
gs.fit(X_train_tfid, y_train)

 

 

๊ทธ๋ฆฌ๋“œ์„œ์น˜๋ž€?
๊ทธ๋ฆฌ๋“œ์„œ์น˜๋Š” ๋ชจ๋ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋„ฃ์„ ์ˆ˜ ์žˆ๋Š” ๊ฐ’๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅํ•œ๋’ค์— ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” 
ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ฐพ๋Š” ํƒ์ƒ‰ ๋ฐฉ๋ฒ•์ด๋‹ค.

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ชจ๋ธ์—์„œ ์ค‘์š”ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” alpha๊ฐ’์ž„

๊ทธ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

 

์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ์ˆ˜์‹

P(์ •์ƒ ๋ฉ”์ผ | ํ…์ŠคํŠธ) = P(๋‹จ์–ด1 | ์ •์ƒ ๋ฉ”์ผ) * P(๋‹จ์–ด2 | ์ •์ƒ ๋ฉ”์ผ) * P(์ •์ƒ ๋ฉ”์ผ)

๋Š” ์•ฝ๊ฐ„์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ

 

์—ฌ๋Ÿฌ ๋‹จ์–ด์˜ ์ถœํ˜„ ํ™•๋ฅ  ์ค‘ ํ•˜๋‚˜๋ผ๋„ 0์ด ๋˜๋ฉด ์ „์ฒด ํ™•๋ฅ ์ด 0์ด ๋˜์–ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•จ

๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ
๊ฐ๊ฐ์˜ ๋ชจ๋“  ํ™•๋ฅ ์— ํŠน์ • ๊ฐ’์„ ๋”ํ•ด์ฃผ๋ฉด ๊ฐ’์ด 0์ด ๋˜์–ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€

 

์ด๋Š” alpha ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด ๊ฐ’์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ

 

# ๊ทธ๋ฆฌ๋“œ์„œ์น˜ ํ•™์Šต ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print('๋ฒ ์ŠคํŠธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: {0}'.format(gs.best_params_))
print('๋ฒ ์ŠคํŠธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ผ ๋•Œ ์ •ํ™•๋„: {0:.2f}'.format(gs.best_score_))

๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ํ†ตํ•ด ํ•™์Šตํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ๋ฒ ์ŠคํŠธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ alpha๋Š” 0.01์˜ ๊ฐ’์ผ ๋•Œ ์˜€๊ณ 

์ด๋•Œ ์ •ํ™•๋„๋Š” 89%์˜€์Œ

์ถœ๋ ฅํ™”๋ฉด

์‹ค์ œ๋กœ  ์˜ˆ์ธกํ•ด๋ณด๊ธฐ

 

# ์ตœ์ ํ™” ๋ชจ๋ธ ์ถ”์ถœ
model = gs.best_estimator_

# ํ…Œ์ŠคํŠธ์„ธํŠธ ์ •ํ™•๋„ ์ถœ๋ ฅ
score = model.score(X_test_tfid, y_test)
print('ํ…Œ์ŠคํŠธ์„ธํŠธ์—์„œ์˜ ์ •ํ™•๋„: {0:.2f}'.format(score))

ํ…Œ์ŠคํŠธ์„ธํŠธ์—์„œ์˜ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋Š” 90%๋กœ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์ˆœํžˆ ๋นˆ๋„๋ฅผ ํ†ตํ•ด ๋ถ„๋ฅ˜ํ•˜์ง€๋งŒ ์ œ๋ฒ• ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

 

# ํ…Œ์ŠคํŠธ์„ธํŠธ ์˜ˆ์ธก ๊ฒฐ๊ณผ ์ƒ˜ํ”Œ ์ถœ๋ ฅ
predicted_y = model.predict(X_test_tfid)
for i in range(10):
    print('์‹ค์ œ ๊ฐ’: {0}, ์˜ˆ์ธก ๊ฐ’: {1}'.format(labels[y_test[i]], labels[predicted_y[i]]))

 

 

ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ๋ถ„๋ฅ˜ ์‹คํ–‰ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค

 

 

3.  ๋ชจ๋ธ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•

log loss

๊ฐœ์š”

- ๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‹œ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ๋กœ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€ํ•  ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค

 

๊ฐœ๋…

- ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ํ™•๋ฅ  ๊ฐ’์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์—ฌ ํ‰๊ฐ€

- ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ logํ•จ์ˆ˜์— ๋„ฃ์–ด์„œ ๋ณ€ํ™˜์„ ์‹œํ‚จ ๊ฐ’์œผ๋กœ ํ‰๊ฐ€ํ•จ -> ์ž˜๋ชป ์˜ˆ์ธกํ• ์ˆ˜๋ก, ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•จ(*์ง๊ด€์  ํ•ด์„)

 

-log(x)(์Œ์˜ ๋กœ๊ทธํ•จ์ˆ˜, ํ•˜๋Š˜์ƒ‰ ๊ทธ๋ž˜ํ”„)๋ฅผ ์‚ฌ์šฉํ•จ

ํ™•๋ฅ ์€ 0.2๋กœ ์ผ์ •ํ•˜๊ฒŒ ์ค„์–ด๋“ค์ง€๋งŒ, ์Œ์˜ ๋กœ๊ทธ๊ฐ’์€ ์ ์  ๋” ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ ( => ๊ทธ๋ž˜ํ”„์ƒ์—์„œ๋Š” ํ™•๋ฅ ์ด ๋‚ฎ์•„์งˆ์ˆ˜๋ก ์ขŒ์ธก์œผ๋กœ ํ–ฅํ•จ) 

์ผ๋ฐ˜ ์ง์„  ๊ทธ๋ž˜ํ”„(์ฃผํ™ฉ์ƒ‰)์ด์—ˆ๋‹ค๋ฉด ํ™•๋ฅ ์ฒ˜๋Ÿผ ์ผ์ •ํ•˜๊ฒŒ ๊ฐ€์ค‘์น˜๊ฐ€ ์ฆ๊ฐ€ํ–ˆ์„ ๊ฒƒ

์ฆ‰, log loss ๊ฐ’์€ ๋‚ฎ์„์ˆ˜๋ก ์ข‹์€ ์ง€ํ‘œ์ด๋‹ค

 

 +) ๋งŒ์ผ ๊ด€์ธก์น˜(row)๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๊ฒฝ์šฐ, ์‹ค์ œ ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ ๋กœ๊ทธ๋ฅผ ์ทจํ•˜๊ณ  ํ‰๊ท ์„ ๋‚ด์–ด ์‚ฐ์ถœํ•œ๋‹ค

 

ํ˜ผ๋ˆ ๋งคํŠธ๋ฆญ์Šค((Confusion Matrix)

์‹ค์ œ ๊ฐ’๊ณผ ๋ถ„๋ฅ˜ ๋ชจ๋ธ๋กœ ๋ถ„๋ฅ˜ํ•œ ์˜ˆ์ธก๊ฐ’์„ ํ‰๊ฐ€ํ•˜์—ฌ ๊ฐ์ฒด์˜ ์ˆ˜๋ฅผ ์„ผ๋‹ค

 

์˜๋ฏธ

- positice -> 1/ 'Success'/ 'Pass'/ 'Live

- negative -> 0'/ 'Fail'/ 'Non Pass'/ 'Dead'

 

- TP (True Positive)

- FP (False Positive)

- FN (False Negative)

- TN (True Negative)

 (P์™€ N์€ ์˜ˆ์ธก์น˜๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๊ฑธ ์‹ค์ œ ๊ฐ’๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋งžํ˜”์œผ๋ฉด True, ํ‹€๋ ธ์œผ๋ฉด False ๋กœ ํ‘œ๊ธฐํ•œ ๊ฒƒ์ž„)

 

์ถ”๊ฐ€๋กœ

FP: ์˜ˆ์ธก์€ ์ฐธ์ด๋‚˜ ์‹ค์ œ๋Š” ๊ฑฐ์ง“, ์ œ 1์ข… ์˜ค๋ฅ˜
FN: ์‹ค์ œ๋Š” ์ฐธ์ด๋‚˜ ์˜ˆ์ธก์€ ๊ฑฐ์ง“, ์ œ 2์ข… ์˜ค๋ฅ˜

์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค

ํ˜ผ๋ˆ ๋งคํŠธ๋ฆญ์Šค ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ

 

 

๋ชจ๋“  ๊ด€์ธก์น˜์˜ ๊ฐœ์ˆ˜๋ฅผ N (= TP + TN + FP + FN) ์ด๋ผ๊ณ  ํ•˜๋ฉด,

 

    - ์ •ํ™•๋„ (Accuracy) = (TP + TN) / N

    - ์žฌํ˜„์œจ (recall rate), ๋ฏผ๊ฐ๋„ (Sensitivity) = TP / (TP + FN)

    - ํŠน์ด๋„ (Specificity) = TN / (FP + TN)

    - ์ •๋ฐ€๋„ (Precision) = TP / (TP + FP)

์˜ ์ˆ˜์‹์„ ํ†ตํ•ด ์ˆ˜ํ•  ์ˆ˜ ์žˆ๋‹ค

 

 

์ •ํ™•๋„์— ๋Œ€ํ•ด ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด์ž๋ฉด

์šฐ์„  ์˜๋ฏธ๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์ค‘์—์„œ ์˜ˆ์ธกํ•œ ๊ฐ’์ด ์‹ค์ œ ๊ฐ’๊ณผ ๋ถ€ํ•ฉํ•˜๋Š” ๋น„์œจ์ด๋ผ๊ณ ๋„ ํ•  ์ˆ˜ ์žˆ๊ณ 

์ด๋Ÿฌํ•œ ์ˆ˜์‹์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค

 

์ •๋ฐ€๋„, ์žฌํ˜„์œจ, ํŠน์ด๋„ ๊ฐ™์€ ๊ฒฝ์šฐ

์ •๋ฐ€๋„(precision): ์˜ˆ์ธก์ด ์ฐธ์ธ ๊ฐ’ ์ค‘ ์‹ค์ œ ์ฐธ์ธ ๊ฐ’
์žฌํ˜„์œจ(recall,Sensitivity,TPR): ์‹ค์ œ ์ฐธ์ธ ๊ฐ’ ์ค‘ ์˜ˆ์ธก๋„ ์ฐธ์ธ ๊ฐ’
ํŠน์ด๋„(specificity): ์˜ˆ์ธก์ด ๊ฑฐ์ง“์ธ ๊ฐ’ ์ค‘ ์‹ค์ œ ๊ฑฐ์ง“์ธ ๊ฐ’

๋ผ๊ณ  ์ •์˜ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ

 

์œ„์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค

 

๊ทธ๋ฆฌ๊ณ  ์ •๋ฐ€๋„์—์„œ๋Š” FP(๊ฑฐ์ง“์œผ๋กœ ์˜ˆ์ธกํ–ˆ์œผ๋‚˜, ์‹ค์ œ๋กœ ์ฐธ์ธ ๊ฒฝ์šฐ)๋ฅผ ์ค„์ด๋Š” ๊ฒƒ, ์žฌํ˜„์œจ์—์„œ๋Š” FN(๊ฑฐ์ง“์œผ๋กœ ์˜ˆ์ธกํ–ˆ๊ณ , ์‹ค์ œ๋กœ๋„ ๊ฑฐ์ง“์ธ ๊ฒฝ์šฐ)์„ ์ค„์ด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
์ฆ‰ FP, FN์ด ์ปค์ง€๋ฉด ์ •๋ฐ€๋„, ์žฌํ˜„์œจ ๊ฐ๊ฐ ์ž‘์•„์ง„๋‹ค

 

์ถœ์ฒ˜

- https://github.com/baek2sm/MLCook/blob/master/sklearn-cook/classification/NaiveBayes_c.py

- https://velog.io/@hhhs101/confusionmatrix

- https://rfriend.tistory.com/771

- https://youtu.be/i5U2inxzXx4