๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๊ฒฝ์€] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

๊ฒฝ์€ 2023. 3. 31. 01:06

์„œํฌํŠธ๋ฒกํ„ฐ๋จธ์‹  ๋ชจ๋ธ

#ํ•„์š” ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn import svm

#๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train_data = pd.read_csv("C:/Users/twink/Documents/์นด์นด์˜คํ†ก ๋ฐ›์€ ํŒŒ์ผ/train.csv")

# type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌ
X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1]

# training dataset๊ณผ test dataset์œผ๋กœ ์ชผ๊ฐœ๊ธฐ
# training๊ณผ test์˜ ๋น„์œจ์€ 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)


#SVM ๋ชจ๋ธ
clf = svm.SVC(kernel = 'linear', C=1, gamma=0.1)
#๋ชจ๋ธ ํ•™์Šต
clf.fit(X_train, y_train)
# ์ •ํ™•๋„ ์ถœ๋ ฅ
print("์„œํฌํŠธ๋ฒกํ„ฐ๋จธ์‹  ์ •ํ™•๋„:", clf.score(X_test, y_test))
  • ์ž„์˜๋กœ kernel์„ ์„ ํ˜•์œผ๋กœ linear ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ ์ด์— ๋”ฐ๋ผ ์„ค์ •ํ•˜๋Š” C์™€ gamma ์„ค์ • ๋ฒ”์œ„๋ฅผ ์ž˜๋ชป ๊ธฐ์žฌํ•˜์˜€๋Š”์ง€ ์ •ํ™•๋„๊ฐ€ ์•ˆ๋‚˜์™€์„œ ๋” ๊ณต๋ถ€ ํ•ด๋ด์•ผ ๋ ๊ฒƒ๊ฐ™์Šต๋‹ˆ๋‹ค ...

 

 

 

 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„

  • C : ์˜ค๋ฅ˜๋ฅผ ์–ผ๋งˆ๋‚˜ ํ—ˆ์šฉํ•  ๊ฒƒ์ด์ง€ ๊ทœ์ œ
  • kernal : ๋ฐ์ดํ„ฐ ์…‹์˜ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋‹ค๋ฆ„(์„ ํ˜• ๋ฐ์ดํ„ฐ์…‹:'linear', ๋น„์„ ํ˜• ๋ฐ์ดํ„ฐ์…‹:'poly' ๋“ฑ)
  • degree : ๋‹คํ•ญ์‹ ์ปค๋„์˜ ์ฐจ์ˆ˜ ๊ฒฐ์ •
  • gamma : ๊ฒฐ์ •๊ฒฝ๊ณ„๋ฅผ ์–ผ๋งˆ๋‚˜ ์œ ์—ฐํ•˜๊ฒŒ ๊ทธ๋ฆด์ง€ ๊ฒฐ์ •, ํด์ˆ˜๋ก ์˜ค๋ฒ„ํ”ผํŒ… ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ ๋†’์•„์ง
  • coef0 : ๋‹คํ•ญ์‹ ์ปค๋„์— ์žˆ๋Š” ์ƒ์ˆ˜ํ•ญ r

 

 

 

ํ‰๊ฐ€๋ฐฉ๋ฒ•

Log loss

๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ

๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€์‹œ์— ์‚ฌ์šฉํ•œ๋‹ค.

 

Log loss ๊ฐœ๋…

1. ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ํ™•๋ฅ  ๊ฐ’์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์—ฌ ํ‰๊ฐ€

2. ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ log ํ•จ์ˆ˜์— ๋„ฃ์–ด์„œ ๋ณ€ํ™˜์„ ์‹œํ‚จ ๊ฐ’์œผ๋กœ ํ‰๊ฐ€ -> ์ž˜๋ชป ์˜ˆ์ธกํ•  ์ˆ˜๋ก, ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•จ

  • logloss๋Š” ์Œ์˜ ๋กœ๊ทธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, logํ•จ์ˆ˜์˜ ๊ฒฐ๊ณผ์— -1์„ ๊ณฑํ•œ ๊ฒฐ๊ณผ๋กœ x์ถ•์„ ๊ธฐ์ค€์œผ๋กœ ๋Œ€์นญ์„ ์ด๋ฃฌ๋‹ค.
  • ์Œ์˜ ๋กœ๊ทธ ํ•จ์ˆ˜๋Š” 1์ผ๋•Œ 0์ด๊ณ , 0์— ๊ฐ€๊นŒ์›Œ์งˆ ์ˆ˜๋ก ์ˆซ์ž๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ปค์ง„๋‹ค.
  • log loss ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ์ข‹์€ ๋ชจ๋ธ์ด๋‹ค.

100%์˜ ํ™•๋ฅ (ํ™•์‹ )๋กœ ๋‹ต์„ ๊ตฌํ•œ ๊ฒฝ์šฐ -log(1.0) = 0 

80% ํ™•๋ฅ ์˜ ๊ฒฝ์šฐ -> -log(0.8) = 0.22314

60% ํ™•๋ฅ ์˜ ๊ฒฝ์šฐ -> -log(0.6) = 0.51082

 

ํ™•๋ฅ ์ด ๋‚ฎ์•„์งˆ์ˆ˜๋ก log loss ๊ฐ’์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€

์ด๋Ÿฐ์‹์œผ๋กœ log loss ๋Š” ํ™•๋ฅ ์ด ๋‚ฎ์„ ๋•Œ ํŒจ๋„ํ‹ฐ๋ฅผ ๋” ๋งŽ์ด ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด ์Œ์˜ ๋กœ๊ทธ ํ•จ์ˆ˜ ์‚ฌ์šฉ

 

๊ด€์ธก์น˜ (row)๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๊ฒฝ์šฐ

๊ด€์ธก์น˜๋ณ„๋กœ ์‹ค์ œ ๋‹ต์•ˆ์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ ๋กœ๊ทธ๋ฅผ ์ทจํ•˜๊ณ , ํ‰๊ท ์„ ๋‚ธ๋‹ค.

 

log loss ํ‰๊ท 

= 1๋ฒˆ ๋ฌธ์ œ ์ •๋‹ต ํ™•๋ฅ ์— ๋Œ€ํ•œ ์Œ์˜ ๋กœ๊ทธ ๊ฐ’ + ... + n๋ฒˆ ๋ฌธ์ œ ์ •๋‹ต ํ™•๋ฅ ์— ๋Œ€ํ•œ ์Œ์˜ ๋กœ๊ทธ ๊ฐ’ / n(๋ฌธ์ œ์˜ ๊ฐœ์ˆ˜)

 

์™œ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

์ตœ์ข…์ ์œผ๋กœ ๋งž์ถ˜ ๊ฒฐ๊ณผ๋งŒ ๊ฐ€์ง€๊ณ  ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๊ฒฝ์šฐ, ์–ผ๋งŒํผ์˜ ํ™•๋ฅ ๋กœ ํ•ด๋‹น ๋‹ต์„ ์–ป์€๊ฑด์ง€ ํ‰๊ฐ€๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹ต์€ ๋งž์ท„์ง€๋งŒ 20%์˜ ํ™•๋ฅ ๋กœ ๊ทธ์ € ์ฐ์€๊ฑฐ๋ผ๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์—†์„ ๊ฒƒ์ด๋‹ค.

 

AUC - ROC Curve

๋‹ค์–‘ํ•œ ์ž„๊ณ„๊ฐ’์—์„œ ๋ชจ๋ธ์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์ธก์ • ๊ทธ๋ž˜ํ”„๋ฅผ ๋งํ•จ

 - ROC(Receiver Operating Characteristic) = ๋ชจ๋“  ์ž„๊ณ„๊ฐ’์—์„œ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„

 - AUC(Area Under thr Curve) = ROC ๊ณก์„  ์•„๋ž˜ ์˜์—ญ

AUC ๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์„ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ํ›Œ๋ฅญํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

 

 

์šฉ์–ด์ •๋ฆฌ

1. TP : ๋งž๋Š” ๊ฒƒ์„ ๋งž๋‹ค๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ

2. TN : ์•„๋‹Œ ๊ฒƒ์„ ํ‹€๋ฆฌ๋‹ค๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ

3. FP : ์•„๋‹Œ ๊ฒƒ์„ ๋งž๋‹ค๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ

4. FN : ๋งž๋Š” ๊ฒƒ์„ ํ‹€๋ฆฌ๋‹ค๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ

 

์ •๋ฐ€๋„ = TP/TP+FP

์žฌํ˜„์œจ = TP/TP+FN

์ •ํ™•๋„ = TP+TN/TP+FP+TN+FN

 

  • FN,FP ๊ตฌ๊ฐ„์€ ํŒ๋‹จ์ด ๋ถˆ๋ถ„๋ช…ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ตฌ๊ฐ„์ด ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค.
  • ๋”ฐ๋ผ์„œ ํŒ๋‹จ ๋ถˆ๋ถ„๋ช… ๊ตฌ๊ฐ„์„ ์ตœ์†Œํ™” ํ•œ ๊ฒƒ์ด ROC ์ปค๋ธŒ์ด๋‹ค.
  • ROC ์ปค๋ธŒ๋Š” TP ์™€ FP ์˜ ๊ทธ๋ž˜ํ”„์ด๋‹ค.
  • TP ๋Š” FP ๋Š” ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค. 
  • AUC ๊ฐ’์˜ ์ตœ์†Œ๊ฐ’์€ 0.5์œผ๋กœ ์ด ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ํด๋ž˜์Šค ๋ถ„๋ฆฌ ๋Šฅ๋ ฅ์ด ๊ฑฐ์˜ ์—†์„ ๋œปํ•œ๋‹ค.
  • ์ฆ‰, AUC ๊ฐ€ 1์ผ๋•Œ๋Š” ๋‘๊ฐœ์˜ ๊ณก์„ ์ด ์ „ํ˜€ ๊ฒน์น˜์ง€ ์•Š๊ณ  ํด๋ž˜์Šค๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ๊ตฌ๋ถ„, 0.7 ์˜ ๊ฒฝ์šฐ ํด๋ž˜์Šค์˜ ๊ตฌ๋ณ„ ํ™•๋ฅ ์ด 70%๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

ROC ์ปค๋ธŒ

y์ถ• : ๋ฏผ๊ฐ๋„

x์ถ• : ํŠน์ด๋„

์•„๋ž˜์˜ ๋ฉด์ ์„ AUC ๋ผ๊ณ  ํ•˜์—ฌ AUC ๋ฉด์ ์ด ๋„“์„์ˆ˜๋ก ์ข‹์€ ์ปค๋ธŒ, ์ฆ‰ ์ข‹์€ ๋ชจ๋ธ์ด๋‹ค. 

 

 

 

์ถœ์ฒ˜

https://for-my-wealthy-life.tistory.com/37

https://seoyoungh.github.io/machine-learning/ml-logloss/

https://velog.io/@skyepodium/logloss-%EC%95%8C%EC%95%84%EB%B3%B4%EA%B8%B0

https://bioinformaticsandme.tistory.com/328

log loss์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž - [๋ฐ์ด์ฝ˜ ํ‰๊ฐ€์‚ฐ์‹] - YouTube