๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/์ตœ๋‹ค์˜ˆ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

๋‹ค์˜ˆ๋ป 2023. 3. 30. 23:55

KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„ํ•˜๊ธฐ

# ํ•„์š”ํ•œ ํŒจํ‚ค์ง€ import
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
train_data = pd.read_csv("C:/Users/allye/Desktop/DSOB/WIDA Dacon/DCSTree/train.csv")

# type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ
X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1]

# ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ•  (ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ)
# training๊ณผ test์˜ ๋น„์œจ์€ 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN ๋ชจ๋ธ ์ƒ์„ฑ
k = 3             # k ๊ฐ’ ์„ค์ •
model = KNeighborsClassifier(n_neighbors=k)

# ๋ชจ๋ธ ํ•™์Šต
model.fit(X_train, y_train)

# ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ ์˜ˆ์ธก
y_pred = model.predict(X_test)

# ์ •ํ™•๋„ ๊ณ„์‚ฐ
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.7735924530817694

KNN์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„

ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ : ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์–ด๋– ํ•œ ์ž„์˜์˜ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ์„ค์ •ํ•ด์•ผํ•˜๋Š” ๋ณ€์ˆ˜

  1. Distance(๊ฑฐ๋ฆฌ)

๊ฑฐ๋ฆฌ์— ๋”ฐ๋ผ ์ƒˆ๋กœ ๋“ค์–ด์˜ฌ ๋ฐ์ดํ„ฐ์˜ ๋ถ„๋ฅ˜๊ฐ€ ๋‹ฌ๋ผ์ง

  • Euclidean Distance (์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ)
    • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ
    • ํ”ผํƒ€๊ณ ๋ผ์Šค ์ •๋ฆฌ, n ์ฐจ์›์—์„œ์˜ ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ
    • But, ์ฐจ์›์ด ๋†’์•„์งˆ ์ˆ˜๋ก ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง

  • Manhatten Distance (๋งจํ—ˆํŠผ ๊ฑฐ๋ฆฌ)
    • ๋‘ ๋ฒˆ์งธ๋กœ ๋งŽ์ด ์“ฐ์ด๋Š” ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ
    • ์ œํ•œ๋œ ์กฐ๊ฑด ํ•˜์—์„œ ์ •ํ•ด์ง„ ๋ฃจํŠธ๋ฅผ ํ†ตํ•ด ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑฐ๋ฆฌ

์ด ์™ธ์—๋„ ํ‘œ์ค€ํ™” ๊ฑฐ๋ฆฌ, ๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ, ์ฒด๋น„์…ฐํ”„ ๊ฑฐ๋ฆฌ, ๋ฏผ์ฝฅ์Šคํ‚ค ๊ฑฐ๋ฆฌ, ์บ”๋ฒ„๋ผ ๊ฑฐ๋ฆฌ ๋“ฑ์ด ์กด์žฌ

  1. K

KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ

K๊ฐ’์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง

K = ์ด์›ƒ์˜ ์ˆ˜

  • K ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก
  • ์žฅ์  
    • ๋…ธ์ด์ฆˆ์— ๋Œ€ํ•œ ์˜ํ–ฅ๋ ฅ์ด ์ค„์–ด๋“ฆ
    • ๊ฒฐ์ • ๊ฒฝ๊ณ„๊ฐ€ ๋ถ€๋“œ๋Ÿฌ์›Œ ์ง
    • train set์—์„œ์˜ ์ •ํ™•๋„๊ฐ€ ๋†’์Œ
    ๋‹จ์ 
    • ๋ชจ๋ธ์ด ๋ณต์žกํ•ด์งˆ ์ˆ˜ ์žˆ์Œ
    • test set์—์„œ ์—๋Ÿฌ๊ฐ€ ๋†’๊ณ  ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง → ๊ณผ์ ํ•ฉ (Overfitting)
  • K ๊ฐ’์ด ํด์ˆ˜๋ก
  • ์žฅ์  
    • ๋ชจ๋ธ์ด ๋œ ๋ณต์žกํ•จ
    ๋‹จ์ 
    • test set์— ๋Œ€ํ•œ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์Œ → ๊ณผ์†Œ์ ํ•ฉ (Inderfitting)

์ด์›ƒ์˜ ์ˆ˜๋ฅผ ์„ค์ •ํ•  ๋•Œ๋Š” **๊ต์ฐจ ๊ฒ€์ฆ(Cross-validation)**์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ์˜ k๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.


๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•(์ง€ํ‘œ)

  1. logloss (Logarithmic Loss)

๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ํ™•๋ฅ ๊ณผ ์‹ค์ œ ํด๋ž˜์Šค ๋ ˆ์ด๋ธ” ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ์ธก์ •

 

๐Ÿ’ก LogLoss = -1/N * Σ(yi*log(pi) + (1-yi)log(1-pi))

 

yi : i๋ฒˆ์งธ ์ƒ˜ํ”Œ์˜ ์‹ค์ œ ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”

pi : ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ i๋ฒˆ์งธ ์ƒ˜ํ”Œ์ด ํด๋ž˜์Šค 1์— ์†ํ•  ํ™•๋ฅ 

N : ์ „์ฒด ์ƒ˜ํ”Œ ์ˆ˜

  • ์˜ˆ์ธก ํ™•๋ฅ ์ด ์‹ค์ œ ๋ ˆ์ด๋ธ”๊ณผ ์ผ์น˜ํ• ์ˆ˜๋ก : ์ตœ์†Œํ™”
  • ์˜ˆ์ธก ํ™•๋ฅ ์ด ์‹ค์ œ ๋ ˆ์ด๋ธ”๊ณผ ๋‹ค๋ฅผ์ˆ˜๋ก : ์ตœ๋Œ€ํ™”
  • 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’
  • ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ๋ชจ๋ธ์˜ ์˜ˆ์ธก์ด ๋” ์ข‹์Œ
  1. Recall (์žฌํ˜„์œจ)

์ด์ง„ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์—์„œ ๋ชจ๋ธ์ด ์ฐธ์œผ๋กœ ์˜ˆ์ธกํ•œ ์ƒ˜ํ”Œ ์ค‘ ์‹ค์ œ๋กœ ์ฐธ์ธ ์ƒ˜ํ”Œ์˜ ๋น„์œจ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ

๋ชจ๋ธ์ด ๋ชจ๋“  ์–‘์„ฑ ์ƒ˜ํ”Œ์„ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ

 

๐Ÿ’ก TP / (TP + FN)

 

TP : ๋ชจ๋ธ์ด ์–‘์„ฑ์œผ๋กœ ์˜ˆ์ธกํ•˜๊ณ  ์‹ค์ œ๋กœ ์–‘์„ฑ์ธ ์ƒ˜ํ”Œ์˜ ์ˆ˜

FN : ๋ชจ๋ธ์ด ์Œ์„ฑ์œผ๋กœ ์˜ˆ์ธกํ–ˆ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์–‘์„ฑ์ธ ์ƒ˜ํ”Œ์˜ ์ˆ˜

์–‘์„ฑ์ธ ์ƒ˜ํ”Œ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ์— ๋งค์šฐ ์œ ์šฉํ•œ ์ง€ํ‘œ

ex) ์•” ์ง„๋‹จ ๋ชจ๋ธ

 

 

์ถœ์ฒ˜ :

https://needjarvis.tistory.com/715

https://leonard92.tistory.com/12

https://hleecaster.com/ml-accuracy-recall-precision-f1/