๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๋ฏผํ˜œ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

์•Œ ์ˆ˜ ์—†๋Š” ์‚ฌ์šฉ์ž 2023. 3. 30. 13:59

โ˜๐Ÿป WIDA_4์ฃผ์ฐจ

  1. ํŒŒ์ด์ฌ์œผ๋กœ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์™€์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ ๋Œ๋ ค๋ณด๊ณ  accuracy ์ธก์ •ํ•ด๋ณด๊ธฐ (decision tree ์ œ์™ธ ํ•œ๊ฐ€์ง€ ๊ณจ๋ผ์„œ)
  2. ๊ทธ ๋ชจ๋ธ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„ํ•˜๊ธฐ
  3. ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ์•Œ์•„๋ณด๊ธฐ(log loss ํฌํ•จ 2๊ฐ€์ง€)

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ์‹คํ–‰


์ฝ”๋“œ

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Random Forest ๋ชจ๋ธ ๊ฐ€์ ธ์˜ค๊ธฐ
rf_clf = RandomForestClassifier()
bag_clf = BaggingClassifier(rf_clf)

# training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train_data = pd.read_csv("drive/MyDrive/Colab Notebooks/train.csv")

# ๋งž์ถฐ์•ผ ํ•˜๋Š” ๊ฒƒ์€ type์ด๊ธฐ ๋•Œ๋ฌธ์— type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ (์Šฌ๋ผ์ด์‹ฑ)
X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1] #type ์—ด

# training dataset๊ณผ test dataset์œผ๋กœ ์ชผ๊ฐœ๊ธฐ
# training๊ณผ test์˜ ๋น„์œจ์€ 0.3 [7:3์œผ๋กœ ์ชผ๊ฐ ๋‹ค]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ•™์Šต ๋ฐ ๋ณ„๋„์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€
rf_flf = RandomForestClassifier(criterion='entropy', bootstrap=True, random_state=42, max_depth=5)
rf_clf.fit(X_train, y_train)
pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ์ •ํ™•๋„: {0:.4f}'.format(accuracy))

**>>> ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ์ •ํ™•๋„: 0.8767**

์ •ํ™•๋„๋Š” 87.7%์˜€๋‹ค.

? ๊ถ๊ธˆํ–ˆ๋˜ ๊ฑด random_state๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ”์™€ ์ˆซ์ž ์ง€์ •์„ ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ์ง€์˜€๋‹ค.

 

์•„๋ž˜๋Š” ์ตœ์ ์˜ max_depth๋ฅผ ์ฐพ๋Š” ์ฝ”๋“œ์ด๋‹ค.

**##์ตœ์ ์˜ max_depth ์ฐพ๊ธฐ**
from sklearn.model_selection import KFold

cv = KFold(n_splits=5) #ํด์ˆ˜๋ก ์˜ค๋ž˜๊ฑธ๋ฆผ
accuracies = list()
max_attributes = X_test.shape[1]
depth_range = range(1, max_attributes)

#max_depth๋ฅผ 1๋ถ€ํ„ฐ max attributes๊นŒ์ง€
for depth in depth_range:
    fold_accuracy = []
    rand_clf = RandomForestClassifier(max_depth = depth)
    # print("Current max depth: ", depth, "\\n")
    for train_fold, valid_fold in cv.split(train_data):
        f_train = train_data.loc[train_fold] # Extract train data with cv indices
        f_valid = train_data.loc[valid_fold] # Extract valid data with cv indices

        model = rand_clf.fit(X_train, y_train)
        valid_acc = model.score(X_test, y_test)
        fold_accuracy.append(valid_acc)

    avg = sum(fold_accuracy)/len(fold_accuracy)
    accuracies.append(avg)
    print("Accuracy per fold: ", fold_accuracy, "\\n")

df = pd.DataFrame({"Max Depth": depth_range, "Average Accuracy": accuracies})
df = df[["Max Depth", "Average Accuracy"]]
print(df.to_string(index=False))

 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„

์ถœ์ฒ˜: https://sevillabk.github.io/RandomForest/


  • n_estimators: ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์—์„œ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜ ์ง€์ •. ์ฆ‰, ํŠธ๋ฆฌ๋ฅผ ๋ช‡ ๊ฐœ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€ (int, default=10). ํŠธ๋ฆฌ ๊ฐฏ์ˆ˜๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก ํ•™์Šต ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆผ
  • max_features: ์„ ํƒํ•  feature์˜ ๊ฐœ์ˆ˜, ๋ณดํ†ต default๊ฐ’์œผ๋กœ ์”€ (default='auto'). ๊ฒฐ์ • ํŠธ๋ฆฌ์— ์‚ฌ์šฉ๋œ max_features ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๊ฐ™์Œ. ๋‹จ, ‘None’์ด ์•„๋‹Œ ‘auto’(=’sqrt’)์™€ ๊ฐ™์Œ
  • criterion: gini ๋˜๋Š” entropy ์ค‘ ์„ ํƒ
  • min_samples leaf: ๋ฆฌํ”„๋…ธ๋“œ๊ฐ€ ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œํ•œ์˜ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ˆ˜. ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ํŠน์ • ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทน๋„๋กœ ์ ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ž‘์€ ๊ฐ’์œผ๋กœ ์„ค์ • ํ•„์š”. (default : 1)
  • min_samples_split: ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œํ•œ์˜ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ˆ˜. → ๊ณผ์ ํ•ฉ์„ ์ œ์–ดํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ๋ถ„ํ• ๋…ธ๋“œ๊ฐ€ ๋งŽ์•„์ ธ ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ์„ฑ์ด ์ฆ๊ฐ€ํ•จ. (default : 2)
  • max_depth: ํŠธ๋ฆฌ์˜ ๊นŠ์ด (int, default=None → ์™„๋ฒฝํ•˜๊ฒŒ ํด๋ž˜์Šค ๊ฐ’์ด ๊ฒฐ์ •๋  ๋•Œ๊นŒ์ง€ ๋ถ„ํ• ). ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ์˜ ๊ณผ์ ํ•ฉ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์—๋„ ๋˜‘๊ฐ™์ด ์ ์šฉ๋  ์ˆ˜ O
  • bootstrap: True์ด๋ฉด ์ „์ฒด feature์—์„œ ๋ณต์›์ถ”์ถœํ•ด์„œ ํŠธ๋ฆฌ ์ƒ์„ฑ (default=True)

 

ํ‰๊ฐ€๋ฐฉ๋ฒ•


๋ถ„๋ฅ˜์˜ ํ‰๊ฐ€๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” ์‹ค์ œ ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ์™€ ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ์ ๊ฒŒ ๋ฐœ์ƒํ•˜๋Š”๊ฐ€์— ๊ธฐ๋ฐ˜ํ•˜์ง€๋งŒ, ๋‹จ์ˆœํžˆ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๊ณ  ํŒ๋‹จํ•˜๋Š” ๊ฑธ ๋„˜์–ด ๋‹ค๋ฅธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋„ ๋ณตํ•ฉ์ ์œผ๋กœ ๋‹ค๋ฃจ์–ด์•ผ ํ•œ๋‹ค.

1. log loss

(์ถœ์ฒ˜: https://seoyoungh.github.io/machine-learning/ml-logloss/)

 

DEF) ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ

  • Loss fuction(์†์‹ค ํ•จ์ˆ˜, cross-entropy): ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๊ฐ’๊ณผ ์ •๋‹ต์˜ ์˜ค์ฐจ๋ฅผ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜
    • logloss๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ์ข‹์€ ์ง€ํ‘œ

๊ณ„์‚ฐ ์ˆ˜์‹ (์ถœ์ฒ˜: https://leehah0908.tistory.com/23)

  • M: ํด๋ž˜์Šค ์ˆ˜
  • Y: ํ–‰ ๋ฐ์ดํ„ฐ i๊ฐ€ ํด๋ž˜์Šค m์— ์†ํ•  ๊ฒฝ์šฐ1, ๊ทธ๋ ‡์ง€ ์•Š์„ ๊ฒฝ์šฐ 0
  • P: ํ–‰ ๋ฐ์ดํ„ฐ i๊ฐ€ ํด๋ž˜์Šค m์— ์†ํ•˜๋Š” ์˜ˆ์ธก ํ™•๋ฅ 

 

  • ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ logํ•จ์ˆ˜์— ๋„ฃ์–ด ๋ณ€ํ™˜์„ ์‹œํ‚จ ๊ฐ’์œผ๋กœ ํ‰๊ฐ€. (์ž˜๋ชป ์˜ˆ์ธกํ• ์ˆ˜๋ก ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌ๋˜๋Š” ๋ฐฉ์‹)
    • ex) 100% ํ™•๋ฅ  = -log(1.0) = 0 / 80% ํ™•๋ฅ  = -log(0.8) = 0.22314 / 60% ํ™•๋ฅ  = -log(0.6) = 0.51082
    • ํ™•๋ฅ ์ด ๋‚ฎ์•„์งˆ ์ˆ˜๋ก log loss ๊ฐ’์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€
  • ์‚ฌ์šฉ ์ด์œ 
    • ์ตœ์ข…์ ์œผ๋กœ ๋งž์ถ˜ ๊ฒฐ๊ณผ๋งŒ ๊ฐ€์ง€๊ณ  ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๊ฒฝ์šฐ, ์–ผ๋งŒํผ์˜ ํ™•๋ฅ ๋กœ ํ•ด๋‹น ๋‹ต์„ ์–ป์€๊ฑด์ง€ ํ‰๊ฐ€ ๋ถˆ๊ฐ€๋Šฅ
    • ๋ณด์™„์„ ์œ„ํ•ด ํ™•๋ฅ  ๊ฐ’์„ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉ. Log loss๋Š” ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ํ™•๋ฅ  ๊ฐ’์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์—ฌ ํ‰๊ฐ€ํ•จ.
  • ์‹ค์ œ ๋‹ต์•ˆ์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ  ๊ฐ’์„ ์Œ์˜ ๋กœ๊ทธ๋ฅผ ์ทจํ•ด ๋ชจ๋‘ ๋”ํ•˜๊ณ  1/nํ•ด์„œ ํ‰๊ท 
from sklearn.metrics import log_loss

#๋ฐ์ดํ„ฐ ์Šคํ”Œ๋ฆฟ์œผ๋กœ y_valid์™€ ๋ชจ๋ธ ์˜ˆ์ธก์œผ๋กœ y_pred_proba๋ฅผ ๊ตฌํ•œ ํ›„ ์‹คํ–‰ํ•ด์•ผ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.
multi_logloss = log_loss(y_valid, y_pred_proba)
print(multi_logloss)
  • logloss๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ™•๋ฅ  ์˜ˆ์ธก๊ฐ’(predict_proba)์ด ํ•„์š”

 

2. Precision (์ •๋ฐ€๋„)

DEF) ์˜ˆ์ธก์„ positive๋กœ ํ•œ ๋Œ€์ƒ ์ค‘ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์ด positive๋กœ ์ผ์น˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ

True Positives / ( True Positives + False Positives )

TP / (TP + FP)

⇒ TP + FP: ์˜ˆ์ธก์„ positive๋กœ ํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜

⇒ TP: ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์ด positive๋กœ ์ผ์น˜ํ•œ ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜

  • ์ฃผ๋กœ FP๋ฅผ ๋‚ฎ์ถ”๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถค.
  • ์žฌํ˜„์œจ(recall)๊ณผ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ์ง€ํ‘œ.
  • ํ™•์ธ
from sklearn.metrics import precision_score

labels = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1]
guesses = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0]

print(precision_score(labels, guesses))