๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/์กฐ์•„์˜] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹ (svm ๋ชจ๋ธ๋„ ์ถ”๊ฐ€ํ•  ์˜ˆ์ •)

๋ ค์šฐ 2023. 3. 30. 23:10

Decision Tree

Hyper parameter

  1. criterion(๊ธฐ์ค€) : default=”gini”, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์ฒ™๋„
    • “gini”
      • ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜
    • “entropy”
      • ์—”ํŠธ๋กœํ”ผ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜
    • “log_loss”
      • log_loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜
  2. splitter(๋ถ„ํ• ) : default=”best”
    • “best”
      • ๊ฐ€์žฅ ์ข‹์€ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ
    • “random”
      • ๊ฐ€์žฅ ์ข‹์€ ๋žœ๋ค ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ
  3. max_depth : default=None
    • ํŠธ๋ฆฌ๊ตฌ์กฐ์˜ ์ตœ๊ณ  ๊นŠ์ด๋ฅผ ์„ค์ •ํ•ด์ฃผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ
    • ์ตœ๊ณ  ๊นŠ์ด๋ฅผ ์ง€์ •ํ•ด์คŒ์œผ๋กœ์„œ overfitting์„ ๋ง‰์„ ์ˆ˜ ์žˆ์Œ
  4. min_samples_split : default=2
    • ๋‚ด๋ถ€ ๋…ธ๋“œ์—์„œ ์ƒ˜ํ”Œ๋“ค์„ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ ๊ฐฏ์ˆ˜
  5. min_samples_leaf : default=1
    • ์ตœ์†Œ ๋ฆฌํ”„๋…ธ๋“œ ๊ฐœ์ˆ˜
  6. min_weight_fraction_leaf : default=0.0
    • ๋ญ˜๊น์‡ผ?
  7. max_features : default=None
    • best split์„ ์ฐพ์„ ๋•Œ ๊ณ ๋ คํ•  ํ”ผ์ณ์˜ ์ˆ˜
    • int, floatํ˜•์ด ์˜ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ, “auto”, “sqrt”, “log2”๋„ ์˜ฌ ์ˆ˜ ์žˆ์Œ
    • “auto” → 1.3๋ฒ„์ „๋ถ€ํ„ฐ ์‚ฌ๋ผ์งˆ ์˜ˆ์ •์ž„
      • max_features=sqrt(n_features)
    • “sqrt”
      • max_features=sqrt(n_features)
    • “log2”
      • max_features=log2(n_features)
  8. random_state : default=None
    • ์–ด๋–ค training dataset์ด ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ, ์ด dataset์„ nn๊ฐœ๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค๊ณ  ์น˜์ž. ์ด nn๊ฐœ๋กœ ์ชผ๊ฐœ์ง„ ๋ฐ์ดํ„ฐ ์…‹๋“ค์€ ๊ฐ๊ฐ ๋ฒˆํ˜ธ๊ฐ€ ๋ถ™์–ด์žˆ๊ณ , ๊ทธ ๋ฒˆํ˜ธ์—๋Š” ๋ฐ์ดํ„ฐ mm๊ฐœ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค. ๋ช‡๋ฒˆ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ ธ์˜ฌ ๊ฒƒ์ธ์ง€ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ž„
  9. max_leaf_nodes : default=None
    • leaf node์˜ ์ตœ๋Œ“๊ฐ’์„ ๊ฒฐ์ •ํ•ด์คŒ
  10. min_impurity_decrease : default=0.0
    • A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
  11. class_weight : default=None
    • balanced
      • ๊ท ํ˜•์žˆ๊ฒŒ ๋ชจ๋“  ํด๋ž˜์Šค์— ๊ฐ€์ค‘์น˜๋ฅผ ์คŒ
    • dict, list of dict
      • {class_label : weight} ์˜ ํ˜•ํƒœ๋กœ ๊ฐ ํด๋ž˜์Šค๋ณ„๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์คŒ
  12. ccp_alpha : default=0.0
    • ์ตœ์†Œ๋น„์šฉ ๋ณต์žก๋„ pruning(๊ฐ€์ง€์น˜๊ธฐ๋ผ๋Š” ์˜๋ฏธ๋˜๋ฐ…ํ ?)์— ์‚ฌ์šฉ๋˜๋Š” ๋ณต์žก๋„ ํŒŒ๋ผ๋ฏธํ„ฐ

ํ‰๊ฐ€๋ฐฉ๋ฒ•

  1. log loss
    • y๊ฐ’์„ ๋งž์ถ˜ ๊ฒƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ชจ๋ธ์ด ํ•ด๋‹น ์ •๋‹ต์„ ๋งž์ถœ ํ™•๋ฅ ์„ ๋ฐ˜์˜ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ž„
    • ์ž˜๋ชป ์˜ˆ์ธกํ• ์ˆ˜๋ก ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌ ← ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ˆ˜์น˜์  ์†์‹ค๊ฐ’์„ ๊ณ„์‚ฐ
    • ํ™•๋ฅ ์ด ๋†’์„ ์ˆ˜๋ก ๋กœ๊ทธ๊ฐ’์€ ์ž‘๊ฒŒ ๋‚˜์˜ด → ๋กœ๊ทธ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ์ข‹์€ ์ง€ํ‘œ์ž„
    from sklearn.metrics import log_loss
    
    # predict_proba ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ X_test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์˜ˆ์ธก ํ™•๋ฅ ์„ ๊ตฌํ•จ
    logloss_dt = log_loss(y_test, dt_clf.predict_proba(X_test))
    print('log loss {0: .4f}'.format(logloss_dt))

  • M์€ ํด๋ž˜์Šค์˜ ๊ฐฏ์ˆ˜
  • y_i,m์€ ์–ด๋–ค ํ–‰ i๊ฐ€ ํด๋ž˜์Šค m์— ์†ํ•œ ๊ฒƒ์„ ๋งž์ถ”์—ˆ๋Š”๊ฐ€? ๋งž์ท„์œผ๋ฉด 1, ํ‹€๋ ธ๋‹ค๋ฉด 0
  • p_i,m์€ ์–ด๋–ค ํ–‰ i๊ฐ€ ํด๋ž˜์Šค m์— ์†ํ•  ํ™•๋ฅ  ๊ฐ’

2. Confusion Matrix

  • ํ˜ผ๋™ํ–‰๋ ฌ, ์˜ค์ฐจํ–‰๋ ฌ์ด๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, ์ด์ง„๋ถ„๋ฅ˜์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์„ฑ๋Šฅ์ง€ํ‘œ์ž„

 

  ์˜ˆ์ธกํด๋ž˜์Šค (Negative)  ์˜ˆ์ธกํด๋ž˜์Šค (Positive)
์‹ค์ œ ํด๋ž˜์Šค (Negative) True Negative False Positive
์‹ค์ œํด๋ž˜์Šค (Positive) False Negative True Positive
  • True Negative : Negative์ธ ํด๋ž˜์Šค๋ฅผ ๋งž์ถค
  • False Positive : Negative์ธ ํด๋ž˜์Šค๋ฅผ Positive๋กœ ์˜ˆ์ธก → ์ •๋‹ต์„ ๋งž์ถ”์ง€ ๋ชปํ•จ

 

1) Accuracy(์ •ํ™•๋„)

(TN + TP) / (TN + FP + FN + TP)

  • ์˜ˆ์ธก ๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’์ด ๋ชจ๋‘ ๊ฐ™์€ ๊ฒฝ์šฐ / ์ „์ฒด
  • ํด๋ž˜์Šค๊ฐ€ ์–ด๋Š ํ•œ์ชฝ์œผ๋กœ ์น˜์šฐ์ณ์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ ์ •ํ™•๋„์˜ ์‹ ๋ขฐ๋„๋Š” ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ
    • ํ•œ์ชฝ์œผ๋กœ ์น˜์šฐ์ณ์ ธ์žˆ๋Š” ํด๋ž˜์Šค๋งŒ ์ž˜ ๋งž์ถ”๊ณ  ๋‹ค๋ฅธ ํ•œ ํด๋ž˜์Šค๋Š” ์ œ๋Œ€๋กœ ๋งž์ถ”์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ

2) Recall(์žฌํ˜„์œจ)

TP / (FN + TP)

  • ์‹ค์ œ ๊ฐ’์ด positive์ธ ๋Œ€์ƒ ์ค‘ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์ด positive๋กœ ์ผ์น˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ
  • FN + TP๋Š” ์‹ค์ œ ๊ฐ’์ด ๋ชจ๋‘ Positive์ธ ๊ฒฝ์šฐ (์˜ˆ์ธก ๊ฒฐ๊ณผ ์ƒ๊ด€ X)
  • ์‹ค์ œ ๋ฐ์ดํ„ฐ๊ฐ€ Positive์ด์ง€๋งŒ Negative๋กœ ์ž˜๋ชป ํŒ๋‹จํ•  ๋•Œ ์—…๋ฌด์— ํฐ ์˜ํ–ฅ์ด ์žˆ๋Š” ๊ฒฝ์šฐ Recall๊ฐ’์„ ์ค‘์š”ํ•˜๊ฒŒ ๋ด„
    • ์ œ 1์ข… ์˜ค๋ฅ˜

3) Precision(์ •๋ฐ€๋„)

TP / (FP + TP)

  • ์˜ˆ์ธก๊ฐ’์ด positive์ธ ๊ฒƒ๋“ค ์ค‘ ์˜ˆ์ธก ์„ฑ๊ณตํ•œ ๊ฒฝ์šฐ์˜ ๋น„์œจ
  • FP + TP๋Š” ์˜ˆ์ธก๊ฐ’์ด positive ์ธ ๊ฐ’๋“ค์˜ ํ•ฉ
  • ์‹ค์ œ๋กœ๋Š” negative๊ฐ’์ธ ๋ฐ์ดํ„ฐ๋ฅผ positive๋กœ ํŒ๋‹จํ•˜๊ฒŒ๋˜๋ฉด ์—…๋ฌด์— ํฐ ์˜ํ–ฅ์ด ์žˆ๋Š” ๊ฒฝ์šฐ precision์„ ์ค‘์š”ํ•˜๊ฒŒ ๋ด„
    • ์ œ 2์ข… ์˜ค๋ฅ˜

์ถœ์ฒ˜