๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/์กฐ์•„์˜] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์•Œ์•„๋ณด๊ธฐ

๋ ค์šฐ 2023. 3. 24. 00:49

ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ ๊ฐ€์ด๋“œ 1ํŒ (๊ถŒ์ฒ ๋ฏผ)์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.


Classification

  • ์ •๋‹ต์ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ  ์Šค์Šค๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ

์ถœ์ฒ˜: https://velog.io/@uvictoli/Codeit-๋จธ์‹ -๋Ÿฌ๋‹-๊ฒฐ์ •-ํŠธ๋ฆฌ์™€-์•™์ƒ๋ธ”-๊ธฐ๋ฒ•-01.-๊ฒฐ์ •-ํŠธ๋ฆฌ

  • Decision Tree
    • ๋ฐ์ดํ„ฐ์˜ ๊ทœ์น™์„ ์Šค์Šค๋กœ ์ฐพ์•„๋‚ด๊ณ  ํ•™์Šตํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜ ๊ทœ์น™์„ ๋งŒ๋“ค์–ด๋ƒ„
    • ๊ทœ์น™์„ ๊ฐ€์žฅ ์‰ฝ๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ if/else ๊ธฐ๋ฐ˜์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ
    • → ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•  ๊ธฐ์ค€์„ ์„ธ์šธ ๋•Œ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ๋ถ„๋ฅ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐ
    • ๋‚ด๋ถ€ ๋…ธ๋“œ๋“ค์€ ๊ทœ์น™ ๋…ธ๋“œ๊ฐ€ ๋˜๋ฉฐ, ๋‹จ๋ง๋…ธ๋“œ(=๋ฆฌํ”„๋…ธ๋“œ)๋Š” ๊ฒฐ์ •๋œ ํด๋ž˜์Šค(label)๊ฐ’์ด ๋จ
    • ๊ทœ์น™๋…ธ๋“œ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ํด๋ž˜์Šค ๊ฐ’์„ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ ๋ณต์žกํ•œ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋จ→ ํŠธ๋ฆฌ์˜ ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์งˆ ์ˆ˜๋ก ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ
    • → overfitting์˜ ์œ„ํ—˜์„ฑ์ด ์ปค์ง
    • overfitting์„ ๋ง‰๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ตœ๋Œ€ํ•œ ๊ท ์ผํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผ ํ•จ
      • ๊ท ์ผํ•œ ๋ฐ์ดํ„ฐ๋ž€?
        • ๋งŽ์€ ์ •๋ณด ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋šœ๋ ทํ•˜๊ฒŒ ๊ตฌ๋ถ„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ
    • decision tree๋Š” ๊ฐ€์žฅ ๊ท ์ผํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ๊ทœ์น™์„ ๋งŒ๋“ค์–ด์คŒ
    • ์ •๋ณด์˜ ๊ท ์ผ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ด์šฉํ•œ ์ •๋ณด์ด๋“ ์ง€์ˆ˜์™€ ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ์กด์žฌํ•จ
      • ์ •๋ณด์ด๋“ ์ง€์ˆ˜
        • ์ •๋ณด์ด๋“์€ ์—”ํŠธ๋กœํ”ผ์˜ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ
          • ์—”ํŠธ๋กœํ”ผ๋Š” ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์˜ ํ˜ผ์žก๋„๋ฅผ ์ด์•ผ๊ธฐํ•จ
          • ์—”ํŠธ๋กœํ”ผ ๊ฐ’์ด ํด์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„๊ฐ€ ๋–จ์–ด์ง€๊ณ , ์—”ํŠธ๋กœํ”ผ ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„๊ฐ€ ๋†’์•„์ง
        • ์ •๋ณด์ด๋“์ง€์ˆ˜ = 1 - ์—”ํŠธ๋กœํ”ผ์ง€์ˆ˜
        • ์ •๋ณด์ด๋“ ์ง€์ˆ˜๋ฅผ ํ†ตํ•ด ๋ถ„ํ•  ๊ธฐ์ค€์„ ์žก์œผ๋ฉฐ, ์ •๋ณด์ด๋“์ด ๋†’์€ ๊ฒฝ์šฐ ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•จ
      • ์ง€๋‹ˆ๊ณ„์ˆ˜
        • ๊ฒฝ์ œํ•™์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ง€์ˆ˜๋กœ 0์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ํ‰๋“ฑํ•˜๊ณ  1๋กœ ๊ฐˆ์ˆ˜๋ก ๋ถˆํ‰๋“ฑํ•จ
        • ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„๊ฐ€ ๋†’๊ณ , ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋†’์„์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„๊ฐ€ ๋‚ฎ์•„์ง
        • ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์€ ๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•จ
    • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” DecisionTreeClassifier๋Š” ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„ํ• ํ•จ
      • ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์€ ์กฐ๊ฑด๋“ค์„ ์ฐพ๊ณ , ์ด ์กฐ๊ฑด๋“ค์— ๋งž์ถฐ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•จ
    • ์žฅ์ 
      • ๊ท ์ผ๋„๋ผ๋Š” ๊ธฐ์ค€์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฅ˜ → ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์‰ฝ๊ณ  ์ง๊ด€์ ์ž„
      • ๋ฐ์ดํ„ฐ์˜ ๊ท ์ผ๋„๋ฅผ ์ œ์™ธํ•˜๊ณ  feature scaling, ์ „์ฒ˜๋ฆฌ ๋“ฑ์˜ ์ž‘์—…์€ ํฌ๊ฒŒ ์‹ ๊ฒฝ์“ธ ํ•„์š”๊ฐ€ ์—†์Œ
    • ๋‹จ์ 
      • overfitting์˜ ๊ฐ€๋Šฅ์„ฑ์ด ํผ
        • ๋ชจ๋“  ์ƒํ™ฉ์„ ๋งŒ์กฑํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์ธ์ •ํ•˜๊ณ  ํŠธ๋ฆฌ์˜ depth ๋“ฑ์„ ํŠœ๋‹ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์Œ
    • Code
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

# Decision Tree ๋ชจ๋ธ ์ƒ์„ฑํ•˜๊ธฐ
dt_clf = DecisionTreeClassifier()

# training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train_data = pd.read_csv("C:/Users/1ayou/PycharmProjects/dacon_astronomy/dataset/train.csv")

X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

dt_clf.fit(X_train, y_train)

pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)

print('์ •ํ™•๋„ {0:.4f}'.format(accuracy))
    •  
  • ์•™์ƒ๋ธ”
    • ๋‹จ์ผ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ด์šฉํ•ด ์˜ˆ์ธก๊ฐ’์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ๋„์ถœํ•จ
    • ์ •ํ˜•๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ ์‹œ ์•™์ƒ๋ธ” ๋ชจ๋ธ์ด ๋‹จ์ผ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
    • ํ•™์Šต ์œ ํ˜•
      • ๋ณดํŒ…
        • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐ์ •ํ•จ
        • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋ชจ๋‘ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜
        • ๋ณดํŒ… ์œ ํ˜•
          • ํ•˜๋“œ ๋ณดํŒ…
            • ๋‹ค์ˆ˜๊ฒฐ์˜ ์›์น™
            • ๋‹ค์ˆ˜์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์ด์šฉํ•จ
          • ์†Œํ”„ํŠธ ๋ณดํŒ…
            • ๋ถ„๋ฅ˜๊ธฐ๋“ค์ด label ๊ฐ’์œผ๋กœ ๊ฒฐ์ •ํ•  ํ™•๋ฅ ์„ ๋”ํ•˜๊ณ  ํ‰๊ท ์„ ๋‚ด์–ด ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ label์„ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์ด์šฉ
      • ๋ฐฐ๊น…
        • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐ์ •ํ•จ
        • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋ชจ๋‘ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜
        • ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜์—ฌ ํ•™์Šต
      • ๋ถ€์ŠคํŒ…
        • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰
        • ์•ž์—์„œ ํ•™์Šตํ•œ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์˜ˆ์ธก ์‹คํŒจํ•œ ๊ฒฝ์šฐ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์Œ ๋ถ„๋ฅ˜๊ธฐ์—๊ฒŒ ๊ฐ€์ค‘์น˜๋ฅผ ์คŒ
  • ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ
    • bagging
      • ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๋งŒ๋“ค๊ณ , ๋ณดํŒ…์„ ํ†ตํ•ด ์ตœ์ข… ๊ฒฐ์ •์„ ํ•จ
    • code
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

# Decision Tree ๋ชจ๋ธ ์ƒ์„ฑํ•˜๊ธฐ
dt_clf = DecisionTreeClassifier()
bag_clf = BaggingClassifier(dt_clf)

# training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train_data = pd.read_csv("./dataset/train.csv")

X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

dt_clf.fit(X_train, y_train)
bag_clf.fit(X_train, y_train)

pred_dt = dt_clf.predict(X_test)
pred_bg = bag_clf.predict(X_test)

accuracy_dt = accuracy_score(y_test, pred_dt)
accuracy_bg = accuracy_score(y_test, pred_bg)

print('์ •ํ™•๋„ {0:.4f}'.format(accuracy_dt))
print('์ •ํ™•๋„ {0:.4f}'.format(accuracy_bg))
  • RandomForestClassifier
    • ๋ฐ์ดํ„ฐ๋ฅผ ์ž„์˜ํ™” ํ•œ decision tree์˜ ์•™์ƒ๋ธ” ๋ชจ๋ธ์„ random forest classifier
      • ์ž„์˜ํ™” : Randomization
      • Random Forest : ์ž„์˜ํ™” ํ•œ Tree ๋ชจ๋ธ๋“ค์„ ๊ฒฐํ•ฉํ•จ → Tree ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ๊ฐœ ์žˆ์–ด์„œ Forest๋ผ๊ณ  ํ•œ ๋“ฏ ํ•จ..
    • ์—ฌ๋Ÿฌ๊ฐœ์˜ decision tree๋ฅผ ์ด์šฉํ•ด ๋ฐฐ๊น… ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต, ์ตœ์ข…์ ์œผ๋กœ ๋ณดํŒ…์„ ํ†ตํ•ด ์˜ˆ์ธก ๊ฒฐ๊ด๊ฐ’์„ ๋‚ด๋†“์Œ
    • ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ์…‹์€ ์ผ๋ถ€ ์ค‘์ฒฉ๋˜๋„๋ก ๋žœ๋ค ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•จ → bootstrapping ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ด์šฉํ•จ