๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๊ทœ๋ฆฌ] ํ”„๋กœ์ ํŠธ ์—์„ธ์ด

kyuree 2023. 5. 5. 10:41

1. EDA  & ์ „์ฒ˜๋ฆฌ

๋“ค์–ด๊ฐ€๋ฉฐ

์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜ ๋Œ€ํšŒ


๋ฐฐ๊ฒฝ
์•ˆ๋…•ํ•˜์„ธ์š” ์—ฌ๋Ÿฌ๋ถ„! ์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜ ๋Œ€ํšŒ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค.
์ตœ๊ทผ ์ธ๋ฅ˜์—๊ฒŒ ๋‹ค๊ฐ€์˜จ ๋น…๋ฐ์ดํ„ฐ๋ผ๋Š” ๋‹จ์–ด๋Š” ์šฐ์ฃผ์™€ ์ฒœ๋ฌธํ•™์—๊ฒŒ ๋‚ฏ์„ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฐฐ๋‚˜์˜ ์ˆœ๊ฐ„์—๋„ ์šฐ์ฃผ๋Š” ์ฒœ๋ฌธํ•™์ ์ธ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์‚ฐํ•ด์™”๊ณ , ์˜ค๋ž˜ ์ „๋ถ€ํ„ฐ ์ฒœ๋ฌธํ•™์ž๋“ค์€ ์šฐ์ฃผ๋ฅผ ๊ด€์ธกํ–ˆ์œผ๋ฉฐ ๊ทธ ๋ฐฉ๋Œ€ํ•จ์— ๋น„๋ก€ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ ๋ฐ ๋ถ„์„ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์Šฌ๋ก  ๋””์ง€ํ„ธ ์ฒœ์ฒด ๊ด€์ธก(Sloan Digital Sky Survey: ์ดํ•˜ SDSS)๋Š” ์„ธ๊ณ„์  ์ฒœ์ฒด ๊ด€์ธก ํ”„๋กœ์ ํŠธ๋กœ, ์šฐ์ฃผ์— ๋Œ€ํ•œ ์ฒœ๋ฌธํ•™์ ์ธ ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ณณ์—์„œ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 6,000๊ฐœ ๋…ผ๋ฌธ์— ์‚ฌ์šฉ๋˜์—ˆ๊ณ , 25๋งŒ ํšŒ ์ด์ƒ ์ธ์šฉ๋˜์—ˆ์„ ์ •๋„๋กœ ์ฒœ๋ฌธํ•™์— ํฐ ๊ธฐ์—ฌ๋ฅผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ ์  ๊ฑฐ๋Œ€ํ•ด์ง€๋Š” ๊ทœ๋ชจ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์—๋Š” ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์ด ํ™œ์šฉ๋˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
์—ฌ์ „ํžˆ ์šฐ์ฃผ์—๋Š” ๋‹ค์–‘ํ•œ ๋ฏธ์ง€์˜ ์ด์•ผ๊ธฐ๊ฐ€ ๋‚จ์•„์žˆ๊ณ , ์˜ค๋Š˜๋‚  ์ธ๊ฐ„์€ ํ•˜๋Š˜์—์„œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์–ด๋‚ผ ์ •๋„๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์–ด์ฉŒ๋ฉด ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์€ ๊ทœ์น™์ด ์—ฌ๋Ÿฌ๋ถ„์˜ ์†๋์—์„œ ๋ฐํ˜€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์šฐ์ฃผ์˜ ๋น„๋ฐ€์„ ์ฐพ์•„์ฃผ์„ธ์š”!
 
๋ชฉ์ 
์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ
 

EDA

#ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜ธ์ถœ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ
train = pd.read_csv("./dataset_dsob/train.csv")
test = pd.read_csv('./dataset_dsob/test.csv')
sample_submission = pd.read_csv('./dataset_dsob/sample_submission.csv')

 
data info

#๋ฐ์ดํ„ฐ ํฌ๊ธฐ ํ™•์ธ
print('Size of train data', train.shape)
print('Size of test data', test.shape)

 

 

#๋ฐ์ดํ„ฐ ์š”์•ฝ์ •๋ณด ํ™•์ธ
train.info()
train.describe()

 

 

 
๊ฐœ๋ณ„ ๋ณ€์ˆ˜ ํƒ์ƒ‰
Type

#๊ฐœ๋ณ„๋ณ€์ˆ˜ ํƒ์ƒ‰-type
#type๋ณ„ ๊ฐœ์ˆ˜ ํ™•์ธ - ์ „์ฒด ๋ถ„ํฌ
plt.figure(figsize=(15,6))

plt.title('Type count')
plt.ylabel('count')
plt.xlabel('Type')
plt.bar(train.groupby('type')['fiberID'].count().index, train.groupby('type')['fiberID'].count().values)
plt.xticks(rotation=90)
plt.show()

 

#type๋ณ„ ๊ฐœ์ˆ˜ ํ™•์ธ - ์‹ค์ œ ๊ฐœ์ˆ˜
train['type'].value_counts()

-> ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ์‹ฌํ•œ ํŽธ์ž„ => ๋ณ„๋‹ค๋ฅธ ์กฐ์น˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค
 
fiberId

#fiberId
sns.barplot(train['fiberID'].value_counts().index,train['fiberID'].value_counts())
plt.xticks(rotation=90);

-> ์ „์ฒด์ ์ธ ๋ถ„ํฌ๋ฅผ ์‚ดํ”ผ๊ณ ์ž, FiberId ํ•˜๋‚˜์˜ ๊ฐ’๋‹น ๊ฐœ์ˆ˜๋ฅผ ์•Œ๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ์‹คํŒจ.. => ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ๋ง๊ณ  ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ ค๋ณด์ž!

sns.set_style(style='whitegrid')

sns.scatterplot(
    data=train['fiberID'], 
    x=train['fiberID'], 
    y=train['fiberID'].value_counts(), 
    palette='Paired_r'
    )

plt.title('fiberID\'s counts')
plt.xlabel('fiberID')
plt.ylabel('counts')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

-> ๋Œ€๊ฐœ 600๋ฒˆ๋Œ€ ์ด๋‚ด๋กœ ๋ชฐ๋ ค์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค
type์„ ๋ถ„์„ํ•ด๋ดค์„ ๋•Œ QSO์— ํด๋ž˜์Šค๊ฐ€ ๋ชฐ๋ ค์žˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ fiberID๋„ ์ข€ ๋” ์„ธ๋ถ€์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ณ ์ž ํ•˜์˜€๋‹ค.
 
type-fiberID
*์ฝ”๋“œ ์ฐธ๊ณ (https://github.com/Leewongi0731/DACON_CelestialClassification/blob/master/celestial_classifier.ipynb)

# type_num ์—ด์„ ์ƒ์„ฑ

column_number = {}
for i, column in enumerate(sample_submission.columns):
    column_number[column] = i
    
def to_number(x, dic):
    return dic[x]

train['type_num'] = train['type'].apply(lambda x: to_number(x, column_number))


#fiberID์™€ type ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„
#type, fiberID๊ด€์˜ ๊ด€๊ณ„๋„๋ฅผ ๋ณด๊ธฐ์œ„ํ•œ ์‹œ๊ฐํ™”  

fig , ax = plt.subplots(nrows = len(set(train['type'])), ncols=1, figsize = (15,40))
for i in range(len(set(train['type']))):
    ax[i].scatter(train.loc[train['type_num']==i, 'fiberID'],range(train.loc[train['type_num']==i, 'fiberID'].shape[0]))
    ax[i].set_xlim(0,1000)
    ax[i].set_ylabel('count')
    ax[i].set_xlabel('fiberID')
    ax[i].set_title(list(column_number.keys())[i])

-> qso์ œ์™ธ ๋‚˜๋จธ์ง€ ์ฒธ์ฒด์˜ fiberId๋Š” 0์—์„œ 600๋ฒˆ๋Œ€ ์‚ฌ์ด์— ์กด์žฌ
์ •ํ™•ํ•œ fiberID ๊ตฌ๋ถ„๊ธฐ์ค€์„ ์•Œ์•„๋ณด์ž

#fiberId ๊ตฌ๋ถ„๊ธฐ์ค€(์ง€์ ) ์ฐพ๊ธฐ,์ตœ๋Œ“๊ฐ’์ฐพ๊ธฐ ํ™œ์šฉ 
train.groupby('type')['fiberID'].max()

-> ์ตœ๋Œ“๊ฐ’์ฐพ๊ธฐ ํ™œ์šฉ => qso- 1000, ๋‚˜๋จธ์ง€ 640
=> ๋ถ„๋ฅ˜๋˜๋Š” ์ง€์ ์ด fiberId๊ฐ€ 640์ผ ๋•Œ์ž„์„ ์•Œ ์ˆ˜ ์ž‡์Œ
 
ํ•œ ๋ฒˆ ๋” ํ™•์ธํ•œ ๊ฒฐ๊ณผ FiberID > 640์ธ ์ฒœ์ฒด๋Š” QSO๋ฟ์ž„

train[ train['fiberID'] > 640 ]['type'].value_counts()

 
-> ํ–ฅํ›„ ๋ถ„๋ฅ˜ํ•  ๋•Œ  ์‹ ๊ฒฝ์จ์•ผ ํ•  ๋ถ€๋ถ„์ธ๋“ฏํ•จ
 

์ „์ฒ˜๋ฆฌ

์ด์ƒ์น˜ ํƒ์ƒ‰

fig, ax = plt.subplots(nrows = 20, ncols= 1 ,figsize = (20,70))

for i in range(20):
    ax[i].scatter(x =  train.index, y = train[train.columns[i+2]].values)
    ax[i].set_title(train.columns[i+2])

plt.show()

-> ๋Œ€๊ฐœ์˜ ๊ฐ’์ด ํ•œ ๊ฐ’์— ๋ชฐ๋ ค์žˆ์ง€๋งŒ, ๊ทน๋‹จ์ ์œผ๋กœ ๋–จ์–ด์ ธ์žˆ๋Š” ์ด์ƒ์น˜๊ฐ€ ๊ฝค๋‚˜ ์กด์žฌํ•จ => ์ผ๋ถ€๋Š” ์ œ๊ฑฐ์˜ ํ•„์š”์„ฑ ์žˆ์Œ
 
type๋ณ„ ์ด์ƒ์น˜ ํƒ์ƒ‰ - boxplotํ™œ์šฉ

columns = train.columns
for s_type in columns[3:]:
    plt.figure(figsize=(18,3))
    sns.boxplot(x=train['type'],y=s_type,data=train)
    plt.title(s_type)
    plt.xticks(rotation=-90)

-> ์ด๋ ‡๊ฒŒ ํ•œ ๋‘ ๊ฐœ ์กด์žฌํ•˜๋Š” ์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ ํฐ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์ง€๋งŒ

์ด๋ ‡๊ฒŒ ๋‹ค์ˆ˜ ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๋Š” ์ œ๊ฑฐํ•ด์ค˜์•ผ ํ•œ๋‹ค๊ณ  ํŒ๋‹จ๋จ
 
์ด์ƒ์น˜ ๊ธฐ์ค€ ์„ ์ •์„ ์œ„ํ•ด ๊ฐ ์ปฌ๋Ÿผ์˜ ์ตœ๋Œ€ , ์ตœ์†Œ๊ฐ’ ํ™•์ธ

for col in train.columns[3:]:
    print(col,train[col].max())
    print(col,train[col].min())

๊ทธ๋Ÿฐ๋ฐ ์ตœ๋Œ€, ์ตœ์†Œ๋งŒ์œผ๋กœ ์ด์ƒ์น˜ ๊ธฐ์ค€ ์„ ์ •ํ•˜๊ธฐ ์–ด๋ ค์›Œ์„œ 100,200,300,400,500 ~ 1000๊นŒ์ง€ ํ•˜๋‚˜์”ฉ ๋„ฃ์–ด๊ฐ€๋ฉฐ ๊ทธ ๋•Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ํŒŒ์•…ํ•ด๋ด„

for col in train.columns[3:]:
    print(train[ abs(train[col]) > 500 ]['type'].value_counts())
    print('===================================================')

100,200,300์€ ํ•ด๋‹น ๋ฒ”์œ„๊ฐ€ ๋„ˆ๋ฌด ์ข์•„ ์ด์ƒ์น˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•˜๊ณ , 1000์ด๋‚˜ 2000์€ ๋ฒ”์œ„๊ฐ€ ๋„ˆ๋ฌด ๋„“์–ด ์ด์ƒ์น˜ ๊ฐœ์ˆ˜ ์ ์—ˆ์Œ
500~ 800์€ ํ•ด๋ดค์„ ๋•Œ ๋น„์Šทํ•œ ์ •๋„๋กœ ๋‚˜์™€ 500์„ ์ด์ƒ์น˜ ๊ธฐ์ค€ ์ˆ˜์น˜๋กœ ์„ ์ •
 
๊ทธ๋ฆฌ๊ณ  ์ œ๊ฑฐํ•ด์คŒ

remove_outlier_list = []

for i in range(len(train.index)):
    cnt = 0
    for j in train.columns[2:22]:
        if(abs(train.loc[i,j]) > 500):
            cnt += 1
            if cnt > 5:
                remove_outlier_list.append(i)
                break
        
print(remove_outlier_list)

-> ๊ฐ ์ปฌ๋Ÿผ๋งˆ๋‹ค 500 ๋„˜๋Š” ๊ฐ’์ด 5 ๊ฐœ ์ด์ƒ์ธ ๊ฒฝ์šฐ ์ œ๊ฑฐ ๋ฆฌ์ŠคํŠธ์— ๋‹ด์•„์คŒ

#์ด์ƒ์น˜ ๊ฐ’ ์ œ๊ฑฐํ•œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ
train_c = train.drop(remove_outlier_list)

#์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„ ๋น„๊ต๋ฅผ ์œ„ํ•œ ์‹œ๊ฐํ™”
fig, ax = plt.subplots(nrows = 20, ncols= 1 ,figsize = (20,70))

for i in range(20):
    ax[i].scatter(x =  train_c.index, y = train_c[train_c.columns[i+2]].values)
    ax[i].set_title(train_c.columns[i+2])

-> y์ถ• ๋ฒ”์œ„๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ๋Œ€๋ถ€๋ถ„์ด ์ด์ƒ์น˜๊ฐ€ ์ œ๊ฑฐ๋œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, ์• ์ดˆ์— ์ด์ƒ์น˜ ๊ฐœ์ˆ˜๊ฐ€ ์ ์—ˆ๋˜ 'psfMag_u'์™€ ๊ฐ™์€ ๊ฒฝ์šฐ ์ด์ƒ์น˜๊ฐ€ ๊ฑฐ์˜ ์ œ๊ฑฐ๋˜์ง€ ์•Š์Œ์ด ๋ณด์ž„. ๊ทธ๋ ‡์ง€๋งŒ ํ›„์— ์˜ํ–ฅ์„ ํฌ๊ฒŒ ์ค„ ๊ฒƒ๋“ค์€ ์ œ๊ฑฐํ–ˆ์œผ๋ฏ€๋กœ ๋„˜์–ด๊ฐ
 

๋ชจ๋ธ๋ง

ํ•™์Šต์„ ์œ„ํ•ด train_c ๋ฐ์ดํ„ฐ์˜ type์„ ์ˆซ์žํ˜•์œผ๋กœ ๋ณ€ํ™˜

column_number = {}
for i, column in enumerate(sample_submission.columns):
    column_number[column] = i
    
def to_number(x, dic):
    return dic[x]

train_c['type_num'] = train_c['type'].apply(lambda x: to_number(x, column_number))
train_c['type_num']

์‚ฌ์šฉํ•  ๋ชจ๋ธ์ธ xgboost ๋ถˆ๋Ÿฌ์™€์คŒ

#๋ชจ๋ธ๋ง
from xgboost import XGBClassifier
import xgboost as xgb
from xgboost import plot_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, f1_score

 
ํ•™์Šต ๋ฐ์ดํ„ฐ

x = train_c.drop(['type_num','type'],axis = 1) #column์ ‘๊ทผ
y = train_c['type_num'].values
test_x = test
RANDOM_SEED = 2000 #random_state์™€ ์ผ์น˜
x_train, x_valid, y_train, y_valid = train_test_split(x, y, \
                                                  test_size=0.2, random_state=RANDOM_SEED, stratify = y)
xgb_clf = XGBClassifier(booster='gbtree',
                 silent=True,
                 max_depth=5,
                 min_child_weight=8,
                 gamma=1,n_estimators=100,
                 colsample_bytree=0.6,
                 colsample_bylevel=0.6,
                 objective='multi:softprob',
                 random_state=RANDOM_SEED)

xgb_clf.fit(x_train,y_train, eval_set=[(x_train, y_train), (x_valid, y_valid)])

ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •ํ•ด์คŒ
- gbtree -> gblinear๋ณด๋‹ค ํŠธ๋ฆฌ๊ฐ€ ์ ํ•ฉํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜์–ด ์„ ํƒ

์—์ธก๊ฒฐ๊ณผ ํ™•์ธ

xgb_pred = xgb_clf.predict_proba(test_x)

print(np.round(xgb_pred[:5], 3))

์šฐ์„  5 ๊ฐœ๋งŒ ๋ฝ‘์•„์„œ ๋ดค๋Š”๋ฐ ์ž˜ ๋œ ๊ฑด์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค
 
ํ”ผ์ฒ˜ ์ค‘์š”๋„ ์‹œ๊ฐํ™”

%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(xgb_clf, ax=ax)

์ถœ์ฒ˜
https://dacon.io/competitions/official/235573/codeshare/694?page=1&dtype=recent
https://dacon.io/competitions/official/235573/talkboard/400609
https://velog.io/@leo_kim/๋ถ€์ŠคํŒ…-๊ณ„์—ด-์•™์ƒ๋ธ”-์•Œ๊ณ ๋ฆฌ์ฆ˜
https://youtu.be/8b1JEDvenQU
https://youtu.be/GrJP9FLV3FE
https://teddylee777.github.io/scikit-learn/scikit-learn-ensemble/
https://hwi-doc.tistory.com/entry/์ดํ•ดํ•˜๊ณ -์‚ฌ์šฉํ•˜์ž-XGBoost

[ML] XGBoost ์ดํ•ดํ•˜๊ณ  ์‚ฌ์šฉํ•˜์ž

์ˆœ์„œ ๊ฐœ๋… ๊ธฐ๋ณธ ๊ตฌ์กฐ ํŒŒ๋ผ๋ฏธํ„ฐ GridSearchCV 1. ๊ฐœ๋… 'XGBoost (Extreme Gradient Boosting)' ๋Š” ์•™์ƒ๋ธ”์˜ ๋ถ€์ŠคํŒ… ๊ธฐ๋ฒ•์˜ ํ•œ ์ข…๋ฅ˜์ž…๋‹ˆ๋‹ค. ์ด์ „ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ณด์™„ํ•ด๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ์„ ํ˜•์„ฑํ•˜

hwi-doc.tistory.com

https://github.com/Leewongi0731/DACON_CelestialClassification/blob/master/celestial_classifier.ipynb