๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๊ฒฝ์€] ํ”„๋กœ์ ํŠธ ์—์„ธ์ด

๊ฒฝ์€ 2023. 5. 5. 21:57

EDA (Exploratory Data Analysis)

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

train = pd.read_csv("C:/Users/twink/Documents/์นด์นด์˜คํ†ก ๋ฐ›์€ ํŒŒ์ผ/train.csv")
test = pd.read_csv("C:/Users/twink/Desktop/test.csv")
sub = pd.read_csv("C:/Users/twink/Desktop/sample_submission.csv")

 

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ์–ด๋–ค ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ธฐ

  • shape ํ™•์ธํ•˜๊ธฐ
#ํ–‰๊ณผ ์—ด์˜ ๊ฐœ์ˆ˜
print(train.shape)
print(test.shape)

train.columns

test.columns

train ๊ณผ test ์˜ columns ์„ค๋ช…

psfMag, fiberMag, petroMag, modelMag

  • psfMag : ๋จผ ์ฒœ์ฒด๋ฅผ ํ•œ ์ ์œผ๋กœ ๊ฐ€์ •ํ•˜์—ฌ ์ธก์ •ํ•œ ๋น›์˜ ๋ฐ๊ธฐ
  • fiberMag : 3์ธ์น˜ ์ง€๋ฆ„์˜ ๊ด‘์„ฌ์œ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ด‘์ŠคํŽ™ํŠธ๋Ÿผ์„ ์ธก์ •ํ•œ๋‹ค. ๊ด‘์„ฌ์œ ๋ฅผ ํ†ต๊ณผํ•˜๋Š” ๋น›์˜ ๋ฐ๊ธฐ
  • petroMag : ์€ํ•˜์ฒ˜๋Ÿผ ๋šœ๋ ทํ•œ ํ‘œ๋ฉด์ด ์—†๋Š” ์ฒœ์ฒด์—์„œ๋Š” ๋น›์˜ ๋ฐ๊ธฐ๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ฒœ์ฒด์˜ ์œ„์น˜์™€ ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€์—†์ด ๋น›์˜ ๋ฐ๊ธฐ๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•œ ์ˆ˜์น˜
  • modelMag : ์ฒœ์ฒด ์ค‘์‹ฌ์œผ๋กœ๋ถ€ํ„ฐ ํŠน์ • ๊ฑฐ๋ฆฌ์˜ ๋ฐ๊ธฐ

Mag - magnitudes (๊ด‘๋„ : ๊ด‘์›์—์„œ ๋‚˜์˜ค๋Š” ๋น›์˜ ๋ฐ๊ธฐ์™€ ๊ด€๊ณ„๋˜๋Š” ๊ฒƒ, ๊ฐ€์‹œ๊ด‘์„ ์˜ ์ƒ‰๊น”์— ๋”ฐ๋ฅธ ์‚ฌ๋žŒ ๋ˆˆ์˜ ๊ฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ํŒŒ์žฅ์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด์„œ ์–ป๋Š” ๊ฐ’)

 

ํŒŒ์žฅ๋Œ€

u : Ultraviolet ์ž์™ธ์„ 

g : Green ๊ฐ€์‹œ๊ด‘์„ ์˜ ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„

r : Red ๊ฐ€์‹œ๊ด‘์„ ์˜ ๋นจ๊ฐ„์ƒ‰ ๋ถ€๋ถ„

i : Near Infrared ๊ทผ์ ์™ธ์„  (์ ์™ธ์„  ์ค‘ ํ•˜๋‚˜)

z : Infrared ์ ์™ธ์„ 

 

์ž์™ธ์„  : ํŒŒ์žฅ์ด 10~400nm ๋ฒ”์œ„์ธ ์ „์žํŒŒ(EMR) ๋กœ ๊ฐ€์‹œ๊ด‘ ์„ ๊ณผ X์„  ์‚ฌ์ด์— ์œ„์น˜

๊ทผ์ ์™ธ์„  : ํŒŒ์žฅ์ด 750-2500nm์ธ ์ „์ž๊ธฐ ์ŠคํŽ™ํŠธ๋Ÿผ ๋‚ด์˜ ๊ฐ€์‹œ๊ด‘์„  ์— ์ธ์ ‘ํ•œ ์ ์™ธ์„  ๋ณต์‚ฌ

์ ์™ธ์„  : ํŒŒ์žฅ์ด ๊ฐ€์‹œ๊ด‘์„ ๋ณด๋‹ค ๊ธธ๊ณ  ์ „ํŒŒ ๋ณด๋‹ค ์งง์€ ์ „์žํŒŒ

 

์™œ u g r i z ํŒŒ์žฅ๋Œ€์ผ๊นŒ ..  

ugriz ์ธก๊ด‘ ์‹œ์Šคํ…œ์€ SDSS (Sloan Digital Sky Survey) ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ SDSS ์ธก๊ด‘ ์‹œ์Šคํ…œ ์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค. 

 

pandas ์˜ ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•œ ๊ฐ„๋‹จํ•œ ํƒ์ƒ‰

  • .head() : ์ƒ์œ„ 5๊ฐœ, ๊ด„ํ˜ธ ์•ˆ์— ์ˆซ์ž๋กœ ๊ฐœ์ˆ˜ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ
train_data.head()

๊ฐ feature ์— ๋Œ€ํ•œ ํ†ตํ•ฉ์ ์ธ ์ •๋ณด 2๊ฐ€์ง€

  • .info() 
  • .describe()
train.info()

๊ฐ feature์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ์— ๋ช‡ ๊ฐœ์˜ null ์ด ์žˆ๊ณ , ๊ฐ ๋ฐ์ดํ„ฐ์˜ type์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

object ๋Š” ์ˆ˜์น˜ํ˜•์ด ์•„๋‹Œ ์ž๋ฃŒํ˜•์œผ๋กœ ์ดํ•ดํ•˜๋ฉด ๋œ๋‹ค. ๋Œ€๋ถ€๋ถ„ string ์ด๋‹ค.

์ด๋•Œ object ์ธ feature ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ์ง์ ‘์ ์ธ ์ž…๋ ฅ ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์œผ๋‹ˆ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

train.describe()

describe ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์˜ ํ†ต๊ณ„๊ฐ’์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ณ  ์ด์ƒ์น˜๋ฅผ ํ™•์ธํ•œ๋‹ค.

  • count : ๊ฐœ์ˆ˜
  • mean : ํ‰๊ท 
  • std : ํ‘œ์ค€ ํŽธ์ฐจ
  • min : ์ตœ์†Ÿ๊ฐ’
  • 25% : ์ œ 1์‚ฌ๋ถ„์œ„๊ฐ’
  • 50% : ์ค‘์•™๊ฐ’, ์ค‘์œ„๊ฐ’
  • 75% : ์ œ 3์‚ฌ๋ถ„์œ„๊ฐ’
  • max : ์ตœ๋Œ€

๊ฒฐ์ธก์น˜ ํ™•์ธ

train.isna().sum()

  • ๊ฒฐ์ธก์น˜๊ฐ€ ์—†๋Š”๊ฒƒ์„  ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

type ์‹œ๊ฐํ™”

  • seaborn ์‚ฌ์šฉํ•˜๊ธฐ
  • seaborn ์€ matplotlib ๊ธฐ๋ฐ˜์˜ ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ํ†ต๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•œ ๊ณ ๊ธ‰ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
#๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜ธ์ถœ
import seaborn as sns
plt.figure(figsize=(10,8))
sns.countplot(y=train['type'], order=train['type'].value_counts().index)
plt.show()

type ์ปฌ๋Ÿผ์˜ ๊ฐ’๋“ค์„ count ํ•˜์—ฌ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜์˜€๋‹ค.

-> QSO ๊ฐ€ ๋‹ค๋ฅธ ํด๋ž˜์Šค์— ๋น„ํ•ด ์›”๋“ฑํžˆ ๋†’์€ ์ˆ˜์น˜๋กœ ํด๋ž˜์Šค ๊ฐ„์˜ ์‹ฌํ•œ ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค๋ฅผ ๊ฐ€์กŒ๋‹ค.

 

 

 

fiberID ์‹œ๊ฐํ™”

  • fiberID ๊ฐ€ 640 ์ฏค์ด ๋˜๋ฉด์„œ ๋ถ€ํ„ฐ ๊ทธ ์ˆ˜๊ฐ€ ํ™•์—ฐํžˆ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

fiberID ๊ฐ€ 640๋ถ€ํ„ฐ ์ˆ˜๊ฐ€ ์ค„์–ด๋“œ๋Š” ์ด์œ  ์•Œ์•„๋ณด๊ธฐ

train.groupby('type')['fiberID'].max()

๊ฐ ํด๋ž˜์Šค๋ณ„ ์ตœ๋Œ€๊ฐ’์„ ์•Œ์•„๋ณด์•˜๋‹ค. 

  •  fiberID ๊ฐ€ QSO ๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋‘ 640์„ ๋„˜๊ธฐ์ง€ ์•Š์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
  • ์–ธ๋”์ƒ˜ํ”Œ๋ง๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ fiberID ๊ฐ€ 640 ์ด์ƒ ์ธ๊ฒƒ์„ ์ œ์™ธํ•˜๋ฉด QSO ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ํด๋ž˜์Šค์— ๋น„ํ•ด ์›”๋“ฑํžˆ ๋†’์€ ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜•์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•ด๋ณด์•˜๋‹ค.

 

 

์ด์ƒ์น˜ ์ฐพ๊ธฐ

๋ฐ•์Šคํ”Œ๋กฏ์„ ์‚ฌ์šฉํ•˜์—ฌ ์นผ๋Ÿผ๋ณ„ ์ด์ƒ์น˜๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํ™•์ธํ•˜์—ฌ ๋ณด์•˜๋‹ค.

for col in train.columns[3:]:
  plt.figure(figsize=(15,2))
  sns.boxplot(x='type',y=col,data=train)
  plt.title(col)
  plt.xticks(rotation=-30)
  plt.show()

์ฝ”๋“œ์ถœ์ฒ˜ : [2๋“ฑ][JY!] ์ฝ”๋“œ ๊ณต์œ 

 

 

์นผ๋Ÿผ๋ณ„๋กœ ๊ทน๋‹จ๊ฐ’ ๋˜๋Š” ์ด์ƒ์น˜๊ฐ’์„ ์ œ๊ฑฐํ•  ํ•„์š”์„ฑ์ด ๋ณด์ธ๋‹ค.

์ด์ƒ์น˜๊ฐ€ ๋งŽ์ง€ ์•Š์ง€๋งŒ psfMag_u, petroMag_u ์นผ๋Ÿผ๊ณผ ๊ฐ™์ด ๋ˆˆ์— ๋„๋Š” ๊ทน๋‹จ๊ฐ’์ด ์กด์žฌํ•˜๋Š” ์นผ๋Ÿผ๋„ ๋ณด์ธ๋‹ค. 

๊ทธ๋Ÿฌ๋‚˜ ์•„์ง ์ด์ƒ์น˜ ๊ฐ’์ธ์ง€์— ๋Œ€ํ•œ ํ™•์ธ์€ ๋ถˆ๊ฐ€ํ•œ ์ƒํƒœ 

 

 

์Šค์ผ€์ผ๋ง์ด๋‚˜ feature engineering ์„ ์œ„ํ•ด train ๋ฐ์ดํ„ฐ์…‹๊ณผ test ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ์ณ์„œ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋‚˜๋ˆ„์–ด ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ณด์•˜์„๋•Œ ์ด์ƒ์น˜๋กœ ๋ณด์ด์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋Š” ์‚ญ์ œํ•˜๊ฑฐ๋‚˜ ๊ฐ€๊ณตํ•˜์ง€ ์•Š๋Š”๊ฒƒ์ด ์ข‹๋‹ค.

๋”ฐ๋ผ์„œ ๋‘๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ์นœ๋’ค ์ด์ƒ์น˜๋ฅผ ํ™•์ธํ•˜์—ฌ ๋ณด์•˜๋‹ค.

  • train ๋ฐ์ดํ„ฐ์…‹๊ณผ test ๋ฐ์ดํ„ฐ์…‹ ํ•ฉ์นœ ํ›„ ์ด์ƒ์น˜ ํ™•์ธํ•˜๊ธฐ 
total = pd.concat([train,test], axis=0)

for col in total.columns[3:]:
  plt.figure(figsize=(25,4))
  sns.boxplot(x='type',y=col,data=total)
  plt.title(col)
  plt.xticks(rotation=-30)
  plt.show()

total ๋ฐ์ดํ„ฐ์…‹์„ ๋ฐ•์Šคํ”Œ๋กฏ์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€์„๋•Œ train ๋ฐ์ดํ„ฐ์…‹๊ณผ test ๋ฐ์ดํ„ฐ์…‹์˜ ์ด์ƒ์น˜ ๋ถ„ํฌ๊ฐ€ ๊ฑฐ์˜ ์ฐจ์ด๋‚˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฆ‰, train ๋ฐ์ดํ„ฐ์…‹๊ณผ test ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ๊ฐ€ ๊ฑฐ์˜ ์ผ์น˜ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜ˆ์ƒํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

๋”ฐ๋ผ์„œ test ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ train ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‚ญ์ œํ•˜์—ฌ ๊ทน๋‹จ๊ฐ’์„ ์‚ญ์ œํ•˜๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

for col in train.columns[3:]:
    train = train[(train[col] > test[col].min()) & (train[col] < test[col].max())]

train ๋ณ€์ˆ˜๋ฅผ ์ƒˆ๋กœ ํ• ๋‹นํ•˜์—ฌ ์—ด์˜ ๊ฐ’์ด test ๋ฐ์ดํ„ฐ์…‹์˜ ์ตœ๋Œ€๊ฐ’๊ณผ ์ตœ์†Œ๊ฐ’ ์‚ฌ์ด์ธ ํ–‰๋งŒ ์œ ์ง€ํ•˜๊ณ  ๋‚˜๋จธ์ง€ ํ–‰์€ ์ œ๊ฑฐํ•˜์˜€๋‹ค.

 

 

์ „์ฒ˜๋ฆฌ ํ›„์˜ ๋ฐ•์Šคํ”Œ๋กฏ

 

 

 

LightGBM์„ ์ด์šฉํ•œ ๋ชจ๋ธ๋ง

LightGBM ์˜ ์ •์˜

Tree ๊ธฐ๋ฐ˜ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งํ•œ๋‹ค.

Light GBM์€ Tree๊ฐ€ ์ˆ˜์ง์ ์œผ๋กœ ํ™•์žฅ๋˜๋Š” ๋ฐ˜๋ฉด์— ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Tree๊ฐ€ ์ˆ˜ํ‰์ ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค, ์ฆ‰ Light GBM์€ leaf-wise ์ธ ๋ฐ˜๋ฉด ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ level-wise ์ด๋‹ค

์žฅ์ ์œผ๋กœ๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ  ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉฐ ํšจ์œจ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

๋‹จ์ ์œผ๋กœ๋Š” ๊ณผ์ ํ•ฉ์— ๋ฏผ๊ฐํ•˜์—ฌ ์ž‘์€๋ฐ์ดํ„ฐ์—๋Š” ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. 

 

 

 

๋ฐ์ดํ„ฐ

train2 = train.copy()

์ „์ฒ˜๋ฆฌํ•œ train ๋ฐ์ดํ„ฐ์…‹์„ train2๋กœ ์ƒˆ๋กœ ํ• ๋‹นํ•ด ์ฃผ์—ˆ๋‹ค. 

 

LGBM ๋ชจ๋ธ ํ˜ธ์ถœ

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb

ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ

# type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ
X = train2.iloc[:, 2:]
y = train2.iloc[:, 1]

# ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ•  (ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ)
# training๊ณผ test์˜ ๋น„์œจ์€ 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

 

LGBM ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ •์ˆ˜ํ˜• ๋ณ€ํ™˜

column_number = {}
for i, column in enumerate(sub.columns):
    column_number[column] = i

def to_number(x, dic):
    return dic[x]

train2['type_num'] = train2['type'].apply(lambda x: to_number(x, column_number))

 

๋ชจ๋ธ ๊ตฌํ˜„ 

lgb_clf = lgb.LGBMClassifier(num_leaves=34, objective='multiclass', max_depth=1)
lgb_clf.fit(X_train, y_train)
y_pred = lgb_clf.predict(X_test)

LightGBM ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • : ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ 100๊ฐœ๊ฐ€ ๋„˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€ํ‘œ์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์•Œ์•„๋ณด๊ณ  ์„ค์ •ํ•˜์—ฌ ๋ณด์•˜๋‹ค. 

 

num_leaves: ์ „์ฒด Tree์˜ leave ์ˆ˜ ์ด๊ณ , ๋””ํดํŠธ๊ฐ’์€ 31์ด๋‹ค. ํด์ˆ˜๋ก ์ •ํ™•๋„๋Š” ๋†’์•„์ง€์ง€๋งŒ ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

min_data_in_leaf: leaf๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ตœ์†Œํ•œ์˜ ๋ ˆ์ฝ”๋“œ์ด๋ฉฐ ํด์ˆ˜๋ก ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ํ•œ๋‹ค.

max_depth: Tree์˜ ์ตœ๋Œ€ ๊นŠ์ด๋ฅผ ๋งํ•˜๋ฉฐ ์ˆ˜๋ฅผ ์ค„์ด๋ฉด ๊ณผ์ ํ•ฉ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

objective: ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ํƒ€๊ฒŸํŒ… ๊ฐ’์˜ ํ˜•ํƒœ์— ๋”ฐ๋ผ ์ˆ˜์น˜์˜ˆ์ธก์ด๋ฉด regression, ์ด์ง„๋ถ„๋ฅ˜์ด๋ฉด binary ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

metric: ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์–ด๋–ค ๊ฒƒ์œผ๋กœ ํ•  ๊ฒƒ์ธ์ง€ ์กฐ์ •์ด ํ•„์š”ํ•˜๋‹ค.

 

 

๋ชจ๋ธ ์ •ํ™•์„ฑ ์ถœ๋ ฅ

print("Accuracy:", accuracy_score(y_test, y_pred))

 

์ธ์‚ฌ์ดํŠธ ๋„์ถœ

ํƒ€์ž…์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ ํด๋ž˜์Šค๊ฐ„์˜ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•œ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ fiberID ์—์„œ 640 ์ดํ›„๋กœ๋Š” QSO ๋งŒ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด QSO ํด๋ž˜์Šค์—์„œ 640์„ ๋„˜๋Š” ๊ฐ’์„ ์‚ญ์ œํ•˜๋Š” ๋“ฑ์˜ ์–ธ๋”์ƒ˜ํ”Œ๋ง์„ ์ง„ํ–‰ํ•˜๋ฉด ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๋ถˆ๊ท ํ˜•์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•ด๋ณด์•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ญ์ œํ•˜๋Š” ์ผ์ธ๋งŒํผ ์–ธ๋”์ƒ˜ํ”Œ๋ง ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์ž˜ ๊ฐ–๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ด ๋ณด์ธ๋‹ค.

 

์„ฑ์ฐฐ ๋ฐ ๋Š๋‚€์ 

๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด LightBGM ์„ ์‚ฌ์šฉํ•ด ๋ณด์•˜๋Š”๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ ๊ณผ์ ํ•ฉ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ž˜ ์•Œ์ง€ ๋ชปํ•˜์—ฌ num_leaves ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์กฐ๊ธˆ์”ฉ ์„ค์ •ํ•ด๋ณด๋ฉฐ ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๊ธฐ๋ณธ๊ฐ’์ด 31์ธ๋ฐ ํด์ˆ˜๋ก ์ •ํ™•์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•˜์—ฌ์„œ ์ •ํ™•์„ฑ์ด ๋‹ค์‹œ ๋‚ฎ์•„์ง€์ง€ ์•Š๋Š” ์ง€์ ์ธ 34๋ฅผ ์„ค์ •ํ•˜์˜€๋‹ค. ๋˜ํ•œ objective ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•˜๋ฉด์„œ regression ๊ณผ binary ์ค‘ ํ•™์Šต ๋ชฉํ‘œ์— ๋”ฐ๋ผ ์„ค์ •ํ•ด์•ผ ํ–ˆ๋Š”๋ฐ ์ด์ ์„ ์ธ์ง€ํ•˜์ง€ ๋ชปํ–ˆ๋˜๊ฒƒ๊ฐ™๋‹ค. ์ด์ฒ˜๋Ÿผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ๋Œ€ํ•ด ๋ฏธ์ˆ™ํ•œ ์ ์„ ๋ณด์™„ํ•˜๊ณ  ์‹ถ๋‹ค. 

 

 

 

 

์ฐธ๊ณ ๋ฌธํ—Œ

https://dacon.io/competitions/official/235573/codeshare/686?page=1&dtype=recent

https://dacon.io/competitions/official/235573/codeshare/694?page=1&dtype=recent

https://dacon.io/competitions/official/235573/codeshare/693?page=1&dtype=recent

https://www.codeit.kr/community/questions/UXVlc3Rpb246NjBlZTc0YzMyOGRjMDY2Y2ZlYWYwZGE0

https://dacon.io/competitions/official/235573/codeshare/721?page=1&dtype=recent

https://nurilee.com/2020/04/03/lightgbm-definition-parameter-tuning/

https://for-my-wealthy-life.tistory.com/24

https://lucian-blog.tistory.com/101