๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๋ฏผํ˜œ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

์•Œ ์ˆ˜ ์—†๋Š” ์‚ฌ์šฉ์ž 2023. 4. 7. 01:25

๊ธ€์„ ์ž‘์„ฑํ•˜๊ธฐ์— ์•ž์„œ... ์‚ฌ์‹ค EDA๋ฅผ ์ด๋ ‡๊ฒŒ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋งž๋Š”์ง€ ํ™•์‹คํ•˜์ง€๋Š” ์•Š์œผ๋‚˜ ์ตœ๋Œ€ํ•œ ์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.. ํ•˜ํ•˜

 

EDA(Exploratory Data Analysis, ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„)

์ฐธ๊ณ : https://jalynne-kim.medium.com/๋ฐ์ดํ„ฐ๋ถ„์„-๊ธฐ์ดˆ-eda์˜-๊ฐœ๋…๊ณผ-๋ฐ์ดํ„ฐ๋ถ„์„-์ž˜-ํ•˜๋Š”-๋ฒ•-a3cac2cc5ebc


๊ฐœ๋…

๋ฒจ์—ฐ๊ตฌ์†Œ์˜ ์ˆ˜ํ•™์ž ‘์กด ํŠœํ‚ค’๊ฐ€ ๊ฐœ๋ฐœํ•œ ๋ฐ์ดํ„ฐ๋ถ„์„ ๊ณผ์ •์— ๋Œ€ํ•œ ๊ฐœ๋…์œผ๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ ์ง€์†์ ์œผ๋กœ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ‘ํƒ์ƒ‰๊ณผ ์ดํ•ด’๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

์ฒ˜์Œ์— ๋กœ์šฐ๋ฐ์ดํ„ฐ(raw data)๋ฅผ ์ ‘ํ•  ๋•Œ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์ดํ•ดํ•˜๊ณ  ํŒŒ์•…ํ•œ ๋‹ค์Œ, ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ง€ ‘์ด feature(column)๋กœ ํ•„ํ„ฐํ•ด๋ณด๊ณ , ์ € feature๋กœ ํ•ด๋ณด๊ณ ..’ ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ์ธก๋ฉด์œผ๋กœ ์ชผ๊ฐœ๊ณ , ์ถœ๋ ฅํ•ด๋ณด๋ฉด์„œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป์–ด๋‚ด๋Š” ๊ฒƒ. ๊ทธ๊ฒƒ์ด EDA์  ๋ฐ์ดํ„ฐ๋ถ„์„

How?

1. raw data ์˜ description, dictionary ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ๊ฐ column๋“ค๊ณผ row์˜ ์˜๋ฏธ๋ฅผ ์ดํ•ด

๋ฐ์ดํ„ฐ๋ถ„์„ ์—ฐ์Šต ์‹œ์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ์นผ๋Ÿผ๊ณผ ๋กœ์šฐ(row)์— ๋Œ€ํ•œ ์„ค๋ช…์ด ์ž˜ ๋‚˜์™€์žˆ๋‹ค๊ณ  ํ•จ.

2. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ๋ฐ ๋ฐ์ดํ„ฐํ•„ํ„ฐ๋ง ๊ธฐ์ˆ 

๋ฐ์ดํ„ฐ ๋ถ„์„์„ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „, ๋ฐ˜๋“œ์‹œ ๋ฐ์ดํ„ฐ์— ๊ฒฐ์ธก์น˜๊ฐ€ ์—†๋Š”์ง€ ํ™•์ธํ•˜๊ณ , ์žˆ๋‹ค๋ฉด ์ œ๊ฑฐํ•ด์ค˜์•ผ ํ•จ.

๋ถ„์„ ์‹œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ(numerical data)์ธ๋ฐ ๋ฒ”์ฃผํ˜•(categorical data)์œผ๋กœ ๋˜์–ด ์žˆ๋‹ค๋ฉด (๋ฐ์ดํ„ฐ ํƒ€์ž…์ด ‘object’๋กœ ๋œธ) ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜(ex. astype ํ™œ์šฉ)ํ•ด์ค˜์•ผ ํ•จ.

3. ๋ˆ„๊ตฌ๋‚˜ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ์‹œ๊ฐํ™”๋ฅผ ํ•˜๋Š” ๊ธฐ์ˆ 

๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ†ตํ•ด ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ”๋ฅผ ์ œ๋Œ€๋กœ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์ด ๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€๋กœ์„œ ๊ฐ€์žฅ ์ž˜ ๊ฐ–์ถฐ์•ผ ํ•˜๋Š” ๊ธฐ์ˆ 

์‹œ๊ฐํ™”

์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๋กœ train.csv ๋ฐ์ดํ„ฐ์…‹์„ ์‹œ๊ฐํ™” ์‹œ์ผœ๋ณด์•˜๋‹ค.

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Random Forest ๋ชจ๋ธ ๊ฐ€์ ธ์˜ค๊ธฐ
rf_clf = RandomForestClassifier()
bag_clf = BaggingClassifier(rf_clf)

# training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train_data = pd.read_csv("drive/MyDrive/Colab Notebooks/train.csv")

# ๋งž์ถฐ์•ผ ํ•˜๋Š” ๊ฒƒ์€ type์ด๊ธฐ ๋•Œ๋ฌธ์— type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ (์Šฌ๋ผ์ด์‹ฑ)
X = train_data.iloc[:, 2:]
y = train_data.iloc[:, 1] #type ์—ด

# training dataset๊ณผ test dataset์œผ๋กœ ์ชผ๊ฐœ๊ธฐ
# training๊ณผ test์˜ ๋น„์œจ์€ 0.3 [7:3์œผ๋กœ ์ชผ๊ฐ ๋‹ค]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ•™์Šต ๋ฐ ๋ณ„๋„์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€
rf_flf = RandomForestClassifier(criterion='entropy', bootstrap=True, random_state=42, max_depth=5)
rf_clf.fit(X_train, y_train)
pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ์ •ํ™•๋„: {0:.4f}'.format(accuracy))

# ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™”
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ftr_importances_values = rf_clf.feature_importances_
ftr_importances = pd.Series(ftr_importances_values, index=X_train.columns)
ftr_top20 = ftr_importances.sort_values(ascending=False)[:20]

plt.figure(figsize=(8,6))
plt.title('Feature importances Top 20')
sns.barplot(x=ftr_top20, y=ftr_top20.index)
plt.show()

 

DACON ์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜ ๋ชจ๋ธ


๋ฐ์ดํ„ฐ ํŠน์„ฑ

๋ธ”๋กœ๊ทธ ์ฐธ๊ณ → ๐Ÿ”—LINK : RandomForest ์‚ฌ์šฉ

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
  • pd๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ž„ํฌํŠธํ•œ๋‹ค.
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” ์•™์ƒ๋ธ” ๋ชจ๋“ˆ์—์„œ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.
# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import matplotlib.patches as patches
  • matplot library์—์„œ ํ•จ์ˆ˜์˜ ๋ชจ์Œ(pyplot)์„ plt ์ด๋ฆ„์œผ๋กœ ์ž„ํฌํŠธ.
  • matplotlib.pypot
    : ๋ช…๋ น์–ด ์Šคํƒ€์ผ๋กœ ๋™์ž‘ํ•˜๋Š” ํ•จ์ˆ˜์˜ ๋ชจ์Œ 
    • ๋ชจ๋“ˆ์˜ ๊ฐ๊ฐ์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ฐ„ํŽธํ•˜๊ฒŒ ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค๊ณ  ๋ณ€ํ™”๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Œ
      • ex. ๊ทธ๋ž˜ํ”„ ์˜์—ญ์„ ๋งŒ๋“ค๊ธฐ, ์„  ํ‘œํ˜„, ๋ ˆ์ด๋ธ” ๊พธ๋ฏธ๊ธฐ ๋“ฑ
  • seaborn ํŒจํ‚ค์ง€ ์ถ”๊ฐ€
  • seaborn
    : Matplotlib์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์–‘ํ•œ ์ƒ‰์ƒ ํ…Œ๋งˆ์™€ ํ†ต๊ณ„์šฉ ์ฐจํŠธ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์‹œ๊ฐํ™” ํŒจํ‚ค์ง€
  • matplot library์—์„œ patches ์ถ”๊ฐ€
  • matplotlib.patches
    : ๋ชจ๋“  ์‹œ๊ฐํ™” ์š”์†Œ๋“ค์„ ๋‹ด๊ณ  ์žˆ๋Š” Artist ์„œ๋ธŒํด๋ž˜์Šค ๋ฐ‘์— ์†ํ•ด ์žˆ๋Š” ํ‘œ๋ฉด์ƒ‰๊ณผ ํ…Œ๋‘๋ฆฌ์ƒ‰์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

data_path = '/Users/hansung.dev/Github/kaggle_Data/แ„Žแ…ฅแ†ซแ„Žแ…ฆแ„‹แ…ฒแ„’แ…งแ†ผแ„‡แ…ฎแ†ซแ„…แ…ฒ-แ„‹แ…ฏแ†ฏแ„€แ…กแ†ซแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฉแ†ซ2/'

train = pd.read_csv(data_path + 'data/train.csv', index_col=0)
test = pd.read_csv(data_path + 'data/test.csv', index_col=0)
sample_submission = pd.read_csv(data_path + 'data/sample_submission.csv', index_col=0)
  • ๋ถˆ๋Ÿฌ์˜ฌ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ๋กœ๋ฅผ ์ ์–ด์คŒ.
  • pd.read_csv - pandas library์—์„œ ๋ถ„์„ํ•  ๋ฐ์ดํ„ฐ ์ž๋ฃŒ(train.csv & test.csv & submission.csv)๋ฅผ ์ฝ์–ด ๊ฐ ๋ณ€์ˆ˜์— ํ• ๋‹น. ํ–‰ ๋ ˆ์ด๋ธ”์€ 0์œผ๋กœ ์ง€์ •.
    • index_col=None : int, str, sequence of int / str, or False, optional, default
    • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ํ–‰ ๋ ˆ์ด๋ธ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์“ฐ์ž„. ๋˜๋Š” ๋ฌธ์ž์—ด ์ด๋ฆ„์ด๋‚˜ column ์ธ๋ฑ์Šค๋กœ ์ฃผ์–ด์ง.

EDA

print('Size of train data', train.shape)
print('Size of test data', test.shape)

#์ถœ๋ ฅ๊ฒฐ๊ณผ
Size of train data (199991, 22)
Size of test data (10009, 21)
  • train.shape์„ ํ†ตํ•ด train.csv, tst.csv๋ฅผ ๋‹ค์ค‘์ธ๋ฑ์‹ฑ ํ•œ ์ž๋ฃŒ์˜ shape์„ ์ถœ๋ ฅ (ํ–‰ ๊ฐœ์ˆ˜, ์—ด ๊ฐœ์ˆ˜)
  • pandas.Dataframe.shape
    : Return a tuple representing the dimensionality of the DataFrame. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ฐจ์›์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŠœํ”Œ ๊ฐ’์„ ์•Œ๋ ค์คŒ.
train.info()

#์ถœ๋ ฅ๊ฒฐ๊ณผ

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199991 entries, 0 to 199990  #์ธ๋ฑ์Šค ๋ฒˆํ˜ธ๋Š” 0~9๊นŒ์ง€ (#๋ถ€๋ถ„)
Data columns (total 22 columns):         #๋ฐ์ดํ„ฐ ์—ด์€ ์ด 22๊ฐœ
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   type        199991 non-null  object 
 1   fiberID     199991 non-null  int64  
 2   psfMag_u    199991 non-null  float64
 3   psfMag_g    199991 non-null  float64
 4   psfMag_r    199991 non-null  float64
 5   psfMag_i    199991 non-null  float64
 6   psfMag_z    199991 non-null  float64
 7   fiberMag_u  199991 non-null  float64
 8   fiberMag_g  199991 non-null  float64
 9   fiberMag_r  199991 non-null  float64
 10  fiberMag_i  199991 non-null  float64
 11  fiberMag_z  199991 non-null  float64
 12  petroMag_u  199991 non-null  float64
 13  petroMag_g  199991 non-null  float64
 14  petroMag_r  199991 non-null  float64
 15  petroMag_i  199991 non-null  float64
 16  petroMag_z  199991 non-null  float64
 17  modelMag_u  199991 non-null  float64
 18  modelMag_g  199991 non-null  float64
 19  modelMag_r  199991 non-null  float64
 20  modelMag_i  199991 non-null  float64
 21  modelMag_z  199991 non-null  float64
dtypes: float64(20), int64(1), object(1)  #๋ฐ์ดํ„ฐ ํƒ€์ž…์€ float64, int64, object
memory usage: 35.1+ MB

train.describe()
#์‹คํ–‰๊ฒฐ๊ณผ (๊ทธ๋ฆผ) 8 rows × 21 columns

train_desc_df = train.describe()
train_desc_df

train.head()
#์‹คํ–‰๊ฒฐ๊ณผ(๊ทธ๋ฆผ) 5 rows × 22 columns
  • Dataframe.describe()
    : Generate descriptive statistics. ๋ฌ˜์‚ฌํ˜• ํ†ต๊ณ„๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜ 
    • NaN ๊ฐ’์„ ์ œ์™ธํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„ํฌ์˜ ์ค‘์‹ฌ ๊ฒฝํ–ฅ, ๋ถ„์‚ฐ ๋ฐ ๋ชจ์–‘์„ ์š”์•ฝํ•จ.
    • ์š”์•ฝ ํ†ต๊ณ„๋Ÿ‰์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
    DataFrame.describe(percentiles=None, include=None, exclude=None)
  • Dataframe.head()
    : Return the first n rows. ์ฒซ ๋ฒˆ์งธ n๊ฐœ์˜ ํ–‰๋“ค์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

โ˜๐Ÿป ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์ผ์น˜์‹œ์ผœ์•ผ ํ•˜์ง€ ์•Š์„๊นŒ?

 

 

  • ๊ทธ ์™ธ

Corr

 

# ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ƒ๊ด€๊ด€๊ณ„ HeatMap ์‹œ๊ฐํ™”
corr = train.corr()
cmap = sns.color_palette("Blues")
f, ax = plt.subplots(figsize=(10, 7))
sns.heatmap(corr, cmap=cmap)

<matplotlib.axes._subplots.AxesSubplot at 0x1a31854390>

 

 

  • Dataframe.corr()
  • : ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•˜์—ฌ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • palette์˜ ์ƒ‰์„ ํŒŒ๋ž€ ๊ณ„์—ด๋กœ ์„ ํƒํ•จ
  • subplots ์€ ์„ค์ •์„ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค…

 

type

# Log_ID
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
train['type'].value_counts()[:100].plot(kind='bar', alpha=0.7)
plt.title('type Values in the Training Set - train ()')

Text(0.5, 1.0, 'type Values in the Training Set - train ()')

train['type'].value_counts()
QSO                    49680
GALAXY                 37347
SERENDIPITY_BLUE       21760
SPECTROPHOTO_STD       14630
REDDEN_STD             14618
STAR_RED_DWARF         13750
STAR_BHB               13500
SERENDIPITY_FIRST       7132
ROSAT_D                 6580
STAR_CATY_VAR           6506
SERENDIPITY_DISTANT     4654
STAR_CARBON             3257
SERENDIPITY_RED         2562
STAR_WHITE_DWARF        2160
STAR_SUB_DWARF          1154
STAR_BROWN_DWARF         500
SKY                      127
SERENDIPITY_MANUAL        61
STAR_PN                   13
Name: type, dtype: int64
  • ํŠน์ง•
    • QSO์˜ ๊ฐœ์ˆ˜๊ฐ€ ์•ž๋„์ ์œผ๋กœ ๋งŽ๋‹ค.

 

fiber ID

# Log_ID
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
train['fiberID'].value_counts()[:100].plot(kind='bar', alpha=0.7)
plt.title('fiberID Values in the Training Set - train ()')

Text(0.5, 1.0, 'fiberID Values in the Training Set - train ()')

 

โ˜๐Ÿป ์ฝ”๋“œ๋ฅผ ์ „๋ฐ˜์ ์œผ๋กœ ๋œฏ์–ด๋ณด์•˜๋Š”๋ฐ, ๋ฌธ์ œ์ ์ด๋‚˜ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์„ ๋ถ„์„ํ•˜๋Š” ๊ฑด ๋ˆˆ์— ์ž˜ ๋ณด์ด์ง€ ์•Š๋Š”๋‹ค.. ๋‹ค๋ฅธ ๋ถ„๋“ค์˜ ์˜๊ฒฌ์„ ํ•จ๊ป˜ ๋“ค์–ด๋ณด๊ณ  ์‹ถ๋‹ค..!