๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€

[DACON/๊น€๊ทœ๋ฆฌ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

kyuree 2023. 4. 7. 17:31

jupyter notebook ํ™˜๊ฒฝ์—์„œ ์ž‘์—…ํ•ด๋ดค์Šต๋‹ˆ๋‹ค~

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

#ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline                            
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
# ๊ทธ๋ž˜ํ”„ ํ•ด์ƒ๋„ ์—…๊ทธ๋ ˆ์ด๋“œ
%config InlineBackend.figure_format = 'retina'
# ๊ฒฝ๊ณ ๋ฌธ ๋ฌด์‹œ
import warnings
warnings.filterwarnings('ignore')
  • %matplotlib inline์˜ ์˜๋ฏธ
    • notebook์„ ์‹คํ–‰ํ•œ ๋ธŒ๋ผ์šฐ์ €์—์„œ ๋ฐ”๋กœ ๊ทธ๋ฆผ์„ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ
  • ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
    • pandas
    • seaborn
    • matplotlib
  • import warnings?
    • jupyter notebook ์ƒ์—์„œ ๊ฒฝ๊ณ  ๋ฉ”์„ธ์ง€ ์ˆจ๊ฒจ์ฃผ๊ธฐ

๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ

#๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
train = pd.read_csv("C:/Users/jenny/train.csv")
#๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ
train.head()

## ํ–‰/์—ด ํ™•์ธ
print(train.shape)
(199991, 23)

 

#๋ฐ์ดํ„ฐ ์š”์•ฝ์ •๋ณด ํ™•์ธํ•˜๊ธฐ
print(train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199991 entries, 0 to 199990
Data columns (total 23 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   id          199991 non-null  int64  
 1   type        199991 non-null  object 
 2   fiberID     199991 non-null  int64  
 3   psfMag_u    199991 non-null  float64
 4   psfMag_g    199991 non-null  float64
 5   psfMag_r    199991 non-null  float64
 6   psfMag_i    199991 non-null  float64
 7   psfMag_z    199991 non-null  float64
 8   fiberMag_u  199991 non-null  float64
 9   fiberMag_g  199991 non-null  float64
 10  fiberMag_r  199991 non-null  float64
 11  fiberMag_i  199991 non-null  float64
 12  fiberMag_z  199991 non-null  float64
 13  petroMag_u  199991 non-null  float64
 14  petroMag_g  199991 non-null  float64
 15  petroMag_r  199991 non-null  float64
 16  petroMag_i  199991 non-null  float64
 17  petroMag_z  199991 non-null  float64
 18  modelMag_u  199991 non-null  float64
 19  modelMag_g  199991 non-null  float64
 20  modelMag_r  199991 non-null  float64
 21  modelMag_i  199991 non-null  float64
 22  modelMag_z  199991 non-null  float64
dtypes: float64(20), int64(2), object(1)
memory usage: 35.1+ MB
None
  • object ํ˜• 
    • 'type'
  • ์ˆซ์žํ˜•
    • int64
      • 'id','fiberID'
    • float64
      • ์ด์™ธ ๋ชจ๋‘

-> ์ˆ˜์น˜์  ์ ‘๊ทผ์ด ํ•„์š”ํ•˜๊ฒ ๊ตฌ๋‚˜~!

#id ์ปฌ๋Ÿผ ์ œ๊ฑฐ
train = train.drop(['id'], axis=1)
train.info()
  • ํŒ๋‹ค์Šค ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ ์‚ญ์ œ : .drop()
    •  axis=1 : ํŠน์ • ์ปฌ๋Ÿผ ์‚ญ์ œ, axis=0 : ํŠน์ • ๋กœ์šฐ(์ธ๋ฑ์Šค) ์‚ญ์ œ
#๊ฒฐ์ธก์น˜ ์กด์žฌ ์—ฌ๋ถ€ ํ™•์ธํ•˜๊ธฐ
## ๊ฒฐ์ธก์น˜ ์กด์žฌ ์—ฌ๋ถ€ ํ™•์ธ -> ์—†์Œ
print(train.isnull().sum())
type          0
fiberID       0
psfMag_u      0
psfMag_g      0
psfMag_r      0
psfMag_i      0
psfMag_z      0
fiberMag_u    0
fiberMag_g    0
fiberMag_r    0
fiberMag_i    0
fiberMag_z    0
petroMag_u    0
petroMag_g    0
petroMag_r    0
petroMag_i    0
petroMag_z    0
modelMag_u    0
modelMag_g    0
modelMag_r    0
modelMag_i    0
modelMag_z    0
dtype: int64
  • ํŒ๋‹ค์Šค์˜ ๊ฒฐ์ธก์น˜ ํ™•์ธ ํ•จ์ˆ˜: isnull().sum()
    • True or False
      • True :๊ฒฐ์ธก์น˜ ์žˆ์Œ -> 1
      • False :๊ฒฐ์ธก์น˜ ์—†์Œ -> 0
    • .sum()
      • isnull์ด ๊ฐ ๊ฐ’์— ๋Œ€ํ•ด ๊ฒฐ์ธก์น˜ ๊ฒ€์‚ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— sum()์„ ๋ง๋ถ™์—ฌ ์‚ฌ์šฉํ•˜์—ฌ ์ปฌ๋Ÿผ๋ณ„ ๊ฒฐ์ธก์น˜ ์ด ํ•ฉ ํ™•์ธ
#๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰ ํ™•์ธํ•˜๊ธฐ
#flaotํ˜• ๋ณ€์ˆ˜์˜ ์ •๋ณด ํ™•์ธํ•˜๊ธฐ
train.describe(include='float64')

  • mean
    • ํ‰๊ท 
  • std
    • ํ‘œ์ค€ํŽธ์ฐจ, ํผ์ง ์ •๋„
  • ์•Œ ์ˆ˜ ์žˆ๋Š” ์ 
    • ํ‰๊ท 
      • 'psfMag_u'๋งŒ ์Œ์ˆ˜์˜ ๊ฐ’ ๋„์ถœ๋จ
      • ๊ทธ์™ธ์˜ ์ปฌ๋Ÿผ์€ ๋Œ€๊ฐœ 17~22 ์‚ฌ์ด์˜ ๊ฐ’
      • 'fiberMag_u'์˜ ๊ฒฝ์šฐ 1์ ๋Œ€์˜ ๊ฐ’
      • => ํŒŒ์žฅ์ด ์ž์™ธ์„ ์ผ ๋•Œ์˜ ํŠน์ง•์ธ๊ฐ€ ์‹ถ์—ˆ์ง€๋งŒ ๋‹ค๋ฅธ ์ปฌ๋Ÿผ์˜ xxx_u๊ฐ’์€ ์•„๋‹ˆ์—ˆ์Œ
    • ์•„์ง ์ด๋Ÿฌํ•œ ์ˆ˜์น˜๋“ค๋กœ๋Š” ๋šœ๋ ทํ•œ ๋ฐ์ดํ„ฐ ์ƒ์˜ ํŠน์ง• ํŒŒ์•… ์–ด๋ ค์›€

์ปฌ๋Ÿผ๋ณ„ ์‚ดํŽด๋ณด๊ธฐ

 

#ํ•œ ๊ฐ€์ง€ ๊ฐ’๋งŒ ์กด์žฌํ•˜๋Š” ์ปฌ๋Ÿผ ์ฐพ๊ธฐ --> ์—†์Œ

one_value_columns = []

for i in train.columns[2:]:
    if len(train[i].value_counts()) < 2:
        one_value_columns.append(i)

print(len(one_value_columns))
print(one_value_columns)
0
[]
#์ปฌ๋Ÿผ๋“ค์˜ ๊ด€๊ณ„ ํŒŒ์•…ํ•˜๊ธฐ
## 'type' ์ปฌ๋Ÿผ ์ œ์™ธํ•˜๊ณ  ํžˆํŠธ๋งต ๊ทธ๋ฆฌ๊ธฐ
plt.figure(figsize=(15, 8))
sns.heatmap(train.drop(['type'], axis=1).corr(), annot=True)

  • figsize
    • ๊ฐ€๋กœ, ์„ธ๋กœ ๊ธธ์ด ์„ค์ •
  • corr()
    • ์ƒ๊ด€๊ด€๊ณ„

๊ทธ๋‹ค์ง€ ๋‘๋“œ๋Ÿฌ์ง€๋Š” ํŠน์ง• ์—†์–ด๋ณด์ž„

fiberID

 

#fiberID
#์ข…๋ฅ˜๋ณ„ ๊ฐœ์ˆ˜ ํŒŒ์•… 
train['fiberID'].value_counts()

 

fiberID
624    373
122    371
618    370
350    369
14     365
      ... 
809     11
807     11
737     10
991     10
853      9
Name: count, Length: 1000, dtype: int64
  • ์ขŒ์ธก ์—ด์˜ ๊ฐ’: ๊ฐ’๋“ค์˜ ์ข…๋ฅ˜
  • ์šฐ์ธก ์—ด์˜ ๊ฐ’: ๊ฐ ๊ฐ’๋“ค์˜ ๊ฐœ์ˆ˜

-> ์ข…๋ฅ˜๋„ ๋งŽ์„ ๋ฟ๋”๋Ÿฌ, ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์•„ ์‹œ๊ฐํ™” ๋ถˆํ•„์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จ

 

type

 

#์ฒœ์ฒด ์œ ํ˜•๋ณ„ ๊ฐœ์ˆ˜ ํŒŒ์•…
train['type'].value_counts()
type
QSO                    49680
GALAXY                 37347
SERENDIPITY_BLUE       21760
SPECTROPHOTO_STD       14630
REDDEN_STD             14618
STAR_RED_DWARF         13750
STAR_BHB               13500
SERENDIPITY_FIRST       7132
ROSAT_D                 6580
STAR_CATY_VAR           6506
SERENDIPITY_DISTANT     4654
STAR_CARBON             3257
SERENDIPITY_RED         2562
STAR_WHITE_DWARF        2160
STAR_SUB_DWARF          1154
STAR_BROWN_DWARF         500
SKY                      127
SERENDIPITY_MANUAL        61
STAR_PN                   13
Name: count, dtype: int64
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
train['type'].value_counts()[:100].plot(kind='bar', alpha=0.7)
plt.title('train_data : type values')

  • QSO๋Š” ๋ญ”๋ฐ ๊ฐ€์žฅ ๋งŽ์ด ๊ด€์ธก๋์„๊นŒ?
    • ๋จผ ๊ฑฐ๋ฆฌ์—์„œ ๋ฐœ๊ฒฌ๋˜๋Š” ๋งค์šฐ ํ™œ๋™์ ์ธ ์ดˆ๋Œ€์งˆ๋Ÿ‰ ๋ธ”๋ž™ํ™€
    •  ์šฐ์ฃผ์—์„œ ๊ฐ€์žฅ ๋ฐ์€ ๋‹จ์ผ ์ฒœ์ฒด ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ๊ทธ ๋ฐ๊ธฐ๋Š” ์ตœ๋Œ€ ํƒœ์–‘์˜ 700์กฐ ๋ฐฐ์— ๋‹ฌํ•˜๊ธฐ๋„ ํ•จ
    • ๊ฐ€์‹œ๊ด‘์„  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ชจ๋“  ํŒŒ์žฅ์—์„œ ๋ง‰๋Œ€ํ•œ ์—๋„ˆ์ง€๋ฅผ ๋ฟœ์–ด๋‚ด๊ณ ..

psfMag

fig, ax = plt.subplots(nrows=2)

sns.distplot(train['psfMag_g'], ax=ax[0])
sns.distplot(train['psfMag_r'], ax=ax[1])

- > ์‹ค์„  ๊ทธ๋ž˜ํ”„๋Š” ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๊ฒƒ ๊ฐ™์•„์„œ ๊ฐ’๋“ค์˜ ์ „์ฒด์ ์ธ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด์„œ ์‚ฐ์ ๋„ ๊ทธ๋ž˜ํ”„์— ๋„์ „

sns.set_style('whitegrid')
y = train['psfMag_u'].value_counts()
sns.scatterplot(data= train, x='psfMag_u', y=y)

์•„๋ฌด๊ฒƒ๋„ ์•ˆ ๋‚˜์™”๋‹ค..

์—ฌ๊ธฐ๊นŒ์ง€ ํ–ˆ์Šต๋‹ˆ๋‹ค,,, ใ… .,ใ… 

 

์ถœ์ฒ˜

- [Dacon/python] ์ œ์ฃผ๋„ ๋„๋กœ ๊ตํ†ต๋Ÿ‰ ์˜ˆ์ธก (velog.io)

 

[Dacon/python] ์ œ์ฃผ๋„ ๋„๋กœ ๊ตํ†ต๋Ÿ‰ ์˜ˆ์ธก

์ œ์ฃผ๋„ ๋„๋กœ ๊ตํ†ต๋Ÿ‰ ์˜ˆ์ธก AI ๋ชจ๋ธ ๊ฐœ๋ฐœ์ฃผ์ œ: ์ œ์ฃผ๋„ ๋„๋กœ ๊ตํ†ต๋Ÿ‰ ์˜ˆ์ธก AI ๋ชจ๋ธ ๊ฐœ๋ฐœ์š”์•ฝ: ์ œ์ฃผ๋„์˜ ๊ตํ†ต ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋„๋กœ ๊ตํ†ต๋Ÿ‰์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ ๊ฐœ๋ฐœ์ œ์ฃผ๋„ ๋‚ด ์ฃผ๋ฏผ๋“ฑ๋ก์ธ๊ตฌ๋Š” 2022๋…„ ๊ธฐ์ค€

velog.io

- My first EDA (์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜) - DACON

 

My first EDA (์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜)

์›”๊ฐ„ ๋ฐ์ด์ฝ˜ ์ฒœ์ฒด ์œ ํ˜• ๋ถ„๋ฅ˜ ๋Œ€ํšŒ

dacon.io

- ์ด์™ธ์— ํŒ๋‹ค์Šค , ๋งทํ”Œ๋กญ, ์”จ๋ณธ์˜ ํ•จ์ˆ˜ ๋ฐ ์ธ์ž ํƒ์ƒ‰์„ ์œ„ํ•ด ์ฐธ๊ณ ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€์˜ ๊ธ€๋“ค..