๐Ÿ’ก WIDA 91

[DACON/๊น€์„ธ์—ฐ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

#1 import pandas as pd #2 import numpy as np #3 import matplotlib.pyplot as plt #4 import seaborn as sns #5 df=pd.read_csv("C:/Users/lucy8/PycharmProjects/test2/DSOB/train.csv") #6 print(df.head(3)) #7 print(df.shape) #8 print(df.isnull().sum()) #9 print(df.info()) df ํŒŒ์ผ์— ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ์ค‘์— 3๊ฐœ๋ฅผ ๋ฝ‘์•„, ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ ๋“ฑ์„ ํŒŒ์•…ํ•จ ์ขŒ(7ํ–‰,8ํ–‰)๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ํ–‰๋ ฌ๋กœ ์•Œ๋ ค์ฃผ๊ณ , null๊ฐ’์ด ์žˆ๋Š”์ง€ ์•Œ๋ ค์คŒ (null๊ฐ’์ด ์žˆ๋‹ค๋ฉด, ํ•ด๊ฒฐํ•ด์•ผํ•จ (ํ‰๊ท ,์‚ญ์ œ ๋“ฑ)) ์šฐ(9ํ–‰)๋Š” ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์•Œ๋ ค..

[DACON/์กฐ์•„์˜] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

Dacon ์ฝ”๋“œ ๊ณต์œ ์—์„œ ์ œ๊ณตํ•˜๊ณ ์žˆ๋Š” ์ฝ”๋“œ๋“ค๋„ ์ฐธ๊ณ ํ•˜์˜€์œผ๋‚˜, ์ดํ•ด๊ฐ€ ์•ˆ๋˜๋Š” ์ฝ”๋“œ๋“ค์€ ๊ณผ๊ฐํ•˜๊ฒŒ ๋ฒ„๋ฆฌ๊ณ  ์ดํ•ด๋˜๋Š” ๋ถ€๋ถ„๋งŒ ํŒŒ์•…ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. # ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ train_df = pd.read_csv("./dataset/train.csv") train_df # ๋ฐ์ดํ„ฐ์˜ ์ด ํ–‰์—ด ๊ฐœ์ˆ˜ ํ™•์ธ train_df.shape # null๊ฐ’์ด ์กด์žฌํ•˜๋Š”์ง€ ํ™•์ธ train_df.isnull().sum() # ๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ์ •๋ณด๋“ค์„ ์–ป์–ด๋ƒ„ # ๊ฐ column๋ณ„ ๋ฐ์ดํ„ฐ ํƒ€์ž…๋„ ์–ป์„ ์ˆ˜ ์žˆ์Œ train_df.info() # ๊ฐ ์—ด๋ณ„ ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰์„ ํ™•์ธํ•จ train_df..

[DACON/๊น€๋ฏผํ˜œ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

๊ธ€์„ ์ž‘์„ฑํ•˜๊ธฐ์— ์•ž์„œ... ์‚ฌ์‹ค EDA๋ฅผ ์ด๋ ‡๊ฒŒ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋งž๋Š”์ง€ ํ™•์‹คํ•˜์ง€๋Š” ์•Š์œผ๋‚˜ ์ตœ๋Œ€ํ•œ ์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.. ํ•˜ํ•˜ EDA(Exploratory Data Analysis, ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„) ์ฐธ๊ณ : https://jalynne-kim.medium.com/๋ฐ์ดํ„ฐ๋ถ„์„-๊ธฐ์ดˆ-eda์˜-๊ฐœ๋…๊ณผ-๋ฐ์ดํ„ฐ๋ถ„์„-์ž˜-ํ•˜๋Š”-๋ฒ•-a3cac2cc5ebc ๊ฐœ๋… ๋ฒจ์—ฐ๊ตฌ์†Œ์˜ ์ˆ˜ํ•™์ž ‘์กด ํŠœํ‚ค’๊ฐ€ ๊ฐœ๋ฐœํ•œ ๋ฐ์ดํ„ฐ๋ถ„์„ ๊ณผ์ •์— ๋Œ€ํ•œ ๊ฐœ๋…์œผ๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ ์ง€์†์ ์œผ๋กœ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ‘ํƒ์ƒ‰๊ณผ ์ดํ•ด’๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ ์ฒ˜์Œ์— ๋กœ์šฐ๋ฐ์ดํ„ฐ(raw data)๋ฅผ ์ ‘ํ•  ๋•Œ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์ดํ•ดํ•˜๊ณ  ํŒŒ์•…ํ•œ ๋‹ค์Œ, ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ง€ ‘์ด feature(column)๋กœ ํ•„ํ„ฐํ•ด๋ณด๊ณ , ์ €..

[DACON/๊น€์„ธ์—ฐ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

#์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜๋ชจ๋ธ ๋Œ๋ ค๋ณด๊ธฐ import warnings warnings.filterwarnings('ignore') from sklearn.model_selection import train_test_split import pandas as pd # training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ data = pd.read_csv("C:/Users/lucy8/OneDrive/๋ฐ”ํƒ• ํ™”๋ฉด/train.csv", encoding = 'utf-8') # ๋งž์ถฐ์•ผ ํ•˜๋Š” ๊ฒƒ์€ type์ด๊ธฐ ๋•Œ๋ฌธ์— type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ X = data[data.columns[:, :]] y = data[["class"]] # training dataset๊ณผ test dataset์œผ๋กœ ์ชผ๊ฐœ๊ธฐ # training๊ณผ test์˜ ๋น„์œจ์€ ..

[DACON/๊น€๊ฒฝ์€] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

์„œํฌํŠธ๋ฒกํ„ฐ๋จธ์‹  ๋ชจ๋ธ #ํ•„์š” ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import pandas as pd import warnings warnings.filterwarnings('ignore') from sklearn import svm #๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ train_data = pd.read_csv("C:/Users/twink/Documents/์นด์นด์˜คํ†ก ๋ฐ›์€ ํŒŒ์ผ/train.csv") # type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌ X = train_data.iloc[:, 2:] y = train_data.iloc[:, 1] # training dataset๊ณผ test d..

[DACON/๊น€๊ทœ๋ฆฌ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

์ €๋ฒˆ ์‹œ๊ฐ„์— ์•Œ์•„๋ณธ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ง์ ‘ ์‹ค์Šตํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค~ *์ถœ์ฒ˜* https://m.blog.naver.com/baek2sm/221786426960 ํŒŒ์ด์ฌ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์ œ ๋จธ์‹ ๋Ÿฌ๋‹&๋”ฅ๋Ÿฌ๋‹ ์ฟก๋ถ(MLCook) ์‚ฌ์ดํ‚ท๋Ÿฐ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ์˜ˆ์ œ ์•ˆ๋…•ํ•˜์„ธ์š”. ๋™๋„ค์ฝ”๋”์ž…๋‹ˆ๋‹ค. ์ด... blog.naver.com (์ด ๋ถ„์˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฑฐ์˜ ๋˜‘๊ฐ™์ด ์‹ค์Šตํ–ˆ๊ธฐ์— ์ถœ์ฒ˜๋ฅผ ๋จผ์ € ๋ฐํž™๋‹ˆ๋‹ค) 1. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ฐœ์š” - ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ์ „ํ†ต์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๋˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ์ง€๋„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ๊ธฐ๋ณธ ์›๋ฆฌ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ(Baye’s theorem)๋ฅผ ์ ์šฉํ•œ ์›๋ฆฌ - ํ”ํžˆ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๋กœ ์„ค๋ช…๋จ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜ ํ…์ŠคํŠธ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ๋นˆ..

[DACON/์ตœ๋‹ค์˜ˆ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„ํ•˜๊ธฐ # ํ•„์š”ํ•œ ํŒจํ‚ค์ง€ import import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # ๋ฐ์ดํ„ฐ ๋กœ๋“œ train_data = pd.read_csv("C:/Users/allye/Desktop/DSOB/WIDA Dacon/DCSTree/train.csv") # type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ X = train_data.iloc[:, 2:] y = train_data.iloc[:, 1] # ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ•  (ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ) #..

[DACON/์กฐ์•„์˜] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹ (svm ๋ชจ๋ธ๋„ ์ถ”๊ฐ€ํ•  ์˜ˆ์ •)

Decision Tree ์ฝ”๋“œ๋Š” ๋ธ”๋Ÿญ์œผ๋กœ ๋”ฐ๋กœ ์ฒจ๋ถ€ X https://github.com/cAhyoung/dacon_stars_type_clf/blob/main/practive_code/dt_rf_practice.py Hyper parameter criterion(๊ธฐ์ค€) : default=”gini”, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์ฒ™๋„ “gini” ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ “entropy” ์—”ํŠธ๋กœํ”ผ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ “log_loss” log_loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ splitter(๋ถ„ํ• ) : default=”best” “best” ๊ฐ€์žฅ ์ข‹์€ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ “random” ๊ฐ€์žฅ ์ข‹์€ ๋žœ๋ค ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ max_depth : default=None ํŠธ๋ฆฌ๊ตฌ์กฐ์˜ ์ตœ๊ณ  ๊นŠ์ด๋ฅผ..

[DACON/๊น€๋ฏผํ˜œ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

โ˜๐Ÿป WIDA_4์ฃผ์ฐจ ํŒŒ์ด์ฌ์œผ๋กœ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์™€์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ ๋Œ๋ ค๋ณด๊ณ  accuracy ์ธก์ •ํ•ด๋ณด๊ธฐ (decision tree ์ œ์™ธ ํ•œ๊ฐ€์ง€ ๊ณจ๋ผ์„œ) ๊ทธ ๋ชจ๋ธ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„ํ•˜๊ธฐ ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ์•Œ์•„๋ณด๊ธฐ(log loss ํฌํ•จ 2๊ฐ€์ง€) ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ์‹คํ–‰ ์ฝ”๋“œ from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import BaggingClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import pandas as pd import warnings warnings.filterwarnings..

[DACON/๊น€๊ทœ๋ฆฌ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์•Œ์•„๋ณด๊ธฐ

1. ๊ฐœ์š” Classification(๋ถ„๋ฅ˜)? - Supervised learning(์ง€๋„ํ•™์Šต)์˜ ์ผ์ข…์œผ๋กœ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ณ , ์ƒˆ๋กญ๊ฒŒ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์Šค์Šค๋กœ ํŒ๋ณ„ํ•˜๋Š” ๊ณผ์ • - ์˜ˆ์‹œ๋กœ ๋ฌธ์ž๋ฅผ ํŒ๋ณ„ํ•  ๋•Œ, ์ŠคํŒธ์ธ์ง€ ์•„๋‹Œ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋‹จ์ผ๋ถ„๋ฅ˜ ๊ทธ๋ฆฌ๊ณ  ์ˆ˜๋Šฅ ์ ์ˆ˜๊ฐ€ ๋ช‡ ๋“ฑ๊ธ‰์— ์†ํ•˜๋Š”์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋‹ค์ค‘๋ถ„๋ฅ˜๊ฐ€ ์žˆ์Œ * cf) ๋น„์ง€๋„ ํ•™์Šต์˜ clustering : ๋‹ค์ค‘๋ถ„๋ฅ˜์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๋‹ค์ค‘๋ถ„๋ฅ˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋„๋ฉ”์ธ์ด ์ •์˜๋˜์–ด์žˆ๋‹ค๋Š” ์ ์—์„œ ํฐ ์ฐจ์ด์ ์„ ์ง€๋‹Œ๋‹ค Classification(๋ถ„๋ฅ˜) ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ์ผ๋ จ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜๋Š” ๊ธฐ์กด ์นดํ…Œ๊ณ ๋ฆฌ๋“ค์„ ํ•™์Šตํ•˜๊ณ , ์ด๊ฒƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ปดํ“จํ„ฐ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์ฃผ๋ฅผ ๊ตฌ๋ถ„ํ•˜์—ฌ ๊ฒฝ๊ณ„๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ํ•™์Šต 2. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ (Naive Bayes) ๊ฐœ์š” ..