๐Ÿ’ก WIDA/DACON ๋ถ„๋ฅ˜-ํšŒ๊ท€ 43

[DACON/๊น€๊ทœ๋ฆฌ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

jupyter notebook ํ™˜๊ฒฝ์—์„œ ์ž‘์—…ํ•ด๋ดค์Šต๋‹ˆ๋‹ค~ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ #ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns color = sns.color_palette() sns.set_style('darkgrid') # ๊ทธ๋ž˜ํ”„ ํ•ด์ƒ๋„ ์—…๊ทธ๋ ˆ์ด๋“œ %config InlineBackend.figure_format = 'retina' # ๊ฒฝ๊ณ ๋ฌธ ๋ฌด์‹œ import warnings warnings.filterwarnings('ignore') %matplotlib inline์˜ ์˜๋ฏธ notebook์„ ์‹คํ–‰ํ•œ ๋ธŒ๋ผ์šฐ์ €์—์„œ ๋ฐ”๋กœ ๊ทธ..

[DACON/๊น€๊ฒฝ์€] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

EDA ์ง„ํ–‰ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ train = pd.read_csv("C:/Users/twink/Documents/์นด์นด์˜คํ†ก ๋ฐ›์€ ํŒŒ์ผ/train.csv") test = pd.read_csv("C:/Users/twink/Desktop/test.csv") sub = pd.read_csv("C:/Users/twink/Desktop/sample_submission.csv") ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ์–ด๋–ค ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ธฐ pandas ์˜ ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•œ ๊ฐ„๋‹จํ•œ ํƒ์ƒ‰ .head() : ์ƒ์œ„ 5๊ฐœ, ๊ด„ํ˜ธ ์•ˆ์— ์ˆซ์ž๋กœ ๊ฐœ์ˆ˜ ๋ณ€๊ฒฝ ๊ฐ€..

[DACON/๊น€์„ธ์—ฐ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

#1 import pandas as pd #2 import numpy as np #3 import matplotlib.pyplot as plt #4 import seaborn as sns #5 df=pd.read_csv("C:/Users/lucy8/PycharmProjects/test2/DSOB/train.csv") #6 print(df.head(3)) #7 print(df.shape) #8 print(df.isnull().sum()) #9 print(df.info()) df ํŒŒ์ผ์— ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ์ค‘์— 3๊ฐœ๋ฅผ ๋ฝ‘์•„, ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ ๋“ฑ์„ ํŒŒ์•…ํ•จ ์ขŒ(7ํ–‰,8ํ–‰)๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ํ–‰๋ ฌ๋กœ ์•Œ๋ ค์ฃผ๊ณ , null๊ฐ’์ด ์žˆ๋Š”์ง€ ์•Œ๋ ค์คŒ (null๊ฐ’์ด ์žˆ๋‹ค๋ฉด, ํ•ด๊ฒฐํ•ด์•ผํ•จ (ํ‰๊ท ,์‚ญ์ œ ๋“ฑ)) ์šฐ(9ํ–‰)๋Š” ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์•Œ๋ ค..

[DACON/์กฐ์•„์˜] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

Dacon ์ฝ”๋“œ ๊ณต์œ ์—์„œ ์ œ๊ณตํ•˜๊ณ ์žˆ๋Š” ์ฝ”๋“œ๋“ค๋„ ์ฐธ๊ณ ํ•˜์˜€์œผ๋‚˜, ์ดํ•ด๊ฐ€ ์•ˆ๋˜๋Š” ์ฝ”๋“œ๋“ค์€ ๊ณผ๊ฐํ•˜๊ฒŒ ๋ฒ„๋ฆฌ๊ณ  ์ดํ•ด๋˜๋Š” ๋ถ€๋ถ„๋งŒ ํŒŒ์•…ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. # ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ train_df = pd.read_csv("./dataset/train.csv") train_df # ๋ฐ์ดํ„ฐ์˜ ์ด ํ–‰์—ด ๊ฐœ์ˆ˜ ํ™•์ธ train_df.shape # null๊ฐ’์ด ์กด์žฌํ•˜๋Š”์ง€ ํ™•์ธ train_df.isnull().sum() # ๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ์ •๋ณด๋“ค์„ ์–ป์–ด๋ƒ„ # ๊ฐ column๋ณ„ ๋ฐ์ดํ„ฐ ํƒ€์ž…๋„ ์–ป์„ ์ˆ˜ ์žˆ์Œ train_df.info() # ๊ฐ ์—ด๋ณ„ ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰์„ ํ™•์ธํ•จ train_df..

[DACON/๊น€๋ฏผํ˜œ] ํŒŒ์ด์ฌ์„ ์ด์šฉํ•œ EDA

๊ธ€์„ ์ž‘์„ฑํ•˜๊ธฐ์— ์•ž์„œ... ์‚ฌ์‹ค EDA๋ฅผ ์ด๋ ‡๊ฒŒ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋งž๋Š”์ง€ ํ™•์‹คํ•˜์ง€๋Š” ์•Š์œผ๋‚˜ ์ตœ๋Œ€ํ•œ ์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.. ํ•˜ํ•˜ EDA(Exploratory Data Analysis, ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„) ์ฐธ๊ณ : https://jalynne-kim.medium.com/๋ฐ์ดํ„ฐ๋ถ„์„-๊ธฐ์ดˆ-eda์˜-๊ฐœ๋…๊ณผ-๋ฐ์ดํ„ฐ๋ถ„์„-์ž˜-ํ•˜๋Š”-๋ฒ•-a3cac2cc5ebc ๊ฐœ๋… ๋ฒจ์—ฐ๊ตฌ์†Œ์˜ ์ˆ˜ํ•™์ž ‘์กด ํŠœํ‚ค’๊ฐ€ ๊ฐœ๋ฐœํ•œ ๋ฐ์ดํ„ฐ๋ถ„์„ ๊ณผ์ •์— ๋Œ€ํ•œ ๊ฐœ๋…์œผ๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ ์ง€์†์ ์œผ๋กœ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ‘ํƒ์ƒ‰๊ณผ ์ดํ•ด’๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ ์ฒ˜์Œ์— ๋กœ์šฐ๋ฐ์ดํ„ฐ(raw data)๋ฅผ ์ ‘ํ•  ๋•Œ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์ดํ•ดํ•˜๊ณ  ํŒŒ์•…ํ•œ ๋‹ค์Œ, ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ง€ ‘์ด feature(column)๋กœ ํ•„ํ„ฐํ•ด๋ณด๊ณ , ์ €..

[DACON/๊น€์„ธ์—ฐ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

#์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜๋ชจ๋ธ ๋Œ๋ ค๋ณด๊ธฐ import warnings warnings.filterwarnings('ignore') from sklearn.model_selection import train_test_split import pandas as pd # training dataset ๋ถˆ๋Ÿฌ์˜ค๊ธฐ data = pd.read_csv("C:/Users/lucy8/OneDrive/๋ฐ”ํƒ• ํ™”๋ฉด/train.csv", encoding = 'utf-8') # ๋งž์ถฐ์•ผ ํ•˜๋Š” ๊ฒƒ์€ type์ด๊ธฐ ๋•Œ๋ฌธ์— type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ X = data[data.columns[:, :]] y = data[["class"]] # training dataset๊ณผ test dataset์œผ๋กœ ์ชผ๊ฐœ๊ธฐ # training๊ณผ test์˜ ๋น„์œจ์€ ..

[DACON/๊น€๊ฒฝ์€] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

์„œํฌํŠธ๋ฒกํ„ฐ๋จธ์‹  ๋ชจ๋ธ #ํ•„์š” ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import pandas as pd import warnings warnings.filterwarnings('ignore') from sklearn import svm #๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ train_data = pd.read_csv("C:/Users/twink/Documents/์นด์นด์˜คํ†ก ๋ฐ›์€ ํŒŒ์ผ/train.csv") # type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌ X = train_data.iloc[:, 2:] y = train_data.iloc[:, 1] # training dataset๊ณผ test d..

[DACON/๊น€๊ทœ๋ฆฌ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

์ €๋ฒˆ ์‹œ๊ฐ„์— ์•Œ์•„๋ณธ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ง์ ‘ ์‹ค์Šตํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค~ *์ถœ์ฒ˜* https://m.blog.naver.com/baek2sm/221786426960 ํŒŒ์ด์ฌ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์ œ ๋จธ์‹ ๋Ÿฌ๋‹&๋”ฅ๋Ÿฌ๋‹ ์ฟก๋ถ(MLCook) ์‚ฌ์ดํ‚ท๋Ÿฐ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ์˜ˆ์ œ ์•ˆ๋…•ํ•˜์„ธ์š”. ๋™๋„ค์ฝ”๋”์ž…๋‹ˆ๋‹ค. ์ด... blog.naver.com (์ด ๋ถ„์˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฑฐ์˜ ๋˜‘๊ฐ™์ด ์‹ค์Šตํ–ˆ๊ธฐ์— ์ถœ์ฒ˜๋ฅผ ๋จผ์ € ๋ฐํž™๋‹ˆ๋‹ค) 1. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ฐœ์š” - ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ์ „ํ†ต์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๋˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ์ง€๋„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ๊ธฐ๋ณธ ์›๋ฆฌ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ(Baye’s theorem)๋ฅผ ์ ์šฉํ•œ ์›๋ฆฌ - ํ”ํžˆ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๋กœ ์„ค๋ช…๋จ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜ ํ…์ŠคํŠธ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ๋นˆ..

[DACON/์ตœ๋‹ค์˜ˆ] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹

KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„ํ•˜๊ธฐ # ํ•„์š”ํ•œ ํŒจํ‚ค์ง€ import import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # ๋ฐ์ดํ„ฐ ๋กœ๋“œ train_data = pd.read_csv("C:/Users/allye/Desktop/DSOB/WIDA Dacon/DCSTree/train.csv") # type๊ณผ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„๋ฆฌํ•ด์คŒ X = train_data.iloc[:, 2:] y = train_data.iloc[:, 1] # ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ•  (ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ) #..

[DACON/์กฐ์•„์˜] ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋œฏ์–ด๋ณด๊ธฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ํ‰๊ฐ€๋ฐฉ์‹ (svm ๋ชจ๋ธ๋„ ์ถ”๊ฐ€ํ•  ์˜ˆ์ •)

Decision Tree ์ฝ”๋“œ๋Š” ๋ธ”๋Ÿญ์œผ๋กœ ๋”ฐ๋กœ ์ฒจ๋ถ€ X https://github.com/cAhyoung/dacon_stars_type_clf/blob/main/practive_code/dt_rf_practice.py Hyper parameter criterion(๊ธฐ์ค€) : default=”gini”, ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์ฒ™๋„ “gini” ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ “entropy” ์—”ํŠธ๋กœํ”ผ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ “log_loss” log_loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ splitter(๋ถ„ํ• ) : default=”best” “best” ๊ฐ€์žฅ ์ข‹์€ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ “random” ๊ฐ€์žฅ ์ข‹์€ ๋žœ๋ค ๋ถ„ํ•  ๋ฐฉ์‹์„ ์ฐพ์Œ max_depth : default=None ํŠธ๋ฆฌ๊ตฌ์กฐ์˜ ์ตœ๊ณ  ๊นŠ์ด๋ฅผ..