๐Ÿ’ก WIDA/ํ”„๋กœ์ ํŠธ ๋ณด๊ณ ์„œ

[WIDA] 1ํ•™๋…„์กฐ_2์ฐจ ๋ณด๊ณ ์„œ

๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šคํ•™๊ณผ 24ํ•™๋ฒˆ ์ดํ˜„์ง„ 2024. 10. 4. 18:30

์ฃผ์ œ: ํƒ€์ดํƒ€๋‹‰ ์Šน๊ฐ๋“ค์˜ ์ƒ์กด๋ฅ ๊ณผ ์†์„ฑ๋“ค๊ฐ„์˜ ๊ด€๊ณ„ ๋ถ„์„

 

ํ˜„์žฌ ์ง„ํ–‰ ์ƒํ™ฉ

 


*Colab์„ ์‚ฌ์šฉํ•ด ์ž‘์—…ํ•จ

[๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ]

ํ•„์š”ํ•œ ๋ชจ๋“ˆ ๋ฐ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#Colab์— ํ•œ๊ตญ์–ด ํฐํŠธ๋ฅผ ๋ถˆ๋Ÿฌ์˜ด
import matplotlib.font_manager as fm  

!apt-get update -qq         
!apt-get install fonts-nanum* -qq 

fe = fm.FontEntry(fname=r'/usr/share/fonts/truetype/nanum/NanumGothic.ttf', name='NanumGothic')
fm.fontManager.ttflist.insert(0, fe)  
plt.rcParams.update({'font.size': 10, 'font.family': 'NanumGothic'})

 

1. ๋ฐ์ดํ„ฐ ํ™•์ธ

df = pd.read_csv('train.csv')
df

  • Survived - ์ƒ์กด์œ ๋ฌด, target ๊ฐ’. (0 = ์‚ฌ๋ง, 1 = ์ƒ์กด)
  • Name - ํƒ‘์Šน๊ฐ ์„ฑ๋ช…
  • Pclass - ํ‹ฐ์ผ“ ํด๋ž˜์Šค. (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Sex - ์„ฑ๋ณ„
  • Age - ๋‚˜์ด(์„ธ)
  • SibSp - ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ํ˜•์ œ์ž๋งค, ๋ฐฐ์šฐ์ž ์ˆ˜ ์ดํ•ฉ
  • Parch - ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ๋ถ€๋ชจ, ์ž๋…€ ์ˆ˜ ์ดํ•ฉ
  • Embarked - ํƒ‘์Šน ํ•ญ๊ตฌ
  • Fare - ํƒ‘์Šน ์š”๊ธˆ
  • Ticket - ํ‹ฐ์ผ“ ๋„˜๋ฒ„
  • Cabin - ๊ฐ์‹ค ๋„˜๋ฒ„

์—ฌ๊ธฐ์„œ PassengerId๋Š” ์Šน๊ฐ์˜ ๋ฒˆํ˜ธ์ด๋ฏ€๋กœ ์ƒ์กด์œ ๋ฌด์™€ ์—ฐ๊ด‘์„ฑ์ด ์—†์–ด๋ณด์ž…๋‹ˆ๋‹ค.

 

df.info()

#๋ฐ์ดํ„ฐ ๋ชจ์–‘ ํ™•์ธ
df.shape

#๊ฒฐ์ธก์น˜ ํ™•์ธ
df.isnull().sum()

#null์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์ค‘
df.isnull().mean()

 

Age, Cabin, Embarked ํ•ญ๋ชฉ์— ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ Cabin์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ์ƒ๋‹นํžˆ ๋งŽ์ด ์กด์žฌํ•ด ์œ ์˜๋ฏธํ•œ ์ปฌ๋Ÿผ์ด ๋˜์ง€ ๋ชปํ•  ๊ฒƒ ๊ฐ™์•„ ํ•ด๋‹น ์ปฌ๋Ÿผ์„ ์ œ๊ฑฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

2. Cabin ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

# Cabin์—ด ์ œ๊ฑฐ
new_df = df.copy()
new_df = new_df.drop(['Cabin'],axis=1) # ์„ธ๋กœ ์—ด ์ œ๊ฑฐ

 

์ด์ œ Age, Embarked ํ•ญ๋ชฉ์˜ ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์›Œ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

3. Age ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

#Age ์‹œ๊ฐํ™”(1)(๊ฒฐ์ธก์น˜ ์ œ์™ธ)
sns.boxplot(y='Age', data = new_df)

##Age ์‹œ๊ฐํ™”(2)(๊ฒฐ์ธก์น˜ ์ œ์™ธ)
sns.displot(data=new_df, x='Age', kind='kde')

 

Age ๋ฐ์ดํ„ฐ์˜ ๊ฐ’์ด ๋‹ค์–‘ํ•˜๊ณ  ํ‰๊ท ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ธฐ์—” ์ด์ƒ์น˜๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ฒฐ์ธก์น˜๋ฅผ ์ค‘์•™๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

# Age ๊ฒฐ์ธก์น˜, ์ค‘์•™๊ฐ’์œผ๋กœ ๋Œ€์ฒด
age_median = np.nanmedian(new_df["Age"])
new_df['Age'].fillna(age_median,inplace=True)

 

 

4. Embarked ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

# Embarked ์ด๋ฆ„๋ณ„ ์Šน๊ฐ์ˆ˜
new_df['Embarked'].value_counts()

 

Null ๊ฐ’์ด 2๊ฐœ์ด๊ณ , 'S'์— ๊ฐ€์žฅ ๋งŽ์€ ํƒ‘์Šน๊ฐ์ด ์žˆ์œผ๋ฏ€๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ตœ๋นˆ๊ฐ’์ธ "S"๋กœ ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์šฐ๊ฒ ์Šต๋‹ˆ๋‹ค.

# Embarked ๊ฒฐ์ธก์น˜, ์ตœ๋นˆ๊ฐ’ S๋กœ ์ฑ„์šฐ๊ธฐ
new_df['Embarked'].fillna('S',inplace=True)

 

#๊ฒฐ์ธก์น˜ ํ™•์ธ
new_df.isnull().sum()

 

[๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”]

1. Survived

#์ƒ์กด ๋น„์œจ
fig, ax = plt.subplots(1, 2, figsize=(14,5))
labels = ['์‚ฌ๋ง', '์ƒ์กด']
colors=['gray','red']

new_df['Survived'].value_counts().plot.pie(ax=ax[0], explode=[0,0.1], shadow=True, autopct='%1.1f%%', labels=labels, colors=colors, title="์ƒ์กด๋น„์œจ")
ax[0].set(ylabel='')
plt.title('์ƒ์กด ๋น„์œจ')


survived_counts = new_df['Survived'].value_counts()
plt.bar(survived_counts.index, survived_counts.values, color=['gray', 'red'])

plt.title('์‚ฌ๋ง์ž ๋ฐ ์ƒ์กด์ž ์ˆ˜')
plt.xticks(survived_counts.index, labels)
plt.ylabel('Count')

plt.show()

 

 

์‚ฌ๋ง ๋น„์œจ์€ 61.6%, ์ƒ์กด ๋น„์œจ์€ 38.4%๋กœ ์‚ฌ๋งํ•œ ํƒ‘์Šน๊ฐ์˜ ์ˆ˜๊ฐ€ ๋” ๋งŽ์€ ๊ฒƒ์ด ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.
300๋ช… ์ด์ƒ์ด ์ƒ์กดํ•˜๊ณ , 600๋ช… ๊ฐ€๊นŒ์ด ์‚ฌ๋งํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

 

2. Survived๊ณผ ๋‹ค๋ฅธ ์†์„ฑ๊ณผ์˜ ๊ด€๊ณ„

 

2-1 Pclass

#Pclass ์‹œ๊ฐํ™”_01_ํด๋ž˜์Šค ๋ณ„ ํƒ‘์Šน์ˆ˜ ๋ฐ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].set_title("ํด๋ž˜์Šค๋ณ„ ํƒ‘์Šน์ˆ˜ \n", size=15)
sns.countplot(data=new_df, x='Pclass', ax=axes[0], alpha = 0.7, palette=['blue'])

axes[1].set_title("ํด๋ž˜์Šค๋ณ„ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜ \n", size=15)
sns.countplot(x="Pclass", hue="Survived", data=new_df, ax=axes[1], alpha = 0.7, palette= ['gray','red'])
axes[1].legend(labels = ['์‚ฌ๋ง', '์ƒ์กด'])

 

3๋“ฑ์„ ํƒ‘์Šน๊ฐ์ด ๊ฐ€์žฅ ๋งŽ๊ณ , 3๋“ฑ์„ ํƒ‘์Šน๊ฐ์˜ ์‚ฌ๋ง ์ธ์›์ˆ˜ ๋˜ํ•œ ๊ฐ€์žฅ ๋งŽ์Šต๋‹ˆ๋‹ค.
์ƒ๋Œ€์ ์œผ๋กœ 1๋“ฑ์„ ํƒ‘์Šน๊ฐ๋“ค์€ ๋งŽ์ด ์‚ด์•„๋‚จ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

 

#Pclass ์‹œ๊ฐํ™”_02_๋“ฑ๊ธ‰๋ณ„ ์ƒ์กด ๋น„์œจ

fig, ax = plt.subplots(1,2,figsize=(10,4), constrained_layout=True)
colors=['royalblue','tomato','limegreen']
explode=[0.01,0.01,0.01]
labels = ['1๋“ฑ์„', '2๋“ฑ์„', '3๋“ฑ์„']


#Survived ๊ฐ€ 0์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ Pie Plot
new_df[new_df['Survived'] == 0]['Pclass'].value_counts().sort_index().plot.pie(ax=ax[0], autopct='%1.1f%%',colors=colors,explode=explode,labels=labels)
ax[0].set(ylabel='', title='์‚ฌ๋ง ๋น„์œจ - Pclass')

#Survived ๊ฐ€ 1์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ Pie Plot
new_df[new_df['Survived'] == 1]['Pclass'].value_counts().sort_index().plot.pie(ax=ax[1], autopct='%1.1f%%',colors=colors,startangle=45,explode=explode, labels=labels)
ax[1].set(ylabel='', title='์ƒ์กด ๋น„์œจ - Pclass')

plt.show()


#์‚ฌ๋งํ•œ ํƒ‘์Šน๊ฐ ๊ทธ๋ž˜ํ”„์—์„œ 3๋“ฑ์„์˜ ๋น„์œจ์ด 67.8%๋กœ ์••๋„์ ์œผ๋กœ ๋†’๋‹ค.

 

์‚ฌ๋งํ•œ ํƒ‘์Šน๊ฐ ๊ทธ๋ž˜ํ”„์—์„œ 3๋“ฑ์„์˜ ๋น„์œจ์ด 67.8%๋กœ ์••๋„์ ์œผ๋กœ ๋†’์Šต๋‹ˆ๋‹ค.

 

2-2 Name

#2 Name_ํ˜ธ์นญ๋งŒ ๋–ผ์–ด ์ƒˆ๋กœ์šด ์—ด ์ƒ์„ฑ
for i in range(len(new_df["Name"])):
    tmp = new_df["Name"][i].split(',')[1][1:]
    title = tmp.split('.')[0]
    new_df.loc[i, 'Title'] = title

new_df['Title'].value_counts()

 

  • Mr (Mister) : "์ข…์กฑ"์ด๋‚˜ "์ง€์œ„"๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์„ฑ์ธ ๋‚จ์„ฑ์—๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์ผ๋ฐ˜์ ์ธ ํ˜ธ์นญ์ž…๋‹ˆ๋‹ค.
    ๊ฒฐํ˜ผ ์—ฌ๋ถ€์— ์ƒ๊ด€์—†์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Mrs (Missus): ๊ฒฐํ˜ผํ•œ ์—ฌ์„ฑ์—๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํ˜ธ์นญ์œผ๋กœ, ๊ทธ๋…€์˜ ๋‚จํŽธ์˜ ์„ฑ์„ ๋”ฐ๋ผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    "Missus"๋Š” ๋ฐœ์Œ์ƒ "๋ฏธ์‹œ์ฆˆ"๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Ms : ๊ฒฐํ˜ผ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด์ง€ ์•Š๋Š” ์—ฌ์„ฑ์—๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํ˜ธ์นญ์ž…๋‹ˆ๋‹ค. "๋ฏธ์Šค"์™€ "๋ฏธ์„ธ์ฆˆ" ์‚ฌ์ด์˜ ๋ฐœ์Œ์œผ๋กœ,
    ์—ฌ์„ฑ์ด ์ž์‹ ์˜ ๊ฒฐํ˜ผ ์ƒํƒœ๋ฅผ ๊ฐ•์กฐํ•˜๊ณ  ์‹ถ์ง€ ์•Š์„ ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Master: ์ฃผ๋กœ ๋‚จ์ž ์–ด๋ฆฐ์ด์—๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํ˜ธ์นญ์œผ๋กœ, ์„ฑ์ธ์ด ๋˜๊ธฐ ์ „๊นŒ์ง€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Dr : ๋ฐ•์‚ฌ
  • Rev : ์‹ ๋ถ€๋‹˜ (=Mr)
  • Mlle : ์–ด๋ฆฐ ์—ฌ์ž ์•„์ด (Mademoiselle) (=Miss)
  • Major : ๊ตฐ์ธ์ด๋‚˜ ๊ตฐ์‚ฌ์ ์ธ ๊ฒฝํ—˜์ด ์žˆ๋Š” ์‚ฌ๋žŒ (=Mr)
  • Col : ์ค‘๋ น (=Mr)
  • the Countess : ๋ฐฑ์ž‘ ๋ถ€์ธ (=Mrs)
  • Capt : ๊ตฐ ์žฅ๊ต (=Mr)
  • Ms : Miss (=Miss)
  • Sir : ๋‚จ์„ฑ ๊ณต์†ํ•œ ํ‘œํ˜„ (=Mr)
  • Lady : ์—ฌ์„ฑ ๊ณต์†ํ•œ ํ‘œํ˜„ (=Miss)
  • Mme : Mrs ํ”„๋ž‘์Šค์–ด (=Mrs)
  • Don : ์ŠคํŽ˜์ธ์–ด๋กœ ์ง€์œ„ ๋†’์€ ๋‚จ์„ฑ (=Mr)

์—ฌ๋Ÿฌ ํ˜ธ์นญ๋“ค์„ ๋‹ค Mr, Mrs, Ms, Master๋กœ ๋‹ค ์ˆ˜์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์„ฑ๋ณ„์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์€ ํ˜ธ์นญ์ธ "Dr"๋งŒ ๋”ฐ๋กœ ํ™•์ธํ•ด ์ˆ˜๊ธฐ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

 

new_df[new_df['Title']=='Dr']

#"Dr"์—์„œ ์„ฑ๋ณ„์ด ์—ฌ์„ฑ์ธ ํ–‰๋งŒ ์ˆ˜๊ธฐ๋กœ "Miss"๋กœ ๋ณ€๊ฒฝ
new_df.loc[796,"Title"] = "Miss"
#ํƒ€์ดํ‹€ ๋งตํ•‘ ์ง„ํ–‰
title_map = {"Mr":"Mr", "Miss":"Miss", "Mrs":"Mrs", "Master":"Master", "Dr":"Mr", "Rev":"Mr", "Mlle":"Miss",
             "Major":"Mr", "Col":"Mr", "the Countess":"Mrs", "Capt":"Mr", "Ms":"Miss", "Sir":"Mr",
             "sir":"Mr", "Lady":"Miss", "Mme":"Mrs", "Don":"Mr", "Jonkheer":"Mr"}

new_df['Title'] = new_df['Title'].map(title_map)

#Name ์‹œ๊ฐํ™”_01_ํ˜ธ์นญ๋ณ„ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜
import matplotlib.patches as mpatches

survived = new_df[new_df['Survived'] == 1].groupby('Title').size()
dead = new_df[new_df['Survived'] == 0].groupby('Title').size()


labels = ['Mr', 'Master', 'Mrs', 'Miss']
survived_colors = ['blue', 'blue', 'red', 'red']

survived_ordered = survived.reindex(labels, fill_value=0)
dead_ordered = dead.reindex(labels, fill_value=0)

x = np.arange(len(labels))
width = 0.35
fig, ax = plt.subplots()

rects1 = ax.bar(x - width/2, survived_ordered, width, label='Survived', alpha=0.6, color=survived_colors)
rects2 = ax.bar(x + width/2, dead_ordered, width, label='Dead', color='gray')

handles = [mpatches.Patch(color='gray', label='Dead'),
           mpatches.Patch(color='blue', label='Male_Survived'),
           mpatches.Patch(color='red', label='Female_Survived')]
ax.legend(handles=handles)
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Count')
ax.set_title('ํ˜ธ์นญ๋ณ„ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜ \n')

fig.tight_layout()
plt.show()

 


์—ฌ์„ฑ์˜ ํ˜ธ์นญ์ธ Mrs/Miss ๋‘˜๋‹ค ์ƒ์กด์ž ์ˆ˜๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.
Mr์˜ ์‚ฌ๋ง์ž ์ˆ˜๊ฐ€ ์••๋„์ ์œผ๋กœ ๋งŽ์Šต๋‹ˆ๋‹ค.
Master์€ ์ ˆ๋ฐ˜ ์ •๋„์˜ ์ƒ์กด์œจ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

 

2-3 Sex

#Sex ์‹œ๊ฐํ™”_01_ํƒ‘์Šน์ˆ˜ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„
plt.figure(figsize=(8, 6))
sns.countplot(data=new_df, x='Sex', alpha=0.7, palette= ['blue','red'])
plt.title('์„ฑ๋ณ„ ํƒ‘์Šน์ž ์ˆ˜')
plt.show()

๋‚จ์„ฑ ํƒ‘์Šน์ž๊ฐ€ ์—ฌ์„ฑ ํƒ์Šน์ž ์ˆ˜๋ณด๋‹ค ์•ฝ 2๋ฐฐ ๋” ๋งŽ์Šต๋‹ˆ๋‹ค.

 

#Sex ์‹œ๊ฐํ™”_02_์„ฑ๋ณ„๋ณ„ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜
survived = new_df[new_df['Survived'] == 1].groupby('Sex').size()
dead = new_df[new_df['Survived'] == 0].groupby('Sex').size()

labels = ['Male', 'Female']

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()

rects1 = ax.bar(x - width/2, survived[::-1], width, label='Survived', alpha=0.6, color=['blue', 'red']) # male ์ƒ์กด์ž๋Š” ํŒŒ๋ž€์ƒ‰, female ์ƒ์กด์ž๋Š” ๋นจ๊ฐ„์ƒ‰
rects2 = ax.bar(x + width/2, dead[::-1], width, label='Dead', color='gray')

handles = [mpatches.Patch(color='gray', label='Dead'),
           mpatches.Patch(color='blue', label='Male_Survived'),
           mpatches.Patch(color='red', label='Female_Survived')]

ax.legend(handles=handles)
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Count')
ax.set_title('์„ฑ๋ณ„๋ณ„ ์ƒ์กด/์‚ฌ๋ง ์ˆ˜ \n')

fig.tight_layout()
plt.show()

 

๋‚จ์„ฑ ํƒ‘์Šน์ž๊ฐ€ ์—ฌ์„ฑ ํƒ์Šน์ž ์ˆ˜๋ณด๋‹ค ์•ฝ 2๋ฐฐ ๋” ๋งŽ์Šต๋‹ˆ๋‹ค.

์—ฌ์„ฑ ํƒ‘์Šน์ž์˜ ์ƒ์กด๋ฅ ์ด ๋‚จ์„ฑ ํƒ‘์Šน์ž์˜ ์ƒ์กด๋ฅ ๋ณด๋‹ค ๋†’์€๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

#์„ฑ๋ณ„ ์‹œ๊ฐํ™”_03_์„ฑ๋ณ„ ์ƒ์กด/์‚ฌ๋ง์ž ์ˆ˜

fig, ax = plt.subplots(1,2,figsize=(10,6), constrained_layout=True)
labels = ['์—ฌ์„ฑ', '๋‚จ์„ฑ']
colors=['tomato', 'royalblue']
explode=[0.01,0.01]

#Survived ๊ฐ€ 0์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ Pie Plot
new_df[new_df['Survived'] == 0]['Sex'].value_counts().sort_index().plot.pie(ax=ax[0], autopct='%1.1f%%', labels=labels,colors=colors, explode=explode)
ax[0].set(ylabel='', title='์‚ฌ๋ง - Sex')

#Survived ๊ฐ€ 1์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ Pie Plot
new_df[new_df['Survived'] == 1]['Sex'].value_counts().sort_index().plot.pie(ax=ax[1], autopct='%1.1f%%', labels=labels,colors=colors, explode=explode)
ax[1].set(ylabel='', title='์ƒ์กด - Sex')

plt.show()

 

 

์›๊ทธ๋ž˜ํ”„๋กœ ๋น„์œจ์„ ํ™•์ธํ•˜๋ฉด ๋‘ ์„ฑ๋ณ„์˜ ์‚ฌ๋ง๋ฅ ์ด ๋”์šฑ ํ™•์—ฐํ•˜๊ฒŒ ์ฐจ์ด๊ฐ€ ๋‚ฉ๋‹ˆ๋‹ค.

 

2-3-2 Pclass์™€ Sex

#Pclass&Sex ์‹œ๊ฐํ™”
pd.crosstab([new_df.Sex,new_df.Survived],new_df.Pclass,margins=True).style.background_gradient(cmap='summer_r')

 

#Pclass&Sex ์‹œ๊ฐํ™”_2

sns.catplot(x='Pclass',y="Survived",hue='Sex',data=new_df,kind='point', palette={'female': 'red', 'male': 'blue'})
plt.title('Sex์™€ Pclass์˜ ์ƒ์กด๋ฅ  ์ƒ๊ด€ ๊ด€๊ณ„\n'  , fontsize=16)
plt.xlabel('Pclass', fontsize=12)
plt.ylabel('Survived', fontsize=12)
plt.show()

 


1๋“ฑ๊ธ‰ ๊ฐ์‹ค์˜ ์—ฌ์„ฑ์ด ์ œ์ผ ์ƒ์กด์ž๊ฐ€ ๋งŽ์€ ๊ฒƒ์œผ๋กœ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋Œ€๋กœ 3๋“ฑ๊ธ‰ ๊ฐ์‹ค์˜ ๋‚จ์„ฑ์ด ์ œ์ผ ์‚ฌ๋ง์ž๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

1๋“ฑ์„์— ์žˆ๋˜ ํƒ‘์Šน๊ฐ์ด ์„ฑ๋ณ„์— ์ƒ๊ด€์—†์ด ๋‹ค๋ฅธ ํด๋ž˜์Šค์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์ด ์ƒ์กดํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํด๋ž˜์Šค์— ๊ด€๊ณ„์—†์ด ์—ฌ์„ฑ์˜ ์ƒ์กด๋ฅ ์ด ๋†’์€ ๊ฒƒ์ด ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.

 

2-4 Age

#Age_์‹œ๊ฐํ™” (stack, ํžˆ์Šคํ† ๊ทธ๋žจ)
plt.figure(figsize=(8, 5))
sns.histplot(data=new_df, x='Age', hue='Survived', multiple='stack', kde=False, bins=20, palette={0: 'red', 1: 'blue'})
plt.title('Age vs Survived', fontsize=16, weight='bold')
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.grid(True, which='major', linestyle='--', linewidth=0.5)
plt.show()

 

 

20-30๋Œ€ ์Šน๊ฐ์ด ๊ฐ€์žฅ ๋งŽ์Šต๋‹ˆ๋‹ค.

์–ด๋ฆฐ ์Šน๊ฐ(0-10์„ธ)์˜ ์ƒ์กด์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค.
: ๊ทธ๋ž˜ํ”„ ์™ผ์ชฝ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, 0-10์„ธ ๊ตฌ๊ฐ„์—์„œ ํŒŒ๋ž€์ƒ‰(์ƒ์กด์ž)์ด ๋นจ๊ฐ„์ƒ‰(์‚ฌ๋ง์ž)๋ณด๋‹ค ๋งŽ์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 30๋Œ€ ์ดํ›„ ์ƒ์กด๋ฅ  ๊ฐ์†Œ: 30๋Œ€ ์ดํ›„๋ถ€ํ„ฐ๋Š” ํŒŒ๋ž€์ƒ‰ ๋ถ€๋ถ„์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์–ด์ง€๊ณ  ๋นจ๊ฐ„์ƒ‰์ด ์ง€๋ฐฐ์ ์ธ ๋ชจ์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ณ ๋ น ์Šน๊ฐ์˜ ์‚ฌ๋ง๋ฅ  ๋†’์Œ: 60๋Œ€ ์ด์ƒ์˜ ํƒ‘์Šน์ž๋“ค์€ ๋งค์šฐ ์ ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„ ์‚ฌ๋งํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

#Age_์‹œ๊ฐํ™”2
age_range_survival_ratio = []

# 1์‚ด ๋ถ€ํ„ฐ ์ƒ์กด์œจ ๊ตฌํ•˜๊ธฐ
for i in range(1,80):
    age_range_survival_ratio.append(new_df[new_df['Age'] < i]['Survived'].sum() / len(new_df[new_df['Age'] < i]['Survived']))

plt.figure(figsize=(7,7))
plt.plot(age_range_survival_ratio)
plt.title('Age vs Survival Rate')
plt.ylabel('Survived Rate')
plt.xlabel('Age')

plt.show()

 

ํƒ€์ดํƒ€๋‹‰ ์‚ฌ๊ณ  ๋‹น์‹œ ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•˜๊ณ , ํŠนํžˆ ์–ด๋ฆฐ์ด๋“ค์˜ ์ƒ์กด์œจ์ด ๋งค์šฐ ๋†’์•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 

 

2-5 Sibsp

pd.crosstab(new_df.SibSp, new_df.Survived, margins=True).style.background_gradient(cmap='summer_r')

 

sns.barplot(x='SibSp', y='Survived', data=new_df,palette='Blues_d').set_title('SibSp vs Survived')
plt.show()

 

๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด SibSp๊ฐ€ 1์ธ ๊ฒฝ์šฐ ์ƒ์กด์œจ์ด 53.58%๋กœ ๊ฐ€์žฅ ๋†’๊ณ , ๊ฐ€์กฑ ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ƒ์กด์œจ์ด ๋‚ฎ์•„์ง€๋Š” ๊ฒฝํ–ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

2-6 parch

pd.crosstab(new_df.Parch, new_df.Survived).style.background_gradient(cmap='summer_r')

new_df[['Parch', 'Survived']].groupby(['Parch']).mean()

sns.barplot(x='Parch', y='Survived', data=new_df,palette='Purples_d').set_title('Parch vs Survived')
plt.show()

 

Parch๊ฐ€ 1์—์„œ 3์ผ ๋•Œ ์ƒ์กด๋ฅ ์ด ๊ฐ€์žฅ ๋†’์œผ๋ฉฐ, Parch๊ฐ€ 0์ผ ๋•Œ๋Š” ์ƒ์กด๋ฅ ์ด ์•ฝ 34%๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„์ง‘๋‹ˆ๋‹ค. Parch๊ฐ€ 4 ์ด์ƒ์ผ ๋•Œ ์ƒ์กด๋ฅ ์ด ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

 

2-6-2 Sibsp์™€ parch

# ์‹œ๊ฐํ™”: SibSp vs Survived
plt.figure(figsize=(14, 6))

# SibSp ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„
plt.subplot(1, 2, 1)
sns.barplot(data=sibsp_survived.reset_index(), x='SibSp', y='Survival Rate', palette='coolwarm')
plt.title('SibSp vs Survival Rate', fontsize=16)
plt.xlabel('SibSp (Number of Siblings/Spouses)', fontsize=12)
plt.ylabel('Survival Rate', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y')

# Parch ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„
plt.subplot(1, 2, 2)
sns.barplot(data=parch_survived.reset_index(), x='Parch', y='Survival Rate', palette='coolwarm')
plt.title('Parch vs Survival Rate', fontsize=16)
plt.xlabel('Parch (Number of Parents/Children)', fontsize=12)
plt.ylabel('Survival Rate', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y')

plt.tight_layout()
plt.show()

 

SibSp(ํ˜•์ œ/์ž๋งค ์ˆ˜)์™€ Parch(๋ถ€๋ชจ/์ž๋…€ ์ˆ˜) ๊ฐ„์˜ ๊ด€๊ณ„๋Š” ์œ ์‚ฌํ•œ ๊ฒฝํ–ฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

๋‘ ๋ณ€์ˆ˜ ๋ชจ๋‘ ์ƒ์กด๋ฅ ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, ๊ฐ€์กฑ๊ณผ ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ์Šน๊ฐ์€ ๊ตฌ์กฐ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•˜์Šต๋‹ˆ๋‹ค.

SibSp๊ฐ€ 1~2์ผ ๋•Œ ์ƒ์กด๋ฅ ์ด ๋†’์•„์ง€๊ณ , 0์ผ ๊ฒฝ์šฐ ์ƒ์กด๋ฅ ์ด ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค. Parch๋„ ๋น„์Šทํ•˜๊ฒŒ, 1์—์„œ 3์ผ ๋•Œ ์ƒ์กด๋ฅ ์ด ๋†’์ง€๋งŒ, 0์ผ ๋•Œ๋Š” ์•ฝ 34%๋กœ ๋‚ฎ์•„์ง€๋ฉฐ, 4 ์ด์ƒ์ผ ๋•Œ๋Š” ์ƒ์กด๋ฅ ์ด ๊ธ‰๊ฒฉํžˆ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.

 

2-7 Ticket

#Ticket ์‹œ๊ฐํ™”
ticket_survival = new_df.groupby('Ticket')['Survived'].mean().reset_index()
ticket_pivot = ticket_survival.pivot(index='Ticket', columns='Survived', values='Survived')
plt.figure(figsize=(12, 8))
sns.heatmap(ticket_pivot, cmap="YlGnBu", annot=True, fmt=".2f")
plt.title('Survival Rate by Ticket')
plt.xlabel('Ticket')
plt.ylabel('Survival Rate')
plt.show()

ํ‹ฐ์ผ“์˜ ๋ฒˆํ˜ธ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ณ  ๋‹ค์–‘ํ•ด ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ์ƒ์กด์œจ๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜๊ธฐ์— ์ ํ•ฉํ•œ ๋ณ€์ˆ˜๋Š” ์•„๋‹Œ๋“ฏํ•ฉ๋‹ˆ๋‹ค.

 

2-8 Fare

#Fare ์‹œ๊ฐํ™”_01
plt.figure(figsize=(10, 6))
sns.distplot(df[df['Survived'] == 1]['Fare'], label='Survived', hist=False, kde=True, color='blue')
sns.distplot(df[df['Survived'] == 0]['Fare'], label='dead', hist=False, kde=True, color='red')
plt.title('Survival by Fare')
plt.xlabel('Fare')
plt.ylabel('Density') #์ƒ์กด์ž,์‚ฌ๋ง์ž์˜ ๋ฐ€๋„ ํŒŒ์•…
plt.legend()
plt.show()

0~50 ์‚ฌ์ด์˜ Fare ๊ตฌ๊ฐ„:
์‚ฌ๋ง์ž์™€ ์ƒ์กด์ž์˜ ๋ฐ€๋„๊ฐ€ ์ด ๊ตฌ๊ฐ„์—์„œ ๊ฐ€์žฅ ๋†’์Šต๋‹ˆ๋‹ค.
ํŠนํžˆ ์š”๊ธˆ์ด ๋งค์šฐ ๋‚ฎ์€ ๊ตฌ๊ฐ„์—์„œ ์‚ฌ๋ง์ž๊ฐ€ ๋งŽ์ด ์ง‘์ค‘๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ฆ‰ ์š”๊ธˆ์ด ๋‚ฎ์•˜๋˜ 3๋“ฑ์„ ์Šน๊ฐ๋“ค์ด ์ฃผ๋กœ ์‚ฌ๋งํ–ˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

50~150 ์‚ฌ์ด์˜ Fare ๊ตฌ๊ฐ„:
์ƒ์กด์ž์˜ ๋ฐ€๋„๊ฐ€ ์‚ฌ๋ง์ž๋ณด๋‹ค ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์•„์ง‘๋‹ˆ๋‹ค.
๋น„๊ต์  ๋†’์€ ๊ธˆ์•ก์„ ๋‚ธ ์Šน๊ฐ๋“ค์ด ์ƒ์กดํ•  ํ™•๋ฅ ์ด ๋” ๋†’์•˜์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

150 ์ด์ƒ์˜ Fare ๊ตฌ๊ฐ„:
์ƒ์กด์ž์˜ ๋ฐ€๋„๊ฐ€ ๋งค์šฐ ์ž‘์ง€๋งŒ ์—ฌ์ „ํžˆ ๋‚˜ํƒ€๋‚จ. 1๋“ฑ์„ ์Šน๊ฐ๋“ค์ด ์ƒ์กดํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•˜์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

2-9 Embarked

Embarked ์‹œ๊ฐํ™”_01
plt.figure(figsize=(10, 6))
sns.barplot(x='Embarked', y='Survived', data=new_df)
plt.title('Survival Rate by Embarked')
plt.ylabel('Survival Rate')
plt.xlabel('Embarked')
plt.show()

 

Sํ•ญ๊ตฌ, Qํ•ญ๊ตฌ, Cํ•ญ๊ตฌ ์ˆœ์œผ๋กœ ์ƒ์กด์œจ์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค.

 

2-9-2 Pclass& Embarked

Pclass&Embarked ์‹œ๊ฐํ™”
Pclass1 = new_df[new_df['Pclass']==1]['Embarked'].value_counts()
Pclass2 = new_df[new_df['Pclass']==2]['Embarked'].value_counts()
Pclass3 = new_df[new_df['Pclass']==3]['Embarked'].value_counts()

fig, ax = plt.subplots(figsize=(10,6))
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class','3rd class']
df.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=45)
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Class Distirbution in Embarked')
plt.show()

์ƒ์กด์œจ์ด ๊ฐ€์žฅ ๋‚ฎ์€ Sํ•ญ๊ตฌ์— 3ํด๋ž˜์Šค๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ๋ถ„ํฌํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋Œ€๋กœ ์ƒ์กด์œจ์ด ๊ฐ€์žฅ ๋†’์€ Cํ•ญ๊ตฌ์— 1ํด๋ž˜์Šค๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ๋ถ„ํฌํ•ฉ๋‹ˆ๋‹ค.

 

3. ๊ฐ ์†์„ฑ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„

train_heat=pd.read_csv("/content/train.csv")

train_heat['Sex'] = train_heat['Sex'].replace('male',0)
train_heat['Sex'] = train_heat['Sex'].replace('female',1)

train_heat['Embarked'] = train_heat['Embarked'].fillna("S")

train_heat.loc[train_heat.Embarked == 'C','Embarked']=0
train_heat.loc[train_heat.Embarked == 'Q','Embarked']=1
train_heat.loc[train_heat.Embarked == 'S','Embarked']=2
train_heat['Embarked'] = train_heat['Embarked'].astype('int64')

age_median = np.nanmedian(train_heat["Age"])
train_heat['Age'].fillna(age_median,inplace=True)

for i in range(len(train_heat["Name"])):
    tmp = train_heat["Name"][i].split(',')[1][1:]
    title = tmp.split('.')[0]
    train_heat.loc[i, 'Title'] = title

train_heat['Title'].value_counts()

train_heat.loc[796,"Title"] = "Miss"

title_map = {"Mr":"Mr", "Miss":"Miss", "Mrs":"Mrs", "Master":"Master", "Dr":"Mr", "Rev":"Mr", "Mlle":"Miss",
             "Major":"Mr", "Col":"Mr", "the Countess":"Mrs", "Capt":"Mr", "Ms":"Miss", "Sir":"Mr",
             "sir":"Mr", "Lady":"Miss", "Mme":"Mrs", "Don":"Mr", "Jonkheer":"Mr"}

train_heat['Title'] = train_heat['Title'].map(title_map)

train_heat.loc[train_heat.Title == 'Mr','Title']=0
train_heat.loc[train_heat.Title == 'Miss','Title']=1
train_heat.loc[train_heat.Title == 'Mrs','Title']=2
train_heat.loc[train_heat.Title == 'Master','Title']=3
train_heat['Title'] = train_heat['Title'].astype('int64')

train_heat = train_heat.drop('PassengerId', axis=1)

def category_age(x):
    if x < 10:
        return 0
    elif x < 20:
        return 1
    elif x < 30:
        return 2
    elif x < 40:
        return 3
    elif x < 50:
        return 4
    elif x < 60:
        return 5
    elif x < 70:
        return 6
    else:
        return 7

train_heat['Age'] = train_heat['Age'].apply(category_age)

train_heat['Survived'] = pd.to_numeric(train_heat['Survived'], errors='coerce')
corr = train_heat.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('๊ฐ ์†์„ฑ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„\n')

plt.show()

๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” ๊ฒƒ

Fare, Pclss
Title, Sex
Sex, Survived 
Pclass, Survived 
Title, Survived
Age, Pclass
Parch, SibSp