๐Ÿ’ก WIDA/ํ”„๋กœ์ ํŠธ ๋ณด๊ณ ์„œ

[WIDA] 2ํŒ€_3์ฐจ ๋ณด๊ณ ์„œ

8635 2025. 11. 28. 13:35

์ „์ฒ˜๋ฆฌ: ์›-ํ•ซ์ธ์ฝ”๋”ฉ์—์„œ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ณ€๊ฒฝ

๊ธฐ์กด ์›ํ•ซ์ธ์ฝ”๋”ฉ์ด ์ปฌ๋Ÿผ์˜ ์ˆ˜๋ฅผ ๊ณผ๋„ํ•˜๊ฒŒ ๋Š˜๋ ค ํ•™์Šต์„ ๋ฐฉํ•ดํ•˜๊ณ  ์žˆ๋‹ค๋Š” ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์Œ.

์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ์ฒ˜๋ฆฌํ–ˆ๋˜ ํ”ผ์ฒ˜๋ฅผ ๊ฐ’์— ๋”ฐ๋ผ 0~n์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•จ.

school1 ํ•™๊ตid ํ‰๊ท  ์ฒ ํšŒ์œจ.
๋“ฑ์žฅ ํšŸ์ˆ˜ 29ํšŒ ๋ฏธ๋งŒ → ๊ธฐํƒ€
doublem ๋ณต์ˆ˜์ „๊ณต ์—ฌ๋ถ€
0/1
itmajor IT์ „๊ณต์ธ์ง€(1์ „๊ณต 2์ „๊ณต ์ค‘ ํ•˜๋‚˜๋ผ๋„ IT) 0/1
ismajor (๋ฐ์ดํ„ฐ)์ „๊ณต์ž์ธ์ง€
0/1
job ํ˜„์žฌ ์‹ ๋ถ„
0~3
class1_1 ๋ถ„๋ฐ˜. ํ‰๊ท  ์ฒ ํšŒ์œจ.
๋“ฑ์žฅ ํšŸ์ˆ˜ 100ํšŒ ๋ฏธ๋งŒ → ๊ธฐํƒ€
re_registration ์˜ˆ=1, ์•„๋‹ˆ์š”=0, NaN=-1
nationality ๋‚ด๊ตญ์ธ=0, ์™ธ๊ตญ์ธ=1, NaN=-1
inflow_route ์ฃผ์š” ๊ฒฝ๋กœ๋ณ„ 0~4, ๊ธฐํƒ€=5
whyBDA 5% ๋ฏธ๋งŒ ๊ธฐํƒ€ ํ†ตํ•ฉ ํ›„ 0~5
what_to_gain 5% ๋ฏธ๋งŒ ๊ธฐํƒ€ ํ†ตํ•ฉ ํ›„ 0~4, NaN=-1
hope_for_group ์˜คํ”„๋ผ์ธ=0, ์˜จ๋ผ์ธ=1, ๊ฐœ์ธ ํ™œ๋™=2, NaN=-1
major_field ์‚ญ์ œ
desired_career_path ํ…์ŠคํŠธ ํ‘œ์ค€ํ™” → ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ (0~3)
interested_company ์‚ญ์ œ
expected_domain ๋น„์ค‘ 5% ๋ฏธ๋งŒ ๊ธฐํƒ€๋กœ ๋ฌถ์€ ๋’ค ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ
project_type ๊ฐœ์ธ=0, ํŒ€=1
desired_job ์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ
certificate_acquisition Multi-Label Encoding
์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ

๊ธฐ์กด ์ปฌ๋Ÿผ :
ADsp, AWS, SQLD, ๊ตฌ๊ธ€์• ๋„๋ฆฌ์Šคํ‹ฑ์Šค,
๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ, ์ •๋ณด์ฒ˜๋ฆฌ๊ธฐ์‚ฌ, ํƒœ๋ธ”๋กœ,
์ปดํ“จํ„ฐ ํ™œ์šฉ ๋Šฅ๋ ฅ, ์ค€๋น„์ค‘, ๊ธฐํƒ€
desired_certificate Multi-Label Encoding
์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ
certificate_study_period ์‹œํ—˜์ผ ์ „ 3์ฃผ = 0
์‹œํ—˜์ผ ์ „ 4์ฃผ = 1
์‹œํ—˜์ผ ์ „ 5์ฃผ = 2
์ ‘์ˆ˜์ผ ์ „ 3์ฃผ = 3
์ ‘์ˆ˜์ผ ์ „ 4์ฃผ = 4
์ ‘์ˆ˜์ผ ์ „ 5์ฃผ = 5
desired_job_except_data Multi-Label Encoding
์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ?

๊ธฐ์กด ์ปฌ๋Ÿผ
: R&D/์—ฐ๊ตฌ, ๊ฐœ๋ฐœ, ๊ฒฝ์˜/๊ด€๋ฆฌ, ๊ธˆ์œต/ํšŒ๊ณ„, ๊ธฐํš/์ „๋žต, ๋งˆ์ผ€ํŒ…/๊ด‘๊ณ /MD , ์˜์—…, ๋ฏธ์ •
incumbents_level ์ฃผ๋‹ˆ์–ด(0~3๋…„์ฐจ)=0, ์‹œ๋‹ˆ์–ด(10๋…„์ฐจ)=1
incumbents_company_level ์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ด€๋ จ ์žˆ๋Š” ์ปฌ๋Ÿผ/๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ

๊ตญ๋‚ด๋Œ€๊ธฐ์—…IT
๊ตญ๋‚ด๋น…ํ…Œํฌ
๊ธˆ์œต๊ถŒ(์ „๋ฐ˜)
๊ธฐํƒ€
๋ฌธํ™”/์—”ํ„ฐ/๊ฒŒ์ž„
๋ฏธ์ •
๋ฐ”์ด์˜ค
๋ฐ˜๋„์ฒด
์Šคํƒ€ํŠธ์—…C๋ ˆ๋ฒจ
์Šคํฌ์ธ 
ํ•ด์™ธ๋น…ํ…Œํฌ
incumbents_lecture_type ์˜จ๋ผ์ธ=0,
์˜จ/์˜คํ”„๋ผ์ธ ๋™์‹œ์— = 1,
์˜คํ”„๋ผ์ธ = 2
incumbents_lecture_scale ์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€๋กœ ํ•ฉ์น˜๊ธฐ?

100๋ช… ์ด์ƒ์˜ ๋ฆฌ์Šค๋„ˆ์™€ 1-2๋ช…์˜ ํ˜„์ง์ž
100๋ช… ์ด์ƒ์˜ ๋ฆฌ์Šค๋„ˆ์™€ 10๋ช… ์ด์ƒ์˜ ํ˜„์ง์ž
100๋ช… ์ด์ƒ์˜ ๋ฆฌ์Šค๋„ˆ์™€ 3๋ช…์˜ ํ˜„์ง์ž
10๋ช… ๋‚ด์™ธ์˜ ๊ฐ•์˜ ๋ฆฌ์Šค๋„ˆ์™€ 1๋ช…์˜ ํ˜„์ง์ž
3~50๋ช… ๋‚ด์™ธ์˜ ๊ฐ•์˜ ๋ฆฌ์Šค๋„ˆ์™€ 1๋ช…์˜ ํ˜„์ง์ž
๋ฌด๊ด€
onedayclass_topic Multi-Label Encoding
์†Œ์ˆ˜ ์นดํ…Œ๊ณ ๋ฆฌ(<10): ๊ธฐํƒ€/๊ด€๋ จ ์žˆ๋Š” ์ปฌ๋Ÿผ์œผ๋กœ ํ•ฉ์น˜๊ธฐ

ML/DL/AI/NLP/CV/LLM
Python / R ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ํ•ฉ์น˜๊ธฐ
SQL/DB
๊ณต๋ชจ์ „/ํ”„๋กœ์ ํŠธ/ํฌํŠธํด๋ฆฌ์˜ค/์ปค๋ฆฌ์–ด
๋ฐ์ดํ„ฐ์—”์ง€๋‹ˆ์–ด๋ง/๋น…๋ฐ์ดํ„ฐ/ํด๋ผ์šฐ๋“œ
์‹œ๊ฐํ™”/BI
๊ธฐํƒ€

 

1056 rows × 53 columns ํฌ๊ธฐ์˜ csvํŒŒ์ผ๋กœ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•จ. 

 

CATBOOST ์„ฑ๋Šฅ

Accuracy: 0.5047

F1-score: 0.4976

AUC: 0.6889 

 

์–ธ๋”์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ:

์–ธ๋”์ƒ˜ํ”Œ๋ง ํ›„ ํด๋ž˜์Šค ๋ถ„ํฌ: withdrawal 1 = 261/ 0 = 261

Name: count, dtype: int64

Accuracy: 0.5613 F1-score: 0.5792 AUC: 0.7363

 

์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ:

์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ํ›„: withdrawal 1 =  583 / 0 583

Name: count, dtype: int64

Accuracy: 0.6934 F1-score: 0.8189 AUC: 0.6734

 

XGBOOST ์„ฑ๋Šฅ

์ •ํ™•๋„ : 0.6604 

์ด์ „์— ๋น„ํ•ด

  • ์ •ํ™•๋„ ์‚ด์ง(+1.4%) ์ƒ์Šน
  • ์ด์ „ weighted F1: 0.65
  • ์ด๋ฒˆ weighted F1: 0.64 ← ๊ฑฐ์˜ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ worse
  • Class 1์„ ๋งž์ถ”์ง€๋งŒ, ํฌ๊ท€ ํด๋ž˜์Šค(0)๋Š” ๋” ๋ฌด์‹œํ•˜๋Š” ํ˜•ํƒœ๊ฐ€ ์‹ฌํ•ด์กŒ๋‹ค.

๋ชจ๋ธ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด ๊ฒฐ์ธก์น˜๋ฅผ -1๋กœ ์ฑ„์šฐ์ง€ ์•Š๊ณ  ๋น„์›Œ์„œ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ

์ •ํ™•๋„(Accuracy): 0.6368

ํด๋ž˜์Šค๊ฐ„ ๊ฒฉ์ฐจ๋„ ๋ฒŒ์–ด์ง€๊ณ  ์ •ํ™•๋„ ๋–จ์–ด์ง

 

RandomForest ์„ฑ๋Šฅ

ํŠธ๋ฆฌ 500์ผ๋•Œ ์ •ํ™•๋„: 0.7216981132075472 f-1 0.8259587020648967

ํŠธ๋ฆฌ 800์œผ๋กœ ์„ค์ •  ์ •ํ™•๋„: 0.7311320754716981

 

๊ฒฐ๋ก ์ ์œผ๋กœ RandomForest๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ฑ๋Šฅ์„ ๋ฐœ์ „์‹œํ‚ค๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

 

๋ฆฌ๋”๋ณด๋“œ(1์ฐจ)

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ๋กœ ์ฒ ํšŒ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•ด ์ œ์ถœํ•œ ๊ฒฐ๊ณผ

public 0.64604 private 0.6471600688 (151๋“ฑ)

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์„ฑ๋Šฅ ๊ฐœ์„ 

  • Baseline RF: ๊ธฐ๋ณธ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ.
  • RF class_weight: ํด๋ž˜์Šค ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ.
  • SMOTE + RF: **๊ณผํ‘œ์ง‘(Oversampling)**์„ ํ†ตํ•œ ๊ท ํ˜• ๋งž์ถ”๊ธฐ.
  • UnderSample + RF: **๋ฏธํ‘œ์ง‘(Under-sampling)**์„ ํ†ตํ•œ ๊ท ํ˜• ๋งž์ถ”๊ธฐ.
  • Balanced RF: ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ํŠนํ™”๋œ ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•

๋ชจ๋ธ๋ณ„ ํŠน์ง•

  1. Baseline RF (ID X)
    • ์ „์ฒด ์„ฑ๋Šฅ(f1_macro) ์ค€์ˆ˜
    • ์†Œ์ˆ˜ ํด๋ž˜์Šค 0 ์„ฑ๋Šฅ ๋‚ฎ์Œ(f1_0 0.33, recall 0.21) → ์—ฌ์ „ํžˆ 0 ๊ณผ์†Œ ์˜ˆ์ธก
  2. RF class_weight
    • class_weight ์คฌ์ง€๋งŒ ์„ฑ๋Šฅ ์˜คํžˆ๋ ค ๋‚ฎ์Œ(f1_macro 0.54)
    • recall_0 ์•ฝ๊ฐ„ ํ–ฅ์ƒํ–ˆ์ง€๋งŒ precision_0 ์œ ์ง€ → ์˜ค๋ฒ„ํŒจ๋„ํ‹ฐ ๊ฐ€๋Šฅ
  3. SMOTE + RF
    • ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋Š˜๋ ค ํ•™์Šต
    • recall_0 ๋น„์Šทํ•˜์ง€๋งŒ precision_0 0.5๋กœ ๋‚ฎ์•„์ง → false positive ์ฆ๊ฐ€
    • ์ „์ฒด f1_macro ๋น„์Šท
  4. UnderSample + RF
    • ๋‹ค์ˆ˜ ํด๋ž˜์Šค ์ค„์ž„ → ์†Œ์ˆ˜ ํด๋ž˜์Šค F1 ํฌ๊ฒŒ ํ–ฅ์ƒ(f1_0 0.51, recall_0 0.74)
    • precision_0 ๋‚ฎ์Œ(0.39) → ๊ณผ์ž‰ ์˜ˆ์ธก ์กด์žฌ
    • roc_auc ์ตœ๊ณ (0.614) → ๋ถ„๋ฅ˜ ๊ฒฝ๊ณ„๊ฐ€ ์ข€ ๋” ๊ท ํ˜•๋จ
  5. Balanced RF
    • ์ „์ฒด์ ์œผ๋กœ ๊ฐ€์žฅ ๊ท ํ˜• ์žกํž˜(f1_macro ์ตœ๊ณ  0.594)
    • ์†Œ์ˆ˜ ํด๋ž˜์Šค ์„ฑ๋Šฅ(f1_0 0.46, recall_0 0.50) → UnderSample๋ณด๋‹ค๋Š” ๋‚ฎ์ง€๋งŒ ์ „์ฒด ์„ฑ๋Šฅ ์ข‹์Œ
    • roc_auc๋„ ์ค€์ˆ˜(0.60)

์ž„๊ณ„๊ฐ’ ํ…Œ์ŠคํŠธ

๐Ÿฅ‡ F1-macro ์ตœ๋Œ€ํ™” 0.4700 0.6507 0.5041 0.7973 ํด๋ž˜์Šค 1(ํƒˆํ‡ด)์— ์—ฌ์ „ํžˆ ์œ ๋ฆฌํ•˜์ง€๋งŒ, ๊ท ํ˜• ์žกํžŒ ์„ฑ๋Šฅ
๐Ÿฅˆ F1-0 ์ตœ๋Œ€ํ™” 0.5700 0.5457 0.5200 0.5714 ํด๋ž˜์Šค 0(๋ฏธํƒˆํ‡ด) ์˜ˆ์ธก ์„ฑ๋Šฅ ์ตœ๋Œ€ํ™”

 

  • Balanced Random Forest ์‚ฌ์šฉ → ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ๋ณด์ •
  • ์ž„๊ณ„๊ฐ’(Threshold) ํŠœ๋‹ ์ ์šฉ → BEST_THRESHOLD = 0.47๋กœ class 0/1 ์˜ˆ์ธก ๊ธฐ์ค€ ์กฐ์ •

๊ฒฐ๊ณผ - public 0.6247689464, private 0.6273062731

 

์ฝ”๋“œ - ์ž„๊ณ„๊ฐ’ 0.57 ๋ฒ„์ „

๊ฒฐ๊ณผ - public 0.4186046512, private 0.4702842377

 

  • NaN ์ฒ˜๋ฆฌ
    • SimpleImputer(strategy="median")๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์น˜ํ˜• ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ → SMOTE ์—†์ด๋„ NaN ๋ฌธ์ œ ํ•ด๊ฒฐ.
  • Balanced ์ฒ˜๋ฆฌ
    • ๊ธฐ์กด BalancedRandomForestClassifier ๋Œ€์‹  RandomForestClassifier + class_weight="balanced" ์‚ฌ์šฉ
    • ์†Œ์ˆ˜ ํด๋ž˜์Šค์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ ์™„ํ™”.
  • Threshold ํŠœ๋‹
    • ์ตœ์  ์ž„๊ณ„๊ฐ’ BEST_THRESHOLD = 0.47 ์ ์šฉ → ํด๋ž˜์Šค 0(F1) ์„ฑ๋Šฅ ๊ฐœ์„ .
  • ์ „์ฒด ํ•™์Šต ๋ฐ์ดํ„ฐ ์žฌํ•™์Šต
    • ์ตœ์ข… ์ œ์ถœ์šฉ ์˜ˆ์ธก ์‹œ split๋œ train/valid๊ฐ€ ์•„๋‹Œ X, y ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต → ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ.

๊ฒฐ๊ณผ - public 0.64604811, private 0.64604811

 

  • SMOTE๋กœ ํด๋ž˜์Šค 0 ์ƒ˜ํ”Œ์„ ๋Š˜๋ ค recall ๊ฐœ์„ .
  • Threshold๋ฅผ 0.47 → 0.4๋กœ ๋‚ฎ์ถฐ 0์„ ๋” ๋งŽ์ด ์žก๋„๋ก ์กฐ์ •.
  • BalancedRandomForest + class_weight="balanced" ์œ ์ง€.

⇒ ๊ฒฐ๊ณผ public 0.64716

 

์ž„๊ณ„๊ฐ’ ์กฐ์ • : accuracy ๊ธฐ์ค€public์ ์ˆ˜ : 0.6483704974, private์ ์ˆ˜ : 0.6471600688

 

 

์ค‘์š”๋„๊ฐ€ ๋†’์€ ํ”ผ์ฒ˜๋งŒ ์ถ”์ถœํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ์žฌํ•™์Šต ์‹œํ‚ด + ์ž„๊ณ„๊ฐ’ ์กฐ์ •

top 20 : public์ ์ˆ˜ : 0.6483704974, private์ ์ˆ˜ : 0.6459412781

top 30 : public์ ์ˆ˜ : 0.6483704974, private์ ์ˆ˜ : 0.64604811 → ํ•œ ๊ฒƒ ์ค‘ ๊ฐ€์žฅ ๋†’์Œ

 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ํ•œ ๊ฐ’

model = RandomForestClassifier(
    n_estimators=800,
    max_depth=30,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    bootstrap=True,
    random_state=42,
    class_weight="balanced",   # ํŠœ๋‹ ๋•Œ balanced๋กœ ํ–ˆ์œผ๋ฉด ์œ ์ง€, ์•„๋‹ˆ๋ฉด ๋นผ๊ธฐ
    n_jobs=-1
)

 

์ตœ์ข… ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜

DACON ๋ฆฌ๋”๋ณด๋“œ ์ œ์ถœ ๊ฒฐ๊ณผ

0.6483704974, 0.6471600688