โ— ๋ถ„๋ฅ˜๊ธฐ์˜ ํŠœ๋‹


> ๋“ค์–ด๊ฐ€๋Š” ๋ง

1. ์—„์ฒญ๋‚˜๊ฒŒ ๊ณ ์ƒํ–ˆ๋‹ค.

2. ์œˆ๋„์šฐ์—์„œ ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์„ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฉ”์ธ๋ฃจํ”„๋ฅผ  "if __name__ == '__main__'"๋กœ ๊ฐ์‹ธ์•ผ ํ•œ๋‹ค. sklearn์˜ joblib ๋ชจ๋“ˆ ๋ฌธ์„œ๋ฅผ ์‚ดํŽด๋ณด๋ผ๋Š”๋ฐ ๋ด๋„ ๋ณ„๊ฑฐ ์—†๋‹ค. ์ด์   ์—๋Ÿฌ ๋ฉ”์„ธ์ง€๊นŒ์ง€ ๋‚˜ํ•œํ…Œ ๋ปฅ์„ ์นœ๋‹คใ…ก,ใ…ก ์–ด์จŒ๋“  ๊ฐ์‹ธ๊ณ  ๋‚˜๋‹ˆ ๊ทธ ์—๋Ÿฌ ๋ฉ”์„ธ์ง€๋Š” ์—†์–ด์กŒ๋‹ค.

3. ๋งฅ์—์„œ๋Š” ๊ทธ๋ƒฅ ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. 


> ์‹คํ–‰

1. MultinomialNB()์™€ SGDClassifier()์˜ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์•„๋ดค๋‹ค.

2. ์ด๊ฒƒ๋„ ์—„์ฒญ ๊ณ ์ƒํ–ˆ๋‹ค.

3. pipeline๊ณผ parameters๋“ค์„ set๊ณผ dictionary๋กœ ๋งŒ๋“ค์–ด ๋†“๋Š”๊ฒŒ ํ•ต์‹ฌ์ด๋‹ค.

4. ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ธก์ •ํ•  ๋ถ„๋ฅ˜๊ธฐ๋ฅผ MultinomialNB()๋‚˜ SGDClassifier() ๋‘˜ ์ค‘์— ์„ ํƒํ•œ๋‹ค.( KFold๋กœ ๊ต์ฐจ๊ฒ€์ฆ๋„ ํ•  ์ˆ˜ ์žˆ๋”๋ผ)




5. ๊ฐ๊ฐ์˜ parameters๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•  ์ธ์ž๋ฅผ ๋„ฃ๋Š” ๊ฒƒ๋„ ๋‹ฌ๋ผ์ ธ์•ผ ํ•œ๋‹ค.


6. naive_bayes.py์˜ 664์ค„ np.log(smoothed_fc)์˜ smoothed_fc ๊ฐ’์— 0์ด ๋“ค์–ด๊ฐ€์„œ ๊ณ„์† ์—๋Ÿฌ๊ฐ€ ๋‚œ๋‹ค.

stackoverflow๋ฅผ ๋’ค์ ธ๋„ ๋”ฑํžˆ ์ด๊ฑฐ๋‹คํ•˜๋Š”๊ฒŒ ์—†์–ด ๋•œ๋นต์œผ๋กœ np.log(smoothed_fc +0.000000001) ์ด๋ ‡๊ฒŒ ๋„ฃ์—ˆ๋‹ค.

smoothed_fc ์ด ๋ฌด์—‡์ธ์ง€ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋•œ๋นต์ฒ˜๋ฆฌํ•˜๋‹ˆ ๊ต‰์žฅํžˆ ์ฐ์ฐํ•˜๋‹ค. ํŒฌํ‹ฐ๋ฅผ ์ž…๊ณ  ๋˜ฅ์„ ์‹ผ ํ›„์— 2์‹œ๊ฐ„ ์•‰์•„ ์žˆ๋Š” ๊ธฐ๋ถ„์ด๋‹ค.


7. ์–ด์จŒ๋“  ๋ถ„๋ฅ˜๊ธฐ ๋ณ„๋กœ ๋ชจ๋‘ ์ตœ์ ํ™” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์•˜๋‹ค. ๊ฒฐ๊ณผ๋Š” ๋Œ€๋žต ์•„๋ž˜์™€ ๊ฐ™๋‹ค. weighted๊ฐ€ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ์ด์ œ ์•ˆ์“ฐ๋Š”๋ฐ ํ˜ธ์ถœํ•˜๋‹ˆ ๋œจ๋Š” ๊ฒฝ๊ณ ๋‹ค. ์•ž์„œ ํ•˜๋„ ๋งŽ์€ ์—๋Ÿฌ๋ฅผ ๋งŒ๋‚ฌ๋”๋‹ˆ ์ด์ œ ๊ฒฝ๊ณ  ๋”ฐ์œ„ ์‹ ๊ฒฝ๋„ ์•ˆ์“ฐ์ธ๋‹ค.

Fitting 3 folds for each of 288 candidates, totalling 864 fits

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.8s

[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.0min

[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.0min

[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  6.9min

[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  8.2min finished

Best score: 0.7813333333333333

Best parameter set:

        clf__alpha: 1.0

        vect__max_features: None

        vect__ngram_range: (1, 2)

        vect__norm: None

        vect__smooth_idf: True

        vect__sublinear_tf: True

        vect__use_idf: True

Accurary: 0.806

C:\Users\Alice\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1203: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".

  sample_weight=sample_weight)

Precision: 0.8136573785950023

C:\Users\Alice\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1304: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".

  sample_weight=sample_weight)

Recall: 0.806

์œ„์—์„œ ๊ตฌํ•œ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•ด ์•ž์œผ๋กœ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ• ๋•Œ ๊ทธ์™€๊ฐ™์ด ์„ธํŒ…ํ•ด์ฃผ๋ฉด ๋˜๋Š”๊ฑฐ๋‹ค. ๋ญ ๊ทน์ ์œผ๋กœ ์ •ํ™•๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ณ  ๊ทธ๋Ÿฐ๊ฑด ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค. ์ข€ ๋” (๋งŽ์€) ๊ณต๋ถ€๊ฐ€ ํ•„์š”ํ•ด๋ณด์ธ๋‹ค.



> ๋ฐ˜์„ฑ

1. ์ตœ๊ทผ์— ๋Š๋ผ๋Š” ์ ์ธ๋ฐ, ๋ฐ์ดํ„ฐ ๋ถ„์„์ด๋‚˜ ๊ณผํ•™ ๋˜๋Š” ์ธ๊ณต์ง€๋Šฅ๊ฐ™์€ ์‚ฌ์ „ ์ง€์‹ ์—†์ด ๊ทธ๋ƒฅ ์ด๋ ‡๊ฒŒ ๋ฐ”๋‹ฅ์„ ํŒŒ๋Š”๊ฒŒ ๋งž๋Š” ๊ฑด๊ฐ€ ์‹ถ๋‹ค. ์†”์งํžˆ ์–ด๋ ต์ง€๋Š” ์•Š๋‹ค. ํŒŒ๊ณ  ๋“ค์–ด๊ฐ€๋Š” ๋งŒํผ ์ดํ•ด๋„๊ฐ€ ๋†’์•„์ง€๋Š” ๊ฒƒ๋„ ์ •์งํ•˜๋‹ค. ์ผ๋‹จ์€ ์ด๋ ‡๊ฒŒ ๊ฐ•์‚ฌ๋‹˜์ด ๋‚ด๋ ค์ค€ '๋™์•„์ค„'์„ ๋ถ€์—ฌ์žก๊ณ  ์—ด์‹ฌํžˆ ๋”ฐ๋ผ๊ฐ€๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ด๊ณ  ์ข‹์€ ๊ฒƒ ๊ฐ™๋‹ค. ๋‹ค์Œ ์—ฌ์œ ๊ฐ€ ์žˆ์„ ๋•Œ ๊ด€๋ จ ์ฑ…๋“ค์„ ์ฝ์–ด์„œ ๊ธฐ๋ณธ๊ธฐ๋ฅผ ๋‹ฆ์•„์•ผ๊ฒ ๋‹ค. 


2. ์‹œ๊ฐ„๋‚ ๋•Œ๋งˆ๋‹ค ์ด๋ ‡๊ฒŒ ํ•ด๋„ ๊ฐ™์ด ์ˆ˜์—…์„ ๋“ฃ๋Š” ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์˜ ์ ˆ๋ฐ˜๋„ ๋ชป๋”ฐ๋ผ๊ฐ€๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ์ˆ˜์—… ์ค‘ ๋‚ด๊ฐ€ ํ•˜๋Š” ์งˆ๋ฌธ์€ ๊ฐ•์‚ฌ๋‹˜์˜ ๋‹ต๋ณ€์„ ๋“ฃ๊ณ ๋‚˜๋ฉด ๋‚ด๊ฐ€ ์ƒ๊ฐํ•ด๋„ ๋ถ€๋„๋Ÿฝ๊ธฐ ๊ทธ์ง€ ์—†๋Š”๋ฐ ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์ด ํ•˜๋Š” ์งˆ๋ฌธ์€ ์ผ๋‹จ ๋ฌด์Šจ ์งˆ๋ฌธ์ธ์ง€ ์ดํ•ด๊ฐ€ ๊ฐ€์ง€ ์•Š์•„ ๋ญ๋ผ ๋ง์„ ํ• ์ˆ˜๋„ ์—†๋‹คใ…ก,ใ…ก


3. ๊ทธ๋Ÿด ๋•Œ๋งˆ๋‹ค ๋‚˜๋Š” ์žฌ๋ฏธ๋กœ ๋ฐฐ์šฐ๋Š” ๊ฑฐ๋‹ˆ๊นŒ. ๊ทธ๋Ÿด ์ˆ˜๋„ ์žˆ์ง€ ํ•˜๋Š” '๋น„๊ฒํ•œ' ๋ณ€๋ช…์— ๋น ์ง„๋‹ค.

4. ํ•˜์ง€๋งŒ ์„ธ์ƒ์— ํ…์ŠคํŠธ ๋งˆ์ด๋‹์ด ์ ˆ์‹คํ•ด์„œ ๋ฐฐ์šฐ๋Š” ์‚ฌ๋žŒ์ด ๋ช‡๋ช…์ด๋‚˜ ์žˆ๊ฒ ๋Š”๊ฐ€. ์•„๋งˆ ์ € ๋ถ„๋“ค๋„ ์žฌ๋ฏธ๋กœ ๋ฐฐ์šฐ๋Š” ๊ฑธ๊ฑฐ๋‹ค.

5. ๊ทธ๋ž˜๋„ ๋‹คํ–‰์ธ ๊ฒƒ์€ '๋‚˜๋Š” ๋ฌด์–ธ๊ฐ€๋ฅผ ์ฒ˜์Œ ๋ฐฐ์šธ ๋•Œ ์–ธ์ œ๋‚˜ ๋‚จ๋“ค๋ณด๋‹ค ๋ช‡๋ฐฐ๋Š” ๋ชปํ–ˆ๋‹ค'๋Š” ์‚ฌ์‹ค์ด๋‹ค. ์ด๋ฒˆ์—๋„ ๋ณ€ํ•จ์ด ์—†๋Š” ๊ฒƒ ๋ฟ์ด๋‹ค.

6. ์™œ ์ž๊พธ ๋ˆ™๋ฌผ์ด ๋‚˜์ง€ ใ… ใ… 

+ Recent posts