Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ (1) (feat.ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ)

๊ฐ์ž ๐Ÿฅ” 2021. 7. 20. 18:50
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์†Œ์Šค์ฝ”๋“œ ) https://github.com/rickiepark/nlp-with-pytorch

 

rickiepark/nlp-with-pytorch

<ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ>(ํ•œ๋น›๋ฏธ๋””์–ด, 2021)์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์œ„ํ•œ ์ €์žฅ์†Œ์ž…๋‹ˆ๋‹ค. - rickiepark/nlp-with-pytorch

github.com

 

โ–ถ ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋ฐฐ์šด ํผ์…‰ํŠธ๋ก ๊ณผ ์ง€๋„ ํ•™์Šต ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ํ”„(Yelp)์˜ ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ๊ฐ€ ๊ธ์ •์ ์ธ์ง€๋ถ€์ •์ ์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์„ ์ง„ํ–‰ํ•ด ๋ณด์ž. ํ•ด๋‹น ํ”„๋กœ์ ํŠธ๋Š” ๋ฆฌ๋ทฐ์™€ ๊ฐ์„ฑ๋ ˆ์ด๋ธ”(๊ธ์ •or๋ถ€์ •)์ด ์Œ์„ ์ด๋ฃจ๋Š” ์˜ํ”„ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ณ , ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๋Š” ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฐ ๋ช‡ ๊ฐ€์ง€๋ฅผ ์ถ”๊ฐ€๋กœ ์„ค๋ช…ํ•˜๋ฉด์„œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•ด๋ณด์ž.

์•ž์œผ๋กœ ๋งค๋ฒˆ ์‚ฌ์šฉํ•  3๊ฐœ์˜ ๋ณด์กฐ ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•ด๋ณด๋ฉด, 
Vocabularay = ์ƒ˜ํ”Œ๊ณผ ํƒ€๊นƒ์˜ ์ธ์ฝ”๋”ฉ์—์„œ ์„ค๋ช…ํ•œ ์ •์ˆ˜์™€ ํ† ํฐ ๋งคํ•‘์„ ์ˆ˜ํ–‰.
Vectorizer =  ์–ดํœ˜ ์‚ฌ์ „์„ ์บก์Šํ™”ํ•˜๊ณ  ๋ฆฌ๋ทฐ ํ…์ŠคํŠธ ๊ฐ™์€ ๋ฌธ์ž์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ๊ณผ์ •์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜์น˜ ๋ฒกํ„ฐ๋กœ ์ „ํ™˜.
DataLoader =  ๊ฐœ๋ณ„ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜๋กœ ๋ชจ์œผ๋Š” ์—ญํ• .

 

1. ๋ฐ์ดํ„ฐ์…‹ (๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •)

์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, ์˜ํ”„ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ๋‹ค. ์˜ํ”„ ๋ฐ์ดํ„ฐ์…‹์€ ํ›ˆ๋ จ ์ƒ˜ํ”Œ 560000๊ฐœ์™€ ํ…Œ์ŠคํŠธ์ƒ˜ํ”Œ 38000๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ์ง€๋งŒ, ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ํ›ˆ๋ จ ์ƒ˜ํ”Œ์˜ 10%๋งŒ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. (ํ•ด๋‹น ์ฑ…์—์„œ๋Š” ์ด ๋ฐ์ดํ„ฐ์…‹์„ '๋ผ์ดํŠธ'๋ฒ„์ „ ์ด๋ผ๊ณ  ํ‘œํ˜„ํ–ˆ๋‹ค.) ์ด๋ ‡๊ฒŒ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋ฉด ํ›ˆ๋ จ๊ณผ ํ…Œ์ŠคํŠธ๊ฐ€ ๋นจ๋ผ์„œ ์ด์šฉํ•˜๋Š” ๊ฒƒ๋„ ์žˆ์ง€๋งŒ, ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ• ๋•Œ๋ณด๋‹ค ๋‚ฎ์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ์ ์„ ๊ผญ ๊ธฐ์–ตํ•˜์ž. 

๋ฐ์ดํ„ฐ์…‹์„ ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ๋‚˜๋ˆŒ ๊ฒƒ์ด๋‹ค.
ํ›ˆ๋ จ์„ธํŠธ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , ๊ฒ€์ฆ ์„ธํŠธ๋กœ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•œ๋‹ค. ๊ฒ€์ฆ ์„ธํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๊ฒŒ ๋˜๋ฉด ๋ถ€๊ฐ€ํ”ผํ•˜๊ฒŒ ๋ชจ๋ธ์ด ๊ฒ€์ฆ์„ธํŠธ์— ๋” ์ž˜ ์ˆ˜ํ–‰๋˜๋„๋ก ํŽธํ–ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ์ ์ฐจ ๊ฐœ์„ ๋˜๋Š”์ง€ ์žฌํ‰๊ฐ€ํ•ด๋ณด๊ธฐ์œ„ํ•ด ์„ธ๋ฒˆ์งธ ์„ธํŠธ์ธ ํ‰๊ฐ€์„ธํŠธ๋ฅผ ํ™œ์šฉํ•ด์„œ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด๋ณด๋„๋ก ํ•œ๋‹ค.

์šฐ์„ , ์ด ์ฑ…์˜ ๊นƒํ—ˆ๋ธŒ์—์„œ ์ œ๊ณตํ•˜๋Š” ์›๋ณธ์ฝ”๋“œ์—๋Š” '์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›๋Š” ์ฝ”๋“œ'๊ฐ€ ๋‚ด์žฅ๋˜์–ด ์žˆ์–ด ๋”ฐ๋กœ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • ์—†์ด ์ฝ”๋“œ ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์–ด๋–ป๊ฒŒ ๋ฐ์ดํ„ฐ๊ฐ€ ์ „์ฒ˜๋ฆฌ ๋˜์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž.

1.1 import

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

 

1.2  ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

args = Namespace(
    raw_train_dataset_csv="data/yelp/raw_train.csv",
    raw_test_dataset_csv="data/yelp/raw_test.csv",
    proportion_subset_of_train=0.1,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/yelp/reviews_with_splits_lite.csv",
    seed=1337
)
  • ์ธ์ž์™€ ๋ถ€์†๋ช…๋ น์„ ์œ„ํ•œ ๋ช…๋ นํ–‰ ์˜ต์…˜์ธ, argparse ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŒŒ์ด์ฌ ๋‚ด๋ถ€์—์„œ ํŒŒ์ผ์„ ๋‹ค์šด๋ฐ›๊ณ , ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด์ค„ ์ˆ˜ ์žˆ๋‹ค. 
  • train 70% / validation 15% / test 15% ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•  ์˜ˆ์ •์ด๋‹ค.
# ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
  • ๋‹ค์šด๋ฐ›์€ ์›๋ณธ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ด์ฌ์—์„œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด train_reviews ๋ณ€์ˆ˜์— ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด์ฃผ๊ณ , ์ปฌ๋Ÿผ๋ช…์„ rating๊ณผ review๋กœ ์ง€์ •ํ•ด์ฃผ์—ˆ๋‹ค.
# ๋ฆฌ๋ทฐ ํด๋ž˜์Šค ๋น„์œจ์ด ๋™์ผํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค
by_rating = collections.defaultdict(list)
for _, row in train_reviews.iterrows():
    by_rating[row.rating].append(row.to_dict())
    
review_subset = []

for _, item_list in sorted(by_rating.items()):

    n_total = len(item_list)
    n_subset = int(args.proportion_subset_of_train * n_total)
    review_subset.extend(item_list[:n_subset])

review_subset = pd.DataFrame(review_subset)
  • ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ํ•ด๋‹น ์ฑ…์—์„œ ์‚ฌ์šฉ๋  ๋ผ์ดํŠธ๋ฒ„์ „์˜ ๋ฐ์ดํ„ฐ์…‹์€ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์˜ 10%๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค.
  • ๋”ฐ๋ผ์„œ review_subset์— ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์—์„œ 10%์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋งŒ ๋”ฐ๋กœ ์ €์žฅํ•ด์ค€๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  defaultdic๋ฉ”์„œ๋“œ๋ฅผ ํ™œ์šฉํ•ด์„œ ํด๋ž˜์Šค๋ณ„๋กœ ๋น„์œจ์ด ๋™์ผํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ค„ ๊ฒƒ์ด๋‹ค.
review_subset.head()

  • ๊ทธ๋ ‡๊ฒŒ ์ถœ๋ ฅ๋œ ๋ฐ์ดํ„ฐ๋Š” ์ด๋Ÿฐ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
train_reviews.rating.value_counts()

  • ์•ž์—์„œ ํด๋ž˜์Šค๋ณ„๋กœ ๋น„์œจ์ด ๋„์ž‰ใ„นํ•˜๋„๋ก ๋งŒ๋“ค์–ด ์ค€ ๊ฒฐ๊ณผ๊ฐ’์ด๋‹ค. (ํด๋ž˜์Šค๋ณ„๋กœ ๊ฐ๊ฐ 280000๊ฐœ)
# ๊ณ ์œ  ํด๋ž˜์Šค
set(review_subset.rating)

  • ํด๋ž˜์Šค๋Š” 1๊ณผ 2๋กœ ๋‚˜๋ˆ„์–ด์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • (์ถ”ํ›„ ๋‚˜์˜ฌ ๊ฒƒ์ด์ง€๋งŒ) 1์€ negative, 2๋Š” positive ํด๋ž˜์Šค์ด๋‹ค.
# ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ณ„์ ์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค
by_rating = collections.defaultdict(list)
for _, row in review_subset.iterrows():
    by_rating[row.rating].append(row.to_dict())

# ๋ถ„ํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
final_list = []
np.random.seed(args.seed)

for _, item_list in sorted(by_rating.items()):

    np.random.shuffle(item_list)
    
    n_total = len(item_list)
    n_train = int(args.train_proportion * n_total)
    n_val = int(args.val_proportion * n_total)
    n_test = int(args.test_proportion * n_total)
    
    # ๋ฐ์ดํ„ฐ ํฌ์ธํ„ฐ์— ๋ถ„ํ•  ์†์„ฑ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
    for item in item_list[:n_train]:
        item['split'] = 'train'
    
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
        
    for item in item_list[n_train+n_val:n_train+n_val+n_test]:
        item['split'] = 'test'

    # ์ตœ์ข… ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
    final_list.extend(item_list)
  • ์•ž์„œ ๋‚˜์™”๋˜ args์—์„œ ์ง€์ •ํ•œ 7:1.5:1.5 ๋น„์œจ๋กœ ๊ฐ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ train / val / test ๋กœ ์ง€์ •ํ–ˆ๊ณ , ์ตœ์ข… ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€ํ•ด์ฃผ์—ˆ๋‹ค.
# ๋ถ„ํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค
final_reviews = pd.DataFrame(final_list)
  • ๋ฆฌ์ŠคํŠธํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„์— ์šฉ์ดํ•œ pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค์–ด์ค€๋‹ค.
final_reviews.split.value_counts()

  • ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋ฅผ ํ™•์ธํ•จ์œผ๋กœ์จ 7: 1.5 : 1.5๋กœ ์ž˜ ๋‚˜๋‰˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณธ๋‹ค.

 

1.3 ๋ฐ์ดํ„ฐ ์ •์ œ

# ๋ฆฌ๋ทฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    
final_reviews.review = final_reviews.review.apply(preprocess_text)
  • ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ ์ •์ œ์ž‘์—…์„ ๊ฑฐ์นœ๋‹ค.
  • ์ •๊ทœ์‹์„ ํ™œ์šฉํ•˜์—ฌ ๊ธฐํ˜ธ ์•ž๋’ค์— ๊ณต๋ฐฑ์„ ๋„ฃ๊ณ , ๊ตฌ๋‘์ ์ด ์•„๋‹Œ ๊ธฐํ˜ธ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ์ •์ œ์ž‘์—…์„ ์ง„ํ–‰ํ•ด์ฃผ์—ˆ๋‹ค.
final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)
final_reviews.head()

  • ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ๊ฒƒ์ด๋‹ค.
final_reviews.to_csv(args.output_munged_csv, index=False)

 

 

<NEXT>  ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํด๋ž˜์Šค ์‚ดํŽด๋ณด๊ธฐ

https://didu-story.tistory.com/86

 

[NLP] ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ (2) (feat.ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ) - ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค. -- ์†Œ์Šค์ฝ”๋“œ ) https://github.com/rickiepark/nlp-with-pytorch (ํ•œ๋น›๋ฏธ๋””์–ด, 2021)์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์œ„ํ•œ ์ €์žฅ์†Œ์ž…..

didu-story.tistory.com

 

๋ฐ˜์‘ํ˜•