Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] seq2seq๋กœ ๋ฒˆ์—ญ๊ธฐ ๊ตฌํ˜„ํ•˜๊ธฐ (feat.๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ)

๊ฐ์ž ๐Ÿฅ” 2021. 8. 6. 08:38
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์™€ ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ(์œ„ํ‚ค๋…์Šค) ์ €์„œ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

1. Sequence-to-Sequence (seq2seq)

  • ๋ฒˆ์—ญ๊ธฐ์—์„œ ๋Œ€ํ‘œ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ
  • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋œ ์ธ์ฝ”๋” - ๋””์ฝ”๋” ๋ชจ๋ธ์˜ ์ผ์ข…
  • ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ ๋ชจ๋ธ (conditioned generation model)์˜ ์ผ์ข…์ด๊ธฐ๋„ ํ•จ
    • ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ ๋ชจ๋ธ์ด๋ž€? ์ž…๋ ฅํ‘œํ˜„ ๋Œ€์‹  ์ผ๋ฐ˜์ ์ธ ์กฐ๊ฑด ๋ฌธ๋งฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋””์ฝ”๋”๊ฐ€ ์ถœ๋ ฅ์„ ๋งŒ๋“œ๋Š” ๋ชจ๋ธ

1.1 seq2seq ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ

https://wikidocs.net/24996

  1. ์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋ฐ›์€ ๋’ค์— ๋งˆ์ง€๋ง‰์— ๋ชจ๋“  ๋‹จ์–ด์˜ ์ •๋ณด๋ฅผ ์••์ถ•ํ•ด์„œ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด์คŒ (์ฆ‰, context vector๋ฅผ ์ƒ์„ฑ)
  2. context ๋ฒกํ„ฐ๋กœ ๋ชจ๋‘ ๋ฌธ์žฅ์˜ ์ •๋ณด๊ฐ€ ์••์ถ•๋˜๋ฉด ๋””์ฝ”๋”๋กœ ์ „์†ก
  3. ๋””์ฝ”๋”๋Š” context ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„์„œ ๋ฒˆ์—ญ๋œ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถœ๋ ฅ

https://wikidocs.net/24996

  • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ์ž์„ธํžˆ ๋ณด๋ฉด ์ด๋ ‡๊ฒŒ RNN ์•„ํ‚คํ…์ณ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
    • ์„ฑ๋Šฅ์˜ ๋ฌธ์ œ๋กœ RNN๋ณด๋‹ค๋Š” ์ฃผ๋กœ ๋ฐœ์ „๋œ ํ˜•ํƒœ์ธ LSTM๊ณผ GRU์…€์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • ๊ธฐ๊ณ„๋Š” ๋‹จ์–ด๋ณด๋‹ค ์ˆซ์ž๋ฅผ ๋” ์ž˜ ์ธ์‹ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ์…€์—์„œ ์ž„๋ฒ ๋”ฉ ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค.
  • ํ•˜๋‚˜์˜ RNN(LSTM)์…€์—์„œ๋Š” t-1 ์—์„œ์˜ ์€๋‹‰์ƒํƒœ์™€, t์—์„œ์˜ ์ž…๋ ฅ๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ t์—์„œ์˜ ์€๋‹‰์ƒํƒœ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ด์ค€๋‹ค. (์ด์ „ ํฌ์ŠคํŒ…์ฐธ๊ณ )

โ–ถ ์ธ์ฝ”๋”

  • ์ž…๋ ฅ ๋ฌธ์žฅ์€ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ํ†ตํ•ด ๋‹จ์–ด ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์ง€๊ณ , ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ž„๋ฒ ๋”ฉํ•œ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋‹จ์–ด ํ† ํฐ์€ ๊ฐ๊ฐ RNN์…€์˜ ๊ฐ ์‹œ์ ์˜ ์ž…๋ ฅ์ด ๋œ๋‹ค.
  • RNN ๊ฐ ์…€์˜ ๋งˆ์ง€๋ง‰ ์‹œ์ ์˜ ์€๋‹‰์ƒํƒœ๋ฅผ context ๋ฒกํ„ฐ๋กœ ๋งŒ๋“  ํ›„, ๋””์ฝ”๋”๋กœ ๋„˜๊ฒจ์ค€๋‹ค.

โ–ถ ๋””์ฝ”๋”

  • ์ดˆ๊ธฐ ์ž…๋ ฅ์œผ๋กœ ๋ฌธ์žฅ์˜ ์‹œ์ž‘์„ ์˜๋ฏธํ•˜๋Š” <sos>๊ฐ€ ์ž…๋ ฅ๋จ
  • ๋””์ฝ”๋”๋Š” <sos>๊ฐ€ ์ž…๋ ฅ๋˜๋ฉด ๋‹ค์Œ์— ๋“ฑ์žฅํ•  ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก
  • ์—ฌ๊ธฐ์„œ๋Š” ์ฒซ๋ฒˆ์งธ ์‹œ์ ์— ๋‚˜์˜ฌ ๋‹จ์–ด๋ฅผ Je๋กœ ์˜ˆ์ธก
  • ์ด๋ ‡๊ฒŒ ๋””์ฝ”๋”๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ๊ทธ ์˜ˆ์ธกํ•œ ๋‹จ์–ด๋ฅผ ๋‹ค์Œ ์‹œ์ ์˜ RNN์…€์˜ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ๋Š” ํ–‰์œ„๋ฅผ ๋ฐ˜๋ณต
  • ๋ฌธ์žฅ์ด ๋๋‚ฌ๋‹ค๋Š” ์‹ฌ๋ณผ์ธ <eos>๊ฐ€ ์˜ˆ์ธก๋ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต๋œ๋‹ค.

 

2. seq2seq๋กœ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ธฐ ๊ตฌํ˜„ํ•˜๊ธฐ (์ฐธ๊ณ )

๋ณธ ์ฝ”๋“œ๋Š” https://wikidocs.net/24996 ํ•ด๋‹น ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑํ–ˆ๋‹ค. ์šฐ์„ ์€ ๊ธ€์ž ์ฐจ์›์—์„œ์˜ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ๊ตฌํ˜„ํ•  ๊ฒƒ์ด๋‹ค. (ํ† ํฐ์˜ ๋‹จ์œ„๊ฐ€ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๊ธ€์ž(์•ŒํŒŒ๋ฒณ) ์ด๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค!)

2.1 ๋ฐ์ดํ„ฐ์…‹

โ–ถ ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜

๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ๋ณ‘๋ ฌ ์ฝ”ํผ์Šค๊ฐ€ ํ•„์š”ํ•˜๋‹ค. http://www.manythings.org/anki ํ•ด๋‹น ๋งํฌ์—์„œ ๋‹ค์šด๋ฐ›์€ ํ”„๋ž‘์Šค-์˜์–ด ๋ณ‘๋ ฌ ์ฝ”ํผ์Šค์ธ fran-eng.zip ์ด๋ผ๋Š” ํŒŒ์ผ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. ํ•ด๋‹น ํŒŒ์ผ์˜ ์••์ถ•์„ ํ’€๊ณ , fra.txt ๋ผ๋Š” ํŒŒ์ผ์„ ์‚ฌ์šฉํ•ด์„œ ์‹ค์Šตํ•ด๋ณด์ž.

โ–ถ ๋ณ‘๋ ฌ ์ฝ”ํผ์Šค

๋ณ‘๋ ฌ ์ฝ”ํผ์Šค๋Š” 'ํƒœ๊น…'์ž‘์—…๊ณผ๋Š” ์‚ด์ง ๋‹ค๋ฅด๋‹ค. ํƒœ๊น… ์ž‘์—…์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ์Œ์˜ ๊ธธ์ด๊ฐ€ ๊ฐ™๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์ง€๋งŒ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ๋Š” ๊ทธ๋ ‡์ง€ ์•Š๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด '๋‚˜๋Š” ํ•™์ƒ์ด๋‹ค' ๋‘๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์€ 'I am a student' 4๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฌธ์žฅ์œผ๋กœ ๋ฒˆ์—ญ๋œ๋‹ค. ์ด์ฒ˜๋Ÿผ seq2seq๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์‹ค์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

โ–ถ fra.txt์˜ ๊ตฌ์„ฑ

Watch me.           Regardez-moi !

์ด๋ ‡๊ฒŒ ์™ผ์ชฝ์˜ ์˜์–ด ๋ฌธ์žฅ๊ณผ ์˜ค๋ฅธ์ชฝ์˜ ํ”„๋ž‘์Šค์–ด ๋ฌธ์žฅ ์‚ฌ์ด์— ํƒญ์œผ๋กœ ๊ตฌ๋ณ„๋˜๋Š” ๊ตฌ์กฐ๊ฐ€ ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์ด๋‹ค.
์ด์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ 16๋งŒ๊ฐœ์˜ ๋ณ‘๋ ฌ ๋ฌธ์žฅ ์ƒ˜ํ”Œ์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.

 

2.2 import

import pandas as pd
import urllib3
import zipfile
import shutil
import os
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

 

2.3 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

from google.colab import drive
drive.mount('/content/gdrive/')
PATH = '/content/gdrive/MyDrive/Colab Notebooks/NLPstudy/'
  • google colab์„ ์‚ฌ์šฉํ–ˆ๊ธฐ์— google drive๋ฅผ ๋งˆ์šดํŠธํ•ด์ฃผ๊ณ , ๊ฒฝ๋กœ๋ฅผ PATH์— ์ €์žฅํ•ด์ค€๋‹ค.
lines = pd.read_csv(PATH+'fra.txt', names = ['src',  'tar', 'lic'], sep='\t')
# src๋Š” source์˜ ์ค„์ž„๋ง๋กœ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์˜๋ฏธ / tar์€ target์œผ๋กœ ๋ฒˆ์—ญํ•˜๊ณ ์žํ•˜๋Š” ๋ฌธ์žฅ์„ ์˜๋ฏธ
del lines['lic']
len(lines) #๋ฐ์ดํ„ฐ์˜ ๊ฐฏ์ˆ˜ 19๋งŒ๊ฐœ์ •๋„

  • ์ž…๋ ฅ๋ฌธ์žฅ์„ src, ๋ฒˆ์—ญํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์žฅ์„ tar ์œผ๋กœ ์ง€์ •ํ•œ๋‹ค.
  • ๋ฐ์ดํ„ฐ๋Š” ์ด 19๋งŒ๊ฐœ ์ •๋„ ๋œ๋‹ค.

 

2.4 ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

โ–ถ ๋ฐ์ดํ„ฐ ์„ ํƒ

lines = lines.loc[:, 'src':'tar']
lines = lines[0:60000] # 6๋งŒ๊ฐœ๋งŒ ์ €์žฅ
lines.sample(10) #๋žœ๋ค์œผ๋กœ ๋ฝ‘์€ 10๊ฐœ์˜ ์ƒ˜ํ”Œ

  • ์‹ค์Šต์€ 6๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. 

โ–ถ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ ๋งŒ์ ธ์ฃผ๊ธฐ

lines.tar = lines.tar.apply(lambda x : '\t '+ x + ' \n')
lines.sample(10)

  • ์œ„์— seq2seq๋ชจ๋ธ ์„ค๋ช…์„ ๋ณด๋ฉด ๋ฒˆ์—ญํ•˜๊ณ  ์‹ถ์€ ๋ฌธ์žฅ์—๋Š” ์‹œ์ž‘ ์‹ฌ๋ณผ์ธ <sos>์™€ ๋ฌธ์žฅ์ด ๋๋‚˜๋Š” ์‹ฌ๋ณผ์ธ <eos>๊ฐ€ ์กด์žฌํ•œ๋‹ค.
  • ํ•ด๋‹น ๋ฌธ์žฅ์—๋Š” ์‹œ์ž‘๊ณผ ๋์˜ ์‹ฌ๋ณผ์ด ์ •ํ•ด์ ธ ์žˆ์ง€ ์•Š์œผ๋‹ˆ ๋”ฐ๋กœ ์ง€์ •ํ•ด์„œ ๋„ฃ์–ด์ฃผ์ž.
  • <sos> : \t / <eos> : \n ์œผ๋กœ ๋„ฃ์–ด์ฃผ์—ˆ๊ณ , tar ๋ฌธ์žฅ์— ์‹ฌ๋ณผ์ด ์ •์ƒ์ ์œผ๋กœ ์ž…๋ ฅ๋จ์„ ํ™•์ธํ•˜์ž.

โ–ถ ๊ธ€์ž ์ง‘ํ•ฉ ๊ตฌ์ถ•

  • ์˜์–ด๋Š” 79๊ธ€์ž, ํ”„๋ž‘์Šค์–ด๋Š” 105๊ธ€์ž๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๊ธ€์ž๋ฅผ ์ผ๋ถ€๋งŒ ์ถœ๋ ฅํ•ด๋ณด๋ฉด, ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
# ๊ธ€์ž ์ง‘ํ•ฉ ์ƒ์„ฑ (ํ† ํฐ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ '๊ธ€์ž'๋‹จ์œ„๋กœ ์ง‘ํ•ฉ์„ ๊ตฌ์ถ•)
src_vocab=set()
for line in lines.src: # 1์ค„์”ฉ ์ฝ์Œ
    for char in line: # 1๊ฐœ์˜ ๊ธ€์ž์”ฉ ์ฝ์Œ
        src_vocab.add(char)

tar_vocab=set()
for line in lines.tar:
    for char in line:
        tar_vocab.add(char)
        

src_vocab_size = len(src_vocab)+1
tar_vocab_size = len(tar_vocab)+1
print(src_vocab_size)
print(tar_vocab_size)

src_vocab = sorted(list(src_vocab))
tar_vocab = sorted(list(tar_vocab))
print(src_vocab[45:75]) #์ผ๋ถ€๋งŒ ์ถœ๋ ฅํ•ด๋ณด์ž
print(tar_vocab[45:75])

  • ๊ธ€์žฅ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•ด์„œ dictionary๋กœ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค.
# ๊ธ€์ž์— ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ dict์œผ๋กœ ํ‘œํ˜„
src_to_index = dict([(word, i+1) for i, word in enumerate(src_vocab)])
tar_to_index = dict([(word, i+1) for i, word in enumerate(tar_vocab)])
print(src_to_index)
print(tar_to_index)

 

โ–ถ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ์ง„ํ–‰

# ์ธ๋ฑ์Šค๊ฐ€ ๋ถ€์—ฌ๋œ ๊ธ€์ž ์ง‘ํ•ฉ์œผ๋กœ ๋ถ€ํ„ฐ ๊ฐ–๊ณ  ์žˆ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ์„ ์ˆ˜ํ–‰
# ์ž…๋ ฅ์ด ๋  ์˜์–ด ๋ฌธ์žฅ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์ธ์ฝ”๋”ฉ์„ ์ˆ˜ํ–‰
encoder_input = []
for line in lines.src: #์ž…๋ ฅ ๋ฐ์ดํ„ฐ์—์„œ 1์ค„์”ฉ ๋ฌธ์žฅ์„ ์ฝ์Œ
    temp_X = []
    for w in line: #๊ฐ ์ค„์—์„œ 1๊ฐœ์”ฉ ๊ธ€์ž๋ฅผ ์ฝ์Œ
      temp_X.append(src_to_index[w]) # ๊ธ€์ž๋ฅผ ํ•ด๋‹น๋˜๋Š” ์ •์ˆ˜๋กœ ๋ณ€ํ™˜
    encoder_input.append(temp_X)
#์˜ˆ์‹œ 5๊ฐœ๋งŒ ์ถœ๋ ฅํ•ด๋ณด์ž
print(encoder_input[:5])

# ๋””์ฝ”๋”์˜ ์ž…๋ ฅ์ด ๋  ํ”„๋ž‘์Šค์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ์ˆ˜ํ–‰
decoder_input = []
for line in lines.tar:
    temp_X = []
    for w in line:
      temp_X.append(tar_to_index[w])
    decoder_input.append(temp_X)
#์˜ˆ์‹œ 5๊ฐœ๋งŒ ์ถœ๋ ฅํ•ด๋ณด์ž
print(decoder_input[:5])

# ๋””์ฝ”๋”์˜ ์˜ˆ์ธก๊ฐ’๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•œ ์‹ค์ œ๊ฐ’์ด ํ•„์š”ํ•จ
# ์‹ค์ œ๊ฐ’์—๋Š” ์‹œ์ž‘ ์‹ฌ๋ณผ <sos>๊ฐ€ ์žˆ์„ ํ•„์š”๊ฐ€ ์—†์Œ 
# ์‹œ์ž‘์‹ฌ๋ณผ์ธ \t ๋ฅผ ์ œ๊ฑฐํ•ด์ฃผ์ž 
decoder_target = []
for line in lines.tar:
    t=0 # t๊ฐ€ 0์ธ ์ฒ˜์Œ์„ ์ œ์™ธํ•˜๊ณ  temp_X์— appendํ•ด์ฃผ๋Š”๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค.
    temp_X = []
    for w in line:
      if t>0:
        temp_X.append(tar_to_index[w])
      t=t+1
    decoder_target.append(temp_X)
print(decoder_target[:5])

  • decoder input์—์„œ๋Š” ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ๋œ ๊ฒฐ๊ณผ๊ฐ€ 1๋กœ ์‹œ์ž‘ํ–ˆ๋‹ค. (์‹œ์ž‘์‹ฌ๋ณผ๋–„๋ฌธ)
  • decoder target ์—์„œ๋Š” 1์ด ์ œ์™ธ๋œ ๊ฒƒ์„ ๋ณด๋ฉด, ์ •์ƒ์ ์œผ๋กœ ์‹œ์ž‘ ์‹ฌ๋ณผ์ด ์ œ๊ฑฐ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

โ–ถ ํŒจ๋”ฉ

# ์˜์–ด์™€ ํ”„๋ž‘์Šค์–ด์˜ ๊ฐ€์žฅ ๊ธด ๋‹จ์–ด ํƒ์ƒ‰
max_src_len = max([len(line) for line in lines.src])
max_tar_len = max([len(line) for line in lines.tar])
print(max_src_len)
print(max_tar_len)

  • ์˜์–ด 24 / ํ”„๋ž‘์Šค์–ด 76 ๋กœ ํŒจ๋”ฉ ์ง„ํ–‰
  • ์ด๋ฒˆ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ๋Š” ํ•˜๋‚˜์˜ ์Œ์ด๋”๋ผ๋„ ์ „๋ถ€ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ (์ฒ˜์Œ์— ์–ธ๊ธ‰)
  • ํŒจ๋”ฉ์„ ํ• ๋•Œ๋„ ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๊ธธ์ด๋ฅผ ์ „๋ถ€ ๋™์ผํ•˜๊ฒŒ ํ•  ํ•„์š” ์—†์Œ
# ํŒจ๋”ฉ ์ง„ํ–‰
encoder_input = pad_sequences(encoder_input, maxlen=max_src_len, padding='post')
decoder_input = pad_sequences(decoder_input, maxlen=max_tar_len, padding='post')
decoder_target = pad_sequences(decoder_target, maxlen=max_tar_len, padding='post')

 

โ–ถ ์›ํ•ซ ์ธ์ฝ”๋”ฉ

# ๊ธ€์ž ๋‹จ์œ„ ๋ฒˆ์—ญ๊ธฐ ์ด๋ฏ€๋กœ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์€ ๋ณ„๋„๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒƒ
# ์˜ˆ์ธก๊ณผ ์˜ค์ฐจ ์ธก์ •์— ์‚ฌ์šฉ๋˜๋Š” ์‹ค์ œ๊ฐ’ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ž…๋ ฅ๊ฐ’๋„ ์›ํ•ซ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•จ
encoder_input = to_categorical(encoder_input)
decoder_input = to_categorical(decoder_input)
decoder_target = to_categorical(decoder_target)

 

2.5 ๊ต์‚ฌ ๊ฐ•์š” (Teacher Forcing)

  • ์ด์ „ seq2seq ๋ชจ๋ธ ์„ค๋ช…์„ ๋ณด๋ฉด, ํ˜„์žฌ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์˜ ์ž…๋ ฅ์€ ์ด์ „ ๋””์ฝ”๋”์˜ ์ถœ๋ ฅ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค๊ณ  ๋ฐฐ์› ๋‹ค. ๊ทผ๋ฐ ์™œ decoder_input์ด ํ•„์š”ํ•œ๊ฐ€?
  • ์ด์ „ ์‹œ์ ์˜ ์‹ค์ œ๊ฐ’์„ ํ˜„์žฌ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•  ๊ฒƒ
  • ์ด์ „ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์˜ ์˜ˆ์ธก์ด ํ‹€๋ ธ๋Š”๋ฐ ์ด๋ฅผ ํ˜„์žฌ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ๋˜๋ฉด ํ˜„์žฌ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์˜ ์˜ˆ์ธก๊นŒ์ง€ ์ž˜๋ชป๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๊ณ , ์ด๋Š” ์—ฐ์‡„์ ์œผ๋กœ ์ž‘์šฉํ•˜์—ฌ ๋””์ฝ”๋” ์ „์ฒด ์˜ˆ์ธก์„ ์–ด๋ ต๊ฒŒ ํ•˜๊ธฐ ๋–„๋ฌธ
  • ์ด์™€ ๊ฐ™์ด RNN์˜ ๋ชจ๋“  ์‹œ์ ์— ๋Œ€ํ•ด์„œ ์ด์ „ ์‹œ์ ์˜ ์˜ˆ์ธก๊ฐ’ ๋Œ€์‹  ์‹ค์ œ๊ฐ’์„ ์ž…๋ ฅ์œผ๋กœ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์„ ๊ต์‚ฌ๊ฐ•์š” ๋ผ๊ณ  ํ•จ

 

2.6 seq2seq ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model
import numpy as np

encoder_inputs = Input(shape=(None, src_vocab_size))
encoder_lstm = LSTM(units=256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
# encoder_outputs๋„ ๊ฐ™์ด ๋ฆฌํ„ด๋ฐ›๊ธฐ๋Š” ํ–ˆ์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” ํ•„์š”์—†์œผ๋ฏ€๋กœ ์ด ๊ฐ’์€ ๋ฒ„๋ฆผ.
encoder_states = [state_h, state_c]
# LSTM์€ ๋ฐ”๋‹๋ผ RNN๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์ƒํƒœ๊ฐ€ ๋‘ ๊ฐœ. ๋ฐ”๋กœ ์€๋‹‰ ์ƒํƒœ์™€ ์…€ ์ƒํƒœ.
  • LSTM์˜ ์€๋‹‰์ƒํƒœ ํฌ๊ธฐ๋Š” 256์œผ๋กœ ์„ ํƒ
  • ์ธ์ฝ”๋” ๋‚ด๋ถ€์ƒํƒœ๋ฅผ ๋””์ฝ”๋”๋กœ ์ „๋‹ฌํ•ด์•ผํ•˜๊ธฐ์— return_state = True๋กœ ์„ค์ •
  • LSTM์—์„œ state_h, state_c๋ฅผ ๋ฆฌํ„ด๋ฐ›๋Š”๋ฐ state_h๋Š” ์€๋‹‰์ƒํƒœ๊ณ  state_c๋Š” ์…€์ƒํƒœ์— ํ•ด๋‹น
  • ์ฆ‰ ์€๋‹‰์ƒํƒœ์™€ ์…€ ์ƒํƒœ๋ฅผ ์ „๋‹ฌํ•ด์ค€๋‹ค.
  • ์ด ๋‘๊ฐœ๋ฅผ ecoder states์— ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ๋””์ฝ”๋”์— ์ „๋‹ฌํ•˜๋ฏ€๋กœ์„œ ๋‘๊ฐ€์ง€ ์ƒํƒœ๋ฅผ ๋ชจ๋‘ ๋””์ฝ”๋”๋กœ ์ „๋‹ฌํ•  ๊ฒƒ.
  • ์•ž์„œ ๋ฐฐ์šด ๋ฌธ๋งฅ๋ฒกํ„ฐ(context vector)๊ฐ€ encoder_state ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ!
decoder_inputs = Input(shape=(None, tar_vocab_size))
decoder_lstm = LSTM(units=256, return_sequences=True, return_state=True)
decoder_outputs, _, _= decoder_lstm(decoder_inputs, initial_state=encoder_states)
# ๋””์ฝ”๋”์˜ ์ฒซ ์ƒํƒœ๋ฅผ ์ธ์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ, ์…€ ์ƒํƒœ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
decoder_softmax_layer = Dense(tar_vocab_size, activation='softmax')
decoder_outputs = decoder_softmax_layer(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy")
  • ๋””์ฝ”๋”๋Š” ์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ์ƒํƒœ๋ฅผ ์ดˆ๊ธฐ ์€๋‹‰์ƒํƒœ๋กœ ์‚ฌ์šฉ ์ฆ‰, initial_state์˜ ์ธ์ž๊ฐ’์œผ๋กœ encoder_state๋ฅผ ๋ฐ›๋Š”๊ฒƒ์ด ์ด์— ํ•ด๋‹น
  • ๋””์ฝ”๋”์˜ ์€๋‹‰์ƒํƒœ๋„ 256์œผ๋กœ ์ฃผ์–ด์ง
  • ๋””์ฝ”๋”๋„ ์€๋‹‰์ƒํƒœ, ์…€ ์ƒํƒœ๋ฅผ ๋ฆฌํ„ดํ•˜๊ธฐ๋Š” ํ•˜์ง€๋งŒ ํ›ˆ๋ จ ๊ณผ์ •์—์„œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • ๊ทธ ํ›„ ์ถœ๋ ฅ์ธต์— ํ”„๋ž‘์Šค์–ด์˜ ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋งŒํผ ๋‰ด๋Ÿฐ์„ ๋ฐฐ์น˜ํ•œ ํ›„, ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์ œ๊ฐ’๊ณผ์˜ ์˜ค์ฐจ๋ฅผ ๊ตฌํ•จ
model.fit(x=[encoder_input, decoder_input], y=decoder_target, batch_size=64, epochs=3, validation_split=0.2)

  • ์‹œ๊ฐ„๊ด€๊ณ„์ƒ epochs๋ฅผ 3์œผ๋กœ๋งŒ ํ•™์Šตํ–ˆ๋‹ค. (๋ณธ ์ฝ”๋“œ๋Š” 50์œผ๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค.)
  • ์ž…๋ ฅ์œผ๋กœ๋Š” encoder_input, ๋””์ฝ”๋”์˜ ์‹ค์ œ๊ฐ’์ธ decoder_input์„ ๋„ฃ๋Š”๋‹ค.

 

2.7 seq2seq ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ธฐ ๋™์ž‘์‹œํ‚ค๊ธฐ

ํ›ˆ๋ จ๊ณผ์ •๊ณผ ๋™์ž‘๊ณผ์ •์€ ๋‹ค๋ฅด๋‹ค. ๋™์ž‘ ๊ณผ์ •์—์„œ๋Š” encoder model๊ณผ decoder model์„ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด์„œ ์ž…๋ ฅํ•œ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ํ•˜๋„๋ก ๋ชจ๋ธ์„ ์กฐ์ •ํ•œ ํ›„, ๋™์ž‘์‹œ์ผœ๋ณผ ๊ฒƒ์ด๋‹ค. (ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ธ๊ฐ€? ํ .. )
์ „๋ฐ˜์ ์ธ ๋ฒˆ์—ญ ๋™์ž‘ ๋‹จ๊ณ„๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  1. ๋ฒˆ์—ญํ•˜๊ณ ์ž ํ•˜๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์ด ์ธ์ฝ”๋”์— ๋“ค์–ด๊ฐ€์„œ ์€๋‹‰ ์ƒํƒœ์™€ ์…€ ์ƒํƒœ๋ฅผ ์–ป๋Š”๋‹ค.
  2. ์ƒํƒœ์™€ <sos>์— ํ•ด๋‹นํ•˜๋Š” \t ๋ฅผ ๋””์ฝ”๋”๋กœ ๋ณด๋‚ธ๋‹ค.
  3. ๋””์ฝ”๋”๊ฐ€ <eos>์— ํ•ด๋‹นํ•˜๋Š” \n์ด ๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ ๋‹ค์Œ ๋ฌธ์ž๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ–‰๋™์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

โ–ถ ์ธ์ฝ”๋” ๋ชจ๋ธ ์ •์˜

# ์•ž์—์„œ ์ •์˜ํ•œ encoder_input = Input(shape=(None, src_vocab_size))
# outputs = encoder_states : encoder_lstm์œผ๋กœ ๋ถ€ํ„ฐ ๋ฐ›์€ ์€๋‹‰์ƒํƒœ์™€ ์…€์ƒํƒœ๊ฐ’ [state_h, state_c]
encoder_model = Model(inputs=encoder_inputs, outputs=encoder_states)
  • ์šฐ์„  ์ธ์ฝ”๋”๋ฅผ encoder_model ๋กœ ์ •์˜ํ•˜์ž.

 

โ–ถ ๋””์ฝ”๋” ๋ชจ๋ธ ์ •์˜

# ์ด์ „ ์‹œ์ ์˜ ์ƒํƒœ๋“ค์„ ์ €์žฅํ•˜๋Š” ํ…์„œ
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
# ๋ฌธ์žฅ์˜ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ดˆ๊ธฐ ์ƒํƒœ(initial_state)๋ฅผ ์ด์ „ ์‹œ์ ์˜ ์ƒํƒœ๋กœ ์‚ฌ์šฉ. ์ด๋Š” ๋’ค์˜ ํ•จ์ˆ˜ decode_sequence()์— ๊ตฌํ˜„
decoder_states = [state_h, state_c]
# ํ›ˆ๋ จ ๊ณผ์ •์—์„œ์™€ ๋‹ฌ๋ฆฌ LSTM์˜ ๋ฆฌํ„ดํ•˜๋Š” ์€๋‹‰ ์ƒํƒœ์™€ ์…€ ์ƒํƒœ์ธ state_h์™€ state_c๋ฅผ ๋ฒ„๋ฆฌ์ง€ ์•Š์Œ.
decoder_outputs = decoder_softmax_layer(decoder_outputs)
decoder_model = Model(inputs=[decoder_inputs] + decoder_states_inputs, outputs=[decoder_outputs] + decoder_states)
  • ์ด์ „ ์‹œ์ ์˜ ์ƒํƒœ๋ฅผ ์ €์žฅํ•˜๋Š” ํ…์„œ๋ฅผ ๋งŒ๋“ค๊ณ , decoder_lstm ์œผ๋กœ ๋ถ€ํ„ฐ ๋‚˜ 
index_to_src = dict((i, char) for char, i in src_to_index.items())
index_to_tar = dict((i, char) for char, i in tar_to_index.items())
  • ์ธ๋ฑ์Šค๋กœ ๋ถ€ํ„ฐ ๋‹จ์–ด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” index_to_src / index_to_tar์„ ๋งŒ๋“ค์–ด์ค€๋‹ค.
def decode_sequence(input_seq):
    # ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์ธ์ฝ”๋”์˜ ์ƒํƒœ๋ฅผ ์–ป์Œ
    states_value = encoder_model.predict(input_seq)

    # <SOS>์— ํ•ด๋‹นํ•˜๋Š” ์›-ํ•ซ ๋ฒกํ„ฐ ์ƒ์„ฑ
    target_seq = np.zeros((1, 1, tar_vocab_size))
    target_seq[0, 0, tar_to_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ""

    # stop_condition์ด True๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ๋ฃจํ”„ ๋ฐ˜๋ณต
    while not stop_condition:
        # ์ด์  ์‹œ์ ์˜ ์ƒํƒœ states_value๋ฅผ ํ˜„ ์‹œ์ ์˜ ์ดˆ๊ธฐ ์ƒํƒœ๋กœ ์‚ฌ์šฉ
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋ฌธ์ž๋กœ ๋ณ€ํ™˜
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = index_to_tar[sampled_token_index]

        # ํ˜„์žฌ ์‹œ์ ์˜ ์˜ˆ์ธก ๋ฌธ์ž๋ฅผ ์˜ˆ์ธก ๋ฌธ์žฅ์— ์ถ”๊ฐ€
        decoded_sentence += sampled_char

        # <eos>์— ๋„๋‹ฌํ•˜๊ฑฐ๋‚˜ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ๋„˜์œผ๋ฉด ์ค‘๋‹จ.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_tar_len):
            stop_condition = True

        # ํ˜„์žฌ ์‹œ์ ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ ์‹œ์ ์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ €์žฅ
        target_seq = np.zeros((1, 1, tar_vocab_size))
        target_seq[0, 0, sampled_token_index] = 1.

        # ํ˜„์žฌ ์‹œ์ ์˜ ์ƒํƒœ๋ฅผ ๋‹ค์Œ ์‹œ์ ์˜ ์ƒํƒœ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ €์žฅ
        states_value = [h, c]

    return decoded_sentence
for seq_index in [3,50,100,300,1001]: # ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์ธ๋ฑ์Šค
    input_seq = encoder_input[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print(35 * "-")
    print('์ž…๋ ฅ ๋ฌธ์žฅ:', lines.src[seq_index])
    print('์ •๋‹ต ๋ฌธ์žฅ:', lines.tar[seq_index][1:len(lines.tar[seq_index])-1]) # '\t'์™€ '\n'์„ ๋นผ๊ณ  ์ถœ๋ ฅ
    print('๋ฒˆ์—ญ๊ธฐ๊ฐ€ ๋ฒˆ์—ญํ•œ ๋ฌธ์žฅ:', decoded_sentence[:len(decoded_sentence)-1]) # '\n'์„ ๋นผ๊ณ  ์ถœ๋ ฅ

 

์ด๋ ‡๊ฒŒ ๊ธ€์ž ์ˆ˜์ค€์—์„œ์˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ๊ตฌํ˜„ํ•ด ๋ณด์•˜๋‹ค. ๋‹จ์–ด ์ˆ˜์ค€์—์„œ์˜ ๋ฒˆ์—ญ๊ธฐ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์ด ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜๊ณ  ๋ณ„๋„๋กœ ๊ณต๋ถ€๋ฅผ ์ง„ํ–‰ํ•˜์ž.

๋ฐ˜์‘ํ˜•