Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ (Embedding)

๊ฐ์ž ๐Ÿฅ” 2021. 7. 27. 16:43
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์†Œ์Šค์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ

 

1. ์ž„๋ฒ ๋”ฉ(Embedding) ์ด๋ž€?

  • ์ด์‚ฐ ํƒ€์ž…์˜ Word ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ€์ง‘ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ฐฉ๋ฒ•์„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์ด๋ผ๊ณ  ํ•จ
    • ์ด์‚ฐํƒ€์ž…?
      ๋ฌธ์ž, ๊ฐœ์ฒด๋ช…, ์–ดํœ˜์‚ฌ์ „ ๋“ฑ ์œ ํ•œํ•œ ์ง‘ํ•ฉ์—์„œ ์–ป์€ ๋ชจ๋“  ์ž…๋ ฅ ํŠน์„ฑ์„ ์ด์‚ฐํƒ€์ž…์ด๋ผ๊ณ  ํ•œ๋‹ค.
  • ์ž์—ฐ์–ด๋ฅผ ๊ธฐ๊ณ„๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์žํ˜•ํƒœ์ธ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พผ ๊ฒฐ๊ณผ ํ˜น์€ ๊ทธ ์ผ๋ จ์˜ ๊ณผ์ • ์ „์ฒด๋ฅผ ์ž„๋ฒ ๋”ฉ์ด๋ผ๊ณ  ํ•จ
  • ๋Œ€ํ‘œ์ ์œผ๋กœ ์›ํ•ซ์ธ์ฝ”๋”ฉ, TF-IDF ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

 

2. ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์˜ ํ๋ฆ„์™€ ์ข…๋ฅ˜

  1. ์ ์ฐจ ํ†ต๊ณ„๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•์—์„œ Neural Network ๊ธฐ๋ฒ•์œผ๋กœ
    1. ํ†ต๊ณ„๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•
      • ์ž ์žฌ์˜๋ฏธ๋ถ„์„ : ๋‹จ์–ด์˜ ์‚ฌ์šฉ ๋นˆ๋„ ๋“ฑ Corpus์˜ ํ†ต๊ณ„๋Ÿ‰ ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ๋Š” ํ–‰๋ ฌ์— ํŠน์ด๊ฐ’ ๋ถ„ํ•ด ๋“ฑ ์ˆ˜ํ•™์  ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์š” ํ–‰๋ ฌ์— ์†ํ•œ ๋ฒกํ„ฐ๋“ค์˜ ์ฐจ์›์„ ์ถ•์†Œํ•˜๋Š” ๋ฐฉ๋ฒ• (TF-IDF, Word_Context Matrix, PMI Matrix ๋“ฑ์ด ์žˆ๋‹ค.)
    2. Neural Network ๊ธฐ๋ฒ•
      • Nerual Probabilistic Language Model์ด ๋ฐœํ‘œ๋œ ์ดํ›„๋ถ€ํ„ฐ ์ฃผ๋ชฉ๋ฐ›๊ธฐ ์‹œ์ž‘
      • ๊ตฌ์กฐ๊ฐ€ ์œ ์—ฐํ•˜๊ณ  ํ‘œํ˜„๋ ฅ์ด ํ’๋ถ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์—ฐ์–ด์˜ ๋ฌดํ•œํ•œ ๋ฌธ๋งฅ์„ ์ƒ๋‹น ๋ถ€๋ถ„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด์„œ ์ฃผ๋ชฉ๋ฐ›๋Š”๋‹ค.
  2. ๋‹จ์–ด์ˆ˜์ค€์—์„œ ๋ฌธ์žฅ์ˆ˜์ค€์˜ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.
    1. ๋‹จ์–ด์ˆ˜์ค€ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•
      • ๊ฐ๊ฐ ๋ฒกํ„ฐ์— ํ•ด๋‹น ๋‹จ์–ด์˜ ๋ฌธ๋งฅ์  ์˜๋ฏธ๋ฅผ ํ•จ์ถ•ํ•ฎ๋งŒ, ๋‹จ์–ด๊ฐ€ ๋™์ผํ•˜๋ฉด ๋™์ผํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ํ•˜๊ณ , ๋ชจ๋“  ๋ฌธ๋งฅ์˜ ์ •๋ณด๋ฅผ ์ดํ•ดํ• ์ˆ˜ ์—†๋‹ค. (๋™์Œ์ด์˜์–ด๋ฅผ ๊ตฌ๋ฒผํ•˜๊ธฐ ์–ด๋ ต๋‹ค.)
      • NPLM, Word2Vec, GloVe, FastText ๋“ฑ 
    2. ๋ฌธ์žฅ์ˆ˜์ค€ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•
      • ELMo๊ฐ€ ๋ฐœํ‘œ๋œ ์ดํ›„ ์ฃผ๋ชฉ๋ฐ›๊ธฐ ์‹œ์ž‘
      • ๊ฐœ๋ณ„๋‹จ์–ด๊ฐ€ ์•„๋‹Œ Sequence ์ „์ฒด์˜ ๋ฌธ๋งฅ์  ์˜๋ฏธ๋ฅผ ํ•จ์ถ•ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•๋ณด๋‹ค Transfer Learning ํšจ๊ณผ๊ฐ€ ์ข‹๋‹ค.
      • ๋™์Œ์ด์˜์–ด๋„ ๋ฌธ์žฅ์ˆ˜์ค€ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜๋ฉด ๋ถ„๋ฆฌํ•ด์„œ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์–ด์„œ ์š”์ฆ˜ ์ฃผ๋ชฉ๋ฐ›๋Š”๋‹ค. (๋ฌธ๋งฅํŒŒ์•… ์šฉ์ด)
      • BERT, GPT ๋“ฑ
  3. Rule Based ์—์„œ End to End๋กœ, ๊ทธ๋ฆฌ๊ณ  ์ตœ๊ทผ์—๋Š” Pre-Training / fine Tuning ์œผ๋กœ
    1. 1990๋…„๋Œ€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ๋ชจ์Šต์€ ํ”ผ์ณ๋ฅผ ์‚ฌ๋žŒ์ด ๋ฝ‘์•„์„œ ์‚ฌ์šฉ
    2. 2000๋…„๋Œ€์—๋Š” ํ”ผ์ณ๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•ด์ฃผ๊ณ , ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์œผ๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๊ธฐ๊ณ„๊ฐ€ ์ดํ•ดํ•˜๋Š” end to end ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ (๋Œ€ํ‘œ์ ์œผ๋กœ Sequence to Sequence ๋ชจ๋ธ์ด ์žˆ๋‹ค)
    3. 2018๋…„ ELMo ๋ชจ๋ธ์ด ์ œ์•ˆ๋œ ์ดํ›„ NLP๋ชจ๋ธ์€ pre-training ๊ณผ fine tuning ๋ฐฉ์‹์œผ๋กœ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋‹ค

 

3. ์ž„๋ฒ ๋”ฉ์˜ ์ข…๋ฅ˜

  1. ํ–‰๋ ฌ๋ถ„ํ•ด ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•
    • Corpus๊ฐ€ ๋“ค์–ด์žˆ๋Š” ์›๋ž˜ ํ–‰๋ ฌ์„ Decomposition์„ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ ํ•˜๋Š” ๋ฐฉ๋ฒ•
    • Decomposition ์ดํ›„์—๋Š” ๋‘˜ ์ค‘ ํ•˜๋‚˜์˜ ํ–‰๋ ฌ๋งŒ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‘˜์„ sumํ•˜๊ฑฐ๋‚˜ concatenate ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค
    • GloVe, Sweivet ๋“ฑ
  2. ์˜ˆ์ธก ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•
    • ์–ด๋–ค ๋‹จ์–ด ์ฃผ๋ณ€์— ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ• ์ง€ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜
    • ์ด์ „ ๋‹จ์–ด๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹ค์Œ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ผ์ง€ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜
    • ๋ฌธ์žฅ ๋‚ด ์ผ๋ถ€ ๋‹จ์–ด๋ฅผ ์ง€์šฐ๊ณ  ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ผ์ง€ ๋งž์ถ”๋Š” ๊ณผ์ •์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
    • Neural Network ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์ด ์—ฌ๊ธฐ์— ์†ํ•œ๋‹ค.
    • Word2Vec, FastText, BERT, ELMo, GPT ๋“ฑ
  3. ํ† ํ”ฝ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•
    • ์ฃผ์–ด์ง„ ๋ฌธ์„œ์— ์ž ์žฌ๋œ ์ฃผ์ œ๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ฒ•
    • LDA๊ธฐ๋ฒ•์ด ์—ฌ๊ธฐ์— ์†ํ•จ
    • ํ•™์Šต์ด ์˜จ๋ฃŒ๋˜๋ฉด ๊ฐ๋ฌธ์„œ๊ฐ€ ์–ด๋–ค ์ฃผ์ œ์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š”์ง€ ํ™•๋ฅ  ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์˜ ์ผ์ข…์œผ๋กœ ์ดํ•ดํ•˜๋ฉด ๋œ๋‹ค.

 

4. ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ํ•™์Šต ๋ฐฉ๋ฒ•

๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์€ ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ๋‹ค. ํ•˜์ง€๋งŒ ์ง€๋„ํ•™์Šต๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๊ฒŒ ๋ฌด์Šจ๋ง์ธ๊ฐ€ ์•Œ์•„๋ณด๋ฉด, ์•„๋ž˜์™€ ๊ฐ™๋‹ค. 

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์•”๋ฌต์ ์œผ๋กœ ๋ ˆ์ด๋ธ” ๋˜์–ด ์žˆ๋Š” ๋ณด์กฐ ์ž‘์—…์„ ๊ตฌ์„ฑํ•œ๋‹ค.
    • ๋‹จ์–ด ์‹œํ€€์Šค๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ณด์กฐ์ž‘์—… (์–ธ์–ด ๋ชจ๋ธ๋ง ์ž‘์—… ์ˆ˜ํ–‰)
    • ์•ž๊ณผ ๋’ค์˜ ๋‹จ์–ด ์‹œํ€€์Šค๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ˆ„๋ฝ๋œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ณด์กฐ์ž‘์—…
    • ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ์œ„์น˜์— ๊ด€๊ณ„ ์—†์ด window์•ˆ์— ๋“ฑ์žฅํ•  ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ณด์กฐ์ž‘์—…
  • ๋ณด์กฐ์ž‘์—…์˜ ์„ ํƒ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„์ž์˜ ์ง๊ด€๊ณผ ๊ณ„์‚ฐ ๋น„์šฉ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค.
    • ๋ณด์กฐ์ž‘์—…์˜ ์˜ˆ)
      GloVe, CBOW, skip-gram ๋“ฑ

 

5. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ (pre-trained word embedding)

  • ๊ตฌ๊ธ€, ์œ„ํ‚คํ”ผ๋””์•„ ๋“ฑ์—์„œ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜์™€ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๋‹ค์šด๋กœ๋“œํ•ด์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ๋ช‡๊ฐ€์ง€ ์†์„ฑ๊ณผ NLP ์ž‘์—…์— pre-trained word embedding ์„ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ๋ฅผ ์•Œ์•„๋ณด์ž.

5.1 ์ž„๋ฒ ๋”ฉ ๋กœ๋“œ

์ž„๋ฒ ๋”ฉ์„ ํšจ์œจ์ ์œผ๋กœ ๋กœ๋“œํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” PreTrainedEmbeddings ์œ ํ‹ธ๋ฆฌํ‹ฐ ํด๋ž˜์Šค๋ฅผ ์‚ดํŽด๋ณด์ž. (ํ•ด๋‹น ์ฝ”๋“œ๋Š” ์ด ๋งํฌ์—์„œ ํ™•์ธํ•  ์ˆ˜์žˆ๋‹ค) ์ด ํด๋ž˜์Šค๋Š” ๋น ๋ฅธ ์กฐํšŒ๋ฅผ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋‚ด์— ๋ชจ๋“  ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  K-Nearest Neighbor ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•œ annoy ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

5.1.1 ํŒจํ‚ค์ง€ ์„ค์น˜ ๋ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# annoy ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
!pip install annoy

import torch
import torch.nn as nn
from tqdm import tqdm
from annoy import AnnoyIndex
import numpy as np

 

5.1.2 class ์‚ดํŽด๋ณด๊ธฐ

    • __init__ 
      • word_to_index : ๋‹จ์–ด์—์„œ ์ •์ˆ˜๋กœ ๋งคํ•‘
      • word_vectors : ๋ฒกํ„ฐ์˜ ๋ฐฐ์—ด ๋ฆฌ์ŠคํŠธ
      • ์ธ๋ฑ์Šค๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” PreTrainedEmbeddings ํด๋ž˜์Šค์˜ ์ƒ์„ฑ์ž
    • from_embeddings_file(cls, embedding_file)
      • ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ฌ๋•Œ ํ•ด๋‹น ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค
      • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ฒกํ„ฐ ํŒŒ์ผ๋กœ๋ถ€ํ„ฐ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ ๋‹ค.
      • ๋ฐ˜ํ™˜๊ฐ’์€ PretrainedEmbedding ์˜ ์ธ์Šคํ„ด์Šค
    • get_embedding(self, word)
      • ์ž„๋ฒ ๋”ฉ๋œ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜๊ฐ’์œผ๋กœ ๋ฐ›๋Š”๋‹ค.
      • numpy.ndarray์˜ ํ˜•ํƒœ์˜ ๋ฐ˜ํ™˜๊ฐ’
    • get_closet_to_vector(self, vector, n=1)
      • ๋ฒกํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด n๊ฐœ์˜ ์ตœ๊ทผ์ ‘ ์ด์›ƒ์„ ๋ฐ˜ํ™˜
      • [str, str...] ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜๋œ๋‹ค.
    • compute_and_print_analogy(self, word1, word2, word3)
      •  print("{} : {} :: {} : {}".format(word1, word2, word3, word4)) ํ˜•ํƒœ๋กœ ๋‹จ์–ด์˜ ์œ ์ถ” ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ค€๋‹ค.
  •  
class PreTrainedEmbeddings(object):
    """ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ๋ฒกํ„ฐ ์‚ฌ์šฉ์„ ์œ„ํ•œ ๋ž˜ํผ ํด๋ž˜์Šค """
    def __init__(self, word_to_index, word_vectors):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            word_to_index (dict): ๋‹จ์–ด์—์„œ ์ •์ˆ˜๋กœ ๋งคํ•‘
            word_vectors (numpy ๋ฐฐ์—ด์˜ ๋ฆฌ์ŠคํŠธ)
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}

        self.index = AnnoyIndex(len(word_vectors[0]), metric='euclidean')
        print("์ธ๋ฑ์Šค ๋งŒ๋“œ๋Š” ์ค‘!")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)
        print("์™„๋ฃŒ!")
        
    @classmethod
    def from_embeddings_file(cls, embedding_file):
        #ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ฌ๋•Œ ํ•ด๋‹น ๋ฉ”์„œ๋“œ ์‚ฌ์šฉ
        # embeding = PreTrainedEmbeddings.from_embeddings_file(ํŒŒ์ผ์ด๋ฆ„) ์ด๋ ‡๊ฒŒ ์ž„๋ฒ ๋”ฉํ•ด์คŒ
        """์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ฒกํ„ฐ ํŒŒ์ผ์—์„œ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
        
        ๋ฒกํ„ฐ ํŒŒ์ผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํฌ๋งท์ž…๋‹ˆ๋‹ค:
            word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
            word1 x1_0 x1_1 x1_2 x1_3 ... x1_N
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            embedding_file (str): ํŒŒ์ผ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            PretrainedEmbeddings์˜ ์ธ์Šคํ„ด์Šค
        """
        word_to_index = {}
        word_vectors = []

        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
                
                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)       
        return cls(word_to_index, word_vectors)
    
    def get_embedding(self, word):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            word (str)
        ๋ฐ˜ํ™˜๊ฐ’
            ์ž„๋ฒ ๋”ฉ (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]

    def get_closest_to_vector(self, vector, n=1):
        """๋ฒกํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด n ๊ฐœ์˜ ์ตœ๊ทผ์ ‘ ์ด์›ƒ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vector (np.ndarray): Annoy ์ธ๋ฑ์Šค์— ์žˆ๋Š” ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ์™€ ๊ฐ™์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค
            n (int): ๋ฐ˜ํ™˜๋  ์ด์›ƒ์˜ ๊ฐœ์ˆ˜
        ๋ฐ˜ํ™˜๊ฐ’:
            [str, str, ...]: ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋‹จ์–ด
                ๋‹จ์–ด๋Š” ๊ฑฐ๋ฆฌ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor] for neighbor in nn_indices]
    
    def compute_and_print_analogy(self, word1, word2, word3):
        """๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์œ ์ถ” ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค

        word1์ด word2์ผ ๋•Œ word3์€ __์ž…๋‹ˆ๋‹ค.
        ์ด ๋ฉ”์„œ๋“œ๋Š” word1 : word2 :: word3 : word4๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        # get_embedding ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ฒกํ„ฐ๊ฐ’์„ ๋ฐ›๋Š”๋‹ค
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # ๋„ค ๋ฒˆ์งธ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        # get_closet_to_vector ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ตœ๊ทผ์ ‘ ์ด์›ƒ์— ์žˆ๋Š” word๋“ค์„ ๋ฐ›๋Š”๋‹ค. (n = 4๊ฐœ)
        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words 
                             if word not in existing_words] 

        if len(closest_words) == 0:
            print("๊ณ„์‚ฐ๋œ ๋ฒกํ„ฐ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค!")
            return
        
        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

 

5.1.3 GloVe ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ

# GloVe ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
!mkdir -p data/glove
!mv glove.6B.100d.txt data/glove

 

5.1.4 ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ

embeddings = PreTrainedEmbeddings.from_embeddings_file('data/glove/glove.6B.100d.txt')

 

5.1.5 ๊ฒฐ๊ณผ ํ™•์ธํ•ด๋ณด๊ธฐ

  1. ์„ฑ๋ณ„ ๋ช…์‚ฌ์™€ ๋Œ€๋ช…์‚ฌ์˜ ๊ด€๊ณ„
    1. embeddings.compute_and_print_analogy('man', 'he', 'woman')
  2. ๋™์‚ฌ - ๋ช…์‚ฌ ๊ด€๊ณ„
    1. embeddings.compute_and_print_analogy('fly', 'plane', 'sail')
  3. ๋ช…์‚ฌ - ๋ช…์‚ฌ ๊ด€๊ณ„
    1. embeddings.compute_and_print_analogy('cat', 'kitten', 'dog')
  4. ์ƒ์œ„์–ด (๋” ๋„“์€ ๋ฒ”์ฃผ)
    1. embeddings.compute_and_print_analogy('blue', 'color', 'dog')
  5. ๋ถ€๋ถ„์—์„œ ์ „์ฒด ๊ฐœ๋…์˜ ๊ด€๊ณ„
    1. embeddings.compute_and_print_analogy('leg', 'legs', 'hand')
  6. ๋ฐฉ์‹ ์ฐจ์ด
    1. embeddings.compute_and_print_analogy('talk', 'communicate', 'read')
  7. ์ „์ฒด ์˜๋ฏธ ํ‘œํ˜„
    1. embeddings.compute_and_print_analogy('blue', 'democrat', 'red')

 

5.1.6 ์ž„๋ฒ ๋”ฉ์˜ ๋ฌธ์ œ

๋‹จ์–ด ๋ฒกํ„ฐ๋Š” ๋™์‹œ์— ๋“ฑ์žฅํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ, ์ž˜๋ชป๋œ ๊ด€๊ณ„๊ฐ€ ์ƒ์„ฑ๋˜๊ธฐ๋„ ํ•œ๋‹ค.

  1. ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ๋‚ด๋ฑ‰๋Š” ๊ฒฝ์šฐ. (๋™์‹œ์— ๋“ฑ์žฅํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์˜ค๋ฅ˜๋Š” ๋‹น์—ฐํ•˜๋‹ค.)
    1. embeddings.compute_and_print_analogy('fast', 'fastest', 'small')

      โ–ถ smallest๊ฐ€ ๋‚˜์™€์•ผ ํ•˜์ง€๋งŒ, largest among ๋“ฑ ๋‹ค๋ฅธ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ–ˆ๋‹ค.
  2. ๋„๋ฆฌ ์•Œ๋ ค์ง„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ์Œ“์ธ ์„ฑ๋ณ„ ์ธ์ฝ”๋”ฉ (์ด๋ ‡๊ฒŒ ์„ฑ๋ณ„๊ณผ ๊ฐ™์€ ๋ณดํ˜ธ ์†์„ฑ์— ์ฃผ์˜ํ•ด์•ผํ•œ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ถ”ํ›„ ํ•˜์œ„ ๋ชจ๋ธ์—์„œ ์›์น˜ ์•Š๋Š” ํŽธํ–ฅ์„ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.)
    1. embiddeings.compute_and_pring_analogy('man', 'king', 'woman')
  3. ๋–„๋กœ๋Š” ๋ฒกํ„ฐ์— ์ธ์ฝ”๋”ฉ๋œ ๋ฌธํ™”์  ์„ฑ๋ณ„ ํŽธ๊ฒฌ์„ ๋ฐœ์ƒ์‹œํ‚ค๊ธฐ๋„ ํ•œ๋‹ค.
    1. embeddings.compute_and_print_analogy('man', 'doctor', 'woman')

      โ–ถ ๋‚จ์ž๋Š” ์˜์‚ฌ, ์—ฌ์ž๋Š” ๊ฐ„ํ˜ธ์‚ฌ๋ผ๋Š” ๋ฌธํ™”์  ์„ฑ๋ณ„ ํŽธ๊ฒฌ์„ ๋ฐœ์ƒ์‹œ์ผฐ๋‹ค.

 

 

์ง€๊ธˆ๊นŒ์ง€ ์ž„๋ฒ ๋”ฉ์— ๊ด€ํ•œ ๊ฐ„๋‹จํ•œ ์†์„ฑ๊ณผ ํŠน์ง•์„ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋‹ค์Œ๊ธ€์—์„œ๋Š” CBOW ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šต์‹œํ‚ค๋Š” ์˜ˆ์ œ๋ฅผ ์‚ดํŽด๋ณผ ๊ฒƒ์ด๋‹ค.

 

 

์ฐธ๊ณ ๋ฌธํ—Œ ) https://heung-bae-lee.github.io/2020/01/16/NLP_01/

 

์ž„๋ฒ ๋”ฉ์ด๋ž€?

์ปดํ“จํ„ฐ๊ฐ€ ๋ฐ”๋ผ๋ณด๋Š” ๋ฌธ์ž ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฌธ์ž๋Š” ์ปดํ“จํ„ฐ๊ฐ€ ํ•ด์„ํ•  ๋•Œ ๊ทธ๋ƒฅ ๊ธฐํ˜ธ์ผ ๋ฟ์ด๋‹ค. ์ด๋ ‡๊ฒŒ encoding๋œ ์ƒํƒœ๋กœ ๋ณด๊ฒŒ ๋˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฌธ์ œ์ ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ธ€์ž๊ฐ€ ์–ด๋–ค ๊ธ€์ž์ธ์ง€๋ฅผ ํ‘œ์‹œ

heung-bae-lee.github.io

 

๋ฐ˜์‘ํ˜•