Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] ํ‘œ์ œ์–ด์ถ”์ถœ(lemmatization)๊ณผ ์–ด๊ฐ„์ถ”์ถœ(stemming)

๊ฐ์ž ๐Ÿฅ” 2021. 7. 19. 08:03
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…๊ณผ ์œ„ํ‚ค๋…์Šค๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์ฐธ๊ณ ๋ฌธ์„œ๋Š” ๋งํฌ๋กœ ์ฒจ๋ถ€ํ•ด ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

 

โ–ถ ํ‘œ์ œ์–ด์™€ ์–ด๊ฐ„

  • ํ‘œ์ œ์–ด = ๋‹จ์–ด์˜ ๊ธฐ๋ณธํ˜•
    • ๋™์‚ฌ 'fly'์˜ ๋ณ€ํ˜•๋œ ํ˜•ํƒœ → flow, flew, flies, flown, flowing... ์–ด๋ฏธ๊ฐ€ ๋ณ€ํ•˜๋ฉด์„œ ์—ฌ๋Ÿฌ ๋‹จ์–ด๋กœ ๋ณ€ํ˜•
    • ์ด ๋ชจ๋“  ๋‹จ์–ด์˜ ํ‘œ์ œ์–ด๋Š” fly ํ•˜๋‚˜

1. ํ‘œ์ œ์–ด ์ถ”์ถœ (Lemmatization)

ํ‘œ์ œ์–ด ์ถ”์ถœ์€ ๋‹จ์–ด๋“ค์ด ๋‹ค๋ฅธ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๋”๋ผ๋„, ๊ทธ ๋ฟŒ๋ฆฌ ๋‹จ์–ด๋ฅผ ์ฐพ์•„๊ฐ€์„œ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.  ์˜ˆ๋ฅผ๋“ค์–ด am, are, is ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์–ด ์ด์ง€๋งŒ, be ๋™์‚ฌ ํ•˜๋‚˜๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. 

ํ‘œ์ œ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฐ€์žฅ ์„ฌ์„ธํ•œ ๋ฐฉ๋ฒ•์€ ํ˜•ํƒœํ•™์  ํŒŒ์‹ฑ์„ ๋จผ์ € ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰ 'ํ˜•ํƒœ์†Œ'์—๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. 'ํ˜•ํƒœ์†Œ'๋ž€, '์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๊ฐ€์žฅ ์ž‘์€ ๋‹จ์–ด'๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

  • ํ˜•ํƒœ์†Œ์˜ ๊ตฌ์„ฑ
    • ์–ด๊ฐ„(stem) : ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๋‹จ์–ด์˜ ํ•ต์‹ฌ ๋ถ€๋ถ„
    • ์ ‘์‚ฌ(affix) : ๋‹จ์–ด์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์˜๋ฏธ๋ฅผ ์ฃผ๋Š” ๋ถ€๋ถ„

ํ‘œ์ œ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ํŒจํ‚ค์ง€๋Š” nltk์™€ Spacy์—์„œ ์ง€์›ํ•œ๋‹ค.

1.1 nltk์—์„œ ํ‘œ์ œ์–ด ์ถ”์ถœ ์ง„ํ–‰ํ•ด๋ณด๊ธฐ

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
n=WordNetLemmatizer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([n.lemmatize(w) for w in words])

lives๋Š” life๋กœ ํ‘œ์ œ์–ด ์ถ”์ถœ์ด ์ž˜ ์ง„ํ–‰๋˜์—ˆ์ง€๋งŒ, dy, ha, doing ๋“ฑ ์•Œ ์ˆ˜ ์—†๋Š” ๋‹จ์–ด๋‚˜ ํ‘œ์ œ์–ด์ถ”์ถœ์ด ๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„์ด ์กด์žฌํ•œ๋‹ค. ์ด๋Š” ํ‘œ์ œ์–ด ์ถ”์ถœ์„ ์œ„ํ•ด์„œ๋Š” ํ‘œ์ œ์–ด์ถ”์ถœ๊ธฐ(lemmatizer)๊ฐ€ ์›๋ž˜ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๋ฏธ๋ฆฌ ์•Œ๊ณ  ์žˆ์–ด์•ผ๋งŒ ํ‘œ์ œ์–ด ์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ lemmatizer์— ํ•ด๋‹น ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์•Œ๋ ค์ค€๋‹ค๋ฉด, ์ •ํ™•ํ•˜๊ฒŒ ์ถœ๋ ฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

n.lemmatize('dies', 'v')

์ด๋ ‡๊ฒŒ dies๊ฐ€ ๋™์‚ฌ์ž„์„ ์•Œ๋ ค์ฃผ๋‹ˆ, die๋ผ๊ณ  ์ ์ ˆํ•˜๊ฒŒ ํ‘œ์ œ์–ด ์ถ”์ถœ์ด ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

1.2 spacy์—์„œ ํ‘œ์ œ์–ด ์ถ”์ถœ ์ง„ํ–‰ํ•ด๋ณด๊ธฐ

import spacy
nlp = spacy.load('en')
doc = nlp(u"he was running late")
for token in doc:
  print('{} -> {}'.format(token, token.lemma_))

2. ์–ด๊ฐ„์ถ”์ถœ(Stemming)

์–ด๊ฐ„(stem)์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์ •ํ•ด์ง„ ๊ทœ์น™๋งŒ ๋ณด๊ณ  ๋‹จ์–ด๋ฅผ ์ž๋ฅด๋Š” ๋ฐฉ๋ฒ•. ๋”ฐ๋ผ์„œ ์ž˜๋ผ์ง„ ๋‹จ์–ด๋Š” ํ˜„์กดํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด์ผ ์ˆ˜ ์žˆ๋‹ค. 

nltk์—์„œ ์ง€์›ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ stem์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‘ ๊ฐœ๊ฐ€ ์žˆ๋‹ค.

2.1 porterstemmer (porter ์•Œ๊ณ ๋ฆฌ์ฆ˜)

from nltk.stem import PorterStemmer
s=PorterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([s.stem(w) for w in words])

2.2 LancasterStemmer

from nltk.stem import LancasterStemmer
l=LancasterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([l.stem(w) for w in words])

๋‘๊ฐœ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•ด ๋ณด๋ฉด doing์„ ์ž๋ฅด๋Š” ๋ฐฉ๋ฒ•, organ์„ ์ž๋ฅด๋Š” ๋ฐฉ๋ฒ•์ด ๋ชจ๋‘ ๋‹ค๋ฅด๋‹ค. ์ด์ฒ˜๋Ÿผ ์–ด๋–ค ๊ทœ์น™์„ ๊ฐ€์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ์–ด๊ฐ„์ถ”์ถœ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค.

์ฐธ๊ณ ๋‚ด์šฉ  ) https://wikidocs.net/21707

 

์œ„ํ‚ค๋…์Šค

์˜จ๋ผ์ธ ์ฑ…์„ ์ œ์ž‘ ๊ณต์œ ํ•˜๋Š” ํ”Œ๋žซํผ ์„œ๋น„์Šค

wikidocs.net

 

๋ฐ˜์‘ํ˜•