Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] ํ’ˆ์‚ฌํƒœ๊น… (pos-tagging) / ๋ถ€๋ถ„๊ตฌ๋ฌธ๋ถ„์„(chunking) / ๊ฐœ์ฒด๋ช… ์ธ์‹(NER)

๊ฐ์ž ๐Ÿฅ” 2021. 7. 19. 09:00
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…๊ณผ ์œ„ํ‚ค๋…์Šค๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

1. NLP์˜ ๋ถ„๋ฅ˜ ๋ฌธ์ œ

  • ๋ฌธ์„œ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์ด NLP ๋ถ„์•ผ์˜ ์ดˆ๊ธฐ ์‘์šฉ๋ถ„์•ผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.
  • TF-IDF๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋ฌธ์„œ๋‚˜ ๋ฌธ์žฅ๊ฐ™์€ ๊ธด ํ…์ŠคํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š”๋ฐ ์œ ์šฉํ•˜๋‹ค.
  • Topic ๋ ˆ์ด๋ธ” ํ• ๋‹น, ๋ฆฌ๋ทฐ์˜ ๊ฐ์„ฑ ์˜ˆ์ธก, ์ŠคํŒธ ๋ฉ”์ผ ํ•„ํ„ฐ๋ง, ์–ธ์–ด ์‹๋ณ„, ์ด๋ฉ”์ผ ๋ถ„๋ฅ˜ ์ž‘์—…์€ ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์ž„

 

2. ๋‹จ์–ด ๋ถ„๋ฅ˜ํ•˜๊ธฐ

โ–ถ ํ’ˆ์‚ฌํƒœ๊น… (POS-tagging)

  • ํ’ˆ์‚ฌํƒœ๊น…์€ ํ˜•ํƒœ์†Œ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜
  • ํ˜•ํƒœ์†Œ์— ํ’ˆ์‚ฌ๋ฅผ ๋ถ™์ด๋Š” ์ž‘์—…
    • ํ’ˆ์‚ฌ์˜ ๊ตฌ๋ถ„์€ ์‚ฌ๋žŒ๋งˆ๋‹ค, ์–ธ์–ด๋งˆ๋‹ค, ํ•™์ž๋งˆ์ž, ์•Œ๊ณ ๋ฆฌ์ฆ˜๋งˆ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

1. spaCy๋ฅผ ํ™œ์šฉํ•œ ํ’ˆ์‚ฌํƒœ๊น…

import spacy
nlp = spacy.load('en')
doc = nlp(u"Mary slapped the green witch.")
for token in doc:
  print('{} -> {}'.format(token, token.pos_))

2. nltk๋ฅผ ํ™œ์šฉํ•œ ํ’ˆ์‚ฌํƒœ๊น…

nltk.download('tagsets')
import nltk
nltk.download('averaged_perceptron_tagger')

word = ['Mary', 'slapped', 'the', 'green', 'witch']

from nltk.tag import pos_tag
pos_tag(word)

์ฐธ๊ณ ๋ฌธํ—Œ) https://ysyblog.tistory.com/88

 

3. ๋ถ€๋ถ„๊ตฌ๋ฌธ๋ถ„์„(์ฒญํฌ๋‚˜๋ˆ„๊ธฐ, chunking)๊ณผ ๊ฐœ์ฒด๋ช… ์ธ์‹

3.1 ๋ถ€๋ถ„๊ตฌ๋ฌธ๋ถ„์„(shallow parsing) = ์ฒญํฌ๋‚˜๋ˆ„๊ธฐ (chunking)

  • ๋ช…์‚ฌ, ๋™์‚ฌ, ํ˜•์šฉ์‚ฌ์™€ ๊ฐ™์€ ๋ฌธ๋ฒ• ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋œ ๊ณ ์ฐจ์›์˜ ๋‹จ์œ„๋ฅผ ์œ ๋„ํ•ด ๋‚ด๋Š” ๊ฒƒ
  • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ํ’ˆ์‚ฌํƒœ๊น… ๋ชจ๋ธ์ด ์กด์žฌํ•จ

1. spaCy์„ ํ™œ์šฉํ•œ chunking

import spacy
nlp = spacy.load('en')
doc = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
  print('{} -> {}'.format(chunk, chunk.label_))

์ด๋ ‡๊ฒŒ ๋ช…์‚ฌ๊ตฌ๊ฐ€ ์ถ”์ถœ๋œ๋‹ค.

2. ์ •๊ทœ์‹์„ ํ™œ์šฉํ•œ chunking

๋ถ€๋ถ„๊ตฌ๋ฌธ๋ถ„์„(chunking)๋ชจ๋ธ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋  ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค๋ฉด, ์ •๊ทœ์‹์„ ํ™œ์šฉํ•˜์—ฌ ๋ถ€๋ถ„๊ตฌ๋ฌธ๋ถ„์„์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. nltk์—์„œ RegexpParser๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ) https://jynee.github.io/NLP%EA%B8%B0%EC%B4%88_3/

 

3.2 ๊ฐœ์ฒด๋ช… ์ธ์‹ (Named Entity Recognition)

  • ์‚ฌ๋žŒ, ์žฅ์†Œ, ํšŒ์‚ฌ, ์•ฝ ์ด๋ฆ„ ๋“ฑ๊ณผ ๊ฐ€์€ ์‹ค์ œ ์„ธ์ƒ์˜ ๊ฐœ๋…์„ ์˜๋ฏธํ•˜๋Š” ๋ฌธ์ž์—ด
  • ๊ฐœ์ฒด๋ช… ์ธ์‹์„ ์œ„ํ•ด์„œ๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š” (๋ณดํ†ต์€ pos tagging ๋œ ํ˜•ํƒœ๋กœ ์ž…๋ ฅ์„ ์š”๊ตฌํ•œ๋‹ค)

โ–ถ nltk๋ฅผ ํ™œ์šฉํ•œ ๊ฐœ์ฒด๋ช… ์ธ์‹
(nltk๋Š” NER chunker๋ฅผ ์ง€์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ„๋„์˜ ๊ฐœ์ฒด๋ช… ์ธ์‹๊ธฐ๋ฅผ ๊ตฌํ˜„ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค.)
(์•„๋ž˜ ์ฝ”๋“œ๋Š”, ํ’ˆ์‚ฌํƒœ๊น…์„ ์ง„ํ–‰ํ•˜๊ณ  ๊ฐœ์ฒด๋ช…์ธ์‹์„ ์ˆ˜ํ–‰ํ•œ ์ฝ”๋“œ์ด๋‹ค.)

from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "James is working at Disney in London"
sentence=pos_tag(word_tokenize(sentence))
print(sentence) # ํ† ํฐํ™”์™€ ํ’ˆ์‚ฌ ํƒœ๊น…์„ ๋™์‹œ ์ˆ˜ํ–‰

nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence=ne_chunk(sentence)
print(sentence) # ๊ฐœ์ฒด๋ช… ์ธ์‹

 James๋Š” PERSON(์‚ฌ๋žŒ) / Disney๋Š” ORGANIZATION(๊ธฐ๊ด€) / London์€ GPE(์œ„์น˜) ์œผ๋กœ ์ •์ƒ์ ์œผ๋กœ ๊ฐœ์ฒด๋ช…์ด ์ธ์‹๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
์ฐธ๊ณ ๋ฌธํ—Œ) https://wikidocs.net/30682

๋ฐ˜์‘ํ˜•