Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] Pytorch๋ฅผ ํ™œ์šฉํ•˜์—ฌ CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ (1)๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

๊ฐ์ž ๐Ÿฅ” 2021. 7. 29. 18:54
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์†Œ์Šค์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ

 

1. CBOW ๋ž€

  • Word2Vec CBOW ๋ชจ๋ธ
  • ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ์ž‘์—…
  • ๋‹จ์–ด๋ฅผ ์Šค์บ”ํ•˜์—ฌ ๋‹จ์–ด์˜ ๋ฌธ๋งฅ Window๋ฅผ ๋งŒ๋“  ํ›„ ๋ฌธ๋งฅ Window์—์„œ ์ค‘์•™์˜ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฌธ๋งฅ WIndow๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ˆ„๋ฝ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ
  • ๋‹จ์–ด๊ฐ€ ๋ˆ„๋ฝ๋œ ๋ฌธ์žฅ์—์„œ ๋ˆ„๋ฝ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํŒŒ์•…ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰

 

2. ํ™œ์šฉ ๋ฐ์ดํ„ฐ

  • http://bit.ly/2T5iU8J ์—์„œ ๋ฉ”๋ฆฌ ์…ธ๋ฆฌ์˜ ์†Œ์„ค [ํ”„๋ž‘์ผ„์Šˆํƒ€์ธ]์˜ ๋””์ง€ํ„ธ ๋ฒ„์ „์„ ๋ฐ›์•„ ๊ตฌ์ถ•ํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ
  • ํŒŒ์ดํ† ์น˜์˜ Dataset ํด๋ž˜์Šค๋ฅผ ๋งŒ๋“ค๊ณ  ๋งˆ์ง€๋ง‰์— ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์‚ฌ์šฉ

 

3. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

3.1 ํ…์ŠคํŠธ๋ฅผ ๊ฐœ๋ณ„ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„ํ• 

# Split the raw text book into sentences
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
with open(args.raw_dataset_txt) as fp:
    book = fp.read()
sentences = tokenizer.tokenize(book)

print (len(sentences), "sentences")
print ("Sample:", sentences[100])

 

3.2 ๋‹ค์Œ ๊ฐ ๋ฌธ์žฅ์„ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๊ตฌ๋‘์ ์„ ์™„์ „ํžˆ ์ œ๊ฑฐ

  • ์ •๊ทœ์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•ด์ฃผ๊ณ , ๊ณต๋ฐฑ์œผ๋กœ ๋ฌธ์ž์—ด์„๋ถ„ํ• ํ•˜์—ฌ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ถ”์ถœ
# Clean sentences
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    

cleaned_sentences = [preprocess_text(sentence) for sentence in sentences]

 

3.3 ๋ฐ์ดํ„ฐ๋ฅผ ์œˆ๋„์šฐ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„

  • CBOW๋ชจ๋ธ์„ ์ตœ์ ํ™” ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ์…‹์„ window์˜ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„
'''
# ์ด์ „์— ์„ค์ •ํ•ด๋‘” ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธํ•˜๊ธฐ
args = Namespace(
    raw_dataset_txt="data/books/frankenstein.txt",
    window_size=5,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/books/frankenstein_with_splits.csv",
    seed=1337
)

'''

# Global vars
MASK_TOKEN = "<MASK>"

# ์œˆ๋„์šฐ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ (window size = 5)
# ๊ฐ ๋ฌธ์žฅ์˜ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ˆœํšŒํ•˜๋ฉด์„œ ์ง€์ •๋œ ํฌ๊ธฐ์˜ window๋กœ ๋ฌถ์–ด์ฃผ๊ธฐ
flatten = lambda outer_list: [item for inner_list in outer_list for item in inner_list]
windows = flatten([list(nltk.ngrams([MASK_TOKEN] * args.window_size + sentence.split(' ') + \
    [MASK_TOKEN] * args.window_size, args.window_size * 2 + 1)) \
    for sentence in tqdm_notebook(cleaned_sentences)])

# CBOW ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ
data = []
for window in tqdm_notebook(windows):
    target_token = window[args.window_size]
    context = []
    for i, token in enumerate(window):
        if token == MASK_TOKEN or i == args.window_size:
            continue
        else:
            context.append(token)
    data.append([' '.join(token for token in context), target_token])
    
            
# DataFrame์˜ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
cbow_data = pd.DataFrame(data, columns=["context", "target"])

โ–ถ ์ง€์ •๋œ ํฌ๊ธฐ์˜ Window๋กœ ๋ฌถ์–ด์ค€๋‹ค๋Š” ๋ง์„ ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ 

  • CBOW ์ž‘์—…์€ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.
  • ์™ผ์ชฝ๋ฌธ๋งฅ๊ณผ ์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ์„ ์‚ฌ์šฉํ•˜์—ฌ (๋นจ๊ฐ„๋‹จ์–ด)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. 
  • ๋ฌธ๋งฅ์˜ ์œˆ๋„์šฐ ๊ธธ์ด๋Š” ์–‘์ชฝ์œผ๋กœ 2์ž„์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ํ…์ŠคํŠธ ์œ„๋ฅผ ์Šฌ๋ผ์ด๋”ฉํ•˜๋Š” ์œˆ๋„์šฐ๊ฐ€ ์ง€๋„ํ•™์Šต ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•œ๋‹ค. 
  • ๊ฐ ์ƒ˜ํ”Œ์˜ ํƒ€๊นƒ์€ (๋นจ๊ฐ„๋‹จ์–ด) ์ด๊ณ , ์ƒ˜ํ”Œ์€ ์œˆ๋„์šฐ์— ์žˆ๋Š” ๋‚˜๋จธ์ง€ ๋‹จ์–ด์ด๋‹ค.
  • ๊ธธ์ด๊ฐ€ 2๊ฐ€์•ˆ๋˜๋Š” window๋Š” ์ ์ ˆํ•˜๊ฒŒ ํŒจ๋”ฉ๋œ๋‹ค.

 

3.4 ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

  • ๋ฐ์ดํ„ฐ๋ฅผ train 70%, validation 15%, test 15%๋กœ ๋ถ„ํ• ํ•  ๊ฒƒ์ด๋‹ค. (์ด์ „์— args์—์„œ ์ง€์ •)
  • ํ›ˆ๋ จtrain ์„ธํŠธ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๊ณ 
  • ๊ฒ€์ฆval ์„ธํŠธ๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
  • ํ…Œ์ŠคํŠธtest ์„ธํŠธ๋Š” ๋งˆ์ง€๋ง‰์— ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์— ๋”ฑ ํ•œ๋ฒˆ๋งŒ ์‚ฌ์šฉ๋œ๋‹ค.
# Create split data
n = len(cbow_data)
def get_split(row_num):
    if row_num <= n*args.train_proportion:
        return 'train'
    elif (row_num > n*args.train_proportion) and (row_num <= n*args.train_proportion + n*args.val_proportion):
        return 'val'
    else:
        return 'test'
cbow_data['split']= cbow_data.apply(lambda row: get_split(row.name), axis=1)


cbow_data.head()

 

2. Dataset class 

์ „์ฒด์ฝ”๋“œ ๋”๋ณด๊ธฐ ํด๋ฆญ

๋”๋ณด๊ธฐ
class CBOWDataset(Dataset):
    # CBOWDataset์˜ ์ƒ์„ฑ์ž ๋ฉ”์„œ๋“œ
    # CBOWDataset์€ ๊ฒฐ๊ตญ cbow๋ฐ์ดํ„ฐ์…‹๊ณผ vectorizer๋ฅผ ์ƒ์„ฑํ•œ๋‹ค!
    def __init__(self, cbow_df, vectorizer):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_df (pandas.DataFrame): ๋ฐ์ดํ„ฐ์…‹
            vectorizer (CBOWVectorizer): ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งŒ๋“  CBOWVectorizer ๊ฐ์ฒด
        """
        self.cbow_df = cbow_df
        self._vectorizer = vectorizer
        
        measure_len = lambda context: len(context.split(" "))
        # cbow df ๋ฐ์ดํ„ฐ์ค‘ ๊ฐ€์žฅ ๊ธด ๊ฒƒ์„ ์‹œํ€€์Šค์˜ ๊ธธ์ด๋กœ ์„ค์ •ํ•œ๋‹ค
        self._max_seq_length = max(map(measure_len, cbow_df.context))
        
        self.train_df = self.cbow_df[self.cbow_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.cbow_df[self.cbow_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.cbow_df[self.cbow_df.split=='test']
        self.test_size = len(self.test_df)

        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')

    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด Vectorizer ๋งŒ๋“ค๊ธฐ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWDataset์˜ ์ธ์Šคํ„ด์Šค
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split=='train']
        # from_dataframe : vocabulary ํด๋ž˜์Šค๋กœ๋ถ€ํ„ฐ vocab์„ ๋ฐ›์•„์˜ค๊ณ , ํ•ด๋‹น vocab์„ ์‚ฌ์šฉํ•ด์„œ ๋‹จ์–ด์— ์ •์ˆ˜๋ฅผ ๋งคํ•‘ํ•ด์ค€๋‹ค.
        #                : ๋ฐ˜ํ™˜๊ฐ’์€ ๋‹จ์–ด์— ์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ cbow_vocab ์ด๋‹ค.
        #                  --> ์ฆ‰ ์ด๊ฒƒ์ด vectorizer์ด๊ณ  cls์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค.
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, cbow_csv, vectorizer_filepath):
        """ ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด CBOWVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
        ์บ์‹œ๋œ CBOWVectorizer ๊ฐ์ฒด๋ฅผ ์žฌ์‚ฌ์šฉํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
            vectorizer_filepath (str): CBOWVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer์˜ ์ธ์Šคํ„ด์Šค
        """
        cbow_df = pd.read_csv(cbow_csv)
        # load_vectorizer_only : vectorizer์˜ ํŒŒ์ผ์„ ๋ฐ›์•„์™€์„œ cbow_vocab (์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ ๋‹จ์–ด)๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
        #                          --> ์ฆ‰ ์ด๊ฒƒ์ด vectorizer์ด๊ณ  cls์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค.
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(cbow_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """ํŒŒ์ผ์—์„œ CBOWVectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ์ •์  ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): ์ง๋ ฌํ™”๋œ CBOWVectorizer ๊ฐ์ฒด์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer์˜ ์ธ์Šคํ„ด์Šค
        """
        with open(vectorizer_filepath) as fp:
            # from_serializable() : vocab์„ ๋ฐ›์•„์™€์„œ ์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ vocab์„ ์–ป๋Š”๋‹ค. 
            #                        ---> ์ฆ‰, ์ด๊ฒŒ vectorizer
            return CBOWVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """CBOWVectorizer ๊ฐ์ฒด๋ฅผ json ํ˜•ํƒœ๋กœ ๋””์Šคํฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค 
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): CBOWVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """ ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """
        return self._vectorizer
        
    def set_split(self, split="train"):
        """ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์žˆ๋Š” ์—ด์„ ์‚ฌ์šฉํ•ด ๋ถ„ํ•  ์„ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            index (int): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํŠน์„ฑ(x_data)๊ณผ ๋ ˆ์ด๋ธ”(y_target)๋กœ ์ด๋ฃจ์–ด์ง„ ๋”•์…”๋„ˆ๋ฆฌ
        """
        row = self._target_df.iloc[index]

        context_vector = \
            self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)

        return {'x_data': context_vector,
                'y_target': target_index}

    def get_num_batches(self, batch_size):
        """๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            batch_size (int)
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐฐ์น˜ ๊ฐœ์ˆ˜
        """
        return len(self) // batch_size
    
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"): 
    """
    ํŒŒ์ดํ† ์น˜ DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜.
    ๊ฑฑ ํ…์„œ๋ฅผ ์ง€์ •๋œ ์žฅ์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํƒ€๊ฒŸ์€ pandas dataframe ํ˜•ํƒœ๋กœ ๋กœ๋“œ๋˜๊ณ , CBOWDatasetํด๋ž˜์Šค์—์„œ ์ธ๋ฑ์‹ฑ ๋œ๋‹ค.

2.1 Dataset ํด๋ž˜์Šค ๋‚ด๋ถ€ ๋ฉ”์„œ๋“œ ์‚ดํŽด๋ณด๊ธฐ

โ–ถ ์ƒ์„ฑ์ž ๋ฉ”์„œ๋“œ __init__(self, cbow_df, vectorizer)

  • cbow ๋ฐ์ดํ„ฐ์…‹(train, val, test df)๊ณผ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งŒ๋“   vectorizer ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
    # CBOWDataset์˜ ์ƒ์„ฑ์ž ๋ฉ”์„œ๋“œ
    # CBOWDataset์€ ๊ฒฐ๊ตญ cbow๋ฐ์ดํ„ฐ์…‹๊ณผ vectorizer๋ฅผ ์ƒ์„ฑํ•œ๋‹ค!
    def __init__(self, cbow_df, vectorizer):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_df (pandas.DataFrame): ๋ฐ์ดํ„ฐ์…‹
            vectorizer (CBOWVectorizer): ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งŒ๋“  CBOWVectorizer ๊ฐ์ฒด
        """
        self.cbow_df = cbow_df
        self._vectorizer = vectorizer
        
        measure_len = lambda context: len(context.split(" "))
        # cbow df ๋ฐ์ดํ„ฐ์ค‘ ๊ฐ€์žฅ ๊ธด ๊ฒƒ์„ ์‹œํ€€์Šค์˜ ๊ธธ์ด๋กœ ์„ค์ •ํ•œ๋‹ค
        self._max_seq_length = max(map(measure_len, cbow_df.context))
        
        self.train_df = self.cbow_df[self.cbow_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.cbow_df[self.cbow_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.cbow_df[self.cbow_df.split=='test']
        self.test_size = len(self.test_df)

        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')

 

โ–ถ load_dataset_and_make_vectorizer(cls, cbow_csv)
โ–ถ load_dataset_and_load_vectorizer(cls, cbow_csv, vectorizer_filepath)

  • make
    • vectorizer๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์„ ๋–„ ํ•ด๋‹น ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    • cbow csv ํŒŒ์ผ์„ ๋ฐ›์•„์™€์„œ vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๋ฉ”์„œ๋“œ
      • return ๊ฐ’์„ ๋ณด๋ฉด, vetorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋‹ค์‹œ clsํ•จ์ˆ˜์— ๋„ฃ์–ด์ฃผ๊ธฐ ๋–„๋ฌธ์— ๊ฒฐ๊ตญ Dataset์˜ ์ธ์Šคํ„ด์Šค๊ฐ€ ๋ฐ˜ํ™˜๊ฐ’์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • cbow_csv ๋งค๊ฐœ๋ณ€์ˆ˜์—๋Š” ํŒŒ์ผ์˜ ์œ„์น˜๋ฅผ ๋ฐ›๋Š”๋‹ค.
    • dataset = CBOWDataset.load_dataset_and_make_vectorizer(args.cbow_csv) ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค.
  • load
    • ์ด๋ฏธ vectorizer ๊ฐ์ฒด๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ํ•ด๋‹น ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    • cbow csv ํŒŒ์ผ๊ณผ vectorizer ํŒŒ์ผ์„ ๋ฐ›์•„์™€์„œ ์‚ฌ์šฉํ•˜๋„๋ก ๋˜์–ด์žˆ๋‹ค.
    • make์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฒฐ๊ตญ cls๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— dataset์˜ ์ธ์Šคํ„ด์Šค๋กœ ๋ฐ˜ํ™˜๋œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.
    • dataset = CBOWDataset.load_dataset_and_load_vectorizer(args.cbow_csv) ์ด๋ ‡๊ฒŒ ๋‘๊ฐœ์˜ ํŒŒ์ผ์„ ๋กœ๋“œํ•˜๊ฒŒ ๋œ๋‹ค.
    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด Vectorizer ๋งŒ๋“ค๊ธฐ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWDataset์˜ ์ธ์Šคํ„ด์Šค
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split=='train']
        # from_dataframe : vocabulary ํด๋ž˜์Šค๋กœ๋ถ€ํ„ฐ vocab์„ ๋ฐ›์•„์˜ค๊ณ , ํ•ด๋‹น vocab์„ ์‚ฌ์šฉํ•ด์„œ ๋‹จ์–ด์— ์ •์ˆ˜๋ฅผ ๋งคํ•‘ํ•ด์ค€๋‹ค.
        #                : ๋ฐ˜ํ™˜๊ฐ’์€ ๋‹จ์–ด์— ์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ cbow_vocab ์ด๋‹ค.
        #                  --> ์ฆ‰ ์ด๊ฒƒ์ด vectorizer์ด๊ณ  cls์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค.
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, cbow_csv, vectorizer_filepath):
        """ ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด CBOWVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
        ์บ์‹œ๋œ CBOWVectorizer ๊ฐ์ฒด๋ฅผ ์žฌ์‚ฌ์šฉํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
            vectorizer_filepath (str): CBOWVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer์˜ ์ธ์Šคํ„ด์Šค
        """
        cbow_df = pd.read_csv(cbow_csv)
        # load_vectorizer_only : vectorizer์˜ ํŒŒ์ผ์„ ๋ฐ›์•„์™€์„œ cbow_vocab (์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ ๋‹จ์–ด)๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
        #                          --> ์ฆ‰ ์ด๊ฒƒ์ด vectorizer์ด๊ณ  cls์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค.
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(cbow_df, vectorizer)

 

โ–ถ load_vectorizer_only(vectorizer_filepath)
โ–ถ save_vectorizer(self, vectorier_filepath)
โ–ถ get_vectorizer(self)
โ–ถ set_split(self, split='train')
โ–ถ get_num_batches(self, batch_size)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """ํŒŒ์ผ์—์„œ CBOWVectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ์ •์  ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): ์ง๋ ฌํ™”๋œ CBOWVectorizer ๊ฐ์ฒด์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer์˜ ์ธ์Šคํ„ด์Šค
        """
        with open(vectorizer_filepath) as fp:
            # from_serializable() : vocab์„ ๋ฐ›์•„์™€์„œ ์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ vocab์„ ์–ป๋Š”๋‹ค. 
            #                        ---> ์ฆ‰, ์ด๊ฒŒ vectorizer
            return CBOWVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """CBOWVectorizer ๊ฐ์ฒด๋ฅผ json ํ˜•ํƒœ๋กœ ๋””์Šคํฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค 
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): CBOWVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """ ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """
        return self._vectorizer
        
    def set_split(self, split="train"):
        """ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์žˆ๋Š” ์—ด์„ ์‚ฌ์šฉํ•ด ๋ถ„ํ•  ์„ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
        
    def get_num_batches(self, batch_size):
        """๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            batch_size (int)
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐฐ์น˜ ๊ฐœ์ˆ˜
        """
        return len(self) // batch_size

 

โ–ถ __len__ / __getitem__ ๋ฉ”์„œ๋“œ

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            index (int): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํŠน์„ฑ(x_data)๊ณผ ๋ ˆ์ด๋ธ”(y_target)๋กœ ์ด๋ฃจ์–ด์ง„ ๋”•์…”๋„ˆ๋ฆฌ
        """
        row = self._target_df.iloc[index]

        context_vector = \
            self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)

        return {'x_data': context_vector,
                'y_target': target_index}

 

2.2 ๋ณธ ์ฝ”๋“œ์—๋Š” DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜๊ฐ€ ์ถ”๊ฐ€๋˜์–ด์žˆ๋‹ค.

def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"): 
    """
    ํŒŒ์ดํ† ์น˜ DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜.
    ๊ฑฑ ํ…์„œ๋ฅผ ์ง€์ •๋œ ์žฅ์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

 

3. Vectorizer class

  • ์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋Š” ํด๋ž˜์Šค
  • ๋ฌธ๋งฅ์˜ ์ธ๋ฑ์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ •์ˆ˜ ๋ฒกํ„ฐ๋ฅผ๋งŒ๋“ค์–ด์„œ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
  • ๋ฌธ๋งฅ์˜ ํ† ํฐ ์ˆ˜๊ฐ€ ์ตœ๋Œ€ ๊ธธ์ด๋ณด๋‹ค ์ž‘์œผ๋ฉด 0์œผ๋กœ ์ฑ„์›Œ์ง€๋Š” ์ œ๋กœํŒจ๋”ฉ์„ ์ˆ˜ํ—นํ•œ๋‹ค. 

vectorizer ์ „์ฒด์ฝ”๋“œ๋Š” ๋”๋ณด๊ธฐ ํด๋ฆญ

๋”๋ณด๊ธฐ
class CBOWVectorizer(object):
    """ ์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค """
    def __init__(self, cbow_vocab):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_vocab (Vocabulary): ๋‹จ์–ด๋ฅผ ์ •์ˆ˜์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค
        """
        self.cbow_vocab = cbow_vocab

    def vectorize(self, context, vector_length=-1):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            context (str): ๊ณต๋ฐฑ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„ ๋‹จ์–ด ๋ฌธ์ž์—ด
            vector_length (int): ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด ๋งค๊ฐœ๋ณ€์ˆ˜
        """

        indices = [self.cbow_vocab.lookup_token(token) for token in context.split(' ')]
        if vector_length < 0:
            vector_length = len(indices)

        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.cbow_vocab.mask_index

        return out_vector
    
    @classmethod
    def from_dataframe(cls, cbow_df):
        """๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜::
            cbow_df (pandas.DataFrame): ํƒ€๊นƒ ๋ฐ์ดํ„ฐ์…‹
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer ๊ฐ์ฒด
        """
        cbow_vocab = Vocabulary()
        for index, row in cbow_df.iterrows():
            for token in row.context.split(' '):
                cbow_vocab.add_token(token)
            cbow_vocab.add_token(row.target)
            
        return cls(cbow_vocab)

    @classmethod
    def from_serializable(cls, contents):
        cbow_vocab = \
            Vocabulary.from_serializable(contents['cbow_vocab'])
        return cls(cbow_vocab=cbow_vocab)

    def to_serializable(self):
        return {'cbow_vocab': self.cbow_vocab.to_serializable()}

3.1 Vectorizer ๋‚ด๋ถ€ ๋ฉ”์„œ๋“œ ์‚ดํŽด๋ณด๊ธฐ

โ–ถ Vectorizer ์ƒ์„ฑ์ž ๋ฉ”์„œ๋“œ __init__(self, cbow_vocab)

  • ๋‹จ์–ด๋ฅผ ์ •์ˆ˜์— ๋งคํ•‘ํ•œ cbow vocab์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
    def __init__(self, cbow_vocab):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            cbow_vocab (Vocabulary): ๋‹จ์–ด๋ฅผ ์ •์ˆ˜์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค
        """
        self.cbow_vocab = cbow_vocab

 

โ–ถ vectorizer(self, context, vector_length = 1)

  • ๋ฌธ์ž์—ด๊ณผ ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ฐ›๋Š”๋‹ค
    def vectorize(self, context, vector_length=-1):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            context (str): ๊ณต๋ฐฑ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„ ๋‹จ์–ด ๋ฌธ์ž์—ด
            vector_length (int): ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด ๋งค๊ฐœ๋ณ€์ˆ˜
        """

        indices = [self.cbow_vocab.lookup_token(token) for token in context.split(' ')]
        if vector_length < 0:
            vector_length = len(indices)

        # ์ผ๋‹จ ๋ฒกํ„ฐ ๊ธธ์ด๋งŒํผ์˜ zero ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•ด์ค€๋‹ค. (์ œ๋กœํŒจ๋”ฉ)
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.cbow_vocab.mask_index

        return out_vector

 

โ–ถ from_dataframe(cls, cbow_df)

  • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ˜•ํƒœ๋กœ ๋˜์–ด์žˆ๋Š” cbow df๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ฐ›์•„์˜จ๋‹ค.
  • vocab์ด๋ผ๋Š” vocabulary ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํ•ด๋‹น vocab์œผ๋กœ ์ •์ˆ˜๋ฅผ ๋งคํ•‘
    @classmethod
    def from_dataframe(cls, cbow_df):
        """๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜::
            cbow_df (pandas.DataFrame): ํƒ€๊นƒ ๋ฐ์ดํ„ฐ์…‹
        ๋ฐ˜ํ™˜๊ฐ’:
            CBOWVectorizer ๊ฐ์ฒด
        """
        cbow_vocab = Vocabulary()
        for index, row in cbow_df.iterrows():
            for token in row.context.split(' '):
                # add_tocken(aa) : aa์— ์ƒ์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ค€๋‹ค.
                cbow_vocab.add_token(token)
            cbow_vocab.add_token(row.target)
          
        # ์ •
        return cls(cbow_vocab)

 

โ–ถ from_serializable(cls, contents)
โ–ถ to_serializable(self)

  • from
    • vocab์„ ๋ฐ›์•„์™€์„œ ์ •์ˆ˜๊ฐ€ ๋งคํ•‘๋œ vocab์„ ์–ป๋Š”๋‹ค.
    • ๋ฐ˜ํ™˜๊ฐ’์€ cls์ด๋ฏ€๋กœ ๊ฒฐ๊ตญ vectorizer์˜ ๊ฐ์ฒด
  • to
    • ์ง๋ ฌํ™”๋œ vocab์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• 
    @classmethod
    def from_serializable(cls, contents):
        cbow_vocab = \
            Vocabulary.from_serializable(contents['cbow_vocab'])
        return cls(cbow_vocab=cbow_vocab)

    def to_serializable(self):
        return {'cbow_vocab': self.cbow_vocab.to_serializable()}

 

4. Vocabulary class

๋งคํ•‘์„ ์œ„ํ•ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ดํœ˜ ์‚ฌ์ „์„ ๋งŒ๋“œ๋Š” ํด๋ž˜์Šค์ด๋‹ค. ์ด์ „ ํฌ์ŠคํŒ…์˜ vocabulary class์™€ ๊ตฌ์„ฑ๊ณผ ํ•จ์ˆ˜๊ฐ€ ๋™์ผํ•˜๋‹ค. ๋”ฐ๋ผ์„œ (๋”๋ณด๊ธฐํด๋ฆญ) ์ „์ฒด์ฝ”๋“œ๋งŒ ์ฒจ๋ถ€ํ•˜๊ณ  ๊ฐ„๋‹จํ•œ ์„ค๋ช…๋งŒ ๋ง๋ถ™์ด๊ฒ ๋‹ค.

๋”๋ณด๊ธฐ
class Vocabulary(object):
    """ ๋งคํ•‘์„ ์œ„ํ•ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ดํœ˜ ์‚ฌ์ „์„ ๋งŒ๋“œ๋Š” ํด๋ž˜์Šค """

    def __init__(self, token_to_idx=None, mask_token="<MASK>", add_unk=True, unk_token="<UNK>"):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token_to_idx (dict): ๊ธฐ์กด ํ† ํฐ-์ธ๋ฑ์Šค ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ
            mask_token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  MASK ํ† ํฐ.
                ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
            add_unk (bool): UNK ํ† ํฐ์„ ์ถ”๊ฐ€ํ• ์ง€ ์ง€์ •ํ•˜๋Š” ํ”Œ๋ž˜๊ทธ
            unk_token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  UNK ํ† ํฐ
        """

        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx

        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        self._mask_token = mask_token
        
        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
    def to_serializable(self):
        """ ์ง๋ ฌํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token, 
                'mask_token': self._mask_token}

    @classmethod
    def from_serializable(cls, contents):
        """ ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ Vocabulary ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค """
        return cls(**contents)

    def add_token(self, token):
        """ ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค

        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  ํ† ํฐ
        ๋ฐ˜ํ™˜๊ฐ’:
            index (int): ํ† ํฐ์— ์ƒ์‘ํ•˜๋Š” ์ •์ˆ˜
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
            
    def add_many(self, tokens):
        """ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ Vocabulary์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            tokens (list): ๋ฌธ์ž์—ด ํ† ํฐ ๋ฆฌ์ŠคํŠธ
        ๋ฐ˜ํ™˜๊ฐ’:
            indices (list): ํ† ํฐ ๋ฆฌ์ŠคํŠธ์— ์ƒ์‘๋˜๋Š” ์ธ๋ฑ์Šค ๋ฆฌ์ŠคํŠธ
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """ ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
        ํ† ํฐ์ด ์—†์œผ๋ฉด UNK ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token (str): ์ฐพ์„ ํ† ํฐ 
        ๋ฐ˜ํ™˜๊ฐ’:
            index (int): ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค
        ๋…ธํŠธ:
            UNK ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด (Vocabulary์— ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด)
            `unk_index`๊ฐ€ 0๋ณด๋‹ค ์ปค์•ผ ํ•ฉ๋‹ˆ๋‹ค.
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """ ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜: 
            index (int): ์ฐพ์„ ์ธ๋ฑ์Šค
        ๋ฐ˜ํ™˜๊ฐ’:
            token (str): ์ธํ…์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ
        ์—๋Ÿฌ:
            KeyError: ์ธ๋ฑ์Šค๊ฐ€ Vocabulary์— ์—†์„ ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)

โ–ท Vocabulary ํด๋ž˜์Šค์— ๋“ฑ์žฅํ•˜๋Š” ํ•จ์ˆ˜ ์ •๋ฆฌ

  1. __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"
    1. ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค๋ช…
      • token_to_idx (dict) : ๊ธฐ์กด ํ† ํฐ-์ธ๋ฑ์Šค ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ
        • mask_token(str) : vocabulary์— ์ถ”๊ฐ€ํ•  MASKํ† ํฐ (๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.)
      • add_unk (bool) : UNK ํ† ํฐ์„ ์ถ”๊ฐ€ํ• ์ง€ ์ง€์ •ํ•˜๋Š” ํ”Œ๋ž˜๊ทธ
      • unk_token (str) : Vocabulary์— ์ถ”๊ฐ€ํ•   UNK ํ† ํฐ
  2. to_serializable(self)
    1. ์ง๋ ฌํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜
      • {'token_to_idx'self._token_to_idx, 'add_unk'self._add_unk, 'unk_token'self._unk_token, 'mask_token'self._mask_token} ํ˜•ํƒœ
  3. @classmethod from_serializable(cls, contents)
    1. ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ vocabulary ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ
  4. add_token(self, token)
    1. ์ƒˆ๋กœ์šด ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
    2. ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์ค€๋‹ค
    3. ๋งค๊ฐœ๋ณ€์ˆ˜ token ์ด Vocabulary์— ์ถ”๊ฐ€ํ•  ํ† ํฐ์ด ๋œ๋‹ค.
    4. return index : ํ† ํฐ์— ์ƒ์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ˜ํ™˜๊ฐ’์œผ๋กœ ์ถœ๋ ฅ
  5.  add_many(self, tokens)
    1. ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ vocabulary์— ์ถ”๊ฐ€
    2. tokens ๋Š” ๋ฌธ์ž์—ด list
    3. ๋ฐ˜ํ™˜๊ฐ’๋„ tokens ์— ์ƒ์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๊ฐ’
  6. lookup_token(self, token)
    1. ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฉ”์„œ๋“œ
    2. ๋งค๊ฐœ๋ณ€์ˆ˜ token ์ด ์ฐพ์„ ํ† ํฐ์ด๊ณ 
    3. ๋ฐ˜ํ™˜๊ฐ’์€ token์ด ๊ฐ–๋Š” ์ธ๋ฑ์Šค๊ฐ’
  7. lookup_index(self, index)
    1. ์ธ๋ฑ์Šค์— ๋Œ€์‘ํ•˜๋Š” ํ† ํฐ์„ ์ฐพ๋Š” ๋ฉ”์„œ๋“œ
    2. ๋งค๊ฐœ๋ณ€์ˆ˜ index๊ฐ€ ์ฐพ์„ ์ธ๋ฑ์Šค
    3. ๋ฐ˜ํ™˜๊ฐ’์€ index ์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ

 

 

<NEXT> ๋‹ค์Œ ๊ธ€์—์„œ๋Š” ํ•ด๋‹น class๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , ํ›ˆ๋ จํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด์„œ ํฌ์ŠคํŒ… ํ•˜๊ฒ ๋‹ค.

https://didu-story.tistory.com/102

 

[NLP] Pytorch๋ฅผ ํ™œ์šฉํ•˜์—ฌ CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ (2)๋ชจ๋ธ ํ›ˆ๋ จ

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค. -- ์†Œ์Šค์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ <์ด์ „๊ธ€> [NLP] Pytorch๋ฅผ ํ™œ์šฉํ•˜์—ฌ CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ (1) -- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† 

didu-story.tistory.com

 

๋ฐ˜์‘ํ˜•