Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] Pytorch๋ฅผ ํ™œ์šฉํ•˜์—ฌ CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ (2)๋ชจ๋ธ ํ›ˆ๋ จ

๊ฐ์ž ๐Ÿฅ” 2021. 7. 29. 19:41
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์†Œ์Šค์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ

 

<์ด์ „๊ธ€>

 

[NLP] Pytorch๋ฅผ ํ™œ์šฉํ•˜์—ฌ CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ (1)

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค. -- ์†Œ์Šค์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ 1. CBOW ๋ž€ Word2Vec CBOW ๋ชจ๋ธ ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ์ž‘์—… ๋‹จ์–ด๋ฅผ ์Šค์บ”ํ•˜์—ฌ ๋‹จ์–ด์˜ ๋ฌธ๋งฅ Window๋ฅผ

didu-story.tistory.com

 

1. ๋ชจ๋ธ ์ƒ์„ฑ

  1. embedding์ธต์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ์˜ ๋‹จ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ธ๋ฑ์Šค๋ฅผ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ ๋‹ค.
  2. ์ „๋ฐ˜์ ์ธ ๋ฌธ๋งฅ์„ ๊ฐ์ง€ํ•˜๋„๋ก ๋ฒกํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•œ๋‹ค.
  3. Linear ์ธต์—์„œ ๋ฌธ๋งฅ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
    • ์ด ์˜ˆ์ธก๋ฒกํ„ฐ๋Š” ์ „์ฒด ์–ดํœ˜ ์‚ฌ์ „์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ด๋‹ค.
    • ์˜ˆ์ธก๋ฒกํ„ฐ์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์ด ํƒ€๊ฒŸ ๋‹จ์–ด์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
class CBOWClassifier(nn.Module): # Simplified cbow Model
    def __init__(self, vocabulary_size, embedding_size, padding_idx=0):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜3๊ฐœ๋กœ ์ œ์–ด๋œ๋‹ค:
            vocabulary_size (int): ์–ดํœ˜ ์‚ฌ์ „ ํฌ๊ธฐ, ์ž„๋ฒ ๋”ฉ ๊ฐœ์ˆ˜์™€ ์˜ˆ์ธก ๋ฒกํ„ฐ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค
            embedding_size (int): ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ
            padding_idx (int): ๊ธฐ๋ณธ๊ฐ’ 0; ์ž„๋ฒ ๋”ฉ์€ ์ด ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
        """
        super(CBOWClassifier, self).__init__()
        
        # padding idx๋Š” ๊ธฐ๋ณธ๊ฐ’์ด 0 ์ด์ง€๋งŒ, ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ๊ธธ์ด๊ฐ€ ๋ชจ๋‘ ๊ฐ™์ง€ ์•Š์„ ๋•Œ
        # embedding์ธต์— ํŒจ๋”ฉํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜์ด๋‹ค.
        self.embedding =  nn.Embedding(num_embeddings=vocabulary_size, 
                                       embedding_dim=embedding_size,
                                       padding_idx=padding_idx)
        # linear์ธต์—์„œ ๋ฌธ๋งฅ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค
        # ์˜ˆ์ธก๋ฒกํ„ฐ๋Š” ์–ดํœ˜์‚ฌ์ „์— ๋Œ€ํ•œ ํ™•๋ฅ ๋ถ„ํฌ
        self.fc1 = nn.Linear(in_features=embedding_size,
                             out_features=vocabulary_size)

    def forward(self, x_in, apply_softmax=False):
        """๋ถ„๋ฅ˜๊ธฐ์˜ ์ •๋ฐฉํ–ฅ ๊ณ„์‚ฐ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            x_in (torch.Tensor): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ…์„œ 
                x_in.shape๋Š” (batch, input_dim)์ž…๋‹ˆ๋‹ค.
            apply_softmax (bool): ์†Œํ”„ํŠธ๋งฅ์Šค ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์œ„ํ•œ ํ”Œ๋ž˜๊ทธ
                ํฌ๋กœ์Šค-์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด False๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค
        ๋ฐ˜ํ™˜๊ฐ’:
            ๊ฒฐ๊ณผ ํ…์„œ. tensor.shape์€ (batch, output_dim)์ž…๋‹ˆ๋‹ค.
        """
        x_embedded_sum = F.dropout(self.embedding(x_in).sum(dim=1), 0.3)
        y_out = self.fc1(x_embedded_sum)
        
        ## ์ถœ๋ ฅ: ์†Œํ”„ํŠธ๋งฅ์Šคํ•จ์ˆ˜๋Š” ์„ ํƒ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์˜ˆ์ • / ์ˆ˜์น˜์  ๊ณ„์‚ฐ ๋‚ญ๋น„ ๋ฐœ์ƒ, ๋ถˆ์•ˆ์ • ๋ฐœ์ƒ 
        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)
            
        return y_out

 

2. ๋ชจ๋ธ ํ›ˆ๋ จ

2.1 ํ—ฌํผํ•จ์ˆ˜ ์ •์˜

  • make_train_state : ํ˜„์žฌ ํ›ˆ๋ จ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜์ด๋‹ค. ๋’ค์—์„œ ํ›ˆ๋ จํ•˜๋Š” ๊ณผ์ •์„ ๋ณด๋ฉด state์— ๊ณ„์† ํ›ˆ๋ จ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๊ณ , ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œ์ผœ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • update_train_state : ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ ํ•ด๋‹น ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค. ๋ชจ๋ธ์ด ํ•œ๋ฒˆ์”ฉ ๋Œ์•„๊ฐˆ๋•Œ๋งˆ๋‹ค ์ƒํƒœ๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๊ณ , ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ํ†ตํ•ด์„œ ๋ชจ๋ธ์„ ์ €์žฅํ•ด์ค€๋‹ค. ์†์‹ค์„ ์ข‹๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๊ณ  ์ตœ์ƒ์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค.
  • compute_accuracy : ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ• ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ํ—ฌํผํ•จ์ˆ˜์ด๋‹ค.
# ํ˜„์žฌ ํŠธ๋ ˆ์ธ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜
# ๋’ค์— ๊ฐ€๋ฉด ๋ชจ๋ธ์„ ํ›ˆ๋ จ ๋„์ค‘ ์—…๋ฐ์ดํŠธ ํ•˜๊ณ  ์ตœ์ƒ์˜ ๋ชจ๋ธ์„ ์ฐพ์•„๋‚ด๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
# ์†์‹ค ์ €์žฅ, ์—ํฌํฌ ์ €์žฅ ๋“ฑ๋“ฑ 
def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}


# ๋’ค์— ๊ฐ€๋ฉด ์†์‹ค์ด ์ ๊ฒŒ๋‚˜์˜ค๋Š” ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ๊ฑด๋ฐ ํ•ด๋‹น ๋ชจ๋ธ์„ ์ฐพ๊ธฐ์œ„ํ•ด
# ์ง€์†์ ์œผ๋กœ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด ์ด ํ•จ์ˆ˜ ์‚ฌ์šฉ
# ๋ชจ๋ธ์ด ํ•œ๋ฒˆ ๋Œ์•„๊ฐ€๋ฉด, ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๋Š” ์ž‘์—…์„ ํ•  ๊ฑด๋ฐ, ๊ทธ๋•Œ 
# ์ด์ „์— ์ €์žฅ๋˜์–ด ์žˆ๋Š” train state์™€ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์—ฌ ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ๊ฒƒ
def update_train_state(args, model, train_state):
    """ ํ›ˆ๋ จ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.

    Components:
     - ์กฐ๊ธฐ ์ข…๋ฃŒ: ๊ณผ๋Œ€ ์ ํ•ฉ ๋ฐฉ์ง€
     - ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ: ๋” ๋‚˜์€ ๋ชจ๋ธ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

    :param args: ๋ฉ”์ธ ๋งค๊ฐœ๋ณ€์ˆ˜
    :param model: ํ›ˆ๋ จํ•  ๋ชจ๋ธ
    :param train_state: ํ›ˆ๋ จ ์ƒํƒœ๋ฅผ ๋‹ด์€ ๋”•์…”๋„ˆ๋ฆฌ
    :returns:
        ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ์ƒํƒœ
    """

    # ์ ์–ด๋„ ํ•œ ๋ฒˆ ๋ชจ๋ธ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False

    # ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋ฉด ๋ชจ๋ธ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]

        # ์†์‹ค์ด ๋‚˜๋น ์ง€๋ฉด
        if loss_t >= train_state['early_stopping_best_val']:
            # ์กฐ๊ธฐ ์ข…๋ฃŒ ๋‹จ๊ณ„ ์—…๋ฐ์ดํŠธ
            train_state['early_stopping_step'] += 1
        # ์†์‹ค์ด ๊ฐ์†Œํ•˜๋ฉด
        else:
            # ์ตœ์ƒ์˜ ๋ชจ๋ธ ์ €์žฅ
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])

            # ์กฐ๊ธฐ ์ข…๋ฃŒ ๋‹จ๊ณ„ ์žฌ์„ค์ •
            train_state['early_stopping_step'] = 0

        # ์กฐ๊ธฐ ์ข…๋ฃŒ ์—ฌ๋ถ€ ํ™•์ธ
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

    return train_state


# ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ๋‹ค.
def compute_accuracy(y_pred, y_target):
    _, y_pred_indices = y_pred.max(dim=1)
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100

 

2.2 ์ผ๋ฐ˜ ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ•จ์ˆ˜ ์ •์˜

#์ถ”ํ›„ ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด์„œ ๋žœ๋ค์‹œ๋“œ๋ฅผ ๋‹ค์‹œ ์„ค์ •ํ•ด์ค„ ๊ฒƒ์ธ๋ฐ ๊ทธ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜
# ์–ด๋–ค ์—ฐ๊ตฌ๋ฅผ ๋˜‘๊ฐ™์ด ๋‹ค์‹œ ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ ๊ธฐ์กด ์›๋ณธ์—์„œ ๋ณด๊ณ ๋˜์—ˆ๋˜ ๊ฒฐ๊ณผ๊ฐ€ (๊ฑฐ์˜) ๋˜‘๊ฐ™์ด ๋‹ค์‹œ ๋‚˜ํƒ€๋‚˜๋Š”์ง€๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ๊ฒƒ.
def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)

# ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)

 

2.3 ์„ค์ •๊ณผ ์ „์ฒ˜๋ฆฌ ์ž‘์—… (ํ™˜๊ฒฝ์„ค์ •)

args = Namespace(
    # ๋‚ ์งœ์™€ ๊ฒฝ๋กœ ์ •๋ณด
    cbow_csv="data/books/frankenstein_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch5/cbow",
    # ๋ชจ๋ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
    embedding_size=50,
    # ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
    seed=1337,
    num_epochs=100,
    learning_rate=0.0001,
    batch_size=32,
    early_stopping_criteria=5,
    # ์‹คํ–‰ ์˜ต์…˜
    cuda=True,
    catch_keyboard_interrupt=True,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True
)

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("ํŒŒ์ผ ๊ฒฝ๋กœ: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    

# CUDA ์ฒดํฌ
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")
    
print("CUDA ์‚ฌ์šฉ์—ฌ๋ถ€: {}".format(args.cuda))

# ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด ์‹œ๋“œ ์„ค์ •
set_seed_everywhere(args.seed, args.cuda)

# ๋””๋ ‰ํ† ๋ฆฌ ์ฒ˜๋ฆฌ
handle_dirs(args.save_dir)

 

2.4 ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์ค„ ๊ฒƒ์ด๋‹ค.
  • ์ด๋ฏธ ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์ด ์ฑ…์—์„œ ๋ผ์ดํŠธ ๋ฒ„์ „์˜ ์ „์ฒ˜๋ฆฌ๋ž€, ํฌ๊ธฐ๋ฅผ ์ข€ ์ค„์—ฌ์„œ ์ „์ฒ˜๋ฆฌ๋˜์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์„ ์˜๋ฏธํ•œ๋‹ค.
# ๋งŒ์•ฝ ์ฝ”๋žฉ์—์„œ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ๋œ ๋ผ์ดํŠธ ๋ฒ„์ „์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์„ธ์š”.
!mkdir data
!wget https://git.io/JtX5A -O data/download.py
!wget https://git.io/JtX5F -O data/get-all-data.sh
!chmod 755 data/get-all-data.sh
%cd data
!./get-all-data.sh
%cd ..

 

2.5 ๋ฐ์ดํ„ฐ์…‹, vectorizer ์ƒ์„ฑ

args์— ์ €์žฅ๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹๊ณผ vectorizer๋ฅผ ๋งŒ๋“ค์–ด์ค€๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ ์‚ฌ์ „์— ์ƒ์„ฑ๋œ vectorizer ๊ฐ์ฒด๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— else๋ฌธ์ด ์‹คํ–‰๋œ๋‹ค.

# ์ด๋ฏธ ์ƒ์„ฑ๋˜์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹๊ณผ vectorizer๊ฐ์ฒด๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ๊ทธ ๋ถ€๋ถ„์—์„œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฐ›์•„์™€์„œ ๋‹ค์‹œ ํ›ˆ๋ จ ์‹œ์ž‘ 
# args.reload_from_files = False ๋กœ ์ง€์ •ํ–ˆ์Œ
# ์•„๋‹ˆ๋ฉด ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ฒกํ† ๋ผ์ด์ € ๋งŒ๋“ค๊ธฐ

''' args ์„ค์ • ์‚ดํŽด๋ณด๊ธฐ!!
    # ๋ชจ๋ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
    embedding_size=50,
    # ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
    seed=1337,
    num_epochs=100,
    learning_rate=0.0001,
    batch_size=32,
    early_stopping_criteria=5,
'''
if args.reload_from_files:
    print("๋ฐ์ดํ„ฐ์…‹๊ณผ Vectorizer๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค")
    dataset = CBOWDataset.load_dataset_and_load_vectorizer(args.cbow_csv,
                                                           args.vectorizer_file)
else:
    print("๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  Vectorizer๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค")
    dataset = CBOWDataset.load_dataset_and_make_vectorizer(args.cbow_csv)
     #save vectorizer = ๋ฒกํ„ฐ๋ฅผ ๋””์Šคํฌ์— json ํ˜•ํƒœ๋กœ ์ €์žฅํ•ด์ฃผ๋Š” ๋ฉ”์„œ๋“œ
    dataset.save_vectorizer(args.vectorizer_file)
     
# get vectoirzer = ๋ณ€ํ™˜ ๋ฒกํ„ฐ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ฃผ๋Š” ๋ฉ”์„œ๋“œ   
vectorizer = dataset.get_vectorizer()

# ๋ชจ๋ธ ์„ค์ • (embedingsize = 50)
classifier = CBOWClassifier(vocabulary_size=len(vectorizer.cbow_vocab), 
                            embedding_size=args.embedding_size)

 

2.6 ํ›ˆ๋ จ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋ฉด์„œ ํ›ˆ๋ จ ๋ฐ˜๋ณต

classifier = classifier.to(args.device)

# ๋‹ค์ค‘ํด๋ž˜์Šค๋ถ„๋ฅ˜์— ์ƒ์š”๋˜๋Š”CrossEntropyLoss ์‚ฌ์šฉ
# ์˜ตํ‹ฐ๋งˆ์ด์ € adam    
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                                 mode='min', factor=0.5,

# ํŠธ๋ ˆ์ธ์ƒํƒœ ๋งŒ๋“ค๊ธฐ (ํ—ฌํผํ•จ์ˆ˜)                                           patience=1)
train_state = make_train_state(args)

# ์—ํฌํฌ๋Š” 100
epoch_bar = tqdm.notebook.tqdm(desc='training routine', 
                               total=args.num_epochs,
                               position=0)

#set_split ๋ฉ”์„œ๋“œ : ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์žˆ๋Š” ์—ด์„ ์‚ฌ์šฉํ•ด ๋ถ„ํ•  ์„ธํŠธ๋ฅผ ์„ ํƒ
dataset.set_split('train')

# get_num_batches ๋ฉ”์„œ๋“œ : ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ฃผ๋Š” ๋ฉ”์„œ๋“œ
train_bar = tqdm.notebook.tqdm(desc='split=train',
                               total=dataset.get_num_batches(args.batch_size), 
                               position=1, 
                               leave=True)
dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val',
                             total=dataset.get_num_batches(args.batch_size), 
                             position=1, 
                             leave=True)

#์™ธ๋ถ€ for๋ฌธ : ์—ํฌํฌ ํšŸ์ˆ˜๋งŒํผ ๋Œ์•„๊ฐ. (์—ํฌํฌ๋Š” 100์ž„)
#๋‚ด๋ถ€ for๋ฌธ : ๋ฐฐ์น˜๋‹จ์œ„๋ฅผ ๋Œ๋ฆฌ๋Š” ๋ฐ˜๋ณต๋ฌธ
try:
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index

        # ํ›ˆ๋ จ ์„ธํŠธ์— ๋Œ€ํ•œ ์ˆœํšŒ

        # ํ›ˆ๋ จ ์„ธํŠธ์™€ ๋ฐฐ์น˜ ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ์ค€๋น„, ์†์‹ค๊ณผ ์ •ํ™•๋„๋ฅผ 0์œผ๋กœ ์„ค์ •
        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()

        for batch_index, batch_dict in enumerate(batch_generator):
            # ํ›ˆ๋ จ ๊ณผ์ •์€ 5๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค

            # --------------------------------------
            # ๋‹จ๊ณ„ 1. ๊ทธ๋ ˆ์ด๋””์–ธํŠธ๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค
            optimizer.zero_grad()

            # ๋‹จ๊ณ„ 2. ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            y_pred = classifier(x_in=batch_dict['x_data'])

            # ๋‹จ๊ณ„ 3. ์†์‹ค์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            loss = loss_func(y_pred, batch_dict['y_target'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # ๋‹จ๊ณ„ 4. ์†์‹ค์„ ์‚ฌ์šฉํ•ด ๊ทธ๋ ˆ์ด๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            loss.backward()

            # ๋‹จ๊ณ„ 5. ์˜ตํ‹ฐ๋งˆ์ด์ €๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค
            optimizer.step()
            # -----------------------------------------
            
            # ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)

            # ์ง„ํ–‰ ๋ฐ” ์—…๋ฐ์ดํŠธ
            train_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            train_bar.update()

        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)

        # ๊ฒ€์ฆ ์„ธํŠธ์— ๋Œ€ํ•œ ์ˆœํšŒ

        # ๊ฒ€์ฆ ์„ธํŠธ์™€ ๋ฐฐ์น˜ ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ์ค€๋น„, ์†์‹ค๊ณผ ์ •ํ™•๋„๋ฅผ 0์œผ๋กœ ์„ค์ •
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()

        for batch_index, batch_dict in enumerate(batch_generator):

            # ๋‹จ๊ณ„ 1. ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            y_pred =  classifier(x_in=batch_dict['x_data'])

            # ๋‹จ๊ณ„ 2. ์†์‹ค์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            loss = loss_func(y_pred, batch_dict['y_target'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # ๋‹จ๊ณ„ 3. ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            val_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            val_bar.update()

        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)

        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)

        scheduler.step(train_state['val_loss'][-1])

        if train_state['stop_early']:
            break

        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
except KeyboardInterrupt:
    print("Exiting loop")

 

2.7 ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•๋„ ๊ณ„์‚ฐ

  • test ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ด์„œ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
# ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์†์‹ค๊ณผ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
# train_state model filename์— ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์ด ์ €์žฅ๋˜์–ด ์žˆ๋‹ค.
classifier.load_state_dict(torch.load(train_state['model_filename']))
classifier = classifier.to(args.device)
loss_func = nn.CrossEntropyLoss()

# test ๋ฐ์ดํ„ฐ์…‹์„ ์„ ํƒ
dataset.set_split('test')
batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()

for batch_index, batch_dict in enumerate(batch_generator):
    # ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    y_pred =  classifier(x_in=batch_dict['x_data'])
    
    # ์†์‹ค์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    loss = loss_func(y_pred, batch_dict['y_target'])
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)

    # ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)

train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

print("ํ…Œ์ŠคํŠธ ์†์‹ค: {};".format(train_state['test_loss']))
print("ํ…Œ์ŠคํŠธ ์ •ํ™•๋„: {}".format(train_state['test_acc']))

(์ด๊ฒŒ ๋ง์ด ๋˜๋Š” ์ •ํ™•๋„์ธ๊ฐ€..?!)

  • ๊ฒฐ๊ณผ๊ฐ’์ด ๋†’์ง€ ์•Š์€ ์ด์œ 
    1. ์ด ์˜ˆ์ œ์—์„œ๋Š” ๋ฒ”์šฉ์ ์ธ ์ž„๋ฒ ๋”ฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ธฐ ํŽธํ•˜๋„๋ก ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ์ƒ๋žตํ•˜์—ฌ CBOW ์‹ค์Šต์„ ์ง„ํ–‰ํ–ˆ๋‹ค. (์„ฑ๋Šฅ ์ตœ์ ํ™”์— ํ•„์ˆ˜์ ์ธ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ์ƒ๋žตํ•˜๊ณ  ๋‹จ์ˆœ ๊ตฌํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.)
    2. ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์•„์ฃผ ์ž‘๋‹ค. (๋‹จ์–ด์˜ ์ˆ˜๋Š” ์•ฝ 70000๊ฐœ ๋ฟ์ด๋ผ์„œ, zero base์—์„œ ๊ทœ์น™์„ฑ์„ ๊ฐ์ง€ํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ๋ชปํ•˜๋‹ค. 

 

3. ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ ์ถœ๋ ฅํ•ด๋ณด๊ธฐ

์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•œ ํ—ฌํผ ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•œ๋‹ค. ๋‹จ์ง€ ์ด์˜๊ฒŒ ์ž„๋ฒ ๋”ฉ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ž…๋ ฅ๋œ ๋‹จ์–ด์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์ด ์กด์žฌํ•˜๋Š”๋‹จ์–ด๋“ค์„ ์ถœ๋ ฅํ•  ๊ฒƒ์ด๋‹ค.

def pretty_print(results):
    """
    ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค
    """
    for item in results:
        print ("...[%.2f] - %s"%(item[1], item[0]))
        
# n์— ์ถœ๋ ฅ์„์›ํ•˜๋Š” ๊ฐœ์ˆ˜ ์ž…๋ ฅ
def get_closest(target_word, word_to_idx, embeddings, n=5):
    """
    n๊ฐœ์˜ ์ตœ๊ทผ์ ‘ ๋‹จ์–ด๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
    """

    # ๋‹ค๋ฅธ ๋ชจ๋“  ๋‹จ์–ด๊นŒ์ง€ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    word_embedding = embeddings[word_to_idx[target_word.lower()]]
    distances = []
    for word, index in word_to_idx.items():
        if word == "<MASK>" or word == target_word:
            continue
        distances.append((word, torch.dist(word_embedding, embeddings[index])))
    
    results = sorted(distances, key=lambda x: x[1])[1:n+2]
    return results
word = input('๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•ด ์ฃผ์„ธ์š”: ')
embeddings = classifier.embedding.weight.data
word_to_idx = vectorizer.cbow_vocab._token_to_idx
pretty_print(get_closest(word, word_to_idx, embeddings, n=5))

target_words = ['frankenstein', 'monster', 'science', 'sickness', 'lonely', 'happy']

embeddings = classifier.embedding.weight.data
word_to_idx = vectorizer.cbow_vocab._token_to_idx

for target_word in target_words: 
    print(f"======={target_word}=======")
    if target_word not in word_to_idx:
        print("Not in vocabulary")
        continue
    pretty_print(get_closest(target_word, word_to_idx, embeddings, n=5))

 

 

์ด๋ฒˆ ์˜ˆ์ œ์—์„œ๋Š” nn.Embedding ์ธต์„ ์‚ฌ์šฉํ•˜์—ฌ CBOW ๋ถ„๋ฅ˜ ์ง€๋„ํ•™์Šต ์ž‘์—…์œผ๋กœ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ž„๋ฒ ๋”ฉ์„ ํ›ˆ๋ จ์‹œ์ผœ ๋ณด์•˜๋‹ค. ์•ž์„œ ๋ณด์•˜๋“ฏ์ด ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์„œ ์ •ํ™•๋„๊ฐ€ 15%๋กœ ๋งค์šฐ ๋‚ฎ๋‹ค. ๋‹ค์Œ ๊ณต๋ถ€์—์„œ๋Š” ๋ง๋ญ‰์น˜์— ์ฃผ์–ด์ง„ '์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ'์„ ๋ฏธ์„ธ ์กฐ์ •ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” '์ „์ดํ•™์Šต'๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณผ ๊ฒƒ์ด๋‹ค. ๋ฐ‘๋ฐ”๋‹ฅ ๋ถ€ํ„ฐ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๊ธฐ์— ์š”์ฆ˜์—๋Š” '์ „์ดํ•™์Šต'๊ณผ ๊ฐ™์ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๋‹ค๋ฅธ ์ž‘์—…์˜ ์ดˆ๊ธฐ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค.

๋ฐ˜์‘ํ˜•