Potato
์•ˆ๋…•ํ•˜์„ธ์š”, ๊ฐ์žก๋‹ˆ๋‹ค?๐Ÿฅ” ^___^ ๐Ÿ˜บ github ๋ฐ”๋กœ๊ฐ€๊ธฐ ๐Ÿ‘‰๐Ÿป

AI study/์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP)

[NLP] ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ (2) (feat.ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ) - ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํด๋ž˜์Šค ์‚ดํŽด๋ณด๊ธฐ

๊ฐ์ž ๐Ÿฅ” 2021. 7. 22. 18:48
๋ฐ˜์‘ํ˜•

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค.
-- ์†Œ์Šค์ฝ”๋“œ ) https://github.com/rickiepark/nlp-with-pytorch

 

rickiepark/nlp-with-pytorch

<ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ>(ํ•œ๋น›๋ฏธ๋””์–ด, 2021)์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์œ„ํ•œ ์ €์žฅ์†Œ์ž…๋‹ˆ๋‹ค. - rickiepark/nlp-with-pytorch

github.com

 

โ–ถ ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

์•„๋ž˜ ํฌ์ŠคํŒ…์—์„œ ์–ด๋–ป๊ฒŒ ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋๋Š”์ง€ ๋ฏธ๋ฆฌ ํ•™์Šตํ•˜๋ฉด ์ข‹์„ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ์ฝ”๋“œ์—์„œ๋Š” '์ „์ฒ˜๋ฆฌ ๊ณผ์ •'์ด ์ƒ๋žต๋œ '์ „์ฒ˜๋ฆฌ ๋˜์–ด์ง„ ๋ฐ์ดํ„ฐ'๋ฅผ ํ™œ์šฉํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ๋งŒ ์•Œ์•„๋‘์ž.

https://didu-story.tistory.com/83

 

[NLP] ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ (1) (feat.ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ)

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค. -- ์†Œ์Šค์ฝ”๋“œ ) https://github.com/rickiepark/nlp-with-pytorch (ํ•œ๋น›๋ฏธ๋””์–ด, 2021)์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์œ„ํ•œ ์ €์žฅ์†Œ์ž…..

didu-story.tistory.com

 

 

1. ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹ ์ดํ•ดํ•˜๊ธฐ

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ๋Š” pytorch ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํด๋ž˜์Šค ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•ด์„œ ์ฃผ์š”ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค. ํŒŒ์ด์ฌ์˜ ํด๋ž˜์Šค๊ฐ์ฒด์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค๋ฉด ์•„๋ž˜ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•˜๊ณ  ์ดํ•ดํ•˜๊ณ  ์˜ค์ž. ๋˜ํ•œ ํŒŒ์ดํ† ์น˜๋Š” Dataset, DataLoader ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๋ฅผ ๋ณ€ํ™˜์‹œ์ผœ์ฃผ๋Š”๋ฐ, ์ด์—๋Œ€ํ•œ ๊ฒƒ๋„ ์•„๋ž˜ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ดํ•ด๊ฐ€ ๋” ์‰ฌ์šธ ๊ฒƒ์ด๋‹ค.

https://didu-story.tistory.com/85

 

[Pytorch] ํŒŒ์ดํ† ์น˜์˜ Custom dataset๊ณผ DataLoader ์ดํ•ดํ•˜๊ธฐ

1. ํŒŒ์ดํ† ์น˜์˜ Custom dataset / DataLoader 1.1 Custom Dataset ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ  ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘ --> ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์‰ฝ์ง€ ์•Š์Œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ๋ถ€๋ฅด์ง€ ์•Š๊ณ  ํ•˜๋‚˜์”ฉ๋งŒ ๋ถˆ๋Ÿฌ์„œ ์“ฐ๋Š” ๋ฐฉ์‹์„ ํƒ

didu-story.tistory.com

https://didu-story.tistory.com/84

 

[ํŒŒ์ด์ฌ] ํด๋ž˜์Šค ๊ฐ์ฒด

1. ํด๋ž˜์Šค(class)๋ž€ ํด๋ž˜์Šค๋Š” ๊ฐ์ฒด์˜ ๊ตฌ์กฐ์™€ ํ–‰๋™์„ ์ •์˜ํ•œ๋‹ค ๊ฐ์ฒด์˜ ํด๋ž˜์Šค๋Š” ์ดˆ๊ธฐํ™”๋ฅผ ํ†ตํ•ด ์ œ์–ดํ•œ๋‹ค. (__init__) ๊ฐ์ฒด์ง€ํ–ฅ ํ”„๋กœ๊ทธ๋ž˜๋ฐ, ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•ด๊ฒฐํ•  ์žˆ๋‹ค๋Š” ์žฅ์  ์กด์žฌ 2. ํด๋ž˜์Šค

didu-story.tistory.com

 

1.1 Review Dataset

  • ReviewDataset ํด๋ž˜์Šค๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ์ตœ์†Œํ•œ์œผ๋กœ ์ •์ œ๋˜๊ณ  3๊ฐœ๋กœ ๋‚˜๋‰˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.
  • ํŠนํžˆ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ ์„œ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.
  • ์ƒ˜ํ”Œ์ด ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ ์ค‘ ์–ด๋Š ์„ธํŠธ์— ์žˆ๋Š”์ง€ ํ‘œ์‹œ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.
    • ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์˜ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์€ ์•„๋ž˜ ๋งํฌ ์ฐธ๊ณ 
    • https://didu-story.tistory.com/83
class ReviewDataset(Dataset):
    
    #ํด๋ž˜์Šค ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ์‹œ ์ดˆ๊ธฐํ™” ํ•˜๋ฉด์„œ ์‹คํ–‰๋˜๋Š” ๋ถ€๋ถ„
    # self ๋ž€? ์ธ์Šคํ„ด์Šค(ํด๋ž˜์Šค์— ์˜ํ•ด ๋งŒ๋“ค์–ด์ง„ ๊ฐ์ฒด) ์ž๊ธฐ์ž์‹ ์„ ์˜๋ฏธ,
    # self ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด ์ธ์Šคํ„ด์Šค ๋ณ€์ˆ˜ 
    def __init__(self, review_df, vectorizer):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review_df (pandas.DataFrame): ๋ฐ์ดํ„ฐ์…‹
            vectorizer (ReviewVectorizer): ReviewVectorizer ๊ฐ์ฒด
        """
        self.review_df = review_df
        self._vectorizer = vectorizer

        self.train_df = self.review_df[self.review_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.review_df[self.review_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.review_df[self.review_df.split=='test']
        self.test_size = len(self.test_df)

        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')

    # ํด๋ž˜์Šค ๋ฉ”์†Œ๋“œ (๋ฐ์ฝ”๋ ˆ์ดํ„ฐ ์‚ฌ์šฉ) = ํด๋ž˜์Šค ๋ณ€์ˆ˜๋ฅผ ์ปจํŠธ๋กค ํ•  ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค.
    # cls ์ธ์ž๋ฅผ ๋ฐ›์Œ. cls? ReviewDataset ํด๋ž˜์Šค๋ฅผ ๋œปํ•จ
    @classmethod
    def load_dataset_and_make_vectorizer(cls, review_csv):
        # ReviewVectorizer๊ฐ์ฒด = ์•„๋ž˜ ๋‚˜์˜ด (์–ดํœ˜์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋Š” ๊ฐ์ฒด)
        """ ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค 
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            ReviewDataset์˜ ์ธ์Šคํ„ด์Šค
        """
        review_df = pd.read_csv(review_csv)  # ์•„๋ž˜์—์„œ review_csv๋ฅผ ํ†ตํ•ด ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ๊ฑฐ์•ผ
        train_review_df = review_df[review_df.split=='train']
        return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
    
    @classmethod
    def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
        """ ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
        ์บ์‹œ๋œ ReviewVectorizer ๊ฐ์ฒด๋ฅผ ์žฌ์‚ฌ์šฉํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜
            vectorizer_filepath (str): ReviewVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            ReviewDataset์˜ ์ธ์Šคํ„ด์Šค
        """
        review_df = pd.read_csv(review_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(review_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """ ํŒŒ์ผ์—์„œ ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ์ •์  ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): ์ง๋ ฌํ™”๋œ ReviewVectorizer ๊ฐ์ฒด์˜ ์œ„์น˜
        ๋ฐ˜ํ™˜๊ฐ’:
            ReviewVectorizer์˜ ์ธ์Šคํ„ด์Šค
        """
        with open(vectorizer_filepath) as fp:
            return ReviewVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """ ReviewVectorizer ๊ฐ์ฒด๋ฅผ json ํ˜•ํƒœ๋กœ ๋””์Šคํฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            vectorizer_filepath (str): ReviewVectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """ ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """
        return self._vectorizer

    def set_split(self, split="train"):
        """ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์žˆ๋Š” ์—ด์„ ์‚ฌ์šฉํ•ด ๋ถ„ํ•  ์„ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค 
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            split (str): "train", "val", "test" ์ค‘ ํ•˜๋‚˜
        """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """ ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            index (int): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํŠน์„ฑ(x_data)๊ณผ ๋ ˆ์ด๋ธ”(y_target)๋กœ ์ด๋ฃจ์–ด์ง„ ๋”•์…”๋„ˆ๋ฆฌ
        """
        row = self._target_df.iloc[index]

        review_vector = \
            self._vectorizer.vectorize(row.review)

        rating_index = \
            self._vectorizer.rating_vocab.lookup_token(row.rating)

        return {'x_data': review_vector,
                'y_target': rating_index}

    def get_num_batches(self, batch_size):
        """ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            batch_size (int)
        ๋ฐ˜ํ™˜๊ฐ’:
            ๋ฐฐ์น˜ ๊ฐœ์ˆ˜
        """
        return len(self) // batch_size
  • ํŒŒ์ดํ† ์น˜์—์„œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋จผ์ € Dataset ํด๋ž˜์Šค๋ฅผ ์ƒ์†ํ•˜์—ฌ __getitem()__, __len__() ๋ฉ”์„œ๋“œ๋ฅผ ๊ตฌํ˜„ํ•ด์•ผํ•œ๋‹ค.
    • from torch.utils.data import Dataset, DataLoader
  • ํ•ด๋‹น ํด๋ž˜์Šค ์•ˆ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํŒŒ์ดํ† ์น˜ ์œ ํ‹ธ๋ฆฌํ‹ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, DataLoader / ReviewVectorizer ๊ณผ ๊ฐ™์€ ํด๋ž˜์Šค๋Š” ์•„๋ž˜์— ๋‚˜์˜จ๋‹ค. (ํด๋ž˜์Šค๋“ค์€ ์„œ๋กœ ํฌ๊ฒŒ ์˜์กดํ•œ๋‹ค.)
    • ReviewVectorizer => ๋ฆฌ๋ทฐํ…์ŠคํŠธ๋ฅผ ์ˆ˜์น˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํด๋ž˜์Šค
    • DataLoader => ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ๋ชจ์•„์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ๋งŒ๋“œ๋Š” ํด๋ž˜์Šค

โ–ท ์—ฌ๊ธฐ์„œ ์ž ๊น, Dataset ํด๋ž˜์Šค๋ฅผ ์ƒ์†ํ•œ๋‹ค๋Š” ๋œป์ด ๋ฌด์—‡์ผ๊นŒ?

  • torch.utils.data.Dataset ์€ ๋ฐ์ดํ„ฐ์…‹์„ ๋‚˜ํƒ€๋‚ด๋Š” '์ถ”์ƒ ํด๋ž˜์Šค'์ด๋‹ค.
  • ์šฐ๋ฆฌ๋งŒ์˜ ๋ฐ์ดํ„ฐ์…‹์€ Dataset์— ์ƒ์†ํ•˜๊ณ , ์•„๋ž˜์™€ ๊ฐ™์ด ์˜ค๋ฒ„๋ผ์ด๋“œ ํ•ด์•ผํ•œ๋‹ค.
    • len(dataset)์—์„œ ํ˜ธ์ถœ๋˜๋Š” __len__์€ ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค.
    • dataset[i]์—์„œ ํ˜ธ์ถœ๋˜๋Š” __getitem__์€ i๋ฒˆ์งธ ์ƒ˜ํ”Œ์„ ์ฐพ๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
  • ์ฆ‰, __init__์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์„œ ๊ฐ€์ ธ์˜ค์ง€๋งŒ, __getitem__์„ ์ด์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ํŒ๋…ํ•œ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข… ๋ฆฌํ„ด๊ฐ’์€{'x_data': review_vector, 'y_target': rating_index} ํ˜•ํƒœ์˜ ์‚ฌ์ „ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„๋‹ค.

 

โ–ท ReviewDataset ์ปค์Šคํ…€ ๋ฐ์ดํ„ฐ์…‹์— ๋“ฑ์žฅํ•˜๋Š” ํ•จ์ˆ˜ ์ •๋ฆฌ

  1. __init__(self, review_df, Vectorizer)
    1. ๋ฐ์ดํ„ฐ์…‹ (review_df)
    2. ๋ฒกํ„ฐํ™”ํ•ด์ฃผ๋Š” ๊ฐ์ฒด(vectorizer)
    3. train_df / train_size / val_df / val_size / test_df / test_size ๋ณ€์ˆ˜๋ฅผ ์ •์˜
  2. @classmethod load_dataset_and_make_vectorizer(cls, review_csv) 
    1. ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ•จ์ˆ˜
    2. review_csv๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ฐ›๋Š”๋‹ค
    3. ReviewDataset(reveiw_df, ReviewVectorizer.from_dataset(train_review_df) ํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ฐ˜ํ™˜
      • ReviewVectorizer.from_dataset ๋Š” ๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“œ๋Š” ๋ฉ”์„œ๋“œ
        ๋ฐ˜ํ™˜๊ฐ’์ด ReviewVectorizer ๊ฐ์ฒด์ด๋‹ค.
  3. @staticmethod load_vectorizer_only(vetorizer_filepath)
    1. ํŒŒ์ผ์—์„œ ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ์ •์  ๋ฉ”์„œ๋“œ
    2. ReviewVectorizer.from_serializable(json.load(fp)) ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. 
      • ReviewVectorizer.from_serializable ๋Š” ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ(json.load(fp))์—์„œ ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“œ๋Š” ๋ฉ”์„œ๋“œ
  4. save_vectorizer(self, vectorizer_filepath)
    1. ReviewVectorizer ๊ฐ์ฒด๋ฅผ jsonํ˜•ํƒœ๋กœ ๋””์Šคํฌ์— ์ €์žฅํ•˜๋Š” ๋ฉ”์„œ๋“œ
  5. get_vectorizer(self)
    1. ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฉ”์„œ๋“œself._vectorizer ๊ฐ’์„ ๋ฐ˜ํ™˜
  6. set_split(self, split="train")
    1. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์žˆ๋Š” ์—ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ•  ์„ธํŠธ๋ฅผ ์„ ํƒ 
      "train", "test", "val" ์ค‘ ํ•˜๋‚˜
  7. __len__(self)
  8. __getitem(self, index)
    1. {"x_data" : review_vector, "y_target" : rating_index} ๋กœ ์ด๋ฃจ์–ด์ง„ dicํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
  9. get_num_batches(self, batch_size)
    1. ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜

 

1.2 Vocabulary, Vetorizer, DataLoader ํด๋ž˜์Šค

  • ํ•ด๋‹น ์˜ˆ์ œ๋Š” ์œ„ ์„ธ๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ค‘์š”ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
  • ํ…์ŠคํŠธ์˜ ์ž…๋ ฅ์„ ๋ฒกํ„ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ ์œ„ํ•ด ํ•ด๋‹น ํด๋ž˜์Šค๋“ค์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • ์ด ์„ธ๊ฐœ์˜ ํด๋ž˜์Šค๋Š” ๊ฐ ํ† ํฐ์„ ์ •์ˆ˜์— ๋งคํ•‘ํ•˜๊ณ , ์ด ๋งคํ•‘์„ ๊ฐ ๋ฐ์ดํ„ฐํฌ์ธํŠธ์— ์ ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค. ๊ทธ ๋‹ค์Œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ชจ๋ธ์„ ์œ„ํ•ด ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ชจ์€๋‹ค.
  • ์ „์ฒ˜๋ฆฌ ๋œ ๋ฐ์ดํ„ฐ(ํ…์ŠคํŠธ)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. (์ฆ‰ ๋ฐ์ดํ„ฐํฌ์ธํŠธ๋Š” ํ† ํฐํ™”๋œ ์ง‘ํ•ฉ์ด๋‹ค.

 

1.2.1 Vocabulary ํด๋ž˜์Šค

  • ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ์— ํ•„์š”ํ•œ ํ† ํฐ๊ณผ ์ •์ˆ˜ ๋งคํ•‘์„ ๊ด€๋ฆฌํ•˜๋Š” ํด๋ž˜์Šค
  • ํ† ํฐ์„ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•˜๋Š” ํด๋ž˜์Šค  (ํ…์ŠคํŠธ์˜ ๋ฐฐ์น˜๋ฅผ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์˜ ์ฒซ ๋‹จ๊ณ„)
  • ํ† ํฐ๊ณผ ์ •์ˆ˜ ์‚ฌ์ด๋ฅผ 1:1 ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ• (๋ฐ˜๋Œ€๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ฒฝ์šฐ๊นŒ์ง€, ์ด ๋”•์…”๋„ˆ๋ฆฌ ๋‘๊ฐœ ํ•„์š”)
  • ๋‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ Vocabulary ํด๋ž˜์Šค์— ์บก์Šํ™” ํ•œ ๊ฒƒ
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” ๋‹จ์–ด๋Š” UNK (Unknown) ์œผ๋กœ ์ฒ˜๋ฆฌ
    • ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ํ† ํฐ๋“ค์„ ์ œํ•œํ•ด์คŒ => ์ด๋Ÿฐ ํ† ํฐ๋“ค์ด UNK์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๋Š” ๊ฒƒ
class Vocabulary(object):
    """ ๋งคํ•‘์„ ์œ„ํ•ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ดํœ˜ ์‚ฌ์ „์„ ๋งŒ๋“œ๋Š” ํด๋ž˜์Šค """

    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token_to_idx (dict): ๊ธฐ์กด ํ† ํฐ-์ธ๋ฑ์Šค ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ
            add_unk (bool): UNK ํ† ํฐ์„ ์ถ”๊ฐ€ํ• ์ง€ ์ง€์ •ํ•˜๋Š” ํ”Œ๋ž˜๊ทธ
            unk_token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  UNK ํ† ํฐ
        """

        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx

        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
        
    def to_serializable(self):
        """ ์ง๋ ฌํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}

    @classmethod
    def from_serializable(cls, contents):
        """ ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ Vocabulary ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค """
        return cls(**contents)

    # ์ƒˆ๋กœ์šด ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
    def add_token(self, token):
        """ ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค

        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  ํ† ํฐ
        ๋ฐ˜ํ™˜๊ฐ’:
            index (int): ํ† ํฐ์— ์ƒ์‘ํ•˜๋Š” ์ •์ˆ˜
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def add_many(self, tokens):
        """ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ Vocabulary์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            tokens (list): ๋ฌธ์ž์—ด ํ† ํฐ ๋ฆฌ์ŠคํŠธ
        ๋ฐ˜ํ™˜๊ฐ’:
            indices (list): ํ† ํฐ ๋ฆฌ์ŠคํŠธ์— ์ƒ์‘๋˜๋Š” ์ธ๋ฑ์Šค ๋ฆฌ์ŠคํŠธ
        """
        return [self.add_token(token) for token in tokens]

    # ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
    def lookup_token(self, token):
        """ ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
        ํ† ํฐ์ด ์—†์œผ๋ฉด UNK ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            token (str): ์ฐพ์„ ํ† ํฐ 
        ๋ฐ˜ํ™˜๊ฐ’:
            index (int): ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค
        ๋…ธํŠธ:
            UNK ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด (Vocabulary์— ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด)
            `unk_index`๊ฐ€ 0๋ณด๋‹ค ์ปค์•ผ ํ•ฉ๋‹ˆ๋‹ค.
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    # ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
    def lookup_index(self, index):
        """ ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        
        ๋งค๊ฐœ๋ณ€์ˆ˜: 
            index (int): ์ฐพ์„ ์ธ๋ฑ์Šค
        ๋ฐ˜ํ™˜๊ฐ’:
            token (str): ์ธํ…์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ
        ์—๋Ÿฌ:
            KeyError: ์ธ๋ฑ์Šค๊ฐ€ Vocabulary์— ์—†์„ ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
        """
        if index not in self._idx_to_token:
            raise KeyError("Vocabulary์— ์ธ๋ฑ์Šค(%d)๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค." % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)

โ–ท Vocabulary ํด๋ž˜์Šค์— ๋“ฑ์žฅํ•˜๋Š” ํ•จ์ˆ˜ ์ •๋ฆฌ

  1. __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"
    1. ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค๋ช…
      • token_to_idx (dict) : ๊ธฐ์กด ํ† ํฐ-์ธ๋ฑ์Šค ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ
      • add_unk (bool) : UNK ํ† ํฐ์„ ์ถ”๊ฐ€ํ• ์ง€ ์ง€์ •ํ•˜๋Š” ํ”Œ๋ž˜๊ทธ
      • unk_token (str) : Vocabulary์— ์ถ”๊ฐ€ํ•   UNK ํ† ํฐ
  2. to_serializable(self)
    1. ์ง๋ ฌํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜
      • {'token_to_idx'self._token_to_idx, 'add_unk'self._add_unk, 'unk_token'self._unk_token} ํ˜•ํƒœ
  3. @classmethod from_serializable(cls, contents)
    1. ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ vocabulary ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ
  4. add_token(self, token)
    1. ์ƒˆ๋กœ์šด ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
    2. ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์ค€๋‹ค
    3. ๋งค๊ฐœ๋ณ€์ˆ˜ token ์ด Vocabulary์— ์ถ”๊ฐ€ํ•  ํ† ํฐ์ด ๋œ๋‹ค.
    4. return index : ํ† ํฐ์— ์ƒ์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ˜ํ™˜๊ฐ’์œผ๋กœ ์ถœ๋ ฅ
  5.  add_many(self, tokens)
    1. ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ vocabulary์— ์ถ”๊ฐ€
    2. tokens ๋Š” ๋ฌธ์ž์—ด list
    3. ๋ฐ˜ํ™˜๊ฐ’๋„ tokens ์— ์ƒ์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๊ฐ’
  6. lookup_token(self, token)
    1. ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฉ”์„œ๋“œ
    2. ๋งค๊ฐœ๋ณ€์ˆ˜ token ์ด ์ฐพ์„ ํ† ํฐ์ด๊ณ 
    3. ๋ฐ˜ํ™˜๊ฐ’์€ token์ด ๊ฐ–๋Š” ์ธ๋ฑ์Šค๊ฐ’
  7. lookup_index(self, index)
    1. ์ธ๋ฑ์Šค์— ๋Œ€์‘ํ•˜๋Š” ํ† ํฐ์„ ์ฐพ๋Š” ๋ฉ”์„œ๋“œ
    2. ๋งค๊ฐœ๋ณ€์ˆ˜ index๊ฐ€ ์ฐพ์„ ์ธ๋ฑ์Šค
    3. ๋ฐ˜ํ™˜๊ฐ’์€ index ์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ

 

1.2.2 Vectorizer ํด๋ž˜์Šค

  • ํ…์ŠคํŠธ๋ฅผ ์ˆ˜์น˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํด๋ž˜์Šค์ด์ž, ์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋Š” ํด๋ž˜์Šค
  • ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ฐ”๊พธ๋Š” ๋‘๋ฒˆ์งธ ๋‹จ๊ณ„
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํ† ํฐ์„ ์ˆœํšŒํ•˜๋ฉด์„œ ๊ฐ ํ† ํฐ์„ ์ •์ˆ˜๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ
  • ํ•ด๋‹น ํด๋ž˜์Šค์˜ ๊ฒฐ๊ณผ๋Š” '๋ฒกํ„ฐ'
  • ํ•ญ์ƒ ๊ธธ์ด๊ฐ€ ์ผ์ •ํ•œ ๋ฒกํ„ฐ ์ถœ๋ ฅ
class ReviewVectorizer(object):
    """ ์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค """
    def __init__(self, review_vocab, rating_vocab):
        """
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review_vocab (Vocabulary): ๋‹จ์–ด๋ฅผ ์ •์ˆ˜์— ๋งคํ•‘ํ•˜๋Š” Vocabulary
            rating_vocab (Vocabulary): ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”์„ ์ •์ˆ˜์— ๋งคํ•‘ํ•˜๋Š” Vocabulary
        """
        self.review_vocab = review_vocab
        self.rating_vocab = rating_vocab

    # vectorize ๋งค์„œ๋“œ๋Š” ํ•ด๋‹น ํด๋ž˜์Šค์˜ ํ•ต์‹ฌ์ ์ธ ๊ธฐ๋Šฅ์„ ์บก์Šํ™”
    # ๋งค๊ฐœ ๋ณ€์ˆ˜๋กœ ๋ฆฌ๋ทฐ ๋ฌธ์ž์—ด์„ ๋ฐ›์Œ (review) ์ด ๋ฆฌ๋ทฐ์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ๋ฐ˜ํ™˜
    # ๋ฆฌ๋ทฐ ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜๊ฐ€ 1์ด ๋˜๋„๋ก. 
    def vectorize(self, review):
        """ ๋ฆฌ๋ทฐ์— ๋Œ€ํ•œ ์›Ÿ-ํ•ซ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review (str): ๋ฆฌ๋ทฐ
        ๋ฐ˜ํ™˜๊ฐ’:
            one_hot (np.ndarray): ์›-ํ•ซ ๋ฒกํ„ฐ
        """
        one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
        
        for token in review.split(" "):
            if token not in string.punctuation:
                one_hot[self.review_vocab.lookup_token(token)] = 1

        return one_hot

    # ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ from_dataframe()์ด Vectorizer ํด๋ž˜์Šค๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ์ง„์ž…์ ์ž„์„ ๋‚˜ํƒ€๋ƒ„
    # ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ˆœํšŒํ•˜๋ฉด์„œ ์•„๋ž˜ ๋‘ ๊ฐ€์ง€ ์ž‘์—…์„ ์ˆ˜ํ–‰
    # 1. ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜ ์นด์šดํŠธ
    # 2. ํ‚ค์›Œ๋“œ ๋งค๊ฐœ๋ณ€์ˆ˜ cutoff์— ์ง€์ •ํ•œ ์ˆ˜๋ณด๋‹ค ๋นˆ๋„๊ฐ€ ๋†’์€ ํ† ํฐ๋งŒ ์‚ฌ์šฉํ•˜๋Š” vocabulary ๊ฐ์ฒด ์ƒ์„ฑ
    #     ์ตœ์†Œํ•œ cutoff ํšŸ์ˆ˜๋ณด๋‹ค ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ ๋ชจ๋‘ ์ฐพ์•„ vocabulary ๊ฐ์ฒด์— ์ถ”๊ฐ€

    @classmethod
    def from_dataframe(cls, review_df, cutoff=25):
        """ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            review_df (pandas.DataFrame): ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์…‹
            cutoff (int): ๋นˆ๋„ ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง ์„ค์ •๊ฐ’
        ๋ฐ˜ํ™˜๊ฐ’:
            ReviewVectorizer ๊ฐ์ฒด
        """
        review_vocab = Vocabulary(add_unk=True)
        rating_vocab = Vocabulary(add_unk=False)
        
        # ์ ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
        for rating in sorted(set(review_df.rating)):
            rating_vocab.add_token(rating)

        # count > cutoff์ธ ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
        word_counts = Counter()
        for review in review_df.review:
            for word in review.split(" "):
                if word not in string.punctuation:
                    word_counts[word] += 1
               
        for word, count in word_counts.items():
            if count > cutoff:
                review_vocab.add_token(word)

        return cls(review_vocab, rating_vocab)

    @classmethod
    def from_serializable(cls, contents):
        """ ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ ReviewVectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋งค๊ฐœ๋ณ€์ˆ˜:
            contents (dict): ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ
        ๋ฐ˜ํ™˜๊ฐ’:
            ReviewVectorizer ํด๋ž˜์Šค ๊ฐ์ฒด
        """
        review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
        rating_vocab =  Vocabulary.from_serializable(contents['rating_vocab'])

        return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)

    def to_serializable(self):
        """ ์บ์‹ฑ์„ ์œ„ํ•ด ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
        
        ๋ฐ˜ํ™˜๊ฐ’:
            contents (dict): ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ
        """
        return {'review_vocab': self.review_vocab.to_serializable(),
                'rating_vocab': self.rating_vocab.to_serializable()}
  • ํ•ด๋‹น vectorize ๋ฉ”์„œ๋“œ๋ฅผ ๋ณด๋ฉด ์›ํ•ซ๋ฐฑํ„ฐ๋กœ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜์— 1์ด ๋˜๋Š” ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ํ‘œํ˜„ ๋ฐฉ๋ฒ•์€ ๋ช‡๊ฐ€์ง€ ์ œ์•ฝ์ด ์กด์žฌํ•œ๋‹ค.
    • ํฌ์†Œํ•œ ๋ฐฐ์—ด์ด๋ผ๋Š” ์  ( ํ•œ ๋ฆฌ๋ทฐ์˜ ๊ณ ์œ  ๋‹จ์–ด ์ˆ˜๋Š” ํ•ญ์ƒ vocabulary ์ „์ฒด ๋‹จ์–ด์ˆ˜ ๋ณด๋‹ค ํ›จ์”ฌ ์ž‘๋‹ค.)
    • ๋ฆฌ๋ทฐ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๋ฌด์‹œํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค ( BoW ๋ฐฉ์‹)

 

1.2.3 DataLoader

  • ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋Š” ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๋ชจ์œผ๊ธฐ 
  • ํŒŒ์ดํ† ์น˜ ๋‚ด์žฅ ํด๋ž˜์Šค์ธ DataLoader๋Š” ์‹ ๊ฒฝ๋ง์— ํ•„์ˆ˜์ ์ธ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ชจ์œผ๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด์คŒ
  • DataLoader ํด๋ž˜์Šค๋Š” ํŒŒ์ดํ† ์น˜ DataSet(์ด ์˜ˆ์ œ์—์„œ๋Š” ReviewDataset, batch_size ๋“ฑ ๊ด€๋ จํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ฐ›์Œ
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"):
    """
    ํŒŒ์ดํ† ์น˜ DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜.
    ๊ฑฑ ํ…์„œ๋ฅผ ์ง€์ •๋œ ์žฅ์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

 

 

<NEXT> https://didu-story.tistory.com/87 

 

[NLP] ๋ ˆ์Šคํ† ๋ž‘ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ (3) (feat.ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ) - ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€, ์ถ”๋ก ,

-- ๋ณธ ํฌ์ŠคํŒ…์€ ํŒŒ์ดํ† ์น˜๋กœ ๋ฐฐ์šฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (ํ•œ๋น›๋ฏธ๋””์–ด) ์ฑ…์„ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑ๋œ ๊ธ€์ž…๋‹ˆ๋‹ค. -- ์†Œ์Šค์ฝ”๋“œ ) https://github.com/rickiepark/nlp-with-pytorch (ํ•œ๋น›๋ฏธ๋””์–ด, 2021)์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ " data-og-des..

didu-story.tistory.com

 

๋ฐ˜์‘ํ˜•