[Paper Review] BERT(2018), Pre-training of Deep Bidirectional Transformers for Language Understanding

논문 읽기/NLP

[Paper Review] BERT(2018), Pre-training of Deep Bidirectional Transformers for Language Understanding

AI 꿈나무 2021. 7. 22. 20:43

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, arXiv 2018

PDF, NLP By SeonghoonYu July 22th, 2021

Summary

BETR is a multi-layer bidirectional Transformer encoder and learn the word embedding by using the unlabeled data. And then the learned word embbeding is fine-tuned using labeled data from downstread task for transfer learning.

The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

Previous NLP model tranditionally use left-to-right model to pre-train. But this paper use two unsupervised tasks for training a bidirectional model.

(1) Masked LM

In order to train a deep bidirectional representation, they simply mask som percentage of the input tokens at random, and then predict those masked tokens.

The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, they replace the i-th token with the [maks] token 80% of the time. a random token 10% of the time. the unchanged i-th token 10% of the time. Then, Ti will be sued to predict the original token with cross entropy loss.

(2) Next Sentence Prediction(NSP)

Captureing directly the relationships between two setences is so difficult. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task.

When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A(labeled as IsNext), and 50% of the time it is a random sentence from the corpus(labeled as NotNext)

Experiment

GLUE Test results

the effects of masked LM and NSP(next sentence prediction)

Ablation over BERT model size

What I like about the paper

Train a model on unsupervised learning fashon in order to learn bidirectional representations
achieves SOTA performance on 11 different downstream tasks by fine-tuning a pre-trained model BETR.

my github about what i read

Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch

공부 목적으로 논문을 리뷰하고 해당 논문 파이토치 재구현을 합니다. Contribute to Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch development by creating an account on GitHub.

github.com

'논문 읽기 > NLP' 카테고리의 다른 글

[논문 읽기] ELECTRA, Pre-Training Text Encoders as Discriminators rather than Generator(2020) (2)	2021.12.07
[논문 읽기] PyTorch 코드로 살펴보는 Transformer(2017) (0)	2021.06.28
[논문 읽기] PyTorch 코드로 살펴보는 Convolutional Sequence to Sequence Learning(2017) (0)	2021.06.24
[논문 읽기] PyTorch 구현 코드로 살펴보는 Attention(2015), Neural Machine Translation by jointly Learning to Align and Translate (0)	2021.06.19
[논문 읽기] PyTorch 구현 코드로 살펴보는 GRU(2014), Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation (0)	2021.06.16

현재글[Paper Review] BERT(2018), Pre-training of Deep Bidirectional Transformers for Language Understanding

딥러닝 공부방

[Paper Review] BERT(2018), Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding