BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
π Summary
BERT introduces a masked language model and next sentence prediction for pre-training. Key innovation is bidirectional context.
π Citation
Devlin et al., 2018 β https://arxiv.org/abs/1810.04805
π§ Key Ideas
- Masked Language Modeling (MLM)
- Next Sentence Prediction (NSP)
- Fine-tuning on downstream tasks
π Whatβs New?
- Bidirectionality in pretraining
- Outperformed SOTA on GLUE, SQuAD
π¬ Discussion
- Why MLM instead of traditional LM?
- Do we still need NSP today?