BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

πŸ“Œ Summary

BERT introduces a masked language model and next sentence prediction for pre-training. Key innovation is bidirectional context.

πŸ“ƒ Citation

Devlin et al., 2018 – https://arxiv.org/abs/1810.04805

🧠 Key Ideas

  • Masked Language Modeling (MLM)
  • Next Sentence Prediction (NSP)
  • Fine-tuning on downstream tasks

πŸ” What’s New?

  • Bidirectionality in pretraining
  • Outperformed SOTA on GLUE, SQuAD

πŸ’¬ Discussion

  • Why MLM instead of traditional LM?
  • Do we still need NSP today?

πŸ‘₯ Notes by Group