BERT (language model)

From HandWiki
Short description: Automated natural language processing software

Bidirectional Encoder Representations from Transformers (BERT) is a masked-language model published in 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.[1][2] A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in NLP experiments", counting over 150 research publications analyzing and improving the model.[3]

BERT was originally implemented in the English language at two model sizes:[1] (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto BooksCorpus[4] (800M words) and English Wikipedia (2,500M words).

Architecture

BERT is based on the transformer architecture. Specifically, BERT is composed of Transformer encoder layers.

BERT was pre-trained simultaneously on two tasks: language modeling (15% of tokens were masked, and the training objective was to predict the original token given its context) and next sentence prediction (the training objective was to classify if two spans of text appeared sequentially in the training corpus).[5] As a result of this training process, BERT learns latent representations of words and sentences in context. After pre-training, BERT can be finetuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as NLP tasks (language inference, text classification) and sequence-to-sequence based language generation tasks (question-answering, conversational response generation).[1][6] The pre-training stage is significantly more computationally expensive than finetuning.

Performance

When BERT was published, it achieved state-of-the-art performance on a number of natural language understanding tasks:[1]

  • GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
  • SQuAD (Stanford Question Answering Dataset[7]) v1.1 and v2.0
  • SWAG (Situations With Adversarial Generations[8])

Analysis

The reasons for BERT's state-of-the-art performance on these natural language understanding tasks are not yet well understood.[9][10] Current research has focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences,[11][12] analysis of internal vector representations through probing classifiers,[13][14] and the relationships represented by attention weights.[9][10] The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the word fine can have two different meanings depending on the context (I feel fine today, She has fine blond hair). BERT considers the words surrounding the target word fine from the left and right side.

In contrast to deep learning neural networks which require very large amounts of data, BERT has already been pre-trained which means that it has learnt the representations of the words and sentences as well as the underlying semantic relations that they are connected with. BERT can then be fine-tuned on smaller datasets for specific tasks such as sentiment classification. The pre-trained models are chosen according to the content of the given dataset one uses but also the goal of the task. For example, if the task is a sentiment classification task on financial data, a pre-trained model for the analysis of sentiment of financial text should be chosen. The pre-trained models can be found in the Hugging Face library.

History

BERT has its origins from pre-training contextual representations, including semi-supervised sequence learning,[15] generative pre-training, ELMo,[16] and ULMFit.[17] Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary, where BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence.

On October 25, 2019, Google announced that they had started applying BERT models for English language search queries within the US.[18] On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.[19] In October 2020, almost every single English-based query was processed by a BERT model.[20]

Recognition

The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).[21]

See also


References

  1. 1.0 1.1 1.2 1.3 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
  2. "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing" (in en). http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. 
  3. Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What We Know About How BERT Works". Transactions of the Association for Computational Linguistics 8: 842–866. doi:10.1162/tacl_a_00349. https://aclanthology.org/2020.tacl-1.54. 
  4. Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". pp. 19–27. arXiv:1506.06724 [cs.CV].
  5. "Summary of the models — transformers 3.4.0 documentation". https://huggingface.co/transformers/v3.4.0/model_summary.html. 
  6. Horev, Rani (2018). "BERT Explained: State of the art language model for NLP". https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270. 
  7. Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (2016-10-10). "SQuAD: 100,000+ Questions for Machine Comprehension of Text". arXiv:1606.05250 [cs.CL].
  8. Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (2018-08-15). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference". arXiv:1808.05326 [cs.CL].
  9. 9.0 9.1 Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019). "Revealing the Dark Secrets of BERT" (in en-us). Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373. doi:10.18653/v1/D19-1445. https://www.aclweb.org/anthology/D19-1445. 
  10. 10.0 10.1 Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 276–286. doi:10.18653/v1/w19-4828. 
  11. Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). "Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 284–294. doi:10.18653/v1/p18-1027. 
  12. Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). "Colorless Green Recurrent Networks Dream Hierarchically". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 1195–1205. doi:10.18653/v1/n18-1108. 
  13. Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). "Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 240–248. doi:10.18653/v1/w18-5426. 
  14. Zhang, Kelly; Bowman, Samuel (2018). "Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 359–361. doi:10.18653/v1/w18-5448. 
  15. Dai, Andrew; Le, Quoc (4 November 2015). "Semi-supervised Sequence Learning". arXiv:1511.01432 [cs.LG].
  16. Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer (15 February 2018). "Deep contextualized word representations". arXiv:1802.05365v2 [cs.CL].
  17. Howard, Jeremy; Ruder, Sebastian (18 January 2018). "Universal Language Model Fine-tuning for Text Classification". arXiv:1801.06146v5 [cs.CL].
  18. Nayak, Pandu (25 October 2019). "Understanding searches better than ever before". https://www.blog.google/products/search/search-language-understanding-bert/. 
  19. Montti, Roger (10 December 2019). "Google's BERT Rolls Out Worldwide". Search Engine Journal. https://www.searchenginejournal.com/google-bert-rolls-out-worldwide/339359/. 
  20. "Google: BERT now used on almost every English query". 2020-10-15. https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193. 
  21. "Best Paper Awards". 2019. https://naacl2019.org/blog/best-papers/. 

Further reading

  • Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What we know about how BERT works". arXiv:2002.12327 [cs.CL].

External links