Description
Since the birth of BERT followed by that of Transformers have dominated NLP in nearly every language-related tasks whether it is Question-Answering, Sentiment Analysis, Text classification or Text…
Summary
- 233x Faster Transformers inference on CPU Yes, 233x on CPU with the multi-head self-attentive Transformer architecture.
- Transformers enjoys much better accuracy on all these tasks unlike RNN and LSTM the problem of vanishing gradients, which hampers learning of long data sequences.
- As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware like a smartphone.
- In the taskspecific distillation, authors distill fine-tuned teacher models into smaller student architectures following the procedure proposed by TinyBERT.In the task-agnostic distillation approach, authors directly apply fine-tuning on general distilled models to tune for a specific task.