Rethinking Attention with Performers

By Google AI Blog - 2020-10-23

Description

Posted by Krzysztof Choromanski and Lucy Colwell, Research Scientists, Google Research Transformer models have achieved state-of-the-art...

Summary

Posted by Krzysztof Choromanski and Lucy Colwell, Research Scientists, Google Research Transformer models have achieved state-of-the-art results across a diverse range of domains, including natural language, conversation, images, and even music.
To the best of our knowledge, we are the first to show that any attention matrix can be effectively approximated in downstream Transformer-applications using random features.
Properties We first benchmark the space- and time-complexity of the Performer and show that the attention speedups and memory reductions are empirically nearly optimal, i.e., very close to simply not using an attention mechanism at all in the model.
The Performer model is nearly able to reach this optimal performance in the attention component.

Topics

NLP (0.23)
Machine_Learning (0.13)
Backend (0.09)

Similar Articles

Reducing the High Cost of Training NLP Models With SRU

By ASAPP - 2021-02-24

Highly expressive and efficient neural models can be designed using SRU++ with little attention computation needed.

Attention mechanism in Deep Learning, Explained

By KDnuggets - 2021-02-09

Attention is a powerful mechanism developed to enhance the performance of the Encoder-Decoder architecture on neural network-based machine translation tasks. Learn more about how this process works an ...

lucidrains/vit-pytorch

By GitHub - 2020-10-05

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch - lucidrains/vit-pytorch

pytorch-widedeep: deep learning for tabular data

By Medium - 2021-02-22

This is the third of a series of posts introducing pytorch-widedeepa flexible package to combine tabular data with text and images (that could also be used for “standard” tabular data alone). The…