Description
We’re on a journey to solve and democratize artificial intelligence through natural language.
Summary
- We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
- wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
- Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets.
- This demonstrates the feasibility of speech recognition with limited amounts of labeled data.