Description
Learn how to use Transformer architecture to create an easy Speech Recognition / Speech to Text pipeline with Python. Includes examples.
Summary
- Transformer architectures have gained a lot of attention in the field of Natural Language Processing.
- Combined with the benefits resulting from their architecture (i.e.
- attention is all you need, and no sequential processing is necessary), very large models (like BERT or the GPT series) have been trained that achieve state-of-the-art performance on a variety of language tasks.
- A feature encoder in the form of a 1D/temporal ConvNet with 7 layers takes the waveform and converts it into T time steps.
- Using an .mp3 file, converted into .wav The pipeline that we will be creating today requires you to use .wav files, and more specifically .wav files with a sampling rate of 16000 Hz (16 kHz).