Description
Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3. Training…
Summary
- Training Multi-Billion Parameter Models on a Single GPU Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3.
- Many studies have used heterogeneous DL training to reduce GPU memory requirements by exploiting CPU memory, but these target activation memory on smaller-sized CNN-based models.
- and keeping parameters and forward and backward computation on GPU.
- Traditional data parallelism is the community standard for scaling DL training to multiple GPUs, but requires the replication of data and computation, making it unsuitable for heterogeneous training.