ZeRO-Offload: Training Multi-Billion Parameter Models on a Single GPU

By Medium - 2021-01-27

Description

Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3. Training…

Summary

  • Training Multi-Billion Parameter Models on a Single GPU Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3.
  • Many studies have used heterogeneous DL training to reduce GPU memory requirements by exploiting CPU memory, but these target activation memory on smaller-sized CNN-based models.
  • and keeping parameters and forward and backward computation on GPU.
  • Traditional data parallelism is the community standard for scaling DL training to multiple GPUs, but requires the replication of data and computation, making it unsuitable for heterogeneous training.

 

Topics

  1. Backend (0.19)
  2. NLP (0.12)
  3. Machine_Learning (0.1)

Similar Articles