ZeRO-Offload: Training Multi-Billion Parameter Models on a Single GPU

By Medium - 2021-01-27

Description

Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3. Training…

Summary

Training Multi-Billion Parameter Models on a Single GPU Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3.
Many studies have used heterogeneous DL training to reduce GPU memory requirements by exploiting CPU memory, but these target activation memory on smaller-sized CNN-based models.
and keeping parameters and forward and backward computation on GPU.
Traditional data parallelism is the community standard for scaling DL training to multiple GPUs, but requires the replication of data and computation, making it unsuitable for heterogeneous training.

Topics

Backend (0.19)
NLP (0.12)
Machine_Learning (0.1)

ZeRO-Offload: Training Multi-Billion Parameter Models on a Single GPU

Description

Summary

Topics

Similar Articles

NYC Taxi Fare Prediction. Rider Fare Prediction in The Big Apple

Google trained a trillion-parameter AI language model

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

Determined — A Batteries-Included Deep Learning Training Platform

Something Every Data Scientist Should Know But Probably Doesn’t: The Bias-Variance Trade-off…

Fairness | Machine Learning Crash Course

Feedback

Bookmarks

Similar Readings