Description
Microsoft and CMU researchers begin to unravel 3 mysteries in deep learning related to ensemble, knowledge distillation & self-distillation. Discover how their work leads to the first theoretical proo ...
Summary
- Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, “modern age” neural network training—at least for image classification tasks and many others—is usually quite stable.
- Does ensemble/knowledge distillation work in the same way in deep learning compared to that in random feature mappings (namely, the NTK feature mappings)?
- In other words, during knowledge distillation, the individual model is forced to learn every possible view feature, matching the performance of ensemble.
- At a high level, we view self-distillation as combining ensemble and knowledge distillation in a more compact manner.