Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, arXiv 2020
PDF, Classification By SeonghoonYu August 4th, 2021
Summary
Deit는 ViT에 distillation token을 추가하여 Knowledge distillation을 적용한 논문입니다.
Deit is the model which apply Knowledge distillation to ViT by adding a distillation token to ViT.
class token에 head를 적용하여 얻은 확률은 Cross entropy loss에 사용하고 distillation token에 dist_head를 적용하여 얻은 확률은 KD loss에 사용합니다.
The probability obtained by applying head to a class token is used for Cross entropy loss and by applying dist_head to distillation token is used for KD loss.
2 종류의 KD Loss가 존재하는데, Hard-label distillation의 성능이 더 뛰어납니다.
There are 2 types of KD Loss, the performance of Hard-label distillation outperforms over Soft distillation
Experiment
What I like about the paper
- Why has Distillation token for KD Loss the good performance? so suprising.
my github about what i read