[Paper Review] CeiT(2021), Incorporating Convolution Designs into Visual Transformers

논문 읽기/Classification

[Paper Review] CeiT(2021), Incorporating Convolution Designs into Visual Transformers

AI 꿈나무 2021. 8. 5. 17:49

Incorporating Convolution Designs into Visual Transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou Fengwei Yu, Wei Wu, arXiv 2021

PDF, Transformer By SeonghoonYu August 5th, 2021

Summary

CeiT is architecture that combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.

ViT has two problems.

First, ViT performs direct tokenization of patches from the raw input image with a size of 16 x 16 or 32 x 32. It is difficult to extract the low-level features which form some fundamental structures in images. Second, the self-attention modules concentrate on building long-range dependencies among tokens, ignoring the locality in the spatial dimension.

To address these problems, CeiT presents an Image-to-Tokens(I2T), a Locally-enhanced Feed Forward(LeFF) and a Layer-wise Class token Attention(LCA).

(1) I2T(Image to Tokens)

I2T extracts patches from feature maps obatined Conv operation instead of raw input images.

(2) LEFN(Locally-Enganced Feed-Forward Network)

LeFF combines the advantage of CNNs to extract local information with the ability of Transformer to establish long-range dependencies.

(3) LCA(Layer-wise Class-Tokken Attention)

LCA integrate information across different layer because feature representations are different at different layers.

Experiment

What I like about the paper

Method to combine the advantages of CNNs with long-range dependencies of Transformer is intersting.
not only achives SOTA performance, but also has the ability about 3x faster convergence than DeiT

my github about what i read

Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch

공부 목적으로 논문을 리뷰하고 해당 논문 파이토치 재구현을 합니다. Contribute to Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch development by creating an account on GitHub.

github.com

'논문 읽기 > Classification' 카테고리의 다른 글

[논문 읽기] Early Convolutions Help Transformers See Better(2021) (0)	2021.08.09
[논문 읽기] CvT(2021), Introducing Convolutions to Vision Transformers (0)	2021.08.08
[논문 읽기] Deit(2020), Training data-efficient image transformers & distillation through attention (0)	2021.08.04
[논문 읽기] Non-local Neural Networks(2017) (0)	2021.07.13
[논문 읽기] Big Transfer(BiT, 2019), General Visual Representation Learning (0)	2021.07.02

현재글[Paper Review] CeiT(2021), Incorporating Convolution Designs into Visual Transformers

딥러닝 공부방