[논문 읽기] X3D(2020), Expanding Architectures for Efficient Video Recognition

논문 읽기/Video Recognition

[논문 읽기] X3D(2020), Expanding Architectures for Efficient Video Recognition

AI 꿈나무 2021. 9. 20. 13:19

X3D: Expanding Architectures for Efficient Video Recognition

PDF, Video, Christoph Feichtenhofer, CVPR2020

Summary

tiny model부터 시작해서 multiple axis 중 하나의 axis를 점진적으로 확대해 나갑니다. EfficientNet은 depth, width, image resolution을 uniform scaling했다면 X3D는 bottleneck width, temporal duration, frame rate, depth, spatial resolution, width를 controll 합니다.

한번에 하나의 axis를 확장해나가면서 최적의 accuracy-complexity trade off를 찾습니다.

Backbone model은 mobile Net의 depthwise-separable 구조를 사용합니다. 일반적으로 2D 모델을 3D로 적용할 때 temporal axis를 확장하는 것만 고려하는데, 해당 논문에서는 다른 axis를 확장해도 성능이 상승한다는 것을 보여주어 temporal axis 뿐만 아니라 여러 요소를 고려해야 한다고 말합니다.

다른 axis를 확장해도 성능이 향상된다면 어떤 axis가 성능에 가장 큰 영향을 주는지와 complexity를 고려하여 축을 확장해 나가야 합니다.

feature selection 방식으로 제한된 연산량 내에 높은 성능을 갖는 axis를 기준으로 확장해 나갑니다. forward 방식을 사용하는데 첫 번째 step 에서는 bottleneck width, 두 번째 step은 temporal, 세 번째 step은 resolution ... 순서로 확장합니다. 위 그림에서 각 step마다 어느 축이 확장되는지 확인하실 수 있습니다.

128 GPU로 어마어마한 실험을 하는데... 엄두가 안나네요

my github

Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch

공부 목적으로 논문을 리뷰하고 해당 논문 파이토치 재구현을 합니다. Contribute to Seonghoon-Yu/Paper_Review_and_Implementation_in_PyTorch development by creating an account on GitHub.

github.com

'논문 읽기 > Video Recognition' 카테고리의 다른 글

[논문 읽기] VLF(2021), VideoLightFormer: Lightweight Action Recognition using Transformers (0)	2021.09.21
[논문 읽기] X-ViT(2021), Space-time Mixing Attention for Video Transformer (0)	2021.09.21
[논문 읽기] VTN(2021), Video Transformer Network (0)	2021.09.12
[논문 읽기] MViT(2021), Multiscale Vision Transformers (0)	2021.09.12
[논문 읽기] TimeSformer(2021), Is Space-Time Attention All You Need for Video Understanding? (0)	2021.09.10

현재글[논문 읽기] X3D(2020), Expanding Architectures for Efficient Video Recognition

딥러닝 공부방