

[논문 읽기] VLF(2021), VideoLightFormer: Lightweight Action Recognition using Transformers

VideoLightFormer: Lightweight Action Recognition using Transformers PDF, Video Recognition, Raivo Koot, Haiping Lu, arXiv 2021 Summary efficient한 Video Recognition 모델입니다. VTN의 확장 버전이며 CNN 백본으로 각 frame을 저차원의 ebedding으로 압축한 뒤에 각 embedding에 spatial transformer를 적용하고, spatial-time transformer를 적용합니다. 해당 모델의 가장 큰 장점은 latent가 낮다는 것입니다. 저자의 목표가 efficient 모델을 추구하여 각 요소를 최대한 high-efficiency되도록 설계합니다. 실험 ..

[논문 읽기] X-ViT(2021), Space-time Mixing Attention for Video Transformer

Space-time Mixing Attention for Video Transformer PDF, Video, Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudharan, Brais Martinez, Georgios Tzimiropolos, arXiv 2021 Summary ViT를 Video에 적용한 논문입니다. self-attention의 계산 복잡도를 O(TS^2)로 감소시키는데 어떤 방법을 이용하는지 살펴볼 필요가 있는 것 같습니다. 성능도 잘 나오고 FLOPs 관점에서 엄청난 이점을 갖습니다. Method를 살펴보면 이해가 잘 안갑니다. 코드를 뜯어봐야 이해가 될 것 같네요. my github Seonghoon-Yu/Paper_Review_and_Im..

분야별 Vision Transformer 논문 정리

나중에 읽으려고 정리한 논문들입니다. ㅎㅎ 필요하신 분이 계실수도 있으니 공유합니다. 댓글로 vision transformer 논문 추천해주셔도 됩니다 ㅎㅎ Transformer가 서로 다른 데이터 사이에도 적용할 수 있고 데이터 종류에 따라 구조를 변경하지 않아도 되는 장점을 활용한 여러 분야에서 논문들 DPN, depth estimation, https://arxiv.org/abs/2103.13413 Point Transformation, Point cloud https://arxiv.org/abs/2012.09164 Perceiver, audio, video, point clouds, image, https://arxiv.org/abs/2103.03206 UniT, Multimodal, https:..

[Paper Review] Rotation(2018), Unsupervised Representation Learning by Pre-diction Image Rotations

Unsupervised Representation Learning by Pre-diction Image Rotations Spyros Gidaris, Praveer Singh, Nikos Komodakis, arXiv 2018 PDF, SSL By SeonghoonYu August 4th, 2021 Summary The ConvNet is trained on the 4-way image classification task of recognizing one of the four image rotation(0, 90, 180, 270). The task of predicting rotation transformations provides a powerful surrogate supervision signel..

[Paper Review] STM(2019), Spatio Temporal and Motion Encoding for Action Recognition

STM: Spatio Temporal and Motion Encoding for Action Recognition Boyuan Jiang, MengMeng Wang, Weihao Gan, arXiv 2019 PDF, Video By SeonghoonYu August 3th, 2021 Summary STM consists of the Channel-wise SpatioTemporal Module(CSTM) and the Channel-wise Motion Module(CMM). CSTM encode the spatiotemporal features from different timestamps and CCM encode the motion features between neighboring frames. ..

[Paper Review] Invariant Information Clustering for Unsupervised Image Classification and Segmentation(2018)

Invariant Information Clustering for Unsupervised Image Classification and Segmentation Xu Ji, Joao F.Henriques, Andrea Vedaldi, arXiv 2018 PDF, Clustering By SeonghoonYu July 30th, 2021 Summary This paper presents IIC model which acieves SOTA performance on Image clustering and Image segmentation by maximizing the mutual information between the original image and the transformed image from orig..

[Paper review] SlowFast Networks for Video Recognition(2018)

SlowFast Networks for Video Recognition Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, arXiv 2018 PDF, Video By SeonghoonYu July 20th, 2021 Summary They presents a two-pathway SlowFast model for video recognition. Two pathways seperately work at low and high temporal resolutions. (1) One is Slow pathway designed to capture sementic information that can be given by a few sparse f..
