Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PolyViT: Co-training Vision Transformers on Images, Videos and Audio #2107

Open
icoxfog417 opened this issue Dec 4, 2021 · 0 comments
Open

Comments

@icoxfog417
Copy link
Member

一言でいうと

画像、動画、音声をまとめて学習するTransformerの提案。2D画像をバッチに分割し重みをかけて固定長ベクトルにする考えを基本とし、動画も重複のないパーツに区切って同様に処理、音声はスペクトログラムを画像として処理している。動画/音声の分類でSOTAを達成。

キャプチャ

論文リンク

https://arxiv.org/abs/2111.12993

著者/所属機関

Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani

  • Google Research
  • University of Cambridge
  • Alan Turing Institute

投稿日付(yyyy/MM/dd)

2021/11/25

概要

新規性・差分

手法

結果

コメント

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant