Skip to content

zhjohnchan/awesome-vision-and-language-pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 

Repository files navigation

Awesome Vision-and-Language Pre-TrainingAwesome

A curated list of vision-and-language pre-training. :-)

Contributing

Please feel free to send me pull requests or email (chihung.chan@outlook.com) to add links.

Table of Contents

Papers

Survey

Survey Authors
A Survey of Vision-Language Pre-Trained Models Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao
VLP: A Survey on Vision-Language Pre-training Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu
Vision-and-Language Pretrained Models: A Survey Siqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang
Vision-and-Language Pretraining Thong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, Anh Tuan Luu

Research Paper

Fusion Encoders

Method Venue Reference Authors
2019
VisualBERT Arxiv-2019 VisualBERT: A Simple and Performant Baseline for Vision and Language Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
ViLBERT NeurIPS-2019 ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
LXMERT EMNLP-2019 LXMERT: Learning Cross-Modality Encoder Representations from Transformers Hao Tan, Mohit Bansal
2020
ImageBERT Arxiv-2020 ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
InterBERT Arxiv-2020 InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang
PixelBERT Arxiv-2020 Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
VALUE ECCV-2020 Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
UNITER ECCV-2020 UNITER: UNiversal Image-TExt Representation Learning Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
VisDial-BERT ECCV-2020 Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das
OSCAR ECCV-2020 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
X-LXMERT EMNLP-2020 X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
Unicoder-VL AAAI-2020 Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
VLP AAAI-2020 Unified Vision-Language Pre-Training for Image Captioning and VQA Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao
ERNIE-ViL AAAI-2021 ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
VL-BERT ICLR-2020 VL-BERT: Pre-training of Generic Visual-Linguistic Representations Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
12-IN-1 CVPR-2020 12-in-1: Multi-Task Vision and Language Representation Learning Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
VILLA NeurIPS-2020 Large-Scale Adversarial Training for Vision-and-Language Representation Learning Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
2021
X-VLM Arxiv-2021 Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Yan Zeng, Xinsong Zhang, Hang Li
KD-VLP Arxiv-2021 KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation Yongfei Liu, Chenfei Wu, Shao-yen Tseng, Vasudev Lal, Xuming He, Nan Duan
VLMO Arixv-2021 VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei
UNICORN Arxiv-2021 Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
MANGO Arxiv-2021 A Closer Look at the Robustness of Vision-and-Language Pre-trained Models Linjie Li, Zhe Gan, Jingjing Liu
XGPT NLPCC-2021 XGPT: Cross-modal Generative Pre-Training for Image Captioning Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou
ROSITA ACMMM-2021 ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, Jun Yu
Analysis Findings-2021 Does Vision-and-Language Pretraining Improve Lexical Grounding? Tian Yun, Chen Sun, Ellie Pavlick
Analysis TACL-2021 Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh
Volta TACL-2021 Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott
VL-T5 ICML-2021 Unifying Vision-and-Language Tasks via Text Generation Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal
ViLT ICML-2021 ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Wonjae Kim, Bokyung Son, Ildoo Kim
Visual Parsing NeurIPS-2021 Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
ALBEF NeurIPS-2021 Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi
E2E-VLP ACL-2021 E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
SOHO CVPR-2021 Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
VLN-BERT CVPR-2021 A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould
VinVL CVPR-2021 VinVL: Revisiting Visual Representations in Vision-Language Models Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
SimVLM ICLR-2021 SimVLM: Simple Visual Language Model Pretraining with Weak Supervision Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
2022
mPLUG Arxiv-2022 mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou
CoCa Arxiv-2022 Contrastive Captioners are Image-Text Foundation Models Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu
Flamingo Arxiv-2022 Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
BLIP Arxiv-2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
Bridge-Tower Arxiv-2022 Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Nan Duan
VLMbench Arxiv-2022 VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang
MixGen Arxiv-2022 MixGen: A New Multi-Modal Data Augmentation Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li
DaVinci Arxiv-2022 Prefix Language Models are Unified Modal Learners Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
MetaLM Arxiv-2022 Language Models are General-Purpose Interface Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei
VL-BEIT Arxiv-2022 VL-BEIT: Generative Vision-Language Pretraining Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei
VLUE Arxiv-2022 VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
VL-CheckList Arxiv-2022 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, Jianwei Yin
Analysis AAAI-2022 Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre
CLIP-ViL ICLR-2022 How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
METER CVPR-2022 An Empirical Study of Training End-to-End Vision-and-Language Transformers Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng
UVLP CVPR-2022 Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang
TCL CVPR-2022 Vision-Language Pre-Training with Triple Contrastive Learning Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang
OFA ICML-2022 Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
VLMixer ICML-2022 VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo

Dual Encoders

Method Venue Reference Authors
2021
ALIGN Arxiv-2021 Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig
FILIP Arxiv-2021 FILIP: Fine-grained Interactive Language-Image Pre-Training Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu
SLIP Arxiv-2021 SLIP: Self-supervision meets Language-Image Pre-training Norman Mu, Alexander Kirillov, David Wagner, Saining Xie
CLIP ICML-2021 Learning Transferable Visual Models From Natural Language Supervision Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
2022
Analysis Arxiv-2022 Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt
ProtoCLIP Arxiv-2022 Prototypical Contrastive Language Image Pretraining Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou

Unified Models

Method Venue Reference Authors
2021
ViT-BERT Arxiv-2021 Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
UNIMO ACL-2021 UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
2022
SkillNet Arxiv-2022 One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan Zhang, Xueyu Hu, Shuming Shi
data2vec Arxiv-2022 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli
UNIFIED-IO Arxiv-2022 UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi
Uni-Perceiver CVPR-2022 Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai
FLAVA CVPR-2022 FLAVA: A Foundational Language And Vision Alignment Model Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

Datasets

Dataset Images Image-Text Pairs Duration (hrs) Note
SBU 875k 875k - reference, website
FLIKR 29k 145k - reference, website
COCO 113k 567k - reference, website
COCO/OI Narratives 849k 873k - reference, website
VG 108k 5.4m - reference, website
VGQA 108k 1.8m - reference, website
VQA 83k 444k - reference, website
GQA 82k 1m - reference, website
CC3M 3m 3m - reference, website
CC12M 12m 12m - reference, website
YFCC-15M 15m 15m - reference, website
WebImageText 400m 400m - reference
LAION-400M 400m 400m - website
LAION-2B 2b 2b - website
RedCaps 12m 12m reference, website
AltText 1.8b 1.8b - reference
ImageNet-Captions 464k 464k - reference, website
Kinetics - - 1.4k reference, website
TVQA - - 0.4k reference, website
HT100M - - 134k reference, website
WebVid2M - - 13k reference, website

Evaluation

The following contents are adapted from this survey.

Task Description
1. Classification
Visual Question Answering (VQA) Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question.
Visual Reasoning and Compositional Question Answering (GQA) GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes.
Natural Language for Visual Reasoning (NLVR) The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false).
Visual Entailment (VE) In the VE task, image is the premise, and text is the hypothesis. Our goal is to predict whether the text is "Entailment Image". There are three labels, Entailment, Neutral, and Contradiction.
Visual Commonsense Reasoning (VCR) VCR exists in the form of multiple-choice questions. For a question, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons.
Grounding Referring Expressions (GRE) The GRE task is to localize an image region given a text reference. The model can output a score for each region, and the region with the highest score is used as the prediction region.
Visual Spatial Reasoning (VSR) The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
2. Regression
Multi-modal Sentiment Analysis (MSA) MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable.
3. Retrieval
Vision-Language Retrieval VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa.
4. Generation
Visual Captioning (VC) VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input.
Novel Object Captioning at Scale (NoCaps) NoCaps extends the VC task to test a model's capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus.
Visual Dialogue (VD) The task form of VD is given an image (or video), a dialogue history, and a language question, and let the model generate an answer for the question.
5. Others
Multi-modal Machine Translation (MMT) MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, i.e., image.
Vision-Language Navigation (VLN) VLN is a grounding language task of an agent's locomotion as it sees and explores the real-world dynamics based on linguistic instructions.
Optical Character Recognition (OCR) OCR generally refers to detecting and recognizing text information in images, which includes two parts: text detection (similar to regression) and text recognition (similar to classification).

Tutorials

Licenses

CC0

To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.

Acknowledgement

This repo started from this survey. We thank the authors for their comprehensive review of existing studies.

About

A curated list of vision-and-language pre-training (VLP). :-)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published