Bridging the Gap: A Comprehensive Guide to Vision-Language Models for the Modern Researcher

5 min readMay 29, 2024

Introduction

In recent years, significant advancements in language modeling have been made, particularly with Large Language Models (LLMs) such as Llama and ChatGPT. These models have shown the ability to solve a wide array of tasks, increasing their popularity and usage. Initially, these models were limited to text inputs, but they have now been extended to include visual inputs. Integrating vision with language will unlock numerous applications crucial to the ongoing AI-based technological revolution. However, connecting language to vision is not without its challenges. Many models struggle to understand spatial relationships, count accurately, or comprehend attributes and ordering without complex engineering solutions involving additional data annotation. Furthermore, some Vision Language Models (VLMs) can hallucinate, producing irrelevant or inaccurate content, necessitating significant prompt engineering to achieve desired outcomes. Consequently, developing reliable models remains an active and essential area of research.

This blog post aims to provide a comprehensive introduction to Vision Language Models (VLMs), explaining what they are, how they work, and how they are trained. Additionally, we will discuss effective evaluation approaches depending on different research goals. Although this post is not a complete guide or survey on VLMs, it aims to offer a clear and accessible overview, especially for students or researchers new to the field.

The Families of VLMs

Given the substantial progress in deep learning within the fields of computer vision and natural language processing, several initiatives have emerged to bridge these domains. This post will focus on the most recent techniques based on transformers, categorizing these initiatives into four primary training paradigms: contrastive training, masking, pretrained backbones, and generative models.

Contrastive-Based VLMs

Contrastive-based training is a commonly used strategy leveraging pairs of positive and negative examples. The VLM is trained to predict similar representations for positive pairs while predicting different representations for negative pairs. One prominent example of this approach is the CLIP (Contrastive Language-Image Pre-training) model, which trains vision and text encoders to map images and their corresponding captions to similar embedding vectors using a contrastive loss. This method has shown remarkable zero-shot classification capabilities, making it a popular choice for vision-language tasks.

VLMs with Masking Objectives

Masking is a widely used technique in deep learning, particularly effective in the transformer architecture. This method involves reconstructing masked image patches or text tokens, allowing VLMs to predict missing parts. Models like FLAVA and MaskVLM leverage this approach, combining multiple training strategies to learn robust text and image representations.

Generative-Based VLMs

Generative models take a different approach, focusing on generating text and/or images. These models can produce entire images or long captions, making them useful for a variety of vision-language tasks. Examples include CoCa (Contrastive Captioner), which uses both contrastive and generative losses, and CM3Leon, a foundation model capable of text-to-image and image-to-text generation.

VLMs from Pretrained Backbones

Given the high computational cost of training VLMs from scratch, many recent models leverage pretrained language models (LLMs) and vision extractors. These models, such as Frozen and MiniGPT-4, learn a mapping between text and image modalities, enabling the LLMs to answer visual questions with minimal additional training.

Training Vision-Language Models

Training VLMs involves several critical steps and considerations, including selecting appropriate datasets, leveraging existing software, and optimizing compute resources.

Training Data

Data quality is paramount for training VLMs. Effective strategies include using diverse and balanced datasets, improving training data with synthetic samples, and applying data augmentation. Filtering methods, such as CLIP-Score and ranking based on pretrained VLMs, help in selecting high-quality image-text pairs. Additionally, techniques like interleaved data curation and assessing multimodal data quality are essential for creating robust training datasets.

Software and Compute Resources

Several public software repositories, such as OpenCLIP and HuggingFace transformers, facilitate the training and evaluation of VLMs. Understanding the required compute resources and optimizing training speed with tools like PyTorch’s torch.compile and data loading libraries such as FFCV are crucial for efficient model training.

Model Selection

Choosing the right model depends on the specific task and research goals. Contrastive models like CLIP are effective for associating text with visual concepts, while masking-based models excel in reconstructing missing data. Generative models are ideal for tasks requiring the generation of new content from text or images. Leveraging pretrained backbones can significantly reduce training costs and time.

Improving Grounding and Alignment

Ensuring VLMs correctly associate words with specific concepts and produce expected answers from a human perspective involves several techniques:

Improving Grounding: Using bounding boxes annotations or negative captioning enhances the model’s ability to map text with visual clues.
Improving Alignment: Techniques like in-context learning and multimodal fine-tuning help align the model’s outputs with human expectations.

Evaluating Vision-Language Models

Robust evaluation of VLMs is essential for understanding their performance and limitations. Key evaluation areas include benchmarking visio-linguistic abilities, assessing biases and disparities, and measuring hallucinations and memorization.

Benchmarking Visio-Linguistic Abilities

Evaluating VLMs on tasks like image captioning, text-to-image consistency, visual question answering, and zero-shot image classification provides insights into their visio-linguistic capabilities. Understanding how well a model performs on these tasks helps identify areas for improvement and guides future research.

Benchmarking Bias and Disparities

Assessing biases in VLMs is crucial for ensuring fair and equitable performance across different demographics and contexts. Techniques for benchmarking bias include evaluating classifications and embeddings for disparities and understanding how specific concepts in the training data impact downstream performance.

Benchmarking Hallucinations and Memorization

Ensuring VLMs do not produce irrelevant or inaccurate content involves measuring hallucinations and memorization. Techniques for assessing these aspects help identify and mitigate potential issues, improving the reliability and accuracy of the models.

Extending VLMs to Videos

The next generation of VLMs will be capable of understanding videos by mapping video content to language. This extension poses additional challenges, such as higher computational costs and the need to effectively map the temporal dimension of video data through text. Current methods and research focus on addressing these challenges to enable robust video-to-language modeling.

Conclusion

By lowering the barrier to entry into VLM research, this introduction aims to provide the foundations for more responsible development of VLMs while pushing the boundaries of vision understanding. Vision-Language Models have the potential to significantly impact our interaction with technology, and understanding their mechanics is crucial for advancing this field. Whether you are a student or a researcher, the insights provided in this blog post will help you navigate the complexities of VLMs and contribute to the ongoing advancements in AI technology.

Bridging the Gap: A Comprehensive Guide to Vision-Language Models for the Modern Researcher

Written by Oğuzhan KOÇAKLI