What Makes Molmo and PixMo Game-Changers in VLMs?

The most powerful VLMs available today remain proprietary, limiting open research exploration. Open models often lag due to dependency on synthetic data generated by proprietary models, restricting true openness. Molmo, a sophisticated vision-language model, seeks to bridge this gap by creating high-quality multimodal capabilities built from open datasets and independent training methods.

PixMo, the accompanying dataset, was designed to overcome the traditional limitations of data accessibility in VLM development. The team collected extensive image-caption pairs using human speech annotations, which resulted in high-density captions free from the constraints of synthetic datasets.

Molmo’s architecture follows a standard multimodal design: it combines a vision encoder and a language model to create a vision-language model capable of processing both images and text.

Overview

  • PixMo Datasets (the success factor for Molmo)
  • Key Components of the Molmo Architecture
    • Image Pre-processor: Converts input images into a set of multi-scale, multi-crop sections.
    • Vision Encoder (CLIP ViT-L/14 336px)
    • Connector (MLP-based projection): Projection of image embeddings to language model’s dimension.
    • Decoder-Only Transformer LLM.
  • Training Pipeline: Two Stages
    • Multimodal Pre-Training for Caption Generation
    • Supervised Fine-Tuning on Diverse Tasks
  • Evaluation of Molmo on 11 benchmark datasets
  • Hands-on experimentation with Molmo (code)

PixMo Datasets – the Main component of Molmo’s success

  • PixMo-Cap: Annotators were asked to describe images in speech for 60-90 seconds, providing detailed and dense image captions. The speech was further transcribed and passed through a language model to clean the text (remove spoken artifacts, normalize style). The data contains detailed, dense captions for over 712k images.
  • PixMo-AskModelAnything: Annotators generate diverse question-answer pairs with images.
  • PixMo-Points: This dataset includes point-based annotations, enabling Molmo to point, answer location-based questions, and count objects directly by pointing, adding a spatial dimension to visual understanding.
  • Other datasets: These include synthetic clock datasets (question answering on analog clocks) (PixMo-Clocks) and document-heavy datasets (PixMo-Docs, PixMo-CapQA).
pixmo datasets
Source: Author

Comprehensive detail of the Architecture of Molmo and its Design Decisions:

molmo architecture
Source: Author

Input Processing: Multi-Scale, Multi-Crop Images

The input to Molmo is generated by applying multi-scale and multi-crop transformations to the original image. In multi-crop training, multiple crops (sections) of the same image are taken from different regions, often at various scales and resolutions. Each crop provides a different perspective or focus area of the image.

  • Purpose: Multi-crop training is designed to give the model a richer, more diverse understanding of the entire image by exposing it to more details and perspectives. This helps it generalize better, especially on high-resolution images with complex scenes.

Vision Encoder: OpenAI’s ViT-L/14 336px CLIP Model

The core of Molmo’s visual processing is OpenAI’s CLIP (Contrastive Language Image-Pretraining) model, a powerful Vision Transformer (ViT) optimized for high-resolution inputs.

  • Why did Molmo choose OpenAI’s CLIP instead of SigLIP?: Through experimentation, CLIP proved superior to alternatives like SigLIP in handling multi-scale, multi-crop, and high-resolution data. On the other hand, SigLIP performs better in single-crop scenarios but struggles with the demands of multi-crop training, potentially missing out on the richer contextual understanding that Molmo requires.
  • Mathematical and Conceptual Intuition: CLIP’s architecture uses attention layers that weigh the importance of image patches based on spatial and feature-related relevance. Each patch effectively attends to others, forming a comprehensive image representation. This aligns well with multi-scale processing because CLIP can leverage both local patch details and the broader context in its final tokenized representation. SigLIP’s simpler processing pipeline likely restricted its ability to generalize as effectively under similar conditions.

Connector: Multi-Layer Perceptron (MLP) and Pooling

The connector is a carefully constructed MLP that projects the high-dimensional tokens from CLIP to match the input space (dimensions) the language model requires. Following this projection, a pooling layer performs dimensionality reduction, ensuring the visual tokens are condensed to a manageable size for the language model without sacrificing key visual details.

Dimensionality Reduction Through Pooling: Pooling selects and averages key features across the visual tokens. Conceptually, this can be thought of as a summary of visual information—just enough detail to inform the language model without overwhelming it.
Example: Imagine a cityscape image divided into 100 tokens by the vision encoder. Pooling condenses these tokens by summarizing key features, prioritizing prominent structures (like buildings), and reducing redundancy in repetitive areas (like the sky). This results in a smaller, focused set of around 20 tokens, capturing only the most essential details for efficient processing by the language model.

Language Model (LLM): Decoder-Only Transformer

Molmo’s vision encoder remains consistent across variants, employing CLIP’s ViT-L/14 model for all versions. However, Molmo’s LLM component varies based on requirements for capacity, openness, and compute efficiency:

  • Model Variants for Language Processing: Molmo provides flexibility by allowing various LLMs, including OLMo (7B-1024), OLMoE-1B-7B, and larger models like Qwen2 and Mistral. These LLMs differ in their parameter scales and openness, from efficient smaller models to high-capacity variants capable of handling complex language and image interactions.
  • Reasoning Behind Multiple LLMs: By offering a variety of LLMs, Molmo can cater to diverse needs. Smaller models are faster and less compute-intensive, while larger models are suited for tasks that require more nuanced language processing and deeper contextual understanding.

In transformers, decoder-only architecture is particularly suited for tasks requiring context-based generation, such as captioning or question-answering. The model “decodes” tokens in a self-referential manner, with each token attending to all previous tokens to build a coherent output, guided by both visual and textual cues from previous stages.

Training Pipeline: Two Simple Stages

Molmo’s training is divided into two major stages that contribute to model’s high performance and versatility:

Stage 1: Multimodal Pre-Training for Caption Generation

Goal: Train the model to generate detailed, accurate captions for images. PixMo-Cap dataset is used in this step.

Molmo uses a simpler, single-stage pre-training method for caption generation, which avoids the complexity and potential inefficiencies of multi-stage pre-training (e.g., freezing parts of the model/network at different stages).

Mathematical Perspective
Source: Author

Why Molmo Avoids Multi-Stage Pre-training?

Molmo’s simpler, single-stage pre-training works well in its context because:

  • It uses high-quality human-annotated data from the start, which avoids the need for progressive fine-tuning across stages. This is one of the key differentiators between Molmo and other models that rely on weakly labeled or synthetic data.
  • Molmo’s vision encoder (e.g., CLIP) and language model are both off-the-shelf and are fine-tuned together in one go, avoiding the inefficiency of multi-stage fine-tuning.
  • Efficiency: Training all components together (single-stage pre-training) allows the model to converge faster and simplifies the training pipeline.

Stage 2: Supervised Fine-Tuning on Diverse Tasks

After pre-training for caption generation, Molmo is fine-tuned on a mixture of datasets, including standard academic datasets and additional PixMo datasets like PixMo-AskModelAnything, PixMo-Points, PixMo-Clocks, and PixMo-Docs. The fine-tuning includes supervised training data for tasks like question answering, counting, and point-based referencing.

  • Why No RLHF (Reinforcement Learning with Human Feedback)? Molmo does not use RLHF, which is commonly employed in models like GPT-4, to refine performance through human interaction. Instead, Molmo relies on high-quality labelled data for fine-tuning. The idea here is that Molmo’s comprehensive dataset already encompasses a broad set of real-world tasks, obviating the need for further human feedback during training.

Evaluation: Academic Benchmarks and Human Preference

Evaluating multimodal models can be challenging due to the complexity of visual and linguistic tasks. The Molmo team gauged performance using a combination of academic benchmarks and extensive human evaluations.

  1. Academic Benchmarks: Molmo was tested against 11 widely used datasets, including VQA, DocVQA, and a new counting-focused benchmark, Flickr Count. The models to be compared are categorized into 4 groups: proprietary models that can only be accessed through API calls, models with released weights but closed data, models with released weights and released training data, and the Molmo family of models. The results positioned Molmo models alongside or even above proprietary systems like GPT-4V, especially the 72B variant.
  2. Human Preference Testing: To supplement quantitative scores, Molmo’s human preference testing involved collecting over 325,000 pairwise comparisons, and ranking models on user satisfaction. Molmo-72B achieved one of the highest rankings, trailing only proprietary models like GPT-4o in direct user preference.

Comparison with Other Models (LLaVA, Qwen2-VL, PaliGemma)

  • LLaVA and Qwen2-VL: These models rely on multi-stage pre-training, often involving frozen parts of the model during different stages. They use large-scale, synthetic data, which helps with scale but introduces noise and reliance on proprietary VLMs.
  • PaliGemma: Similar to Qwen2-VL, it uses closed data and depends on synthetic data generated by proprietary models. Molmo avoids these dependencies, ensuring transparency and reproducibility.

Also read: Hands-On Multimodal Retrieval and Interpretability (ColQwen + Vespa)

A Hands-on Guide for running Molmo on our use case:

Now that we are clear with the architecture of Molmo let’s get hands-on and try out some examples with Molmo. In this section, we’ll walk through using Molmo on example images to extract structured information. This hands-on session will help you understand how to load the model, process images, generate outputs, and customize it for your own data.

Colab notebook: Molmo-VLM-handson.ipynb (I have used A100 High-Ram GPU for running these experiments)

1. Setting Up the Environment

First, we need to install some essential packages. These include transformers for model processing, torch for handling tensors, Pillow for image manipulation, and pytesseract for OCR (Optical Character Recognition).

!pip install -q transformers torch Pillow einops
!pip install -q pytesseract
!apt-get install -y tesseract-ocr

2. Loading the Molmo Model and Processor

Here, we specify the Molmo model we want to use (in this case, MolmoE-1B-0924) and load it along with its processor.

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import torch

model_name="allenai/MolmoE-1B-0924"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map='auto')
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map='auto')

model.to("cuda")

AutoProcessor prepares the inputs for Molmo, handling both images and text prompts. AutoModelForCausalLM loads the language model. Setting device_map=’auto’ ensures the model is loaded onto the best available device (like GPU) for faster performance.

3. Processing and Displaying an Image

To work with an image, we load it using Pillow and display it to confirm we have the correct input.

image_path="your_image.png"  # provide the image path here
image = Image.open(image_path).convert('RGB')
image

This code loads an image from the specified path and converts it to RGB format, ensuring compatibility with the model.

Resizing the Image for Consistency

If an image is too large, you can resize it for consistent processing and then display the image. This function resizes images with a height greater than 800 pixels. Reducing image size can optimize processing without significantly affecting the model’s ability to interpret content.

def resize_image(image, max_height=800):
    width, height = image.size
    if height > max_height:
        ratio = max_height / height
        new_width = int(width * ratio)
        new_height = int(height * ratio)
        return image.resize((new_width, new_height))
    return image

4. Processing Image and Text for Model Input

We define a text prompt and process both the image and text together using the processor.

inputs = processor.process(
    images=[image],
    text="Extract all the information from the page in JSON format, especially the account summary and all contact details in proper format."
)

inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

The processor combines the image and text into a format the model can interpret. Each input is moved to the model’s device (usually GPU) and reshaped for batch processing.

5. Generating the Output Text

Using the model’s generate_from_batch function, we generate an output based on the image and prompt.

output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(generated_text)

Here, we set a maximum limit of 500 tokens (you can increase or decrease the number of tokens according to your usecase) for the response and define a stop condition (<|endoftext|>). This line (output[0, inputs[‘input_ids’].size(1):] ) extracts only the generated tokens with slicing which skips the input prompt tokens in the output. This isolates the newly generated tokens and avoids redundancy in responses. 

The model processes the inputs and generates tokens representing the text output, which we then decode to human-readable text. This allows us to see Molmo’s extracted information based on our prompt.

Overall function which takes an image_path and a prompt and will generate text as instructed

def generate_text(image_path, prompt):
   image = Image.open(image_path).convert('RGB')
   inputs = processor.process(
       images=[image],
       text=prompt
   )
  inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
   output = model.generate_from_batch(
       inputs,
       GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
       tokenizer=processor.tokenizer
   )
   generated_tokens = output[0,inputs['input_ids'].size(1):]
   generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
   return image, generated_text

You can pass custom prompts to refine the model’s focus. In this case, we’re asking for detailed information, specifying a JSON format for structured data extraction. This helps Molmo return data that’s ready for further processing or analysis.

The image from which we are extracting data:

Binary Quantization
Binary Quantization, source: Author
input_path="/content/Visualization - Binary Quantization.png"

prompt=""'You are an expert mathematician. You need to understand what is been mentioned in this page and outline the topics along with explanation.
The output should be in json format with keys "topics mentioned", "explanation": {"exp_topic1", "exp_topic2", ...}
'''

image, generated_text = generate_text(input_path, prompt)
resize_image(image)
print(generated_text)

Output:

{
"topics mentioned": [
"Query and token",
"Binary quantization",
"Hamming distance",
"Minimum Hamming distance",
"Query and token embeddings",
"Final hamming similarity"
],
"explanation": {
"query and token": "The image discusses the process of converting each
value in a query or token into either 1 or 0, depending on whether it
represents a positive or negative value respectively. This technique is used
in binary quantization.",
"binary quantization": "This is a method for representing real numbers in
binary format with a fixed number of bits. The image explains how to convert
floating-point numbers to binary and then calculate the Hamming distance
between two binary vectors.",
"Hamming distance": "This is a measure of how many bit positions differ
between two binary vectors. The image shows how to calculate this distance
between two binary vectors of different lengths.",
"minimum Hamming distance": "This refers to the shortest distance between
two vectors of the same length, excluding the vector itself. The image
provides formulas for calculating this distance for different token sizes
and query lengths.",
"query and token embeddings": "The image describes how to represent query
and token data in a 4-dimensional space using multi-vector embeddings. It
explains the process of tokenization and the use of binary quantization for
this representation.",
"final hamming similarity": "The image concludes by discussing the
calculation of overall hamming similarity between two query vectors and
their embeddings"
}
}

We can also take a complex example where there are many tables and see how much data the model can extract in one go:

input_path="/content/0fa82bab-e131-43dd-86da-7153b2ecc76d.png"

prompt=""'Extract all the information from the page in json, each and every data needs to be present. Don't miss out on contact details, name, address, account bill summary, billing history and ways to pay.
The output should be in json format with keys being all the data found in the page. Information is crucial.
'''

image, generated_text = generate_text(input_path, prompt, max_tokens=1000)
print(generated_text)
resize_image(image, max_height=600) # displaying the image my resizing it 600 pixels height

Output:

{
"energyStatement": {
"accountNumber": "5553220335-0",
"statementDate": "01/30/2024",
"dueDate": "02/20/2024",
"website": "www.pge.com/myenergy",
"serviceInfo": {
"meterNumber": "10098180854",
"totalUsage": "518.53 MWh",
" rotatingOutageBlock": "10F",
"serviceID": "5534591016"
},
"billingHistory": {
"billingcycles": "33 billing cycles",
"billingcyclesToDate": "12/31/2023",
"currentBillingcycle": "12/22/2023"
},
"serviceSchedule": {
"serviceID": "5534591016",
"schedule": "EVA Home Charging"
},
"electricDeliveryCharges": {
"total": "$139.29",
"2018VintagePowerChargeInferenceAdjustment": "1.00"
},
"contactInfo": {
"phoneNumber": "555-123-4567",
"email": "[email protected]"
}
}
}
Output

From the above image, as we can see in on the go, most of the details are extracted, but what if we don’t want to miss a single piece of information from the page and the page is dense with information? There, we can try an approach to split the image into multiple patches and pass those patches separately to the model to extract data that we can eventually combine together.

Splitting the Image into Patches

To handle complex images with diverse regions, split them into smaller patches and process each patch individually. Here, we are following a straightforward approach of splitting the image into 4 equal sections. This is useful for large documents where different regions may contain distinct information, and also sections are equally divided (like research papers).

def split_image_into_patches(image):
    width, height = image.size
    patches = {
        "top_left": image.crop((0, 0, width // 2, height // 2)),
        "top_right": image.crop((width // 2, 0, width, height // 2)),
        "bottom_left": image.crop((0, height // 2, width // 2, height)),
        "bottom_right": image.crop((width // 2, height // 2, width, height))
    }
    return patches

Processing Each Patch and Extracting Information

Each patch is processed separately with a prompt to extract relevant details. We store each patch’s result in a dictionary.

extracted_data = {}
for patch_name, patch_image in image_patches.items():
    inputs = processor.process(
        images=[patch_image],
        text="Extract all the information from page in JSON, each and every data needs to be present."
    )
    inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
    output = model.generate_from_batch(
        inputs,
        GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
        tokenizer=processor.tokenizer
    )
    generated_tokens = output[0, inputs['input_ids'].size(1):]
    generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
    extracted_data[patch_name] = generated_text

The above approach of splitting images equally is similar to splitting a long text document into fixed-length text chunks. However, if the chunks are divided between a continuing text then we lose context. This concept applies to images too. So, instead of splitting the image equally, what if we split the image based on visually semantic chunks.

We will be trying out a simple approach here: combining OCR with calculating the line gap in bounding boxes to create a group of patches from an image and then pass those patches to the Molmo model. 

We can apply OCR to identify text regions in the image and return the text along with bounding boxes.

import pytesseract

def extract_text_regions(image):
    ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    text_regions = []
    for i, word in enumerate(ocr_data['text']):
        if word.strip():  # Ignore empty strings
            x, y, w, h = ocr_data['left'][i], ocr_data['top'][i], ocr_data['width'][i], ocr_data['height'][i]
            text_regions.append({
                "text": word,
                "bbox": (x, y, x + w, y + h)
            })
    return text_regions

Grouping and Processing Semantic Chunks

We can group text regions into logical chunks (like paragraphs or tables) for more logical extraction. This function groups words into larger chunks, like lines or paragraphs, based on their bounding box positions (calculation of vertical line gap between bounding boxes). It’s useful for extracting more contextually coherent information from documents.

def group_text_regions(text_regions, line_threshold=10):
    grouped_regions = []
    current_group = []
    last_bottom = -1

    for region in text_regions:
        _, top, _, bottom = region['bbox']
        if last_bottom != -1 and (top - last_bottom > line_threshold):
            grouped_regions.append(current_group)
            current_group = []
        current_group.append(region)
        last_bottom = bottom

    if current_group:
        grouped_regions.append(current_group)
    
    return grouped_regions

Now, we will apply this approach on a page to create groups and pass each patch to the model for extraction. Once all the json data are extracted, we can pass it to an LLM to combine everything together.

# Apply OCR to identify text regions
text_regions = extract_text_regions(image)

# Group text regions into semantic chunks
semantic_chunks = group_text_regions(text_regions)

# Initialize a dictionary to store extracted data from each chunk
extracted_data = {}

# Loop through each semantic chunk, process, and store the output
for idx, chunk in enumerate(semantic_chunks):
   # Create a bounding box for the chunk
   x_min = min([r['bbox'][0] for r in chunk])
   y_min = min([r['bbox'][1] for r in chunk])
   x_max = max([r['bbox'][2] for r in chunk])
   y_max = max([r['bbox'][3] for r in chunk])

   # Crop the image to the bounding box of the chunk
   chunk_image = image.crop((x_min, y_min, x_max, y_max))

   # Prepare text prompt for Molmo
   chunk_text = " ".join([r['text'] for r in chunk])
   prompt_text = f"Extract information from this section: {chunk_text} in JSON format."

   # Process the chunk image and prompt with Molmo
   inputs = processor.process(
       images=[chunk_image],
       text=prompt_text
   )
   inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

   output = model.generate_from_batch(
       inputs,
       GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
       tokenizer=processor.tokenizer
   )

   generated_tokens = output[0, inputs['input_ids'].size(1):]
   generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
   print(generated_text, "\n\n")

   # Store the extracted data for the current chunk
   extracted_data[f"chunk_{idx}"] = generated_text

# Combine all extracted data
combined_data = { "page_summary": extracted_data }

This was a fun experiment, but it is not yet the best-optimized approach. We can improve it further by using segmentation to create logical chunks. If we plan to use OCR, then grouping needs to be more strict and heuristic-based (considering both vertical and horizontal line gaps and some checks on the amount of text or data available).

Conclusion

In this deep dive into Molmo and PixMo, we explored the motivations behind developing open and robust vision-language models, the detailed architecture of Molmo, and the unique datasets powering its capabilities. We walked through key design decisions, including why Molmo opted for a simpler, single-stage training pipeline and chose CLIP as the vision encoder for its superior performance in handling multi-crop, high-resolution images. The hands-on section showcased Molmo’s flexibility in extracting complex structured data, providing you with practical examples and code to try out yourself. By embracing transparency, high-quality data, and efficient training strategies, Molmo sets a new standard in open multimodal research, offering a versatile tool for tackling diverse vision-language tasks. We have come to the end of the blog. I hope this blog provides a comprehensive understanding of Molmo and inspires you to experiment with its capabilities.

Also, if you are looking for a generative AI course online, then explore: GenAI Pinnacle Program

Frequently Asked Questions

Q1. Why does Molmo use CLIP instead of other vision encoders like SigLIP?

Ans. Molmo uses CLIP because it demonstrated superior performance in handling multi-crop and high-resolution images. CLIP’s robust attention mechanisms and ability to capture spatial relationships across image patches make it more effective for complex visual tasks. In contrast, SigLIP struggled with multi-crop settings and was better suited for simpler, single-crop scenarios.

Q2. What datasets power Molmo’s training, and how do they differ from synthetic datasets?

Ans. Molmo leverages the PixMo dataset, which includes high-quality, human-annotated image-caption pairs and specialized datasets like PixMo-AskModelAnything and PixMo-Points. These datasets provide diverse, real-world data that enhance Molmo’s generalization capabilities. Unlike synthetic datasets, PixMo’s human annotations ensure a richer and more natural understanding of visual content.

Q3. Can I use Molmo for custom tasks, and how flexible is it with different input types?

Ans. Yes, Molmo is designed to be highly flexible. You can customize prompts based on your specific task needs, such as extracting structured data in JSON format or answering specific queries about an image. The hands-on examples in the blog demonstrate how to adapt Molmo to various use cases, making it suitable for tasks ranging from document understanding to image captioning

Hi, I’m Antaripa Saha, Machine Learning Engineer II at a US-based startup. I am passionate about math, generative AI, and the latest advancements in VLMs and LLMs. I like deep-diving research papers and breaking them down in my blogs.
My twitter profile: https://twitter.com/doesdatmaksense

Source link

Author picture

Leave a Reply

Your email address will not be published. Required fields are marked *