Hybrid Mamba-Transformer Model for Advanced NLP

Blog

Hybrid Mamba-Transformer Model for Advanced NLP

November 2, 2024

Jamba 1.5 is an instruction-tuned large language model that comes in two versions: Jamba 1.5 Large with 94 billion active parameters and Jamba 1.5 Mini with 12 billion active parameters. It combines the Mamba Structured State Space Model (SSM) with the traditional Transformer architecture. This model, developed by AI21 Labs, can process a 256K effective context window, which is the largest among open-source models.

Overview

Jamba 1.5 a hybrid Mamba-Transformer model for efficient NLP, capable of processing massive context windows with up to 256K tokens.
Its 94B and 12B parameter versions enable diverse language tasks while optimizing memory and speed through the ExpertsInt8 quantization.
AI21’s Jamba 1.5 combines scalability and accessibility, supporting tasks from summarization to question-answering across nine languages.
It’s innovative architecture allows for long-context handling and high efficiency, making it ideal for memory-heavy NLP applications.
It’s hybrid model architecture and high-throughput design offer versatile NLP capabilities, available through API access and on Hugging Face.

What are Jamba 1.5 Models?

The Jamba 1.5 models, including Mini and Large variants, are designed to handle various natural language processing (NLP) tasks such as question answering, summarization, text generation, and classification. Jamba models on an extensive corpus support nine languages—English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew. Jamba 1.5, with its joint SSM-Transformer structure, tackles the problems with the conventional transformer models that are often hindered by two major limitations: high memory requirements for long context windows and slower processing.

Aspect	Details
Base Architecture	Hybrid Transformer-Mamba architecture with a Mixture-of-Experts (MoE) module
Model Variants	Jamba-1.5-Large (94B active parameters, 398B total) and Jamba-1.5-Mini (12B active parameters, 52B total)
Layer Composition	9 blocks, each with 8 layers; 1:7 ratio of Transformer attention layers to Mamba layers
Mixture of Experts (MoE)	16 experts, selecting the top 2 per token for dynamic specialization
Hidden Dimensions	8192 hidden state size
Attention Heads	64 query heads, 8 key-value heads
Context Length	Supports up to 256K tokens, optimized for memory with significantly reduced KV cache memory
Quantization Technique	ExpertsInt8 for MoE and MLP layers, allowing efficient use of INT8 while maintaining high throughput
Activation Function	Integration of Transformer and Mamba activations, with an auxiliary loss to stabilize activation magnitudes
Efficiency	Designed for high throughput and low latency, optimized to run on 8x80GB GPUs with 256K context support

Blog