Hybrid Mamba-Transformer Model for Advanced NLP

Jamba 1.5 is an instruction-tuned large language model that comes in two versions: Jamba 1.5 Large with 94 billion active parameters and Jamba 1.5 Mini with 12 billion active parameters. It combines the Mamba Structured State Space Model (SSM) with the traditional Transformer architecture. This model, developed by AI21 Labs, can process a 256K effective context window, which is the largest among open-source models.

Overview

  • Jamba 1.5 a hybrid Mamba-Transformer model for efficient NLP, capable of processing massive context windows with up to 256K tokens.
  • Its 94B and 12B parameter versions enable diverse language tasks while optimizing memory and speed through the ExpertsInt8 quantization.
  • AI21’s Jamba 1.5 combines scalability and accessibility, supporting tasks from summarization to question-answering across nine languages.
  • It’s innovative architecture allows for long-context handling and high efficiency, making it ideal for memory-heavy NLP applications.
  • It’s hybrid model architecture and high-throughput design offer versatile NLP capabilities, available through API access and on Hugging Face.

What are Jamba 1.5 Models?

The Jamba 1.5 models, including Mini and Large variants, are designed to handle various natural language processing (NLP) tasks such as question answering, summarization, text generation, and classification. Jamba models on an extensive corpus support nine languages—English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew. Jamba 1.5, with its joint SSM-Transformer structure, tackles the problems with the conventional transformer models that are often hindered by two major limitations: high memory requirements for long context windows and slower processing.

The Architecture of Jamba 1.5

The Architecture of Jamba 1.5
Aspect Details
Base Architecture Hybrid Transformer-Mamba architecture with a Mixture-of-Experts (MoE) module
Model Variants Jamba-1.5-Large (94B active parameters, 398B total) and Jamba-1.5-Mini (12B active parameters, 52B total)
Layer Composition 9 blocks, each with 8 layers; 1:7 ratio of Transformer attention layers to Mamba layers
Mixture of Experts (MoE) 16 experts, selecting the top 2 per token for dynamic specialization
Hidden Dimensions 8192 hidden state size
Attention Heads 64 query heads, 8 key-value heads
Context Length Supports up to 256K tokens, optimized for memory with significantly reduced KV cache memory
Quantization Technique ExpertsInt8 for MoE and MLP layers, allowing efficient use of INT8 while maintaining high throughput
Activation Function Integration of Transformer and Mamba activations, with an auxiliary loss to stabilize activation magnitudes
Efficiency Designed for high throughput and low latency, optimized to run on 8x80GB GPUs with 256K context support

Explanation

  • KV cache memory is memory allocated for storing key-value pairs from previous tokens, optimizing speed when handling long sequences.
  • ExpertsInt8 quantization is a compression method using INT8 precision in MoE and MLP layers to save memory and improve processing speed.
  • Attention heads are separate mechanisms within the attention layer that focus on different parts of the input sequence, improving model understanding.
  • Mixture-of-Experts (MoE) is a modular approach where only selected expert sub-models process each input, boosting efficiency and specialization.

Intended Use and Accessibility

Jamba 1.5 was designed for a range of applications accessible via AI21’s Studio API, Hugging Face or cloud partners, making it deployable in various environments. For tasks such as sentiment analysis, summarization, paraphrasing, and more. It can also be finetuned on domain-specific data for better results; the model can be downloaded from Hugging Face

Jamba 1.5

One way to access them is by using AI21’s Chat interface:

Chat Interface

Here’s the link: Chat Interface

Jamba 1.5 Chat Interface
Jamba 1.5 Chat Interface

This is just a small sample of the model’s question-answering capabilities.

Jamba 1.5 using Python

You can send requests and get responses from Jamba 1.5 in Python using the API Key. 

To get your API key, click on settings on the left bar of the homepage, then click on the API key.

Note: You’ll get $10 free credits, and you can track the credits you use by clicking on ‘Usage’ in the settings. 

ai21 studio

Installation

!pip install ai21

Python Code 

from ai21 import AI21Client
from ai21.models.chat import ChatMessage
messages = [ChatMessage(content="What's a tokenizer in 2-3 lines?", role="user")]
client = AI21Client(api_key='')
response = client.chat.completions.create(
  messages=messages,
  model="jamba-1.5-mini",
  stream=True
)
for chunk in response:
  print(chunk.choices[0].delta.content, end="")

A tokenizer is a tool that breaks down text into smaller units called tokens, words, subwords, or characters. It is essential for natural language processing tasks, as it prepares text for analysis by models.

It’s straightforward: We send the message to our desired model and get the response using our API key. 

Note: You can also choose to use the jamba-1.5-large model instead of Jamba-1.5-mini

Conclusion

Jamba 1.5 blends the strengths of the Mamba and Transformer architectures. With its scalable design, high throughput, and extensive context handling, it is well-suited for diverse applications ranging from summarization to sentiment analysis. By offering accessible integration options and optimized efficiency, it enables users to work effectively with its modelling capabilities across various environments. It can also be finetuned on domain-specific data for better results. 

Frequently Asked Questions

Q1. What is Jamba 1.5?  

Ans. Jamba 1.5 is a family of large language models designed with a hybrid architecture combining Transformer and Mamba elements. It includes two versions, Jamba-1.5-Large (94B active parameters) and Jamba-1.5-Mini (12B active parameters), optimized for instruction-following and conversational tasks.

Q2. What makes Jamba 1.5 efficient for long-context processing?  

Ans. Jamba 1.5 models support an effective context length of 256K tokens, made possible by its hybrid architecture and an innovative quantization technique, ExpertsInt8. This efficiency allows the models to manage long-context data with reduced memory usage.

Q3. What is the ExpertsInt8 quantization technique in Jamba 1.5?  

Ans. ExpertsInt8 is a custom quantization method that compresses model weights in the MoE and MLP layers to INT8 format. This technique reduces memory usage while maintaining model quality and is compatible with A100 GPUs, enhancing serving efficiency.

Q4. Is Jamba 1.5 available for public use?  

Ans. Yes, both Large and Mini are publicly available under the Jamba Open Model License. The models can be accessed on Hugging Face.

I’m a tech enthusiast, graduated from Vellore Institute of Technology. I’m working as a Data Science Trainee right now. I am very much interested in Deep Learning and Generative AI.

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details

Source link

Author picture

Leave a Reply

Your email address will not be published. Required fields are marked *