Introduction
Artificial intelligence (AI) is rapidly changing industries around the world, including healthcare, autonomous vehicles, banking, and customer service. While building AI models acquires a lot of attention, AI inference—the process of applying a trained model to fresh data to make predictions—is where the real-world impact occurs. As enterprises become more reliant on AI-powered applications, the demand for efficient, scalable, and low-latency inferencing solutions has never been higher.
This is where NVIDIA NIM comes into the picture. NVIDIA NIM is designed to help developers deploy AI models as microservices, simplifying the process of delivering inference solutions at scale. In this blog, we’ll dive deep into the capabilities of NIM, check some model using NIM API, and how it’s revolutionizing AI inferencing.
Learning Outcomes
- Understand the significance of AI inference and its impact on various industries.
- Gain insights into the functionalities and benefits of NVIDIA NIM for deploying AI models.
- Learn how to access and utilize pretrained models through the NVIDIA NIM API.
- Discover the steps to measure inferencing speed for different AI models.
- Explore practical examples of using NVIDIA NIM for both text generation and image creation.
- Recognize the modular architecture of NVIDIA NIM and its advantages for scalable AI solutions.
This article was published as a part of the Data Science Blogathon.
What is NVIDIA NIM?
NVIDIA NIM is a platform that uses microservices to make AI inference easier in real-life applications. Microservices are small services that can work on their own but also come together to create larger systems that can grow. By putting ready-to-use AI models into microservices, NIM helps developers use these models quickly and easily, without needing to think about the infrastructure or how to scale it.
Key Characteristics of NVIDIA NIM
- Pretrained AI Models: NIM comes with a library of pretrained models for various tasks like speech recognition, natural language processing (NLP), computer vision, and more.
- Optimized for Performance: NIM leverages NVIDIA’s powerful GPUs and software optimizations (like TensorRT) to deliver low-latency, high-throughput inference.
- Modular Design: Developers can mix and match microservices depending on the specific inference task they need to perform.
Understanding Key Features of NVIDIA NIM
Let us understand key features of NVIDIA NIM below in detail:
Pretrained Models for Fast Deployment
NVIDIA NIM provides a wide range of pretrained models that are ready for immediate deployment. These models cover various AI tasks, including:
Low-Latency Inference
It is very good for quick responses, so it tends to work well for applications needing real-time processing. For example, in a self-driving car, choices are made using live data from sensors and cameras. NIM ensures that such AI models work fast enough with that kind of data as real-time needs demand.
How to Access Models from NVIDIA NIM
Below we will see how we can access models from NVIDIA NIM:
- Login using E-mail in NVIDIA NIM here.
- Choose any model and get your API key.
Checking Inferencing Speed using Different Models
In this section, we will explore how to evaluate the inferencing speed of various AI models. Understanding the response time of these models is crucial for applications that require real-time processing. We will begin with the Reasoning Model, specifically focusing on the Llama-3.2-3b-instruct Preview.
Reasoning Model
The Llama-3.2-3b-instruct model performs natural language processing tasks, effectively comprehending and responding to user queries. Below, we provide the necessary requirements and a step-by-step guide for setting up the environment to run this model.
Requirements
Before we begin, ensure that you have the following libraries installed:
openai
: This library allows interaction with OpenAI’s models.python-dotenv
: This library helps manage environment variables.
openai
python-dotenv
Create Virtual Environment and Activate it
To ensure a clean setup, we will create a virtual environment. This helps in managing dependencies effectively without affecting the global Python environment. Follow the commands below to set it up:
python -m venv env
.\env\Scripts\activate
Code Implementation
Now, we will implement the code to interact with the Llama-3.2-3b-instruct model. The following script initializes the model, accepts user input, and calculates the inferencing speed:
from openai import OpenAI
from dotenv import load_dotenv
import os
import time
load_dotenv()
llama_api_key = os.getenv('NVIDIA_API_KEY')
client = OpenAI(
base_url = "https://integrate.api.nvidia.com/v1",
api_key = llama_api_key)
user_input = input("What you want to ask: ")
start_time = time.time()
completion = client.chat.completions.create(
model="meta/llama-3.2-3b-instruct",
messages=[{"role":"user","content":user_input}],
temperature=0.2,
top_p=0.7,
max_tokens=1024,
stream=True
)
end_time = time.time()
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
response_time = end_time - start_time
print(f"\nResponse time: {response_time} seconds")
Response time
The output will include the response time, allowing you to evaluate the efficiency of the model: 0.8189256191253662 seconds
Stable Diffusion 3 Medium
Stable Diffusion 3 Medium is a cutting-edge generative AI model designed to transform text prompts into stunning visual imagery, empowering creators and developers to explore new realms of artistic expression and innovative applications. Below, we have implemented code that demonstrates how to utilize this model for generating captivating images.
Code Implementation
import requests
import base64
from dotenv import load_dotenv
import os
import time
load_dotenv()
invoke_url = "https://ai.api.nvidia.com/v1/genai/stabilityai/stable-diffusion-3-medium"
api_key = os.getenv('STABLE_DIFFUSION_API')
headers = {
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
}
payload = {
"prompt": input("Enter Your Image Prompt Here: "),
"cfg_scale": 5,
"aspect_ratio": "16:9",
"seed": 0,
"steps": 50,
"negative_prompt": ""
}
start_time = time.time()
response = requests.post(invoke_url, headers=headers, json=payload)
end_time = time.time()
response.raise_for_status()
response_body = response.json()
image_data = response_body.get('image')
if image_data:
image_bytes = base64.b64decode(image_data)
with open('generated_image.png', 'wb') as image_file:
image_file.write(image_bytes)
print("Image saved as 'generated_image.png'")
else:
print("No image data found in the response")
response_time = end_time - start_time
print(f"Response time: {response_time} seconds")
Output:
Response time: 3.790468692779541 seconds
Conclusion
With the increasing speed of AI applications, solutions are required that can execute many tasks effectively. One crucial part of this area is the NVIDIA NIM, as it helps businesses and developers use AI easily in a scalable manner through the use of pretrained AI models combined with fast GPU processing and a microservices setup. They can quickly deploy real-time applications in both cloud and edge settings, making them highly flexible and durable in the field.
Key Takeaways
- NVIDIA NIM leverages microservices architecture to efficiently scale AI inference by deploying models in modular components.
- NIM is designed to fully exploit NVIDIA GPUs, using tools like TensorRT to accelerate inference for faster performance.
- Ideal for industries like healthcare, autonomous vehicles, and industrial automation where low-latency inference is critical.
Frequently Asked Questions
A. The primary components include the inference server, pre-trained models, TensorRT optimizations, and microservices architecture for handling AI inference tasks more efficiently.
A. NVIDIA NIM is made to easily work with current AI models. It lets developers add pre-trained models from different sources into their applications. This is done by offering containerized microservices with standard APIs. This makes it easy to include these models into existing systems without a lot of changes. It basically acts like a bridge between AI models and applications.
A. NVIDIA NIM removes the hurdles in building AI applications by providing industry-standard APIs for developers, enabling them to build robust copilots, chatbots, and AI assistants. It also ensures that creating AI applications is easier for IT and DevOps teams in terms of installing AI models within their controlled environments.
A. If you are using your personal mail you will get 1000 API credits, 5000 API credits for business mail.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.