Gen AI Powered Data Insight Generation using LIDA 

Introduction

This article will introduce the readers to LIDA, an open-source python library for generating detailed data visualizations and appealing infographics. We will first understand that how LIDA works and what are its core capabilities and then finally see it in action by building a Streamlit application that will enable the user to explore the provided csv dataset and unearth valuable insights automatically by creating amazing data visualizations.

Learning Objectives

  • Understand the challenges associated with manual data exploration and analysis.
  • Explore LIDA architecture and core building blocks.
  • Learn to build a fully functional Streamlit application for automated data exploration and insight generation.

This article was published as a part of the Data Science Blogathon.

Challenges with Manual Data Exploration

Manual data exploration is a labor intensive process that demands significant time and effort to clean, analyze, and visualize data. Analysts often face the challenge of sifting through large datasets, which
increases the likelihood of human error and overlooked patterns or insights. Additionally, the manual approach can be inconsistent, as it relies heavily on the individual skills and expertise of the analyst, making it difficult to reproduce results or scale the process for larger datasets.

Automating data exploration accelerates the analysis process, ensuring more accurate and comprehensive insights. Automation tools, like LIDA, streamline data visualization, and insight generation, allowing users to focus on decision-making and strategic planning.

What is LIDA and How Does it Work?

LIDA is a open-source python library for generating data visualizations and infographics. LIDA is grammar agnostic and can work with any programming language. It also supports multiple visualization libraries like matplotlib, seaborn etc.

LIDA Architecture

LIDA consists of the following 4 key modules that work together in a sequence to generate automatic data visualizations and infographics:

Summarizer

  • Function: Converts datasets into a rich but compact natural language representation (context)
  • Process: Uses rules and large language models (LLMs) to analyze the dataset.

Goal Explorer

  • Function: Generates a set of potential “goals” based on the dataset context.
  • Process: Utilizes LLMs to interpret the context and suggest relevant visualization goals.

Viz Generator

  • Function: Generates, evaluates, repairs, filters, and executes visualization code to meet specified goals.
  • Process: Leverages LLMs to create visualization code in the appropriate programming language or grammar.

Infographer

  • Function: Generates stylized infographics based on the visualization and style prompts.
  • Process: Applies image generation models (IGMs) to transform visualizations into styled infographics.

Now that we are familiar with the building blocks of LIDA and their respective functions, let’s understand that how all these blocks integrate and work together in a single workflow:

  • Dataset Input: The user provides a CSV dataset (e.g., Cars.csv).
  • Summarization: The Summarizer processes the dataset and generates a natural language context.
  • Goal Exploration: The Goal Explorer uses the context to suggest possible visualization goals.
  • Visualization Generation: The Viz Generator creates and executes code to produce visualizations based on the selected goals.
  • Infographic Creation: The Infographer transforms these visualizations into styled infographics according to user-defined prompts.
  • Output Delivery: The system outputs a natural language summary, suggested goals, visualization code, and the final stylized infographics.

This integrated approach streamlines the process of data exploration, visualization, and infographic creation, making it efficient and user-friendly.

Building Application for Automatic Insight Generation

Now that we have a pretty fair idea of LIDA and it’s functioning, let’s roll up our sleeves and get into some action by building a Streamlit application that will accept a CSV dataset as input and then leverage LIDA to generate automatic data visualizations

Step1: Install Python Libraries

First things first, let’s install the required python libraries for our application. We will create a requirements.txt file with the following set of libraries:

Python Library Description/Use case
uvicorn A lightning-fast ASGI server for running Python web applications
streamlit An open-source app framework for creating and sharing beautiful, custom web apps
pandas A powerful data manipulation and analysis library providing data structures like DataFrames
lida A toolkit for generating data visualizations and data-faithful infographics
python-dotenv A toolkit for generating data visualizations and data-faithful infographics, compatible with various programming languages and visualization libraries

Then install all the libraries by running the command “pip install -r requirements.txt”

Step2: Integrating LIDA with LLM

Next, we need to integrate LIDA with a LLM that will be used to summarize the dataset, create goals and then finally generate and execute visualization code. LIDA is highly flexible and integrates smoothly with multiple large language model providers, including OpenAI, Azure OpenAI, PaLM, Cohere, and Huggingface. However, for our application, we will be using the GPT-3.5 Turbo model by OpenAI and for that we would need an Open AI API key.

To generate an API key, first, create an OpenAI account or sign in. Next, navigate to the API key page and “Create new secret key”, optionally naming the key. Make sure to save this somewhere safe and do not share it with anyone.

Once we have API key, create a .env file and save your API key over there

Step3: Streamlit Application Logic

Finally, we will create the app.py file containing the Streamlit application logic and LIDA API call. 

import streamlit as st
import pandas as pd
from lida import Manager, TextGenerationConfig , llm  
from PIL import Image
from io import BytesIO
import base64
from dotenv import load_dotenv
import os
import openai

# Configuring the OpenAI API Key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# To convert charts into images, so that they can be displayed on Stremlit front-end
def base64_to_image(base64_string):
    # Decode the base64 string
    byte_data = base64.b64decode(base64_string)
    # Use BytesIO to convert the byte data to image
    return Image.open(BytesIO(byte_data))

# Streamlit App Code
st.set_page_config(
    page_title="Automatic Insights and Visualization App",
    page_icon="🤖",
    layout="centered",
    initial_sidebar_state="expanded",
)

st.header("Automatic Insights and Visualization 🤖")


menu = st.sidebar.selectbox("Choose an Option", ["Automatic Insights"])

if menu == "Automatic Insights":
    st.subheader("Generate Automatic Insights")
    # Upload CSV dataset as input
    uploaded_file = st.file_uploader("Choose a csv file")
    if uploaded_file is not None:
        dataframe = pd.read_csv(uploaded_file)
        st.write(dataframe)
        btn = st.button("Generate Suggestions", type = "primary")

        if btn: 
            # Generate goals using LIDA
            lida = Manager(text_gen = llm("openai"))
            textgen_config = TextGenerationConfig(n=1, 
                                                  temperature=0.5, 
                                                  model="gpt-3.5-turbo-0301", 
                                                  use_cache=True)
            summary = lida.summarize(dataframe, 
                      summary_method="default", 
                      textgen_config=textgen_config)  
            goals = lida.goals(summary, n=5, textgen_config=textgen_config)

            i = 0
            library = "seaborn"
            imgs = []
            textgen_config = TextGenerationConfig(n=1, temperature=0.2, use_cache=True)
            # Create the corresponding data visualization for each goal
            for i in range(len(goals)):
                charts = lida.visualize(summary=summary, 
                                        goal=goals[i], 
                                        textgen_config=textgen_config, 
                                        library=library)
                img_base64_string = charts[0].raster
                img = base64_to_image(img_base64_string)
                imgs.append(img)

            tab1, tab2, tab3, tab4, tab5 = st.tabs(
            ["Goal 1", "Goal 2", "Goal 3", "Goal 4", "Goal 5"]
            )

            with tab1:
                st.header("Goal 1")
                goals[0].question
                st.image(imgs[0])

            with tab2:
                st.header("Goal 2")
                goals[1].question
                st.image(imgs[1])

            with tab3:
                st.header("Goal 3")
                goals[2].question
                st.image(imgs[2])
            
            with tab4:
                st.header("Goal 4")
                goals[3].question
                st.image(imgs[3])
            
            with tab5:
                st.header("Goal 5")
                goals[4].question
                st.image(imgs[4])

Once all the files are ready, you can run the streamlit application using the command “streamlit run app.py”

Generate Automatic insights
Lida

Conclusion

We explored the challenges associated with manual data exploration and how tools like LIDA help us streamline the process by providing a flexible and fully automatic solution for data exploration and insight generation We also got an understanding of the LIDA system architecture and its core capabilities. Lastly, we saw LIDA in action by building an automatic insight generation application using Streamlit.

Here is the link for the video depicting the final application and it’s working.

Key Takeaways

  • Whether you’re working with Matplotlib or Seaborn, Python or any other programming language, LIDA fits right into your workflow.
  • Leverage the latest language models to generate intelligent insights and recommendations for your data.
  • No step learning curves here. LIDA is designed to be intuitive and easy to use, so you can focus on the things that matter to you and the business – making data-driven decisions.
  • Automating data exploration accelerates the analysis process, ensuring more accurate and comprehensive insights.

Frequently Asked Questions

Q1. What are the different LLM models supported by LIDA?

A. LIDA supports multiple large language model providers like OpenAI, Azure OpenAI, PaLM, Cohere and Huggingface .

Q2. Is an API key required to work with LIDA?

A. LIDA is an open-source library and doesn’t require an API key as you need to install it on your system and run it locally, but you might need an API key for the LLM model that you will be using with LIDA. For example, you will need an OpenAI API key if you are using a model like GPT3.5-Turbo.

Q3. Does LIDA support query-based visualization generation?

A. Instead of relying on LIDA for goal generation, a user can explicitly provide the query/goal and generate the desired chart. LIDA also provides the support for multi-lingual input.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Source link

Picture of quantumailabs.net
quantumailabs.net

Leave a Reply

Your email address will not be published. Required fields are marked *