Finetuning Phi-Medium to Generate Cypher Query from Text

Blog

Finetuning Phi-Medium to Generate Cypher Query from Text

Introduction

The rise of Retrieval-Augmented Generation (RAG) and Knowledge Graphs has revolutionized how we interact with complex data sets by providing a structured, interconnected representation of information. Knowledge Graphs, such as those used in Neo4j, facilitate the querying and visualization of relationships within data. However, translating natural language into structured query languages like Cypher remains a challenging task. This guide aims to bridge this gap by detailing the fine-tuning of the Phi-3 Medium model to generate Cypher queries from natural language inputs. By leveraging the compact yet powerful capabilities of the Phi-3 Medium model, even small-scale developers can efficiently convert text to Cypher queries, enhancing the accessibility and usability of Knowledge Graphs.

Learning Objectives

Understand the importance of Cypher Query generation from natural language for developer efficiency.
Learn about Microsoft’s Phi 3 Medium and its role in transforming English queries into code.
Explore Unsloth’s efficiency improvements and memory management for Large Language Models.
Set up the environment for fine-tuning Phi 3 Medium with Unsloth efficiently.
Prepare datasets compatible with Phi 3 Medium and Unsloth for effective fine-tuning.
Master fine-tuning Phi 3 Medium with specific training arguments using SFTTrainer.

This article was published as a part of the Data Science Blogathon.

What is Phi 3 Medium?

The Phi family of Large Language Models is introduced by Microsoft to represent that even small language models can perform better and may be on par with the bigger models. Microsoft has trained this small family of models with different types of datasets, thus making these models good at different tasks including entity extraction, summarization, chatbots, roleplay, and more.

Microsoft has released these models keeping in mind that their small size can help even small developers work with them, and train them on their very own datasets, thus bringing up many different applications. Recently, Microsoft has announced the third generation of the phi family called the Phi 3 series of Large Language Models.

In the Phi 3 series, the context length was bought from 4k tokens to now 128k tokens, thus allowing more context to fit in. The Phi 3 family of models comes with different sizes starting from the smallest 3.8 billion parameter model called the Phi 3 Mini, followed by the Phi 3 Small which is a 7B parameter model, and finally the Phi 3 Medium which is a 14 billion parameter model, the one we will train in this Guide. All of these models have a long context version extending the context length to 128k tokens.

Who is Unsloth?

Developed by Daniel and Michael Han, Unsloth emerged to be one the best Optimized Frameworks designed to improve the fine-tuning process for large language models (LLMs). Known for its blazing speed and memory efficiency, Unsloth can increase training speeds by up to 30 times while reducing memory usage by an impressive 60%. All these capabilities make it the right framework for developers aiming to fine-tune LLMs with accuracy and speed.

Unsloth supports different types of Hardware Configs, from NVIDIA GPUs like the Tesla T4 and H100 to AMD and Intel GPUs. It even employs complex methodologies like intelligent weight upcasting, which minimizes the need for weight upscaling during QLoRA, thereby optimizing memory use.

As an open-source tool under the Apache 2.0 license, Unsloth integrates seamlessly into the fine-tuning of prominent LLMs like Mistral 7B, Llama, and Gemma, achieving up to a 5x increase in fine-tuning speed while simultaneously reducing memory usage by 60%. Furthermore, it is compatible with alternative fine-tuning methods like Flash-Attention 2, which not only speeds up inference but even the fine-tuning process.

Environment Creation

We will first create our environment. For this we will download Unsloth for Google Colab.

!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"

Then we will create some default Unsloth values for training. These are:

from unsloth import FastLanguageModel
import torch

sequence_length_maximum = 2048
weights_data_type = None
quantize_to_4bit = True

We start by importing the FastLanguageModel class from the Unsloth library. Then we define some variables to be worked with throughout the guide:

sequence_length_maximum: It is the max sequence length that a model can handle. We give it a value of 4096.
weights_data_type: Here we tell what data type the model weights should be. We gave it None, which will auto-select the data type.
quantize_to_4bit: Here, we give it a value of True. This then tells the model to load in 4 bits, so that it can easily fit in the Colab GPU.

Downloading Model and Creating LoRA Adaptors

Here, we will start downloading the Phi 3 Medium Model. We will do this with the Unsloth’s FastLanguageModel class.

model, tokenizer = FastLanguageModel.from_pretrained(
  model_name = "unsloth/Phi-3-medium-4k-instruct",
  max_seq_length = sequence_length_maximum,
  dtype = weights_data_type,
  load_in_4bit = quantize_to_4bit,
  token = "YOUR_HF_TOKEN"
)

Finetuning Phi-Medium to Generate Cypher Query from Text

When we run the code, the output generated can be seen in the pic above. Both the Phi 3 Medium model and its tokenizer will be downloaded to the Colab environment by fetching it from the HuggingFace Repository.

We cannot finetune the whole Phi 3 Medium model. So we just train a few weights of the Phi 3 Model. For this, we work with LoRA (Low-Rank Adaptation), which works by training only a subset of parameters. So for this, we need to create a LoRA config and get the Parameter Efficient Finetuned Model (peft model) from this LoRA config. The code for this will be:

model = FastLanguageModel.get_peft_model(
  model,
  r = 16,
  target_modules = ["q_proj", "k_proj", "down_proj", "v_proj", "o_proj", 
  "up_proj", "gate_proj"],
  lora_alpha = 16,
  bias = "none",
  lora_dropout = 0,
  random_state = 3407,
  use_gradient_checkpointing = "True",
)

Here “r” is the Rank of the LoRA Matrix. If we have a higher rank, then we need to train more parameters, and if lower rank, then a lower number of parameters. We set this to a value of 16.
Here the lora_alpha is the scaling factor of the weights present in the LoRA Matrix. It is usually kept the same as rank to get optimal results.
Dropout will randomly shut down some of the weights in the LoRA weight matrix. We have kept it to 0, so that we can get an increase in the training speed and it has little impact on the performance.
We can have a bias parameter for the weights in the LoRA matrix. But setting to None will further increase the memory efficiency and decrease the training time,

After running this code, the LoRA Adapters for the Phi 3 Medium will be created. Now we can work with this peft model and finetune it with a dataset of our choice.

Preparing the Dataset for Fine-tuning

Here, we will be training the Phi 3 Medium Large Language Model with a dataset that will allow the model to generate Cypher Queries which are necessary for querying the Knowledge Graph Databases like the neo4j. So for this, we will download the dataset provided from a GitHub Repository. The command for this will be:

!wget https://raw.githubusercontent.com/neo4j-labs/text2cypher\
/main/datasets/synthetic_gpt4turbo_demodbs/text2cypher_gpt4turbo.csv

The above command will download a CSV file. This CSV file contains the dataset that we will be working with to train the Phi 3 Medium LLM. Before that, we need to do some preprocessing. We are only taking a certain part i.e. a subset of the dataset. The code for this will be:

import pandas as pd
df = pd.read_csv('/content/text2cypher_gpt4turbo.csv')
df = df[(df['database'] == 'recommendations') & 
(df['syntax_error'] == False) & (df['timeout'] == False)]
df = df[['question','cypher']]
df.rename(columns={'question': 'input','cypher':'output'}, inplace=True)
df.reset_index(drop=True, inplace=True)

Here, we filter the data. We need the data coming from the recommendations database. We need only those columns which do not have any syntax error and where there is no timeout. This is necessary because we need the Phi 3 to give us a syntax error-free Cypher Queries when asked.

The dataset contains many columns, but only the question and the cypher column are the ones we need. And we even renamed these columns to input and output, where the question column is the input and the cypher column is the output that needs to be generated by the Large Language Model.

In the output pic, we can see the first 5 rows of the dataset. It contains only two columns, input and output. The database we are working with, for the training data, has a schema to it.

Schema for this Database

graph_schema = """
Node properties:
- **Movie**
  - `url`: STRING Example: "https://themoviedb.org/movie/862"
  - `runtime`: INTEGER Min:1, Max: 915
  - `revenue`: INTEGER Min: 1, Max: 2787965087
  - `budget`: INTEGER Min: 1, Max: 380000000
  - `imdbRating`: FLOAT Min: 1.6, Max: 9.6
  - `released`: STRING Example: "1995-11-22"
  - `countries`: LIST Min Size: 1, Max Size: 16
  - `languages`: LIST Min Size: 1, Max Size: 19
  - `imdbVotes`: INTEGER Min: 13, Max: 1626900
  - `imdbId`: STRING Example: "0114709"
  - `year`: INTEGER Min: 1902, Max: 2016
  - `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/uXDf"
  - `movieId`: STRING Example: "1"
  - `tmdbId`: STRING Example: "862"
  - `title`: STRING Example: "Toy Story"
- **Genre**
  - `name`: STRING Example: "Adventure"
- **User**
  - `userId`: STRING Example: "1"
  - `name`: STRING Example: "Omar Huffman"
- **Actor**
  - `url`: STRING Example: "https://themoviedb.org/person/1271225"
  - `bornIn`: STRING Example: "France"
  - `bio`: STRING Example: "From Wikipedia, the free encyclopedia  Lillian Di"
  - `died`: DATE Example: "1954-01-01"
  - `born`: DATE Example: "1877-02-04"
  - `imdbId`: STRING Example: "2083046"
  - `name`: STRING Example: "François Lallement"
  - `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/6DCW"
  - `tmdbId`: STRING Example: "1271225"
- **Director**
  - `url`: STRING Example: "https://themoviedb.org/person/88953"
  - `bornIn`: STRING Example: "Burchard, Nebraska, USA"
  - `bio`: STRING Example: "Harold Lloyd has been called the cinema’s “first m"
  - `died`: DATE Min: 1930-08-26, Max: 2976-09-29
  - `born`: DATE Min: 1861-12-08, Max: 2018-05-01
  - `imdbId`: STRING Example: "0516001"
  - `name`: STRING Example: "Harold Lloyd"
  - `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/er4Z"
  - `tmdbId`: STRING Example: "88953"
- **Person**
  - `url`: STRING Example: "https://themoviedb.org/person/1271225"
  - `bornIn`: STRING Example: "France"
  - `bio`: STRING Example: "From Wikipedia, the free encyclopedia  Lillian Di"
  - `died`: DATE Example: "1954-01-01"
  - `born`: DATE Example: "1877-02-04"
  - `imdbId`: STRING Example: "2083046"
  - `name`: STRING Example: "François Lallement"
  - `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/6DCW"
  - `tmdbId`: STRING Example: "1271225"
Relationship properties:
- **RATED**
  - `rating: FLOAT` Example: "2.0"
  - `timestamp: INTEGER` Example: "1260759108"
- **ACTED_IN**
  - `role: STRING` Example: "Officer of the Marines (uncredited)"
- **DIRECTED**
  - `role: STRING`
The relationships:
(:Movie)-[:IN_GENRE]->(:Genre)
(:User)-[:RATED]->(:Movie)
(:Actor)-[:ACTED_IN]->(:Movie)
(:Actor)-[:DIRECTED]->(:Movie)
(:Director)-[:DIRECTED]->(:Movie)
(:Director)-[:ACTED_IN]->(:Movie)
(:Person)-[:ACTED_IN]->(:Movie)
(:Person)-[:DIRECTED]->(:Movie)
"""

The schema contains all the Node properties and the Relationships between the nodes that are presented in the recommendations graph database. Now, we will convert these to an instruction format, so the model will only output a cypher query only when it has been instructed to do so. The function for this will be.

prompt = """Given are the instruction below, having an input \
that provides further context.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
token_eos = tokenizer.eos_token
def format_prompt(columns):
    instructions = f"Use the below text to generate a cypher query. \
    The schema is given below:\n{graph_schema}"
    inps       = columns["input"]
    outs      = columns["output"]
    text_list = []
    for input, output in zip(inps, outs):
        text = prompt.format(instructions, input, output) + token_eos
        text_list.append(text)
    return { "text" : texts, }

Here we first define our Prompt Template. In this template, we start by defining the instruction then followed by the input, and finally the output.
Then we create a function called format_prompt(). This takes in the data and then extracts the input and output columns from the data.
Then we iterate through each row in the input and output column and fit them to the Prompt Template.
Along with that, we even added the end-of-sentence token called token_eos to the Prompt, which will tell the model that the generation needs to be stopped.
We finally return the list containing all these Prompts in a dictionary format.

This function above will be passed to our dataset to create the final column. The code for this will be:

from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(format_prompt, batched = True)

Here, we start by importing the Dataset class from the datasets library.
Then we convert our dataset, which is of type DataFrame to the Dataset type by calling the .from_pandas() method and passing it to the DataFrame.
Now, we will map the function that we have created to create our final dataset for training.

Running the code will create a new column called “text”, which will contain the prompts that we have defined in the format_prompt() function. From the pic above, we can see that there are a total of 700+ rows of data in our dataset and there are three columns, that are text, input, and output. With this, we have our data ready for fine-tuning.

Fine-tuning Phi 3 Medium for Text2Cypher Query

We are now ready to fine-tune the Phi 3 Medium on the Cypher Query dataset. In this section, we start by creating our Trainer and the corresponding Training Arguments that we need to train our model on this dataset that we have prepared. The code for this will be:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = sequence_length_maximum,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = True,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.02,
        lr_scheduler_type = "linear",
        output_dir = "outputs",
    ),
)

We start by importing the SFTTrainer from the trl library which we will work with to perform the Supervised Fine Tuning.
We even import the TrainingArguments class from the transformers library to set the training config for training the model.
Then we create an instance of SFTTrainer with various parameters and store it in the trainer variable model = model: Tells the pre-trained model to be fine-tuned.
tokenizer = tokenizer: Tells the tokenizer associated with the modeltrain_dataset = dataset: Sets the dataset that we have prepared for training the model.
dataset_text_field = “text”: Indicates the field in the dataset that contains the text data.
max_seq_length = sequence_max_length: Here, we provide the maximum sequence length for the model.
dataset_num_proc = 2: Number of processes to use for data loading.
packing = False: Disables packing of sequences, which can speed up training for short sequences.

While training a Large Language Model or a Deep Learning model, we must set many different hyperparameters, which bring out the best-performing model. These include different parameters.

Different Parameters

At a time we send two examples to the processor, so we select a batch size of 2.
We need 4 accumulation steps before updating the gradients in the backward pass. So we have set it to 4.
We have set the warmup steps to 3, so the learning rate will not be in effect until three steps are completed.
We want to run the training for the whole dataset, so gave one epoch for the training.
We need to print out the metrics after every step, so we will log the training metrics like the accuracy and the training loss for each step.
The optimizer will take care of the gradients so that they will reach a global minimum so that the accuracy loss is decreased. Here for the optimizer, we will go with the Adam optimizer.
Weight decay is needed so the weights do not go to extreme values. So gave it a decay value of 0.02.
The learning rate scheduler will change the learning rate while the training is happening. Here we want it to change linearly so we gave it the option called “linear”.

We are now done with defining our Trainer and the TrainingArguments for training our quantized Phi 3 Medium 14Billion Large Language Model. Running the trainer.train() will start the training.

trainer_stats = trainer.train()

Running the above will start the training process. In Google Colab, working with the free T4 GPU, it takes around 1 hour and 40 minutes to go through 1 epoch on the training data. It has taken around 95 epochs to complete one epoch. Finally, the training is completed.

Generating Cypher Query with Phi 3 Medium

We have now finished training the model. Now we will test the model to check how well it generates cypher queries given a text.

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
   prompt.format(
       f"Convert text to cypher query based on this schema: \n{graph_schema}",
      "What are the top 5 movies with a runtime greater than 120 minutes"
       "",
   )
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))

We start by loading the trained model for inference by passing it to the for_inference() method of the FastLanguageModel class.
Then we call the tokenizer and give it the input Prompt. We work with the same Prompt Template that we have defined and give the questions “What are the top 5 movies?.
These are then given to the model to give out the output tokens and we have set the max new tokens to 128 and store the generated result in the output variable.
Finally, we decode the output tokens and print it.

We can see the results of running this code in the above pic. We see that the Cypher Query generated by the model matches the ground truth, Cypher Query. Let us test with some more examples to see the performance of the fine-tuned Phi 3 Medium for Cypher Query generation.

inputs = tokenizer(
[
   prompt.format(

       f"Convert text to cypher query based on this schema: \n{graph_schema}",
       "Which 3 directors have the longest bios in the database?"
       "",
   )
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))

inputs = tokenizer(
[
   prompt.format(
       f"Convert text to cypher query based on this schema: \n{graph_schema}",
       "List the genres that have movies with an imdbRating less than 4.0.",
       "",
   )
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))

We can see that in both the examples above, the fine-tuned Phi 3 Medium model has generated the correct Cypher Query for the provided question. In the first example, the Phi 3 Medium did provide the right answer but took slightly a different approach. With this, we can say that finetuning Phi 3 Medium on the Cypher Dataset has made its generation slightly more accurate while generating Cypher Queries given a text.

Conclusion

This guide has detailed the fine-tuning process of the Phi 3 Medium model for generating Cypher queries from natural language inputs, aimed at enhancing accessibility to Knowledge Graphs like Neo4j. Through leveraging tools like Unsloth for efficient model training and deploying techniques such as LoRA adapters to optimize parameter usage, developers can effectively translate complex data queries into structured Cypher commands.

Key Takeaways

Phi 3 Family of models developed by Microsoft provides small developers to train these models on their personalized datasets for different scenarios.
Unsloth, a Python library is a great tool for fine-tuning small language models which improve the training speeds and memory efficiency.
Creating the environment involves installing necessary libraries and configuring parameters like the sequence length and data type.
Lora is a method that allows us to train only a subset of the whole parameters of the Large Language Model thus allowing us to train them on a consumer hardware.
Text to Cypher query generation will allow developers to let Large Language Models access Graph Databases to provide more accurate responses.

Frequently Asked Questions

Q1. What are the benefits of using Phi-3 Medium for this task?

A. Phi-3 Medium is a compact and powerful LLM, making it suitable for developers with limited resources. Fine-tuning allows it to specialize in Cypher query generation, improving accuracy and efficiency.

Q2. What is Unsloth and how does it help?

A. Unsloth is a framework specifically designed to optimize the fine-tuning process for large language models. It offers significant speed and memory usage improvements compared to traditional methods

Q3. What fine-tuning dataset is required?

A. The guide uses a dataset containing pairs of natural language questions and their corresponding Cypher queries. This dataset helps the model learn the relationship between text and the structured query language.

Q4. How does the fine-tuning process work?

A. The guide outlines steps for setting up the training environment, downloading the pre-trained model, and preparing the dataset. It then details how to fine-tune the model using Unsloth and a specific training configuration.

Q5. How do I generate Cypher queries with the fine-tuned model?

A. Once trained, the model can be used to generate Cypher Query from Text. The guide provides an example of how to structure the input and decode the generated query.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Source link

Blog