Introduction
The field of medical AI has witnessed remarkable advancements in recent years, with the development of powerful language models and datasets driving progress. In this article, we will explore the journey of MedMCQA, a groundbreaking medical question-answering dataset, and its role in shaping the landscape of medical AI. We will examine the challenges faced during its publication, its impact on the research community, and how it paved the way for the development of OpenBioLLM-70B, a state-of-the-art biomedical language model that has surpassed industry giants such as GPT-4, Gemini, Med-PaLM-1, Med-PaLM-2, and Meditron in performance.
The Genesis of MedMCQA
Our idea for developing medical language models originated in 2020, drawing inspiration from the widely-used models BlueBERT and BioBERT.
Upon examining the datasets used for training and fine-tuning in these papers, I noticed that they lacked diversity. They mostly consisted of PubMed articles and relation-mentioned documents. This observation led me to realize the need for a comprehensive and diverse dataset for the medical AI community.
Motivated by this goal, I started working on a dataset that would later be published under the name MedMCQA. The MedMCQA paper contains a collection of questions and answers from the Indian medical domain, sourced from NEET and AIIMS exams, as well as mock questions. By curating this dataset, we aimed to provide a valuable resource for researchers and developers working on medical AI applications. The idea was to enable them to train and evaluate models on a wide range of challenging medical questions. The development of MedMCQA marked the beginning of our journey towards creating medical language models.
Challenges and Perseverance: The Journey to Publication
Interestingly, the journey of MedMCQA was not without its challenges. Despite being thoughtfully written in 2021, the paper faced numerous rejections from top NLP conferences during the peer review process. As almost a year passed without the paper being accepted for publication, I began to feel nervous and doubtful about the quality of our work. At one point, I even considered abandoning the idea of publishing this paper altogether. However, one of my co-authors suggested giving it a final attempt by submitting it to an ACM conference. With renewed determination, we decided to take this last shot and submit our work to the conference.
After the paper’s acceptance, it started gaining significant recognition within the medical AI community. Gradually, MedMCQA became the largest medical question-answering dataset available. Researchers and developers from various organizations started incorporating it into their language model use cases. Notable examples include Meta, which used MedMCQA for pre-training and evaluating their Galactica model. Meanwhile, Google utilized the dataset in the pre-training and evaluation of their state-of-the-art medical language models, Med-PaLM-1 and Med-PaLM-2. Furthermore, the OpenAI and Microsoft official paper on ChatGPT-4 also employed MedMCQA to evaluate the model’s performance on medical applications.
In the Med-PaLM paper, which showcases Google’s best medical model, a closer look at the datasets used in pretraining reveals that our Indian dataset, MedMCQA, made the of the largest contribution among the medical datasets used. This highlights the significant impact of Indian research labs in the field of large language models (LLMs) and underscores the importance of our work in advancing medical AI research on a global scale.
The Birth of an Idea: Specialized BERT Models for Medical Domains
In the MedMCQA paper, we presented subject-wise accuracy for the first time in the medical AI field, providing a comprehensive evaluation across approximately 20 medical subjects taught during the preparation for NEET and AIIMS exams in India. This approach ensured that the dataset was diverse and representative of the various disciplines within the medical domain. Additionally, we tested numerous open-ended medical question-answering models and published the results in the paper, establishing a benchmark for future research.
While analyzing the subject-wise accuracy, I had an intriguing thought: since no single model could achieve the highest accuracy across all medical subjects, why not build separate models and embeddings for each subject? At that time, I was working with BERT, as large language models (LLMs) were not yet widely popular. This idea led me to consider developing specialized BERT models for different medical domains, such as BERT-Radiology, BERT-Biochemistry, BERT-Medicine, BERT-Surgery, and so on.
Data Collection and the Evolution from BERT to OpenBioLLM-70B
To pursue this idea, I needed datasets specific to each medical subject, which marked the beginning of my data collection journey. Although the data collection efforts commenced in 2021, the initial plan was to create specialized BERT models for each domain. However, as the project evolved and LLMs gained prominence, the collected data was ultimately used to fine-tune the Llama-3 model. This later became the foundation for OpenBioLLM-70B. In the development of OpenBioLLM-70B, we utilized two types of datasets: instruct data and DPO (Direct Preference Optimization) datasets.
To generate a portion of the instruct dataset, we collaborated with medical students who provided valuable insights and contributions. We then used this initial dataset to generate additional synthetic datasets for fine-tuning the model. This helped expand the training data and improve its performance.
For the DPO dataset, we employed a unique approach to ensure the quality and relevance of the model’s responses. We generated four responses from the model for each input and presented them to the medical students for evaluation. The students were then asked to select the best response based on their inter-annotation agreement. This helped us identify the most accurate and appropriate answers.
To mitigate potential biases in the selection process, we introduced a randomness factor by randomly sampling approximately 20 samples and swapping their labels from chosen to rejected and vice versa. This technique helped balance the dataset and prevent the experts from being overly biased towards their initial choices.
As we continue to refine OpenBioLLM-70B, we are actively exploring additional techniques to further align the model with human preferences. We are also working on enhancing the model and improving its performance. Some of the ongoing experiments include multi-turn dialogue DPO settings.
Fine-tuning Llama-3: The Making of OpenBioLLM-70B
Before the release of Llama-3, I had already started working on fine-tuning other models, such as Mistral-7B and some others. Surprisingly, the fine-tuned Starling model showed the best accuracy compared to the other models, even outperforming GPT-3.5. We were thrilled with the results and planned to release the models to the public.
However, just as we were about to release the Starling model, we learned that Llama-3 was scheduled to be released on the same day. Given the potential impact of Llama-3, we decided to postpone our release and wait for the Llama-3 model to become available. As soon as Llama-3 was released, I wasted no time in evaluating its performance in the medical domain. Within just 15 minutes of its release, I had already begun testing the model. Drawing from our previous experience and the datasets we had prepared, I quickly moved on to fine-tuning Llama-3. For this we used the same data and hyperparameters we had used for the Starling model.
Surpassing Industry Giants: OpenBioLLM-70B’s Groundbreaking Performance
The results were astounding. The fine-tuned Llama-3 8B model delivered remarkable performance, surpassing our expectations. The combination of the powerful Llama-3 architecture and our carefully curated medical datasets proved to be a winning formula. It set the stage for the development of OpenBioLLM-70B.
Excited by the impressive performance of the 8B model, I convinced my manager to push the limits and work on the 70B model. Although it was not initially part of our planned experiments, the exceptional accuracy we observed motivated us to explore the potential of a larger model. We quickly prepared the environment to fine-tune the 70B model, which required the use of 8 x 80 H100 GPUs. The fine-tuning process was computationally intensive, but once it was completed, we eagerly evaluated the model’s performance. To our astonishment, the results were beyond our wildest expectations. At first, we couldn’t believe what we were seeing! Our fine-tuned Llama-3 70B model was outperforming GPT-4 on various biomedical benchmarks.
This groundbreaking achievement marked a significant milestone in our journey to develop OpenBioLLM-70B.
Reassuring Our Trust
I remember the excitement of sharing updates with my manager as our models continued to surpass the performance of industry giants. First, we had the Starling model beating GPT-3.5, then we outperformed Med-PaLM, and finally, we surpassed Gemini. The moment of truth arrived when I sent a message to my manager, announcing that our model had beaten GPT-4. It was a claim so bold that none of us could believe it at first.
We quickly arranged a meeting in the middle of the night, as I often worked late hours. My manager congratulated me and urged me to verify the results multiple times to ensure their accuracy. Despite the audacity of the claim, we rigorously evaluated the model’s performance several times. The results confirmed that we had indeed surpassed GPT-4, Gemini, Med-PaLM-1, Med-PaLM-2, Meditron, and any other model available worldwide at that time.
OpenBioLLM-70B had established itself as the best-performing biomedical language model in existence.
We shared the news on Twitter, and the post went viral. It was a series of firsts for many things. OpenBioLLM-70B was the first model to outperform GPT-4 and the first healthcare model to gain such widespread popularity. Most importantly, it was the first Indian model to trend among the top 10 world’s best models on Hugging Face. This was a list that included industry giants like Apple, Microsoft, and Meta.
A Serendipitous Encounter: Validating OpenBioLLM with Neurologists
On the same day that we achieved this milestone, I had an interesting encounter while traveling from Chennai to Dehradun. During the flight, I met two ladies who asked for help with their iPhone camera, a topic I wasn’t particularly familiar with. However, seeing their need for assistance, I decided to try something unique. Since we were in the plane and there was no internet so I took out my MacBook and loaded the OpenBioLLM model locally, handing it over to them in the flight. These ladies were unfamiliar with chatbots like ChatGPT, so the experience was entirely new for them. They started by asking questions related to the iPhone, and to their surprise, the model provided quite satisfactory answers. Curious about the technology, they inquired about what it was. I explained that it was a chatbot specifically designed for healthcare.
Intrigued, they expressed their desire to test the model further and began asking in-depth questions, such as medication suggestions and symptom-related scenarios, all within a proper medical context. Surprised by the complexity of their questions, I politely asked about their background. They revealed that they were both professional neurologists and doctors. I was shocked and realized that they were the perfect individuals to evaluate the model’s performance.
They proceeded to test the model more thoroughly, and I could see the astonishment on their faces as they interacted with OpenBioLLM. When I asked them to rate the model on a scale of 0-5, they responded that it was a good model and gave it a rating of 4. Furthermore, they expressed their willingness to assist with data collection and other aspects of the model’s development. I learned that they were from a well-known hospital in Nellore called Narayan Medical College.
The Viral Success of OpenBioLLM and Its Impact on the Research Community
The news of OpenBioLLM’s success spread like wildfire, with numerous blogs, videos, and articles covering the breakthrough. The viral attention was overwhelming at times, but it also opened up incredible opportunities for collaboration and knowledge sharing. I was honored to receive an invitation from Harvard University to present my work in the prestigious Lab. Additionally, I had the privilege of giving a talk at the Edinburgh Core NLP Group on the same topic. Throughout this journey, I formed friendships with many talented researchers working on exciting projects, such as genomics LLMs and multimodal LLMs.
Working on the OpenBioLLM project was a true honor, but it’s important to note that this is just the beginning. We have ignited a spark that is now growing into a blazing fire, inspiring researchers worldwide to believe in the possibility of achieving meaningful results through techniques like QLora and Lora for fine-tuning large language models. I have been deeply moved by the countless messages of thanks and appreciation I have received from researchers and enthusiasts around the globe. It fills me with immense happiness to know that our work has made a significant contribution to the research community and has the potential to drive further advancements in the field.
Future Directions and Collaboration Opportunities
Looking ahead, I am committed to continuing my research journey and working on even more robust and innovative models. Some of the projects in the pipeline include vision-based models for medical applications, Genomics & multimodal models, and many more exciting developments.
I am currently exploring several research topics and would be thrilled to collaborate with anyone interested in joining forces. I firmly believe that by working together and leveraging our collective expertise, we can push the boundaries of what is possible in biomedical AI and create solutions that have a lasting impact on healthcare and research. If any of these research areas resonate with you or if you have ideas for collaboration, please don’t hesitate to reach out. I am excited about the future of biomedical AI and the role we can play in shaping it.
The Importance of Developing Foundational Models in India
It’s incredibly gratifying to know that many individuals and companies are using OpenBioLLM-70B in production and finding it useful. I have received numerous queries and appreciation messages from users who have benefited from the model’s capabilities. As the first Indian LLM to gain such widespread adoption, it feels great to have contributed something of value to the AI community.
Looking to the future, I hope that our country will produce more foundational models that can be applied across various domains. I believe that Indian researchers and entrepreneurs should focus on developing robust and innovative models from the ground up, rather than solely relying on APIs. While using APIs is not inherently bad, it’s important to push our limits and work on creating better and more advanced models.
A Call to Action: Leveraging India’s Potential in AI Innovation
There have been instances where people claimed to release impressive models from India, but under the hood, they were merely using existing APIs. Instead, we should strive to develop our own state-of-the-art models that can compete on a global level. In recent times, we have seen the emergence of remarkable language models for Indian languages, such as Tamil-Llama and Odia-Llama. These initiatives showcase the potential and talent within our country. Now, it’s time for us to take the next step and work on models that can make a significant impact on a global scale. India has a wealth of diverse and unique datasets that can be leveraged to train powerful AI models.
By collecting and utilizing these datasets effectively, we can contribute something truly meaningful to the research society. Our country has the potential to become a hub for AI innovation, and it’s up to us to seize this opportunity and drive progress in the field. I strongly encourage my fellow researchers and entrepreneurs to collaborate, share knowledge, and work toward building foundational models that can revolutionize various industries. By pooling our expertise and resources, we can create AI solutions that not only benefit our nation but also have a lasting impact on the global stage.
Conclusion
The story of MedMCQA and OpenBioLLM-70B is a testament to the power of perseverance, innovation, and collaboration in the field of medical AI. From the initial challenges faced during the publication of MedMCQA to the groundbreaking success of OpenBioLLM-70B, this journey highlights the immense potential of Indian researchers and the importance of developing foundational models within our country.
As we look to the future, it is crucial for Indian researchers and entrepreneurs to leverage our nation’s diverse datasets and expertise to create AI solutions that can make a global impact. By collaborating, sharing knowledge, and pushing the boundaries of what is possible, we can establish India as a hub for AI innovation and contribute meaningfully to the advancement of various industries, including healthcare.
The success of OpenBioLLM-70B is just the beginning. We are very excited about the future possibilities and collaborations that lie ahead. Together, let us embrace the challenge of building robust and innovative models that can revolutionize the field of AI and make a lasting difference in the world.