Recent advances in multimodal foundation models like GPT-4V have shown strong performance in general visual and textual data tasks. However, adapting these models to specialized domains like biomedicine requires large, domain-specific instruction datasets. While automatic dataset generation has been explored, these datasets often need more alignment with expert knowledge, limiting their real-world applicability. Instruction tuning, which fine-tunes models using task-specific prompts, has been effective but relies on extensive, costly datasets. Challenges include the lack of publicly available data generators and limited clinician-annotated data, hindering the development of expert-aligned models for specialized applications.
Researchers from Stanford University and Harvard Medical School have developed a framework called Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL). This data-centric approach integrates clinician preferences in generating and selecting instruction data for biomedical multimodal foundation models. Initially, clinician-selected demonstrations guide the generation of relevant data using GPT-4V. Subsequently, a selection model, informed by clinician-annotated and model-annotated data, ranks the generated samples based on quality. The framework significantly enhances model performance, achieving an 18.5% improvement in open visual chat and an 81.73% win rate in biomedical visual question answering.
Instruction tuning has become a powerful technique for adapting pre-trained language models to various natural language tasks by providing task-specific instructions and examples. Notable studies like FLANT5, LLaMA, and LLaMA2 have demonstrated its effectiveness without extensive fine-tuning. Recent approaches suggest using robust language models to automatically generate high-quality instruction data, enabling cost-effective training, as seen with Stanford Alpaca’s use of text-davinci-003 to instruction-tune LLaMA. Adapting vision-language models poses challenges in the biomedical field due to limited training data. This work aims to create a data-centric method that aligns clinician expertise with instructional data for improved instruction tuning.
The BioMed-VITAL framework for clinician-aligned biomedical visual instruction tuning consists of three stages: data generation, data selection, and instruction tuning. In the first stage, diverse expert-selected demonstrations are used with the GPT-4V model to create an instructional dataset. The second stage involves training a data selection model that distills clinician preferences from human annotations and model-based evaluations to filter out low-quality samples. Finally, in the instruction tuning phase, the curated dataset adapts a general multimodal model for biomedical tasks, enhancing its performance through targeted learning on clinician-relevant data.
The study on BioMed-VITAL generated multi-round QA instructional data from image-text pairs in the PMC-15M dataset using the GPT-4 vision API and BiomedCLIP. Instruction tuning employed the llava-v1.5-13b model to enhance alignment with clinician preferences. The optimal training data mixture was a ratio of 1:400 between human and model preferences, achieving peak performance at a weight of 400. BioMed-VITAL outperformed the LLaVA-Med baseline in open-ended medical visual chat evaluations, excelling in accuracy and recall across benchmarks like VQA-RAD, SLAKE, and PathVQA, demonstrating the effectiveness of incorporating clinician preferences in data generation and selection.
In conclusion, the study presents BioMed-VITAL, a data-centric framework designed for biomedical visual instruction tuning that aligns closely with clinician preferences. By integrating clinician expertise into data generation and selection processes, BioMed-VITAL creates high-quality datasets that enhance the performance of visual instruction tuning models in biomedicine. The generation phase utilizes a variety of clinician-selected demonstrations to guide the GPT-4V generator. In contrast, the selection phase involves a dedicated model that refines clinician preferences to identify the most relevant data. This approach leads to notable improvements in downstream tasks, with a significant performance increase in open visual chat and medical visual question answering.
Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 52k+ ML SubReddit.
We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!
The post BioMed-VITAL: A Clinician-Aligned AI Framework for Biomedical Visual Instruction Tuning appeared first on MarkTechPost.