Visual Med-Alpaca: Bridging Modalities in Biomedical Language Models
Chang Shu1* Baian Chen2* Fangyu Liu1 Zihao Fu1 Ehsan Shareghi 3 Nigel Collier1
1University of Cambridge      2Ruiping Health      3University of Monash

Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the LLaMa-7B. With a few hours of instruct-tuning and plug-and-play visual modules, it can perform a range of tasks from reading radiological images and answering complex clinical questions, while being easily deployable and replicable with a single gaming GPU.

Demo



Please fill out this form to access the online demo. Warning: Only for academic usage and do not apply to real clinical scenarios!
Overview

Domain-specific foundation models are extremely useful in the biomedical domain as biomedical text is highly specialized and contains many domain-specific terms and concepts that are not present in general domain text corpora such as Wikipedia and Books. Pre-training on large volumes of biomedical text has shown to improve the performance of language models on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs. However, to the best of our knowldege, there is not exisiting multimodal foundationmodel Therefore, we develop the Visual Med-Alpaca,


Resources:

    We apologize for the inconvenience, but this project is currently undergoing internal ethical screening at Cambridge University. We anticipate releasing the following assets within the next 1-2 weeks. You are more than welcome to Join Our Waitlist, and we'll notify you as soon as they become available.

  • Data: Github
  • Data Generation: Github
  • Visual Adaptation: Github
  • Training Code: Github
  • Demo: Huggingface Space
Model Architecture and Training Recipe

Overview of the model architecture and training procedure.
Domain Adaptation: Self-Instruct in Biomedical Domain

The process of collecting inquiries from various medical question-and-answer datasets (MEDIQA RQE, MedQA, MedDialog, MEDIQA QA, PubMedQA) is implemented in our study. This approach aims to increase the diversity and thoroughness of the dataset and improve the accuracy and comprehensiveness of the obtained results.

We synthesize answers of these questions with gpt-3.5-turbo. The gpt-3.5-turbo model is equipped with advanced natural language processing capabilities that enable it to understand and generate human-like responses to a wide range of questions. This makes it a reliable tool for generating structural and informative answers.

The process of filtering and editing question-answer pairs was performed manually. A total of 54k turns were carefully selected, taking into account the criteria of balance and diversity.

Visual Adaptation: Medical Image Captioning and Deplot

Visual input is a critical element of the medical domain, contributing essential information in healthcare settings. Healthcare practitioners heavily rely on visual cues to diagnose, monitor and treat patients. Medical imaging technologies, such as X-ray, CT and MRI, provide an unparalleled means of examining internal organs, identifying diseases and abnormalities that may not be visible to the naked eye.

Our study involves a further development of our previous work on visual language reasoning concerning charts and plots, as showcased in DEPLOT: One-shot visual language reasoning by plot-to-table translation. In this study, we enhance our approach by incorporating a visual foundation model that is capable of accommodating radiology images as inputs.

Within this particular framework, the task of visual language reasoning can be delineated into a bifurcation consisiting of two key phases: (1) the process of translating image to text, followed by (2) a cognitive engagement in reasoning over the text thereby derived.

The process involves the utilization of visual foundation models to convert medical images into an intermediate text state. The converted data is subsequently employed to prompt a pre-trained large language model (LLM), relying on the few-shot reasoning abilities inherent in LLMs.

At present, our platform is capable of supporting two distinct visual foundation models, namely the DEPLOT and Med-GIT models, considering the prevalence of plot and radiology imagery within the medical field. This system's architecture is also designed to facilitate the seamless integration of alternate medical visual foundation models.

The Med-GIT model represents a GIT: Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the ROCO dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.

Implementation Details

Hyper-parameter Training time
Comparison with Other Methods

Compare with ChatGPT / Alpaca / Galactica
Future Work

One of the most crucial future works is the systematic evaluation of Visual Med-Alpaca, as well as other NLP models within the biomedical field. With the varying structure and type of medical data, it is essential to assess the efficacy of NLP models and their generalizability across different data sets.

We also expect pretraining on medical data can enhance the performance of NLP models in the biomedical field. It should help in the identification and reasoning of disease phenotypes, drug mechanism and the representation of clinical concepts.

The addition of genome protein modality may also help in achieving better reasoning in NLP models. Given that genetic and protein information are critical for understanding disease processes, NLP can aid in the analysis of large volumes of genomic data, making it possible to identify novel mutations involved in various disease processes. Therefore, incorporating genomic information into NLP models will enable a wider range of applications within the biomedical field.

Limitations

Visual Med-Alpaca, is intended for academic research purposes only. Any commercial or clinical use of the model is strictly prohibited. This decision is based on the non-commercial license inherited from LLaMA, on which the model is built. Additionally, Visual Med-Alpaca is not legally approved for medical use in any country. Users should be aware of the model's limitations in terms of medical knowledge and the possibility of misinformation. Therefore, any reliance on Visual Med-Alpaca for medical decision-making is at the user's own risk.

Note: The developers and owners of the model, the Language Technology Lab at Cambridge University, do not assume any liability for the accuracy or completeness of the information provided by Visual Med-Alpaca, nor will they be responsible for any potential harm caused by the misuse of the model.

Acknowledgement

We are deeply grateful for the contributions made by open-source projects: LLaMA, Stanford Alpaca, Alpaca-LoRA, Deplot, BigBio, ROCO, Visual-ChatGPT, GenerativeImage2Text.