visual-med-alpaca/docs/index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Visual Med-Alpaca</title>
<link rel="shortcut icon" href="favicon.ico">
<link rel="stylesheet" href="files/style.css">
<link rel="stylesheet" href="files/font.css">

</head>

<style type="text/css">
	#myvalignContainer1O { position:relative }
	#myvalignContainer1I { position:absolute; top:50%; height:10em; margin-top:-5em }
</style>
<style type="text/css">
	#myvalignContainer2 { line-height:4em }
</style>

<body>

<script src="files/analytics.js" async=""></script><script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-40457306-2', 'mit.edu');
  ga('send', 'pageview');

</script>

<!-- Title -->
<div class="container">

<!-- <span class="title">Task-Oriented Flow Utilization</span> -->
<!-- <span class="venue">Conference name</span> -->
    <td><center><img src="files/ltl_logo.png" width="1000" ></center></td>
    <br><br>
<span class="title">Visual Med-Alpaca: Bridging Modalities in Biomedical Language Models</span>

<table align="center" border="0" width="1000" class="authors">
	<tbody><tr>
	<td class="author"> <a href="https://ciaranshu.github.io">Chang Shu</a><sup>1*</sup></td>
	<td class="author"> Baian Chen<sup>2*</sup></td>
	<td class="author"> <a href="http://fangyuliu.mezihao">Fangyu Liu</a><sup>1</sup></td>
	<td class="author"> <a href="https://fuzihaofzh.github.io">Zihao Fu</a><sup>1</sup></td>
    <td class="author"> <a href="https://eehsan.github.io">Ehsan Shareghi </a><sup>3</sup></td>
	<td class="author"> <a href="https://sites.google.com/site/nhcollier/home/">Nigel Collier</a><sup>1</sup></td>
	</tr></tbody>
</table>

<table align="center" border="0" width="1000" class="affiliations">
<tbody>
	<tr>
    <td class="affliation" align="center">
      <sup>1</sup><a href="https://ltl.mmll.cam.ac.uk">University of Cambridge</a>
      &emsp;&emsp;&emsp;&emsp;
      <sup>2</sup>Ruiping Health</a>
    &emsp;&emsp;&emsp;&emsp;
    <sup>3</sup><a href="https://www.monash.edu/it/dsai">University of Monash</a>
    </td>
  </tr>
</tbody>
</table>


<br>
<table align="center"><tbody><tr>
</tr>
<tr><td>
<table border="0">
</tbody>
<tr><td class="caption">Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the <a href="https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/">LLaMa-7B</a>. With a few hours of instruct-tuning and plug-and-play visual modules, it can perform a range of tasks from reading radiological images and answering complex clinical questions, while being easily deployable and replicable with a single gaming GPU. </td></tr>
</tbody></table>
<br>

<!-- Result -->
<div class="section">
<span class="section-title"> Demo</span>
</br></br>
<table align="center"><tbody>
<tr><td><center>
<img src="files/demo.gif" width="900" >
</td></tr>
<tr><td><center>
</br></br>
Please fill out <a href=https://forms.gle/X4A8sib7qpU499dY8><u>this form</u></a> to access the online demo. <b>Warning: Only for academic usage and do not apply to real clinical scenarios!</b>
</center></td></tr>
</table>
</div>

<!-- Abstract -->
<div class="section">
<span class="section-title">Overview </span>
<p>

Domain-specific foundation models are extremely useful in the biomedical domain as biomedical text is highly specialized and contains many domain-specific terms and concepts that are not present in general domain text corpora such as Wikipedia and Books. Pre-training on large volumes of biomedical text has shown to improve the performance of language models on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs.


However, to the best of our knowldege, there is not exisiting multimodal foundationmodel


Therefore, we develop the Visual Med-Alpaca,


</p>
<!-- <p class="bibtex">
@article{xue2019video,
  title={Video Enhancement with Task-Oriented Flow},
  author={Xue, Tianfan and Chen, Baian and Wu, Jiajun and Wei, Donglai and Freeman, William T},
  journal={International Journal of Computer Vision (IJCV)},
  volume={127},
  number={8},
  pages={1106--1125},
  year={2019},
  publisher={Springer}
}
</p> -->
<br>
<b>Resources:</b></br>
<ul>
<p>
We apologize for the inconvenience, but this project is currently undergoing internal ethical screening at Cambridge University. We anticipate releasing the following assets within the next 1-2 weeks. You are more than welcome to <a href=https://forms.gle/X4A8sib7qpU499dY8><u>Join Our Waitlist</u></a>, and we'll notify you as soon as they become available.
</p>


<li> Data: <a href=https://forms.gle/X4A8sib7qpU499dY8>Github</a>
</li>

    <li> Data Generation: <a href=https://forms.gle/X4A8sib7qpU499dY8>Github</a>
</li>
        <li> Visual Adaptation: <a href=https://forms.gle/X4A8sib7qpU499dY8>Github</a>
</li>
    <li> Training Code: <a href=https://forms.gle/X4A8sib7qpU499dY8>Github</a>
</li>
  <li> Demo: <a href=https://forms.gle/X4A8sib7qpU499dY8>Huggingface Space</a>
</li>

<!-- </li>

    <li> Data Generation: <a href="https://github.com/cambridgeltl/">Github</a>
</li>
        <li> Visual Adaptation: <a href="https://github.com/cambridgeltl/">Github</a>
</li>
    <li> Training Code: <a href="https://github.com/cambridgeltl/">Github</a>
</li>
  <li> Demo: Huggingface Space
</li> -->
</ul>
</div>


<div class="section">
<span class="section-title"> Model Architecture and Training Recipe  </span>
</br></br>
Overview of the model architecture and training procedure.
</div>

<div class="section">
<span class="section-title"> Domain Adaptation: Self-Instruct in Biomedical Domain</span>
</br></br>
The process of collecting inquiries from various medical question-and-answer datasets (<a href='https://huggingface.co/datasets/bigbio/mediqa_rqe'>MEDIQA RQE</a>, <a href='https://huggingface.co/datasets/bigbio/med_qa'>MedQA</a>, <a href='https://huggingface.co/datasets/bigbio/meddialog'>MedDialog</a>, <a href='https://huggingface.co/datasets/bigbio/mediqa_qa'>MEDIQA QA</a>, <a href='https://huggingface.co/datasets/bigbio/pubmed_qa'>PubMedQA</a>) is implemented in our study. This approach aims to increase the diversity and thoroughness of the dataset and improve the accuracy and comprehensiveness of the obtained results. </br></br>

We synthesize answers of these questions with gpt-3.5-turbo. The gpt-3.5-turbo model is equipped with advanced natural language processing capabilities that enable it to understand and generate human-like responses to a wide range of questions. This makes it a reliable tool for generating structural and informative answers.</br></br>

The process of filtering and editing question-answer pairs was performed manually. A total of 54k turns were carefully selected, taking into account the criteria of balance and diversity.</br></br>
</div>


<div class="section">
<span class="section-title"> Visual Adaptation: Medical Image Captioning and Deplot</span>

</br></br>
Visual input is a critical element of the medical domain, contributing essential information in healthcare settings. Healthcare practitioners heavily rely on visual cues to diagnose, monitor and treat patients. Medical imaging technologies, such as X-ray, CT and MRI, provide an unparalleled means of examining internal organs, identifying diseases and abnormalities that may not be visible to the naked eye. </br></br>

Our study involves a further development of our previous work on visual language reasoning concerning charts and plots, as showcased in <a href="https://huggingface.co/docs/transformers/main/model_doc/deplot">DEPLOT</a>: One-shot visual language reasoning by plot-to-table translation. In this study, we enhance our approach by incorporating a visual foundation model that is capable of accommodating radiology images as inputs. </br></br>

Within this particular framework, the task of visual language reasoning can be delineated into a bifurcation consisiting of two key phases: (1) the process of translating image to text, followed by (2) a cognitive engagement in reasoning over the text thereby derived.</br></br>

The process involves the utilization of visual foundation models to convert medical images into an intermediate text state. The converted data is subsequently employed to prompt a pre-trained large language model (LLM), relying on the few-shot reasoning abilities inherent in LLMs.</br></br>

At present, our platform is capable of supporting two distinct visual foundation models, namely the <a href="https://huggingface.co/docs/transformers/main/model_doc/deplot">DEPLOT</a> and Med-GIT models, considering the prevalence of plot and radiology imagery within the medical field. This system's architecture is also designed to facilitate the seamless integration of alternate medical visual foundation models.</br></br>

The Med-GIT model represents a <a href="https://github.com/microsoft/GenerativeImage2Text">GIT</a>: Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the <a href="https://github.com/razorx89/roco-dataset">ROCO</a> dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.</br></br>

</div>

<!-- div class="section">
<span class="section-title">Results </span></br>
<p class="subsection">Interpolation</p -->
<!-- Result start -->
<!-- Result end -->

<!-- p class="subsection">Visualization</p -->
<!-- Visualization start -->
<!-- Visualization end -->
<!-- /div -->


<div class="section">
<span class="section-title"> Implementation Details  </span>
</br></br>
Hyper-parameter
Training time
</div>

<div class="section">
<span class="section-title"> Comparison with Other Methods </span>
</br></br>
Compare with ChatGPT / Alpaca / Galactica
</div>


<div class="section">
<span class="section-title"> Future Work </span>
</br></br>
One of the most crucial future works is the systematic evaluation of Visual Med-Alpaca, as well as other NLP models within the biomedical field. With the varying structure and type of medical data, it is essential to assess the efficacy of NLP models and their generalizability across different data sets. </br></br>

We also expect pretraining on medical data can enhance the performance of NLP models in the biomedical field. It should help in the identification and reasoning of disease phenotypes, drug mechanism and the representation of clinical concepts.</br></br>

The addition of genome protein modality may also help in achieving better reasoning in NLP models. Given that genetic and protein information are critical for understanding disease processes, NLP can aid in the analysis of large volumes of genomic data, making it possible to identify novel mutations involved in various disease processes. Therefore, incorporating genomic information into NLP models will enable a wider range of applications within the biomedical field.</br></br>
</div>


<div class="section">
<span class="section-title"> Limitations </span>
</br></br>
Visual Med-Alpaca, is intended for academic research purposes only. Any commercial or clinical use of the model is strictly prohibited. This decision is based on the non-commercial license inherited from LLaMA, on which the model is built. Additionally, Visual Med-Alpaca is not legally approved for medical use in any country. Users should be aware of the model's limitations in terms of medical knowledge and the possibility of misinformation. Therefore, any reliance on Visual Med-Alpaca for medical decision-making is at the user's own risk.
</br></br>
<b>Note: The developers and owners of the model, the Language Technology Lab at Cambridge University, do not assume any liability for the accuracy or completeness of the information provided by Visual Med-Alpaca, nor will they be responsible for any potential harm caused by the misuse of the model.</b>
</br></br>

</div>

<div class="section">
<span class="section-title"> Acknowledgement </span>
</br></br>
We are deeply grateful for the contributions made by open-source projects:
<a href="https://github.com/facebookresearch/llama">LLaMA</a>,
<a href="https://github.com/tatsu-lab/stanford_alpaca">Stanford Alpaca</a>,
<a href="https://github.com/tloen/alpaca-lora">Alpaca-LoRA</a>,
<a href="https://huggingface.co/docs/transformers/main/model_doc/deplot">Deplot</a>,
<a href="https://huggingface.co/bigbio">BigBio</a>,
<a href="https://github.com/razorx89/roco-dataset">ROCO</a>,
<a href="https://github.com/microsoft/visual-chatgpt">Visual-ChatGPT</a>,
<a href="https://github.com/microsoft/GenerativeImage2Text">GenerativeImage2Text</a>.
</br></br>
</div>
<p>&nbsp;</p>
<!-- end .container --></div>


</body></html>