Update README.md

2026-01-19 14:28:49 +08:00 · 2023-04-15 16:31:30 +08:00 · 2023-04-15 16:31:30 +08:00 · be2074a434
commit be2074a434
parent 04325486c1
1 changed files with 8 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -11,7 +11,7 @@

  
 ## Abstract
-Introducing [**Visual Med-Alpaca**](https://github.com/cambridgeltl/visual-med-alpaca), an open-source, parameter-efficient biomedical foundation model that can be integrated with medical "visual experts" for multimodal biomedical tasks. Built upon the [LLaMa-7B](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/) architecture ([Touvron et al., 2023](https://arxiv.org/abs/2302.13971)), this model is trained using an instruction set curated collaboratively by GPT-3.5 and human experts. Leveraging a few hours of instruction-tuning and the inclusion of plug-and-play visual modules, Visual Med-Alpaca can perform a diverse range of tasks, from interpreting radiological images to addressing complex clinical inquiries. The model can be replicated with ease, necessitating only a single gaming GPU.
+Introducing [**Visual Med-Alpaca**](https://github.com/cambridgeltl/visual-med-alpaca), an open-source, parameter-efficient biomedical foundation model that can be integrated with medical "visual experts" for multimodal biomedical tasks. Built upon the [LLaMa-7B](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/) architecture ([Touvron et al., 2023](https://arxiv.org/abs/2302.13971)), this model is trained using an instruction set curated collaboratively by GPT-3.5-Turbo and human experts. Leveraging a few hours of instruction-tuning and the inclusion of plug-and-play visual modules, Visual Med-Alpaca can perform a diverse range of tasks, from interpreting radiological images to addressing complex clinical inquiries. The model can be replicated with ease, necessitating only a single consumer GPU.


 ## Demo  
@ -25,9 +25,9 @@ Please fill out [this form](https://forms.gle/X4A8sib7qpU499dY8) to access the o

 Domain-specific foundation models play a critical role in the biomedical field, as the language used in biomedical texts is highly specialized, often encompassing domain-specific concepts and relationships not found in general domain text corpora such as Wikipedia and Books. Empirical evidence demonstrates that pretraining on substantial amounts of biomedical text significantly improves language models' performance on various biomedical text mining tasks, as compared to existing publicly available pretrained language models (PLMs) ([Lee et al., 2019](https://arxiv.org/abs/1901.08746); [Gururangan et al., 2020](https://arxiv.org/abs/2004.10964), [Gu et al., 2021](https://arxiv.org/pdf/2007.15779.pdf)).
  
-Modern large language models (LLMs) necessitate an unprecedented level of computational resources for full-model fine-tuning. The cost of fine-tuning even a 7-billion-parameter LLM exclusively on PubMed is prohibitively expensive for the majority of academic institutions. Pretraining models on extensive medical image datasets to attain multimodal capabilities incurs even higher costs. Consequently, researchers are exploring more cost-effective techniques such as Adapter, Instruct-Tuning, and Prompt Augmentation to develop models that can be trained and deployed on gaming-level graphics cards while maintaining adequate performance. In the context of bridging text and vision for multimodal applications, training can also be similarly expensive ([Alayrac et al., 2022](https://arxiv.org/abs/2204.14198)). Besides, to the best of our knowledge, there is no publicly available multimodal generative foundation model specifically designed for biomedical applications. 
+Modern large language models (LLMs) necessitate an unprecedented level of computational resources for full-model fine-tuning. The cost of fine-tuning even a 7-billion-parameter LLM exclusively on PubMed is prohibitively expensive for the majority of academic institutions. Pretraining models on extensive medical image datasets to attain multimodal capabilities incurs even higher costs. Consequently, researchers are exploring more cost-effective techniques such as Adapter, Instruct-Tuning, and Prompt Augmentation to develop models that can be trained and deployed on consumer-level graphics cards while maintaining adequate performance. In the context of bridging text and vision for multimodal applications, training can also be similarly expensive ([Alayrac et al., 2022](https://arxiv.org/abs/2204.14198)). Besides, to the best of our knowledge, there is no publicly available multimodal generative foundation model specifically designed for biomedical applications. 

-In response to these challenges, we introduce [**Visual Med-Alpaca**](https://github.com/cambridgeltl/visual-med-alpaca), an open-source, parameter-efficient biomedical foundation model that features a plug-and-play visual extension framework. To develop the Visual Med-Alpaca model, we initially create a biomedical instruction set by extracting medical questions from various medical datasets within the [BigBIO](https://github.com/bigscience-workshop/biomedical) repository ([Fries et al., 2022](https://arxiv.org/abs/2206.15076)). Subsequently, we prompt GPT-3.5-turbo to synthesize answers for these questions. Multiple rounds of human filtering and editing are performed to refine the question-answer pairs, resulting in a high-quality instruction set comprising 54k data points. Next, we expand Med-Alpaca into Visual Med-Alpaca by connecting the textual model with "visual medical experts," which are specialized medical computer vision models. For instance, in radiology-domain applications, we train an in-house radiology image captioning model called Med-GIT (see later for details). When given an input image, a classifier determines if or which medical visual expert is responsible for the image. The designated medical expert then converts the image into a text prompt. The prompt manager subsequently merges the converted visual information with the textual query, enabling Med-Alpaca to generate an appropriate response.
+In response to these challenges, we introduce [**Visual Med-Alpaca**](https://github.com/cambridgeltl/visual-med-alpaca), an open-source, parameter-efficient biomedical foundation model that features a plug-and-play visual extension framework. To develop the Visual Med-Alpaca model, we initially create a biomedical instruction set by extracting medical questions from various medical datasets within the [BigBIO](https://github.com/bigscience-workshop/biomedical) repository ([Fries et al., 2022](https://arxiv.org/abs/2206.15076)). Subsequently, we prompt GPT-3.5-Turbo to synthesize answers for these questions. Multiple rounds of human filtering and editing are performed to refine the question-answer pairs, resulting in a high-quality instruction set comprising 54k data points. Next, we expand Med-Alpaca into Visual Med-Alpaca by connecting the textual model with "visual medical experts," which are specialized medical computer vision models. For instance, in radiology-domain applications, we train an in-house radiology image captioning model called Med-GIT (see later for details). When given an input image, a classifier determines if or which medical visual expert is responsible for the image. The designated medical expert then converts the image into a text prompt. The prompt manager subsequently merges the converted visual information with the textual query, enabling Med-Alpaca to generate an appropriate response.

 **Ongoing work.** A paramount objective for the future is to thoroughly assess the medical proficiency and potential shortcomings of Visual Med-Alpaca, encompassing issues such as misleading medical advice and incorrect medical information. Moving beyond traditional benchmarking and manual evaluation methods, we aim to focus on different user groups, including doctors and patients, and evaluate all facets of the model through a user-centered approach. This comprehensive assessment will enable us to ensure the reliability and effectiveness of Visual Med-Alpaca in addressing various biomedical tasks and catering to the diverse needs of its users.

@ -57,7 +57,7 @@ To incorporate biomedical knowledge and visual modality into the foundation mode
  
 The process of collecting inquiries from various medical question-and-answer datasets ([MEDIQA RQE](https://huggingface.co/datasets/bigbio/mediqa_rqe), [MedQA](https://huggingface.co/datasets/bigbio/med_qa), [MedDialog](https://huggingface.co/datasets/bigbio/meddialog), [MEDIQA QA](https://huggingface.co/datasets/bigbio/mediqa_qa), [PubMedQA](https://huggingface.co/datasets/bigbio/pubmed_qa)) is implemented in our study. This approach aims to increase the diversity and thoroughness of the dataset and improve the accuracy and comprehensiveness of the obtained results.  
  
-We synthesize answers of these questions with gpt-3.5-turbo in the [self-instruct](https://github.com/yizhongw/self-instruct) fashion. The gpt-3.5-turbo model is equipped with advanced natural language processing capabilities that enable it to understand and generate human-like responses to a wide range of questions. This makes it a reliable tool for generating structural and informative answers.  
+We synthesize answers of these questions with GPT-3.5-Turbo in the [self-instruct](https://github.com/yizhongw/self-instruct) fashion. The GPT-3.5-Turbo model is equipped with advanced natural language processing capabilities that enable it to understand and generate human-like responses to a wide range of questions. This makes it a reliable tool for generating structural and informative answers.  
  
 The process of filtering and editing question-answer pairs was performed manually. A total of 54,000 turns were carefully selected, taking into account the criteria of balance and diversity.  
  
@ -68,11 +68,11 @@ Visual input constitutes a vital component of the medical domain, supplying indi

 We propose linking visual experts with Med-Alpaca, as foundation model chaining presents a modular and highly adaptable framework for incorporating a diverse array of visual modules. Within this framework, any multimodal task can be divided into two essential stages: (1) the conversion of images to text, and (2) cognitive reasoning based on the derived text. In our context, visual experts (i.e., visual foundation models) transform medical images into an intermediate text representation. This converted data is then used to prompt a pretrained LLM, leveraging the inherent few-shot reasoning capabilities of LLMs to generate appropriate responses.
 
-Currently, our platform supports two distinct visual foundation models: the Med-GIT models and [DePlot](https://huggingface.co/docs/transformers/main/model_doc/deplot), chosen due to the widespread presence of radiology images and plots within the medical domain. The system's architecture is also designed to enable seamless integration of alternative medical visual foundation models, and we plan to incorporate additional visual experts in the near future.
+Currently, our platform supports two distinct visual foundation models: Med-GIT and [DePlot](https://huggingface.co/docs/transformers/main/model_doc/deplot), chosen due to the widespread presence of radiology images and plots within the medical domain. The system's architecture is also designed to enable seamless integration of alternative medical visual foundation models, and we plan to incorporate additional visual experts in the near future.


  
-The Med-GIT model represents a [GIT](https://github.com/microsoft/GenerativeImage2Text): Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the [ROCO](https://github.com/razorx89/roco-dataset) dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.  
+The Med-GIT model represents a [GIT](https://github.com/microsoft/GenerativeImage2Text): Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the [ROCO](https://github.com/razorx89/roco-dataset) dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our [publicly accessible Github repository](https://github.com/cambridgeltl/visual-med-alpaca/tree/main/code).  
  
 ## Case Study  
  
@ -141,11 +141,11 @@ The Med-GIT model represents a [GIT](https://github.com/microsoft/GenerativeImag

 ## Future Work  
  
-One of the most crucial future works is the systematic evaluation of Visual Med-Alpaca, as well as other NLP models within the biomedical field. With the varying structure and type of medical data, it is essential to assess the efficacy of NLP models and their generalizability across different data sets.  
+One of the most crucial ongoing works is the systematic evaluation of Visual Med-Alpaca, as well as other NLP models within the biomedical field. With the varying structure and type of medical data, it is essential to assess the efficacy of NLP models and their generalizability across different datasets.  
  
 We also expect pretraining on medical data can enhance the performance of NLP models in the biomedical field. It should help in the identification and reasoning of disease phenotypes, drug mechanism and the representation of clinical concepts.  
  
-The addition of genome protein modality may also help in achieving better reasoning in NLP models. Given that genetic and protein information are critical for understanding disease processes, NLP can aid in the analysis of large volumes of genomic data, making it possible to identify novel mutations involved in various disease processes. Therefore, incorporating genomic information into NLP models will enable a wider range of applications within the biomedical field.
+The addition of genome protein modality may also help in achieving better reasoning in LLMs. Given that genetic and protein information are critical for understanding disease processes, LLMs can aid in the analysis of large volumes of genomic data, making it possible to identify novel mutations involved in various disease processes. Therefore, incorporating genomic information into LLMs will enable a wider range of applications within the biomedical field.

 ## Implementation Details