diff --git a/docs/files/demo.gif b/docs/files/demo.gif new file mode 100644 index 0000000..de252c3 Binary files /dev/null and b/docs/files/demo.gif differ diff --git a/docs/index.html b/docs/index.html index eb0e449..c2ff3a0 100644 --- a/docs/index.html +++ b/docs/index.html @@ -77,14 +77,15 @@
- Demo (insert GIF here) (Baian) + Demo

- +
-Please register for Hugging Face and fill out this form [link] to access the online demo of Visual Med-Alpaca. Warning: Only for academic usage and do not apply it to real clinical scenarios! +Please register for Hugging Face and fill out this form [link] to access the online demo of Visual Med-Alpaca. +Warning: Only for academic usage and do not apply it to real clinical scenarios!
@@ -158,167 +159,32 @@ Overview of the model architecture and training procedure.
- Domain Adaptation: Self-Instruct in Biomedical Domain (Baian) + Domain Adaptation: Self-Instruct in Biomedical Domain

-How to generate the instruct-tuning set +The process of collecting inquiries from various medical question-and-answer datasets (MEDIQA RQE, MedQA, MedDialog, MEDIQA QA, PubMedQA) is implemented in our study. This approach aims to increase the diversity and thoroughness of the dataset and improve the accuracy and comprehensiveness of the obtained results.

+ +We synthesize answers of these questions with gpt-3.5-turbo. The gpt-3.5-turbo model is equipped with advanced natural language processing capabilities that enable it to understand and generate human-like responses to a wide range of questions. This makes it a reliable tool for generating structural and informative answers.

+ +The process of filtering and editing question-answer pairs was performed manually. A total of 54k turns were carefully selected, taking into account the criteria of balance and diversity.

- Visual Adaptation: Deplot and Medical VQA (Baian) + Visual Adaptation: Medical Image Captioning and Deplot

-We also build a large-scale, high-quality video dataset, Vimeo90K. This dataset consists of 89,800 video clips downloaded from vimeo.com, which covers large variaty of scenes and actions. It is designed for the following four video processing tasks: temporal frame interpolation, video denoising, video deblocking, and video super-resolution. +Visual input is a critical element of the medical domain, contributing essential information in healthcare settings. Healthcare practitioners heavily rely on visual cues to diagnose, monitor and treat patients. Medical imaging technologies, such as X-ray, CT and MRI, provide an unparalleled means of examining internal organs, identifying diseases and abnormalities that may not be visible to the naked eye.

-

+Our study involves a further development of our previous work on visual language reasoning concerning charts and plots, as showcased in DEPLOT: One-shot visual language reasoning by plot-to-table translation. In this study, we enhance our approach by incorporating a visual foundation model that is capable of accommodating radiology images as inputs.

-Sampled Frames (Full-resolution samples are here):
-
- - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

+Within this particular framework, the task of visual language reasoning can be delineated into a bifurcation consisiting of two key phases: (1) the process of translating image to text, followed by (2) a cognitive engagement in reasoning over the text thereby derived.

- -The list of original videos

+The process involves the utilization of visual foundation models to convert medical images into an intermediate text state. The converted data is subsequently employed to prompt a pre-trained large language model (LLM), relying on the few-shot reasoning abilities inherent in LLMs.

-The list of all full-length original videos can be found here, and youtube-dl can be used to batch download them. We reused some of utilities by AoT Dataset for scene detection/camera stabilization to generate these video clips and please refer to this repository for more details.

+At present, our platform is capable of supporting two distinct visual foundation models, namely the DEPLOT and Med-GIT models, considering the prevalence of plot and radiology imagery within the medical field. This system's architecture is also designed to facilitate the seamless integration of alternate medical visual foundation models.

-We further process these 89,800 video clips to generate the following two subsets.

+The Med-GIT model represents a GIT: Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the ROCO dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.

- -Triplet dataset (for temporal frame interpolation):

-The triplet dataset consists of 73,171 3-frame sequences with a fixed resolution of 448 x 256, extracted from 15K selected video clips from Vimeo-90K. This dataset is designed for temporal frame interpolation. Download links are - - - -Septuplet dataset (for video denoising, deblocking, and super-resoluttion):

- Notice: we have recently updated our testing denoising dataset to fix a bug in denoising test data generation. The new quantitative result of our algorithm is reported in our updated paper

-The septuplet dataset consists of 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K. This dataset is designed to video denoising, deblocking, and super-resolution. - -