diff --git a/docs/files/demo.gif b/docs/files/demo.gif new file mode 100644 index 0000000..de252c3 Binary files /dev/null and b/docs/files/demo.gif differ diff --git a/docs/index.html b/docs/index.html index 2ccf8f3..3c3a8c5 100644 --- a/docs/index.html +++ b/docs/index.html @@ -71,20 +71,21 @@ - +
Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the LLaMa-7B. With a few hours of instruct-tuning and plug-and-play visual modules, it can perform a range of tasks from reading radiological images and answering complex clinical questions, while being easily deployable and replicable with a single gaming GPU.
Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the LLaMa-7B. With a few hours of instruct-tuning and plug-and-play visual modules, it can perform a range of tasks from reading radiological images and answering complex clinical questions, while being easily deployable and replicable with a single gaming GPU.

- Demo (insert GIF here) (Baian) + Demo

- +
-Please register for Hugging Face and fill out this form [link] to access the online demo of Visual Med-Alpaca. Warning: Only for academic usage and do not apply it to real clinical scenarios! +

+Please fill out this form to access the online demo. Warning: Only for academic usage and do not apply to real clinical scenarios!
@@ -125,13 +126,14 @@ We apologize for the inconvenience, but this project is currently undergoing int

-
  • Data: Github, HuggingFace +
  • Data: Github +
  • Code: Github
  • -
  • Code: Github
  • -
  • Model: HuggingFace + +
  • Model: HuggingFace Models
  • -
  • Demo: HuggingFace +
  • Demo: Huggingface Spaces
  • - - - - - - - - - - - - - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -

    +Within this particular framework, the task of visual language reasoning can be delineated into a bifurcation consisiting of two key phases: (1) the process of translating image to text, followed by (2) a cognitive engagement in reasoning over the text thereby derived.

    - -The list of original videos

    +The process involves the utilization of visual foundation models to convert medical images into an intermediate text state. The converted data is subsequently employed to prompt a pre-trained large language model (LLM), relying on the few-shot reasoning abilities inherent in LLMs.

    -The list of all full-length original videos can be found here, and youtube-dl can be used to batch download them. We reused some of utilities by AoT Dataset for scene detection/camera stabilization to generate these video clips and please refer to this repository for more details.

    +At present, our platform is capable of supporting two distinct visual foundation models, namely the DEPLOT and Med-GIT models, considering the prevalence of plot and radiology imagery within the medical field. This system's architecture is also designed to facilitate the seamless integration of alternate medical visual foundation models.

    -We further process these 89,800 video clips to generate the following two subsets.

    +The Med-GIT model represents a GIT: Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the ROCO dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.

    - -Triplet dataset (for temporal frame interpolation):

    -The triplet dataset consists of 73,171 3-frame sequences with a fixed resolution of 448 x 256, extracted from 15K selected video clips from Vimeo-90K. This dataset is designed for temporal frame interpolation. Download links are - - - -Septuplet dataset (for video denoising, deblocking, and super-resoluttion):

    - Notice: we have recently updated our testing denoising dataset to fix a bug in denoising test data generation. The new quantitative result of our algorithm is reported in our updated paper

    -The septuplet dataset consists of 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K. This dataset is designed to video denoising, deblocking, and super-resolution. - -