diff --git a/docs/files/demo.gif b/docs/files/demo.gif
new file mode 100644
index 0000000..de252c3
Binary files /dev/null and b/docs/files/demo.gif differ
diff --git a/docs/index.html b/docs/index.html
index eb0e449..c2ff3a0 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -77,14 +77,15 @@
-
Visual Adaptation: Deplot and Medical VQA (Baian)
+
Visual Adaptation: Medical Image Captioning and Deplot
-We also build a large-scale, high-quality video dataset, Vimeo90K. This dataset consists of 89,800 video clips downloaded from
vimeo.com, which covers large variaty of scenes and actions. It is designed for the following four video processing tasks: temporal frame interpolation, video denoising, video deblocking, and video super-resolution.
+Visual input is a critical element of the medical domain, contributing essential information in healthcare settings. Healthcare practitioners heavily rely on visual cues to diagnose, monitor and treat patients. Medical imaging technologies, such as X-ray, CT and MRI, provide an unparalleled means of examining internal organs, identifying diseases and abnormalities that may not be visible to the naked eye.
-
+Our study involves a further development of our previous work on visual language reasoning concerning charts and plots, as showcased in
DEPLOT: One-shot visual language reasoning by plot-to-table translation. In this study, we enhance our approach by incorporating a visual foundation model that is capable of accommodating radiology images as inputs.
-Sampled Frames (Full-resolution samples are
here):
-
-
-
-
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-
-
-
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-  |
-
-
-
+Within this particular framework, the task of visual language reasoning can be delineated into a bifurcation consisiting of two key phases: (1) the process of translating image to text, followed by (2) a cognitive engagement in reasoning over the text thereby derived.
-
-
The list of original videos
+The process involves the utilization of visual foundation models to convert medical images into an intermediate text state. The converted data is subsequently employed to prompt a pre-trained large language model (LLM), relying on the few-shot reasoning abilities inherent in LLMs.
-The list of all full-length original videos can be found
here, and
youtube-dl can be used to batch download them. We reused some of utilities by AoT Dataset for scene detection/camera stabilization to generate these video clips and please refer to this
repository for more details.
+At present, our platform is capable of supporting two distinct visual foundation models, namely the
DEPLOT and Med-GIT models, considering the prevalence of plot and radiology imagery within the medical field. This system's architecture is also designed to facilitate the seamless integration of alternate medical visual foundation models.
-We further process these 89,800 video clips to generate the following two subsets.
+The Med-GIT model represents a
GIT: Generative Image-to-text Transformer for Vision and Language, fine-tuned specifically on the
ROCO dataset to facilitate specialized radiology image captioning. The training procedure for the model is outlined in comprehensive detail in our publicly accessible Github repository.
-
-
Triplet dataset (for temporal frame interpolation):
-The triplet dataset consists of 73,171 3-frame sequences with a fixed resolution of 448 x 256, extracted from 15K selected video clips from Vimeo-90K. This dataset is designed for temporal frame interpolation. Download links are
-
-- Testing set only (17GB): zip
-- Both training and test set (33GB): zip
-
-
-
-
Septuplet dataset (for video denoising, deblocking, and super-resoluttion):
-
Notice: we have recently updated our testing denoising dataset to fix a bug in denoising test data generation. The new quantitative result of our algorithm is reported in our updated paper
-The septuplet dataset consists of 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K. This dataset is designed to video denoising, deblocking, and super-resolution.
-
-- The test set for video denoising (16GB): zip
-- The test set for video deblocking (11GB): zip
-- The test set for video super-resolution (6GB): zip
-- The original test set (not downsampled or downgraded by noise) (15GB): zip
-- The original training + test set (82GB): zip
-
-