diff --git a/README-en.md b/README-en.md
index 6a618c5..68b37bc 100644
--- a/README-en.md
+++ b/README-en.md
@@ -1,257 +1 @@
-<div align="center">
-<h1>
-  MiniCPM
-</h1>
-</div>
-
-<p align="center">
-<a href="https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16" target="_blank">Hugging Face</a> |
-<a href="https://modelscope.cn/models/OpenBMB/miniCPM-bf16" target="_blank">ModelScope</a> |
-<a href="https://wisemodel.cn/models/OpenBMB/miniCPM-bf16" target="_blank">WiseModel</a> |
-<a href="XXXX" target="_blank">技术报告</a> 
-</p>
-
-
-<div align="center">
-
-XXXXXX
-XXXXXX
-
-Experience models with larger scale at [Luca](https://luca.cn/).
-
-<h4 align="center">
-    <p>
-        <b>中文</b> |
-        <a href="XXXX">English</a>
-    <p>
-</h4>
-
-</div>
-
-## Quick Links
-
-- [Introduction](#1)
-- [Downloading](#2)
-- [Benchmark](#3)
-    - [Chinese](#3.1)
-    - [English](#3.2)
-    - [Code](#3.3)
-    - [Logic](#3.4)
-    - [Multi-modal](#3.5)
-- [Deployment on mobile phones](#4)
-- [Demo & API](#5)
-- [Parameter-efficient Fine-tuning](#6)
-- [LICENSE](#7)
-- [Citation](#8)
-- [Show Cases](#9)
-
-<p id="1"></p>
-
-# Introduction
-
-<p id="2"></p>
-
-# Downloading
-- [HuggingFace Repo]()
-- [ModelScope Repo]()
-- [XX Repo]()
-
-<p id="3"></p>
-
-# Benchmark 
-
-  | HuggingFace | ModelScope | WiseModel |
-  |-------------|------------|-----------|
-  |[sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)
-  |[sft-fp32](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32)|[sft-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-sft-fp32)|[sft-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
-  |[dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)
-  |[dpo-fp16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16)|[dpo-fp16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/)|[dpo-fp16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16)
-  |[dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|[dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|[dpo-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
-
-## Multi-modal
-
-|Models|MME(P)|MMB-dev(en)|MMB-dev(zh)|MMMU-val|CMMMU-val|
-|-|-|-|-|-|-|
-|LLaVA-Phi|1335.1|59.8|/|/|/|
-|MobileVLM|1288.9|59.6|/|/|/|
-|Imp-v1|1434.0|66.5|/|/|/|
-|Qwen-VL-Chat|**1487**|60.6|56.7|**35.9**|30.7
-|**MiniCPM-V**|1446|**67.3**|**61.9**|34.7|**32.1**|
-
-## DPO
-
-|Models|MT-bench|
-|---|---|
-|GPT-4-turbo|9.32|
-|GPT-3.5-turbo|8.39|
-|Mistral-8*7b-Instruct-v0.1|8.30|
-|Claude-2.1|8.18|
-|Zephyr-7B-beta|7.34|
-|**MiniCPM-2B**|**7.25**|
-|Vicuna-33B|7.12|
-|Zephyr-7B-alpha|6.88|
-|LLaMA-2-70B-chat|6.86|
-|Mistral-7B-Instruct-v0.1|6.84|
-|LLaMA-2-13B-chat|6.65|
-|Vicuna-13B|6.57|
-|MPT-34B-instruct|6.39|
-|LLaMA-2-7B-chat|6.27|
-|Vicuna-7B|6.17|
-|MPT-7B-chat|5.42|
-
-
-## Deployment on mobile phones
-
-<!-- 进行Int4量化后，MiniCPM只占2GB空间，具备在端侧手机进行模型部署的条件。
-对此，我们针对Android和Harmony系统使用开源框架MLC-LLM进行模型适配，针对iPhone系统使用开源框架LLMFarm进行模型适配，并分别选取了部分端侧手机设备进行了测试。 -->
-After INT4 quantization, MiniCPM only occupies 2GB of space, meeting the requirements of inference on edge devices. 
-
-We utilize the open-source framework [MLC-LLM](https://github.com/mlc-ai/mlc-llm) for deployment on Android and Harmony OS. For deployment on IOS, we adapt MiniCPM using [LLMFarm](https://github.com/guinmoon/LLMFarm). We select some mobile phones for testing respectively.
-
-### Tutorial
-
-  #### Android
-<!-- android编译安装MiniCPM指南 [EN](https://github.com/OpenBMB/mlc-MiniCPM/blob/main/README.md)  -->
-[Compilation and installation on Android](https://github.com/OpenBMB/mlc-MiniCPM/blob/main/README.md)
-
-  #### IOS
-<!-- [ios编译安装MiniCPM指南](https://github.com/OpenBMB/LLMFarm) -->
-[Compilation and installation on IOS](https://github.com/OpenBMB/LLMFarm)
-
-  #### Multimodal
-
-### Performance
-
-<!-- 我们并为针对手机部署进行深度优化，仅验证MiniCPM使用手机芯片进行推理的可行性。
-**我们也欢迎更多开发者进一步调优并更新下面的测试列表，不断提升端侧大模型在手机上的推理性能。** -->
-Instead of conducting in-depth optimization for deployment on mobile phones, we only verify the feasibility of MiniCPM using mobile chips for inference.
-
-**We welcome more developers to continuously improve the inference performance of LLMs on mobile phones and update the test results below.**
-
-|Mobile Phones|OS|Processor|Memory（GB）|Inference Throughput（token/s）|
-|-|-|-|-|-|
-|OPPO Find N3|Android 13|snapdragon 8 Gen2|12|6.5|
-|Samsung S23 Ultra|Android 14|snapdragon 8 Gen2|12|6.4|
-|Meizu M182Q|Android 11|snapdragon 888Plus|8|3.7|
-|Xiaomi 12 Pro|Android 13|snapdragon 8 Gen1|8+3|3.7|
-|Xiaomi Redmi K40|Android 11|snapdragon 870|8|3.5|
-|Oneplus LE 2100|Android 13|snapdragon 870|12|3.5|
-|Oneplus HD1900|Android 11|snapdragon 865|8|3.2|
-|Oneplus HD1900|Android 11|snapdragon 855|8|3.0|
-|Oneplus HD1905|Android 10|snapdragon 855|8|3.0|
-|Oneplus HD1900|Android 11|snapdragon 855|8|3.0|
-|Xiaomi MI 8|Android 9|snapdragon 845|6|2.3|
-|Huawei Nova 11SE|Harmony 4.0.0|snapdragon 778|12|1.9|
-|Xiaomi MIX 2|Android 9|snapdragon 835|6|1.3|
-|iPhone 15 Pro|iOS 17.2.1|A16|8|18.0|
-|iPhone 15|iOS 17.2.1|A16|6|15.0|
-|iPhone 12 Pro|iOS 16.5.1|A14|6|5.8|
-|iPhone 12|iOS 17.2.1|A14|4|5.8|
-|iPhone 11|iOS 16.6|A13|4|4.6|
-  
-## Demo & API
-
-#### Web-demo based on Gradio
-Launch gradio-based demo using the following command: 
-```shell
-python demo/gradio_based_demo.py
-```
-
-#### Inference with vLLM (Recommended!)
-
-* Install vLLM supporting MiniCPM
-  - vLLM 0.2.2 is adapted to MiniCPM in `inference/vllm`. More vLLM versions will be supported in the future
-```shell
-pip install inference/vllm
-```
-
-* Transfer Huggingface Transformers repo to vLLM-MiniCPM repo, where `<hf_repo_path>`, `<vllmcpm_repo_path>` are local paths.
-```shell
-python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
-```
-
-* Examples
-```shell
-cd inference/vllm/examples/infer_cpm
-python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_final.txt
-
-## 
-
-## LICENSE
-
-#### Model LICENSE
-
-<!-- 本仓库中代码依照 [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) 协议开源，MiniCPM 模型权重的使用则需要遵循 [“通用模型许可协议-来源说明-宣传限制-商业授权”](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md)。
-MiniCPM 模型权重对学术研究完全开放。如需将模型用于商业用途，请联系cpm@modelbest.cn来获取书面授权，在登记后亦允许免费商业使用。 -->
-This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. The usage of MiniCPM's models and weights must strictly follow [“通用模型许可协议-来源说明-宣传限制-商业授权”](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md).
-
-The models and weights of MiniCPM are completely free for academic research. If you need to use MiniCPM for commercial purposes, feel free to contact cpm@modelbest.cn for obtaining written authorization. After registration, free commercial usage is also allowed.
-
-
-
-#### Disclaimer
-
-<!-- 作为一个语言模型，MiniCPM 通过学习大量的文本来生成内容，但它无法理解、表达个人观点或价值判断，它所输出的任何内容都不代表模型开发者的观点和立场。
-因此用户在使用 MiniCPM 生成的内容时，应自行负责对其进行评估和验证。 -->
-
-As a language model, MiniCPM generates contents by learning from huge amount of internet corpus. It doesn't have personal opinions or value judgments. All the generated content of MiniCPM doesn't represent views or standpoints of model developers.
-
-Users are responsible for the evaluation and verification of all generated contents.
-
-## Citation
-
-Please cite our [techinical report]() if you find our work valuable.
-
-```
-@inproceedings{han2022bminf,
-	title={MiniCPM: todo},
-	booktitle={OpenBMB Blog},
-	year={2024}
-}
-```
-
-## Show Cases
-
-#### Code
-Case 1:
-![代码生成-case1](./assets/code.case1.gif)
-
-Case 2:
-![代码生成-case2](./assets/code.case2.gif)
-
-#### Reasoning
-Case 1:
-![数理逻辑-case1](./assets/math.case1.png)
-
-Case 2:
-![数理逻辑-case1](./assets/math.case2.png)
-
-
-#### World-Knowledge
-Case 1:
-![知识推理-case1](./assets/knowledge.case1.png)
-
-#### Content Creation
-Case 1:
-![内容创作-case1](./assets/creation.case1.png)
-
-#### Translation
-Case 1:
-![文本翻译-case1](./assets/translation.case1.png)
-
-Case 2:
-![文本翻译-case1](./assets/translation.case2.png)
-
-#### Instruction Following
-Case 1:
-![指令跟随-case1](./assets/instruction_following.case1.png)
-
-Case 2:
-![指令跟随-case1](./assets/instruction_following.case2.png)
-
-#### Special characters
-Case 1:
-![指令跟随-case1](./assets/instruction_following.case3.png)
-
-Case 2:
-![指令跟随-case1](./assets/instruction_following.case4.png)
+Update soon.
\ No newline at end of file
diff --git a/README.md b/README.md
index 44035da..b668f76 100644
--- a/README.md
+++ b/README.md
@@ -4,21 +4,16 @@
 </h1>
 </div>
 
-<h4 align="center">
-    <p>
-        <b>中文</b> |
-        <a href="XXXX">English</a>
-    <p>
-</h4>
 
 <p align="center">
-<a href="XXXX" target="_blank">MiniCPM 技术报告</a> |
-<a href="https://github.com/OpenBMB/OmniLMM/" target="_blank">多模态模型 OmniLMM</a> |
-<a href="https://luca.cn/" target="_blank">千亿模型 Luca</a> 
+<a href="https://shengdinghu.notion.site/MiniCPM-c805a17c5c8046398914e47f0542095a" target="_blank">MiniCPM 技术报告</a> |
+<a href="https://github.com/OpenBMB/OmniLMM/" target="_blank">OmniLMM 多模态模型</a> |
+<a href="https://luca.cn/" target="_blank">CPM-C 千亿模型试用</a> |
+<a href="https://discord.gg/3cGQn9b3YM" target="_blank">加入我们的社区</a> 
+ 
 </p>
 
-
-MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列端侧语言大模型，主体语言模型 MiniCPM-2B 仅有 24亿（2.4B）的非词嵌入参数量。
+MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的系列端侧语言大模型，主体语言模型 MiniCPM-2B 仅有 24亿（2.4B）的非词嵌入参数量。
 - 经过 SFT 后，MiniCPM 在公开综合性评测集上，MiniCPM 与 Mistral-7B相近（中文、数学、代码能力更优），整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。
 - 经过 DPO 后，MiniCPM 在当前最接近用户体感的评测集 MTBench上，MiniCPM-2B 也超越了 Llama2-70B-Chat、Vicuna-33B、Mistral-7B-Instruct-v0.1、Zephyr-7B-alpha 等众多代表性开源大模型。
 - 以 MiniCPM-2B 为基础构建端侧多模态大模型 MiniCPM-V，整体性能在同规模模型中实现最佳，超越基于 Phi-2 构建的现有多模态大模型，在部分评测集上达到与 9.6B Qwen-VL-Chat 相当甚至更好的性能。
@@ -27,47 +22,136 @@ MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列
 
 我们将完全开源MiniCPM-2B的模型参数供学术研究和有限商用，以及训练过程中的所有Checkpoint和大部分非专有数据供模型机理研究。
 
-- 基于MiniCPM-2B的指令微调与人类偏好对**MiniCPM-2B-SFT/DPO。**
-- 基于MiniCPM-2B的多模态模型**MiniCPM-V**，能力超越基于Phi-2的同参数级别多模态模型**。**
-- MiniCPM-2B-SFT/DPO的Int4量化版**MiniCPM-2B-SFT/DPO-Int4。**
-- 基于MLC-LLM、LLMFarm开发的MiniCPM手机端程序，**文本及多模态模型均可在手机端进行推理。**
+- 基于MiniCPM-2B的指令微调与人类偏好对**MiniCPM-2B-SFT/DPO**。
+- 基于MiniCPM-2B的多模态模型**MiniCPM-V**，能力超越基于Phi-2的同参数级别多模态模型。
+- MiniCPM-2B-SFT/DPO的Int4量化版**MiniCPM-2B-SFT/DPO-Int4**。
+- 基于MLC-LLM、LLMFarm开发的MiniCPM手机端程序，**文本及多模态模型均可在手机端进行推理**。
 
 ### 局限性：
 
-- 受限于模型规模，模型可能出现幻觉性问题。其中由于DPO模型生成的回复内容更长，更容易出现幻觉。我们也将持续进行MiniCPM模型的迭代改进；
-- 为了保证在学术研究用途上模型的通用性，我们未对模型进行任何身份认同训练。同时由于我们用ShareGPT开源语料作为部分训练数据，模型可能会输出类似GPT系列模型的身份认同信息；
-- 受限于模型规模，模型的输出受到提示词（prompt）的影响较大，可能多次尝试产生不一致的结果；
+- 受限于模型规模，模型可能出现幻觉性问题。其中由于DPO模型生成的回复内容更长，更容易出现幻觉。我们也将持续进行MiniCPM模型的迭代改进。
+- 为了保证在学术研究用途上模型的通用性，我们未对模型进行任何身份认同训练。同时由于我们用ShareGPT开源语料作为部分训练数据，模型可能会输出类似GPT系列模型的身份认同信息。
+- 受限于模型规模，模型的输出受到提示词（prompt）的影响较大，可能多次尝试产生不一致的结果。
 - 受限于模型容量，模型的知识记忆较不准确，后续我们将结合RAG方法来增强模型的知识记忆能力。
 
-# 目录
+## 目录
 
 - [模型下载](#1)
 - [快速上手](#2)
 - [评测结果](#3)
 - [手机部署](#4)
 - [Demo & API 部署](#5)
-- [高效参数微调](#6)
+- [二次开发](#6)
 - [开源协议](#7)
 - [工作引用](#8)
 - [典型示例](#9)
 
 <p id="1"></p>
 
-# 模型下载
+<p id="1"></p>
+
+## 模型下载
+
+* 语言模型
  
   | HuggingFace | ModelScope | WiseModel |
   |-------------|------------|-----------|
-  |[sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)
-  |[sft-fp32](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32)|[sft-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-sft-fp32)|[sft-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
-  |[dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)
-  |[dpo-fp16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16)|[dpo-fp16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/)|[dpo-fp16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16)
-  |[dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|[dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|[dpo-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
+  |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)
+  |[MiniCPM-2B-sft-fp32](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
+  |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)
+  |[MiniCPM-2B-dpo-fp16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16)|[MiniCPM-2B-dpo-fp16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/)|[MiniCPM-2B-dpo-fp16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16)
+  |[MiniCPM-2B-dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
+
+
+* 多模态模型
+  | HuggingFace | ModelScope | WiseModel |
+  |-------------|------------|-----------|
+  | [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) | [MiniCPM-V](https://modelscope.cn/models/OpenBMB/MiniCPM-V/) | [MiniCPM-V](https://wisemodel.cn/models/OpenBMB/MiniCPM-V) |
+  | [OmniLMM](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) |
+
+  
 
 
 <p id="2"></p>
 
-# 快速上手
+## 快速上手
 
+#### vLLM 推理
+
+* 安装支持 MiniCPM 的 vLLM
+  - 因为 MiniCPM 采用 MUP 结构，在矩阵乘法中存在一定的放缩计算，与Llama类模型结构有细微差别。
+  - 我们基于版本为 0.2.2 的 vLLM 实现了 MiniCPM 的推理，代码位于仓库[inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference)文件夹下，未来将会支持更新的vLLM 版本。
+
+* 安装支持 MiniCPM 的 vLLM 版本
+```shell
+pip install inference/vllm
+```
+
+* 将Huggingface Transformers仓库转为vLLM-MiniCPM支持的格式，其中`<hf_repo_path>`, `<vllmcpm_repo_path>`均为本地路径
+```shell
+python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
+```
+
+* 测试样例
+```shell
+cd inference/vllm/examples/infer_cpm
+python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
+```
+
+* 期望输出
+```shell
+<用户>: Which city is the capital of China?
+<AI>:
+ The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
+```
+
+#### Huggingface 模型
+##### MiniCPM-2B
+* 安装`transformers>=4.36.0`以及`accelerate`后，运行以下代码
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+torch.manual_seed(0)
+
+path = 'openbmb/MiniCPM-2B-dpo-bf16'
+tokenizer = AutoTokenizer.from_pretrained(path)
+model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
+
+responds, history = model.chat(tokenizer, "山东省最高的山是哪座山, 它比黄山高还是矮？差距多少？", temperature=0.8, top_p=0.8)
+print(responds)
+```
+
+* 期望输出
+```shell
+山东省最高的山是泰山，海拔1545米。
+
+相对于黄山（海拔1864米），泰山海拔较低，相差约319米。
+```
+
+##### MiniCPM-V
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
+
+model_path='openbmb/MiniCPM-V'
+model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.bfloat16)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model.eval().cuda()
+
+image = Image.open('./assets/COCO_test2015_000000262144.jpg').convert('RGB')
+
+question = '请描述一下该图像'
+res, context, _ = model.chat(
+    image=image,
+    question=question,
+    context=None,
+    tokenizer=tokenizer,
+    sampling=True,
+    temperature=0.7
+)
+print(res)
+```
 
 <p id="3"></p>
 
@@ -145,13 +229,79 @@ MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列
 
 #### 多模态评测
 
-|模型|MME(P)|MMB-dev(en)|MMB-dev(zh)|MMMU-val|CMMMU-val|
-|-|-|-|-|-|-|
-|LLaVA-Phi|1335.1|59.8|/|/|/|
-|MobileVLM|1288.9|59.6|/|/|/|
-|Imp-v1|1434.0|66.5|/|/|/|
-|Qwen-VL-Chat|**1487**|60.6|56.7|**35.9**|30.7
-|**MiniCPM-V**|1446|**67.3**|**61.9**|34.7|**32.1**|
+<div align="left">
+
+<table style="margin: 0px auto;">
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>MME</th>
+    <th nowrap="nowrap" >MMB dev (en)</th>
+    <th nowrap="nowrap" >MMB dev (zh)</th>
+    <th nowrap="nowrap" >MMMU val</th>
+    <th nowrap="nowrap" >CMMMU val</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td align="left">LLaVA-Phi</td>
+    <td align="right">3B</td>
+    <td>1335</td>
+    <td>59.8</td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">MobileVLM</td>
+    <td align="right">3B</td>
+    <td>1289</td>
+    <td>59.6</td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Imp-v1</td>
+    <td align="right">3B</td>
+    <td>1434</td>
+    <td>66.5</td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td align="left" >Qwen-VL-Chat</td>
+    <td align="right" >9.6B</td>
+    <td>1487</td>
+    <td>60.6 </td>
+    <td>56.7 </td>
+    <td>35.9 </td>
+    <td>30.7 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >CogVLM</td>
+    <td align="right">17.4B </td>
+    <td>1438 </td>
+    <td>63.7 </td>
+    <td>53.8 </td>
+    <td>32.1 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>OmniLMM-3B</b></td>
+    <td align="right">3B </td>
+    <td>1452 </td>
+    <td>67.3 </td>
+    <td>61.9 </td>
+    <td>34.7 </td>
+    <td>32.1 </td>
+  </tr>
+</tbody>
+</table>
+
+</div>
 
 #### DPO评测
 
@@ -182,18 +332,18 @@ MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列
   * 使用开源框架MLC-LLM进行模型适配。
   * 支持文本模型、多模态模型。
   * 适用于MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4、MiniCPM-V。
-  * [编译安装MiniCPM指南](https://github.com/OpenBMB/mlc-MiniCPM/blob/main/README.md) 
+  * [编译安装MiniCPM指南](https://github.com/OpenBMB/mlc-MiniCPM) 
 * iOS
   * 使用开源框架LLMFarm进行模型适配。
   * 支持文本模型。
-  * 适用于MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4
+  * 适用于MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4。
   * [编译安装MiniCPM指南](https://github.com/OpenBMB/LLMFarm)
 
 #### 部署性能
 
 * 我们未针对手机推理模型进行深度优化和系统测试，仅验证MiniCPM使用手机芯片进行推理的可行性。
 * 此前尚未有工作尝试在手机上部署多模态大模型。我们此次在MLC-LLM上验证了手机部署MiniCPM-V的可行性，能够正常输入输出，但也存在图片处理时间较长的问题，需要进一步优化 :)。
-* **我们也欢迎更多开发者进一步调优并更新下面的测试列表，不断提升端侧大模型在手机上的推理性能。**
+* **我们也欢迎更多开发者进一步调优并更新下面的测试列表，不断提升端侧大模型在手机上的推理性能**。
 
 |手机型号|操作系统|处理器|Memory（GB）|文本吞吐（token/s）|
 |-|-|-|-|-|
@@ -228,14 +378,68 @@ MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列
 * 使用如下命令启动基于Gradio的网页版demo：
 
 ```shell
-python demo/gradio_based_demo.py
+# generation powered by vllm
+python demo/vllm_based_demo.py --model_path <vllmcpm_repo_path>
+# generation powered by huggingface
+python demo/hf_based_demo.py --model_path <hf_repo_path>
 ```
 
 
 <p id="6"></p>
 
-## 高效参数微调
+## 二次开发
 
+* 高效参数微调
+  * 一张1080/2080可实现高效参数微调
+  * [高效参数微调代码](https://github.com/OpenBMB/MiniCPM/tree/main/finetune)
+  
+* 全参数微调 or 持续训练
+  * 使用[BMTrain](https://github.com/OpenBMB/BMTrain)，借助重计算和ZeRO-3，一张3090/4090可实现全参数微调，一台机器可实现持续训练
+  * 相关代码也将陆续推出
+
+
+
+<p id="9"></p>
+
+## 典型示例
+
+#### 文本生成
+
+![内容创作-case1](./assets/creation.case1.png)
+
+![内容创作-case2](./assets/creation.case2.png)
+
+![内容创作-case3](./assets/creation.case3.png)
+
+#### 代码生成
+
+![代码生成-case1](./assets/code.case1.gif)
+
+![代码生成-case2](./assets/code.case2.gif)
+
+#### 数理逻辑
+
+![数理逻辑-case1](./assets/math.case1.png)
+
+![数理逻辑-case1](./assets/math.case2.png)
+
+#### 文本翻译
+
+![文本翻译-case1](./assets/translation.case1.png)
+
+![文本翻译-case2](./assets/translation.case2.png)
+
+#### 指令跟随
+
+![指令跟随-case1](./assets/instruction_following.case1.png)
+
+![指令跟随-case1](./assets/instruction_following.case2.png)
+
+#### 特殊字符
+
+![特殊字符-case1](./assets/special_char.case1.png)
+
+![特殊字符-case2](./assets/special_char.case2.png)
 
 
 <p id="7"></p>
@@ -259,52 +463,12 @@ python demo/gradio_based_demo.py
 
 ## 工作引用
 
-* 如果觉得MiniCPM有助于您的工作，请考虑引用下列[技术报告](todo)
+* 如果觉得MiniCPM有助于您的工作，请引用我们的[技术报告](todo)
 
 ```
 @inproceedings{minicpm2024,
-	title={MiniCPM: todo},
+	title={MiniCPM：Unveiling the Potential of End-side Large Language Models},
 	booktitle={OpenBMB Blog},
 	year={2024}
 }
 ```
-
-<p id="9"></p>
-
-## 典型示例
-
-#### 文本生成
-
-![知识推理-case1](./assets/knowledge.case1.png)
-
-![内容创作-case1](./assets/creation.case1.png)
-
-#### 代码生成
-
-![代码生成-case1](./assets/code.case1.gif)
-
-![代码生成-case2](./assets/code.case2.gif)
-
-#### 数理逻辑
-
-![数理逻辑-case1](./assets/math.case1.png)
-
-![数理逻辑-case1](./assets/math.case2.png)
-
-#### 文本翻译
-
-![文本翻译-case1](./assets/translation.case1.png)
-
-![文本翻译-case1](./assets/translation.case2.png)
-
-#### 指令跟随
-
-![指令跟随-case1](./assets/instruction_following.case1.png)
-
-![指令跟随-case1](./assets/instruction_following.case2.png)
-
-#### 特殊字符
-
-![指令跟随-case1](./assets/instruction_following.case3.png)
-
-![指令跟随-case1](./assets/instruction_following.case4.png)
diff --git a/assets/COCO_test2015_000000262144.jpg b/assets/COCO_test2015_000000262144.jpg
new file mode 100644
index 0000000..012f88d
Binary files /dev/null and b/assets/COCO_test2015_000000262144.jpg differ
diff --git a/assets/creation.case1.png b/assets/creation.case1.png
index e4a7b1c..3f6d1aa 100644
Binary files a/assets/creation.case1.png and b/assets/creation.case1.png differ
diff --git a/assets/creation.case2.png b/assets/creation.case2.png
new file mode 100644
index 0000000..e4a7b1c
Binary files /dev/null and b/assets/creation.case2.png differ
diff --git a/assets/creation.case3.png b/assets/creation.case3.png
new file mode 100644
index 0000000..08e1eba
Binary files /dev/null and b/assets/creation.case3.png differ
diff --git a/assets/instruction_following.case3.png b/assets/instruction_following.case3.png
deleted file mode 100644
index e8ddd65..0000000
Binary files a/assets/instruction_following.case3.png and /dev/null differ
diff --git a/assets/instruction_following.case4.png b/assets/instruction_following.case4.png
deleted file mode 100644
index 91495ca..0000000
Binary files a/assets/instruction_following.case4.png and /dev/null differ
diff --git a/assets/math.case1.png b/assets/math.case1.png
index 61bb4e9..617f0b4 100644
Binary files a/assets/math.case1.png and b/assets/math.case1.png differ
diff --git a/assets/math.case2.png b/assets/math.case2.png
index 4b4c219..fab8b54 100644
Binary files a/assets/math.case2.png and b/assets/math.case2.png differ
diff --git a/assets/special_char.case1.png b/assets/special_char.case1.png
new file mode 100644
index 0000000..7f51de1
Binary files /dev/null and b/assets/special_char.case1.png differ
diff --git a/assets/special_char.case2.png b/assets/special_char.case2.png
new file mode 100644
index 0000000..43a5d6c
Binary files /dev/null and b/assets/special_char.case2.png differ
diff --git a/assets/translation.case1.png b/assets/translation.case1.png
index abfe181..c2a23af 100644
Binary files a/assets/translation.case1.png and b/assets/translation.case1.png differ
diff --git a/assets/translation.case2.png b/assets/translation.case2.png
index b81d414..95dee81 100644
Binary files a/assets/translation.case2.png and b/assets/translation.case2.png differ
diff --git a/demo/gradio_based_demo.py b/demo/hf_based_demo.py
similarity index 96%
rename from demo/gradio_based_demo.py
rename to demo/hf_based_demo.py
index b6da175..1cd9289 100644
--- a/demo/gradio_based_demo.py
+++ b/demo/hf_based_demo.py
@@ -2,6 +2,7 @@ from typing import Dict
 from typing import List
 from typing import Tuple
 
+import argparse
 import gradio as gr
 import torch
 from threading import Thread
@@ -11,14 +12,17 @@ from transformers import (
     TextIteratorStreamer
 )
 
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path", type=str, default="")
+args = parser.parse_args()
 
 # init model and tokenizer
-path = "openbmb/miniCPM-dpo-fp32"
+path = args.model_path
 tokenizer = AutoTokenizer.from_pretrained(path)
 model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
 
 
-def hf_gen(dialog: str, top_p: float, temperature: float, max_dec_len: int):
+def hf_gen(dialog: List, top_p: float, temperature: float, max_dec_len: int):
     """generate model output with huggingface api
 
     Args:
@@ -122,6 +126,7 @@ def reverse_last_round(chat_history):
     assert len(chat_history) >= 1, "History is empty. Nothing to reverse!!"
     return chat_history[:-1]
 
+
 # launch gradio demo
 with gr.Blocks(theme="soft") as demo:
     gr.Markdown("""# MiniCPM Gradio Demo""")
diff --git a/demo/vllm_based_demo.py b/demo/vllm_based_demo.py
new file mode 100644
index 0000000..3789380
--- /dev/null
+++ b/demo/vllm_based_demo.py
@@ -0,0 +1,161 @@
+from typing import Dict
+from typing import List
+from typing import Tuple
+
+import argparse
+import gradio as gr
+from vllm import LLM, SamplingParams
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path", type=str, default="")
+args = parser.parse_args()
+
+# init model and tokenizer
+path = args.model_path
+llm = LLM(model=path, tensor_parallel_size=1, dtype="bfloat16")
+
+
+def vllm_gen(dialog: List, top_p: float, temperature: float, max_dec_len: int):
+    """generate model output with huggingface api
+
+    Args:
+        query (str): actual model input.
+        top_p (float): only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+        temperature (float): Strictly positive float value used to modulate the logits distribution.
+        max_dec_len (int): The maximum numbers of tokens to generate.
+
+    Yields:
+        str: real-time generation results of hf model
+    """    
+    prompt = ""
+    assert len(dialog) % 2 == 1
+    for info in dialog:
+        if info["role"] == "user":
+            prompt += "<用户>" + info["content"]
+        else:
+            prompt += "<AI>" + info["content"]
+    prompt += "<AI>"
+    params_dict = {
+        "n": 1,
+        "best_of": 1,
+        "presence_penalty": 1.0,    
+        "frequency_penalty": 0.0,
+        "temperature": temperature,
+        "top_p": top_p,
+        "top_k": -1,
+        "use_beam_search": False,
+        "length_penalty": 1,
+        "early_stopping": False,
+        "stop": None,
+        "stop_token_ids": None,
+        "ignore_eos": False,
+        "max_tokens": max_dec_len,
+        "logprobs": None,
+        "prompt_logprobs": None,
+        "skip_special_tokens": True,
+    }
+    sampling_params = SamplingParams(**params_dict)
+    outputs = llm.generate(prompts=prompt, sampling_params=sampling_params)[0]
+    generated_text = outputs.outputs[0].text
+    return generated_text
+
+
+def generate(chat_history: List, query: str, top_p: float, temperature: float, max_dec_len: int):
+    """generate after hitting "submit" button
+
+    Args:
+        chat_history (List): [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n]]. list that stores all QA records
+        query (str): query of current round
+        top_p (float): only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+        temperature (float): strictly positive float value used to modulate the logits distribution.
+        max_dec_len (int): The maximum numbers of tokens to generate.
+
+    Yields:
+        List: [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n], [q_n+1, a_n+1]]. chat_history + QA of current round.
+    """    
+    assert query != "", "Input must not be empty!!!"
+    # apply chat template
+    model_input = []
+    for q, a in chat_history:
+        model_input.append({"role": "user", "content": q})
+        model_input.append({"role": "assistant", "content": a})
+    model_input.append({"role": "user", "content": query})
+    # yield model generation
+    model_output = vllm_gen(model_input, top_p, temperature, max_dec_len)
+    chat_history.append([query, model_output])
+    return gr.update(value=""), chat_history
+
+
+def regenerate(chat_history: List, top_p: float, temperature: float, max_dec_len: int):
+    """re-generate the answer of last round's query
+
+    Args:
+        chat_history (List): [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n]]. list that stores all QA records
+        top_p (float): only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+        temperature (float): strictly positive float value used to modulate the logits distribution.
+        max_dec_len (int): The maximum numbers of tokens to generate.
+
+    Yields:
+        List: [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n]]. chat_history
+    """    
+    assert len(chat_history) >= 1, "History is empty. Nothing to regenerate!!"
+    # apply chat template
+    model_input = []
+    for q, a in chat_history[:-1]:
+        model_input.append({"role": "user", "content": q})
+        model_input.append({"role": "assistant", "content": a})
+    model_input.append({"role": "user", "content": chat_history[-1][0]})
+    # yield model generation
+    model_output = vllm_gen(model_input, top_p, temperature, max_dec_len)
+    chat_history[-1][1] = model_output
+    return gr.update(value=""), chat_history
+
+
+def clear_history():
+    """clear all chat history
+
+    Returns:
+        List: empty chat history
+    """    
+    return []
+
+
+def reverse_last_round(chat_history):
+    """reverse last round QA and keep the chat history before
+
+    Args:
+        chat_history (List): [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n]]. list that stores all QA records
+
+    Returns:
+        List: [[q_1, a_1], [q_2, a_2], ..., [q_n-1, a_n-1]]. chat_history without last round.
+    """    
+    assert len(chat_history) >= 1, "History is empty. Nothing to reverse!!"
+    return chat_history[:-1]
+
+
+# launch gradio demo
+with gr.Blocks(theme="soft") as demo:
+    gr.Markdown("""# MiniCPM Gradio Demo""")
+
+    with gr.Row():
+        with gr.Column(scale=1):
+            top_p = gr.Slider(0, 1, value=0.8, step=0.1, label="top_p")
+            temperature = gr.Slider(0.1, 2.0, value=0.5, step=0.1, label="temperature")
+            max_dec_len = gr.Slider(1, 1024, value=1024, step=1, label="max_dec_len")
+        with gr.Column(scale=5):
+            chatbot = gr.Chatbot(bubble_full_width=False, height=400)
+            user_input = gr.Textbox(label="User", placeholder="Input your query here!", lines=8)
+            with gr.Row():
+                submit = gr.Button("Submit")
+                clear = gr.Button("Clear")
+                regen = gr.Button("Regenerate")
+                reverse = gr.Button("Reverse")
+
+    submit.click(generate, inputs=[chatbot, user_input, top_p, temperature, max_dec_len], outputs=[user_input, chatbot])
+    regen.click(regenerate, inputs=[chatbot, top_p, temperature, max_dec_len], outputs=[user_input, chatbot])
+    clear.click(clear_history, inputs=[], outputs=[chatbot])
+    reverse.click(reverse_last_round, inputs=[chatbot], outputs=[chatbot])
+
+demo.queue()
+demo.launch(server_name="127.0.0.1", show_error=True)
diff --git a/finetune/README.md b/finetune/README.md
index 4c92758..1b1c182 100644
--- a/finetune/README.md
+++ b/finetune/README.md
@@ -1,5 +1,7 @@
 # MiniCPM 微调
 
+[English Version](https://github.com/OpenBMB/MiniCPM/blob/main/finetune/README_en.md)
+
 本目录提供 MiniCPM-2B 模型的微调示例，包括全量微调和 PEFT。格式上，提供多轮对话微调样例和输入输出格式微调样例。
 
 如果将模型下载到了本地，本文和代码中的 `OpenBMB/MiniCPM-2B` 字段均应替换为相应地址以从本地加载模型。
diff --git a/finetune/README_en.md b/finetune/README_en.md
new file mode 100644
index 0000000..4ce39f4
--- /dev/null
+++ b/finetune/README_en.md
@@ -0,0 +1,92 @@
+# MiniCPM Fine-tuning
+
+[中文版](https://github.com/OpenBMB/MiniCPM/blob/main/finetune/README.md)
+
+This directory provides examples of fine-tuning the MiniCPM-2B model, including full model fine-tuning and PEFT. In terms of format, we offer examples for multi-turn dialogue fine-tuning and input-output format fine-tuning.
+
+If you have downloaded the model to your local system, the `OpenBMB/MiniCPM-2B` field mentioned in this document and in the code should be replaced with the corresponding address to load the model from your local system.
+
+Running the example requires `python>=3.10`. Besides the basic `torch` dependency, additional dependencies are needed to run the example code.
+
+
+
+**We have provided an [example notebook](lora_finetune.ipynb) to demonstrate how to process data and use the fine-tuning script with AdvertiseGen as an example.**
+
+```bash
+pip install -r requirements.txt
+```
+
+## Testing Hardware Standard
+
+We only provide examples for single-node multi-GPU/multi-node multi-GPU setups, so you will need at least one machine with multiple GPUs. In the **default configuration file** in this repository, we have documented the memory usage:
+
++ SFT full parameters fine-tuning: Evenly distributed across 4 GPUs, each GPU consumes `30245MiB` of memory.
++ LORA fine-tuning: One GPU, consuming `10619MiB`  of memory.。
+
+> Please note that these results are for reference only, and memory consumption may vary with different parameters. Please adjust according to your hardware situation.
+
+## Multi-Turn Dialogue Format
+
+The multi-turn dialogue fine-tuning example adopts the ChatGLM3 dialogue format convention, adding different `loss_mask` for different roles, thus calculating `loss` for multiple replies in one computation.
+
+For the data file, the example uses the following format
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "role": "system",
+        "content": "<system prompt text>"
+      },
+      {
+        "role": "user",
+        "content": "<user prompt text>"
+      },
+      {
+        "role": "assistant",
+        "content": "<assistant response text>"
+      },
+      // ... Muti Turn
+      {
+        "role": "user",
+        "content": "<user prompt text>"
+      },
+      {
+        "role": "assistant",
+        "content": "<assistant response text>"
+      }
+    ]
+  }
+  // ...
+]
+```
+
+## Dataset Format Example
+
+Here, taking the AdvertiseGen dataset as an example,
+you can download the AdvertiseGen dataset from [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing)
+or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) . After extracting the AdvertiseGen directory, place it in the `data` directory and convert it into the following format dataset.
+
+
+> Please note, the fine-tuning code now includes a validation set, so for a complete set of fine-tuning datasets, it must contain training and validation datasets, while the test dataset is optional. Or, you can use the validation dataset in place of it.
+
+```
+{"conversations": [{"role": "user", "content": "类型#裙*裙长#半身裙"}, {"role": "assistant", "content": "这款百搭时尚的仙女半身裙，整体设计非常的飘逸随性，穿上之后每个女孩子都能瞬间变成小仙女啦。料子非常的轻盈，透气性也很好，穿到夏天也很舒适。"}]}
+```
+
+## Start Fine-tuning
+
+Execute **single-node multi-GPU/multi-node multi-GPU** runs with the following code.
+
+```bash
+cd finetune
+bash sft_finetune.sh
+```
+
+Execute **single-node single-GPU** runs with the following code.
+
+```angular2html
+cd finetune
+bash lora_finetune.sh
+```
diff --git a/finetune/finetune.py b/finetune/finetune.py
index 3d050db..2cbd6cc 100644
--- a/finetune/finetune.py
+++ b/finetune/finetune.py
@@ -73,7 +73,7 @@ class SupervisedDataset(Dataset):
 
     def preprocessing(self, example):
         input_ids = [self.tokenizer.bos_token_id]
-        label_ids = [self.tokenizer.bos_token_id]
+        label_ids = []
 
         for message in example["messages"]:
             role = message["role"]
@@ -82,18 +82,17 @@ class SupervisedDataset(Dataset):
 
             if role == "user":
                 input_ids += self.user_tokens + content_ids
-                label_ids += (
-                    [self.ignore_index] * len(self.user_tokens)
-                    + [self.tokenizer.eos_token_id]
-                    + [self.ignore_index] * len(content_ids)
-                )
+                label_ids += [self.ignore_index] * len(self.user_tokens) + [
+                    self.ignore_index
+                ] * len(content_ids)
             else:
                 input_ids += self.assistant_tokens + content_ids
-                label_ids += [self.ignore_index] * len(
-                    self.assistant_tokens
-                ) + content_ids
-        input_ids.append(self.tokenizer.eos_token_id)
-        label_ids.append(self.tokenizer.eos_token_id)
+                label_ids += (
+                    [self.ignore_index] * len(self.assistant_tokens)
+                    + content_ids
+                    + [self.tokenizer.eos_token_id]
+                )
+
         input_ids = input_ids[: self.model_max_length]
         label_ids = label_ids[: self.model_max_length]
         # input_ids += [self.tokenizer.eos_token_id] * (len(label_ids) - len(input_ids))
diff --git a/inference/README.md b/inference/README.md
new file mode 100644
index 0000000..903c6b3
--- /dev/null
+++ b/inference/README.md
@@ -0,0 +1,60 @@
+# VLLM 推理 MiniCPM | MiniCPM inference on VLLM
+
+### 中文
+
+* 安装支持 MiniCPM 的 vLLM
+  - 因为 MiniCPM 采用 MUP 结构，在矩阵乘法中存在一定的放缩计算，与Llama类模型结构有细微差别。
+  - 我们基于版本为 0.2.2 的 vLLM 实现了 MiniCPM 的推理，代码位于仓库[inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference)文件夹下，未来将会支持更新的vLLM 版本。
+
+* 安装支持 MiniCPM 的 vLLM 版本
+```shell
+pip install inference/vllm
+```
+
+* 将Huggingface Transformers仓库转为vLLM-MiniCPM支持的格式，其中`<hf_repo_path>`, `<vllmcpm_repo_path>`均为本地路径
+```shell
+python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
+```
+
+* 测试样例
+```shell
+cd inference/vllm/examples/infer_cpm
+python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
+```
+
+* 期望输出
+```shell
+<用户>: Which city is the capital of China?
+<AI>:
+ The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
+```
+
+### English
+
+
+* Install vLLM which supports MiniCPM
+ - The structure of MiniCPM is not completely same as Llama, since MiniCPM uses the structure of MUP and scaling is applied in matrix multiplications.
+ - We implemented the inference of MiniCPM in vLLM 0.2.2, and the code is located at [inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference). Newer vLLM versions will be supported in the future.
+
+* Install vLLM which supports MiniCPM
+```shell
+pip install inference/vllm
+```
+
+* Convert Huggingface repo to vllm-cpm repo，where `<hf_repo_path>`, `<vllmcpm_repo_path>` are local paths
+```shell
+python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
+```
+
+* Test cases
+```shell
+cd inference/vllm/examples/infer_cpm
+python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
+```
+
+* Expected Output
+```shell
+<用户>: Which city is the capital of China?
+<AI>:
+ The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
+```
\ No newline at end of file
diff --git a/inference/vllm/examples/infer_cpm/prompts/prompt_demo.txt b/inference/vllm/examples/infer_cpm/prompts/prompt_demo.txt
index b678a4b..e7aacb9 100644
--- a/inference/vllm/examples/infer_cpm/prompts/prompt_demo.txt
+++ b/inference/vllm/examples/infer_cpm/prompts/prompt_demo.txt
@@ -1 +1 @@
-Where is my car?怎么翻译
+Which city is the capital of China?

Model	Size	MME	MMB dev (en)	MMB dev (zh)	MMMU val	CMMMU val
LLaVA-Phi	3B	1335	59.8	-	-	-
MobileVLM	3B	1289	59.6	-	-	-
Imp-v1	3B	1434	66.5	-	-	-
Qwen-VL-Chat	9.6B	1487	60.6	56.7	35.9	30.7
CogVLM	17.4B	1438	63.7	53.8	32.1	-
OmniLMM-3B	3B	1452	67.3	61.9	34.7	32.1