Merge branch 'main' into add_cpmV_hfdemo

2026-01-19 12:53:36 +08:00 · 2024-06-28 18:13:19 +08:00 · 2024-06-28 18:13:19 +08:00 · 8c6d6f8615
commit 8c6d6f8615
parent ce391fb099 ead8064a52
298 changed files with 278913 additions and 93273 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,3 +2,7 @@
 *.pyc
 finetune/output/*
 wip.*
+.idea
+venv
+.venv
+.env
--- a/License.md
+++ b/License.md
@ -0,0 +1,41 @@
+Version 1.0, June 5, 2024
+© 2024 OpenBMB. All rights reserved.
+
+## Part One: Preamble
+
+We are opening the entire series of the globally leading MiniCPM edge-side large language models, including the flagship edge-side models MiniCPM-2.4B and MiniCPM-1.2B, as well as the world's most powerful edge multimodal models MiniCPM-V series. The aforementioned weights are completely open for all academic research. Commercial use is also allowed after filling out a registration questionnaire. Community use of the MiniCPM series models must comply with Apache 2.0 and the "MiniCPM Model Community License Agreement."
+Therefore, you and the MiniCPM development team agree to the following "MiniCPM Model Community License Agreement":
+
+## Part Two: Licensing and Redistributio
+
+####  1. Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable, royalty-free, limited license to use, copy, distribute, reproduce, create derivative works from, and modify MiniCPM materials in accordance with OpenBMB's intellectual property rights or other rights in the MiniCPM materials.
+####  2. Distribution and Redistribution
+- If you distribute or provide MiniCPM series model materials (or any derivative works thereof), or use any product or service of them, you must (A) provide a copy of this agreement; and (B) prominently display "Built with 面壁MiniCPM" on the relevant website, user interface, blog post, about page, or product documentation. If you create, train, fine-tune, or improve an AI model using the MiniCPM series models, the model must include "MiniCPM" in its name.
+- You must retain the following attribution statement in all distributed MiniCPM-related materials: "MiniCPM is licensed under the MiniCPM Model Community License, © OpenBMB Platforms, Inc. All rights reserved."
+- Your use of MiniCPM materials must comply with applicable laws and regulations and the "MiniCPM Model Community License Agreement," which is incorporated into this agreement by reference.
+- You may not use MiniCPM series models or their outputs and results to improve any other large language models (other than MiniCPM or its derivatives).
+####  3. Additional Commercial Terms
+If you or your affiliates' services or products deploy the model on edge-side devices not exceeding 5,000 units, or provide applications with a daily active user count (DAU) of less than 1 million, you can apply to OpenBMB for permission and, after filling out the registration questionnaire, may be allowed to use it commercially for free. Otherwise, please email (cpm@modelbest.cn) to apply for authorization from OpenBMB, which may, at its discretion, grant permission, and you will not have the right to exercise any rights under this agreement.
+####  4. Usage-based Restrictions
+The restrictions set forth in Appendix A are considered usage-based restrictions. Therefore, you may not use the model or its derivatives for designated restricted uses. You may use the model under this license, including only for lawful purposes and in compliance with the terms of the license. Usage includes creating any content, fine-tuning, updating, running, training, evaluating, and/or re-parameterizing the model. You should require all users of the model or its derivatives to comply with the terms of this section.
+
+## Part Three: Other Terms
+####  5. Trademarks and Related
+This license does not grant you the right to use OpenBMB, OpenBMB Intelligence, MiniCPM trademarks, trade names, logos, or otherwise imply a relationship between the parties; any rights not expressly granted herein are reserved by OpenBMB.
+####  6. Disclaimer
+Unless required by applicable law or agreed to in writing, OpenBMB provides the model and supplemental materials "as is," without any warranty or condition, express or implied, including but not limited to all express and implied warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose. You are solely responsible for determining the appropriateness of using or redistributing the model, its derivatives, and supplemental materials, and assume any risks associated with exercising the permissions under this license.
+
+## Appendix A: Usage Restrictions
+You agree not to use the model or its derivatives for:
+- Any use that violates applicable national or international laws or regulations or infringes upon the legal rights and interests of any third party;
+- Any military purposes;
+- Exploiting, harming, or attempting to exploit or harm minors in any way;
+- Generating or disseminating verifiable false information and/or content with the intent to harm others;
+- Generating or disseminating inappropriate content that must comply with applicable regulatory requirements;
+- Unauthorized generation or dissemination of personally identifiable information, or unreasonable use thereof;
+- Defamation, demeaning, or otherwise harassing others;
+- Fully automated decision-making that adversely affects individuals' legal rights or creates or modifies binding, enforceable obligations;
+- Any use intended to or having the effect of discriminating or harming individuals or groups based on online or offline social behaviors or known or predicted personal characteristics;
+- Exploiting the vulnerabilities of specific groups due to their age, social, physical, or psychological characteristics, in a manner that materially distorts the behavior of group members, leading to or likely leading to physical or psychological harm to the person or others;
+- Any use intended to or having the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
--- a/MiniCPM模型商用许可协议.md
+++ b/MiniCPM模型商用许可协议.md
@ -0,0 +1,43 @@
+版本 1.0，2024年6月5日
+版权所有 © 2024 OpenBMB
+
+## 第一部分：序言
+
+我们将全球领先的MiniCPM端侧模型全系开源，包括旗舰端侧模型MiniCPM-2.4B和MiniCPM-1.2B，以及全球领先的端侧多模态模型MiniCPM-V系列。以上权重对所有学术研究完全开放。在填写问卷登记后亦允许商业使用，社区使用 MiniCPM系列模型需要遵循 Apache 2.0 和《MiniCPM 模型社区许可协议》。
+因此，您与MiniCPM 开发团队达成如下《MiniCPM模型商用许可协议》：
+
+## 第二部分：许可权和再分发
+
+#### 1. 权利授予
+您被授予非排他性的、全球性的、不可转让的和免版税的有限许可，依据OpenBMB对MiniCPM材料所拥有的知识产权或其他权利来使用、复制、分发、复制、创建衍生作品和修改MiniCPM材料。
+#### 2. 分发和再分发
+- 如果您分发或提供MiniCPM系列模型材料（或其任何衍生作品），或使用其中任何一个的产品或服务，您必须（A）提供本协议的副本；并（B）在相关网站、用户界面、博客文章、关于页面或产品文档中显著显示“Built with 面壁MiniCPM”。如果您使用MiniCPM系列模型创建、训练、微调或改进AI模型，该模型必须包含“MiniCPM”命名。
+- 您必须在分发的所有MiniCPM相关材料中保留以下归属声明：“面壁MiniCPM 根据MiniCPM模型社区许可证许可，版权所有©面壁智能 Platforms, Inc. 保留所有权利。”
+- 您对MiniCPM材料的使用必须遵守适用的法律法规，并遵守《MiniCPM 模型社区许可协议》，该政策通过引用并入本协议。
+- 您不得使用MiniCPM系列模型或其输出和结果来改进任何其他大型语言模型（除 MiniCPM 或其衍生品外）。
+#### 3. 附加商业条款
+若您或您的关联方的服务或产品是将模型部署在端侧设备，且部署设备不超5000台，或提供应用的日均用户活跃量（DAU）低于100万，可直接向面壁智能申请许可，在填写问卷登记后可允许免费商业使用。否则请发邮件（cpm@modelbest.cn）向面壁智能申请授权，我们可自行决定是否授权，并自行决定授权的期限和范围。在我们给予书面授权前，您无权行使任何商业性权利，亦不得将模型用于任何商业用途。
+
+#### 4. 基于使用的限制
+附录A中规定的限制被视为基于使用的限制。因此，您不得将模型及其衍生作品用于指定的受限用途。您可以根据本许可证使用模型，包括仅用于合法目的并符合许可证的规定。使用包括创建任何内容、微调、更新、运行、训练、评估和/或重新参数化模型。您应要求所有使用模型或其衍生作品的用户遵守本段的条款。
+
+## 第三部分：其他条款
+#### 5. 商标和相关
+本许可证不授予您使用OpenBMB、面壁智能、MiniCPM商标、商号、标志或以其他方式暗示双方之间关系的权利；未在此明确授予的任何权利均由OpenBMB保留。
+
+#### 6. 免责声明
+除非适用法律要求或书面同意，OpenBMB 按“现状”提供模型和补充材料，不提供任何形式的保证或条件，包括但不限于所有明示和暗示的保证或条件，包括所有权、非侵权、适销性或适用于特定目的的保证或条件。您自行负责确定使用或再分发模型、模型的衍生作品和补充材料的适当性，并承担在本许可证下行使权利所引发的任何风险。
+
+## 附录A：使用限制
+您同意不将模型或其衍生作品用于：
+- 任何违反适用国家或国际法律法规或侵犯任何第三方合法权利和利益的方式；
+- 任何军事用途；
+- 以任何方式利用、伤害或试图利用或伤害未成年人；
+- 生成或传播可验证的虚假信息和/或内容，以损害他人为目的；
+- 生成或传播不适当内容，需符合适用的监管要求；
+- 未经授权生成或传播个人可识别信息，或进行不合理使用；
+- 诽谤、贬低或以其他方式骚扰他人；
+- 完全自动化的决策，导致个人的法律权利受到不利影响或创建或修改具有约束力、可执行的义务；
+- 任何意图或具有歧视或伤害个人或群体的效果，基于在线或离线的社会行为或已知或预测的个人特征；
+- 利用特定群体的年龄、社会、身体或心理特征的弱点，以实质性扭曲该群体成员的行为，导致或可能导致该人或其他人身体或心理伤害的方式；
+- 任何意图或具有歧视个人或群体效果的用途，基于法律保护的特征或类别。
--- a/README-en.md
+++ b/README-en.md
@ -11,27 +11,30 @@
 </h4>

 <p align="center">
-<a href="https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4" target="_blank">Technical Blog</a> |
-<a href="https://github.com/OpenBMB/OmniLMM/" target="_blank">Multi-modal Model OmniLMM</a> |
-<a href="https://luca.cn/" target="_blank">CPM-C 100B Model Trial</a> |
-Join our <a href="https://discord.gg/3cGQn9b3YM" target="_blank">discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">wechat</a>
+<a href="https://openbmb.vercel.app/" target="_blank">Technical Blog</a> |
+<a href="https://arxiv.org/abs/2404.06395" target="_blank">MiniCPM Paper</a> |
+<a href="https://github.com/OpenBMB/MiniCPM-V/" target="_blank">MiniCPM-V Repo</a> |
+Join our <a href="https://discord.gg/3cGQn9b3YM" target="_blank">discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
 </p>

 MiniCPM is an End-Side LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings (2.7B in total).

 - MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc.
 - After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench.
- MiniCPM-V, based on MiniCPM-2B, achieves the best overall performance among multimodel models of the same scale, surpassing existing multimodal large models built on Phi-2 and achieving performance comparable to or even better than 9.6B Qwen-VL-Chat on some tasks.
+- MiniCPM-V 2.0, based on MiniCPM-2B, achieves state-of-the-art performance on multiple benchmarks among models under 7B parameters. It even outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass. MiniCPM-V 2.0 also shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding.
 - MiniCPM can be deployed and infer on smartphones, and the speed of streaming output is relatively higher than human verbal speed. MiniCPM-V has also successfully deployed multi-modal models on smartphones.
 - The cost of developing based on MiniCPM is low. Parameter efficient finetuning can be conducted with a single 1080/2080 GPU and full parameter finetuning can be conducted with a 3090/4090 GPU.

 We release all model parameters for research and limited commercial use. 

- SFT and DPO version based on MiniCPM-2B and human preference: **MiniCPM-2B-SFT/DPO**
- The multi-modal model **MiniCPM-V** based on MiniCPM-2B, which outperforms models with similar size, i.e., Phi-2
+- SFT and DPO version based on MiniCPM-2B: **MiniCPM-2B-SFT/DPO**
+- The multi-modal model **MiniCPM-V 2.0** based on MiniCPM-2B.
 - The INT4 quantized version **MiniCPM-2B-SFT/DPO-Int4** based on MiniCPM-2B-SFT/DPO
+- The 128k long context version of MiniCPM-2B: **MiniCPM-2B-128k**.
+- The MoE version of MiniCPM-2B: **MiniCPM-MoE-8x2B**.
+- SFT version of MiniCPM-1B, a lighter-weight model: **MiniCPM-1B-SFT**.
 - Mobile phone application based on MLC-LLM and LLMFarm. Both language model and multimodel model can conduct inference on smartphones.
- 30 Intermidiate [checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history) for academic purpose.
+- 30 Intermidiate [checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history) of MiniCPM-2B for academic purpose.

 ### Limitations

@ -57,6 +60,7 @@ We release all model parameters for research and limited commercial use.
 <p id="0"></p>

 ## Update Log
+- **2024/04/11 We release [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0), [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog.**
 - 2024/03/16 Intermediate checkpoints were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)!
 - 2024/02/13 We support llama.cpp 
 - 2024/02/09 We have included a [Community](#community) section in the README to encourage support for MiniCPM from the open-source community.
@ -69,30 +73,23 @@ We release all model parameters for research and limited commercial use.

 * Language Model

-  | HuggingFace | ModelScope | WiseModel | Replicate |
-  |-------------|------------|-----------|-----------|
-  |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)
-  |[MiniCPM-2B-sft-fp32](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
-  |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://replicate.com/tuantuanzhang/minicpm)
-  |[MiniCPM-2B-dpo-fp16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16)|[MiniCPM-2B-dpo-fp16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/)|[MiniCPM-2B-dpo-fp16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16)
-  |[MiniCPM-2B-dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)
-  |[MiniCPM-2B-sft-fp32-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32-llama-format)|
-  |[MiniCPM-2B-sft-bf16-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format)|
-  |[MiniCPM-2B-dpo-bf16-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16-llama-format)|
-  |[MiniCPM-2B-dpo-fp16-gguf](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) |
-  |[MiniCPM-2B-dpo-q4km-gguf](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) |
+  | HuggingFace | ModelScope | WiseModel | 
+  |-------------|------------|-----------|
+  |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)|
+  |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)|
+  |[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k) |[MiniCPM-2B-128k](https://modelscope.cn/models/openbmb/MiniCPM-2B-128k/summary)| 
+  |[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) |[MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)| 
+  |[MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) |

-  Note: 
-  1. The model training was conducted in bf16 format, so inference using bf16 will yield the best results. Other formats might experience a slight performance decline due to precision issues.
-  2. The models with a '-llama-format' suffix are those where we have transformed the MiniCPM structure into the Llama structure (primarily integrating the parameterization scheme of mup into the model's own parameters). This enables users of the Llama model to try out MiniCPM at no extra cost. [See details](#llamaformat)
-  3. Thanks to [the contributor](https://github.com/runfuture) for adapting MiniCPM to [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ollama](https://github.com/ollama/ollama).
+  Note: More model versions can be found [here](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f).
     
 * Multimodel Model

    | HuggingFace | ModelScope | WiseModel |
    |-------------|------------|-----------|
+    | [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2) | [MiniCPM-V 2.0](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
    | [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) | [MiniCPM-V](https://modelscope.cn/models/OpenBMB/MiniCPM-V/) | [MiniCPM-V](https://wisemodel.cn/models/OpenBMB/MiniCPM-V) |
-    | [OmniLMM](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) |
+    | [OmniLMM-12B](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM-12B](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM-12B](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) |



@ -131,7 +128,7 @@ The capital city of China is Beijing. Beijing is not only the political center o
 <p id="llamaformat"></p>

 ##### MiniCPM-2B (Llama Format)
-We have converted the model weights of MiniCPM into a format that can be directly called by Llama code, for everyone to try:
+To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model:
 ```python
 import torch
 from transformers import LlamaTokenizerFast, LlamaForCausalLM
@ -175,30 +172,19 @@ print(res)

 #### vLLM 

-* Install vLLM supporting MiniCPM.
-  - MiniCPM adopts the MUP program, which introduces some extra scaling operations to make the training process stable. And the MUP structure is a little different from the structure used by Llama and other LLMs.
-  - vLLM 0.2.2 is adapted to MiniCPM in the folder [inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference). More vLLM versions will be supported in the future.
-
-```shell
-pip install inference/vllm
-```
-
-* Transfer Huggingface Transformers repo to vLLM-MiniCPM repo, where `<hf_repo_path>`, `<vllmcpm_repo_path>` are local paths.
-
-```shell
-python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
-```
+* Install [vLLM](https://github.com/vllm-project/vllm)
+  ```shell
+  pip install "vllm>=0.4.1"
+  ```

 * Examples
-
-```shell
-cd inference/vllm/examples/infer_cpm
-python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_final.txt
-```
+  ```shell
+  python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
+  ```


-#### llama.cpp、Ollama、fastllm Inference
-We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm).
+#### llama.cpp, Ollama, fastllm, mlx_lm Inference
+We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama.


 **llama.cpp**
@ -211,19 +197,16 @@ We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.
 More parameters adjustment [see this](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)

 **ollama**
-Solving [this issue](https://github.com/ollama/ollama/issues/2383)
-
-
-<p id="Community"></p>
-
-## Community
-
- [ChatLLM](https://github.com/foldl/chatllm.cpp) :[Run MiniCPM on CPU](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796)
+1. [install ollama](https://github.com/ollama/ollama)
+2. In command line:
+```
+ollama run modelbest/minicpm-2b-dpo 
+```

 **fastllm**
-1. [install fastllm]([fastllm](https://github.com/ztxz16/fastllm)
+1. install [fastllm](https://github.com/ztxz16/fastllm)
 2. inference
-```
+```python
 import torch
 from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM
 path = 'openbmb/MiniCPM-2B-dpo-fp16'
@ -235,6 +218,22 @@ model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16"
 print(model.response("<用户>Write an acrostic poem with the word MINICPM (One line per letter)<AI>", top_p=0.8, temperature=0.5, repeat_penalty=1.02))
 ```

+**mlx_lm**
+1. install mlx_lm
+    ```shell
+    pip install mlx_lm
+    ```
+2. download model weights [MiniCPM-2B-sft-bf16-llama-format-mlx](https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx)
+3. inference
+    ```shell
+    python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code
+    ```
+
+<p id="Community"></p>
+
+## Community
+
+- [ChatLLM](https://github.com/foldl/chatllm.cpp): [Run MiniCPM on CPU](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796)


 <p id="3"></p>
@ -311,89 +310,355 @@ print(model.response("<用户>Write an acrostic poem with the word MINICPM (One
 |Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*|
 |MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25|

-#### Multimodal evaluation
+#### MiniCPM-2B-128k Evaluation
+| Model                               | avg   | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run |
+|-------------------------------------|-------|-------------------|---------|---------------|--------------|---------------------|-----------------|-----------------|------------------|---------------------|-----------|-----------|------------|----------|
+| LWM-Text-128k                       | 24.45 | 33.62             | 100     | 97.8          | 0.6          | 28.82               | 15.93           | 14.31           | 9.99             | 1.5                 | 0         | 3.43      | 20.05      | 1        |
+| Yarn-Mistral-7b-128k                | 19.84 | 27.36             | 92.71   |               | 0            | 27.95               | 15.49           | 9.55            | 9.06             | 7.5                 | 0         | 17.14     | 0.76       | 1.25     |
+| Mistral-7B-Instruct-v0.2(ABF 1000w) | 27.75 | 36.9              | 100     | 78.98         | 3.6          | 37.12               | 11.74           | 17.37           | 21.12            | 9.5                 | 0         | 29.43     | 17.51      | 0        |
+| Yi-6B-200k                          | 22.15 | 32.54             | 100     | 94.92         | 0            | 36.68               | 15.07           | 9.2             | 0.92             | 3.5                 | 0         | 4.29      | 0.51       | 0.75     |
+| chatglm3-6b-128k                    | 25.58 | 36.57             | 89.93   | 99.66         | 5.2          | 46.29               | 10.7            | 8.38            | 25.91            | 6.5                 | 0         | 8         | 5.33       | 1        |
+| MiniCPM-2.4B-128k                   | 27.32 | 37.68             | 98.31   | 99.83         | 9            | 29.69               | 23.06           | 16.33           | 15.73            | 9.5                 | 0         | 4.29      | 22.08      | 0        |

+#### MiniCPM-MoE-8x2B Evaluation
 <div align="left">

 <table style="margin: 0px auto;">
 <thead>
  <tr>
    <th align="left">Model</th>
-    <th>Size</th>
-    <th nowrap="nowrap" >Visual Tokens</th>
-    <th>MME</th>
-    <th nowrap="nowrap" >MMB dev (en)</th>
-    <th nowrap="nowrap" >MMB dev (zh)</th>
-    <th nowrap="nowrap" >MMMU val</th>
-    <th nowrap="nowrap" >CMMMU val</th>
+    <th nowrap="nowrap" >BBH</th>
+    <th nowrap="nowrap" >MMLU</th>
+    <th nowrap="nowrap" >CEval</th>
+    <th nowrap="nowrap" >CMMLU</th>
+    <th nowrap="nowrap" >HumanEval</th>
+    <th nowrap="nowrap" >MBPP&dagger;</th>
+    <th nowrap="nowrap" >GSM8K</th>
+    <th nowrap="nowrap" >MATH</th
  </tr>
 </thead>
 <tbody align="center">
  <tr>
-    <td align="left">LLaVA-Phi</td>
-    <td align="right">3B</td>
-    <td>576</td>
-    <td>1335</td>
-    <td>59.8</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left">Llama2-34B*</td>
+    <td>44.1</td>
+    <td>62.6</td>
+    <td>-</td>
+    <td>-</td>
+    <td>22.6</td>
+    <td>33.0</td>
+    <td>42.2</td>
+    <td>6.24</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left">MobileVLM</td>
-    <td align="right">3B</td>
-    <td>144</td>
-    <td>1289</td>
-    <td>59.6</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left">Mistral-7B-Instruct-v0.2</td>
+    <td>39.81</td>
+    <td>60.51</td>
+    <td>42.55</td>
+    <td>41.92</td>
+    <td>36.59</td>
+    <td>39.63</td>
+    <td>40.49</td>
+    <td>4.95</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" >Imp-v1</td>
-    <td align="right">3B</td>
-    <td>576</td>
-    <td>1434</td>
-    <td>66.5</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left" >Gemma-7B*</td>
+    <td>55.1</td>
+    <td>64.3</td>
+    <td>-</td>
+    <td>-</td>
+    <td>32.3</td>
+    <td>44.4</td>
+    <td>46.4</td>
+    <td>24.3</td>
  </tr>
  <tr>
-    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right" >9.6B</td>
-    <td>256</td>
-    <td>1487</td>
-    <td>60.6 </td>
-    <td>56.7 </td>
-    <td>35.9 </td>
-    <td>30.7 </td>
+    <td nowrap="nowrap" align="left" >Qwen1.5-7B*</td>
+    <td>40.2</td>
+    <td>61</td>
+    <td>74.1</td>
+    <td>73.1</td>
+    <td>36</td>
+    <td>37.4</td>
+    <td>62.5</td>
+    <td>20.3</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" >CogVLM</td>
-    <td align="right">17.4B </td>
-    <td>1225</td>
-    <td>1438 </td>
-    <td>63.7 </td>
-    <td>53.8 </td>
-    <td>32.1 </td>
-    <td>- </td>
+    <td  nowrap="nowrap" align="left" >Deepseek-MoE(16B)*</td>
+    <td>-</td>
+    <td>45.0</td>
+    <td>40.6</td>
+    <td>42.5</td>
+    <td>26.8</td>
+    <td>39.2</td>
+    <td>18.8</td>
+    <td>4.3</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" ><b>MiniCPM-V(3B)</b></td>
-    <td align="right">3B </td>
-    <td>64</td>
-    <td>1452 </td>
-    <td>67.3 </td>
-    <td>61.9 </td>
-    <td>34.7 </td>
-    <td>32.1 </td>
+    <td nowrap="nowrap" align="left" ><b>MiniCPM-2.4B</b></td>
+    <td>36.87</td>
+    <td>53.46</td>
+    <td>51.13</td>
+    <td>51.07</td>
+    <td>50.00</td>
+    <td>35.93</td>
+    <td>53.83</td>
+    <td>10.24</td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>MiniCPM-MoE-8x2B</b></td>
+    <td>39.22</td>
+    <td>58.90</td>
+    <td>58.11</td>
+    <td>58.80</td>
+    <td>55.49</td>
+    <td>41.68</td>
+    <td>61.56</td>
+    <td>10.52</td>
  </tr>
 </tbody>
 </table>

 </div>

+<p id="4"></p>
+
+Note：* means evaluation results are directly taken from their technical reports. &dagger; means evaluation results on the full set of
+MBPP, instead of the hand-verified set.
+
+
+#### Multimodal evaluation
+
+<div align="center">
+
+<table style="margin: 0px auto;">
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>TextVQA val</th>
+    <th>DocVQA test</th>
+    <th>OCRBench</th>
+    <th>OpenCompass</th>
+    <th nowrap="nowrap" >MME</th>
+    <th>MMB dev(en)</th>
+    <th>MMB dev(zh)</th>
+    <th>MMMU val</th>
+    <th>MathVista</th>
+    <th>LLaVA Bench</th>
+    <th nowrap="nowrap">Object HalBench</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
+    <td>- </td>
+    <td>74.6</td>
+    <td>88.1</td>
+    <td>680</td>
+    <td>63.8</td>
+    <td>2148.9</td>
+    <td>75.2</td>
+    <td>74.0</td>
+    <td>48.9</td>
+    <td>45.8</td>
+    <td>79.9</td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">GPT-4V</td>
+    <td>- </td>
+    <td>78.0</td>
+    <td>88.4</td>
+    <td>645</td>
+    <td>63.2</td>
+    <td>1771.5</td>
+    <td>75.1</td>
+    <td>75.0</td>
+    <td>53.8</td>
+    <td>47.8</td>
+    <td>93.1</td>
+    <td>86.4 / 92.7</td>
+  </tr>
+  <tr>
+    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
+    <td align="right" >6.7B</td>
+    <td>45.5*</td>
+    <td>17.1*</td>
+    <td>290</td>
+    <td>49.3</td>
+    <td>1915.1 </td>
+    <td>68.6 </td>
+    <td>68.3 </td>
+    <td>40.3 </td>
+    <td>28.8 </td>
+    <td>51.9 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
+    <td align="right" >9.6B</td>
+    <td>61.5</td>
+    <td>62.6</td>
+    <td>488 </td>
+    <td>52.1 </td>
+    <td>1860.0 </td>
+    <td>60.6 </td>
+    <td>56.7 </td>
+    <td>37.0 </td>
+    <td>33.8 </td>
+    <td>67.7 </td>
+    <td>56.2 / 80.0</td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
+    <td align="right" >34B</td>
+    <td>43.4*</td>
+    <td>16.9*</td>
+    <td>290</td>
+    <td>52.6 </td>
+    <td>2050.2</td>
+    <td>71.1</td>
+    <td>71.4</td>
+    <td>45.1</td>
+    <td>30.7</td>
+    <td>62.3</td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
+    <td align="right" >7.3B</td>
+    <td>64.7*</td>
+    <td>47.0* </td>
+    <td>435</td>
+    <td>55.6 </td>
+    <td>1765.4 </td>
+    <td>74.1 </td>
+    <td>72.8 </td>
+    <td>38.3 </td>
+    <td>36.8</td>
+    <td>77.8 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >TextMonkey</td>
+    <td align="right" >9.7B</td>
+    <td>64.3</td>
+    <td>66.7 </td>
+    <td>558</td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>-</td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+    <tr>
+    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
+    <td align="right" >17.4B</td>
+    <td>70.4</td>
+    <td>33.3*</td>
+    <td>590 </td>
+    <td>52.5 </td>
+    <td>1736.6 </td>
+    <td>63.7 </td>
+    <td>53.8 </td>
+    <td>37.3 </td>
+    <td>34.7 </td>
+    <td>73.9 </td>
+    <td>73.6 / 87.4 </td>
+  </tr>
+  <tr>
+    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
+    <td align="right" >1.7B</td>
+    <td>58.4*</td>
+    <td>37.9*</td>
+    <td>413</td>
+    <td>46.0 </td>
+    <td>1531.6 </td>
+    <td>64.0 </td>
+    <td>61.2 </td>
+    <td>33.8 </td>
+    <td>29.4 </td>
+    <td>51.1 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
+    <td align="right" >3.1B</td>
+    <td>57.5</td>
+    <td>19.4*</td>
+    <td>-</td>
+    <td>-</td>
+    <td>1440.5(P) </td>
+    <td>63.2 </td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
+    <td align="right" >2.2B</td>
+    <td>56.2</td>
+    <td>34.2*</td>
+    <td>-</td>
+    <td>-</td>
+    <td>1653.0 </td>
+    <td>59.8 </td>
+    <td>- </td>
+    <td>31.7 </td>
+    <td>-</td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
+    <td align="right" >2.8B </td>
+    <td>60.6</td>
+    <td>38.2 </td>
+    <td>366</td>
+    <td>47.6</td>
+    <td>1650.2 </td>
+    <td>67.9 </td>
+    <td>65.3 </td>
+    <td><strong>38.3</strong></td>
+    <td>28.9</td>
+    <td>51.3 </td>
+    <td>78.4 / 88.5 </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
+    <td align="right" >2.8B </td>
+    <td><strong>74.1</strong></td>
+    <td><strong>71.9</strong> </td>
+    <td><strong>605</strong></td>
+    <td><strong>55.0</strong></td>
+    <td><strong>1808.6</strong> </td>
+    <td><strong>69.6</strong> </td>
+    <td><strong>68.1</strong> </td>
+    <td>38.2 </td>
+    <td><strong>38.7</strong></td>
+    <td><strong>69.2</strong> </td>
+    <td><strong>85.5 / 92.2 </strong></td>
+  </tr>
+</tbody>
+</table>
+
+</div>
+* We evaluate the officially released checkpoint by ourselves.
+
 #### DPO evaluation

 |Model|MT-bench|
@ -422,19 +687,17 @@ print(model.response("<用户>Write an acrostic poem with the word MINICPM (One
 * Android, HarmonyOS
  * Adapt based on open-source framework MLC-LLM.
  * Adapted for text model MiniCPM, and multimodel model MiniCPM-V.
-  * Support MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4、MiniCPM-V.
+  * Support MiniCPM-2B-SFT-INT4, MiniCPM-2B-DPO-INT4, and MiniCPM-V.
  * [Compile and Installation Guide](https://github.com/OpenBMB/mlc-MiniCPM/blob/main/README.md) 
 * iOS
  * Adapt based on open-source framework LLMFarm.
  * Adapted for text model MiniCPM.
-  * Support MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4.
+  * Support MiniCPM-2B-SFT-INT4, MiniCPM-2B-DPO-INT4.
  * [Compile and Installation Guide](https://github.com/OpenBMB/LLMFarm)

 #### Performance

-* We did not conduct in-depth optimization and system testing on the mobile inference model, only verifying the feasibility of MiniCPM using mobile phone chips for inference.
-* Besides us, there are also some [efforts](https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/MobileVLM-README.md) to deploy multimodal models on mobile phones based on llama.cpp. We have verified the feasibility of deploying MiniCPM-V on mobile phones based on MLC-LLM this time, and it can input and output normally. However, there also exist a problem of long image processing time, which needs further optimization :)
-* **We welcome more developers to continuously improve the inference performance of LLMs on mobile phones and update the test results below.**
+* We did not conduct in-depth optimization and system testing on the mobile inference model, only verifying the feasibility of MiniCPM using mobile phone chips for inference. **We welcome more developers to continuously improve the inference performance of LLMs on mobile phones and update the test results below.**

 | Mobile Phones     | OS            | Processor          | Memory（GB） | Inference Throughput（token/s） |
 | ----------------- | ------------- | ------------------ | ------------ | ------------------------------- |
@ -458,7 +721,14 @@ print(model.response("<用户>Write an acrostic poem with the word MINICPM (One
 | iPhone 11         | iOS 16.6      | A13                | 4            | 4.6                             |
 |Xiaomi Redmi K50   | HyperOS 1.0.2 |	MediaTek Dimensity 8100	|12	|3.5|

-![multimodel demo](https://github.com/OpenBMB/OmniLMM/blob/main/assets/gif_cases/Snake_en.gif)
+* We have also verified the feasibility of deploying MiniCPM-V series models on mobile phones based on MLC-LLM, and it can input and output normally. However, there also exist a problem of long image processing time, which needs further optimization. The demo video below is the raw screen recording on a Xiaomi 14 Pro without edition.
+
+<table align="center">
+    <p align="center">
+      <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/gif_cases/station.gif" width=36%/>
+      <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/gif_cases/english_menu.gif" width=36%/>
+    </p>
+</table>


 <p id="5"></p>
@ -488,6 +758,18 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
  * Using [BMTrain](https://github.com/OpenBMB/BMTrain)，as well as checkpointing and ZeRO-3 (zero redundancy optimizer)，we can tune all parameters of MiniCPM using one piece of NVIDIA GeForce GTX 3090/4090.
  * This code will be available soon.

+* mlx Parameter-efficient Tuning
+  * environment preparation
+    ```shell
+    pip install -r finetune/requirements_mlx.txt
+    ```
+  * finetune
+    ```shell
+    # train
+    python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/AdvertiseGen  --train  --seed 2024 --iters 500
+    # test
+    python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/AdvertiseGen  --test --seed 2024
+    ```

 <p id="9"></p>

@ -530,10 +812,9 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
 #### Model LICENSE

 * This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
-* The usage of MiniCPM model weights must strictly follow [the General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md).
-* The models and weights of MiniCPM are completely free for academic research.
-* If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.
-
+* The usage of MiniCPM model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
+* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
+  
 #### Statement

 * As a language model, MiniCPM generates content by learning from a vast amount of text. 
@ -545,12 +826,15 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>

 ## Citation

-* Please cite our [techinical report](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) if you find our work valuable.
+* Please cite our [paper](https://arxiv.org/abs/2404.06395) if you find our work valuable.

 ```
-@misc{minicpm2024,
-	title={MiniCPM：Unveiling the Potential of End-side Large Language Models},
-	booktitle={OpenBMB Blog},
-	year={2024}
+@misc{hu2024minicpm,
+      title={MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies}, 
+      author={Shengding Hu and Yuge Tu and Xu Han and Chaoqun He and Ganqu Cui and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Weilin Zhao and Xinrong Zhang and Zheng Leng Thai and Kaihuo Zhang and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun},
+      year={2024},
+      eprint={2404.06395},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
 }
 ```
--- a/README.md
+++ b/README.md
@ -12,27 +12,30 @@


 <p align="center">
-<a href="https://shengdinghu.notion.site/MiniCPM-c805a17c5c8046398914e47f0542095a" target="_blank">MiniCPM 技术博客</a> |
-<a href="https://github.com/OpenBMB/OmniLMM/" target="_blank">OmniLMM 多模态模型</a> |
-<a href="https://luca.cn/" target="_blank">CPM-C 千亿模型试用</a> |
-加入我们的 <a href="https://discord.gg/3cGQn9b3YM" target="_blank">discord</a> 和 <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">wechat</a>
+<a href="https://openbmb.vercel.app/?category=Chinese+Blog" target="_blank">MiniCPM 技术博客</a> |
+<a href="https://arxiv.org/abs/2404.06395" target="_blank">MiniCPM 论文</a> |
+<a href="https://github.com/OpenBMB/MiniCPM-V/" target="_blank">MiniCPM-V 仓库</a> |
+加入我们的 <a href="https://discord.gg/3cGQn9b3YM" target="_blank">discord</a> 和 <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">微信群</a>
 
 </p>

 MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的系列端侧大模型，主体语言模型 MiniCPM-2B 仅有 24亿（2.4B）的非词嵌入参数量, 总计2.7B参数量。
- 经过 SFT 后，MiniCPM 在公开综合性评测集上，MiniCPM 与 Mistral-7B相近（中文、数学、代码能力更优），整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。
- 经过 DPO 后，MiniCPM 在当前最接近用户体感的评测集 MTBench上，MiniCPM-2B 也超越了 Llama2-70B-Chat、Vicuna-33B、Mistral-7B-Instruct-v0.1、Zephyr-7B-alpha 等众多代表性开源大模型。
- 以 MiniCPM-2B 为基础构建端侧多模态大模型 MiniCPM-V，整体性能在同规模模型中实现最佳，超越基于 Phi-2 构建的现有多模态大模型，在部分评测集上达到与 9.6B Qwen-VL-Chat 相当甚至更好的性能。
+- 经过 SFT 后，MiniCPM-2B 在公开综合性评测集上与 Mistral-7B 表现相近（中文、数学、代码能力更优），整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。
+- 经过 DPO 后，MiniCPM-2B 在当前最接近用户体感的评测集 MTBench 上也超越了 Llama2-70B-Chat、Vicuna-33B、Mistral-7B-Instruct-v0.1、Zephyr-7B-alpha 等众多代表性开源大模型。
+- 以 MiniCPM-2B 为基础构建端侧多模态大模型 MiniCPM-V 2.0，在多个测试基准中实现了 7B 以下模型的最佳性能，在 OpenCompass 榜单上超过了 Qwen-VL-Chat 9.6B、CogVLM-Chat 17.4B 和 Yi-VL 34B 等更大参数规模的模型。MiniCPM-V 2.0 还展现出领先的 OCR 能力，在场景文字识别能力上接近 Gemini Pro。
 - 经过 Int4 量化后，MiniCPM 可在手机上进行部署推理，流式输出速度略高于人类说话速度。MiniCPM-V 也直接跑通了多模态大模型在手机上的部署。
 - 一张1080/2080可高效参数微调，一张3090/4090可全参数微调，一台机器可持续训练 MiniCPM，二次开发成本较低。

-我们完全开源MiniCPM-2B的模型参数供学术研究和有限商用.
+我们完全开源MiniCPM系列的模型参数供学术研究和有限商用。
 具体而言，我们目前已公开以下模型，地址详见 [模型下载](#1) 部分
- 基于MiniCPM-2B的指令微调与人类偏好对**MiniCPM-2B-SFT/DPO**。
- 基于MiniCPM-2B的多模态模型**MiniCPM-V**，能力超越基于Phi-2的同参数级别多模态模型。
+- 基于MiniCPM-2B的指令微调与人类偏好对齐版本**MiniCPM-2B-SFT/DPO**。
+- 基于MiniCPM-2B的多模态模型**MiniCPM-V 2.0**。
 - MiniCPM-2B-SFT/DPO的Int4量化版**MiniCPM-2B-SFT/DPO-Int4**。
+- MiniCPM-2B的128k长文本版本**MiniCPM-2B-128k**。
+- MiniCPM-2B的MoE版本**MiniCPM-MoE-8x2B**。
+- 更轻量级的MiniCPM-1B指令微调版本**MiniCPM-1B-SFT**。
 - 基于MLC-LLM、LLMFarm开发的MiniCPM手机端程序，**文本及多模态模型均可在手机端进行推理**。
- 训练过程中的[30个Checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history)供模型机理研究。
+- MiniCPM-2B训练过程中的[30个Checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history)供模型机理研究。

 ### 局限性：

@ -43,24 +46,35 @@ MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的

 ## 目录

- [更新日志](#0)
- [模型下载](#1)
- [快速上手](#2)
- [开源社区](#community)
- [评测结果](#3)
- [手机部署](#4)
- [Demo & API 部署](#5)
- [二次开发](#6)
- [开源协议](#7)
- [工作引用](#8)
- [典型示例](#9)
+- [更新日志](#0)｜
+- [模型下载](#1)｜
+- [快速上手](#2)｜
+- [模型量化](#quantize)｜
+- [开源社区](#community)｜
+- [评测结果](#3)｜
+- [手机部署](#4)｜
+- [Demo & API 部署](#5)｜
+- [二次开发](#6)｜
+- [开源协议](#7)｜
+- [工作引用](#8)｜
+- [典型示例](#9)｜

+## 常用模块导航
+| [推理](#2) | [微调](#6) | [手机部署](#4) | [量化](#quantize)
+|-------------|------------|-----------|-----------|
+|[Transformers](#Huggingface模型)|[Transformers](#transformer_finetune)|[MLC部署](#MLC)|[GPTQ](#gptq)|
+|[vLLM](#vllm-推理)|[mlx_finetune](#mlx)|[llama.cpp](#llama.cpp)|[AWQ](#awq)|
+|[llama.cpp](#llama.cpp)|[llama_factory](./finetune/llama_factory_example/README.md)||[困惑度测试](#quantize_test)|
+|[ollama](#ollama)||||
+|[fastllm](#fastllm)||||
+|[mlx_lm](#mlx_lm)||||
 <p id="0"></p>

 ## 更新日志
- 2024/03/16 minicpm-2b 的30余个中间检查点开放了！[huggingface链接](https://huggingface.co/openbmb/MiniCPM-2B-history)
+- **2024/04/11 开源[MiniCPM-V-2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0)、[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)、[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B)和[MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)！点击[这里](https://openbmb.vercel.app/?category=Chinese+Blog)查看技术博客。** 
+- 2024/03/16 MiniCPM-2B 的30余个中间检查点开放了！[HuggingFace链接](https://huggingface.co/openbmb/MiniCPM-2B-history)
 - 2024/02/13 支持了llama.cpp
- 2024/02/09 我们在readme里加入了一个[开源社区](#community)章节，用来收集开源社区对MiniCPM的支持案例。
+- 2024/02/09 我们在README里加入了一个[开源社区](#community)章节，用来收集开源社区对MiniCPM的支持案例。
 - 2024/02/08 我们更新了[llama-format的模型权重](#llamaformat)，方便大家更加快捷地使用我们的模型。
 - 2024/02/01 初始发布。

@ -70,30 +84,23 @@ MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的

 * 语言模型
 
-  | HuggingFace | ModelScope | WiseModel | Replicate |
-  |-------------|------------|-----------|-----------|
+  | HuggingFace | ModelScope | WiseModel | 
+  |-------------|------------|-----------|
  |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)|
-  |[MiniCPM-2B-sft-fp32](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-sft-fp32)|[MiniCPM-2B-sft-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)|
-  |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://replicate.com/tuantuanzhang/minicpm)
-  |[MiniCPM-2B-dpo-fp16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16)|[MiniCPM-2B-dpo-fp16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/)|[MiniCPM-2B-dpo-fp16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16)|
-  |[MiniCPM-2B-dpo-fp32](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp32)|[MiniCPM-2B-dpo-fp32](https://wisemodel.cn/models/OpenBMB/miniCPM-dpo-fp32)|
-  |[MiniCPM-2B-sft-fp32-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-sft-fp32-llama-format)|
-  |[MiniCPM-2B-sft-bf16-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format)|
-  |[MiniCPM-2B-dpo-bf16-llama-format](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16-llama-format)|
-  |[MiniCPM-2B-dpo-fp16-gguf](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) |
-  |[MiniCPM-2B-dpo-q4km-gguf](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) |
+  |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)|
+  |[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k) |[MiniCPM-2B-128k](https://modelscope.cn/models/openbmb/MiniCPM-2B-128k/summary)| 
+  |[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) |[MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)| 
+  |[MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) |

-  注: 
-  1. 模型训练为bf16训练，因此用bf16进行推理将取得最好的效果，其他的格式会由于精度问题造成一点的性能下降。
-  2. -llama-format后缀的模型是我们将MiniCPM结构的模型转化成了Llama结构的（主要将mup的参数化方案融合进了模型本身的参数）。使得Llama模型的使用者可以零成本尝试MiniCPM。[详见这里](#llamaformat)
-  3. 感谢[@runfuture](https://github.com/runfuture)对MiniCPM进行了[llama.cpp](https://github.com/ggerganov/llama.cpp)和[ollama](https://github.com/ollama/ollama)的适配
+  注: 更多模型版本见[这里](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f)。


 * 多模态模型
  | HuggingFace | ModelScope | WiseModel |
  |-------------|------------|-----------|
+  | [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2) | [MiniCPM-V 2.0](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
  | [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) | [MiniCPM-V](https://modelscope.cn/models/OpenBMB/MiniCPM-V/) | [MiniCPM-V](https://wisemodel.cn/models/OpenBMB/MiniCPM-V) |
-  | [OmniLMM](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) |
+  | [OmniLMM-12B](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM-12B](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM-12B](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) |

  

@ -106,6 +113,8 @@ MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的

 - [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing)

+<p id="Huggingface模型"></p>
+
 #### Huggingface 模型

 ##### MiniCPM-2B
@ -133,7 +142,7 @@ print(responds)
 <p id="llamaformat"></p>

 ##### MiniCPM-2B （Llama Format）
-我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的形式，以便大家尝试:
+我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的[格式](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format)，以便大家尝试:
 ```python
 import torch
 from transformers import LlamaTokenizerFast, LlamaForCausalLM
@ -177,24 +186,14 @@ print(res)

 #### vLLM 推理

-* 安装支持 MiniCPM 的 vLLM
-  - 因为 MiniCPM 采用 MUP 结构，在矩阵乘法中存在一定的放缩计算，与Llama类模型结构有细微差别。
-  - 我们基于版本为 0.2.2 的 vLLM 实现了 MiniCPM 的推理，代码位于仓库[inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference)文件夹下，未来将会支持更新的vLLM 版本。
-
-* 安装支持 MiniCPM 的 vLLM 版本
+* 安装[vLLM](https://github.com/vllm-project/vllm)
 ```shell
-pip install inference/vllm
-```
-
-* 将Huggingface Transformers仓库转为vLLM-MiniCPM支持的格式，其中`<hf_repo_path>`, `<vllmcpm_repo_path>`均为本地路径
-```shell
-python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
+pip install "vllm>=0.4.1"
 ```

 * 测试样例
 ```shell
-cd inference/vllm/examples/infer_cpm
-python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
+python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
 ```

 * 期望输出
@ -204,10 +203,12 @@ python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/promp
 The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
 ```

-#### llama.cpp、Ollama、fastllm推理
-我们支持了[llama.cpp](https://github.com/ggerganov/llama.cpp/) 推理、[ollama](https://github.com/ollama/ollama)推理、[fastllm](https://github.com/ztxz16/fastllm)推理.
+#### llama.cpp、Ollama、fastllm、mlx_lm推理
+MiniCPM支持[llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm)、[mlx_lm](https://github.com/ml-explore/mlx-examples)推理。感谢[@runfuture](https://github.com/runfuture)对llama.cpp和ollama的适配。

-**llama.cpp**
+<p id="llama.cpp"></p>
+
+#### llama.cpp
 1. [安装llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build)
 2. 下载gguf形式的模型。[下载链接-fp16格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [下载链接-q4km格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf)
 3. 在命令行运行示例代码:
@ -216,13 +217,42 @@ python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/promp
 ```
 更多参数调整[详见](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)

-**ollama**
-正在解决[这个问题](https://github.com/ollama/ollama/issues/2383)
+<p id="ollama"></p>

-**fastllm**
+#### ollama
+***ollama自动安装模型***
+1. [安装ollama](https://github.com/ollama/ollama)
+2. 在命令行运行:
+```
+ollama run modelbest/minicpm-2b-dpo
+```
+***ollama手动安装模型***
+1. [安装ollama](https://github.com/ollama/ollama)
+2. 下载gguf形式的模型。[下载链接2b-fp16格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [下载链接2b-q4km格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) [下载链接1b-fp16格式](https://huggingface.co/linglingdan/MiniCPM-1b-fp16-gguf) [下载链接1b-qr_1格式](https://huggingface.co/linglingdan/MiniCPM-1b-q4-1)
+3. 在命令行运行以下命令,model_name可自定义：
+```
+touch model_name.Modelfile
+```
+4. 将以上model_name.Modelfile的内容修改如下,FROM空格后写入gguf的模型路径
+```
+FROM model_path/model_name.gguf
+TEMPLATE """<s><USER>{{ .Prompt }}<AI>{{ .Response }}"""
+PARAMETER stop "<\s>"
+```
+5. 在命令行运行以下命令，创建ollama模型，ollama_model_name可自定义，model_name.Modelfile参考第3步命名
+```
+ollama create ollama_model_name -f model_name.Modelfile
+```
+6. 运行ollama模型：
+```
+ollama run ollama_model_name
+```
+<p id="fastllm"></p>
+
+#### fastllm
 1. [编译安装fastllm](https://github.com/ztxz16/fastllm)
 2. 模型推理
-```
+```python
 import torch
 from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM
 path = 'openbmb/MiniCPM-2B-dpo-fp16'
@ -233,14 +263,86 @@ llm.set_device_map("cpu")
 model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4"
 print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高还是矮？差距多少？<AI>", top_p=0.8, temperature=0.5, repeat_penalty=1.02))
 ```
- 
+<p id="mlx_lm"></p>

+#### mlx_lm
+1. 安装mlx_lm库
+    ```shell
+    pip install mlx_lm
+    ```
+2. 下载转换后的模型权重[MiniCPM-2B-sft-bf16-llama-format-mlx](https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx)
+3. 模型推理
+    ```shell
+    python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code
+    ```
+<p id="quantize"></p>

+## 模型量化
+<p id="gptq"></p>
+
+**gptq量化**
+1. 首先git获取[minicpm_gptqd代码](https://github.com/LDLINGLINGLING/AutoGPTQ/tree/minicpm_gptq)
+2. 进入minicpm_gptqd主目录./AutoGPTQ，命令行输入：
+    ```
+    pip install e .
+    ```
+3. 前往[模型下载](#1)下载未量化的MiniCPM仓库下所有文件放至本地同一文件夹下,1b、2b模型均可,训练后模型亦可。
+4. 在./AutoGPTQ/examples/quantization路径下输入以下命令，其中no_quantized_path是第3步模型下载路径，save_path是量化模型保存路径，--bits 为量化位数可以选择输入4或者8
+    ```
+    python quant_with_alpaca.py --pretrained_model_dir no_quantized_path --quantized_model_dir save_path --bits 4
+    ```
+5. 可以使用./AutoGPTQ/examples/quantization/inference.py进行推理，也可以参考前文使用vllm对量化后的模型，单卡4090下minicpm-1b-int4模型vllm推理在2000token/s左右。
+
+<p id="awq"></p>
+
+**awq量化**
+1. 在quantize/awq_quantize.py 文件中修改根据注释修改配置参数：
+  ```python
+  model_path = '/root/ld/ld_model_pretrained/MiniCPM-1B-sft-bf16' # model_path or model_id
+  quant_path = '/root/ld/ld_project/pull_request/MiniCPM/quantize/awq_cpm_1b_4bit' # quant_save_path
+  quant_data_path='/root/ld/ld_project/pull_request/MiniCPM/quantize/quantize_data/wikitext'# 写入自带量化数据集，data下的alpaca或者wikitext
+  quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8
+  quant_samples=512 # how many samples to use for calibration
+  custom_data=[{'question':'你叫什么名字。','answer':'我是openmbmb开源的小钢炮minicpm。'}, # 自定义数据集可用
+                 {'question':'你有什么特色。','answer':'我很小，但是我很强。'}]
+  ```
+2. 在quantize/quantize_data文件下已经提供了alpaca和wiki_text两个数据集作为量化校准集，修改上述quant_data_path为其中一个文件夹的路径
+3. 如果需要自定义数据集，修改quantize/awq_quantize.py中的custom_data变量，如：
+    ```python
+    custom_data=[{'question':'过敏性鼻炎有什么症状？','answer':'过敏性鼻炎可能鼻塞，流鼻涕，头痛等症状反复发作，严重时建议及时就医。'},
+                 {'question':'1+1等于多少？','answer':'等于2'}]
+    ```
+4. 根据选择的数据集，选择以下某一行代码替换 quantize/awq_quantize.py 中第三十八行：
+  ```python
+    #使用wikitext进行量化
+    model.quantize(tokenizer, quant_config=quant_config, calib_data=load_wikitext(quant_data_path=quant_data_path))
+    #使用alpaca进行量化
+    model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca(quant_data_path=quant_data_path))
+    #使用自定义数据集进行量化
+    model.quantize(tokenizer, quant_config=quant_config, calib_data=load_cust_data(quant_data_path=quant_data_path))
+    
+  ```
+5. 运行quantize/awq_quantize.py文件,在设置的quan_path目录下可得awq量化后的模型。
+<p id="quantize_test"></p>
+
+**量化测试**
+1. 命令行进入到 MiniCPM/quantize 目录下
+2. 修改quantize_eval.sh文件中awq_path，gptq_path，awq_path,如果不需要测试的类型保持为空字符串，如下示例表示仅测试awq模型：
+  ```
+    awq_path="/root/ld/ld_project/AutoAWQ/examples/awq_cpm_1b_4bit"
+    gptq_path=""
+    model_path=""
+  ```
+3. 在MiniCPM/quantize路径下命令行输入：
+  ```
+    bash quantize_eval.sh
+  ```
+4. 窗口将输出该模型的内存占用情况、困惑度。
 <p id="community"></p>

 ## 开源社区

- [ChatLLM框架](https://github.com/foldl/chatllm.cpp):[在CPU上跑MiniCPM](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796)
+- [ChatLLM框架](https://github.com/foldl/chatllm.cpp)：[在CPU上跑MiniCPM](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796)



@ -337,94 +439,359 @@ print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高
 |Mistral-7B-Instruct-v0.1|6.84|
 |MPT-34B-instruct|6.39|

-#### 多模态模型评测
+#### MiniCPM-2B-128k 模型评测
+| Model                               | avg   | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run |
+|-------------------------------------|-------|-------------------|---------|---------------|--------------|---------------------|-----------------|-----------------|------------------|---------------------|-----------|-----------|------------|----------|
+| LWM-Text-128k                       | 24.45 | 33.62             | 100     | 97.8          | 0.6          | 28.82               | 15.93           | 14.31           | 9.99             | 1.5                 | 0         | 3.43      | 20.05      | 1        |
+| Yarn-Mistral-7b-128k                | 19.84 | 27.36             | 92.71   |               | 0            | 27.95               | 15.49           | 9.55            | 9.06             | 7.5                 | 0         | 17.14     | 0.76       | 1.25     |
+| Mistral-7B-Instruct-v0.2(ABF 1000w) | 27.75 | 36.9              | 100     | 78.98         | 3.6          | 37.12               | 11.74           | 17.37           | 21.12            | 9.5                 | 0         | 29.43     | 17.51      | 0        |
+| Yi-6B-200k                          | 22.15 | 32.54             | 100     | 94.92         | 0            | 36.68               | 15.07           | 9.2             | 0.92             | 3.5                 | 0         | 4.29      | 0.51       | 0.75     |
+| chatglm3-6b-128k                    | 25.58 | 36.57             | 89.93   | 99.66         | 5.2          | 46.29               | 10.7            | 8.38            | 25.91            | 6.5                 | 0         | 8         | 5.33       | 1        |
+| MiniCPM-2.4B-128k                   | 27.32 | 37.68             | 98.31   | 99.83         | 9            | 29.69               | 23.06           | 16.33           | 15.73            | 9.5                 | 0         | 4.29      | 22.08      | 0        |

+#### MiniCPM-MoE-8x2B模型评测
 <div align="left">

 <table style="margin: 0px auto;">
 <thead>
  <tr>
    <th align="left">Model</th>
-    <th>Size</th>
-    <th nowrap="nowrap" >Visual Tokens</th>
-    <th>MME</th>
-    <th nowrap="nowrap" >MMB dev (en)</th>
-    <th nowrap="nowrap" >MMB dev (zh)</th>
-    <th nowrap="nowrap" >MMMU val</th>
-    <th nowrap="nowrap" >CMMMU val</th>
+    <th nowrap="nowrap" >BBH</th>
+    <th nowrap="nowrap" >MMLU</th>
+    <th nowrap="nowrap" >CEval</th>
+    <th nowrap="nowrap" >CMMLU</th>
+    <th nowrap="nowrap" >HumanEval</th>
+    <th nowrap="nowrap" >MBPP&dagger;</th>
+    <th nowrap="nowrap" >GSM8K</th>
+    <th nowrap="nowrap" >MATH</th
  </tr>
 </thead>
 <tbody align="center">
  <tr>
-    <td align="left">LLaVA-Phi</td>
-    <td align="right">3B</td>
-    <td>576</td>
-    <td>1335</td>
-    <td>59.8</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left">Llama2-34B*</td>
+    <td>44.1</td>
+    <td>62.6</td>
+    <td>-</td>
+    <td>-</td>
+    <td>22.6</td>
+    <td>33.0</td>
+    <td>42.2</td>
+    <td>6.24</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left">MobileVLM</td>
-    <td align="right">3B</td>
-    <td>144</td>
-    <td>1289</td>
-    <td>59.6</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left">Mistral-7B-Instruct-v0.2</td>
+    <td>39.81</td>
+    <td>60.51</td>
+    <td>42.55</td>
+    <td>41.92</td>
+    <td>36.59</td>
+    <td>39.63</td>
+    <td>40.49</td>
+    <td>4.95</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" >Imp-v1</td>
-    <td align="right">3B</td>
-    <td>576</td>
-    <td>1434</td>
-    <td>66.5</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
+    <td nowrap="nowrap" align="left" >Gemma-7B*</td>
+    <td>55.1</td>
+    <td>64.3</td>
+    <td>-</td>
+    <td>-</td>
+    <td>32.3</td>
+    <td>44.4</td>
+    <td>46.4</td>
+    <td>24.3</td>
  </tr>
  <tr>
-    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right" >9.6B</td>
-    <td>256</td>
-    <td>1487</td>
-    <td>60.6 </td>
-    <td>56.7 </td>
-    <td>35.9 </td>
-    <td>30.7 </td>
+    <td nowrap="nowrap" align="left" >Qwen1.5-7B*</td>
+    <td>40.2</td>
+    <td>61</td>
+    <td>74.1</td>
+    <td>73.1</td>
+    <td>36</td>
+    <td>37.4</td>
+    <td>62.5</td>
+    <td>20.3</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" >CogVLM</td>
-    <td align="right">17.4B </td>
-    <td>1225</td>
-    <td>1438 </td>
-    <td>63.7 </td>
-    <td>53.8 </td>
-    <td>32.1 </td>
-    <td>- </td>
+    <td  nowrap="nowrap" align="left" >Deepseek-MoE(16B)*</td>
+    <td>-</td>
+    <td>45.0</td>
+    <td>40.6</td>
+    <td>42.5</td>
+    <td>26.8</td>
+    <td>39.2</td>
+    <td>18.8</td>
+    <td>4.3</td>
  </tr>
  <tr>
-    <td nowrap="nowrap" align="left" ><b>MiniCPM-V(3B)</b></td>
-    <td align="right">3B </td>
-    <td>64</td>
-    <td>1452 </td>
-    <td>67.3 </td>
-    <td>61.9 </td>
-    <td>34.7 </td>
-    <td>32.1 </td>
+    <td nowrap="nowrap" align="left" ><b>MiniCPM-2.4B</b></td>
+    <td>36.87</td>
+    <td>53.46</td>
+    <td>51.13</td>
+    <td>51.07</td>
+    <td>50.00</td>
+    <td>35.93</td>
+    <td>53.83</td>
+    <td>10.24</td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>MiniCPM-MoE-8x2B</b></td>
+    <td>39.22</td>
+    <td>58.90</td>
+    <td>58.11</td>
+    <td>58.80</td>
+    <td>55.49</td>
+    <td>41.68</td>
+    <td>61.56</td>
+    <td>10.52</td>
  </tr>
 </tbody>
 </table>

 </div>

+<p id="4"></p>
+
+注：* 表示结果取自技术报告。&dagger; 表示评测集为MBPP全集。
+
+#### 多模态模型评测
+
+<div align="center">
+
+<table style="margin: 0px auto;">
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>TextVQA val</th>
+    <th>DocVQA test</th>
+    <th>OCRBench</th>
+    <th>OpenCompass</th>
+    <th nowrap="nowrap" >MME</th>
+    <th>MMB dev(en)</th>
+    <th>MMB dev(zh)</th>
+    <th>MMMU val</th>
+    <th>MathVista</th>
+    <th>LLaVA Bench</th>
+    <th nowrap="nowrap">Object HalBench</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
+    <td>- </td>
+    <td>74.6</td>
+    <td>88.1</td>
+    <td>680</td>
+    <td>63.8</td>
+    <td>2148.9</td>
+    <td>75.2</td>
+    <td>74.0</td>
+    <td>48.9</td>
+    <td>45.8</td>
+    <td>79.9</td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">GPT-4V</td>
+    <td>- </td>
+    <td>78.0</td>
+    <td>88.4</td>
+    <td>645</td>
+    <td>63.2</td>
+    <td>1771.5</td>
+    <td>75.1</td>
+    <td>75.0</td>
+    <td>53.8</td>
+    <td>47.8</td>
+    <td>93.1</td>
+    <td>86.4 / 92.7</td>
+  </tr>
+  <tr>
+    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
+    <td align="right" >6.7B</td>
+    <td>45.5*</td>
+    <td>17.1*</td>
+    <td>290</td>
+    <td>49.3</td>
+    <td>1915.1 </td>
+    <td>68.6 </td>
+    <td>68.3 </td>
+    <td>40.3 </td>
+    <td>28.8 </td>
+    <td>51.9 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
+    <td align="right" >9.6B</td>
+    <td>61.5</td>
+    <td>62.6</td>
+    <td>488 </td>
+    <td>52.1 </td>
+    <td>1860.0 </td>
+    <td>60.6 </td>
+    <td>56.7 </td>
+    <td>37.0 </td>
+    <td>33.8 </td>
+    <td>67.7 </td>
+    <td>56.2 / 80.0</td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
+    <td align="right" >34B</td>
+    <td>43.4*</td>
+    <td>16.9*</td>
+    <td>290</td>
+    <td>52.6 </td>
+    <td>2050.2</td>
+    <td>71.1</td>
+    <td>71.4</td>
+    <td>45.1</td>
+    <td>30.7</td>
+    <td>62.3</td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
+    <td align="right" >7.3B</td>
+    <td>64.7*</td>
+    <td>47.0* </td>
+    <td>435</td>
+    <td>55.6 </td>
+    <td>1765.4 </td>
+    <td>74.1 </td>
+    <td>72.8 </td>
+    <td>38.3 </td>
+    <td>36.8</td>
+    <td>77.8 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >TextMonkey</td>
+    <td align="right" >9.7B</td>
+    <td>64.3</td>
+    <td>66.7 </td>
+    <td>558</td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>- </td>
+    <td>-</td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+    <tr>
+    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
+    <td align="right" >17.4B</td>
+    <td>70.4</td>
+    <td>33.3*</td>
+    <td>590 </td>
+    <td>52.5 </td>
+    <td>1736.6 </td>
+    <td>63.7 </td>
+    <td>53.8 </td>
+    <td>37.3 </td>
+    <td>34.7 </td>
+    <td>73.9 </td>
+    <td>73.6 / 87.4 </td>
+  </tr>
+  <tr>
+    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
+    <td align="right" >1.7B</td>
+    <td>58.4*</td>
+    <td>37.9*</td>
+    <td>413</td>
+    <td>46.0 </td>
+    <td>1531.6 </td>
+    <td>64.0 </td>
+    <td>61.2 </td>
+    <td>33.8 </td>
+    <td>29.4 </td>
+    <td>51.1 </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
+    <td align="right" >3.1B</td>
+    <td>57.5</td>
+    <td>19.4*</td>
+    <td>-</td>
+    <td>-</td>
+    <td>1440.5(P) </td>
+    <td>63.2 </td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
+    <td align="right" >2.2B</td>
+    <td>56.2</td>
+    <td>34.2*</td>
+    <td>-</td>
+    <td>-</td>
+    <td>1653.0 </td>
+    <td>59.8 </td>
+    <td>- </td>
+    <td>31.7 </td>
+    <td>-</td>
+    <td>- </td>
+    <td>- </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
+    <td align="right" >2.8B </td>
+    <td>60.6</td>
+    <td>38.2 </td>
+    <td>366</td>
+    <td>47.6</td>
+    <td>1650.2 </td>
+    <td>67.9 </td>
+    <td>65.3 </td>
+    <td><strong>38.3</strong></td>
+    <td>28.9</td>
+    <td>51.3 </td>
+    <td>78.4 / 88.5 </td>
+  </tr>
+  <tr>
+    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
+    <td align="right" >2.8B </td>
+    <td><strong>74.1</strong></td>
+    <td><strong>71.9</strong> </td>
+    <td><strong>605</strong></td>
+    <td><strong>55.0</strong></td>
+    <td><strong>1808.6</strong> </td>
+    <td><strong>69.6</strong> </td>
+    <td><strong>68.1</strong> </td>
+    <td>38.2 </td>
+    <td><strong>38.7</strong></td>
+    <td><strong>69.2</strong> </td>
+    <td><strong>85.5 / 92.2 </strong></td>
+  </tr>
+</tbody>
+</table>
+
+</div>
+* 我们自己评测了正式开源的模型权重。
+


 <p id="4"></p>

 ## 手机部署
+<p id="MLC"></p>

 #### 部署步骤

@ -444,9 +811,7 @@ print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高

 #### 部署性能

-* 我们未针对手机推理模型进行深度优化和系统测试，仅验证MiniCPM使用手机芯片进行推理的可行性。
-* 【更正】在本工作之前已有初步的基于llama.cpp进行手机部署多模态大模型的[努力](https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/MobileVLM-README.md)，我们此次在MLC-LLM上验证了手机部署MiniCPM-V的可行性，能够正常输入输出，但也存在图片处理时间较长的问题，需要进一步优化，兼容性问题也需要进一步解决 :)。
-* **我们也欢迎更多开发者进一步调优并更新下面的测试列表，不断提升端侧大模型在手机上的推理性能**。
+* 我们未针对手机推理模型进行深度优化和系统测试，仅验证MiniCPM使用手机芯片进行推理的可行性。**我们也欢迎更多开发者进一步调优并更新下面的测试列表，不断提升端侧大模型在手机上的推理性能**。

 |手机型号|操作系统|处理器|Memory（GB）|文本吞吐（token/s）|
 |-|-|-|-|-|
@ -470,7 +835,14 @@ print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高
 |iPhone 11|iOS 16.6|A13|4|4.6|
 |Xiaomi Redmi K50|HyperOS 1.0.2|MediaTek Dimensity 8100|12|3.5

-![多模态样例](https://github.com/OpenBMB/OmniLMM/blob/main/assets/Snake_cn_Mushroom_en.gif)
+* 我们也使用MLC-LLM验证了在手机上部署MiniCPM-V系列模型的可行性，能够正常输入输出，但也存在图片处理时间较长的问题，需要进一步优化，兼容性问题也需要进一步解决。下面的动图是使用小米14 Pro运行MiniCPM-V 2.0的屏幕录像，没有进行任何编辑。
+
+<table align="center">
+    <p align="center">
+      <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/gif_cases/station.gif" width=36%/>
+      <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/gif_cases/english_menu.gif" width=36%/>
+    </p>
+</table>


 <p id="5"></p>
@ -488,20 +860,34 @@ python demo/vllm_based_demo.py --model_path <vllmcpm_repo_path>
 python demo/hf_based_demo.py --model_path <hf_repo_path>
 ```

-
 <p id="6"></p>

 ## 二次开发
+<p id="transformer_finetune"></p>

 * 高效参数微调
  * 一张1080/2080可实现高效参数微调
-  * [高效参数微调代码](https://github.com/OpenBMB/MiniCPM/tree/main/finetune)
-  
+  * [高效参数微调代码](https://github.com/OpenBMB/MiniCPM/tree/main/finetune) 
+<p id="BMTrain"></p>  
+
 * 全参数微调 or 持续训练
  * 使用[BMTrain](https://github.com/OpenBMB/BMTrain)，借助重计算和ZeRO-3，一张3090/4090可实现全参数微调，一台机器可实现持续训练
  * 相关代码也将陆续推出
+<p id="mlx"></p> 

-
+* mlx高效参数微调
+  * 环境准备
+    ```shell
+    pip install -r finetune/requirements_mlx.txt
+    ```
+  * 微调命令
+    ```shell
+    # train
+    python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/AdvertiseGen  --train  --seed 2024 --iters 500
+    # test
+    python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/AdvertiseGen  --test --seed 2024
+    ```
+* [llama_factory微调](https://github.com/OpenBMB/MiniCPM/tree/main/finetune/llama_factory_example/README.md)

 <p id="9"></p>

@ -553,9 +939,8 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
 #### 模型协议

 * 本仓库中代码依照 [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) 协议开源
-* MiniCPM 模型权重的使用则需要遵循 [“通用模型许可协议-来源说明-宣传限制-商业授权”](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md)。
-* MiniCPM 模型权重对学术研究完全开放。
-* 如需将模型用于商业用途，请联系cpm@modelbest.cn来获取书面授权，在登记后亦允许免费商业使用。
+* MiniCPM 模型权重的使用则需要遵循 [“MiniCPM模型商用许可协议.md”](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%E6%A8%A1%E5%9E%8B%E5%95%86%E7%94%A8%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.md)。
+* MiniCPM 模型权重对学术研究完全开放，在填写[“问卷”](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)进行登记后亦允许免费商业使用。

 #### 声明

@ -567,12 +952,13 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>

 ## 工作引用

-* 如果觉得MiniCPM有助于您的工作，请引用我们的[技术报告](https://shengdinghu.notion.site/MiniCPM-c805a17c5c8046398914e47f0542095a?pvs=4)
+* 如果觉得MiniCPM有助于您的工作，请引用我们的[论文](https://arxiv.org/abs/2404.06395)

 ```
-@misc{minicpm2024,
-	title={MiniCPM: Unveiling the Potential of End-side Large Language Models},
-	booktitle={OpenBMB Blog},
-	year={2024}
+@article{hu2024minicpm,
+  title={MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies},
+  author={Hu, Shengding and Tu, Yuge and Han, Xu and He, Chaoqun and Cui, Ganqu and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and Zhao, Weilin and others},
+  journal={arXiv preprint arXiv:2404.06395},
+  year={2024}
 }
 ```
--- a/demo/hf_based_demo.py
+++ b/demo/hf_based_demo.py
@ -1,10 +1,8 @@
 from typing import List
-
 import argparse
 import gradio as gr
 import torch
 from threading import Thread
-
 from PIL import Image
 from transformers import (
    AutoModelForCausalLM,
@ -12,11 +10,12 @@ from transformers import (
    TextIteratorStreamer
 )
 import warnings
+
 warnings.filterwarnings('ignore', category=UserWarning, message='TypedStorage is deprecated')

 parser = argparse.ArgumentParser()
-parser.add_argument("--model_path", type=str, default="")
-parser.add_argument("--torch_dtype", type=str, default="bfloat16", choices=["float32", "bfloat16"])
+parser.add_argument("--model_path", type=str, default="openbmb/MiniCPM-2B-dpo-fp16")
+parser.add_argument("--torch_dtype", type=str, default="bfloat16", choices=["float32", "bfloat16", "float16"])
 parser.add_argument("--server_name", type=str, default="127.0.0.1")
 parser.add_argument("--server_port", type=int, default=7860)
 args = parser.parse_args()
@ -27,6 +26,8 @@ if torch_dtype == "" or torch_dtype == "bfloat16":
    torch_dtype = torch.bfloat16
 elif torch_dtype == "float32":
    torch_dtype = torch.float32
+elif torch_dtype == "float16":
+    torch_dtype = torch.float16
 else:
    raise ValueError(f"Invalid torch dtype: {torch_dtype}")

@ -47,7 +48,7 @@ def check_model_v(img_file_path: str = None):
    Returns:
        Ture if model is MiniCPMV else False
    '''
-    if model_architectures == "MiniCPMV":
+    if "MiniCPMV" in model_architectures:
        return True
    if isinstance(img_file_path, str):
        gr.Warning('Only MiniCPMV model can support Image')
@ -62,7 +63,6 @@ if check_model_v():
 server_name = args.server_name
 server_port = args.server_port

-
 def hf_gen(dialog: List, top_p: float, temperature: float, repetition_penalty: float, max_dec_len: int):
    """generate model output with huggingface api

@ -74,9 +74,9 @@ def hf_gen(dialog: List, top_p: float, temperature: float, repetition_penalty: f

    Yields:
        str: real-time generation results of hf model
-    """    
+    """
    inputs = tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=False)
-    enc = tokenizer(inputs, return_tensors="pt").to("cuda")
+    enc = tokenizer(inputs, return_tensors="pt").to(next(model.parameters()).device)
    streamer = TextIteratorStreamer(tokenizer)
    generation_kwargs = dict(
        enc,
@ -200,7 +200,7 @@ def clear_history():

    Returns:
        List: empty chat history
-    """    
+    """
    return []


@ -212,7 +212,7 @@ def reverse_last_round(chat_history):

    Returns:
        List: [[q_1, a_1], [q_2, a_2], ..., [q_n-1, a_n-1]]. chat_history without last round.
-    """    
+    """
    assert len(chat_history) >= 1, "History is empty. Nothing to reverse!!"
    return chat_history[:-1]

--- a/demo/mlx_based_demo.py
+++ b/demo/mlx_based_demo.py
@ -0,0 +1,42 @@
+"""
+使用 MLX 快速推理 MiniCPM
+
+如果你使用 Mac 设备进行推理，可以直接使用MLX进行推理。
+由于 MiniCPM 暂时不支持 mlx 格式转换。您可以下载由 MLX 社群转换好的模型 [MiniCPM-2B-sft-bf16-llama-format-mlx](https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx)。
+
+并安装对应的依赖包
+
+
+```bash
+pip install mlx-lm
+```
+
+这是一个简单的推理代码，使用 Mac 设备推理 MiniCPM-2
+```python
+python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code
+```
+
+"""
+
+from mlx_lm import load, generate
+from jinja2 import Template
+
+def chat_with_model():
+    model, tokenizer = load("mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx")
+    print("Model loaded. Start chatting! (Type 'quit' to stop)")
+
+    messages = []
+    chat_template = Template(
+        "{% for message in messages %}{% if message['role'] == 'user' %}{{'<用户>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}")
+
+    while True:
+        user_input = input("You: ")
+        if user_input.lower() == 'quit':
+            break
+        messages.append({"role": "user", "content": user_input})
+        response = generate(model, tokenizer, prompt=chat_template.render(messages=messages), verbose=True)
+        print("Model:", response)
+        messages.append({"role": "ai", "content": response})
+
+
+chat_with_model()
--- a/demo/openai_api_demo/openai_api_request_demo.py
+++ b/demo/openai_api_demo/openai_api_request_demo.py
@ -0,0 +1,55 @@
+"""
+这是一个简单的OpenAI接口代码,由于 MiniCPM-2B的限制，该脚本：
+1. 没有工具调用功能
+2. 没有System Prompt
+3. 最大支持文本 4096 长度
+
+运行本代码需要：
+1. 启动本地服务，本方案使用的是 AutoModelForCausalLM.from_pretrained 读入模型，没有进行优化，可以根据需要自行修改。
+2. 通过此代码进行请求。
+"""
+
+from openai import OpenAI
+
+base_url = "http://127.0.0.1:8000/v1/"
+client = OpenAI(api_key="MiniCPM-2B", base_url=base_url)
+
+def chat(use_stream=True):
+    messages = [
+        {
+            "role": "user",
+            "content": "tell me a story"
+        }
+    ]
+    response = client.chat.completions.create(
+        model="MiniCPM-2B",
+        messages=messages,
+        stream=use_stream,
+        max_tokens=4096,  # need less than 4096 tokens
+        temperature=0.8,
+        top_p=0.8
+    )
+    if response:
+        if use_stream:
+            for chunk in response:
+                print(chunk.choices[0].delta.content)
+        else:
+            content = response.choices[0].message.content
+            print(content)
+    else:
+        print("Error:", response.status_code)
+
+
+def embedding():
+    response = client.embeddings.create(
+        model="bge-m3",
+        input=["hello, I am MiniCPM-2B"],
+    )
+    embeddings = response.data[0].embedding
+    print("Embedding_Success：", len(embeddings))
+
+
+if __name__ == "__main__":
+    chat(use_stream=True)
+
+
--- a/demo/openai_api_demo/openai_api_server_demo.py
+++ b/demo/openai_api_demo/openai_api_server_demo.py
@ -0,0 +1,296 @@
+import gc
+import json
+import os
+import time
+from threading import Thread
+
+import tiktoken
+import torch
+import uvicorn
+
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+
+from contextlib import asynccontextmanager
+from typing import List, Literal, Optional, Union
+from pydantic import BaseModel, Field
+from transformers import AutoTokenizer, TextIteratorStreamer, AutoModelForCausalLM
+from sentence_transformers import SentenceTransformer
+from loguru import logger
+from sse_starlette.sse import EventSourceResponse
+
+EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
+
+MODEL_PATH = os.environ.get('MODEL_PATH', 'openbmb/MiniCPM-2B-dpo-fp16')
+TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)
+
+EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', 'BAAI/bge-m3')
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    yield
+    # clean cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+
+
+app = FastAPI(lifespan=lifespan)
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "owner"
+    root: Optional[str] = None
+    parent: Optional[str] = None
+    permission: Optional[list] = None
+
+
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard] = []
+
+
+class FunctionCallResponse(BaseModel):
+    name: Optional[str] = None
+    arguments: Optional[str] = None
+
+
+class ChatMessage(BaseModel):
+    role: Literal["user", "assistant", "system", "function"]
+    content: str = None
+    name: Optional[str] = None
+
+
+class DeltaMessage(BaseModel):
+    role: Optional[Literal["user", "assistant", "system"]] = None
+    content: Optional[str] = None
+
+
+class EmbeddingRequest(BaseModel):
+    input: List[str]
+    model: str
+
+
+class CompletionUsage(BaseModel):
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+
+
+class EmbeddingResponse(BaseModel):
+    data: list
+    model: str
+    object: str
+    usage: CompletionUsage
+
+
+class UsageInfo(BaseModel):
+    prompt_tokens: int = 0
+    total_tokens: int = 0
+    completion_tokens: Optional[int] = 0
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = 0.8
+    top_p: Optional[float] = 0.8
+    max_tokens: Optional[int] = None
+    stream: Optional[bool] = False
+    tools: Optional[Union[dict, List[dict]]] = None
+    repetition_penalty: Optional[float] = 1.1
+
+
+class ChatCompletionResponseChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Literal["stop", "length"]
+
+
+class ChatCompletionResponseStreamChoice(BaseModel):
+    delta: DeltaMessage
+    finish_reason: Optional[Literal["stop", "length"]]
+    index: int
+
+
+class ChatCompletionResponse(BaseModel):
+    model: str
+    id: str
+    object: Literal["chat.completion", "chat.completion.chunk"]
+    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
+    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
+    usage: Optional[UsageInfo] = None
+
+
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    model_card = ModelCard(
+        id="MiniCPM-2B"
+    )
+    return ModelList(
+        data=[model_card]
+    )
+
+
+def generate_minicpm(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, params: dict):
+    messages = params["messages"]
+    temperature = float(params.get("temperature", 1.0))
+    repetition_penalty = float(params.get("repetition_penalty", 1.0))
+    top_p = float(params.get("top_p", 1.0))
+    max_new_tokens = int(params.get("max_tokens", 256))
+    inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
+    enc = tokenizer(inputs, return_tensors="pt").to(model.device)
+    input_echo_len = len(enc["input_ids"][0])
+
+    if input_echo_len >= model.config.max_length:
+        logger.error(f"Input length larger than {model.config.max_length}")
+        return
+    streamer = TextIteratorStreamer(tokenizer)
+    generation_kwargs = {
+        **enc,
+        "do_sample": True if temperature > 1e-5 else False,
+        "top_k": 0,
+        "top_p": top_p,
+        "temperature": temperature,
+        "repetition_penalty": repetition_penalty,
+        "max_new_tokens": max_new_tokens,
+        "pad_token_id": tokenizer.eos_token_id,
+        "streamer": streamer,
+    }
+    eos_token = tokenizer.eos_token
+    thread = Thread(target=model.generate, kwargs=generation_kwargs)
+    thread.start()
+    response = ""
+    for new_text in streamer:
+        new_text = new_text.split(eos_token)[0] if eos_token in new_text else new_text
+        response += new_text
+        current_length = len(new_text)
+        yield {
+            "text": response[5 + len(inputs):],
+            "usage": {
+                "prompt_tokens": input_echo_len,
+                "completion_tokens": current_length - input_echo_len,
+                "total_tokens": len(response),
+            },
+            "finish_reason": "",
+        }
+    thread.join()
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+@app.post("/v1/embeddings", response_model=EmbeddingResponse)
+async def get_embeddings(request: EmbeddingRequest):
+    embeddings = [embedding_model.encode(text) for text in request.input]
+    embeddings = [embedding.tolist() for embedding in embeddings]
+
+    def num_tokens_from_string(string: str) -> int:
+        encoding = tiktoken.get_encoding('cl100k_base')
+        num_tokens = len(encoding.encode(string))
+        return num_tokens
+
+    response = {
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": embedding,
+                "index": index
+            }
+            for index, embedding in enumerate(embeddings)
+        ],
+        "model": request.model,
+        "object": "list",
+        "usage": CompletionUsage(
+            prompt_tokens=sum(len(text.split()) for text in request.input),
+            completion_tokens=0,
+            total_tokens=sum(num_tokens_from_string(text) for text in request.input),
+        )
+    }
+    return response
+
+
+@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+async def create_chat_completion(request: ChatCompletionRequest):
+    global model, tokenizer
+
+    if len(request.messages) < 1 or request.messages[-1].role == "assistant":
+        raise HTTPException(status_code=400, detail="Invalid request")
+
+    gen_params = dict(
+        messages=request.messages,
+        temperature=request.temperature,
+        top_p=request.top_p,
+        max_tokens=request.max_tokens or 2048,
+        echo=False,
+        repetition_penalty=request.repetition_penalty,
+        tools=request.tools,
+    )
+    logger.debug(f"==== request ====\n{gen_params}")
+    input_tokens = sum(len(tokenizer.encode(msg.content)) for msg in request.messages)
+    if request.stream:
+        async def stream_response():
+            previous_text = ""
+            for new_response in generate_minicpm(model, tokenizer, gen_params):
+                delta_text = new_response["text"][len(previous_text):]
+                previous_text = new_response["text"]
+                delta = DeltaMessage(content=delta_text, role="assistant")
+                choice_data = ChatCompletionResponseStreamChoice(
+                    index=0,
+                    delta=delta,
+                    finish_reason=None
+                )
+                chunk = {
+                    "model": request.model,
+                    "id": "",
+                    "choices": [choice_data.dict(exclude_none=True)],
+                    "object": "chat.completion.chunk"
+                }
+                yield json.dumps(chunk) + "\n"
+
+        return EventSourceResponse(stream_response(), media_type="text/event-stream")
+
+    else:
+        generated_text = ""
+        for response in generate_minicpm(model, tokenizer, gen_params):
+            generated_text = response["text"]
+        generated_text = generated_text.strip()
+        output_tokens = len(tokenizer.encode(generated_text))
+        usage = UsageInfo(
+            prompt_tokens=input_tokens,
+            completion_tokens=output_tokens,
+            total_tokens=output_tokens + input_tokens
+        )
+        message = ChatMessage(role="assistant", content=generated_text)
+        logger.debug(f"==== message ====\n{message}")
+        choice_data = ChatCompletionResponseChoice(
+            index=0,
+            message=message,
+            finish_reason="stop",
+        )
+        return ChatCompletionResponse(
+            model=request.model,
+            id="",
+            choices=[choice_data],
+            object="chat.completion",
+            usage=usage
+        )
+
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+    model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16, device_map="auto",
+                                                 trust_remote_code=True)
+    embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
+    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/demo/vllm_based_demo.py
+++ b/demo/vllm_based_demo.py
@ -1,36 +1,46 @@
-from typing import Dict
 from typing import List
-from typing import Tuple
-
 import argparse
 import gradio as gr
 from vllm import LLM, SamplingParams
-
+import torch
+from transformers import AutoTokenizer

 parser = argparse.ArgumentParser()
-parser.add_argument("--model_path", type=str, default="")
+parser.add_argument("--model_path", type=str, default="openbmb/MiniCPM-1B-sft-bf16")
 parser.add_argument("--torch_dtype", type=str, default="bfloat16", choices=["float32", "bfloat16"])
 parser.add_argument("--server_name", type=str, default="127.0.0.1")
 parser.add_argument("--server_port", type=int, default=7860)
-args = parser.parse_args()
+parser.add_argument("--max_tokens", type=int, default=2048)
+# for MiniCPM-1B and MiniCPM-2B  model, max_tokens should be set to 2048

+args = parser.parse_args()

 # init model torch dtype
 torch_dtype = args.torch_dtype
-if torch_dtype =="" or torch_dtype == "bfloat16":
-    torch_dtype = "bfloat16"
+if torch_dtype == "" or torch_dtype == "bfloat16":
+    torch_dtype = torch.bfloat16
 elif torch_dtype == "float32":
-    torch_dtype = "float32"
+    torch_dtype = torch.float32
+elif torch_dtype == "float16":
+    torch_dtype = torch.float16
 else:
    raise ValueError(f"Invalid torch dtype: {torch_dtype}")

 # init model and tokenizer
 path = args.model_path
-llm = LLM(model=path, tensor_parallel_size=1, dtype=torch_dtype)
+llm = LLM(
+    model=path,
+    tensor_parallel_size=1,
+    dtype=torch_dtype,
+    trust_remote_code=True,
+    gpu_memory_utilization=0.9,
+    max_model_len=args.max_tokens
+)
+tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
+
+server_name = args.server_name
+server_port = args.server_port

-# init gradio demo host and port
-server_name=args.server_name
-server_port=args.server_port

 def vllm_gen(dialog: List, top_p: float, temperature: float, max_dec_len: int):
    """generate model output with huggingface api
@ -43,19 +53,14 @@ def vllm_gen(dialog: List, top_p: float, temperature: float, max_dec_len: int):

    Yields:
        str: real-time generation results of hf model
-    """    
-    prompt = ""
+    """
    assert len(dialog) % 2 == 1
-    for info in dialog:
-        if info["role"] == "user":
-            prompt += "<用户>" + info["content"]
-        else:
-            prompt += "<AI>" + info["content"]
-    prompt += "<AI>"
+    prompt = tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=False)
+    token_ids = tokenizer.convert_tokens_to_ids(["<|im_end|>"])
    params_dict = {
        "n": 1,
        "best_of": 1,
-        "presence_penalty": 1.0,    
+        "presence_penalty": 1.0,
        "frequency_penalty": 0.0,
        "temperature": temperature,
        "top_p": top_p,
@ -63,8 +68,8 @@ def vllm_gen(dialog: List, top_p: float, temperature: float, max_dec_len: int):
        "use_beam_search": False,
        "length_penalty": 1,
        "early_stopping": False,
-        "stop": None,
-        "stop_token_ids": None,
+        "stop": "<|im_end|>",
+        "stop_token_ids": token_ids,
        "ignore_eos": False,
        "max_tokens": max_dec_len,
        "logprobs": None,
@ -89,7 +94,7 @@ def generate(chat_history: List, query: str, top_p: float, temperature: float, m

    Yields:
        List: [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n], [q_n+1, a_n+1]]. chat_history + QA of current round.
-    """    
+    """
    assert query != "", "Input must not be empty!!!"
    # apply chat template
    model_input = []
@ -114,7 +119,7 @@ def regenerate(chat_history: List, top_p: float, temperature: float, max_dec_len

    Yields:
        List: [[q_1, a_1], [q_2, a_2], ..., [q_n, a_n]]. chat_history
-    """    
+    """
    assert len(chat_history) >= 1, "History is empty. Nothing to regenerate!!"
    # apply chat template
    model_input = []
@ -133,7 +138,7 @@ def clear_history():

    Returns:
        List: empty chat history
-    """    
+    """
    return []


@ -145,7 +150,7 @@ def reverse_last_round(chat_history):

    Returns:
        List: [[q_1, a_1], [q_2, a_2], ..., [q_n-1, a_n-1]]. chat_history without last round.
-    """    
+    """
    assert len(chat_history) >= 1, "History is empty. Nothing to reverse!!"
    return chat_history[:-1]

@ -158,7 +163,7 @@ with gr.Blocks(theme="soft") as demo:
        with gr.Column(scale=1):
            top_p = gr.Slider(0, 1, value=0.8, step=0.1, label="top_p")
            temperature = gr.Slider(0.1, 2.0, value=0.5, step=0.1, label="temperature")
-            max_dec_len = gr.Slider(1, 1024, value=1024, step=1, label="max_dec_len")
+            max_dec_len = gr.Slider(1, args.max_tokens, value=args.max_tokens, step=1, label="max_tokens")
        with gr.Column(scale=5):
            chatbot = gr.Chatbot(bubble_full_width=False, height=400)
            user_input = gr.Textbox(label="User", placeholder="Input your query here!", lines=8)
--- a/finetune/data_processing.ipynb
+++ b/finetune/data_processing.ipynb
@ -0,0 +1,66 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. 准备数据集\n",
+    "\n",
+    "将数据集转换为更通用的格式\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 转换为 ChatML 格式\n",
+    "import os\n",
+    "import shutil\n",
+    "import json\n",
+    "\n",
+    "input_dir = \"data/AdvertiseGen\"\n",
+    "output_dir = \"data/mlx_AdvertiseGen\"\n",
+    "if os.path.exists(output_dir):\n",
+    "    shutil.rmtree(output_dir)\n",
+    "os.makedirs(output_dir, exist_ok=True)\n",
+    "\n",
+    "for fn in [\"train.json\", \"dev.json\"]:\n",
+    "    data_out_list = []\n",
+    "    with open(os.path.join(input_dir, fn), \"r\") as f, open(os.path.join(output_dir, fn), \"w\") as fo:\n",
+    "        for line in f:\n",
+    "            if len(line.strip()) > 0:\n",
+    "                data = json.loads(line)\n",
+    "                data_out = {\"input\":data['content'],'prompt':\"/n请为以下关键词生成一条广告语。\",'output':data['summary']}\n",
+    "                data_out_list.append(data_out)\n",
+    "\n",
+    "        for d in data_out_list:\n",
+    "            json_str = json.dumps(d,ensure_ascii=False)  # 将字典转换为JSON字符串\n",
+    "            fo.write(json_str + '\\n')  # 写入字符串并添加换行符\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/finetune/finetune.py
+++ b/finetune/finetune.py
@ -12,7 +12,7 @@ from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,

@dataclass
 class ModelArguments:
-    model_name_or_path: Optional[str] = field(default="baichuan-inc/Baichuan2-7B-Base")
+    model_name_or_path: Optional[str] = field(default="openbmb/MiniCPM-2B-sft-bf16")


@dataclass
@ -42,21 +42,21 @@ class TrainingArguments(transformers.TrainingArguments):

 class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""
-
+    
    def __init__(
        self,
        data_path,
        tokenizer,
        model_max_length=4096,
-        user_tokens=[1786, 4194, 95388],
-        assistant_tokens=[1786, 10850, 95388],
+        user_tokens='<用户>',
+        assistant_tokens='<AI>',
    ):
        super(SupervisedDataset, self).__init__()
        self.data = json.load(open(data_path))
        self.tokenizer = tokenizer
        self.model_max_length = model_max_length
-        self.user_tokens = user_tokens
-        self.assistant_tokens = assistant_tokens
+        self.user_tokens = self.tokenizer.encode(user_tokens) #针对不同模型，都可以对应到<用户>的id
+        self.assistant_tokens = self.tokenizer.encode(assistant_tokens) #针对不同模型，都可以对应到<AI>的id
        self.ignore_index = -100
        item = self.preprocessing(self.data[0])
        print("input:", self.tokenizer.decode(item["input_ids"]))
@ -64,7 +64,6 @@ class SupervisedDataset(Dataset):
        for id_ in item["label_ids"]:
            if id_ == -100:
                continue
-
            labels.append(id_)
        print("label:", self.tokenizer.decode(labels))

@ -171,6 +170,8 @@ if __name__ == "__main__":
        model_path=model_args.model_name_or_path,
        max_length=training_args.model_max_length,
        use_lora=training_args.use_lora,
+        bf16=training_args.bf16,
+        fp16=training_args.fp16
    )

    train_dataset = SupervisedDataset(
@ -194,4 +195,4 @@ if __name__ == "__main__":

    trainer.train()
    # save the incremental PEFT weights, more details can be found in https://huggingface.co/blog/peft
-    # model.save_pretrained("output_dir") 
+    # model.save_pretrained("output_dir") 
--- a/finetune/llama_factory_example/README.md
+++ b/finetune/llama_factory_example/README.md
@ -0,0 +1,101 @@
+# MiniCPM_llama_factory 微调
+MiniCPM已经支持llama_factory微调，llama_factory支持continue_pretrain,sft,ppo,dpo,kto,orpo等等微调方式。
+由于llama_factory功能强大，但初学者较难上手，我们录制了微调教程
+
+**我们提供了 llama_factory_example文件夹，用来微调minicpm1b，minicpm2b模型。**
+1.首先安装llama_factory依赖。
+```bash
+git clone https://github.com/hiyouga/LLaMA-Factory
+cd LLaMA-Factory
+pip install -r requirements.txt
+```
+2.将数据集处理成Minicpm/finetune/llama_factory_example/llama_factory_data文件夹中的格式,示例包括dpo,kto,sft三种微调方式并放置到llama_factory/data目录下.以dpo为例：
+```json
+  [
+    {
+      "conversations": [
+        {
+          "from": "human",
+          "value": "Hi! I'd like to create a new language game simulating the first person perspective of a character named Angela."
+        }
+      ],
+      "chosen": {
+        "from": "gpt",
+        "value": "That sounds like a fun and engaging idea! Here are some tips to help you create the game:\n1. ......"
+      },
+      "rejected": {
+        "from": "gpt",
+        "value": "Hello! I'd be happy to help you create a language game simulating the first-person perspective ....."
+      }
+    }
+  ]
+```
+3.在llama_factory/data/dataset_info.json中添加数据集信息,保证dataset_info.json中能找到你的数据集，如下例：
+``` json
+  {"identity": {
+    "file_name": "identity.json"
+  },
+    "sft_zh_demo": {
+      "file_name": "alpaca_zh_demo.json"
+    },
+    "kto_en_demo": {
+      "file_name": "kto_en_demo.json",
+      "formatting": "sharegpt",
+      "columns": {
+        "messages": "messages",
+        "kto_tag": "label"
+      },
+      "tags": {
+        "role_tag": "role",
+        "content_tag": "content",
+        "user_tag": "user",
+        "assistant_tag": "assistant"
+      }
+    },
+    "dpo_en_demo": {
+      "file_name": "dpo_en_demo.json",
+      "ranking": true,
+      "formatting": "sharegpt",
+      "columns": {
+        "messages": "conversations",
+        "chosen": "chosen",
+        "rejected": "rejected"
+      }
+    }
+  }
+```
+4.将MiniCPM/finetune/llama_factory_example中文件复制到LLaMA-Factory/examples目录下。
+  ```bash
+    cd LLaMA-Factory/examples
+    mkdir minicpm
+    #以下代码中的/your/path要改成你的MiniCPM代码和LLaMA-Factory路径
+    cp -r /your/path/MiniCPM/finetune/llama_factory_example/*  /your/path/LLaMA-Factory/examples/minicpm
+  ```
+5.以dpo为例，首先修改minicpm_dpo.yaml,需要修改的：
+```yaml
+  model_name_or_path: openbmb/MiniCPM-2B-sft-bf16 #或者你本地保存的地址
+  dataset: dpo_en_demo #这里写dataset_info.json中的键名
+  output_dir: your/finetune_minicpm/save/path
+  bf16: true #如果你的设备支持bf16，否则false
+  deepspeed: examples/deepspeed/ds_z2_config.json #如果显存不够可以改成ds_z3_config.json
+```
+6.修改single_node.sh文件中：
+
+  - 1.如果是a100以及更高端服务器，删除以下两行
+  ```bash
+    export NCCL_P2P_DISABLE=1
+    export NCCL_IB_DISABLE=1 
+  ```
+  - 2.设置你希望参与微调的卡，以下示例为第1张到第8张卡都参与微调
+  ```bash
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+  ```
+  - 3.将以下代码src/train.py空格后方参数改为llama_facoty中minicpm_dpo.yaml的绝对路径
+  ```bash
+    src/train.py /root/ld/ld_project/LLaMA-Factory/examples/minicpm/minicpm_sft.yaml
+  ```
+7.执行：
+```bash
+  cd LLaMA-Factory
+  bash single_node.sh
+```
--- a/finetune/llama_factory_example/llama_factory_data/dpo_en_demo.json
+++ b/finetune/llama_factory_example/llama_factory_data/dpo_en_demo.json
--- a/finetune/llama_factory_example/llama_factory_data/kto_en_demo.json
+++ b/finetune/llama_factory_example/llama_factory_data/kto_en_demo.json
--- a/finetune/llama_factory_example/llama_factory_data/sft_zh_demo.json
+++ b/finetune/llama_factory_example/llama_factory_data/sft_zh_demo.json
--- a/finetune/llama_factory_example/minicpm_dpo.yaml
+++ b/finetune/llama_factory_example/minicpm_dpo.yaml
@ -0,0 +1,42 @@
+### model
+model_name_or_path: /root/ld/ld_project/LLaMA-Factory/saves/minicpm/full/sft/
+
+### method
+stage: dpo
+do_train: true
+finetuning_type: full
+
+### ddp
+ddp_timeout: 180000000
+deepspeed: examples/deepspeed/ds_z2_config.json
+
+### dataset
+dataset: dpo_en_demo
+template: cpm
+cutoff_len: 1200
+max_samples: 50000000
+overwrite_cache: true
+preprocessing_num_workers: 16
+
+
+### output
+output_dir: saves/minicpm/dpo
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_strategy: epoch
+### train
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 4
+learning_rate: 0.00001
+num_train_epochs: 2.0
+lr_scheduler_type: cosine
+warmup_steps: 0.1
+bf16: true
+
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 4
+evaluation_strategy: steps
+eval_steps: 500
--- a/finetune/llama_factory_example/minicpm_kto.yaml
+++ b/finetune/llama_factory_example/minicpm_kto.yaml
@ -0,0 +1,42 @@
+### model
+model_name_or_path: /root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16/
+
+### method
+stage: kto
+do_train: true
+finetuning_type: full
+kto_ftx: 0.1
+
+### ddp
+ddp_timeout: 180000000
+deepspeed: examples/deepspeed/ds_z2_config.json
+
+### dataset
+dataset: kto_harmless
+template: cpm
+cutoff_len: 1200
+max_samples: 500000
+overwrite_cache: true
+preprocessing_num_workers: 16
+
+### output
+output_dir: saves/minicpm/kto
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+
+### train
+per_device_train_batch_size: 4
+gradient_accumulation_steps: 4
+learning_rate: 0.000005
+num_train_epochs: 1.0
+lr_scheduler_type: cosine
+warmup_steps: 0.1
+bf16: true
+
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 16
+evaluation_strategy: steps
+eval_steps: 500
--- a/finetune/llama_factory_example/minicpm_sft.yaml
+++ b/finetune/llama_factory_example/minicpm_sft.yaml
@ -0,0 +1,41 @@
+### model
+model_name_or_path: /root/ld/ld_model_pretrained/miniCPM-bf16/
+
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+
+### ddp
+ddp_timeout: 180000000
+deepspeed: examples/deepspeed/ds_z2_config.json
+
+### dataset
+dataset: glaive_toolcall_en,glaive_toolcall_zh
+template: cpm
+cutoff_len: 1800
+max_samples: 500000
+overwrite_cache: true
+preprocessing_num_workers: 16
+
+### output
+output_dir: saves/minicpm/fuction_call
+logging_steps: 10
+save_strategy: epoch
+plot_loss: true
+overwrite_output_dir: true
+
+### train
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 4
+learning_rate: 0.0001
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_steps: 0.1
+bf16: true
+
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 4
+evaluation_strategy: steps
+eval_steps: 500
--- a/finetune/llama_factory_example/single_node.sh
+++ b/finetune/llama_factory_example/single_node.sh
@ -0,0 +1,16 @@
+#!/bin/bash
+
+NPROC_PER_NODE=8
+NNODES=1
+RANK=0
+MASTER_ADDR=127.0.0.1
+MASTER_PORT=29500
+export NCCL_P2P_DISABLE=1
+export NCCL_IB_DISABLE=1 
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
+    --nproc_per_node $NPROC_PER_NODE \
+    --nnodes $NNODES \
+    --node_rank $RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT \
+    src/train.py /root/ld/ld_project/LLaMA-Factory/examples/minicpm/minicpm_sft.yaml
--- a/finetune/mlx_finetune.py
+++ b/finetune/mlx_finetune.py
@ -0,0 +1,742 @@
+# Copyright © 2023-2024 Apple Inc.
+"""
+This script demonstrates how to fine-tune a LoRA model on AdvertiseGen dataset in mlx.
+Using Code is modified from https://github.com/ml-explore/mlx-examples.
+Using Model with https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx
+
+Use this Code with command:
+
+train:
+首先处理数据，运行data_processing.ipynb
+python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/mlx_AdvertiseGen  --train  --seed 2024 --iters 500
+
+输出结果如下：
+
+Training
+Iter 1: Val loss 4.015, Val took 1067.669s
+Iter 2: Val loss 4.001, Val took 1061.649s
+...
+
+训练结束之后，文件夹下会有 adapters.npz 文件，用于后续的测试。接着，运行测试命令
+
+test:
+python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx  --data data/mlx_AdvertiseGen  --test --seed 2024
+
+输出结果如下：
+
+Testing
+Test loss 3.977, Test ppl 53.350.
+
+
+"""
+import argparse
+import json
+import time
+from pathlib import Path
+from typing import Generator
+import transformers
+import numpy as np
+from huggingface_hub import snapshot_download
+import glob
+import inspect
+import math
+from dataclasses import dataclass
+from typing import Dict, Optional, Tuple, Union
+
+from mlx.utils import tree_flatten, tree_unflatten
+import mlx.optimizers as optim
+import mlx.core as mx
+import mlx.nn as nn
+
+
+@dataclass
+class ModelArgs:
+    hidden_size: int
+    num_hidden_layers: int
+    intermediate_size: int
+    num_attention_heads: int
+    rms_norm_eps: float
+    vocab_size: int
+    num_key_value_heads: int = None
+    rope_theta: float = 10000
+    rope_traditional: bool = False
+    model_type: str = None
+    rope_scaling: Optional[Dict[str, Union[float, str]]] = None
+
+    def __post_init__(self):
+        if self.num_key_value_heads is None:
+            self.num_key_value_heads = self.num_attention_heads
+
+        if self.rope_scaling:
+            required_keys = {"factor", "type"}
+            if not all(key in self.rope_scaling for key in required_keys):
+                raise ValueError(f"rope_scaling must contain keys {required_keys}")
+
+            if self.rope_scaling["type"] != "linear":
+                raise ValueError("rope_scaling 'type' currently only supports 'linear'")
+
+    @classmethod
+    def from_dict(cls, params):
+        return cls(
+            **{
+                k: v
+                for k, v in params.items()
+                if k in inspect.signature(cls).parameters
+            }
+        )
+
+
+class LoRALinear(nn.Module):
+    @staticmethod
+    def from_linear(linear: nn.Linear, rank: int = 8):
+        # TODO remove when input_dims and output_dims are attributes
+        # on linear and quantized linear
+        output_dims, input_dims = linear.weight.shape
+        if isinstance(linear, nn.QuantizedLinear):
+            input_dims *= 32 // linear.bits
+        lora_lin = LoRALinear(input_dims, output_dims, rank)
+        lora_lin.linear = linear
+        return lora_lin
+
+    def to_linear(self):
+        linear = self.linear
+        bias = "bias" in linear
+        weight = linear.weight
+        is_quantized = isinstance(linear, nn.QuantizedLinear)
+
+        # Use the same type as the linear weight if not quantized
+        dtype = weight.dtype
+
+        if is_quantized:
+            dtype = mx.float16
+            weight = mx.dequantize(
+                weight,
+                linear.scales,
+                linear.biases,
+                linear.group_size,
+                linear.bits,
+            )
+        output_dims, input_dims = weight.shape
+        fused_linear = nn.Linear(input_dims, output_dims, bias=bias)
+
+        lora_b = (self.scale * self.lora_b.T).astype(dtype)
+        lora_a = self.lora_a.T.astype(dtype)
+        fused_linear.weight = weight + lora_b @ lora_a
+        if bias:
+            fused_linear.bias = linear.bias
+
+        if is_quantized:
+            fused_linear = nn.QuantizedLinear.from_linear(
+                fused_linear,
+                linear.group_size,
+                linear.bits,
+            )
+
+        return fused_linear
+
+    def __init__(
+            self,
+            input_dims: int,
+            output_dims: int,
+            lora_rank: int = 8,
+            bias: bool = False,
+            scale: float = 20.0,
+    ):
+        super().__init__()
+
+        # Regular linear layer weights
+        self.linear = nn.Linear(input_dims, output_dims, bias=bias)
+
+        # Scale for low-rank update
+        self.scale = scale
+
+        # Low rank lora weights
+        scale = 1 / math.sqrt(input_dims)
+        self.lora_a = mx.random.uniform(
+            low=-scale,
+            high=scale,
+            shape=(input_dims, lora_rank),
+        )
+        self.lora_b = mx.zeros(shape=(lora_rank, output_dims))
+
+    def __call__(self, x):
+        dtype = self.linear.weight.dtype
+        if isinstance(self.linear, nn.QuantizedLinear):
+            dtype = self.linear.scales.dtype
+        y = self.linear(x.astype(dtype))
+        z = (x @ self.lora_a) @ self.lora_b
+        return y + self.scale * z
+
+
+class Attention(nn.Module):
+    def __init__(self, args: ModelArgs):
+        super().__init__()
+
+        dim = args.hidden_size
+        self.n_heads = n_heads = args.num_attention_heads
+        self.n_kv_heads = n_kv_heads = args.num_key_value_heads
+
+        self.repeats = n_heads // n_kv_heads
+
+        head_dim = args.hidden_size // n_heads
+        self.scale = head_dim ** -0.5
+
+        self.q_proj = nn.Linear(dim, n_heads * head_dim, bias=False)
+        self.k_proj = nn.Linear(dim, n_kv_heads * head_dim, bias=False)
+        self.v_proj = nn.Linear(dim, n_kv_heads * head_dim, bias=False)
+        self.o_proj = nn.Linear(n_heads * head_dim, dim, bias=False)
+        rope_scale = (
+            1 / args.rope_scaling["factor"]
+            if args.rope_scaling is not None and args.rope_scaling["type"] == "linear"
+            else 1
+        )
+        self.rope = nn.RoPE(
+            head_dim,
+            traditional=args.rope_traditional,
+            base=args.rope_theta,
+            scale=rope_scale,
+        )
+
+    def __call__(
+            self,
+            x: mx.array,
+            mask: Optional[mx.array] = None,
+            cache: Optional[Tuple[mx.array, mx.array]] = None,
+    ) -> mx.array:
+        B, L, D = x.shape
+
+        queries, keys, values = self.q_proj(x), self.k_proj(x), self.v_proj(x)
+
+        # Prepare the queries, keys and values for the attention computation
+        queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
+        keys = keys.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
+        values = values.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
+
+        if cache is not None:
+            key_cache, value_cache = cache
+            queries = self.rope(queries, offset=key_cache.shape[2])
+            keys = self.rope(keys, offset=key_cache.shape[2])
+            keys = mx.concatenate([key_cache, keys], axis=2)
+            values = mx.concatenate([value_cache, values], axis=2)
+        else:
+            queries = self.rope(queries)
+            keys = self.rope(keys)
+
+        output = mx.fast.scaled_dot_product_attention(
+            queries, keys, values, scale=self.scale, mask=mask
+        )
+        output = output.transpose(0, 2, 1, 3).reshape(B, L, -1)
+        return self.o_proj(output), (keys, values)
+
+
+class MLP(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.gate_proj = nn.Linear(dim, hidden_dim, bias=False)
+        self.down_proj = nn.Linear(hidden_dim, dim, bias=False)
+        self.up_proj = nn.Linear(dim, hidden_dim, bias=False)
+
+    def __call__(self, x) -> mx.array:
+        return self.down_proj(nn.silu(self.gate_proj(x)) * self.up_proj(x))
+
+
+class TransformerBlock(nn.Module):
+    def __init__(self, args: ModelArgs):
+        super().__init__()
+        self.num_attention_heads = args.num_attention_heads
+        self.hidden_size = args.hidden_size
+        self.self_attn = Attention(args)
+        self.mlp = MLP(args.hidden_size, args.intermediate_size)
+        self.input_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
+        self.post_attention_layernorm = nn.RMSNorm(
+            args.hidden_size, eps=args.rms_norm_eps
+        )
+        self.args = args
+
+    def __call__(
+            self,
+            x: mx.array,
+            mask: Optional[mx.array] = None,
+            cache: Optional[Tuple[mx.array, mx.array]] = None,
+    ) -> mx.array:
+        r, cache = self.self_attn(self.input_layernorm(x), mask, cache)
+        h = x + r
+        r = self.mlp(self.post_attention_layernorm(h))
+        out = h + r
+        return out, cache
+
+
+class LlamaModel(nn.Module):
+    def __init__(self, args: ModelArgs):
+        super().__init__()
+        self.args = args
+        self.vocab_size = args.vocab_size
+        self.num_hidden_layers = args.num_hidden_layers
+        assert self.vocab_size > 0
+        self.embed_tokens = nn.Embedding(args.vocab_size, args.hidden_size)
+        self.layers = [
+            TransformerBlock(args=args) for _ in range(args.num_hidden_layers)
+        ]
+        self.norm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
+
+    def __call__(
+            self,
+            inputs: mx.array,
+            cache=None,
+    ):
+        h = self.embed_tokens(inputs)
+
+        mask = None
+        if h.shape[1] > 1:
+            mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1])
+            mask = mask.astype(h.dtype)
+
+        if cache is None:
+            cache = [None] * len(self.layers)
+
+        for e, layer in enumerate(self.layers):
+            h, cache[e] = layer(h, mask, cache[e])
+
+        return self.norm(h), cache
+
+
+class Model(nn.Module):
+    def __init__(self, args: ModelArgs):
+        super().__init__()
+        self.model = LlamaModel(args)
+        self.lm_head = nn.Linear(args.hidden_size, args.vocab_size, bias=False)
+
+    def __call__(
+            self,
+            inputs: mx.array,
+            cache=None,
+    ):
+        out, cache = self.model(inputs, cache)
+        return self.lm_head(out), cache
+
+
+def build_parser():
+    parser = argparse.ArgumentParser(description="LoRA or QLoRA finetuning.")
+    parser.add_argument(
+        "--model",
+        default="/Users/liudan/Downloads/模型/llamaformat_minicpm",
+        help="The path to the local model directory or Hugging Face repo.",
+    )
+    # Generation args
+    parser.add_argument(
+        "--max-tokens",
+        "-m",
+        type=int,
+        default=100,
+        help="The maximum number of tokens to generate",
+    )
+    parser.add_argument(
+        "--temp", type=float, default=0.8, help="The sampling temperature"
+    )
+    parser.add_argument(
+        "--prompt",
+        "-p",
+        type=str,
+        help="The prompt for generation"
+    )
+
+    # Training args
+    parser.add_argument(
+        "--train",
+        action="store_true",
+        help="Do training",
+    )
+    parser.add_argument(
+        "--data",
+        type=str,
+        default="data/mlx_AdvertiseGen",
+        help="Directory with {train, valid, test}.json files",
+    )
+    parser.add_argument(
+        "--lora-layers",
+        type=int,
+        default=16,
+        help="Number of layers to fine-tune",
+    )
+    parser.add_argument("--batch-size", type=int, default=4, help="Minibatch size.")
+    parser.add_argument(
+        "--iters", type=int, default=1000, help="Iterations to train for."
+    )
+    parser.add_argument(
+        "--val-batches",
+        type=int,
+        default=25,
+        help="Number of validation batches, -1 uses the entire validation set.",
+    )
+    parser.add_argument(
+        "--learning-rate", type=float, default=1e-5, help="Adam learning rate."
+    )
+    parser.add_argument(
+        "--steps-per-report",
+        type=int,
+        default=10,
+        help="Number of training steps between loss reporting.",
+    )
+    parser.add_argument(
+        "--steps-per-eval",
+        type=int,
+        default=200,
+        help="Number of training steps between validations.",
+    )
+    parser.add_argument(
+        "--resume-adapter-file",
+        type=str,
+        default=None,
+        help="Load path to resume training with the given adapter weights.",
+    )
+    parser.add_argument(
+        "--adapter-file",
+        type=str,
+        default="adapters.npz",
+        help="Save/load path for the trained adapter weights.",
+    )
+    parser.add_argument(
+        "--save-every",
+        type=int,
+        default=100,
+        help="Save the model every N iterations.",
+    )
+    parser.add_argument(
+        "--test",
+        action="store_true",
+        help="Evaluate on the test set after training",
+    )
+    parser.add_argument(
+        "--test-batches",
+        type=int,
+        default=500,
+        help="Number of test set batches, -1 uses the entire test set.",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="The PRNG seed")
+    return parser
+
+
+class ConversationDataset:
+
+    def __init__(self, path: Path):
+        with open(path, "r") as fid:
+            self._data = [json.loads(l) for l in fid]
+
+    def __getitem__(self, idx: int):
+        entry = self._data[idx]
+        content = entry.get("input", "")
+        summary = entry.get("output", "")
+        prompt  = entry.get("prompt", "")
+        return prompt, content, summary
+
+    def __len__(self):
+        return len(self._data)
+
+
+def load(args):
+    def load_and_check(name):
+        dataset_path = Path(args.data) / f"{name}.json"
+        try:
+            return ConversationDataset(dataset_path)
+        except Exception as e:
+            print(f"Unable to build dataset {dataset_path} ({e})")
+            raise
+
+    names = ("train", "dev", "dev")
+    train, valid, test = (load_and_check(n) for n in names)
+
+    if args.train and len(train) == 0:
+        raise ValueError(
+            "Training set not found or empty. Must provide training set for fine-tuning."
+        )
+    if args.train and len(valid) == 0:
+        raise ValueError(
+            "Validation set not found or empty. Must provide validation set for fine-tuning."
+        )
+    if args.test and len(test) == 0:
+        raise ValueError(
+            "Test set not found or empty. Must provide test set for evaluation."
+        )
+    return train, valid, test
+
+
+def loss(model, inputs, targets, lengths):
+    logits, _ = model(inputs)
+    logits = logits.astype(mx.float32)
+    length_mask = mx.arange(inputs.shape[1])[None, :] < lengths[:, None]
+    ce = nn.losses.cross_entropy(logits, targets) * length_mask
+    ntoks = length_mask.sum()
+    ce = ce.sum() / ntoks
+    return ce, ntoks
+
+
+def iterate_batches(dset, tokenizer, batch_size, train=False):
+    # Shuffle indices
+    while True:
+        indices = np.arange(len(dset))
+        if train:
+            indices = np.random.permutation(indices)
+
+        # Collect batches from dataset
+        for i in range(0, len(indices) - batch_size + 1, batch_size):
+            # Encode batch
+            batch_samples=[dset[indices[i + j]] for j in range(batch_size)]
+            batch_format_text=['<用户>{}<AI>{}'.format(i[1]+i[0],i[2]) for i in batch_samples]
+            batch = [tokenizer.encode(i)+[tokenizer.eos_token_id] for i in batch_format_text]
+            lengths = [len(x) for x in batch]
+            # Check if any sequence is longer than 2048 tokens
+            if max(lengths) > 2048:
+                print(
+                    "[WARNING] Some sequences are longer than 2048 tokens. "
+                    "Consider pre-splitting your data to save memory."
+                )
+
+            # Pad to the max length
+            batch_arr = np.zeros((batch_size, max(lengths)), np.int32)
+
+            for j in range(batch_size):
+                batch_arr[j, : lengths[j]] = batch[j]
+            batch = mx.array(batch_arr)
+            yield batch[:, :-1], batch[:, 1:], mx.array(lengths)
+
+        if not train:
+            break
+
+
+def load_model(path_or_hf_repo: str):
+    # If the path exists, it will try to load model form it
+    # otherwise download and cache from the hf_repo and cache
+    model_path = Path(path_or_hf_repo)
+    if not model_path.exists():
+        model_path = Path(
+            snapshot_download(
+                repo_id=path_or_hf_repo,
+                allow_patterns=["*.json", "*.safetensors", "tokenizer.model"],
+            )
+        )
+
+    with open(model_path / "config.json", "r") as f:
+        config = json.loads(f.read())
+        quantization = config.get("quantization", None)
+
+    weight_files = glob.glob(str(model_path / "*.safetensors"))
+    if len(weight_files) == 0:
+        raise FileNotFoundError("No safetensors found in {}".format(model_path))
+
+    weights = {}
+    for wf in weight_files:
+        weights.update(mx.load(wf).items())
+
+    model_args = ModelArgs.from_dict(config)
+    model = Model(model_args)
+    if quantization is not None:
+        nn.QuantizedLinear.quantize_module(
+            model,
+            **quantization,
+            linear_class_predicate=lambda m: isinstance(m, nn.Linear)
+                                             and m.weight.shape[0] != 8,
+        )
+
+    model.load_weights(list(weights.items()))
+
+    mx.eval(model.parameters())
+    tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
+    return model, tokenizer, config
+
+
+def generate(
+        prompt: mx.array, model: nn.Module, temp: float = 0.0
+) -> Generator[mx.array, None, None]:
+    """
+    Generate text based on the given prompt and model.
+
+    Args:
+        prompt (mx.array): The input prompt.
+        model (nn.Module): The model to use for generation.
+        temp (float): The temperature for sampling. If temp is 0, use max sampling.
+
+    Yields:
+        mx.array: The generated text.
+    """
+
+    def sample(logits: mx.array) -> mx.array:
+        return (
+            mx.argmax(logits, axis=-1)
+            if temp == 0
+            else mx.random.categorical(logits * (1 / temp))
+        )
+
+    y = prompt
+    cache = None
+    while True:
+        logits, cache = model(y[None], cache=cache)
+        logits = logits[:, -1, :]
+        y = sample(logits)
+        yield y
+
+
+def evaluate(model, dataset, loss, tokenizer, batch_size, num_batches):
+    all_losses = []
+    ntokens = 0
+    for it, batch in zip(
+            range(num_batches),
+            iterate_batches(dataset, tokenizer, batch_size),
+    ):
+        losses, toks = loss(model, *batch)
+        all_losses.append((losses * toks).item())
+        ntokens += toks.item()
+
+    return np.sum(all_losses) / ntokens
+
+
+def train(model, train_set, val_set, optimizer, loss, tokenizer, args):
+    # Create value and grad function for loss
+    loss_value_and_grad = nn.value_and_grad(model, loss)
+
+    losses = []
+    n_tokens = 0
+
+    # Main training loop
+    start = time.perf_counter()
+    for it, batch in zip(
+            range(args.iters),
+            iterate_batches(train_set, tokenizer, args.batch_size, train=True),
+    ):
+        # Forward and backward pass
+        (lvalue, toks), grad = loss_value_and_grad(model, *batch)
+
+        # Model update
+        optimizer.update(model, grad)
+        mx.eval(model.parameters(), optimizer.state, lvalue)
+
+        # Record loss
+        losses.append(lvalue.item())
+        n_tokens += toks.item()
+
+        if (it + 1) % args.steps_per_report == 0:
+            train_loss = np.mean(losses)
+
+            stop = time.perf_counter()
+            print(
+                f"Iter {it + 1}: Train loss {train_loss:.3f}, "
+                f"It/sec {args.steps_per_report / (stop - start):.3f}, "
+                f"Tokens/sec {float(n_tokens) / (stop - start):.3f}"
+            )
+            losses = []
+            n_tokens = 0
+            start = time.perf_counter()
+
+        # Report validation loss if needed
+        if it == 0 or (it + 1) % args.steps_per_eval == 0:
+            stop = time.perf_counter()
+            val_loss = evaluate(
+                model, val_set, loss, tokenizer, args.batch_size, args.val_batches
+            )
+            print(
+                f"Iter {it + 1}: "
+                f"Val loss {val_loss:.3f}, "
+                f"Val took {(time.perf_counter() - stop):.3f}s"
+            )
+
+            start = time.perf_counter()
+
+        # Save adapter weights if needed
+        if (it + 1) % args.save_every == 0:
+            mx.savez(
+                args.adapter_file, **dict(tree_flatten(model.trainable_parameters()))
+            )
+            print(f"Iter {it + 1}: Saved adapter weights to {args.adapter_file}.")
+
+
+def generate_string(model, prompt, tokenizer, args):
+    print(prompt, end="", flush=True)
+
+    prompt = mx.array(tokenizer.encode(prompt))
+
+    tokens = []
+    skip = 0
+    for token, n in zip(
+            generate(prompt, model, args.temp),
+            range(args.max_tokens),
+    ):
+        if token == tokenizer.eos_token_id:
+            break
+
+        tokens.append(token.item())
+        s = tokenizer.decode(tokens)
+        if len(s) - skip > 1:
+            print(s[skip:-1], end="", flush=True)
+            skip = len(s) - 1
+    print(tokenizer.decode(tokens)[skip:], flush=True)
+    print("=" * 10)
+    if len(tokens) == 0:
+        print("No tokens generated for this prompt")
+        return
+
+
+if __name__ == "__main__":
+    parser = build_parser()
+    args = parser.parse_args()
+
+    np.random.seed(args.seed)
+
+    print("Loading pretrained model")
+    model, tokenizer, _ = load_model(args.model)
+
+    # Freeze all layers other than LORA linears
+    model.freeze()
+    for l in model.model.layers[len(model.model.layers) - args.lora_layers:]:
+        l.self_attn.q_proj = LoRALinear.from_linear(l.self_attn.q_proj)
+        l.self_attn.v_proj = LoRALinear.from_linear(l.self_attn.v_proj)
+        if hasattr(l, "block_sparse_moe"):
+            l.block_sparse_moe.gate = LoRALinear.from_linear(l.block_sparse_moe.gate)
+
+    p = sum(v.size for _, v in tree_flatten(model.parameters())) / 10 ** 6
+    print(f"Total parameters {p:.3f}M")
+    p = sum(v.size for _, v in tree_flatten(model.trainable_parameters())) / 10 ** 6
+    print(f"Trainable parameters {p:.3f}M")
+
+    print("Loading datasets")
+    train_set, valid_set, test_set = load(args)
+
+    # Resume training the given adapters.
+    if args.resume_adapter_file is not None:
+        print(f"Loading pretrained adapters from {args.resume_adapter_file}")
+        model.load_weights(args.resume_adapter_file, strict=False)
+
+    if args.train:
+        print("Training")
+        opt = optim.Adam(learning_rate=args.learning_rate)
+
+        # Train model
+        train(model, train_set, valid_set, opt, loss, tokenizer, args)
+
+        # Save adapter weights
+        mx.savez(args.adapter_file, **dict(tree_flatten(model.trainable_parameters())))
+
+    # Load the LoRA adapter weights which we assume should exist by this point
+    if not Path(args.adapter_file).is_file():
+        raise ValueError(
+            f"Adapter file {args.adapter_file} missing. "
+            "Use --train to learn and save the adapters.npz."
+        )
+    model.load_weights(args.adapter_file, strict=False)
+
+    if args.test:
+        print("Testing")
+        model.eval()
+        test_loss = evaluate(
+            model,
+            test_set,
+            loss,
+            tokenizer,
+            args.batch_size,
+            num_batches=args.test_batches,
+        )
+        test_ppl = math.exp(test_loss)
+
+        print(f"Test loss {test_loss:.3f}, Test ppl {test_ppl:.3f}.")
+
+    if args.prompt is not None:
+        print("Generating")
+        generate_string(model, args.prompt, tokenizer, args)
--- a/finetune/requirements_mlx.txt
+++ b/finetune/requirements_mlx.txt
@ -0,0 +1,12 @@
+transformers>=4.39.1
+torch>=2.2.0
+triton>=2.2.0
+httpx>=0.27.0
+gradio>=4.26.0
+flash_attn>=2.4.1
+accelerate>=0.29.2
+sentence_transformers>=2.6.1
+sse_starlette>=2.1.0
+tiktoken>=0.6.0
+mlx_lm>=0.8.0
+openai>=0.16.2
--- a/inference/README.md
+++ b/inference/README.md
@ -1,60 +0,0 @@
-# VLLM 推理 MiniCPM | MiniCPM inference on VLLM
-
-### 中文
-
-* 安装支持 MiniCPM 的 vLLM
-  - 因为 MiniCPM 采用 MUP 结构，在矩阵乘法中存在一定的放缩计算，与Llama类模型结构有细微差别。
-  - 我们基于版本为 0.2.2 的 vLLM 实现了 MiniCPM 的推理，代码位于仓库[inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference)文件夹下，未来将会支持更新的vLLM 版本。
-
-* 安装支持 MiniCPM 的 vLLM 版本
-```shell
-pip install inference/vllm
-```
-
-* 将Huggingface Transformers仓库转为vLLM-MiniCPM支持的格式，其中`<hf_repo_path>`, `<vllmcpm_repo_path>`均为本地路径
-```shell
-python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
-```
-
-* 测试样例
-```shell
-cd inference/vllm/examples/infer_cpm
-python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
-```
-
-* 期望输出
-```shell
-<用户>: Which city is the capital of China?
-<AI>:
- The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
-```
-
-### English
-
-
-* Install vLLM which supports MiniCPM
- - The structure of MiniCPM is not completely same as Llama, since MiniCPM uses the structure of MUP and scaling is applied in matrix multiplications.
- - We implemented the inference of MiniCPM in vLLM 0.2.2, and the code is located at [inference](https://github.com/OpenBMB/MiniCPM/tree/main/inference). Newer vLLM versions will be supported in the future.
-
-* Install vLLM which supports MiniCPM
-```shell
-pip install inference/vllm
-```
-
-* Convert Huggingface repo to vllm-cpm repo，where `<hf_repo_path>`, `<vllmcpm_repo_path>` are local paths
-```shell
-python inference/convert_hf_to_vllmcpm.py --load <hf_repo_path> --save <vllmcpm_repo_path>
-```
-
-* Test cases
-```shell
-cd inference/vllm/examples/infer_cpm
-python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/prompt_demo.txt
-```
-
-* Expected Output
-```shell
-<用户>: Which city is the capital of China?
-<AI>:
- The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
-```
--- a/inference/convert_hf_to_vllmcpm.py
+++ b/inference/convert_hf_to_vllmcpm.py
@ -1,91 +0,0 @@
-import argparse
-import json
-import os
-import shutil
-from tqdm import tqdm
-from collections import OrderedDict
-import torch
-
-def convert_model(config, ckpt):
-    # config
-    config_bmt = OrderedDict(
-        {
-            "_dtype": "bf16",
-            "activate_fn": "silu",
-            "architectures": [
-                "CPMDragonflyForCausalLM"
-            ],
-            "model_type": "cpm_dragonfly",
-            "base": 10000,
-            "dim_ff": config['intermediate_size'],
-            "dim_head": config['hidden_size'] // config['num_attention_heads'],
-            "dim_model": config['hidden_size'],
-            "dim_model_base": 256,
-            "dropout_p": 0.0,
-            "eps": config['rms_norm_eps'],
-            "init_std": config['initializer_range'],
-            "num_heads": config['num_attention_heads'],
-            "num_kv_heads": config['num_key_value_heads'],
-            "num_layers": config['num_hidden_layers'],
-            "orig_max_length": 4096,
-            "pose_prob": 0.0,
-            "pose_scaling_factor": 1.0,
-            "qk_norm": False,
-            "rope_scaling_factor": 1,
-            "rope_scaling_type": "",
-            "scale": True,
-            "scale_depth": config['scale_depth'],
-            "scale_emb": config['scale_emb'],
-            "tie_lm_head": True,
-            "tp": 0,
-            "transformers_version": "4.35.0",
-            "vocab_size": config['vocab_size']
-        }
-    )
-
-
-    model_bmt = OrderedDict()
-    model_bmt["input_embedding.weight"] = ckpt['model.embed_tokens.weight'].contiguous()
-    model_bmt["encoder.output_layernorm.weight"] = ckpt['model.norm.weight'].contiguous()
-    for lnum in tqdm(range(config_bmt['num_layers'])):
-        hf_pfx = f"model.layers.{lnum}"
-        bmt_pfx = f"encoder.layers.{lnum}"
-        model_bmt[f"{bmt_pfx}.self_att.layernorm_before_attention.weight"] = ckpt[f"{hf_pfx}.input_layernorm.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.self_att.self_attention.project_q.weight"] = ckpt[f"{hf_pfx}.self_attn.q_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.self_att.self_attention.project_k.weight"] = ckpt[f"{hf_pfx}.self_attn.k_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.self_att.self_attention.project_v.weight"] = ckpt[f"{hf_pfx}.self_attn.v_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.self_att.self_attention.attention_out.weight"] = ckpt[f"{hf_pfx}.self_attn.o_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.ffn.layernorm_before_ffn.weight"] = ckpt[f"{hf_pfx}.post_attention_layernorm.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.ffn.ffn.w_in.w_0.weight"] = ckpt[f"{hf_pfx}.mlp.gate_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.ffn.ffn.w_in.w_1.weight"] = ckpt[f"{hf_pfx}.mlp.up_proj.weight"].contiguous()
-        model_bmt[f"{bmt_pfx}.ffn.ffn.w_out.weight"] = ckpt[f"{hf_pfx}.mlp.down_proj.weight"].contiguous()
-
-
-    return config_bmt, model_bmt
-
-def load_model_ckpt(args):
-    with open(os.path.join(args.load, "config.json"), 'r') as fin:
-        config = json.load(fin)
-    ckpt = torch.load(os.path.join(args.load, "pytorch_model.bin"))
-
-    os.makedirs(f"{args.save}", exist_ok=True)
-
-    # model and config
-    hf_config, hf_ckpt = convert_model(config, ckpt)
-    with open(os.path.join(args.save, "config.json"), 'w') as fout:
-        json.dump(hf_config, fout, indent=4)
-    torch.save(hf_ckpt, f"{args.save}/pytorch_model.pt")
-
-    # tokenizer
-    shutil.copyfile(f"{args.load}/tokenizer.json", f"{args.save}/tokenizer.json")
-    shutil.copyfile(f"{args.load}/tokenizer.model", f"{args.save}/tokenizer.model")
-    shutil.copyfile(f"{args.load}/special_tokens_map.json", f"{args.save}/special_tokens_map.json")
-    shutil.copyfile(f"{args.load}/tokenizer_config.json", f"{args.save}/tokenizer_config.json")
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--load", type=str, default="")
-    parser.add_argument("--save", type=str, default="")
-    args = parser.parse_args()
-
-    load_model_ckpt(args)
--- a/inference/vllm/examples/infer_cpm/inference.py
+++ b/inference/vllm/examples/infer_cpm/inference.py
--- a/inference/vllm/.pylintrc
+++ b/inference/vllm/.pylintrc
@ -1,434 +0,0 @@
-# This Pylint rcfile contains a best-effort configuration to uphold the
-# best-practices and style described in the Google Python style guide:
-#   https://google.github.io/styleguide/pyguide.html
-#
-# Its canonical open-source location is:
-#   https://google.github.io/styleguide/pylintrc
-
-[MASTER]
-
-# Files or directories to be skipped. They should be base names, not paths.
-ignore=docs
-
-# Files or directories matching the regex patterns are skipped. The regex
-# matches against base names, not paths.
-ignore-patterns=
-
-# Pickle collected data for later comparisons.
-persistent=no
-
-# List of plugins (as comma separated values of python modules names) to load,
-# usually to register additional checkers.
-load-plugins=
-
-# Use multiple processes to speed up Pylint.
-jobs=4
-
-# Allow loading of arbitrary C extensions. Extensions are imported into the
-# active Python interpreter and may run arbitrary code.
-unsafe-load-any-extension=no
-
-
-[MESSAGES CONTROL]
-
-# Only show warnings with the listed confidence levels. Leave empty to show
-# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED
-confidence=
-
-# Enable the message, report, category or checker with the given id(s). You can
-# either give multiple identifier separated by comma (,) or put this option
-# multiple time (only on the command line, not in the configuration file where
-# it should appear only once). See also the "--disable" option for examples.
-#enable=
-
-# Disable the message, report, category or checker with the given id(s). You
-# can either give multiple identifiers separated by comma (,) or put this
-# option multiple times (only on the command line, not in the configuration
-# file where it should appear only once).You can also use "--disable=all" to
-# disable everything first and then reenable specific checks. For example, if
-# you want to run only the similarities checker, you can use "--disable=all
-# --enable=similarities". If you want to run only the classes checker, but have
-# no Warning level messages displayed, use"--disable=all --enable=classes
-# --disable=W"
-disable=abstract-method,
-        apply-builtin,
-        arguments-differ,
-        attribute-defined-outside-init,
-        backtick,
-        bad-option-value,
-        basestring-builtin,
-        buffer-builtin,
-        c-extension-no-member,
-        consider-using-enumerate,
-        cmp-builtin,
-        cmp-method,
-        coerce-builtin,
-        coerce-method,
-        delslice-method,
-        div-method,
-        duplicate-code,
-        eq-without-hash,
-        execfile-builtin,
-        file-builtin,
-        filter-builtin-not-iterating,
-        fixme,
-        getslice-method,
-        global-statement,
-        hex-method,
-        idiv-method,
-        implicit-str-concat-in-sequence,
-        import-error,
-        import-self,
-        import-star-module-level,
-        inconsistent-return-statements,
-        input-builtin,
-        intern-builtin,
-        invalid-str-codec,
-        locally-disabled,
-        logging-fstring-interpolation,  # added by vLLM
-        logging-not-lazy,  # added by vLLM
-        long-builtin,
-        long-suffix,
-        map-builtin-not-iterating,
-        misplaced-comparison-constant,
-        missing-class-docstring,  # TODO (vLLM): enable
-        missing-function-docstring,
-        missing-module-docstring,  # TODO (vLLM): enable
-        metaclass-assignment,
-        next-method-called,
-        next-method-defined,
-        no-absolute-import,
-        no-else-break,
-        no-else-continue,
-        no-else-raise,
-        no-else-return,
-        no-init,  # added
-        no-member,
-        no-name-in-module,
-        no-self-use,
-        nonzero-method,
-        oct-method,
-        old-division,
-        old-ne-operator,
-        old-octal-literal,
-        old-raise-syntax,
-        parameter-unpacking,
-        print-statement,
-        raising-string,
-        range-builtin-not-iterating,
-        raw_input-builtin,
-        rdiv-method,
-        reduce-builtin,
-        relative-import,
-        reload-builtin,
-        round-builtin,
-        setslice-method,
-        signature-differs,
-        standarderror-builtin,
-        suppressed-message,
-        sys-max-int,
-        too-few-public-methods,
-        too-many-ancestors,
-        too-many-arguments,
-        too-many-boolean-expressions,
-        too-many-branches,
-        too-many-instance-attributes,
-        too-many-locals,
-        too-many-nested-blocks,
-        too-many-public-methods,
-        too-many-return-statements,
-        too-many-statements,
-        trailing-newlines,
-        unichr-builtin,
-        unicode-builtin,
-        unnecessary-pass,
-        unpacking-in-except,
-        unspecified-encoding,
-        useless-else-on-loop,
-        useless-object-inheritance,
-        useless-suppression,
-        using-cmp-argument,
-        wrong-import-order,
-        xrange-builtin,
-        zip-builtin-not-iterating,
-
-
-[REPORTS]
-
-# Set the output format. Available formats are text, parseable, colorized, msvs
-# (visual studio) and html. You can also give a reporter class, eg
-# mypackage.mymodule.MyReporterClass.
-output-format=text
-
-# Tells whether to display a full report or only the messages
-reports=no
-
-# Python expression which should return a note less than 10 (10 is the highest
-# note). You have access to the variables errors warning, statement which
-# respectively contain the number of errors / warnings messages and the total
-# number of statements analyzed. This is used by the global evaluation report
-# (RP0004).
-evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)
-
-# Template used to display messages. This is a python new-style format string
-# used to format the message information. See doc for all details
-#msg-template=
-
-
-[BASIC]
-
-# Good variable names which should always be accepted, separated by a comma
-good-names=main,_
-
-# Bad variable names which should always be refused, separated by a comma
-bad-names=
-
-# Colon-delimited sets of names that determine each other's naming style when
-# the name regexes allow several styles.
-name-group=
-
-# Include a hint for the correct naming format with invalid-name
-include-naming-hint=no
-
-# List of decorators that produce properties, such as abc.abstractproperty. Add
-# to this list to register other decorators that produce valid properties.
-property-classes=abc.abstractproperty,cached_property.cached_property,cached_property.threaded_cached_property,cached_property.cached_property_with_ttl,cached_property.threaded_cached_property_with_ttl
-
-# Regular expression matching correct function names
-function-rgx=^(?:(?P<exempt>setUp|tearDown|setUpModule|tearDownModule)|(?P<camel_case>_?[A-Z][a-zA-Z0-9]*)|(?P<snake_case>_?[a-z][a-z0-9_]*))$
-
-# Regular expression matching correct variable names
-variable-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct constant names
-const-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
-
-# Regular expression matching correct attribute names
-attr-rgx=^_{0,2}[a-z][a-z0-9_]*$
-
-# Regular expression matching correct argument names
-argument-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct class attribute names
-class-attribute-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
-
-# Regular expression matching correct inline iteration names
-inlinevar-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct class names
-class-rgx=^_?[A-Z][a-zA-Z0-9]*$
-
-# Regular expression matching correct module names
-module-rgx=^(_?[a-z][a-z0-9_]*|__init__)$
-
-# Regular expression matching correct method names
-method-rgx=(?x)^(?:(?P<exempt>_[a-z0-9_]+__|runTest|setUp|tearDown|setUpTestCase|tearDownTestCase|setupSelf|tearDownClass|setUpClass|(test|assert)_*[A-Z0-9][a-zA-Z0-9_]*|next)|(?P<camel_case>_{0,2}[A-Z][a-zA-Z0-9_]*)|(?P<snake_case>_{0,2}[a-z][a-z0-9_]*))$
-
-# Regular expression which should only match function or class names that do
-# not require a docstring.
-no-docstring-rgx=(__.*__|main|test.*|.*test|.*Test)$
-
-# Minimum line length for functions/classes that require docstrings, shorter
-# ones are exempt.
-docstring-min-length=10
-
-
-[TYPECHECK]
-
-# List of decorators that produce context managers, such as
-# contextlib.contextmanager. Add to this list to register other decorators that
-# produce valid context managers.
-contextmanager-decorators=contextlib.contextmanager,contextlib2.contextmanager
-
-# Tells whether missing members accessed in mixin class should be ignored. A
-# mixin class is detected if its name ends with "mixin" (case insensitive).
-ignore-mixin-members=yes
-
-# List of module names for which member attributes should not be checked
-# (useful for modules/projects where namespaces are manipulated during runtime
-# and thus existing member attributes cannot be deduced by static analysis. It
-# supports qualified module names, as well as Unix pattern matching.
-ignored-modules=
-
-# List of class names for which member attributes should not be checked (useful
-# for classes with dynamically set attributes). This supports the use of
-# qualified names.
-ignored-classes=optparse.Values,thread._local,_thread._local
-
-# List of members which are set dynamically and missed by pylint inference
-# system, and so shouldn't trigger E1101 when accessed. Python regular
-# expressions are accepted.
-generated-members=
-
-
-[FORMAT]
-
-# Maximum number of characters on a single line.
-max-line-length=80
-
-# TODO(https://github.com/PyCQA/pylint/issues/3352): Direct pylint to exempt
-# lines made too long by directives to pytype.
-
-# Regexp for a line that is allowed to be longer than the limit.
-ignore-long-lines=(?x)(
-  ^\s*(\#\ )?<?https?://\S+>?$|
-  ^\s*(from\s+\S+\s+)?import\s+.+$)
-
-# Allow the body of an if to be on the same line as the test if there is no
-# else.
-single-line-if-stmt=yes
-
-# Maximum number of lines in a module
-max-module-lines=99999
-
-# String used as indentation unit.  The internal Google style guide mandates 2
-# spaces.  Google's externaly-published style guide says 4, consistent with
-# PEP 8.  Here, we use 2 spaces, for conformity with many open-sourced Google
-# projects (like TensorFlow).
-indent-string='    '
-
-# Number of spaces of indent required inside a hanging  or continued line.
-indent-after-paren=4
-
-# Expected format of line ending, e.g. empty (any line ending), LF or CRLF.
-expected-line-ending-format=
-
-
-[MISCELLANEOUS]
-
-# List of note tags to take in consideration, separated by a comma.
-notes=TODO
-
-
-[STRING]
-
-# This flag controls whether inconsistent-quotes generates a warning when the
-# character used as a quote delimiter is used inconsistently within a module.
-check-quote-consistency=yes
-
-
-[VARIABLES]
-
-# Tells whether we should check for unused import in __init__ files.
-init-import=no
-
-# A regular expression matching the name of dummy variables (i.e. expectedly
-# not used).
-dummy-variables-rgx=^\*{0,2}(_$|unused_|dummy_)
-
-# List of additional names supposed to be defined in builtins. Remember that
-# you should avoid to define new builtins when possible.
-additional-builtins=
-
-# List of strings which can identify a callback function by name. A callback
-# name must start or end with one of those strings.
-callbacks=cb_,_cb
-
-# List of qualified module names which can have objects that can redefine
-# builtins.
-redefining-builtins-modules=six,six.moves,past.builtins,future.builtins,functools
-
-
-[LOGGING]
-
-# Logging modules to check that the string format arguments are in logging
-# function parameter format
-logging-modules=logging,absl.logging,tensorflow.io.logging
-
-
-[SIMILARITIES]
-
-# Minimum lines number of a similarity.
-min-similarity-lines=4
-
-# Ignore comments when computing similarities.
-ignore-comments=yes
-
-# Ignore docstrings when computing similarities.
-ignore-docstrings=yes
-
-# Ignore imports when computing similarities.
-ignore-imports=no
-
-
-[SPELLING]
-
-# Spelling dictionary name. Available dictionaries: none. To make it working
-# install python-enchant package.
-spelling-dict=
-
-# List of comma separated words that should not be checked.
-spelling-ignore-words=
-
-# A path to a file that contains private dictionary; one word per line.
-spelling-private-dict-file=
-
-# Tells whether to store unknown words to indicated private dictionary in
-# --spelling-private-dict-file option instead of raising a message.
-spelling-store-unknown-words=no
-
-
-[IMPORTS]
-
-# Deprecated modules which should not be used, separated by a comma
-deprecated-modules=regsub,
-                   TERMIOS,
-                   Bastion,
-                   rexec,
-                   sets
-
-# Create a graph of every (i.e. internal and external) dependencies in the
-# given file (report RP0402 must not be disabled)
-import-graph=
-
-# Create a graph of external dependencies in the given file (report RP0402 must
-# not be disabled)
-ext-import-graph=
-
-# Create a graph of internal dependencies in the given file (report RP0402 must
-# not be disabled)
-int-import-graph=
-
-# Force import order to recognize a module as part of the standard
-# compatibility libraries.
-known-standard-library=
-
-# Force import order to recognize a module as part of a third party library.
-known-third-party=enchant, absl
-
-# Analyse import fallback blocks. This can be used to support both Python 2 and
-# 3 compatible code, which means that the block might have code that exists
-# only in one or another interpreter, leading to false positives when analysed.
-analyse-fallback-blocks=no
-
-
-[CLASSES]
-
-# List of method names used to declare (i.e. assign) instance attributes.
-defining-attr-methods=__init__,
-                      __new__,
-                      setUp
-
-# List of member names, which should be excluded from the protected access
-# warning.
-exclude-protected=_asdict,
-                  _fields,
-                  _replace,
-                  _source,
-                  _make
-
-# List of valid names for the first argument in a class method.
-valid-classmethod-first-arg=cls,
-                            class_
-
-# List of valid names for the first argument in a metaclass class method.
-valid-metaclass-classmethod-first-arg=mcs
-
-
-[EXCEPTIONS]
-
-# Exceptions that will emit a warning when being caught. Defaults to
-# "Exception"
-overgeneral-exceptions=StandardError,
-                       Exception,
-                       BaseException
--- a/inference/vllm/.readthedocs.yaml
+++ b/inference/vllm/.readthedocs.yaml
@ -1,21 +0,0 @@
-# Read the Docs configuration file
-# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
-
-version: 2
-
-build:
-  os: ubuntu-22.04
-  tools:
-    python: "3.8"
-
-sphinx:
-   configuration: docs/source/conf.py
-
-# If using Sphinx, optionally build your docs in additional formats such as PDF
-formats:
-   - pdf
-
-# Optionally declare the Python requirements required to build your docs
-python:
-   install:
-   - requirements: docs/requirements-docs.txt
--- a/inference/vllm/CONTRIBUTING.md
+++ b/inference/vllm/CONTRIBUTING.md
@ -1,77 +0,0 @@
-# Contributing to vLLM
-
-Thank you for your interest in contributing to vLLM!
-Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
-There are several ways you can contribute to the project:
-
- Identify and report any issues or bugs.
- Request or add a new model.
- Suggest or implement new features.
-
-However, remember that contributions aren't just about code.
-We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.
-
-Finally, one of the most impactful ways to support us is by raising awareness about vLLM.
-Talk about it in your blog posts, highlighting how it's driving your incredible projects.
-Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.
-
-
-## Setup for development
-
-### Build from source
-
-```bash
-pip install -r requirements.txt
-pip install -e .  # This may take several minutes.
-```
-
-### Testing
-
-```bash
-pip install -r requirements-dev.txt
-
-# Static type checking
-mypy
-# Unit tests
-pytest tests/
-```
-**Note:** Currently, the repository does not pass the mypy tests.
-
-
-## Contributing Guidelines
-
-### Issue Reporting
-
-If you encounter a bug or have a feature request, please check our issues page first to see if someone else has already reported it.
-If not, please file a new issue, providing as much relevant information as possible.
-
-### Coding Style Guide
-
-In general, we adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
-
-We include a formatting script [`format.sh`](./format.sh) to format the code.
-
-### Pull Requests
-
-When submitting a pull request:
-
-1. Make sure your code has been rebased on top of the latest commit on the main branch.
-2. Ensure code is properly formatted by running [`format.sh`](./format.sh).
-3. Include a detailed description of the changes in the pull request.
-Explain why you made the changes you did.
-If your pull request fixes an open issue, please include a reference to it in the description.
-
-### Code Reviews
-
-All submissions, including submissions by project members, require a code review.
-To make the review process as smooth as possible, please:
-
-1. Keep your changes as concise as possible.
-If your pull request involves multiple unrelated changes, consider splitting it into separate pull requests.
-2. Respond to all comments within a reasonable time frame.
-If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.
-
-### Thank You
-
-Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
-Your contributions make vLLM a great tool for everyone!
--- a/inference/vllm/Dockerfile
+++ b/inference/vllm/Dockerfile
@ -1,72 +0,0 @@
-FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev
-
-RUN apt-get update -y \
-    && apt-get install -y python3-pip
-
-WORKDIR /workspace
-
-# install build and runtime dependencies
-COPY requirements.txt requirements.txt
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install -r requirements.txt
-
-# install development dependencies
-COPY requirements-dev.txt requirements-dev.txt
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install -r requirements-dev.txt
-
-# image to build pytorch extensions
-FROM dev AS build
-
-# copy input files
-COPY csrc csrc
-COPY setup.py setup.py
-COPY requirements.txt requirements.txt
-COPY pyproject.toml pyproject.toml
-COPY vllm/__init__.py vllm/__init__.py
-
-# max jobs used by Ninja to build extensions
-ENV MAX_JOBS=$max_jobs
-RUN python3 setup.py build_ext --inplace
-
-# image to run unit testing suite
-FROM dev AS test
-
-# copy pytorch extensions separately to avoid having to rebuild
-# when python code changes
-COPY --from=build /workspace/vllm/*.so /workspace/vllm/
-COPY tests tests
-COPY vllm vllm
-
-ENTRYPOINT ["python3", "-m", "pytest", "tests"]
-
-# use CUDA base as CUDA runtime dependencies are already installed via pip
-FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS vllm-base
-
-# libnccl required for ray
-RUN apt-get update -y \
-    && apt-get install -y python3-pip
-
-WORKDIR /workspace
-COPY requirements.txt requirements.txt
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install -r requirements.txt
-
-FROM vllm-base AS vllm
-COPY --from=build /workspace/vllm/*.so /workspace/vllm/
-COPY vllm vllm
-
-EXPOSE 8000
-ENTRYPOINT ["python3", "-m", "vllm.entrypoints.api_server"]
-
-# openai api server alternative
-FROM vllm-base AS vllm-openai
-# install additional dependencies for openai api server
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install accelerate fschat
-
-COPY --from=build /workspace/vllm/*.so /workspace/vllm/
-COPY vllm vllm
-
-ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
-
--- a/inference/vllm/LICENSE
+++ b/inference/vllm/LICENSE
@ -1,201 +0,0 @@
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
-   1. Definitions.
-
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
-
-      "Contributor" shall mean Licensor and any individual or Legal Entity
-      on behalf of whom a Contribution has been received by Licensor and
-      subsequently incorporated within the Work.
-
-   2. Grant of Copyright License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      copyright license to reproduce, prepare Derivative Works of,
-      publicly display, publicly perform, sublicense, and distribute the
-      Work and such Derivative Works in Source or Object form.
-
-   3. Grant of Patent License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      (except as stated in this section) patent license to make, have made,
-      use, offer to sell, sell, import, and otherwise transfer the Work,
-      where such license applies only to those patent claims licensable
-      by such Contributor that are necessarily infringed by their
-      Contribution(s) alone or by combination of their Contribution(s)
-      with the Work to which such Contribution(s) was submitted. If You
-      institute patent litigation against any entity (including a
-      cross-claim or counterclaim in a lawsuit) alleging that the Work
-      or a Contribution incorporated within the Work constitutes direct
-      or contributory patent infringement, then any patent licenses
-      granted to You under this License for that Work shall terminate
-      as of the date such litigation is filed.
-
-   4. Redistribution. You may reproduce and distribute copies of the
-      Work or Derivative Works thereof in any medium, with or without
-      modifications, and in Source or Object form, provided that You
-      meet the following conditions:
-
-      (a) You must give any other recipients of the Work or
-          Derivative Works a copy of this License; and
-
-      (b) You must cause any modified files to carry prominent notices
-          stating that You changed the files; and
-
-      (c) You must retain, in the Source form of any Derivative Works
-          that You distribute, all copyright, patent, trademark, and
-          attribution notices from the Source form of the Work,
-          excluding those notices that do not pertain to any part of
-          the Derivative Works; and
-
-      (d) If the Work includes a "NOTICE" text file as part of its
-          distribution, then any Derivative Works that You distribute must
-          include a readable copy of the attribution notices contained
-          within such NOTICE file, excluding those notices that do not
-          pertain to any part of the Derivative Works, in at least one
-          of the following places: within a NOTICE text file distributed
-          as part of the Derivative Works; within the Source form or
-          documentation, if provided along with the Derivative Works; or,
-          within a display generated by the Derivative Works, if and
-          wherever such third-party notices normally appear. The contents
-          of the NOTICE file are for informational purposes only and
-          do not modify the License. You may add Your own attribution
-          notices within Derivative Works that You distribute, alongside
-          or as an addendum to the NOTICE text from the Work, provided
-          that such additional attribution notices cannot be construed
-          as modifying the License.
-
-      You may add Your own copyright statement to Your modifications and
-      may provide additional or different license terms and conditions
-      for use, reproduction, or distribution of Your modifications, or
-      for any such Derivative Works as a whole, provided Your use,
-      reproduction, and distribution of the Work otherwise complies with
-      the conditions stated in this License.
-
-   5. Submission of Contributions. Unless You explicitly state otherwise,
-      any Contribution intentionally submitted for inclusion in the Work
-      by You to the Licensor shall be under the terms and conditions of
-      this License, without any additional terms or conditions.
-      Notwithstanding the above, nothing herein shall supersede or modify
-      the terms of any separate license agreement you may have executed
-      with Licensor regarding such Contributions.
-
-   6. Trademarks. This License does not grant permission to use the trade
-      names, trademarks, service marks, or product names of the Licensor,
-      except as required for reasonable and customary use in describing the
-      origin of the Work and reproducing the content of the NOTICE file.
-
-   7. Disclaimer of Warranty. Unless required by applicable law or
-      agreed to in writing, Licensor provides the Work (and each
-      Contributor provides its Contributions) on an "AS IS" BASIS,
-      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
-      implied, including, without limitation, any warranties or conditions
-      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
-      PARTICULAR PURPOSE. You are solely responsible for determining the
-      appropriateness of using or redistributing the Work and assume any
-      risks associated with Your exercise of permissions under this License.
-
-   8. Limitation of Liability. In no event and under no legal theory,
-      whether in tort (including negligence), contract, or otherwise,
-      unless required by applicable law (such as deliberate and grossly
-      negligent acts) or agreed to in writing, shall any Contributor be
-      liable to You for damages, including any direct, indirect, special,
-      incidental, or consequential damages of any character arising as a
-      result of this License or out of the use or inability to use the
-      Work (including but not limited to damages for loss of goodwill,
-      work stoppage, computer failure or malfunction, or any and all
-      other commercial damages or losses), even if such Contributor
-      has been advised of the possibility of such damages.
-
-   9. Accepting Warranty or Additional Liability. While redistributing
-      the Work or Derivative Works thereof, You may choose to offer,
-      and charge a fee for, acceptance of support, warranty, indemnity,
-      or other liability obligations and/or rights consistent with this
-      License. However, in accepting such obligations, You may act only
-      on Your own behalf and on Your sole responsibility, not on behalf
-      of any other Contributor, and only if You agree to indemnify,
-      defend, and hold each Contributor harmless for any liability
-      incurred by, or claims asserted against, such Contributor by reason
-      of your accepting any such warranty or additional liability.
-
-   END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-      To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "[]"
-      replaced with your own identifying information. (Don't include
-      the brackets!)  The text should be enclosed in the appropriate
-      comment syntax for the file format. We also recommend that a
-      file or class name and description of purpose be included on the
-      same "printed page" as the copyright notice for easier
-      identification within third-party archives.
-
-   Copyright [yyyy] [name of copyright owner]
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--- a/inference/vllm/MANIFEST.in
+++ b/inference/vllm/MANIFEST.in
@ -1,4 +0,0 @@
-include LICENSE
-include requirements.txt
-
-recursive-include csrc *
--- a/inference/vllm/README.md
+++ b/inference/vllm/README.md
@ -1,95 +0,0 @@
-<p align="center">
-  <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
-    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
-  </picture>
-</p>
-
-<h3 align="center">
-Easy, fast, and cheap LLM serving for everyone
-</h3>
-
-<p align="center">
-| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |
-
-</p>
-
---
-
-*Latest News* 🔥
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
-
---
-
-vLLM is a fast and easy-to-use library for LLM inference and serving.
-
-vLLM is fast with:
-
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests
- Optimized CUDA kernels
-
-vLLM is flexible and easy to use with:
-
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
-
-vLLM seamlessly supports many Hugging Face models, including the following architectures:
-
- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
- Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.)
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
- Phi-1.5 (`microsoft/phi-1_5`, etc.)
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)
-
-Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
-
-```bash
-pip install vllm
-```
-
-## Getting Started
-
-Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
-
-## Contributing
-
-We welcome and value any contributions and collaborations.
-Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
-
-## Citation
-
-If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
-```bibtex
-@inproceedings{kwon2023efficient,
-  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
-  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
-  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
-  year={2023}
-}
-```
--- a/inference/vllm/benchmarks/README.md
+++ b/inference/vllm/benchmarks/README.md
@ -1,8 +0,0 @@
-# Benchmarking vLLM
-
-## Downloading the ShareGPT dataset
-
-You can download the dataset by running:
-```bash
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-```
--- a/inference/vllm/benchmarks/benchmark_latency.py
+++ b/inference/vllm/benchmarks/benchmark_latency.py
@ -1,101 +0,0 @@
-"""Benchmark the latency of processing a single batch of requests."""
-import argparse
-import time
-
-import numpy as np
-import torch
-from tqdm import tqdm
-
-from vllm import LLM, SamplingParams
-
-
-def main(args: argparse.Namespace):
-    print(args)
-
-    # Process all the requests in a single batch if possible.
-    # NOTE(woosuk): If the request cannot be processed in a single batch,
-    # the engine will automatically process the request in multiple batches.
-    llm = LLM(
-        model=args.model,
-        tokenizer=args.tokenizer,
-        quantization=args.quantization,
-        tensor_parallel_size=args.tensor_parallel_size,
-        max_num_seqs=args.batch_size,
-        max_num_batched_tokens=args.batch_size * args.input_len,
-        trust_remote_code=args.trust_remote_code,
-        dtype=args.dtype,
-    )
-
-    sampling_params = SamplingParams(
-        n=args.n,
-        temperature=0.0 if args.use_beam_search else 1.0,
-        top_p=1.0,
-        use_beam_search=args.use_beam_search,
-        ignore_eos=True,
-        max_tokens=args.output_len,
-    )
-    print(sampling_params)
-    dummy_prompt_token_ids = [[0] * args.input_len] * args.batch_size
-
-    def run_to_completion(profile: bool = False):
-        if profile:
-            torch.cuda.cudart().cudaProfilerStart()
-        start_time = time.perf_counter()
-
-        llm.generate(prompt_token_ids=dummy_prompt_token_ids,
-                     sampling_params=sampling_params,
-                     use_tqdm=False)
-
-        end_time = time.perf_counter()
-        latency = end_time - start_time
-        if profile:
-            torch.cuda.cudart().cudaProfilerStop()
-        return latency
-
-    print("Warming up...")
-    run_to_completion(profile=False)
-
-    # Benchmark.
-    latencies = []
-    for _ in tqdm(range(args.num_iters), desc="Profiling iterations"):
-        latencies.append(run_to_completion(profile=False))
-    print(f'Avg latency: {np.mean(latencies)} seconds')
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        description='Benchmark the latency of processing a single batch of '
-        'requests till completion.')
-    parser.add_argument('--model', type=str, default='facebook/opt-125m')
-    parser.add_argument('--tokenizer', type=str, default=None)
-    parser.add_argument('--quantization',
-                        '-q',
-                        choices=['awq', 'squeezellm', None],
-                        default=None)
-    parser.add_argument('--tensor-parallel-size', '-tp', type=int, default=1)
-    parser.add_argument('--input-len', type=int, default=32)
-    parser.add_argument('--output-len', type=int, default=128)
-    parser.add_argument('--batch-size', type=int, default=8)
-    parser.add_argument('--n',
-                        type=int,
-                        default=1,
-                        help='Number of generated sequences per prompt.')
-    parser.add_argument('--use-beam-search', action='store_true')
-    parser.add_argument('--num-iters',
-                        type=int,
-                        default=3,
-                        help='Number of iterations to run.')
-    parser.add_argument('--trust-remote-code',
-                        action='store_true',
-                        help='trust remote code from huggingface')
-    parser.add_argument(
-        '--dtype',
-        type=str,
-        default='auto',
-        choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
-        help='data type for model weights and activations. '
-        'The "auto" option will use FP16 precision '
-        'for FP32 and FP16 models, and BF16 precision '
-        'for BF16 models.')
-    args = parser.parse_args()
-    main(args)
--- a/inference/vllm/benchmarks/benchmark_serving.py
+++ b/inference/vllm/benchmarks/benchmark_serving.py
@ -1,233 +0,0 @@
-"""Benchmark online serving throughput.
-
-On the server side, run one of the following commands:
-    (vLLM backend)
-    python -m vllm.entrypoints.api_server \
-        --model <your_model> --swap-space 16 \
-        --disable-log-requests
-
-    (TGI backend)
-    ./launch_hf_server.sh <your_model>
-
-On the client side, run:
-    python benchmarks/benchmark_serving.py \
-        --backend <backend> \
-        --tokenizer <your_model> --dataset <target_dataset> \
-        --request-rate <request_rate>
-"""
-import argparse
-import asyncio
-import json
-import random
-import time
-from typing import AsyncGenerator, List, Tuple
-
-import aiohttp
-import numpy as np
-from transformers import PreTrainedTokenizerBase
-from vllm.transformers_utils.tokenizer import get_tokenizer
-
-# (prompt len, output len, latency)
-REQUEST_LATENCY: List[Tuple[int, int, float]] = []
-
-
-def sample_requests(
-    dataset_path: str,
-    num_requests: int,
-    tokenizer: PreTrainedTokenizerBase,
-) -> List[Tuple[str, int, int]]:
-    # Load the dataset.
-    with open(dataset_path) as f:
-        dataset = json.load(f)
-    # Filter out the conversations with less than 2 turns.
-    dataset = [
-        data for data in dataset
-        if len(data["conversations"]) >= 2
-    ]
-    # Only keep the first two turns of each conversation.
-    dataset = [
-        (data["conversations"][0]["value"], data["conversations"][1]["value"])
-        for data in dataset
-    ]
-
-    # Tokenize the prompts and completions.
-    prompts = [prompt for prompt, _ in dataset]
-    prompt_token_ids = tokenizer(prompts).input_ids
-    completions = [completion for _, completion in dataset]
-    completion_token_ids = tokenizer(completions).input_ids
-    tokenized_dataset = []
-    for i in range(len(dataset)):
-        output_len = len(completion_token_ids[i])
-        tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))
-
-    # Filter out too long sequences.
-    filtered_dataset: List[Tuple[str, int, int]] = []
-    for prompt, prompt_token_ids, output_len in tokenized_dataset:
-        prompt_len = len(prompt_token_ids)
-        if prompt_len < 4 or output_len < 4:
-            # Prune too short sequences.
-            # This is because TGI causes errors when the input or output length
-            # is too short.
-            continue
-        if prompt_len > 1024 or prompt_len + output_len > 2048:
-            # Prune too long sequences.
-            continue
-        filtered_dataset.append((prompt, prompt_len, output_len))
-
-    # Sample the requests.
-    sampled_requests = random.sample(filtered_dataset, num_requests)
-    return sampled_requests
-
-
-async def get_request(
-    input_requests: List[Tuple[str, int, int]],
-    request_rate: float,
-) -> AsyncGenerator[Tuple[str, int, int], None]:
-    input_requests = iter(input_requests)
-    for request in input_requests:
-        yield request
-
-        if request_rate == float("inf"):
-            # If the request rate is infinity, then we don't need to wait.
-            continue
-        # Sample the request interval from the exponential distribution.
-        interval = np.random.exponential(1.0 / request_rate)
-        # The next request will be sent after the interval.
-        await asyncio.sleep(interval)
-
-
-async def send_request(
-    backend: str,
-    api_url: str,
-    prompt: str,
-    prompt_len: int,
-    output_len: int,
-    best_of: int,
-    use_beam_search: bool,
-) -> None:
-    request_start_time = time.perf_counter()
-
-    headers = {"User-Agent": "Benchmark Client"}
-    if backend == "vllm":
-        pload = {
-            "prompt": prompt,
-            "n": 1,
-            "best_of": best_of,
-            "use_beam_search": use_beam_search,
-            "temperature": 0.0 if use_beam_search else 1.0,
-            "top_p": 1.0,
-            "max_tokens": output_len,
-            "ignore_eos": True,
-            "stream": False,
-        }
-    elif backend == "tgi":
-        assert not use_beam_search
-        params = {
-            "best_of": best_of,
-            "max_new_tokens": output_len,
-            "do_sample": True,
-        }
-        pload = {
-            "inputs": prompt,
-            "parameters": params,
-        }
-    else:
-        raise ValueError(f"Unknown backend: {backend}")
-
-    timeout = aiohttp.ClientTimeout(total=3 * 3600)
-    async with aiohttp.ClientSession(timeout=timeout) as session:
-        while True:
-            async with session.post(api_url, headers=headers, json=pload) as response:
-                chunks = []
-                async for chunk, _ in response.content.iter_chunks():
-                    chunks.append(chunk)
-            output = b"".join(chunks).decode("utf-8")
-            output = json.loads(output)
-
-            # Re-send the request if it failed.
-            if "error" not in output:
-                break
-
-    request_end_time = time.perf_counter()
-    request_latency = request_end_time - request_start_time
-    REQUEST_LATENCY.append((prompt_len, output_len, request_latency))
-
-
-async def benchmark(
-    backend: str,
-    api_url: str,
-    input_requests: List[Tuple[str, int, int]],
-    best_of: int,
-    use_beam_search: bool,
-    request_rate: float,
-) -> None:
-    tasks: List[asyncio.Task] = []
-    async for request in get_request(input_requests, request_rate):
-        prompt, prompt_len, output_len = request
-        task = asyncio.create_task(send_request(backend, api_url, prompt,
-                                                prompt_len, output_len,
-                                                best_of, use_beam_search))
-        tasks.append(task)
-    await asyncio.gather(*tasks)
-
-
-def main(args: argparse.Namespace):
-    print(args)
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-
-    api_url = f"http://{args.host}:{args.port}/generate"
-    tokenizer = get_tokenizer(args.tokenizer, trust_remote_code=args.trust_remote_code)
-    input_requests = sample_requests(args.dataset, args.num_prompts, tokenizer)
-
-    benchmark_start_time = time.perf_counter()
-    asyncio.run(benchmark(args.backend, api_url, input_requests, args.best_of,
-                          args.use_beam_search, args.request_rate))
-    benchmark_end_time = time.perf_counter()
-    benchmark_time = benchmark_end_time - benchmark_start_time
-    print(f"Total time: {benchmark_time:.2f} s")
-    print(f"Throughput: {args.num_prompts / benchmark_time:.2f} requests/s")
-
-    # Compute the latency statistics.
-    avg_latency = np.mean([latency for _, _, latency in REQUEST_LATENCY])
-    print(f"Average latency: {avg_latency:.2f} s")
-    avg_per_token_latency = np.mean([
-        latency / (prompt_len + output_len)
-        for prompt_len, output_len, latency in REQUEST_LATENCY
-    ])
-    print(f"Average latency per token: {avg_per_token_latency:.2f} s")
-    avg_per_output_token_latency = np.mean([
-        latency / output_len
-        for _, output_len, latency in REQUEST_LATENCY
-    ])
-    print("Average latency per output token: "
-          f"{avg_per_output_token_latency:.2f} s")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Benchmark the online serving throughput.")
-    parser.add_argument("--backend", type=str, default="vllm",
-                        choices=["vllm", "tgi"])
-    parser.add_argument("--host", type=str, default="localhost")
-    parser.add_argument("--port", type=int, default=8000)
-    parser.add_argument("--dataset", type=str, required=True,
-                        help="Path to the dataset.")
-    parser.add_argument("--tokenizer", type=str, required=True,
-                        help="Name or path of the tokenizer.")
-    parser.add_argument("--best-of", type=int, default=1,
-                        help="Generates `best_of` sequences per prompt and "
-                             "returns the best one.")
-    parser.add_argument("--use-beam-search", action="store_true")
-    parser.add_argument("--num-prompts", type=int, default=1000,
-                        help="Number of prompts to process.")
-    parser.add_argument("--request-rate", type=float, default=float("inf"),
-                        help="Number of requests per second. If this is inf, "
-                             "then all the requests are sent at time 0. "
-                             "Otherwise, we use Poisson process to synthesize "
-                             "the request arrival times.")
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument('--trust-remote-code', action='store_true',
-                        help='trust remote code from huggingface')
-    args = parser.parse_args()
-    main(args)
--- a/inference/vllm/benchmarks/benchmark_throughput.py
+++ b/inference/vllm/benchmarks/benchmark_throughput.py
@ -1,305 +0,0 @@
-"""Benchmark offline inference throughput."""
-import argparse
-import json
-import random
-import time
-from typing import List, Optional, Tuple
-
-import torch
-from transformers import (AutoModelForCausalLM, AutoTokenizer,
-                          PreTrainedTokenizerBase)
-from tqdm import tqdm
-
-
-def sample_requests(
-    dataset_path: str,
-    num_requests: int,
-    tokenizer: PreTrainedTokenizerBase,
-    fixed_output_len: Optional[int],
-) -> List[Tuple[str, int, int]]:
-    if fixed_output_len is not None:
-        if fixed_output_len < 4:
-            raise ValueError("output_len too small")
-
-    # Load the dataset.
-    with open(dataset_path) as f:
-        dataset = json.load(f)
-    # Filter out the conversations with less than 2 turns.
-    dataset = [data for data in dataset if len(data["conversations"]) >= 2]
-    # Only keep the first two turns of each conversation.
-    dataset = [(data["conversations"][0]["value"],
-                data["conversations"][1]["value"]) for data in dataset]
-
-    # Tokenize the prompts and completions.
-    prompts = [prompt for prompt, _ in dataset]
-    prompt_token_ids = tokenizer(prompts).input_ids
-    completions = [completion for _, completion in dataset]
-    completion_token_ids = tokenizer(completions).input_ids
-    tokenized_dataset = []
-    for i in range(len(dataset)):
-        output_len = len(completion_token_ids[i])
-        if fixed_output_len is not None:
-            output_len = fixed_output_len
-        tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))
-
-    # Filter out too long sequences.
-    filtered_dataset: List[Tuple[str, int, int]] = []
-    for prompt, prompt_token_ids, output_len in tokenized_dataset:
-        prompt_len = len(prompt_token_ids)
-        if prompt_len < 4 or output_len < 4:
-            # Prune too short sequences.
-            continue
-        if prompt_len > 1024 or prompt_len + output_len > 2048:
-            # Prune too long sequences.
-            continue
-        filtered_dataset.append((prompt, prompt_len, output_len))
-
-    # Sample the requests.
-    sampled_requests = random.sample(filtered_dataset, num_requests)
-    return sampled_requests
-
-
-def run_vllm(
-    requests: List[Tuple[str, int, int]],
-    model: str,
-    tokenizer: str,
-    quantization: Optional[str],
-    tensor_parallel_size: int,
-    seed: int,
-    n: int,
-    use_beam_search: bool,
-    trust_remote_code: bool,
-    dtype: str,
-) -> float:
-    from vllm import LLM, SamplingParams
-    llm = LLM(
-        model=model,
-        tokenizer=tokenizer,
-        quantization=quantization,
-        tensor_parallel_size=tensor_parallel_size,
-        seed=seed,
-        trust_remote_code=trust_remote_code,
-        dtype=dtype,
-    )
-
-    # Add the requests to the engine.
-    for prompt, _, output_len in requests:
-        sampling_params = SamplingParams(
-            n=n,
-            temperature=0.0 if use_beam_search else 1.0,
-            top_p=1.0,
-            use_beam_search=use_beam_search,
-            ignore_eos=True,
-            max_tokens=output_len,
-        )
-        # FIXME(woosuk): Do not use internal method.
-        llm._add_request(
-            prompt=prompt,
-            prompt_token_ids=None,
-            sampling_params=sampling_params,
-        )
-
-    start = time.perf_counter()
-    # FIXME(woosuk): Do not use internal method.
-    llm._run_engine(use_tqdm=True)
-    end = time.perf_counter()
-    return end - start
-
-
-def run_hf(
-    requests: List[Tuple[str, int, int]],
-    model: str,
-    tokenizer: PreTrainedTokenizerBase,
-    n: int,
-    use_beam_search: bool,
-    max_batch_size: int,
-    trust_remote_code: bool,
-) -> float:
-    assert not use_beam_search
-    llm = AutoModelForCausalLM.from_pretrained(
-        model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code)
-    if llm.config.model_type == "llama":
-        # To enable padding in the HF backend.
-        tokenizer.pad_token = tokenizer.eos_token
-    llm = llm.cuda()
-
-    pbar = tqdm(total=len(requests))
-    start = time.perf_counter()
-    batch: List[str] = []
-    max_prompt_len = 0
-    max_output_len = 0
-    for i in range(len(requests)):
-        prompt, prompt_len, output_len = requests[i]
-        # Add the prompt to the batch.
-        batch.append(prompt)
-        max_prompt_len = max(max_prompt_len, prompt_len)
-        max_output_len = max(max_output_len, output_len)
-        if len(batch) < max_batch_size and i != len(requests) - 1:
-            # Check if we can add more requests to the batch.
-            _, next_prompt_len, next_output_len = requests[i + 1]
-            if (max(max_prompt_len, next_prompt_len) +
-                    max(max_output_len, next_output_len)) <= 2048:
-                # We can add more requests to the batch.
-                continue
-
-        # Generate the sequences.
-        input_ids = tokenizer(batch, return_tensors="pt",
-                              padding=True).input_ids
-        llm_outputs = llm.generate(
-            input_ids=input_ids.cuda(),
-            do_sample=not use_beam_search,
-            num_return_sequences=n,
-            temperature=1.0,
-            top_p=1.0,
-            use_cache=True,
-            max_new_tokens=max_output_len,
-        )
-        # Include the decoding time.
-        tokenizer.batch_decode(llm_outputs, skip_special_tokens=True)
-        pbar.update(len(batch))
-
-        # Clear the batch.
-        batch = []
-        max_prompt_len = 0
-        max_output_len = 0
-    end = time.perf_counter()
-    return end - start
-
-
-def run_mii(
-    requests: List[Tuple[str, int, int]],
-    model: str,
-    tensor_parallel_size: int,
-    output_len: int,
-) -> float:
-    from mii import pipeline
-    llm = pipeline(model, tensor_parallel=tensor_parallel_size)
-    prompts = [prompt for prompt, _, _ in requests]
-
-    start = time.perf_counter()
-    llm(prompts, max_new_tokens=output_len)
-    end = time.perf_counter()
-    return end - start
-
-
-def main(args: argparse.Namespace):
-    print(args)
-    random.seed(args.seed)
-
-    # Sample the requests.
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.tokenizer, trust_remote_code=args.trust_remote_code)
-    if args.dataset is None:
-        # Synthesize a prompt with the given input length.
-        prompt = "hi" * (args.input_len - 1)
-        requests = [(prompt, args.input_len, args.output_len)
-                    for _ in range(args.num_prompts)]
-    else:
-        requests = sample_requests(args.dataset, args.num_prompts, tokenizer,
-                                   args.output_len)
-
-    if args.backend == "vllm":
-        elapsed_time = run_vllm(requests, args.model, args.tokenizer,
-                                args.quantization, args.tensor_parallel_size,
-                                args.seed, args.n, args.use_beam_search,
-                                args.trust_remote_code, args.dtype)
-    elif args.backend == "hf":
-        assert args.tensor_parallel_size == 1
-        elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
-                              args.use_beam_search, args.hf_max_batch_size,
-                              args.trust_remote_code)
-    elif args.backend == "mii":
-        elapsed_time = run_mii(requests, args.model, args.tensor_parallel_size,
-                               args.output_len)
-    else:
-        raise ValueError(f"Unknown backend: {args.backend}")
-    total_num_tokens = sum(prompt_len + output_len
-                           for _, prompt_len, output_len in requests)
-    print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
-          f"{total_num_tokens / elapsed_time:.2f} tokens/s")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Benchmark the throughput.")
-    parser.add_argument("--backend",
-                        type=str,
-                        choices=["vllm", "hf", "mii"],
-                        default="vllm")
-    parser.add_argument("--dataset",
-                        type=str,
-                        default=None,
-                        help="Path to the dataset.")
-    parser.add_argument("--input-len",
-                        type=int,
-                        default=None,
-                        help="Input prompt length for each request")
-    parser.add_argument("--output-len",
-                        type=int,
-                        default=None,
-                        help="Output length for each request. Overrides the "
-                        "output length from the dataset.")
-    parser.add_argument("--model", type=str, default="facebook/opt-125m")
-    parser.add_argument("--tokenizer", type=str, default=None)
-    parser.add_argument('--quantization',
-                        '-q',
-                        choices=['awq', 'squeezellm', None],
-                        default=None)
-    parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1)
-    parser.add_argument("--n",
-                        type=int,
-                        default=1,
-                        help="Number of generated sequences per prompt.")
-    parser.add_argument("--use-beam-search", action="store_true")
-    parser.add_argument("--num-prompts",
-                        type=int,
-                        default=1000,
-                        help="Number of prompts to process.")
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument("--hf-max-batch-size",
-                        type=int,
-                        default=None,
-                        help="Maximum batch size for HF backend.")
-    parser.add_argument('--trust-remote-code',
-                        action='store_true',
-                        help='trust remote code from huggingface')
-    parser.add_argument(
-        '--dtype',
-        type=str,
-        default='auto',
-        choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
-        help='data type for model weights and activations. '
-        'The "auto" option will use FP16 precision '
-        'for FP32 and FP16 models, and BF16 precision '
-        'for BF16 models.')
-    args = parser.parse_args()
-    if args.tokenizer is None:
-        args.tokenizer = args.model
-    if args.dataset is None:
-        assert args.input_len is not None
-        assert args.output_len is not None
-    else:
-        assert args.input_len is None
-
-    if args.backend == "vllm":
-        if args.hf_max_batch_size is not None:
-            raise ValueError("HF max batch size is only for HF backend.")
-    elif args.backend == "hf":
-        if args.hf_max_batch_size is None:
-            raise ValueError("HF max batch size is required for HF backend.")
-        if args.quantization is not None:
-            raise ValueError("Quantization is only for vLLM backend.")
-    elif args.backend == "mii":
-        if args.dtype != "auto":
-            raise ValueError("dtype must be auto for MII backend.")
-        if args.n != 1:
-            raise ValueError("n must be 1 for MII backend.")
-        if args.use_beam_search:
-            raise ValueError("Beam search is not supported for MII backend.")
-        if args.quantization is not None:
-            raise ValueError("Quantization is only for vLLM backend.")
-        if args.hf_max_batch_size is not None:
-            raise ValueError("HF max batch size is only for HF backend.")
-        if args.tokenizer != args.model:
-            raise ValueError("Tokenizer must be the same as the model for MII "
-                             "backend.")
-    main(args)
--- a/inference/vllm/benchmarks/kernels/benchmark_paged_attention.py
+++ b/inference/vllm/benchmarks/kernels/benchmark_paged_attention.py
@ -1,197 +0,0 @@
-import argparse
-import random
-import time
-
-import torch
-
-from vllm import attention_ops
-
-NUM_BLOCKS = 1024
-PARTITION_SIZE = 512
-
-
-@torch.inference_mode()
-def main(
-    version: str,
-    num_seqs: int,
-    context_len: int,
-    num_query_heads: int,
-    num_kv_heads: int,
-    head_size: int,
-    use_alibi: bool,
-    block_size: int,
-    dtype: torch.dtype,
-    seed: int,
-    do_profile: bool,
-) -> None:
-    random.seed(seed)
-    torch.random.manual_seed(seed)
-    torch.cuda.manual_seed(seed)
-
-    scale = float(1.0 / (head_size**0.5))
-    query = torch.empty(num_seqs,
-                        num_query_heads,
-                        head_size,
-                        dtype=dtype,
-                        device="cuda")
-    query.uniform_(-scale, scale)
-
-    assert num_query_heads % num_kv_heads == 0
-    num_queries_per_kv = num_query_heads // num_kv_heads
-    head_mapping = torch.repeat_interleave(
-        torch.arange(num_kv_heads, dtype=torch.int32, device="cuda"),
-        num_queries_per_kv)
-    alibi_slopes = None
-    if use_alibi:
-        alibi_slopes = torch.randn(num_query_heads,
-                                   dtype=torch.float,
-                                   device="cuda")
-
-    context_lens = [context_len for _ in range(num_seqs)]
-    max_context_len = max(context_lens)
-    context_lens = torch.tensor(context_lens, dtype=torch.int, device="cuda")
-
-    # Create the block tables.
-    max_num_blocks_per_seq = (max_context_len + block_size - 1) // block_size
-    block_tables = []
-    for _ in range(num_seqs):
-        block_table = [
-            random.randint(0, NUM_BLOCKS - 1)
-            for _ in range(max_num_blocks_per_seq)
-        ]
-        block_tables.append(block_table)
-    block_tables = torch.tensor(block_tables, dtype=torch.int, device="cuda")
-
-    # Create the KV cache.
-    x = 16 // torch.tensor([], dtype=dtype).element_size()
-    key_cache_shape = (NUM_BLOCKS, num_kv_heads, head_size // x, block_size, x)
-    key_cache = torch.empty(size=key_cache_shape, dtype=dtype, device="cuda")
-    key_cache.uniform_(-scale, scale)
-    value_cache_shape = (NUM_BLOCKS, num_kv_heads, head_size, block_size)
-    value_cache = torch.empty(size=value_cache_shape,
-                              dtype=dtype,
-                              device="cuda")
-    value_cache.uniform_(-scale, scale)
-
-    # Prepare for the paged attention kernel.
-    output = torch.empty_like(query)
-    if version == "v2":
-        num_partitions = ((max_context_len + PARTITION_SIZE - 1) //
-                          PARTITION_SIZE)
-        tmp_output = torch.empty(
-            size=(num_seqs, num_query_heads, num_partitions, head_size),
-            dtype=output.dtype,
-            device=output.device,
-        )
-        exp_sums = torch.empty(
-            size=(num_seqs, num_query_heads, num_partitions),
-            dtype=torch.float32,
-            device=output.device,
-        )
-        max_logits = torch.empty_like(exp_sums)
-
-    def run_benchmark(num_iters: int, profile: bool = False) -> float:
-        torch.cuda.synchronize()
-        if profile:
-            torch.cuda.cudart().cudaProfilerStart()
-        start_time = time.perf_counter()
-
-        for _ in range(num_iters):
-            if version == "v1":
-                attention_ops.paged_attention_v1(
-                    output,
-                    query,
-                    key_cache,
-                    value_cache,
-                    head_mapping,
-                    scale,
-                    block_tables,
-                    context_lens,
-                    block_size,
-                    max_context_len,
-                    alibi_slopes,
-                )
-            elif version == "v2":
-                attention_ops.paged_attention_v2(
-                    output,
-                    exp_sums,
-                    max_logits,
-                    tmp_output,
-                    query,
-                    key_cache,
-                    value_cache,
-                    head_mapping,
-                    scale,
-                    block_tables,
-                    context_lens,
-                    block_size,
-                    max_context_len,
-                    alibi_slopes,
-                )
-            else:
-                raise ValueError(f"Invalid version: {version}")
-        torch.cuda.synchronize()
-
-        end_time = time.perf_counter()
-        if profile:
-            torch.cuda.cudart().cudaProfilerStart()
-        return (end_time - start_time) / num_iters
-
-    # Warmup.
-    print("Warming up...")
-    run_benchmark(num_iters=3, profile=False)
-
-    # Benchmark.
-    if do_profile:
-        latency = run_benchmark(num_iters=1, profile=True)
-    else:
-        latency = run_benchmark(num_iters=100, profile=False)
-    print(f"Kernel running time: {latency * 1000000:.3f} us")
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        description="Benchmark the paged attention kernel.")
-    parser.add_argument("--version",
-                        type=str,
-                        choices=["v1", "v2"],
-                        default="v2")
-    parser.add_argument("--batch-size", type=int, default=8)
-    parser.add_argument("--context-len", type=int, default=4096)
-    parser.add_argument("--num-query-heads", type=int, default=64)
-    parser.add_argument("--num-kv-heads", type=int, default=8)
-    parser.add_argument("--head-size",
-                        type=int,
-                        choices=[64, 80, 96, 112, 128, 256],
-                        default=128)
-    parser.add_argument("--block-size", type=int, choices=[16, 32], default=16)
-    parser.add_argument("--use-alibi", action="store_true")
-    parser.add_argument("--dtype",
-                        type=str,
-                        choices=["half", "bfloat16", "float"],
-                        default="half")
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument("--profile", action="store_true")
-    args = parser.parse_args()
-    print(args)
-
-    if args.num_query_heads % args.num_kv_heads != 0:
-        raise ValueError("num_query_heads must be divisible by num_kv_heads")
-    dtype_to_torch_dtype = {
-        "half": torch.half,
-        "bfloat16": torch.bfloat16,
-        "float": torch.float,
-    }
-    main(
-        version=args.version,
-        num_seqs=args.batch_size,
-        context_len=args.context_len,
-        num_query_heads=args.num_query_heads,
-        num_kv_heads=args.num_kv_heads,
-        head_size=args.head_size,
-        block_size=args.block_size,
-        use_alibi=args.use_alibi,
-        dtype=dtype_to_torch_dtype[args.dtype],
-        seed=args.seed,
-        do_profile=args.profile,
-    )
--- a/inference/vllm/benchmarks/launch_tgi_server.sh
+++ b/inference/vllm/benchmarks/launch_tgi_server.sh
@ -1,16 +0,0 @@
-#!/bin/bash
-
-PORT=8000
-MODEL=$1
-TOKENS=$2
-
-docker run --gpus all --shm-size 1g -p $PORT:80 \
-           -v $PWD/data:/data \
-           ghcr.io/huggingface/text-generation-inference:0.8 \
-           --model-id $MODEL \
-           --sharded false  \
-           --max-input-length 1024 \
-           --max-total-tokens 2048 \
-           --max-best-of 5 \
-           --max-concurrent-requests 5000 \
-           --max-batch-total-tokens $TOKENS
--- a/inference/vllm/cpm.py
+++ b/inference/vllm/cpm.py
@ -1,118 +0,0 @@
-# coding=utf-8
-# Copyright 2022 The OpenBMB team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from typing import List
-from typing import Optional
-from typing import Tuple
-
-import torch
-import torch.nn.functional as F
-from typing_extensions import TypedDict
-from transformers.configuration_utils import PretrainedConfig
-
-
-class CPMDragonflyConfig(PretrainedConfig):
-    model_type = "cpmdragonfly"
-    keys_to_ignore_at_inference = ["past_key_values"]
-    attribute_map = {
-        "num_key_value_heads": "num_kv_heads",
-        "hidden_act": "activate_fn",
-        "hidden_size": "dim_model",
-        "num_attention_heads": "num_heads",
-        "intermediate_size": "dim_ff",
-        "num_hidden_layers": "num_layers",
-        "vocab_size": "vocab_size",
-        "rms_norm_eps": "eps",
-        "scale_emb": "scale_emb",
-        "scale_depth": "scale_depth",
-        "scale": "scale",
-        "attention_scale": "attention_scale"
-    }
-
-    def __init__(
-        self,
-        vocab_size=32000,
-        dim_model=4096,
-        num_heads=32,
-        num_kv_heads=32,
-        dim_head=128,
-        dim_ff=11008,
-        num_layers=32,
-        dropout_p=0.0,
-        activate_fn="silu",
-        scale=True,
-        scale_emb: float=1.,
-        scale_depth: float=-1,
-        dim_model_base:int=None,
-        eps=1e-5,
-        init_std=0.02,
-        half: bool = True,
-        half_type = 'bf16',
-        mask_modules: Optional[List[Tuple[bool, bool]]] = None,
-        use_flash_attn: bool = True,
-        flash_attn_mask_shape="1d",
-        flash_impl="cuda",
-        base=10000,
-        non_checkpointing_layers_num:int = 0,
-        attention_scale=1,
-        max_position_embeddings=8192,
-        rope_scaling=None,
-        **kwargs,
-    ):
-        self.vocab_size = vocab_size
-        self.dim_model = dim_model
-        self.num_heads = num_heads
-        self.num_kv_heads = num_kv_heads
-        self.dim_head = dim_head
-        self.dim_ff = dim_ff
-        self.num_layers = num_layers
-        self.dropout_p = dropout_p
-        self.activate_fn = activate_fn
-        self.scale = scale
-        self.scale_emb = scale_emb
-        self.half = half
-        self.half_type = half_type
-        self.dim_model_base = dim_model_base
-        self.scale_depth = scale_depth
-        self.eps = eps
-        self.init_std = init_std
-        self.flash_impl = flash_impl
-        self.mask_modules = mask_modules
-        self.use_flash_attn = use_flash_attn
-        self.flash_attn_mask_shape = flash_attn_mask_shape
-        self.base = base
-        self.attention_scale=attention_scale
-        self.max_position_embeddings = max_position_embeddings
-        self.non_checkpointing_layers_num = non_checkpointing_layers_num
-        self.rope_scaling = rope_scaling
-        super().__init__(architectures=["CPMDragonflyForCausalLM"])
-    
-    @property
-    def scale_width(self,):
-        if self.scale:
-            return self.dim_model / self.dim_model_base
-        else:
-            return 1.
-    
-    @property
-    def dtype(self, ):
-        if self.half:
-            if self.half_type == 'bf16':
-                return torch.bfloat16
-            else:
-                return torch.half
-        else:
-            return torch.float
-    
--- a/inference/vllm/csrc/activation.cpp
+++ b/inference/vllm/csrc/activation.cpp
@ -1,28 +0,0 @@
-#include <torch/extension.h>
-
-void silu_and_mul(
-  torch::Tensor& out,
-  torch::Tensor& input);
-
-void gelu_new(
-  torch::Tensor& out,
-  torch::Tensor& input);
-
-void gelu_fast(
-  torch::Tensor& out,
-  torch::Tensor& input);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "silu_and_mul",
-    &silu_and_mul,
-    "Activation function used in SwiGLU.");
-  m.def(
-    "gelu_new",
-    &gelu_new,
-    "GELU implementation used in GPT-2.");
-  m.def(
-    "gelu_fast",
-    &gelu_fast,
-    "Approximate GELU implementation.");
-}
--- a/inference/vllm/csrc/activation_kernels.cu
+++ b/inference/vllm/csrc/activation_kernels.cu
@ -1,114 +0,0 @@
-#include <torch/extension.h>
-#include <ATen/cuda/CUDAContext.h>
-
-#include "dispatch_utils.h"
-
-namespace vllm {
-
-template<typename T>
-__device__ __forceinline__ T silu(const T& x) {
-  // x * sigmoid(x)
-  return (T) (((float) x) / (1.0f + expf((float) -x)));
-}
-
-template<typename scalar_t>
-__global__ void silu_and_mul_kernel(
-  scalar_t* __restrict__ out,               // [..., d]
-  const scalar_t* __restrict__ input,       // [..., 2, d]
-  const int d) {
-  const int64_t token_idx = blockIdx.x;
-  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
-    const scalar_t x = __ldg(&input[token_idx * 2 * d + idx]);
-    const scalar_t y = __ldg(&input[token_idx * 2 * d + d + idx]);
-    out[token_idx * d + idx] = silu(x) * y;
-  }
-}
-
-} // namespace vllm
-
-void silu_and_mul(
-  torch::Tensor& out,      // [..., d]
-  torch::Tensor& input)    // [..., 2 * d]
-{
-  int64_t num_tokens = input.numel() / input.size(-1);
-  int d = input.size(-1) / 2;
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(d, 1024));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    input.scalar_type(),
-    "silu_and_mul_kernel",
-    [&] {
-      vllm::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        out.data_ptr<scalar_t>(),
-        input.data_ptr<scalar_t>(),
-        d);
-    });
-}
-
-namespace vllm {
-
-// Element-wise activation kernel template.
-template<typename scalar_t, scalar_t (*ACT_FN)(const scalar_t&)>
-__global__ void activation_kernel(
-  scalar_t* __restrict__ out,               // [..., d]
-  const scalar_t* __restrict__ input,       // [..., d]
-  const int d) {
-  const int64_t token_idx = blockIdx.x;
-  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
-    const scalar_t x = __ldg(&input[token_idx * d + idx]);
-    out[token_idx * d + idx] = ACT_FN(x);
-  }
-}
-
-} // namespace vllm
-
-// Launch element-wise activation kernel.
-#define LAUNCH_ACTIVATION_KERNEL(KERNEL)                                                  \
-  int d = input.size(-1);                                                                 \
-  int64_t num_tokens = input.numel() / d;                                                 \
-  dim3 grid(num_tokens);                                                                  \
-  dim3 block(std::min(d, 1024));                                                          \
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();                           \
-  VLLM_DISPATCH_FLOATING_TYPES(                                                           \
-    input.scalar_type(),                                                                  \
-    "activation_kernel",                                                                  \
-    [&] {                                                                                 \
-      vllm::activation_kernel<scalar_t, KERNEL<scalar_t>><<<grid, block, 0, stream>>>(    \
-        out.data_ptr<scalar_t>(),                                                         \
-        input.data_ptr<scalar_t>(),                                                       \
-        d);                                                                               \
-    });
-
-namespace vllm {
-
-template<typename T>
-__device__ __forceinline__ T gelu_new_kernel(const T& x) {
-  const float x3 = (float) (x * x * x);
-  const T t = (T) tanhf((T) (0.79788456f * (float) (x + (T) (0.044715f * x3))));
-  return ((T) 0.5) * x * (((T) 1.0) + t);
-}
-
-template<typename T>
-__device__ __forceinline__ T gelu_fast_kernel(const T& x) {
-  const float f = (float) x;
-  const T t = (T) tanhf(((T) (f * 0.79788456f)) * (((T) 1.0) + (T) (0.044715f * f) * x));
-  return ((T) 0.5) * x * (((T) 1.0) + t);
-}
-
-} // namespace vllm
-
-void gelu_new(
-  torch::Tensor& out,     // [..., d]
-  torch::Tensor& input)   // [..., d]
-{
-  LAUNCH_ACTIVATION_KERNEL(vllm::gelu_new_kernel);
-}
-
-void gelu_fast(
-  torch::Tensor& out,     // [..., d]
-  torch::Tensor& input)   // [..., d]
-{
-  LAUNCH_ACTIVATION_KERNEL(vllm::gelu_fast_kernel);
-}
--- a/inference/vllm/csrc/attention.cpp
+++ b/inference/vllm/csrc/attention.cpp
@ -1,42 +0,0 @@
-#include <torch/extension.h>
-#include <c10/util/Optional.h>
-
-void paged_attention_v1(
-  torch::Tensor& out,
-  torch::Tensor& query,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& head_mapping,
-  float scale,
-  torch::Tensor& block_tables,
-  torch::Tensor& context_lens,
-  int block_size,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes);
-
-void paged_attention_v2(
-  torch::Tensor& out,
-  torch::Tensor& exp_sums,
-  torch::Tensor& max_logits,
-  torch::Tensor& tmp_out,
-  torch::Tensor& query,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& head_mapping,
-  float scale,
-  torch::Tensor& block_tables,
-  torch::Tensor& context_lens,
-  int block_size,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "paged_attention_v1",
-    &paged_attention_v1,
-    "Compute the attention between an input query and the cached keys/values using PagedAttention.");
-  m.def(
-    "paged_attention_v2",
-    &paged_attention_v2,
-    "PagedAttention V2.");
-}
--- a/inference/vllm/csrc/attention/attention_dtypes.h
+++ b/inference/vllm/csrc/attention/attention_dtypes.h
@ -1,6 +0,0 @@
-#pragma once
-
-#include "attention_generic.cuh"
-#include "dtype_float16.cuh"
-#include "dtype_float32.cuh"
-#include "dtype_bfloat16.cuh"
--- a/inference/vllm/csrc/attention/attention_generic.cuh
+++ b/inference/vllm/csrc/attention/attention_generic.cuh
@ -1,64 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <stdint.h>
-
-namespace vllm {
-
-// A vector type to store Q, K, V elements.
-template<typename T, int VEC_SIZE>
-struct Vec {};
-
-// A vector type to store FP32 accumulators.
-template<typename T>
-struct FloatVec {};
-
-// Template vector operations.
-template<typename Acc, typename A, typename B>
-inline __device__ Acc mul(A a, B b);
-
-template<typename T>
-inline __device__ float sum(T v);
-
-template<typename T>
-inline __device__ float dot(T a, T b) {
-  return sum(mul<T, T, T>(a, b));
-}
-
-template<typename A, typename T>
-inline __device__ float dot(T a, T b) {
-  return sum(mul<A, T, T>(a, b));
-}
-
-template<typename T>
-inline __device__ void zero(T& dst) {
-  constexpr int WORDS = sizeof(T) / 4;
-  union {
-    T raw;
-    uint32_t words[WORDS];
-  } tmp;
-
-#pragma unroll
-  for (int ii = 0; ii < WORDS; ++ii) {
-    tmp.words[ii] = 0u;
-  }
-  dst = tmp.raw;
-}
-
-} // namespace vllm
--- a/inference/vllm/csrc/attention/attention_kernels.cu
+++ b/inference/vllm/csrc/attention/attention_kernels.cu
@ -1,872 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#include <torch/extension.h>
-#include <ATen/cuda/CUDAContext.h>
-
-#include "attention_dtypes.h"
-#include "attention_utils.cuh"
-
-#include <algorithm>
-
-#define WARP_SIZE 32
-#define MAX(a, b) ((a) > (b) ? (a) : (b))
-#define MIN(a, b) ((a) < (b) ? (a) : (b))
-#define DIVIDE_ROUND_UP(a, b) (((a) + (b) - 1) / (b))
-
-namespace vllm {
-
-// Utility function for attention softmax.
-template<int NUM_WARPS>
-inline __device__ float block_sum(float* red_smem, float sum) {
-  // Decompose the thread index into warp / lane.
-  int warp = threadIdx.x / WARP_SIZE;
-  int lane = threadIdx.x % WARP_SIZE;
-
-  // Compute the sum per warp.
-#pragma unroll
-  for (int mask = WARP_SIZE / 2; mask >= 1; mask /= 2) {
-    sum += __shfl_xor_sync(uint32_t(-1), sum, mask);
-  }
-
-  // Warp leaders store the data to shared memory.
-  if (lane == 0) {
-    red_smem[warp] = sum;
-  }
-
-  // Make sure the data is in shared memory.
-  __syncthreads();
-
-  // The warps compute the final sums.
-  if (lane < NUM_WARPS) {
-    sum = red_smem[lane];
-  }
-
-  // Parallel reduction inside the warp.
-#pragma unroll
-  for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
-    sum += __shfl_xor_sync(uint32_t(-1), sum, mask);
-  }
-
-  // Broadcast to other threads.
-  return __shfl_sync(uint32_t(-1), sum, 0);
-}
-
-// TODO(woosuk): Merge the last two dimensions of the grid.
-// Grid: (num_heads, num_seqs, max_num_partitions).
-template<
-  typename scalar_t,
-  int HEAD_SIZE,
-  int BLOCK_SIZE,
-  int NUM_THREADS,
-  int PARTITION_SIZE = 0> // Zero means no partitioning.
-__device__ void paged_attention_kernel(
-  float* __restrict__ exp_sums,           // [num_seqs, num_heads, max_num_partitions]
-  float* __restrict__ max_logits,         // [num_seqs, num_heads, max_num_partitions]
-  scalar_t* __restrict__ out,             // [num_seqs, num_heads, max_num_partitions, head_size]
-  const scalar_t* __restrict__ q,         // [num_seqs, num_heads, head_size]
-  const scalar_t* __restrict__ k_cache,   // [num_blocks, num_kv_heads, head_size/x, block_size, x]
-  const scalar_t* __restrict__ v_cache,   // [num_blocks, num_kv_heads, head_size, block_size]
-  const int* __restrict__ head_mapping,   // [num_heads]
-  const float scale,
-  const int* __restrict__ block_tables,   // [num_seqs, max_num_blocks_per_seq]
-  const int* __restrict__ context_lens,   // [num_seqs]
-  const int max_num_blocks_per_seq,
-  const float* __restrict__ alibi_slopes, // [num_heads]
-  const int q_stride,
-  const int kv_block_stride,
-  const int kv_head_stride) {
-  const int seq_idx = blockIdx.y;
-  const int partition_idx = blockIdx.z;
-  const int max_num_partitions = gridDim.z;
-  constexpr bool USE_PARTITIONING = PARTITION_SIZE > 0;
-  const int context_len = context_lens[seq_idx];
-  if (USE_PARTITIONING && partition_idx * PARTITION_SIZE >= context_len) {
-    // No work to do. Terminate the thread block.
-    return;
-  }
-
-  const int num_context_blocks = DIVIDE_ROUND_UP(context_len, BLOCK_SIZE);
-  const int num_blocks_per_partition = USE_PARTITIONING ? PARTITION_SIZE / BLOCK_SIZE : num_context_blocks;
-
-  // [start_block_idx, end_block_idx) is the range of blocks to process.
-  const int start_block_idx = USE_PARTITIONING ? partition_idx * num_blocks_per_partition : 0;
-  const int end_block_idx = MIN(start_block_idx + num_blocks_per_partition, num_context_blocks);
-  const int num_blocks = end_block_idx - start_block_idx;
-
-  // [start_token_idx, end_token_idx) is the range of tokens to process.
-  const int start_token_idx = start_block_idx * BLOCK_SIZE;
-  const int end_token_idx = MIN(start_token_idx + num_blocks * BLOCK_SIZE, context_len);
-  const int num_tokens = end_token_idx - start_token_idx;
-
-  constexpr int THREAD_GROUP_SIZE = MAX(WARP_SIZE / BLOCK_SIZE, 1);
-  constexpr int NUM_THREAD_GROUPS = NUM_THREADS / THREAD_GROUP_SIZE; // Note: This assumes THREAD_GROUP_SIZE divides NUM_THREADS
-  assert(NUM_THREADS % THREAD_GROUP_SIZE == 0);
-  constexpr int NUM_TOKENS_PER_THREAD_GROUP = DIVIDE_ROUND_UP(BLOCK_SIZE, WARP_SIZE);
-  constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
-  const int thread_idx = threadIdx.x;
-  const int warp_idx = thread_idx / WARP_SIZE;
-  const int lane = thread_idx % WARP_SIZE;
-
-  const int head_idx = blockIdx.x;
-  const int num_heads = gridDim.x;
-  const int kv_head_idx = head_mapping[head_idx];
-  const float alibi_slope = alibi_slopes == nullptr ? 0.f : alibi_slopes[head_idx];
-
-  // A vector type to store a part of a key or a query.
-  // The vector size is configured in such a way that the threads in a thread group
-  // fetch or compute 16 bytes at a time.
-  // For example, if the size of a thread group is 4 and the data type is half,
-  // then the vector size is 16 / (4 * sizeof(half)) == 2.
-  constexpr int VEC_SIZE = MAX(16 / (THREAD_GROUP_SIZE * sizeof(scalar_t)), 1);
-  using K_vec = typename Vec<scalar_t, VEC_SIZE>::Type;
-  using Q_vec = typename Vec<scalar_t, VEC_SIZE>::Type;
-
-  constexpr int NUM_ELEMS_PER_THREAD = HEAD_SIZE / THREAD_GROUP_SIZE;
-  constexpr int NUM_VECS_PER_THREAD = NUM_ELEMS_PER_THREAD / VEC_SIZE;
-
-  const int thread_group_idx = thread_idx / THREAD_GROUP_SIZE;
-  const int thread_group_offset = thread_idx % THREAD_GROUP_SIZE;
-
-  // Load the query to registers.
-  // Each thread in a thread group has a different part of the query.
-  // For example, if the the thread group size is 4, then the first thread in the group
-  // has 0, 4, 8, ... th vectors of the query, and the second thread has 1, 5, 9, ...
-  // th vectors of the query, and so on.
-  // NOTE(woosuk): Because q is split from a qkv tensor, it may not be contiguous.
-  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
-  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
-#pragma unroll
-  for (int i = thread_group_idx; i < NUM_VECS_PER_THREAD; i += NUM_THREAD_GROUPS) {
-    const int vec_idx = thread_group_offset + i * THREAD_GROUP_SIZE;
-    q_vecs[thread_group_offset][i] = *reinterpret_cast<const Q_vec*>(q_ptr + vec_idx * VEC_SIZE);
-  }
-  __syncthreads(); // TODO(naed90): possible speedup if this is replaced with a memory wall right before we use q_vecs
-
-  // Memory planning.
-  extern __shared__ char shared_mem[];
-  // NOTE(woosuk): We use FP32 for the softmax logits for better accuracy.
-  float* logits = reinterpret_cast<float*>(shared_mem);
-  // Workspace for reduction.
-  __shared__ float red_smem[2 * NUM_WARPS];
-
-  // x == THREAD_GROUP_SIZE * VEC_SIZE
-  // Each thread group fetches x elements from the key at a time.
-  constexpr int x = 16 / sizeof(scalar_t);
-  float qk_max = -FLT_MAX;
-
-  // Iterate over the key blocks.
-  // Each warp fetches a block of keys for each iteration.
-  // Each thread group in a warp fetches a key from the block, and computes
-  // dot product with the query.
-  const int* block_table = block_tables + seq_idx * max_num_blocks_per_seq;
-  for (int block_idx = start_block_idx + warp_idx; block_idx < end_block_idx; block_idx += NUM_WARPS) {
-    // NOTE(woosuk): The block number is stored in int32. However, we cast it to int64
-    // because int32 can lead to overflow when this variable is multiplied by large numbers
-    // (e.g., kv_block_stride).
-    const int64_t physical_block_number = static_cast<int64_t>(block_table[block_idx]);
-
-    // Load a key to registers.
-    // Each thread in a thread group has a different part of the key.
-    // For example, if the the thread group size is 4, then the first thread in the group
-    // has 0, 4, 8, ... th vectors of the key, and the second thread has 1, 5, 9, ... th
-    // vectors of the key, and so on.
-    for (int i = 0; i < NUM_TOKENS_PER_THREAD_GROUP; i++) {
-      const int physical_block_offset = (thread_group_idx + i * WARP_SIZE) % BLOCK_SIZE;
-      const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;
-      K_vec k_vecs[NUM_VECS_PER_THREAD];
-
-#pragma unroll
-      for (int j = 0; j < NUM_VECS_PER_THREAD; j++) {
-        const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
-                                        + kv_head_idx * kv_head_stride
-                                        + physical_block_offset * x;
-        const int vec_idx = thread_group_offset + j * THREAD_GROUP_SIZE;
-        const int offset1 = (vec_idx * VEC_SIZE) / x;
-        const int offset2 = (vec_idx * VEC_SIZE) % x;
-        k_vecs[j] = *reinterpret_cast<const K_vec*>(k_ptr + offset1 * BLOCK_SIZE * x + offset2);
-      }
-
-      // Compute dot product.
-      // This includes a reduction across the threads in the same thread group.
-      float qk = scale * Qk_dot<scalar_t, THREAD_GROUP_SIZE>::dot(q_vecs[thread_group_offset], k_vecs);
-      // Add the ALiBi bias if slopes are given.
-      qk += (alibi_slope != 0) ? alibi_slope * (token_idx - context_len + 1) : 0;
-
-      if (thread_group_offset == 0) {
-        // Store the partial reductions to shared memory.
-        // NOTE(woosuk): It is required to zero out the masked logits.
-        const bool mask = token_idx >= context_len;
-        logits[token_idx - start_token_idx] = mask ? 0.f : qk;
-        // Update the max value.
-        qk_max = mask ? qk_max : fmaxf(qk_max, qk);
-      }
-    }
-  }
-
-  // Perform reduction across the threads in the same warp to get the
-  // max qk value for each "warp" (not across the thread block yet).
-  // The 0-th thread of each thread group already has its max qk value.
-#pragma unroll
-  for (int mask = WARP_SIZE / 2; mask >= THREAD_GROUP_SIZE; mask /= 2) {
-    qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
-  }
-  if (lane == 0) {
-    red_smem[warp_idx] = qk_max;
-  }
-  __syncthreads();
-
-  // TODO(woosuk): Refactor this part.
-  // Get the max qk value for the sequence.
-  qk_max = lane < NUM_WARPS ? red_smem[lane] : -FLT_MAX;
-#pragma unroll
-  for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
-    qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
-  }
-  // Broadcast the max qk value to all threads.
-  qk_max = __shfl_sync(uint32_t(-1), qk_max, 0);
-
-  // Get the sum of the exp values.
-  float exp_sum = 0.f;
-  for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
-    float val = __expf(logits[i] - qk_max);
-    logits[i] = val;
-    exp_sum += val;
-  }
-  exp_sum = block_sum<NUM_WARPS>(&red_smem[NUM_WARPS], exp_sum);
-
-  // Compute softmax.
-  const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f);
-  for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
-    logits[i] *= inv_sum;
-  }
-  __syncthreads();
-
-  // If partitioning is enabled, store the max logit and exp_sum.
-  if (USE_PARTITIONING && thread_idx == 0) {
-    float* max_logits_ptr = max_logits + seq_idx * num_heads * max_num_partitions
-                                       + head_idx * max_num_partitions
-                                       + partition_idx;
-    *max_logits_ptr = qk_max;
-    float* exp_sums_ptr = exp_sums + seq_idx * num_heads * max_num_partitions
-                                   + head_idx * max_num_partitions
-                                   + partition_idx;
-    *exp_sums_ptr = exp_sum;
-  }
-
-  // Each thread will fetch 16 bytes from the value cache at a time.
-  constexpr int V_VEC_SIZE = MIN(16 / sizeof(scalar_t), BLOCK_SIZE);
-  using V_vec = typename Vec<scalar_t, V_VEC_SIZE>::Type;
-  using L_vec = typename Vec<scalar_t, V_VEC_SIZE>::Type;
-  using Float_L_vec = typename FloatVec<L_vec>::Type;
-
-  constexpr int NUM_V_VECS_PER_ROW = BLOCK_SIZE / V_VEC_SIZE;
-  constexpr int NUM_ROWS_PER_ITER = WARP_SIZE / NUM_V_VECS_PER_ROW;
-  constexpr int NUM_ROWS_PER_THREAD = DIVIDE_ROUND_UP(HEAD_SIZE, NUM_ROWS_PER_ITER);
-
-  // NOTE(woosuk): We use FP32 for the accumulator for better accuracy.
-  float accs[NUM_ROWS_PER_THREAD];
-#pragma unroll
-  for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-    accs[i] = 0.f;
-  }
-
-  scalar_t zero_value;
-  zero(zero_value);
-  for (int block_idx = start_block_idx + warp_idx; block_idx < end_block_idx; block_idx += NUM_WARPS) {
-    // NOTE(woosuk): The block number is stored in int32. However, we cast it to int64
-    // because int32 can lead to overflow when this variable is multiplied by large numbers
-    // (e.g., kv_block_stride).
-    const int64_t physical_block_number = static_cast<int64_t>(block_table[block_idx]);
-    const int physical_block_offset = (lane % NUM_V_VECS_PER_ROW) * V_VEC_SIZE;
-    const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;
-    L_vec logits_vec;
-    from_float(logits_vec, *reinterpret_cast<Float_L_vec*>(logits + token_idx - start_token_idx));
-
-    const scalar_t* v_ptr = v_cache + physical_block_number * kv_block_stride
-                                    + kv_head_idx * kv_head_stride;
-#pragma unroll
-    for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-      const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
-      if (row_idx < HEAD_SIZE) {
-        const int offset = row_idx * BLOCK_SIZE + physical_block_offset;
-        V_vec v_vec = *reinterpret_cast<const V_vec*>(v_ptr + offset);
-        if (block_idx == num_context_blocks - 1) {
-          // NOTE(woosuk): When v_vec contains the tokens that are out of the context,
-          // we should explicitly zero out the values since they may contain NaNs.
-          // See https://github.com/vllm-project/vllm/issues/641#issuecomment-1682544472
-          scalar_t* v_vec_ptr = reinterpret_cast<scalar_t*>(&v_vec);
-#pragma unroll
-          for (int j = 0; j < V_VEC_SIZE; j++) {
-            v_vec_ptr[j] = token_idx + j < context_len ? v_vec_ptr[j] : zero_value;
-          }
-        }
-        accs[i] += dot(logits_vec, v_vec);
-      }
-    }
-  }
-
-  // Perform reduction within each warp.
-#pragma unroll
-  for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-    float acc = accs[i];
-#pragma unroll
-    for (int mask = NUM_V_VECS_PER_ROW / 2; mask >= 1; mask /= 2) {
-      acc += __shfl_xor_sync(uint32_t(-1), acc, mask);
-    }
-    accs[i] = acc;
-  }
-
-  // NOTE(woosuk): A barrier is required because the shared memory space for logits
-  // is reused for the output.
-  __syncthreads();
-
-  // Perform reduction across warps.
-  float* out_smem = reinterpret_cast<float*>(shared_mem);
-#pragma unroll
-  for (int i = NUM_WARPS; i > 1; i /= 2) {
-    int mid = i / 2;
-    // Upper warps write to shared memory.
-    if (warp_idx >= mid && warp_idx < i) {
-      float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE];
-#pragma unroll
-      for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-        const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
-        if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
-          dst[row_idx] = accs[i];
-        }
-      }
-    }
-    __syncthreads();
-
-    // Lower warps update the output.
-    if (warp_idx < mid) {
-      const float* src = &out_smem[warp_idx * HEAD_SIZE];
-#pragma unroll
-      for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-        const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
-        if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
-          accs[i] += src[row_idx];
-        }
-      }
-    }
-    __syncthreads();
-  }
-
-  // Write the final output.
-  if (warp_idx == 0) {
-    scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
-                            + head_idx * max_num_partitions * HEAD_SIZE
-                            + partition_idx * HEAD_SIZE;
-#pragma unroll
-    for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
-      const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
-      if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
-        from_float(*(out_ptr + row_idx), accs[i]);
-      }
-    }
-  }
-}
-
-// Grid: (num_heads, num_seqs, 1).
-template<
-  typename scalar_t,
-  int HEAD_SIZE,
-  int BLOCK_SIZE,
-  int NUM_THREADS>
-__global__ void paged_attention_v1_kernel(
-  scalar_t* __restrict__ out,             // [num_seqs, num_heads, head_size]
-  const scalar_t* __restrict__ q,         // [num_seqs, num_heads, head_size]
-  const scalar_t* __restrict__ k_cache,   // [num_blocks, num_kv_heads, head_size/x, block_size, x]
-  const scalar_t* __restrict__ v_cache,   // [num_blocks, num_kv_heads, head_size, block_size]
-  const int* __restrict__ head_mapping,   // [num_heads]
-  const float scale,
-  const int* __restrict__ block_tables,   // [num_seqs, max_num_blocks_per_seq]
-  const int* __restrict__ context_lens,   // [num_seqs]
-  const int max_num_blocks_per_seq,
-  const float* __restrict__ alibi_slopes, // [num_heads]
-  const int q_stride,
-  const int kv_block_stride,
-  const int kv_head_stride) {
-  paged_attention_kernel<scalar_t, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS>(
-    /* exp_sums */ nullptr, /* max_logits */ nullptr,
-    out, q, k_cache, v_cache, head_mapping, scale, block_tables, context_lens,
-    max_num_blocks_per_seq, alibi_slopes, q_stride, kv_block_stride, kv_head_stride);
-}
-
-// Grid: (num_heads, num_seqs, max_num_partitions).
-template<
-  typename scalar_t,
-  int HEAD_SIZE,
-  int BLOCK_SIZE,
-  int NUM_THREADS,
-  int PARTITION_SIZE>
-__global__ void paged_attention_v2_kernel(
-  float* __restrict__ exp_sums,           // [num_seqs, num_heads, max_num_partitions]
-  float* __restrict__ max_logits,         // [num_seqs, num_heads, max_num_partitions]
-  scalar_t* __restrict__ tmp_out,         // [num_seqs, num_heads, max_num_partitions, head_size]
-  const scalar_t* __restrict__ q,         // [num_seqs, num_heads, head_size]
-  const scalar_t* __restrict__ k_cache,   // [num_blocks, num_kv_heads, head_size/x, block_size, x]
-  const scalar_t* __restrict__ v_cache,   // [num_blocks, num_kv_heads, head_size, block_size]
-  const int* __restrict__ head_mapping,   // [num_heads]
-  const float scale,
-  const int* __restrict__ block_tables,   // [num_seqs, max_num_blocks_per_seq]
-  const int* __restrict__ context_lens,   // [num_seqs]
-  const int max_num_blocks_per_seq,
-  const float* __restrict__ alibi_slopes, // [num_heads]
-  const int q_stride,
-  const int kv_block_stride,
-  const int kv_head_stride) {
-  paged_attention_kernel<scalar_t, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE>(
-    exp_sums, max_logits, tmp_out, q, k_cache, v_cache, head_mapping, scale,
-    block_tables, context_lens, max_num_blocks_per_seq, alibi_slopes,
-    q_stride, kv_block_stride, kv_head_stride);
-}
-
-// Grid: (num_heads, num_seqs).
-template<
-  typename scalar_t,
-  int HEAD_SIZE,
-  int NUM_THREADS,
-  int PARTITION_SIZE>
-__global__ void paged_attention_v2_reduce_kernel(
-  scalar_t* __restrict__ out,             // [num_seqs, num_heads, head_size]
-  const float* __restrict__ exp_sums,     // [num_seqs, num_heads, max_num_partitions]
-  const float* __restrict__ max_logits,   // [num_seqs, num_heads, max_num_partitions]
-  const scalar_t* __restrict__ tmp_out,   // [num_seqs, num_heads, max_num_partitions, head_size]
-  const int* __restrict__ context_lens,   // [num_seqs]
-  const int max_num_partitions) {
-  const int num_heads = gridDim.x;
-  const int head_idx = blockIdx.x;
-  const int seq_idx = blockIdx.y;
-  const int context_len = context_lens[seq_idx];
-  const int num_partitions = DIVIDE_ROUND_UP(context_len, PARTITION_SIZE);
-  if (num_partitions == 1) {
-    // No need to reduce. Only copy tmp_out to out.
-    scalar_t* out_ptr = out + seq_idx * num_heads * HEAD_SIZE + head_idx * HEAD_SIZE;
-    const scalar_t* tmp_out_ptr = tmp_out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
-                                          + head_idx * max_num_partitions * HEAD_SIZE;
-    for (int i = threadIdx.x; i < HEAD_SIZE; i += blockDim.x) {
-      out_ptr[i] = tmp_out_ptr[i];
-    }
-    // Terminate the thread block.
-    return;
-  }
-
-  constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
-  const int warp_idx = threadIdx.x / WARP_SIZE;
-  const int lane = threadIdx.x % WARP_SIZE;
-
-  // Size: 2 * num_partitions.
-  extern __shared__ char shared_mem[];
-  // Workspace for reduction.
-  __shared__ float red_smem[2 * NUM_WARPS];
-
-  // Load max logits to shared memory.
-  float* shared_max_logits = reinterpret_cast<float*>(shared_mem);
-  const float* max_logits_ptr = max_logits + seq_idx * num_heads * max_num_partitions
-                                           + head_idx * max_num_partitions;
-  float max_logit = -FLT_MAX;
-  for (int i = threadIdx.x; i < num_partitions; i += blockDim.x) {
-    const float l = max_logits_ptr[i];
-    shared_max_logits[i] = l;
-    max_logit = fmaxf(max_logit, l);
-  }
-  __syncthreads();
-
-  // Get the global max logit.
-  // Reduce within the warp.
-#pragma unroll
-  for (int mask = WARP_SIZE / 2; mask >= 1; mask /= 2) {
-    max_logit = fmaxf(max_logit, __shfl_xor_sync(uint32_t(-1), max_logit, mask));
-  }
-  if (lane == 0) {
-    red_smem[warp_idx] = max_logit;
-  }
-  __syncthreads();
-  // Reduce across warps.
-  max_logit = lane < NUM_WARPS ? red_smem[lane] : -FLT_MAX;
-#pragma unroll
-  for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
-    max_logit = fmaxf(max_logit, __shfl_xor_sync(uint32_t(-1), max_logit, mask));
-  }
-  // Broadcast the max value to all threads.
-  max_logit = __shfl_sync(uint32_t(-1), max_logit, 0);
-
-  // Load rescaled exp sums to shared memory.
-  float* shared_exp_sums = reinterpret_cast<float*>(shared_mem + sizeof(float) * num_partitions);
-  const float* exp_sums_ptr = exp_sums + seq_idx * num_heads * max_num_partitions
-                                       + head_idx * max_num_partitions;
-  float global_exp_sum = 0.0f;
-  for (int i = threadIdx.x; i < num_partitions; i += blockDim.x) {
-    float l = shared_max_logits[i];
-    float rescaled_exp_sum = exp_sums_ptr[i] * expf(l - max_logit);
-    global_exp_sum += rescaled_exp_sum;
-    shared_exp_sums[i] = rescaled_exp_sum;
-  }
-  __syncthreads();
-  global_exp_sum = block_sum<NUM_WARPS>(&red_smem[NUM_WARPS], global_exp_sum);
-  const float inv_global_exp_sum = __fdividef(1.0f, global_exp_sum + 1e-6f);
-
-  // Aggregate tmp_out to out.
-  const scalar_t* tmp_out_ptr = tmp_out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
-                                        + head_idx * max_num_partitions * HEAD_SIZE;
-  scalar_t* out_ptr = out + seq_idx * num_heads * HEAD_SIZE + head_idx * HEAD_SIZE;
-#pragma unroll
-  for (int i = threadIdx.x; i < HEAD_SIZE; i += NUM_THREADS) {
-    float acc = 0.0f;
-    for (int j = 0; j < num_partitions; ++j) {
-      acc += to_float(tmp_out_ptr[j * HEAD_SIZE + i]) * shared_exp_sums[j] * inv_global_exp_sum;
-    }
-    from_float(out_ptr[i], acc);
-  }
-}
-
-} // namespace vllm
-
-#define LAUNCH_PAGED_ATTENTION_V1(HEAD_SIZE)                                                  \
-  cudaFuncSetAttribute(                                                                       \
-    vllm::paged_attention_v1_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS>,                   \
-    cudaFuncAttributeMaxDynamicSharedMemorySize, shared_mem_size);                            \
-  vllm::paged_attention_v1_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS>                      \
-  <<<grid, block, shared_mem_size, stream>>>(                                                 \
-    out_ptr,                                                                                  \
-    query_ptr,                                                                                \
-    key_cache_ptr,                                                                            \
-    value_cache_ptr,                                                                          \
-    head_mapping_ptr,                                                                         \
-    scale,                                                                                    \
-    block_tables_ptr,                                                                         \
-    context_lens_ptr,                                                                         \
-    max_num_blocks_per_seq,                                                                   \
-    alibi_slopes_ptr,                                                                         \
-    q_stride,                                                                                 \
-    kv_block_stride,                                                                          \
-    kv_head_stride);
-
-// TODO(woosuk): Tune NUM_THREADS.
-template<
-  typename T,
-  int BLOCK_SIZE,
-  int NUM_THREADS = 128>
-void paged_attention_v1_launcher(
-  torch::Tensor& out,
-  torch::Tensor& query,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& head_mapping,
-  float scale,
-  torch::Tensor& block_tables,
-  torch::Tensor& context_lens,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes) {
-  int num_seqs = query.size(0);
-  int num_heads = query.size(1);
-  int head_size = query.size(2);
-  int max_num_blocks_per_seq = block_tables.size(1);
-  int q_stride = query.stride(0);
-  int kv_block_stride = key_cache.stride(0);
-  int kv_head_stride = key_cache.stride(1);
-
-  int thread_group_size = MAX(WARP_SIZE / BLOCK_SIZE, 1);
-  assert(head_size % thread_group_size == 0);
-
-  // NOTE: alibi_slopes is optional.
-  const float* alibi_slopes_ptr = alibi_slopes ?
-    reinterpret_cast<const float*>(alibi_slopes.value().data_ptr())
-    : nullptr;
-
-  T* out_ptr = reinterpret_cast<T*>(out.data_ptr());
-  T* query_ptr = reinterpret_cast<T*>(query.data_ptr());
-  T* key_cache_ptr = reinterpret_cast<T*>(key_cache.data_ptr());
-  T* value_cache_ptr = reinterpret_cast<T*>(value_cache.data_ptr());
-  int* head_mapping_ptr = reinterpret_cast<int*>(head_mapping.data_ptr());
-  int* block_tables_ptr = block_tables.data_ptr<int>();
-  int* context_lens_ptr = context_lens.data_ptr<int>();
-
-  constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
-  int padded_max_context_len = DIVIDE_ROUND_UP(max_context_len, BLOCK_SIZE) * BLOCK_SIZE;
-  int logits_size = padded_max_context_len * sizeof(float);
-  int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);
-  // Python-side check in vllm.worker.worker._check_if_can_support_max_seq_len
-  // Keep that in sync with the logic here!
-  int shared_mem_size = std::max(logits_size, outputs_size);
-
-  dim3 grid(num_heads, num_seqs, 1);
-  dim3 block(NUM_THREADS);
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  switch (head_size) {
-    // NOTE(woosuk): To reduce the compilation time, we only compile for the
-    // head sizes that we use in the model. However, we can easily extend this
-    // to support any head size which is a multiple of 16.
-    case 64:
-      LAUNCH_PAGED_ATTENTION_V1(64);
-      break;
-    case 80:
-      LAUNCH_PAGED_ATTENTION_V1(80);
-      break;
-    case 96:
-      LAUNCH_PAGED_ATTENTION_V1(96);
-      break;
-    case 112:
-      LAUNCH_PAGED_ATTENTION_V1(112);
-      break;
-    case 128:
-      LAUNCH_PAGED_ATTENTION_V1(128);
-      break;
-    case 256:
-      LAUNCH_PAGED_ATTENTION_V1(256);
-      break;
-    default:
-      TORCH_CHECK(false, "Unsupported head size: ", head_size);
-      break;
-  }
-}
-
-#define CALL_V1_LAUNCHER(T, BLOCK_SIZE)                             \
-  paged_attention_v1_launcher<T, BLOCK_SIZE>(                       \
-    out,                                                            \
-    query,                                                          \
-    key_cache,                                                      \
-    value_cache,                                                    \
-    head_mapping,                                                   \
-    scale,                                                          \
-    block_tables,                                                   \
-    context_lens,                                                   \
-    max_context_len,                                                \
-    alibi_slopes);
-
-// NOTE(woosuk): To reduce the compilation time, we omitted block sizes
-// 1, 2, 4, 64, 128, 256.
-#define CALL_V1_LAUNCHER_BLOCK_SIZE(T)                              \
-  switch (block_size) {                                             \
-    case 8:                                                         \
-      CALL_V1_LAUNCHER(T, 8);                                       \
-      break;                                                        \
-    case 16:                                                        \
-      CALL_V1_LAUNCHER(T, 16);                                      \
-      break;                                                        \
-    case 32:                                                        \
-      CALL_V1_LAUNCHER(T, 32);                                      \
-      break;                                                        \
-    default:                                                        \
-      TORCH_CHECK(false, "Unsupported block size: ", block_size);   \
-      break;                                                        \
-  }
-
-void paged_attention_v1(
-  torch::Tensor& out,             // [num_seqs, num_heads, head_size]
-  torch::Tensor& query,           // [num_seqs, num_heads, head_size]
-  torch::Tensor& key_cache,       // [num_blocks, num_heads, head_size/x, block_size, x]
-  torch::Tensor& value_cache,     // [num_blocks, num_heads, head_size, block_size]
-  torch::Tensor& head_mapping,    // [num_heads]
-  float scale,
-  torch::Tensor& block_tables,    // [num_seqs, max_num_blocks_per_seq]
-  torch::Tensor& context_lens,    // [num_seqs]
-  int block_size,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes) {
-  if (query.dtype() == at::ScalarType::Float) {
-    CALL_V1_LAUNCHER_BLOCK_SIZE(float);
-  } else if (query.dtype() == at::ScalarType::Half) {
-    CALL_V1_LAUNCHER_BLOCK_SIZE(uint16_t);
-  } else if (query.dtype() == at::ScalarType::BFloat16) {
-    CALL_V1_LAUNCHER_BLOCK_SIZE(__nv_bfloat16);
-  } else {
-    TORCH_CHECK(false, "Unsupported data type: ", query.dtype());
-  }
-}
-
-#define LAUNCH_PAGED_ATTENTION_V2(HEAD_SIZE)                                                  \
-  vllm::paged_attention_v2_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE>      \
-  <<<grid, block, shared_mem_size, stream>>>(                                                 \
-    exp_sums_ptr,                                                                             \
-    max_logits_ptr,                                                                           \
-    tmp_out_ptr,                                                                              \
-    query_ptr,                                                                                \
-    key_cache_ptr,                                                                            \
-    value_cache_ptr,                                                                          \
-    head_mapping_ptr,                                                                         \
-    scale,                                                                                    \
-    block_tables_ptr,                                                                         \
-    context_lens_ptr,                                                                         \
-    max_num_blocks_per_seq,                                                                   \
-    alibi_slopes_ptr,                                                                         \
-    q_stride,                                                                                 \
-    kv_block_stride,                                                                          \
-    kv_head_stride);                                                                          \
-  vllm::paged_attention_v2_reduce_kernel<T, HEAD_SIZE, NUM_THREADS, PARTITION_SIZE>           \
-  <<<reduce_grid, block, reduce_shared_mem_size, stream>>>(                                   \
-    out_ptr,                                                                                  \
-    exp_sums_ptr,                                                                             \
-    max_logits_ptr,                                                                           \
-    tmp_out_ptr,                                                                              \
-    context_lens_ptr,                                                                         \
-    max_num_partitions);
-
-template<
-  typename T,
-  int BLOCK_SIZE,
-  int NUM_THREADS = 128,
-  int PARTITION_SIZE = 512>
-void paged_attention_v2_launcher(
-  torch::Tensor& out,
-  torch::Tensor& exp_sums,
-  torch::Tensor& max_logits,
-  torch::Tensor& tmp_out,
-  torch::Tensor& query,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& head_mapping,
-  float scale,
-  torch::Tensor& block_tables,
-  torch::Tensor& context_lens,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes) {
-  int num_seqs = query.size(0);
-  int num_heads = query.size(1);
-  int head_size = query.size(2);
-  int max_num_blocks_per_seq = block_tables.size(1);
-  int q_stride = query.stride(0);
-  int kv_block_stride = key_cache.stride(0);
-  int kv_head_stride = key_cache.stride(1);
-
-  int thread_group_size = MAX(WARP_SIZE / BLOCK_SIZE, 1);
-  assert(head_size % thread_group_size == 0);
-
-  // NOTE: alibi_slopes is optional.
-  const float* alibi_slopes_ptr = alibi_slopes ?
-    reinterpret_cast<const float*>(alibi_slopes.value().data_ptr())
-    : nullptr;
-
-  T* out_ptr = reinterpret_cast<T*>(out.data_ptr());
-  float* exp_sums_ptr = reinterpret_cast<float*>(exp_sums.data_ptr());
-  float* max_logits_ptr = reinterpret_cast<float*>(max_logits.data_ptr());
-  T* tmp_out_ptr = reinterpret_cast<T*>(tmp_out.data_ptr());
-  T* query_ptr = reinterpret_cast<T*>(query.data_ptr());
-  T* key_cache_ptr = reinterpret_cast<T*>(key_cache.data_ptr());
-  T* value_cache_ptr = reinterpret_cast<T*>(value_cache.data_ptr());
-  int* head_mapping_ptr = reinterpret_cast<int*>(head_mapping.data_ptr());
-  int* block_tables_ptr = block_tables.data_ptr<int>();
-  int* context_lens_ptr = context_lens.data_ptr<int>();
-
-  constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
-  int max_num_partitions = DIVIDE_ROUND_UP(max_context_len, PARTITION_SIZE);
-  int logits_size = PARTITION_SIZE * sizeof(float);
-  int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);
-
-  // For paged attention v2 kernel.
-  dim3 grid(num_heads, num_seqs, max_num_partitions);
-  int shared_mem_size = std::max(logits_size, outputs_size);
-  // For paged attention v2 reduce kernel.
-  dim3 reduce_grid(num_heads, num_seqs);
-  int reduce_shared_mem_size = 2 * max_num_partitions * sizeof(float);
-
-  dim3 block(NUM_THREADS);
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  switch (head_size) {
-    // NOTE(woosuk): To reduce the compilation time, we only compile for the
-    // head sizes that we use in the model. However, we can easily extend this
-    // to support any head size which is a multiple of 16.
-    case 64:
-      LAUNCH_PAGED_ATTENTION_V2(64);
-      break;
-    case 80:
-      LAUNCH_PAGED_ATTENTION_V2(80);
-      break;
-    case 96:
-      LAUNCH_PAGED_ATTENTION_V2(96);
-      break;
-    case 112:
-      LAUNCH_PAGED_ATTENTION_V2(112);
-      break;
-    case 128:
-      LAUNCH_PAGED_ATTENTION_V2(128);
-      break;
-    case 256:
-      LAUNCH_PAGED_ATTENTION_V2(256);
-      break;
-    default:
-      TORCH_CHECK(false, "Unsupported head size: ", head_size);
-      break;
-  }
-}
-
-#define CALL_V2_LAUNCHER(T, BLOCK_SIZE)                             \
-  paged_attention_v2_launcher<T, BLOCK_SIZE>(                       \
-    out,                                                            \
-    exp_sums,                                                       \
-    max_logits,                                                     \
-    tmp_out,                                                        \
-    query,                                                          \
-    key_cache,                                                      \
-    value_cache,                                                    \
-    head_mapping,                                                   \
-    scale,                                                          \
-    block_tables,                                                   \
-    context_lens,                                                   \
-    max_context_len,                                                \
-    alibi_slopes);
-
-// NOTE(woosuk): To reduce the compilation time, we omitted block sizes
-// 1, 2, 4, 64, 128, 256.
-#define CALL_V2_LAUNCHER_BLOCK_SIZE(T)                              \
-  switch (block_size) {                                             \
-    case 8:                                                         \
-      CALL_V2_LAUNCHER(T, 8);                                       \
-      break;                                                        \
-    case 16:                                                        \
-      CALL_V2_LAUNCHER(T, 16);                                      \
-      break;                                                        \
-    case 32:                                                        \
-      CALL_V2_LAUNCHER(T, 32);                                      \
-      break;                                                        \
-    default:                                                        \
-      TORCH_CHECK(false, "Unsupported block size: ", block_size);   \
-      break;                                                        \
-  }
-
-void paged_attention_v2(
-  torch::Tensor& out,             // [num_seqs, num_heads, head_size]
-  torch::Tensor& exp_sums,        // [num_seqs, num_heads, max_num_partitions]
-  torch::Tensor& max_logits,      // [num_seqs, num_heads, max_num_partitions]
-  torch::Tensor& tmp_out,         // [num_seqs, num_heads, max_num_partitions, head_size]
-  torch::Tensor& query,           // [num_seqs, num_heads, head_size]
-  torch::Tensor& key_cache,       // [num_blocks, num_heads, head_size/x, block_size, x]
-  torch::Tensor& value_cache,     // [num_blocks, num_heads, head_size, block_size]
-  torch::Tensor& head_mapping,    // [num_heads]
-  float scale,
-  torch::Tensor& block_tables,    // [num_seqs, max_num_blocks_per_seq]
-  torch::Tensor& context_lens,    // [num_seqs]
-  int block_size,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes) {
-  if (query.dtype() == at::ScalarType::Float) {
-    CALL_V2_LAUNCHER_BLOCK_SIZE(float);
-  } else if (query.dtype() == at::ScalarType::Half) {
-    CALL_V2_LAUNCHER_BLOCK_SIZE(uint16_t);
-  } else if (query.dtype() == at::ScalarType::BFloat16) {
-    CALL_V2_LAUNCHER_BLOCK_SIZE(__nv_bfloat16);
-  } else {
-    TORCH_CHECK(false, "Unsupported data type: ", query.dtype());
-  }
-}
-
-#undef WARP_SIZE
-#undef MAX
-#undef MIN
-#undef DIVIDE_ROUND_UP
--- a/inference/vllm/csrc/attention/attention_utils.cuh
+++ b/inference/vllm/csrc/attention/attention_utils.cuh
@ -1,55 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "attention_dtypes.h"
-
-#include <float.h>
-#include <type_traits>
-
-namespace vllm {
-
-// Q*K^T operation.
-template<int THREAD_GROUP_SIZE, typename Vec, int N>
-inline __device__ float qk_dot_(const Vec (&q)[N], const Vec (&k)[N]) {
-  using A_vec = typename FloatVec<Vec>::Type;
-  // Compute the parallel products for Q*K^T (treat vector lanes separately).
-  A_vec qk_vec = mul<A_vec, Vec, Vec>(q[0], k[0]);
-#pragma unroll
-  for (int ii = 1; ii < N; ++ii) {
-    qk_vec = fma(q[ii], k[ii], qk_vec);
-  }
-
-  // Finalize the reduction across lanes.
-  float qk = sum(qk_vec);
-#pragma unroll
-  for (int mask = THREAD_GROUP_SIZE / 2; mask >= 1; mask /= 2) {
-    qk += __shfl_xor_sync(uint32_t(-1), qk, mask);
-  }
-  return qk;
-}
-
-template<typename T, int THREAD_GROUP_SIZE>
-struct Qk_dot {
-  template<typename Vec, int N>
-  static inline __device__ float dot(const Vec (&q)[N], const Vec (&k)[N]) {
-    return qk_dot_<THREAD_GROUP_SIZE>(q, k);
-  }
-};
-
-} // namespace vllm
--- a/inference/vllm/csrc/attention/dtype_bfloat16.cuh
+++ b/inference/vllm/csrc/attention/dtype_bfloat16.cuh
@ -1,438 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
- * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "attention_generic.cuh"
-#include "dtype_float32.cuh"
-
-#include <cuda_bf16.h>
-#include <cuda_fp16.h>
-#include <stdint.h>
-
-namespace vllm {
-
-// Define custom BF16 vector data types.
-struct bf16_4_t {
-  __nv_bfloat162 x;
-  __nv_bfloat162 y;
-};
-
-struct bf16_8_t {
-  __nv_bfloat162 x;
-  __nv_bfloat162 y;
-  __nv_bfloat162 z;
-  __nv_bfloat162 w;
-};
-
-// BF16 vector types for Q, K, V.
-template<>
-struct Vec<__nv_bfloat16, 1> {
-  using Type = __nv_bfloat16;
-};
-template<>
-struct Vec<__nv_bfloat16, 2> {
-  using Type = __nv_bfloat162;
-};
-template<>
-struct Vec<__nv_bfloat16, 4> {
-  using Type = bf16_4_t;
-};
-template<>
-struct Vec<__nv_bfloat16, 8> {
-  using Type = bf16_8_t;
-};
-
-// FP32 accumulator vector types corresponding to Vec.
-template<>
-struct FloatVec<__nv_bfloat16> {
-  using Type = float;
-};
-template<>
-struct FloatVec<__nv_bfloat162> {
-  using Type = float2;
-};
-template<>
-struct FloatVec<bf16_4_t> {
-  using Type = Float4_;
-};
-template<>
-struct FloatVec<bf16_8_t> {
-  using Type = Float8_;
-};
-
-// Utility functions for type conversions.
-inline __device__ float2 bf1622float2(const __nv_bfloat162 val) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __bfloat1622float2(val);
-#endif
-}
-
-inline __device__ __nv_bfloat162 bf162bf162(const __nv_bfloat16 val) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __bfloat162bfloat162(val);
-#endif
-}
-
-// Vector addition.
-inline __device__ __nv_bfloat16 add(__nv_bfloat16 a, __nv_bfloat16 b) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return a + b;
-#endif
-}
-
-inline __device__ __nv_bfloat162 add(__nv_bfloat162 a, __nv_bfloat162 b) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __hadd2(a, b);
-#endif
-}
-
-inline __device__ bf16_4_t add(bf16_4_t a, bf16_4_t b) {
-  bf16_4_t c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  return c;
-}
-
-inline __device__ bf16_8_t add(bf16_8_t a, bf16_8_t b) {
-  bf16_8_t c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  c.z = add(a.z, b.z);
-  c.w = add(a.w, b.w);
-  return c;
-}
-
-inline __device__ float2 add(__nv_bfloat162 a, float2 fb) {
-  float2 fa = bf1622float2(a);
-  return add(fa, fb);
-}
-
-inline __device__ Float4_ add(bf16_4_t a, Float4_ fb) {
-  Float4_ fc;
-  fc.x = add(a.x, fb.x);
-  fc.y = add(a.y, fb.y);
-  return fc;
-}
-
-inline __device__ Float8_ add(bf16_8_t a, Float8_ fb) {
-  Float8_ fc;
-  fc.x = add(a.x, fb.x);
-  fc.y = add(a.y, fb.y);
-  fc.z = add(a.z, fb.z);
-  fc.w = add(a.w, fb.w);
-  return fc;
-}
-
-// Vector multiplication.
-template<>
-inline __device__ __nv_bfloat16 mul(__nv_bfloat16 a, __nv_bfloat16 b) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __hmul(a, b);
-#endif
-}
-
-template<>
-inline __device__ __nv_bfloat162 mul(__nv_bfloat162 a, __nv_bfloat162 b) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __hmul2(a, b);
-#endif
-}
-
-template<>
-inline __device__ __nv_bfloat162 mul(__nv_bfloat16 a, __nv_bfloat162 b) {
-  return mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(bf162bf162(a), b);
-}
-
-template<>
-inline __device__ bf16_4_t mul(bf16_4_t a, bf16_4_t b) {
-  bf16_4_t c;
-  c.x = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.x, b.x);
-  c.y = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.y, b.y);
-  return c;
-}
-
-template<>
-inline __device__ bf16_4_t mul(__nv_bfloat16 a, bf16_4_t b) {
-  __nv_bfloat162 s = bf162bf162(a);
-  bf16_4_t c;
-  c.x = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.x);
-  c.y = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.y);
-  return c;
-}
-
-template<>
-inline __device__ bf16_8_t mul(bf16_8_t a, bf16_8_t b) {
-  bf16_8_t c;
-  c.x = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.x, b.x);
-  c.y = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.y, b.y);
-  c.z = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.z, b.z);
-  c.w = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(a.w, b.w);
-  return c;
-}
-
-template<>
-inline __device__ bf16_8_t mul(__nv_bfloat16 a, bf16_8_t b) {
-  __nv_bfloat162 s = bf162bf162(a);
-  bf16_8_t c;
-  c.x = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.x);
-  c.y = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.y);
-  c.z = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.z);
-  c.w = mul<__nv_bfloat162, __nv_bfloat162, __nv_bfloat162>(s, b.w);
-  return c;
-}
-
-template<>
-inline __device__ float mul(__nv_bfloat16 a, __nv_bfloat16 b) {
-  float fa = __bfloat162float(a);
-  float fb = __bfloat162float(b);
-  return fa * fb;
-}
-
-template<>
-inline __device__ float2 mul(__nv_bfloat162 a, __nv_bfloat162 b) {
-  float2 fa = bf1622float2(a);
-  float2 fb = bf1622float2(b);
-  return mul<float2, float2, float2>(fa, fb);
-}
-
-template<>
-inline __device__ float2 mul(__nv_bfloat16 a, __nv_bfloat162 b) {
-  return mul<float2, __nv_bfloat162, __nv_bfloat162>(bf162bf162(a), b);
-}
-
-template<>
-inline __device__ Float4_ mul(bf16_4_t a, bf16_4_t b) {
-  Float4_ fc;
-  fc.x = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.x, b.x);
-  fc.y = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.y, b.y);
-  return fc;
-}
-
-template<>
-inline __device__ Float4_ mul(__nv_bfloat16 a, bf16_4_t b) {
-  __nv_bfloat162 s = bf162bf162(a);
-  Float4_ fc;
-  fc.x = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.x);
-  fc.y = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.y);
-  return fc;
-}
-
-template<>
-inline __device__ Float8_ mul(bf16_8_t a, bf16_8_t b) {
-  Float8_ fc;
-  fc.x = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.x, b.x);
-  fc.y = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.y, b.y);
-  fc.z = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.z, b.z);
-  fc.w = mul<float2, __nv_bfloat162, __nv_bfloat162>(a.w, b.w);
-  return fc;
-}
-
-template<>
-inline __device__ Float8_ mul(__nv_bfloat16 a, bf16_8_t b) {
-  __nv_bfloat162 s = bf162bf162(a);
-  Float8_ fc;
-  fc.x = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.x);
-  fc.y = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.y);
-  fc.z = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.z);
-  fc.w = mul<float2, __nv_bfloat162, __nv_bfloat162>(s, b.w);
-  return fc;
-}
-
-// Vector fused multiply-add.
-inline __device__ __nv_bfloat162 fma(__nv_bfloat162 a, __nv_bfloat162 b, __nv_bfloat162 c) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __hfma2(a, b, c);
-#endif
-}
-
-inline __device__ __nv_bfloat162 fma(__nv_bfloat16 a, __nv_bfloat162 b, __nv_bfloat162 c) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  return __hfma2(bf162bf162(a), b, c);
-#endif
-}
-
-inline __device__ bf16_4_t fma(bf16_4_t a, bf16_4_t b, bf16_4_t c) {
-  bf16_4_t d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  return d;
-}
-
-inline __device__ bf16_4_t fma(__nv_bfloat16 a, bf16_4_t b, bf16_4_t c) {
-  __nv_bfloat162 s = bf162bf162(a);
-  bf16_4_t d;
-  d.x = fma(s, b.x, c.x);
-  d.y = fma(s, b.y, c.y);
-  return d;
-}
-
-inline __device__ bf16_8_t fma(bf16_8_t a, bf16_8_t b, bf16_8_t c) {
-  bf16_8_t d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  d.z = fma(a.z, b.z, c.z);
-  d.w = fma(a.w, b.w, c.w);
-  return d;
-}
-
-inline __device__ bf16_8_t fma(__nv_bfloat16 a, bf16_8_t b, bf16_8_t c) {
-  __nv_bfloat162 s = bf162bf162(a);
-  bf16_8_t d;
-  d.x = fma(s, b.x, c.x);
-  d.y = fma(s, b.y, c.y);
-  d.z = fma(s, b.z, c.z);
-  d.w = fma(s, b.w, c.w);
-  return d;
-}
-
-inline __device__ float fma(__nv_bfloat16 a, __nv_bfloat16 b, float fc) {
-  return __bfloat162float(a) * __bfloat162float(b) + fc;
-}
-
-inline __device__ float2 fma(__nv_bfloat162 a, __nv_bfloat162 b, float2 fc) {
-  float2 fa = bf1622float2(a);
-  float2 fb = bf1622float2(b);
-  return fma(fa, fb, fc);
-}
-
-inline __device__ float2 fma(__nv_bfloat16 a, __nv_bfloat162 b, float2 fc) {
-  return fma(bf162bf162(a), b, fc);
-}
-
-inline __device__ Float4_ fma(bf16_4_t a, bf16_4_t b, Float4_ fc) {
-  Float4_ fd;
-  fd.x = fma(a.x, b.x, fc.x);
-  fd.y = fma(a.y, b.y, fc.y);
-  return fd;
-}
-
-inline __device__ Float4_ fma(__nv_bfloat16 a, bf16_4_t b, Float4_ fc) {
-  __nv_bfloat162 s = bf162bf162(a);
-  Float4_ fd;
-  fd.x = fma(s, b.x, fc.x);
-  fd.y = fma(s, b.y, fc.y);
-  return fd;
-}
-
-inline __device__ Float8_ fma(bf16_8_t a, bf16_8_t b, Float8_ fc) {
-  Float8_ fd;
-  fd.x = fma(a.x, b.x, fc.x);
-  fd.y = fma(a.y, b.y, fc.y);
-  fd.z = fma(a.z, b.z, fc.z);
-  fd.w = fma(a.w, b.w, fc.w);
-  return fd;
-}
-
-inline __device__ Float8_ fma(__nv_bfloat16 a, bf16_8_t b, Float8_ fc) {
-  __nv_bfloat162 s = bf162bf162(a);
-  Float8_ fd;
-  fd.x = fma(s, b.x, fc.x);
-  fd.y = fma(s, b.y, fc.y);
-  fd.z = fma(s, b.z, fc.z);
-  fd.w = fma(s, b.w, fc.w);
-  return fd;
-}
-
-// Vector sum.
-template<>
-inline __device__ float sum(__nv_bfloat16 v) {
-  return __bfloat162float(v);
-}
-
-template<>
-inline __device__ float sum(__nv_bfloat162 v) {
-  float2 vf = bf1622float2(v);
-  return vf.x + vf.y;
-}
-
-template<>
-inline __device__ float sum(bf16_4_t v) {
-  return sum(v.x) + sum(v.y);
-}
-
-template<>
-inline __device__ float sum(bf16_8_t v) {
-  return sum(v.x) + sum(v.y) + sum(v.z) + sum(v.w);
-}
-
-// From float32 to bfloat16.
-inline __device__ void from_float(__nv_bfloat16& dst, float src) {
-  dst = __float2bfloat16(src);
-}
-
-inline __device__ void from_float(__nv_bfloat162& dst, float2 src) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  dst = __float22bfloat162_rn(src);
-#endif
-}
-
-inline __device__ void from_float(bf16_4_t& dst, Float4_ src) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  dst.x = __float22bfloat162_rn(src.x);
-  dst.y = __float22bfloat162_rn(src.y);
-#endif
-}
-
-inline __device__ void from_float(bf16_8_t& dst, Float8_ src) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  dst.x = __float22bfloat162_rn(src.x);
-  dst.y = __float22bfloat162_rn(src.y);
-  dst.z = __float22bfloat162_rn(src.z);
-  dst.w = __float22bfloat162_rn(src.w);
-#endif
-}
-
-// From bfloat16 to float32.
-inline __device__ float to_float(__nv_bfloat16 u) {
-  return __bfloat162float(u);
-}
-
-// Zero-out a variable.
-inline __device__ void zero(__nv_bfloat16& dst) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
-  assert(false);
-#else
-  // Same as CUDART_ZERO_BF16 introduced in CUDA 12.2.
-  dst = __ushort_as_bfloat16((unsigned short)0x0000U);
-#endif
-}
-
-} // namespace vllm
--- a/inference/vllm/csrc/attention/dtype_float16.cuh
+++ b/inference/vllm/csrc/attention/dtype_float16.cuh
@ -1,444 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
- * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "attention_generic.cuh"
-#include "dtype_float32.cuh"
-
-#include <stdint.h>
-
-namespace vllm {
-
-// FP16 vector types for Q, K, V.
-template<>
-struct Vec<uint16_t, 1> {
-  using Type = uint16_t;
-};
-template<>
-struct Vec<uint16_t, 2> {
-  using Type = uint32_t;
-};
-template<>
-struct Vec<uint16_t, 4> {
-  using Type = uint2;
-};
-template<>
-struct Vec<uint16_t, 8> {
-  using Type = uint4;
-};
-
-// FP32 accumulator vector types corresponding to Vec.
-template<>
-struct FloatVec<uint16_t> {
-  using Type = float;
-};
-template<>
-struct FloatVec<uint32_t> {
-  using Type = float2;
-};
-template<>
-struct FloatVec<uint2> {
-  using Type = Float4_;
-};
-template<>
-struct FloatVec<uint4> {
-  using Type = Float8_;
-};
-
-// Utility functions for type conversions.
-inline __device__ uint32_t h0_h0(uint16_t a) {
-  uint32_t b;
-  asm volatile("mov.b32 %0, {%1, %1};" : "=r"(b) : "h"(a));
-  return b;
-}
-
-inline __device__ float half_to_float(uint16_t h) {
-  float f;
-  asm volatile("cvt.f32.f16 %0, %1;\n" : "=f"(f) : "h"(h));
-  return f;
-}
-
-inline __device__ float2 half2_to_float2(uint32_t v) {
-  uint16_t lo, hi;
-  asm volatile("mov.b32 {%0, %1}, %2;\n" : "=h"(lo), "=h"(hi) : "r"(v));
-  return make_float2(half_to_float(lo), half_to_float(hi));
-}
-
-inline __device__ uint16_t float_to_half(float f) {
-  union {
-    uint32_t u32;
-    uint16_t u16[2];
-  } tmp;
-  asm volatile("cvt.rn.f16.f32 %0, %1;\n" : "=h"(tmp.u16[0]) : "f"(f));
-  return tmp.u16[0];
-}
-
-inline __device__ uint32_t float2_to_half2(float2 f) {
-  union {
-    uint32_t u32;
-    uint16_t u16[2];
-  } tmp;
-
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
-  asm volatile("cvt.rn.f16x2.f32 %0, %1, %2;\n" : "=r"(tmp.u32) : "f"(f.y), "f"(f.x));
-#else
-  asm volatile("cvt.rn.f16.f32 %0, %1;\n" : "=h"(tmp.u16[0]) : "f"(f.x));
-  asm volatile("cvt.rn.f16.f32 %0, %1;\n" : "=h"(tmp.u16[1]) : "f"(f.y));
-#endif
-  return tmp.u32;
-}
-
-// Vector addition.
-inline __device__ uint16_t add(uint16_t a, uint16_t b) {
-  uint16_t c;
-  asm volatile("add.f16 %0, %1, %2;\n" : "=h"(c) : "h"(a), "h"(b));
-  return c;
-}
-
-inline __device__ uint32_t add(uint32_t a, uint32_t b) {
-  uint32_t c;
-  asm volatile("add.f16x2 %0, %1, %2;\n" : "=r"(c) : "r"(a), "r"(b));
-  return c;
-}
-
-inline __device__ uint2 add(uint2 a, uint2 b) {
-  uint2 c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  return c;
-}
-
-inline __device__ uint4 add(uint4 a, uint4 b) {
-  uint4 c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  c.z = add(a.z, b.z);
-  c.w = add(a.w, b.w);
-  return c;
-}
-
-inline __device__ float2 add(uint32_t a, float2 fb) {
-  float2 fa = half2_to_float2(a);
-  return add(fa, fb);
-}
-
-inline __device__ Float4_ add(uint2 a, Float4_ fb) {
-  Float4_ fc;
-  fc.x = add(a.x, fb.x);
-  fc.y = add(a.y, fb.y);
-  return fc;
-}
-
-inline __device__ Float8_ add(uint4 a, Float8_ fb) {
-  Float8_ fc;
-  fc.x = add(a.x, fb.x);
-  fc.y = add(a.y, fb.y);
-  fc.z = add(a.z, fb.z);
-  fc.w = add(a.w, fb.w);
-  return fc;
-}
-
-// Vector multiplication.
-template<>
-inline __device__ uint16_t mul(uint16_t a, uint16_t b) {
-  uint16_t c;
-  asm volatile("mul.f16 %0, %1, %2;\n" : "=h"(c) : "h"(a), "h"(b));
-  return c;
-}
-
-template<>
-inline __device__ uint32_t mul(uint32_t a, uint32_t b) {
-  uint32_t c;
-  asm volatile("mul.f16x2 %0, %1, %2;\n" : "=r"(c) : "r"(a), "r"(b));
-  return c;
-}
-
-template<>
-inline __device__ uint32_t mul(uint16_t a, uint32_t b) {
-  return mul<uint32_t, uint32_t, uint32_t>(h0_h0(a), b);
-}
-
-template<>
-inline __device__ uint2 mul(uint2 a, uint2 b) {
-  uint2 c;
-  c.x = mul<uint32_t, uint32_t, uint32_t>(a.x, b.x);
-  c.y = mul<uint32_t, uint32_t, uint32_t>(a.y, b.y);
-  return c;
-}
-
-template<>
-inline __device__ uint2 mul(uint16_t a, uint2 b) {
-  uint32_t s = h0_h0(a);
-  uint2 c;
-  c.x = mul<uint32_t, uint32_t, uint32_t>(s, b.x);
-  c.y = mul<uint32_t, uint32_t, uint32_t>(s, b.y);
-  return c;
-}
-
-template<>
-inline __device__ uint4 mul(uint4 a, uint4 b) {
-  uint4 c;
-  c.x = mul<uint32_t, uint32_t, uint32_t>(a.x, b.x);
-  c.y = mul<uint32_t, uint32_t, uint32_t>(a.y, b.y);
-  c.z = mul<uint32_t, uint32_t, uint32_t>(a.z, b.z);
-  c.w = mul<uint32_t, uint32_t, uint32_t>(a.w, b.w);
-  return c;
-}
-
-template<>
-inline __device__ uint4 mul(uint16_t a, uint4 b) {
-  uint32_t s = h0_h0(a);
-  uint4 c;
-  c.x = mul<uint32_t, uint32_t, uint32_t>(s, b.x);
-  c.y = mul<uint32_t, uint32_t, uint32_t>(s, b.y);
-  c.z = mul<uint32_t, uint32_t, uint32_t>(s, b.z);
-  c.w = mul<uint32_t, uint32_t, uint32_t>(s, b.w);
-  return c;
-}
-
-template<>
-inline __device__ float mul(uint16_t a, uint16_t b) {
-  float fa = half_to_float(a);
-  float fb = half_to_float(b);
-  return fa * fb;
-}
-
-template<>
-inline __device__ float2 mul(uint32_t a, uint32_t b) {
-  float2 fa = half2_to_float2(a);
-  float2 fb = half2_to_float2(b);
-  return mul<float2, float2, float2>(fa, fb);
-}
-
-template<>
-inline __device__ float2 mul(uint16_t a, uint32_t b) {
-  return mul<float2, uint32_t, uint32_t>(h0_h0(a), b);
-}
-
-template<>
-inline __device__ Float4_ mul(uint2 a, uint2 b) {
-  Float4_ fc;
-  fc.x = mul<float2, uint32_t, uint32_t>(a.x, b.x);
-  fc.y = mul<float2, uint32_t, uint32_t>(a.y, b.y);
-  return fc;
-}
-
-template<>
-inline __device__ Float4_ mul(uint16_t a, uint2 b) {
-  uint32_t s = h0_h0(a);
-  Float4_ fc;
-  fc.x = mul<float2, uint32_t, uint32_t>(s, b.x);
-  fc.y = mul<float2, uint32_t, uint32_t>(s, b.y);
-  return fc;
-}
-
-template<>
-inline __device__ Float8_ mul(uint4 a, uint4 b) {
-  Float8_ fc;
-  fc.x = mul<float2, uint32_t, uint32_t>(a.x, b.x);
-  fc.y = mul<float2, uint32_t, uint32_t>(a.y, b.y);
-  fc.z = mul<float2, uint32_t, uint32_t>(a.z, b.z);
-  fc.w = mul<float2, uint32_t, uint32_t>(a.w, b.w);
-  return fc;
-}
-
-template<>
-inline __device__ Float8_ mul(uint16_t a, uint4 b) {
-  uint32_t s = h0_h0(a);
-  Float8_ fc;
-  fc.x = mul<float2, uint32_t, uint32_t>(s, b.x);
-  fc.y = mul<float2, uint32_t, uint32_t>(s, b.y);
-  fc.z = mul<float2, uint32_t, uint32_t>(s, b.z);
-  fc.w = mul<float2, uint32_t, uint32_t>(s, b.w);
-  return fc;
-}
-
-// Vector fused multiply-add.
-inline __device__ uint32_t fma(uint32_t a, uint32_t b, uint32_t c) {
-  uint32_t d;
-  asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(d) : "r"(a), "r"(b), "r"(c));
-  return d;
-}
-
-inline __device__ uint32_t fma(uint16_t a, uint32_t b, uint32_t c) {
-  return fma(h0_h0(a), b, c);
-}
-
-inline __device__ uint2 fma(uint2 a, uint2 b, uint2 c) {
-  uint2 d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  return d;
-}
-
-inline __device__ uint2 fma(uint16_t a, uint2 b, uint2 c) {
-  uint32_t s = h0_h0(a);
-  uint2 d;
-  d.x = fma(s, b.x, c.x);
-  d.y = fma(s, b.y, c.y);
-  return d;
-}
-
-inline __device__ uint4 fma(uint4 a, uint4 b, uint4 c) {
-  uint4 d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  d.z = fma(a.z, b.z, c.z);
-  d.w = fma(a.w, b.w, c.w);
-  return d;
-}
-
-inline __device__ uint4 fma(uint16_t a, uint4 b, uint4 c) {
-  uint32_t s = h0_h0(a);
-  uint4 d;
-  d.x = fma(s, b.x, c.x);
-  d.y = fma(s, b.y, c.y);
-  d.z = fma(s, b.z, c.z);
-  d.w = fma(s, b.w, c.w);
-  return d;
-}
-
-inline __device__ float fma(uint16_t a, uint16_t b, float fc) {
-  float fa = half_to_float(a);
-  float fb = half_to_float(b);
-  return fa * fb + fc;
-}
-
-inline __device__ float2 fma(uint32_t a, uint32_t b, float2 fc) {
-  float2 fa = half2_to_float2(a);
-  float2 fb = half2_to_float2(b);
-  return fma(fa, fb, fc);
-}
-
-inline __device__ float2 fma(uint16_t a, uint32_t b, float2 fc) {
-  return fma(h0_h0(a), b, fc);
-}
-
-inline __device__ Float4_ fma(uint2 a, uint2 b, Float4_ fc) {
-  Float4_ fd;
-  fd.x = fma(a.x, b.x, fc.x);
-  fd.y = fma(a.y, b.y, fc.y);
-  return fd;
-}
-
-inline __device__ Float4_ fma(uint16_t a, uint2 b, Float4_ fc) {
-  uint32_t s = h0_h0(a);
-  Float4_ fd;
-  fd.x = fma(s, b.x, fc.x);
-  fd.y = fma(s, b.y, fc.y);
-  return fd;
-}
-
-inline __device__ Float8_ fma(uint4 a, uint4 b, Float8_ fc) {
-  Float8_ fd;
-  fd.x = fma(a.x, b.x, fc.x);
-  fd.y = fma(a.y, b.y, fc.y);
-  fd.z = fma(a.z, b.z, fc.z);
-  fd.w = fma(a.w, b.w, fc.w);
-  return fd;
-}
-
-inline __device__ Float8_ fma(uint16_t a, uint4 b, Float8_ fc) {
-  uint32_t s = h0_h0(a);
-  Float8_ fd;
-  fd.x = fma(s, b.x, fc.x);
-  fd.y = fma(s, b.y, fc.y);
-  fd.z = fma(s, b.z, fc.z);
-  fd.w = fma(s, b.w, fc.w);
-  return fd;
-}
-
-// Vector sum.
-template<>
-inline __device__ float sum(uint16_t v) {
-  return half_to_float(v);
-}
-
-template<>
-inline __device__ float sum(uint32_t v) {
-  float2 tmp = half2_to_float2(v);
-  return tmp.x + tmp.y;
-}
-
-template<>
-inline __device__ float sum(uint2 v) {
-  uint32_t c = add(v.x, v.y);
-  return sum(c);
-}
-
-template<>
-inline __device__ float sum(uint4 v) {
-  uint32_t c = add(v.x, v.y);
-  c = add(c, v.z);
-  c = add(c, v.w);
-  return sum(c);
-}
-
-// From float32 to float16.
-inline __device__ void from_float(uint16_t& dst, float src) {
-  dst = float_to_half(src);
-}
-
-inline __device__ void from_float(uint32_t& dst, float2 src) {
-  dst = float2_to_half2(src);
-}
-
-inline __device__ void from_float(uint2& dst, Float4_ src) {
-  dst.x = float2_to_half2(src.x);
-  dst.y = float2_to_half2(src.y);
-}
-
-inline __device__ void from_float(uint4& dst, Float8_ src) {
-  dst.x = float2_to_half2(src.x);
-  dst.y = float2_to_half2(src.y);
-  dst.z = float2_to_half2(src.z);
-  dst.w = float2_to_half2(src.w);
-}
-
-// From float16 to float32.
-inline __device__ float to_float(uint16_t u) {
-  return half_to_float(u);
-}
-
-inline __device__ float2 to_float(uint32_t u) {
-  return half2_to_float2(u);
-}
-
-inline __device__ Float4_ to_float(uint2 u) {
-  Float4_ tmp;
-  tmp.x = half2_to_float2(u.x);
-  tmp.y = half2_to_float2(u.y);
-  return tmp;
-}
-
-inline __device__ Float8_ to_float(uint4 u) {
-  Float8_ tmp;
-  tmp.x = half2_to_float2(u.x);
-  tmp.y = half2_to_float2(u.y);
-  tmp.z = half2_to_float2(u.z);
-  tmp.w = half2_to_float2(u.w);
-  return tmp;
-}
-
-// Zero-out a variable.
-inline __device__ void zero(uint16_t& dst) {
-  dst = uint16_t(0);
-}
-
-} // namespace vllm
--- a/inference/vllm/csrc/attention/dtype_float32.cuh
+++ b/inference/vllm/csrc/attention/dtype_float32.cuh
@ -1,273 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
- * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "attention_generic.cuh"
-
-#include <stdint.h>
-
-namespace vllm {
-
-// Define custom FP32 vector data types.
-struct Float4_ {
-  float2 x;
-  float2 y;
-};
-
-struct Float8_ {
-  float2 x;
-  float2 y;
-  float2 z;
-  float2 w;
-};
-
-// FP32 vector types for Q, K, V.
-template<>
-struct Vec<float, 1> {
-  using Type = float;
-};
-template<>
-struct Vec<float, 2> {
-  using Type = float2;
-};
-template<>
-struct Vec<float, 4> {
-  using Type = float4;
-};
-
-// FP32 accumulator vector types corresponding to Vec.
-template<>
-struct FloatVec<float> {
-  using Type = float;
-};
-template<>
-struct FloatVec<float2> {
-  using Type = float2;
-};
-template<>
-struct FloatVec<float4> {
-  using Type = float4;
-};
-
-// Vector addition.
-inline __device__ float add(float a, float b) {
-  return a + b;
-}
-
-inline __device__ float2 add(float2 a, float2 b) {
-  float2 c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  return c;
-}
-
-inline __device__ float4 add(float4 a, float4 b) {
-  float4 c;
-  c.x = add(a.x, b.x);
-  c.y = add(a.y, b.y);
-  c.z = add(a.z, b.z);
-  c.w = add(a.w, b.w);
-  return c;
-}
-
-// Vector multiplication.
-template<>
-inline __device__ float mul<float, float>(float a, float b) {
-  return a * b;
-}
-
-template<>
-inline __device__ float2 mul(float2 a, float2 b) {
-  float2 c;
-  c.x = a.x * b.x;
-  c.y = a.y * b.y;
-  return c;
-}
-
-template<>
-inline __device__ float2 mul(float a, float2 b) {
-  float2 c;
-  c.x = a * b.x;
-  c.y = a * b.y;
-  return c;
-}
-
-template<>
-inline __device__ float4 mul(float4 a, float4 b) {
-  float4 c;
-  c.x = a.x * b.x;
-  c.y = a.y * b.y;
-  c.z = a.z * b.z;
-  c.w = a.w * b.w;
-  return c;
-}
-
-template<>
-inline __device__ float4 mul(float a, float4 b) {
-  float4 c;
-  c.x = a * b.x;
-  c.y = a * b.y;
-  c.z = a * b.z;
-  c.w = a * b.w;
-  return c;
-}
-
-// Vector fused multiply-add.
-inline __device__ float fma(float a, float b, float c) {
-  return a * b + c;
-}
-
-inline __device__ float2 fma(float2 a, float2 b, float2 c) {
-  float2 d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  return d;
-}
-
-inline __device__ float2 fma(float a, float2 b, float2 c) {
-  float2 d;
-  d.x = fma(a, b.x, c.x);
-  d.y = fma(a, b.y, c.y);
-  return d;
-}
-
-inline __device__ float4 fma(float4 a, float4 b, float4 c) {
-  float4 d;
-  d.x = fma(a.x, b.x, c.x);
-  d.y = fma(a.y, b.y, c.y);
-  d.z = fma(a.z, b.z, c.z);
-  d.w = fma(a.w, b.w, c.w);
-  return d;
-}
-
-inline __device__ float4 fma(float a, float4 b, float4 c) {
-  float4 d;
-  d.x = fma(a, b.x, c.x);
-  d.y = fma(a, b.y, c.y);
-  d.z = fma(a, b.z, c.z);
-  d.w = fma(a, b.w, c.w);
-  return d;
-}
-
-inline __device__ Float4_ fma(float a, Float4_ b, Float4_ c) {
-  Float4_ d;
-  d.x = fma(a, b.x, c.x);
-  d.y = fma(a, b.y, c.y);
-  return d;
-}
-
-inline __device__ Float8_ fma(float a, Float8_ b, Float8_ c) {
-  Float8_ d;
-  d.x = fma(a, b.x, c.x);
-  d.y = fma(a, b.y, c.y);
-  d.z = fma(a, b.z, c.z);
-  d.w = fma(a, b.w, c.w);
-  return d;
-}
-
-// Vector sum.
-template<>
-inline __device__ float sum(float v) {
-  return v;
-}
-
-template<>
-inline __device__ float sum(float2 v) {
-  return v.x + v.y;
-}
-
-template<>
-inline __device__ float sum(float4 v) {
-  return v.x + v.y + v.z + v.w;
-}
-
-template<>
-inline __device__ float sum(Float4_ v) {
-  return v.x.x + v.x.y + v.y.x + v.y.y;
-}
-
-template<>
-inline __device__ float sum(Float8_ v) {
-  return v.x.x + v.x.y + v.y.x + v.y.y + v.z.x + v.z.y + v.w.x + v.w.y;
-}
-
-// Vector dot product.
-inline __device__ float dot(float a, float b) {
-  return a * b;
-}
-
-inline __device__ float dot(float2 a, float2 b) {
-  float2 c = mul<float2, float2, float2>(a, b);
-  return c.x + c.y;
-}
-
-inline __device__ float dot(Float4_ a, Float4_ b) {
-  float2 acc = mul<float2, float2, float2>(a.x, b.x);
-  acc = fma(a.y, b.y, acc);
-  return acc.x + acc.y;
-}
-
-inline __device__ float dot(Float8_ a, Float8_ b) {
-  float2 acc = mul<float2, float2, float2>(a.x, b.x);
-  acc = fma(a.y, b.y, acc);
-  acc = fma(a.z, b.z, acc);
-  acc = fma(a.w, b.w, acc);
-  return acc.x + acc.y;
-}
-
-// From float to float.
-inline __device__ void from_float(float& dst, float src) {
-  dst = src;
-}
-
-inline __device__ void from_float(float2& dst, float2 src) {
-  dst = src;
-}
-
-inline __device__ void from_float(float4& dst, float4 src) {
-  dst = src;
-}
-
-// From float to float.
-inline __device__ float to_float(float u) {
-  return u;
-}
-
-inline __device__ float2 to_float(float2 u) {
-  return u;
-}
-
-inline __device__ float4 to_float(float4 u) {
-  return u;
-}
-
-inline __device__ Float4_ to_float(Float4_ u) {
-  return u;
-}
-
-inline __device__ Float8_ to_float(Float8_ u) {
-  return u;
-}
-
-// Zero-out a variable.
-inline __device__ void zero(float& dst) {
-  dst = 0.f;
-}
-
-} // namespace vllm
--- a/inference/vllm/csrc/cache.cpp
+++ b/inference/vllm/csrc/cache.cpp
@ -1,47 +0,0 @@
-#include <torch/extension.h>
-
-#include <map>
-#include <vector>
-
-void swap_blocks(
-  torch::Tensor& src,
-  torch::Tensor& dst,
-  const std::map<int64_t, int64_t>& block_mapping);
-
-void copy_blocks(
-  std::vector<torch::Tensor>& key_caches,
-  std::vector<torch::Tensor>& value_caches,
-  const std::map<int64_t, std::vector<int64_t>>& block_mapping);
-
-void reshape_and_cache(
-  torch::Tensor& key,
-  torch::Tensor& value,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& slot_mapping);
-
-void gather_cached_kv(
-  torch::Tensor& key,
-  torch::Tensor& value,
-  torch::Tensor& key_cache,
-  torch::Tensor& value_cache,
-  torch::Tensor& slot_mapping);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "swap_blocks",
-    &swap_blocks,
-    "Swap in (out) the cache blocks from src to dst");
-  m.def(
-    "copy_blocks",
-    &copy_blocks,
-    "Copy the cache blocks from src to dst");
-  m.def(
-    "reshape_and_cache",
-    &reshape_and_cache,
-    "Reshape the key and value tensors and cache them");
-  m.def(
-    "gather_cached_kv",
-    &gather_cached_kv,
-    "Gather key and value from the cache into contiguous QKV tensors");
-}
--- a/inference/vllm/csrc/cache_kernels.cu
+++ b/inference/vllm/csrc/cache_kernels.cu
@ -1,387 +0,0 @@
-#include <torch/extension.h>
-#include <ATen/cuda/CUDAContext.h>
-
-#include "dispatch_utils.h"
-
-#include <algorithm>
-#include <cassert>
-#include <map>
-#include <vector>
-
-void swap_blocks(
-  torch::Tensor& src,
-  torch::Tensor& dst,
-  const std::map<int64_t, int64_t>& block_mapping) {
-  torch::Device src_device = src.device();
-  torch::Device dst_device = dst.device();
-  cudaMemcpyKind memcpy_type;
-  if (src_device.is_cuda() && dst_device.is_cuda()) {
-    TORCH_CHECK(
-      src_device.index() == dst_device.index(),
-      "src and dst must be on the same GPU");
-    memcpy_type = cudaMemcpyDeviceToDevice;
-  } else if (src_device.is_cuda() && dst_device.is_cpu()) {
-    memcpy_type = cudaMemcpyDeviceToHost;
-  } else if (src_device.is_cpu() && dst_device.is_cuda()) {
-    memcpy_type = cudaMemcpyHostToDevice;
-  } else {
-    TORCH_CHECK(false, "Invalid device combination");
-  }
-
-  void *src_ptr = src.data_ptr();
-  void *dst_ptr = dst.data_ptr();
-
-  const int64_t block_size_in_bytes = src.element_size() * src[0].numel();
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  // NOTE(woosuk): This can be slow if the number of blocks is large.
-  for (const auto& pair : block_mapping) {
-    int64_t src_block_number = pair.first;
-    int64_t dst_block_number = pair.second;
-    int64_t src_offset = src_block_number * block_size_in_bytes;
-    int64_t dst_offset = dst_block_number * block_size_in_bytes;
-    cudaMemcpyAsync(
-      dst_ptr + dst_offset,
-      src_ptr + src_offset,
-      block_size_in_bytes,
-      memcpy_type,
-      stream);
-  }
-}
-
-namespace vllm {
-
-// Grid: (num_layers, num_pairs)
-template<typename scalar_t>
-__global__ void copy_blocks_kernel(
-  int64_t* key_cache_ptrs,
-  int64_t* value_cache_ptrs,
-  const int64_t* __restrict__ block_mapping,
-  const int numel_per_block) {
-  const int layer_idx = blockIdx.x;
-  const int pair_idx = blockIdx.y;
-
-  scalar_t* key_cache = reinterpret_cast<scalar_t*>(key_cache_ptrs[layer_idx]);
-  scalar_t* value_cache = reinterpret_cast<scalar_t*>(value_cache_ptrs[layer_idx]);
-  int64_t src_block_number = block_mapping[2 * pair_idx];
-  int64_t dst_block_number = block_mapping[2 * pair_idx + 1];
-
-  const int64_t src_block_offset = src_block_number * numel_per_block;
-  const int64_t dst_block_offset = dst_block_number * numel_per_block;
-  for (int i = threadIdx.x; i < numel_per_block; i += blockDim.x) {
-    int64_t src_offset = src_block_offset + i;
-    int64_t dst_offset = dst_block_offset + i;
-    key_cache[dst_offset] = key_cache[src_offset];
-  }
-  for (int i = threadIdx.x; i < numel_per_block; i += blockDim.x) {
-    int64_t src_offset = src_block_offset + i;
-    int64_t dst_offset = dst_block_offset + i;
-    value_cache[dst_offset] = value_cache[src_offset];
-  }
-}
-
-} // namespace vllm
-
-void copy_blocks(
-  std::vector<torch::Tensor>& key_caches,
-  std::vector<torch::Tensor>& value_caches,
-  const std::map<int64_t, std::vector<int64_t>>& block_mapping) {
-  int num_layers = key_caches.size();
-  TORCH_CHECK(num_layers == value_caches.size());
-  if (num_layers == 0) {
-    return;
-  }
-  torch::Device cache_device = key_caches[0].device();
-  TORCH_CHECK(cache_device.is_cuda());
-
-  // Create data structures for the kernel.
-  // Create an array of pointers to the key and value caches.
-  int64_t key_cache_ptrs[num_layers];
-  int64_t value_cache_ptrs[num_layers];
-  for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) {
-    key_cache_ptrs[layer_idx] = reinterpret_cast<int64_t>(key_caches[layer_idx].data_ptr());
-    value_cache_ptrs[layer_idx] = reinterpret_cast<int64_t>(value_caches[layer_idx].data_ptr());
-  }
-  // Create block mapping array.
-  std::vector<int64_t> block_mapping_vec;
-  for (const auto& pair : block_mapping) {
-    int64_t src_block_number = pair.first;
-    for (int64_t dst_block_number : pair.second) {
-      block_mapping_vec.push_back(src_block_number);
-      block_mapping_vec.push_back(dst_block_number);
-    }
-  }
-  int64_t* block_mapping_array = block_mapping_vec.data();
-  int num_pairs = block_mapping_vec.size() / 2;
-
-  // Move the data structures to the GPU.
-  // NOTE: This synchronizes the CPU and GPU.
-  torch::Tensor key_cache_ptrs_tensor = torch::from_blob(
-    key_cache_ptrs, {num_layers}, torch::kInt64).to(cache_device);
-  torch::Tensor value_cache_ptrs_tensor = torch::from_blob(
-    value_cache_ptrs, {num_layers}, torch::kInt64).to(cache_device);
-  torch::Tensor block_mapping_tensor = torch::from_blob(
-    block_mapping_array, {2 * num_pairs}, torch::kInt64).to(cache_device);
-
-  // Launch the kernel.
-  const int numel_per_block = key_caches[0][0].numel();
-  dim3 grid(num_layers, num_pairs);
-  dim3 block(std::min(1024, numel_per_block));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] {
-      vllm::copy_blocks_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        key_cache_ptrs_tensor.data_ptr<int64_t>(),
-        value_cache_ptrs_tensor.data_ptr<int64_t>(),
-        block_mapping_tensor.data_ptr<int64_t>(),
-        numel_per_block);
-    }));
-}
-
-namespace vllm {
-
-template<typename scalar_t>
-__global__ void reshape_and_cache_kernel(
-  const scalar_t* __restrict__ key,           // [num_tokens, num_heads, head_size]
-  const scalar_t* __restrict__ value,         // [num_tokens, num_heads, head_size]
-  scalar_t* __restrict__ key_cache,           // [num_blocks, num_heads, head_size/x, block_size, x]
-  scalar_t* __restrict__ value_cache,         // [num_blocks, num_heads, head_size, block_size]
-  const int64_t* __restrict__ slot_mapping,   // [num_tokens]
-  const int key_stride,
-  const int value_stride,
-  const int num_heads,
-  const int head_size,
-  const int block_size,
-  const int x) {
-  const int64_t token_idx = blockIdx.x;
-  const int64_t slot_idx = slot_mapping[token_idx];
-  if (slot_idx < 0) {
-    // Padding token that should be ignored.
-    return;
-  }
-
-  const int64_t block_idx = slot_idx / block_size;
-  const int64_t block_offset = slot_idx % block_size;
-
-  const int n = num_heads * head_size;
-  for (int i = threadIdx.x; i < n; i += blockDim.x) {
-    const int64_t src_key_idx = token_idx * key_stride + i;
-    const int64_t src_value_idx = token_idx * value_stride + i;
-
-    const int head_idx = i / head_size;
-    const int head_offset = i % head_size;
-    const int x_idx = head_offset / x;
-    const int x_offset = head_offset % x;
-
-    const int64_t tgt_key_idx = block_idx * num_heads * (head_size / x) * block_size * x
-                                + head_idx * (head_size / x) * block_size * x
-                                + x_idx * block_size * x
-                                + block_offset * x
-                                + x_offset;
-    const int64_t tgt_value_idx = block_idx * num_heads * head_size * block_size
-                                  + head_idx * head_size * block_size
-                                  + head_offset * block_size
-                                  + block_offset;
-    key_cache[tgt_key_idx] = key[src_key_idx];
-    value_cache[tgt_value_idx] = value[src_value_idx];
-  }
-}
-
-} // namespace vllm
-
-void reshape_and_cache(
-  torch::Tensor& key,           // [num_tokens, num_heads, head_size]
-  torch::Tensor& value,         // [num_tokens, num_heads, head_size]
-  torch::Tensor& key_cache,     // [num_blocks, num_heads, head_size/x, block_size, x]
-  torch::Tensor& value_cache,   // [num_blocks, num_heads, head_size, block_size]
-  torch::Tensor& slot_mapping)  // [num_tokens]
-{
-  int num_tokens = key.size(0);
-  int num_heads = key.size(1);
-  int head_size = key.size(2);
-  int block_size = key_cache.size(3);
-  int x = key_cache.size(4);
-
-  int key_stride = key.stride(0);
-  int value_stride = value.stride(0);
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(num_heads * head_size, 512));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    key.scalar_type(),
-    "reshape_and_cache_kernel",
-    [&] {
-      vllm::reshape_and_cache_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        key.data_ptr<scalar_t>(),
-        value.data_ptr<scalar_t>(),
-        key_cache.data_ptr<scalar_t>(),
-        value_cache.data_ptr<scalar_t>(),
-        slot_mapping.data_ptr<int64_t>(),
-        key_stride,
-        value_stride,
-        num_heads,
-        head_size,
-        block_size,
-        x);
-    });
-}
-
-namespace vllm {
-
-// Grid: (num_blocks, block_size).
-template<typename scalar_t>
-__global__ void gather_cached_kv_kernel(
-  scalar_t* __restrict__ key,             // [num_tokens, [stride], num_heads, head_size]
-  scalar_t* __restrict__ value,           // [num_tokens, [stride], num_heads, head_size]
-  const scalar_t* __restrict__ key_cache,   // [num_blocks, num_heads, head_size/x, block_size, x]
-  const scalar_t* __restrict__ value_cache,   // [num_blocks, num_heads, head_size, block_size]
-  const int* __restrict__ slot_mapping,   // [num_tokens]
-  const int key_stride,
-  const int value_stride,
-  const int num_heads,
-  const int head_size,
-  const int block_size,
-  const int x) {
-    const int token_idx = blockIdx.x;
-    const int slot_idx = slot_mapping[token_idx];
-    const int block_idx = slot_idx / block_size;
-    const int block_offset = slot_idx % block_size;
-
-    const int num_tokens = num_heads * head_size;
-    for (int i = threadIdx.x; i < num_tokens; i += blockDim.x) {
-      const int tgt_key_idx = token_idx * key_stride + i;
-      const int tgt_value_idx = token_idx * value_stride + i;
-  
-      const int head_idx = i / head_size;
-      const int head_offset = i % head_size;
-      const int x_idx = head_offset / x;  // the offset of the [head_size/x] dimension
-      const int x_offset = head_offset % x;
-  
-      const int src_key_idx = block_idx * num_heads * (head_size / x) * block_size * x
-                              + head_idx * (head_size / x) * block_size * x
-                              + x_idx * block_size * x
-                              + block_offset * x
-                              + x_offset;
-      const int src_value_idx = block_idx * num_heads * head_size * block_size
-                                + head_idx * head_size * block_size
-                                + head_offset * block_size
-                                + block_offset;
-
-      key[tgt_key_idx] = __ldg(&key_cache[src_key_idx]);
-      value[tgt_value_idx] = __ldg(&value_cache[src_value_idx]);
-    }
-}
-
-template <typename scalar_t>
-__global__ void gather_cached_kv_kernel_optimized(
-    scalar_t *__restrict__ key,             // [num_tokens, [stride], num_heads, head_size]
-    scalar_t *__restrict__ value,           // [num_tokens, [stride], num_heads, head_size]
-    const scalar_t *__restrict__ key_cache, // [num_blocks, num_heads, head_size/x, block_size, x]
-    const scalar_t *__restrict__ value_cache, // [num_blocks, num_heads, head_size, block_size]
-    const int *__restrict__ slot_mapping,   // [num_tokens]
-    const int key_stride,
-    const int value_stride,
-    const int num_heads,
-    const int head_size,
-    const int block_size,
-    const int x)
-{
-    const int token_idx = blockIdx.x;
-    const int slot_idx = slot_mapping[token_idx];
-    const int block_idx = slot_idx / block_size;
-    const int block_offset = slot_idx % block_size;
-
-    const int dim = num_heads * head_size;
-    assert(dim % 4 == 0);  // this is true for known use cases
-    const int unroll_factor = 4;
-    const int unrolled_dim = dim / unroll_factor;
-
-    for (int i = threadIdx.x; i < unrolled_dim; i += blockDim.x)
-    {
-        int tgt_key_indices[unroll_factor];
-        int tgt_value_indices[unroll_factor];
-        int src_key_indices[unroll_factor];
-        int src_value_indices[unroll_factor];
-        scalar_t keys_to_store[unroll_factor];
-        scalar_t values_to_store[unroll_factor];
-
-        #pragma unroll
-        for (int j = 0; j < unroll_factor; ++j)
-        {
-            int index = i + j * unrolled_dim;
-
-            const int tgt_key_idx = token_idx * key_stride + index;
-            const int tgt_value_idx = token_idx * value_stride + index;
-
-            const int head_idx = index / head_size;
-            const int head_offset = index % head_size;
-            const int x_idx = head_offset / x;
-            const int x_offset = head_offset % x;
-
-            const int src_key_idx = block_idx * num_heads * (head_size / x) * block_size * x
-                                    + head_idx * (head_size / x) * block_size * x
-                                    + x_idx * block_size * x
-                                    + block_offset * x
-                                    + x_offset;
-            const int src_value_idx = block_idx * num_heads * head_size * block_size
-                                      + head_idx * head_size * block_size
-                                      + head_offset * block_size
-                                      + block_offset;
-
-            tgt_key_indices[j] = tgt_key_idx;
-            tgt_value_indices[j] = tgt_value_idx;
-            src_key_indices[j] = src_key_idx;
-            src_value_indices[j] = src_value_idx;
-
-            keys_to_store[j] = __ldg(&key_cache[src_key_idx]);
-            values_to_store[j] = __ldg(&value_cache[src_value_idx]);
-        }
-
-        #pragma unroll
-        for (int j = 0; j < unroll_factor; ++j)
-        {
-            key[tgt_key_indices[j]] = keys_to_store[j];
-            value[tgt_value_indices[j]] = values_to_store[j];
-        }
-    }
-}
-
-} // namespace vllm
-
-void gather_cached_kv(
-  torch::Tensor& key,           // [out] [num_tokens, num_heads, head_size]
-  torch::Tensor& value,         // [out] [num_tokens, num_heads, head_size]
-  torch::Tensor& key_cache,     // [in]  [num_blocks, num_heads, head_size/x, block_size, x]
-  torch::Tensor& value_cache,   // [in]  [num_blocks, num_heads, head_size, block_size]
-  torch::Tensor& slot_mapping)  // [in]  [num_tokens]
-{
-  int num_tokens = key.size(0);
-  int num_heads = key.size(1);
-  int head_size = key.size(2);
-  int block_size = key_cache.size(3);
-  int x = key_cache.size(4);
-
-  int key_stride = key.stride(0);
-  int value_stride = value.stride(0);
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(num_heads * head_size, 512));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    key.scalar_type(),
-    "gather_cached_kv_kernel_optimized",
-    [&] {
-      vllm::gather_cached_kv_kernel_optimized<scalar_t><<<grid, block, 0, stream>>>(
-        key.data_ptr<scalar_t>(),
-        value.data_ptr<scalar_t>(),
-        key_cache.data_ptr<scalar_t>(),
-        value_cache.data_ptr<scalar_t>(),
-        slot_mapping.data_ptr<int>(),
-        key_stride,
-        value_stride,
-        num_heads,
-        head_size,
-        block_size,
-        x);
-    });
-}
--- a/inference/vllm/csrc/cuda_utils.cpp
+++ b/inference/vllm/csrc/cuda_utils.cpp
@ -1,13 +0,0 @@
-#include <torch/extension.h>
-
-int get_device_attribute(
-    int attribute,
-    int device_id);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "get_device_attribute",
-    &get_device_attribute,
-    "Gets the specified device attribute.");
-}
-
--- a/inference/vllm/csrc/cuda_utils_kernels.cu
+++ b/inference/vllm/csrc/cuda_utils_kernels.cu
@ -1,14 +0,0 @@
-int get_device_attribute(
-    int attribute,
-    int device_id)
-{
-    int device, value;
-    if (device_id < 0) {
-        cudaGetDevice(&device);
-    }
-    else {
-        device = device_id;
-    }
-    cudaDeviceGetAttribute(&value, static_cast<cudaDeviceAttr>(attribute), device);
-    return value;
-}
--- a/inference/vllm/csrc/dispatch_utils.h
+++ b/inference/vllm/csrc/dispatch_utils.h
@ -1,14 +0,0 @@
-/*
- * Adapted from
- * https://github.com/pytorch/pytorch/blob/v2.0.1/aten/src/ATen/Dispatch.h
- */
-#include <torch/extension.h>
-
-#define VLLM_DISPATCH_CASE_FLOATING_TYPES(...)              \
-  AT_DISPATCH_CASE(at::ScalarType::Float, __VA_ARGS__)      \
-  AT_DISPATCH_CASE(at::ScalarType::Half, __VA_ARGS__)       \
-  AT_DISPATCH_CASE(at::ScalarType::BFloat16, __VA_ARGS__)
-
-#define VLLM_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...)             \
-  AT_DISPATCH_SWITCH(                                             \
-    TYPE, NAME, VLLM_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))
--- a/inference/vllm/csrc/layernorm.cpp
+++ b/inference/vllm/csrc/layernorm.cpp
@ -1,24 +0,0 @@
-#include <torch/extension.h>
-
-void rms_norm(
-  torch::Tensor& out,
-  torch::Tensor& input,
-  torch::Tensor& weight,
-  float epsilon);
-
-void fused_add_rms_norm(
-  torch::Tensor& input,
-  torch::Tensor& residual,
-  torch::Tensor& weight,
-  float epsilon);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "rms_norm",
-    &rms_norm,
-    "Apply Root Mean Square (RMS) Normalization to the input tensor.");
-  m.def(
-    "fused_add_rms_norm",
-    &fused_add_rms_norm,
-    "In-place fused Add and RMS Normalization");
-}
--- a/inference/vllm/csrc/layernorm_kernels.cu
+++ b/inference/vllm/csrc/layernorm_kernels.cu
@ -1,117 +0,0 @@
-#include <torch/extension.h>
-#include <ATen/cuda/CUDAContext.h>
-
-#include "dispatch_utils.h"
-#include "reduction_utils.cuh"
-
-namespace vllm {
-
-// TODO(woosuk): Further optimize this kernel.
-template<typename scalar_t>
-__global__ void rms_norm_kernel(
-  scalar_t* __restrict__ out,             // [..., hidden_size]
-  const scalar_t* __restrict__ input,     // [..., hidden_size]
-  const scalar_t* __restrict__ weight,    // [hidden_size]
-  const float epsilon,
-  const int num_tokens,
-  const int hidden_size) {
-  __shared__ float s_variance;
-  float variance = 0.0f;
-
-  for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    const float x = (float) input[blockIdx.x * hidden_size + idx];
-    variance += x * x;
-  }
-  variance = blockReduceSum<float>(variance);
-  if (threadIdx.x == 0) {
-    s_variance = rsqrtf(variance / hidden_size + epsilon);
-  }
-  __syncthreads();
-
-  for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float x = (float) input[blockIdx.x * hidden_size + idx];
-    out[blockIdx.x * hidden_size + idx] = ((scalar_t) (x * s_variance)) * weight[idx];
-  }
-}
-
-// TODO: Further optimize this kernel.
-template<typename scalar_t>
-__global__ void fused_add_rms_norm_kernel(
-  scalar_t* __restrict__ input,           // [..., hidden_size]
-  scalar_t* __restrict__ residual,        // [..., hidden_size]
-  const scalar_t* __restrict__ weight,    // [hidden_size]
-  const float epsilon,
-  const int num_tokens,
-  const int hidden_size) {
-  __shared__ float s_variance;
-  float variance = 0.0f;
-
-  for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float x = (float) input[blockIdx.x * hidden_size + idx];
-    x += (float) residual[blockIdx.x * hidden_size + idx];
-    variance += x * x;
-    residual[blockIdx.x * hidden_size + idx] = (scalar_t) x;
-  }
-  variance = blockReduceSum<float>(variance);
-  if (threadIdx.x == 0) {
-    s_variance = rsqrtf(variance / hidden_size + epsilon);
-  }
-  __syncthreads();
-
-  for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float x = (float) residual[blockIdx.x * hidden_size + idx];
-    input[blockIdx.x * hidden_size + idx] = ((scalar_t) (x * s_variance)) * weight[idx];
-  }
-}
-
-} // namespace vllm
-
-void rms_norm(
-  torch::Tensor& out,      // [..., hidden_size]
-  torch::Tensor& input,    // [..., hidden_size]
-  torch::Tensor& weight,   // [hidden_size]
-  float epsilon) {
-  int hidden_size = input.size(-1);
-  int num_tokens = input.numel() / hidden_size;
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(hidden_size, 1024));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    input.scalar_type(),
-    "rms_norm_kernel",
-    [&] {
-      vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        out.data_ptr<scalar_t>(),
-        input.data_ptr<scalar_t>(),
-        weight.data_ptr<scalar_t>(),
-        epsilon,
-        num_tokens,
-        hidden_size);
-    });
-}
-
-void fused_add_rms_norm(
-  torch::Tensor& input,    // [..., hidden_size]
-  torch::Tensor& residual, // [..., hidden_size]
-  torch::Tensor& weight,   // [hidden_size]
-  float epsilon) {
-  int hidden_size = input.size(-1);
-  int num_tokens = input.numel() / hidden_size;
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(hidden_size, 1024));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    input.scalar_type(),
-    "fused_add_rms_norm_kernel",
-    [&] {
-      vllm::fused_add_rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        input.data_ptr<scalar_t>(),
-        residual.data_ptr<scalar_t>(),
-        weight.data_ptr<scalar_t>(),
-        epsilon,
-        num_tokens,
-        hidden_size);
-    });
-}
--- a/inference/vllm/csrc/pos_encoding.cpp
+++ b/inference/vllm/csrc/pos_encoding.cpp
@ -1,16 +0,0 @@
-#include <torch/extension.h>
-
-void rotary_embedding(
-  torch::Tensor& positions,
-  torch::Tensor& query,
-  torch::Tensor& key,
-  int head_size,
-  torch::Tensor& cos_sin_cache,
-  bool is_neox);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def(
-    "rotary_embedding",
-    &rotary_embedding,
-    "Apply GPT-NeoX or GPT-J style rotary embedding to query and key");
-}
--- a/inference/vllm/csrc/pos_encoding_kernels.cu
+++ b/inference/vllm/csrc/pos_encoding_kernels.cu
@ -1,127 +0,0 @@
-#include <torch/extension.h>
-#include <ATen/cuda/CUDAContext.h>
-
-#include "dispatch_utils.h"
-
-namespace vllm {
-
-template<typename scalar_t, bool IS_NEOX>
-inline __device__ void apply_rotary_embedding(
-  scalar_t* __restrict__ arr,
-  const scalar_t* __restrict__ cos_ptr,
-  const scalar_t* __restrict__ sin_ptr,
-  int rot_offset,
-  int embed_dim)
-{
-  int x_index, y_index;
-  scalar_t cos, sin;
-  if (IS_NEOX) {
-    // GPT-NeoX style rotary embedding.
-    x_index = rot_offset;
-    y_index = embed_dim + rot_offset;
-    cos = __ldg(cos_ptr + x_index);
-    sin = __ldg(sin_ptr + x_index);
-  } else {
-    // GPT-J style rotary embedding.
-    x_index = 2 * rot_offset;
-    y_index = 2 * rot_offset + 1;
-    cos = __ldg(cos_ptr + x_index / 2);
-    sin = __ldg(sin_ptr + x_index / 2);
-  }
-
-  const scalar_t x = arr[x_index];
-  const scalar_t y = arr[y_index];
-  arr[x_index] = x * cos - y * sin;
-  arr[y_index] = y * cos + x * sin;
-}
-
-template<typename scalar_t, bool IS_NEOX>
-__global__ void rotary_embedding_kernel(
-  const int64_t* __restrict__ positions,        // [batch_size, seq_len] or [num_tokens]
-  scalar_t* __restrict__ query,                 // [batch_size, seq_len, num_heads, head_size] or [num_tokens, num_heads, head_size]
-  scalar_t* __restrict__ key,                   // [batch_size, seq_len, num_kv_heads, head_size] or [num_tokens, num_kv_heads, head_size]
-  const scalar_t* __restrict__ cos_sin_cache,   // [max_position, 2, rot_dim // 2]
-  const int rot_dim,
-  const int query_stride,
-  const int key_stride,
-  const int num_heads,
-  const int num_kv_heads,
-  const int head_size) {
-  // Each thread block is responsible for one token.
-  const int token_idx = blockIdx.x;
-  int64_t pos = positions[token_idx];
-  const scalar_t* cache_ptr = cos_sin_cache + pos * rot_dim;
-
-  const int embed_dim = rot_dim / 2;
-  const scalar_t* cos_ptr = cache_ptr;
-  const scalar_t* sin_ptr = cache_ptr + embed_dim;
-
-  const int nq = num_heads * embed_dim;
-  for (int i = threadIdx.x; i < nq; i += blockDim.x) {
-    const int head_idx = i / embed_dim;
-    const int token_head = token_idx * query_stride + head_idx * head_size;
-    const int rot_offset = i % embed_dim;
-    apply_rotary_embedding<scalar_t, IS_NEOX>(query + token_head, cos_ptr,
-                                              sin_ptr, rot_offset, embed_dim);
-  }
-
-  const int nk = num_kv_heads * embed_dim;
-  for (int i = threadIdx.x; i < nk; i += blockDim.x) {
-    const int head_idx = i / embed_dim;
-    const int token_head = token_idx * key_stride + head_idx * head_size;
-    const int rot_offset = i % embed_dim;
-    apply_rotary_embedding<scalar_t, IS_NEOX>(key + token_head, cos_ptr,
-                                              sin_ptr, rot_offset, embed_dim);
-  }
-}
-
-} // namespace vllm
-
-void rotary_embedding(
-  torch::Tensor& positions,         // [batch_size, seq_len] or [num_tokens]
-  torch::Tensor& query,             // [batch_size, seq_len, num_heads * head_size] or [num_tokens, num_heads * head_size]
-  torch::Tensor& key,               // [batch_size, seq_len, num_kv_heads * head_size] or [num_tokens, num_kv_heads * head_size]
-  int head_size,
-  torch::Tensor& cos_sin_cache,     // [max_position, rot_dim]
-  bool is_neox) {
-  int64_t num_tokens = query.numel() / query.size(-1);
-  int rot_dim = cos_sin_cache.size(1);
-  int num_heads = query.size(-1) / head_size;
-  int num_kv_heads = key.size(-1) / head_size;
-  int query_stride = query.stride(-2);
-  int key_stride = key.stride(-2);
-
-  dim3 grid(num_tokens);
-  dim3 block(std::min(num_heads * rot_dim / 2, 512));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(
-    query.scalar_type(),
-    "rotary_embedding",
-    [&] {
-      if (is_neox) {
-        vllm::rotary_embedding_kernel<scalar_t, true><<<grid, block, 0, stream>>>(
-          positions.data_ptr<int64_t>(),
-          query.data_ptr<scalar_t>(),
-          key.data_ptr<scalar_t>(),
-          cos_sin_cache.data_ptr<scalar_t>(),
-          rot_dim,
-          query_stride,
-          key_stride,
-          num_heads,
-          num_kv_heads,
-          head_size);
-      } else {
-        vllm::rotary_embedding_kernel<scalar_t, false><<<grid, block, 0, stream>>>(
-          positions.data_ptr<int64_t>(),
-          query.data_ptr<scalar_t>(),
-          key.data_ptr<scalar_t>(),
-          cos_sin_cache.data_ptr<scalar_t>(),
-          rot_dim,
-          query_stride,
-          key_stride,
-          num_heads,
-          num_kv_heads,
-          head_size);
-      }
-    });
-}
--- a/inference/vllm/csrc/quantization.cpp
+++ b/inference/vllm/csrc/quantization.cpp
@ -1,19 +0,0 @@
-#include <torch/extension.h>
-
-torch::Tensor awq_gemm(
-  torch::Tensor _in_feats,
-  torch::Tensor _kernel,
-  torch::Tensor _scaling_factors,
-  torch::Tensor _zeros,
-  int split_k_iters);
-
-void squeezellm_gemm(
-  torch::Tensor vec,
-  torch::Tensor mat,
-  torch::Tensor mul,
-  torch::Tensor lookup_table);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("awq_gemm", &awq_gemm, "Quantized GEMM for AWQ");
-  m.def("squeezellm_gemm", &squeezellm_gemm, "Quantized GEMM for SqueezeLLM");
-}
--- a/inference/vllm/csrc/quantization/awq/dequantize.cuh
+++ b/inference/vllm/csrc/quantization/awq/dequantize.cuh
@ -1,87 +0,0 @@
-/*
-Adapted from https://github.com/mit-han-lab/llm-awq
-Modified from NVIDIA FasterTransformer: https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h
-@article{lin2023awq,
-  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
-  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
-  journal={arXiv},
-  year={2023}
-}
-*/
-
-#pragma once
-
-namespace vllm {
-namespace awq {
-
-__device__ uint4 dequantize_s4_to_fp16x2(uint32_t const& source)
-{
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750
-  assert(false);
-#else
-    uint4 result;
-
-    uint32_t*      h   = reinterpret_cast<uint32_t*>(&result);
-    uint32_t const i4s = reinterpret_cast<uint32_t const&>(source);
-
-    // First, we extract the i4s and construct an intermediate fp16 number.
-    static constexpr uint32_t immLut                = (0xf0 & 0xcc) | 0xaa;
-    static constexpr uint32_t BOTTOM_MASK           = 0x000f000f;
-    static constexpr uint32_t TOP_MASK              = 0x00f000f0;
-    static constexpr uint32_t I4s_TO_F16s_MAGIC_NUM = 0x64006400;
-
-    // Note that the entire sequence only requires 1 shift instruction. This is thanks to the register packing
-    // format and the fact that we force our integers to be unsigned, and account for this in the fp16 subtractions.
-    // In addition, I exploit the fact that sub and fma have the same throughput in order to convert elt_23 and
-    // elt_67 to fp16 without having to shift them to the bottom bits before hand.
-
-    // Shift right by 8 to now consider elt_45 and elt_67. Issue first to hide RAW dependency if we issue
-    // immediately before required.
-    const uint32_t top_i4s = i4s >> 8;
-    // Extract elt_01 - (i4s & 0x000f000f) | 0x64006400
-    asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
-                    : "=r"(h[0])
-                    : "r"(i4s), "n"(BOTTOM_MASK), "n"(I4s_TO_F16s_MAGIC_NUM), "n"(immLut));
-    // Extract elt_23 (i4s & 0x00f000f0) | 0x64006400
-    asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
-                    : "=r"(h[1])
-                    : "r"(i4s), "n"(TOP_MASK), "n"(I4s_TO_F16s_MAGIC_NUM), "n"(immLut));
-    // Extract elt_45 (top_i4s & 0x000f000f) | 0x64006400
-    asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
-                    : "=r"(h[2])
-                    : "r"(top_i4s), "n"(BOTTOM_MASK), "n"(I4s_TO_F16s_MAGIC_NUM), "n"(immLut));
-    // Extract elt_67 (top_i4s & 0x00f000f0) | 0x64006400
-    asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
-                    : "=r"(h[3])
-                    : "r"(top_i4s), "n"(TOP_MASK), "n"(I4s_TO_F16s_MAGIC_NUM), "n"(immLut));
-
-    // I use inline PTX below because I am not sure if the compiler will emit float2half instructions if I use the
-    // half2 ctor. In this case, I chose performance reliability over code readability.
-
-    // This is the half2 {1032, 1032} represented as an integer.
-    // static constexpr uint32_t FP16_TOP_MAGIC_NUM = 0x64086408;
-    // Haotian: subtract {1024, 1024} instead, we do not need to map to [-8, 7]
-    static constexpr uint32_t FP16_TOP_MAGIC_NUM = 0x64006400;
-    // This is the half2 {1 / 16, 1 / 16} represented as an integer.
-    static constexpr uint32_t ONE_SIXTEENTH = 0x2c002c00;
-    // This is the half2 {-72, -72} represented as an integer.
-    // static constexpr uint32_t NEG_72 = 0xd480d480;
-    // Haotian: Let's use {-64, -64}.
-    static constexpr uint32_t NEG_64 = 0xd400d400;
-
-    // Finally, we construct the output numbers.
-    // Convert elt_01
-    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[0]) : "r"(h[0]), "r"(FP16_TOP_MAGIC_NUM));
-    // Convert elt_23
-    asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(h[1]) : "r"(h[1]), "r"(ONE_SIXTEENTH), "r"(NEG_64));
-    // Convert elt_45
-    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[2]) : "r"(h[2]), "r"(FP16_TOP_MAGIC_NUM));
-    // Convert elt_67
-    asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(h[3]) : "r"(h[3]), "r"(ONE_SIXTEENTH), "r"(NEG_64));
-
-    return result;
-#endif
-}
-
-} // namespace awq
-} // namespace vllm
--- a/inference/vllm/csrc/quantization/awq/gemm_kernels.cu
+++ b/inference/vllm/csrc/quantization/awq/gemm_kernels.cu
@ -1,560 +0,0 @@
-/*
-Adapted from https://github.com/mit-han-lab/llm-awq
-@article{lin2023awq,
-  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
-  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
-  journal={arXiv},
-  year={2023}
-}
- */
-
-
-#include <torch/extension.h>
-#include <c10/cuda/CUDAGuard.h>
-
-#include "dequantize.cuh"
-
-#include <cuda_fp16.h>
-
-namespace vllm {
-namespace awq {
-
-// Pack two half values.
-static inline __device__ __host__ unsigned
-__pack_half2(const half x, const half y) {
-  unsigned v0 = *((unsigned short *)&x);
-  unsigned v1 = *((unsigned short *)&y);
-  return (v1 << 16) | v0;
-}
-
-__global__ void __launch_bounds__(64) gemm_forward_4bit_cuda_m16n128k32(int G, int split_k_iters, half* __restrict__ A, int* __restrict__ B, half* __restrict__ scaling_factors, int* __restrict__ zeros, int M, int IC, int OC, half* __restrict__ C) 
-{
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750
-  assert(false);
-#else
-  static constexpr uint32_t ZERO = 0x0;
-  float C_warp[32];
-  __shared__ half A_shared[16 * (32 + 8)];
-  __shared__ half B_shared[32 * (128 + 8)];
-  
-  __shared__ half scaling_factors_shared[128];
-  __shared__ half zeros_shared[128];
-
-  int j_factors1 = ((OC + 128 - 1) / 128);
-  int blockIdx_x = 0;
-  int blockIdx_y = blockIdx.x % ((M + 16 - 1) / 16 * j_factors1);
-  int blockIdx_z = blockIdx.x / ((M + 16 - 1) / 16 * j_factors1);
-
-  half A_shared_warp[8];
-  half B_shared_warp[32];
-  for (int j_0_4_init = 0; j_0_4_init < 4; ++j_0_4_init) {
-    for (int i = 0; i < 8; ++i) {
-      C_warp[(j_0_4_init * 8) + i] = 0.0;
-    }
-  }
-
-  static constexpr int row_stride_warp = 32 * 8 / 32;
-  static constexpr int row_stride = 2 * 32 * 8 / 128;
-  bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 128;
-  // TODO: Haotian: blockIdx_y / j_factors1 in A loading to support bsz > 16
-  bool ld_A_flag = (blockIdx_y / j_factors1 * 16 + threadIdx.y * row_stride_warp + threadIdx.x * 8 / 32) < M;     // threadIdx.y is warp_id
-  // bool wb_C_flag = (threadIdx.x / 4) < M;
-
-  half* A_ptr = A 
-                + (((int)blockIdx_y) / j_factors1 * 16 + (((int)threadIdx.y) * row_stride_warp) + ((int)threadIdx.x) / (32 / 8)) * IC
-                + (((int)threadIdx.x) % (32 / 8)) * 8;
-  
-  int* B_ptr = B
-            + ((int)threadIdx.y) * (OC / 8) * 2
-            + (((int)threadIdx.x) / (128 / 8)) * (OC / 8)
-            + (((int)blockIdx_y) % j_factors1) * (128 / 8)
-            + (((int)threadIdx.x) % (128 / 8)) * 1;
-// Why * 1 in the above line?
-                        
-  half* A_shared_ptr = A_shared 
-                    + ((int)threadIdx.y) * row_stride_warp * (32 + 8) 
-                    + (((int)threadIdx.x) / (32 / 8)) * (32 + 8)
-                    + (((int)threadIdx.x) % (32 / 8) ) * 8;
-
-  half* B_shared_ptr = B_shared
-                    + ((int)threadIdx.y) * (row_stride / 2) * (128 + 8)
-                    + (((int)threadIdx.x) / (128 / 8)) * (128 + 8)
-                    + (((int)threadIdx.x) % (128 / 8)) * 8;
-  
-  int* zeros_ptr = zeros
-                + (((int)blockIdx_y) % j_factors1) * (128 / 8)
-                + ((int)threadIdx.x) % (128 / 8);
-  
-  half* scaling_factors_ptr = scaling_factors
-                            + (((int)blockIdx_y) % j_factors1) * (128) 
-                            + (((int)threadIdx.x) % (128 / 8)) * 8;
-
-  half* C_ptr = C 
-              + static_cast<long long>(blockIdx_z) * M * OC        // blockIdz.x -> split_k dim
-              + (((int)blockIdx_y) % j_factors1) * 128
-              + ((int)threadIdx.y) * 64
-              + (((int)threadIdx.x) % 4) * 2;
-
-  // preload s.f. and zeros
-  int k_bound = (IC / 32 + split_k_iters - 1) / split_k_iters;
-  if ((k_bound - 1) * split_k_iters * 32 + blockIdx_z * 32 >= IC) k_bound -= 1;
-  for (int _k_0_0 = 0; _k_0_0 < k_bound; ++_k_0_0) {
-    int k_0_0 = _k_0_0 * split_k_iters + blockIdx_z;
-    __syncthreads();
-    // TODO: Haotian: blockIdx_y / j_factors1 in A loading to support bsz > 16
-    if (ld_A_flag)
-    {
-      *(uint4*)(A_shared_ptr) = *(uint4*)(A_ptr + (k_0_0 * 32));
-    }
-    else
-    {
-      *(uint4*)(A_shared_ptr) = make_uint4(0, 0, 0, 0);
-    }
-
-    // for (int ax0_ax1_fused_0 = 0; ax0_ax1_fused_0 < 2; ++ax0_ax1_fused_0) {
-    uint32_t zeros_loaded = *(uint32_t*)(zeros_ptr + k_0_0 * 32 / G * (OC / 8));
-    uint4 B_loaded_zero = dequantize_s4_to_fp16x2(zeros_loaded);
-    uint4 B_loaded_scale = *(uint4*)(scaling_factors_ptr + k_0_0 * 32 / G * (OC));
-    /*
-    if (blockIdx_z == 0 && blockIdx_y == 0 && k_0_0 == 0 && threadIdx.x == 0 && threadIdx.y == 0){
-      printf("%x %x %x %x %x %x %x %x\n", B_loaded_scale.x, B_loaded_scale.y, B_loaded_scale.z, B_loaded_scale.w, B_loaded_zero.x, B_loaded_zero.y, B_loaded_zero.z, B_loaded_zero.w);
-    }
-    */
-    // uint4 B_loaded_scale = make_uint4(0, 0, 0, 0);
-    int* B_ptr_local = B_ptr + k_0_0 * 32 * (OC / 8);
-
-    for (int ax0_ax1_fused_0 = 0; ax0_ax1_fused_0 < 8; ++ax0_ax1_fused_0) {
-
-      // B: 32 x 136 (128+8) float16
-      // each warp: 32 x 4
-      // each thr: read 32 bit -> convert to 8xFP16 (a UINT4) -> scale and minus zero -> WB UINT4
-      // *(uint4*)(B_shared + ((((ax0_ax1_fused_0 * 544) + (((int)threadIdx.y) * 272)) + ((((int)threadIdx.x) >> 4) * 136)) + ((((int)threadIdx.x) & 15) * 8))) = *(uint4*)(B + ((((((k_0_0 * 163840) + (ax0_ax1_fused_0 * 20480)) + (((int)threadIdx.y) * 10240)) + ((((int)threadIdx.x) >> 4) * 5120)) + (((int)blockIdx_y) * 128)) + ((((int)threadIdx.x) & 15) * 8)));
-      // row stride in shared memory: (NWARPS * 32 * 8 / cta_N) 
-      uint32_t B_loaded = *(uint32_t*)(B_ptr_local + ax0_ax1_fused_0 * row_stride * (OC / 8));
-      uint4 B_loaded_fp16 = dequantize_s4_to_fp16x2(B_loaded);
-      //uint4 B_loaded_zero = *(uint4*)(zeros_shared + (threadIdx.x % (cta_N / 8)) * 8);
-
-      // uint4 B_loaded_scale = *(uint4*)(scaling_factors_shared + (threadIdx.x % (cta_N / 8)) * 8);
-      // - zero and * scale
-      // TODO (Haotian): can save 4 assembly instructions if sormulate as deq = q * scale - zero * scale.
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.x) : "r"(B_loaded_fp16.x), "r"(B_loaded_zero.x));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.x) : "r"(B_loaded_fp16.x), "r"(B_loaded_scale.x), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.y) : "r"(B_loaded_fp16.y), "r"(B_loaded_zero.y));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.y) : "r"(B_loaded_fp16.y), "r"(B_loaded_scale.y), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.z) : "r"(B_loaded_fp16.z), "r"(B_loaded_zero.z));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.z) : "r"(B_loaded_fp16.z), "r"(B_loaded_scale.z), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.w) : "r"(B_loaded_fp16.w), "r"(B_loaded_zero.w));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.w) : "r"(B_loaded_fp16.w), "r"(B_loaded_scale.w), "r"(ZERO));
-      /*
-      if (ax0_ax1_fused_0 == 0 && blockIdx_z == 0 && blockIdx_y == 0 && k_0_0 == 0 && threadIdx.x == 17 && threadIdx.y == 0){
-        printf("[x] %X %X %X %X\n", B_loaded_fp16.x, B_loaded_fp16.y, B_loaded_fp16.z, B_loaded_fp16.w);
-      }
-      */
-
-      // write back
-      *(uint4*)(B_shared_ptr + ax0_ax1_fused_0 * row_stride * (128 + 8)) = B_loaded_fp16;
-    }
-    __syncthreads();
-
-    for (int k_0_1 = 0; k_0_1 < 2; ++k_0_1) {
-      {
-        unsigned int addr;
-        __asm__ __volatile__(
-          "{ .reg .u64 addr; cvta.to.shared.u64 addr, %1; cvt.u32.u64 %0, addr; }\n"
-          : "=r"(addr)
-          : "l"((void *)((&(A_shared[(k_0_1 * 16)])) + (((((int)threadIdx.x) & 15) * 40) + ((((int)threadIdx.x) >> 4) * 8))))
-        );
-
-
-        __asm__ __volatile__(
-          "ldmatrix.sync.aligned.m8n8.x4.shared.b16"
-          "{%0, %1, %2, %3}, [%4];\n"
-          : "=r"(((unsigned *)(A_shared_warp + 0))[0]), "=r"(((unsigned *)(A_shared_warp + 0))[1]), "=r"(((unsigned *)(A_shared_warp + 0))[2]), "=r"(((unsigned *)(A_shared_warp + 0))[3])
-          : "r"(addr)
-        );
-      }
-
-      for (int ax1_0 = 0; ax1_0 < 4; ++ax1_0) {
-        {
-          unsigned int addr;
-          __asm__ __volatile__(
-            "{ .reg .u64 addr; cvta.to.shared.u64 addr, %1; cvt.u32.u64 %0, addr; }\n"
-            : "=r"(addr)
-            : "l"((void *)((&(B_shared[(((k_0_1 * 2176) + (((int)threadIdx.y) * 64)) + (ax1_0 * 16))])) + (((((int)threadIdx.x) & 15) * 136) + ((((int)threadIdx.x) >> 4) * 8))))
-          );
-          __asm__ __volatile__(
-            "ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16"
-            "{%0, %1, %2, %3}, [%4];\n"
-            : "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[0]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[1]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[2]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[3])
-            : "r"(addr)
-          );
-        }
-      }
-      for (int j_0_4 = 0; j_0_4 < 4; ++j_0_4) {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-#else
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[0]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[0]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-
-#endif
-      }
-    }
-  }
-
-// TODO: Shang: Hoist loop invariance.
-  for (int ax1_0_1 = 0; ax1_0_1 < 4; ++ax1_0_1) {
-    for (int local_id = 0; local_id < 8; ++local_id) {
-      int row_offset = (((int)blockIdx_y) / j_factors1) * 16 + ((int)threadIdx.x) / 4 + (local_id % 4) / 2 * 8;
-      if (row_offset < M)
-      {
-        *(C_ptr + ax1_0_1 * 16 + row_offset * OC + (local_id / 4) * 8 + local_id % 2) = __float2half(C_warp[(ax1_0_1 * 8) + local_id]);
-      }
-    }
-  }
-#endif
-}
-
-
-__global__ void __launch_bounds__(64) gemm_forward_4bit_cuda_m16n64k32(int G, int split_k_iters, half* __restrict__ A, int* __restrict__ B, half* __restrict__ scaling_factors, int* __restrict__ zeros, int M, int IC, int OC, half* __restrict__ C) 
-{
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750
-  assert(false);
-#else
-  static constexpr uint32_t ZERO = 0x0;
-  float C_warp[32];
-  __shared__ half A_shared[16 * (32 + 8)];
-  __shared__ half B_shared[32 * (64 + 8)];
-  
-  __shared__ half scaling_factors_shared[64];
-  __shared__ half zeros_shared[64];
-
-  int j_factors1 = ((OC + 64 - 1) / 64);
-
-  int blockIdx_x = 0;
-  int blockIdx_y = blockIdx.x % ((M + 16 - 1) / 16 * j_factors1);
-  int blockIdx_z = blockIdx.x / ((M + 16 - 1) / 16 * j_factors1);
-
-  half A_shared_warp[8];
-  half B_shared_warp[16];
-  for (int j_0_4_init = 0; j_0_4_init < 2; ++j_0_4_init) {
-    for (int i = 0; i < 8; ++i) {
-      C_warp[(j_0_4_init * 8) + i] = 0.0;
-    }
-  }
-
-  static constexpr int row_stride_warp = 32 * 8 / 32;
-  static constexpr int row_stride = 2 * 32 * 8 / 64;
-  bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 64;
-  // TODO: Haotian: blockIdx_y / j_factors1 in A loading to support bsz > 16
-  bool ld_A_flag = (blockIdx_y / j_factors1 * 16 + threadIdx.y * row_stride_warp + threadIdx.x * 8 / 32) < M;     // threadIdx.y is warp_id
-  // bool wb_C_flag = (threadIdx.x / 4) < M;
-
-  half* A_ptr = A 
-                + (((int)blockIdx_y) / j_factors1 * 16 + (((int)threadIdx.y) * row_stride_warp) + ((int)threadIdx.x) / (32 / 8)) * IC
-                + (((int)threadIdx.x) % (32 / 8)) * 8;
-  
-  int* B_ptr = B
-            + ((int)threadIdx.y) * (OC / 8) * 4
-            + (((int)threadIdx.x) / (64 / 8)) * (OC / 8)
-            + (((int)blockIdx_y) % j_factors1) * (64 / 8)
-            + (((int)threadIdx.x) % (64 / 8)) * 1;
-// Why * 1 in the above line?
-                        
-  half* A_shared_ptr = A_shared 
-                    + ((int)threadIdx.y) * row_stride_warp * (32 + 8) 
-                    + (((int)threadIdx.x) / (32 / 8)) * (32 + 8)
-                    + (((int)threadIdx.x) % (32 / 8) ) * 8;
-
-  half* B_shared_ptr = B_shared
-                    + ((int)threadIdx.y) * (row_stride / 2) * (64 + 8)
-                    + (((int)threadIdx.x) / (64 / 8)) * (64 + 8)
-                    + (((int)threadIdx.x) % (64 / 8)) * 8;
-  
-  int* zeros_ptr = zeros
-                + (((int)blockIdx_y) % j_factors1) * (64 / 8)
-                + ((int)threadIdx.x) % (64 / 8);
-  
-  half* scaling_factors_ptr = scaling_factors
-                            + (((int)blockIdx_y) % j_factors1) * (64) 
-                            + (((int)threadIdx.x) % (64 / 8)) * 8;
-
-  half* C_ptr = C 
-              + static_cast<long long>(blockIdx_z) * M * OC        // blockIdz.x -> split_k dim
-              + (((int)blockIdx_y) % j_factors1) * 64
-              + ((int)threadIdx.y) * 32
-              + (((int)threadIdx.x) % 4) * 2;
-
-  // preload s.f. and zeros
-  int k_bound = (IC / 32 + split_k_iters - 1) / split_k_iters;
-  if ((k_bound - 1) * split_k_iters * 32 + blockIdx_z * 32 >= IC) k_bound -= 1;
-  for (int _k_0_0 = 0; _k_0_0 < k_bound; ++_k_0_0) {
-    int k_0_0 = _k_0_0 * split_k_iters + blockIdx_z;
-    __syncthreads();
-    // TODO: Haotian: blockIdx_y / j_factors1 in A loading to support bsz > 16
-    if (ld_A_flag)
-    {
-      *(uint4*)(A_shared_ptr) = *(uint4*)(A_ptr + (k_0_0 * 32));
-    }
-    else
-    {
-      *(uint4*)(A_shared_ptr) = make_uint4(0, 0, 0, 0);
-    }
-
-    // for (int ax0_ax1_fused_0 = 0; ax0_ax1_fused_0 < 2; ++ax0_ax1_fused_0) {
-    uint32_t zeros_loaded = *(uint32_t*)(zeros_ptr + k_0_0 * 32 / G * (OC / 8));
-    uint4 B_loaded_zero = dequantize_s4_to_fp16x2(zeros_loaded);
-    uint4 B_loaded_scale = *(uint4*)(scaling_factors_ptr + k_0_0 * 32 / G * (OC));
-    /*
-    if (blockIdx_z == 0 && blockIdx_y == 0 && k_0_0 == 0 && threadIdx.x == 0 && threadIdx.y == 0){
-      printf("%x %x %x %x %x %x %x %x\n", B_loaded_scale.x, B_loaded_scale.y, B_loaded_scale.z, B_loaded_scale.w, B_loaded_zero.x, B_loaded_zero.y, B_loaded_zero.z, B_loaded_zero.w);
-    }
-    */
-    // uint4 B_loaded_scale = make_uint4(0, 0, 0, 0);
-    int* B_ptr_local = B_ptr + k_0_0 * 32 * (OC / 8);
-
-    for (int ax0_ax1_fused_0 = 0; ax0_ax1_fused_0 < 4; ++ax0_ax1_fused_0) {
-
-      // B: 32 x 136 (128+8) float16
-      // each warp: 32 x 4
-      // each thr: read 32 bit -> convert to 8xFP16 (a UINT4) -> scale and minus zero -> WB UINT4
-      // *(uint4*)(B_shared + ((((ax0_ax1_fused_0 * 544) + (((int)threadIdx.y) * 272)) + ((((int)threadIdx.x) >> 4) * 136)) + ((((int)threadIdx.x) & 15) * 8))) = *(uint4*)(B + ((((((k_0_0 * 163840) + (ax0_ax1_fused_0 * 20480)) + (((int)threadIdx.y) * 10240)) + ((((int)threadIdx.x) >> 4) * 5120)) + (((int)blockIdx_y) * 128)) + ((((int)threadIdx.x) & 15) * 8)));
-      // row stride in shared memory: (NWARPS * 32 * 8 / cta_N) 
-      uint32_t B_loaded = *(uint32_t*)(B_ptr_local + ax0_ax1_fused_0 * row_stride * (OC / 8));
-      uint4 B_loaded_fp16 = dequantize_s4_to_fp16x2(B_loaded);
-      //uint4 B_loaded_zero = *(uint4*)(zeros_shared + (threadIdx.x % (cta_N / 8)) * 8);
-
-      // uint4 B_loaded_scale = *(uint4*)(scaling_factors_shared + (threadIdx.x % (cta_N / 8)) * 8);
-      // - zero and * scale
-      // TODO (Haotian): can save 4 assembly instructions if sormulate as deq = q * scale - zero * scale.
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.x) : "r"(B_loaded_fp16.x), "r"(B_loaded_zero.x));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.x) : "r"(B_loaded_fp16.x), "r"(B_loaded_scale.x), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.y) : "r"(B_loaded_fp16.y), "r"(B_loaded_zero.y));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.y) : "r"(B_loaded_fp16.y), "r"(B_loaded_scale.y), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.z) : "r"(B_loaded_fp16.z), "r"(B_loaded_zero.z));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.z) : "r"(B_loaded_fp16.z), "r"(B_loaded_scale.z), "r"(ZERO));
-      asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(B_loaded_fp16.w) : "r"(B_loaded_fp16.w), "r"(B_loaded_zero.w));
-      asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(B_loaded_fp16.w) : "r"(B_loaded_fp16.w), "r"(B_loaded_scale.w), "r"(ZERO));
-      /*
-      if (ax0_ax1_fused_0 == 0 && blockIdx_z == 0 && blockIdx_y == 0 && k_0_0 == 0 && threadIdx.x == 17 && threadIdx.y == 0){
-        printf("[x] %X %X %X %X\n", B_loaded_fp16.x, B_loaded_fp16.y, B_loaded_fp16.z, B_loaded_fp16.w);
-      }
-      */
-
-      // write back
-      *(uint4*)(B_shared_ptr + ax0_ax1_fused_0 * row_stride * (64 + 8)) = B_loaded_fp16;
-    }
-    __syncthreads();
-
-    for (int k_0_1 = 0; k_0_1 < 2; ++k_0_1) 
-    {
-      {
-        unsigned int addr;
-        __asm__ __volatile__(
-          "{ .reg .u64 addr; cvta.to.shared.u64 addr, %1; cvt.u32.u64 %0, addr; }\n"
-          : "=r"(addr)
-          : "l"((void *)((&(A_shared[(k_0_1 * 16)])) + (((((int)threadIdx.x) & 15) * 40) + ((((int)threadIdx.x) >> 4) * 8))))
-        );
-        __asm__ __volatile__(
-          "ldmatrix.sync.aligned.m8n8.x4.shared.b16"
-          "{%0, %1, %2, %3}, [%4];\n"
-          : "=r"(((unsigned *)(A_shared_warp + 0))[0]), "=r"(((unsigned *)(A_shared_warp + 0))[1]), "=r"(((unsigned *)(A_shared_warp + 0))[2]), "=r"(((unsigned *)(A_shared_warp + 0))[3])
-          : "r"(addr)
-        );
-      }
-        
-
-      for (int ax1_0 = 0; ax1_0 < 2; ++ax1_0) 
-      {
-        {
-          unsigned int addr;
-          __asm__ __volatile__(
-            "{ .reg .u64 addr; cvta.to.shared.u64 addr, %1; cvt.u32.u64 %0, addr; }\n"
-            : "=r"(addr)
-            : "l"((void *)((&(B_shared[(((k_0_1 * 1152) + (((int)threadIdx.y) * 32)) + (ax1_0 * 16))])) + (((((int)threadIdx.x) & 15) * 72) + ((((int)threadIdx.x) >> 4) * 8))))
-          );
-          __asm__ __volatile__(
-            "ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16"
-            "{%0, %1, %2, %3}, [%4];\n"
-            : "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[0]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[1]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[2]), "=r"(((unsigned *)(B_shared_warp + (ax1_0 * 8)))[3])
-            : "r"(addr)
-          );
-        }
-      }
-      
-      for (int j_0_4 = 0; j_0_4 < 2; ++j_0_4) 
-      {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-#else
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};\n"
-            :  "=f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "=f"(((float *)(C_warp + (j_0_4 * 8)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[0]), "r"(((unsigned *)(B_shared_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[0]), "f"(((float *)(C_warp + (j_0_4 * 8)))[1]), "f"(((float *)(C_warp + (j_0_4 * 8)))[2]), "f"(((float *)(C_warp + (j_0_4 * 8)))[3]));
-        }
-
-        {
-          __asm__ __volatile__(
-            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32"
-            "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};\n"
-            :  "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "=f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3])
-            : "r"(((unsigned *)(A_shared_warp + 0))[0]), "r"(((unsigned *)(A_shared_warp + 0))[1]), "r"(((unsigned *)(A_shared_warp + 0))[2]), "r"(((unsigned *)(A_shared_warp + 0))[3]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[0]), "r"(((unsigned *)(B_shared_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[0]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[1]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[2]), "f"(((float *)(C_warp + ((j_0_4 * 8) + 4)))[3]));
-        }
-#endif
-      }
-    }
-  }
-
-// TODO: Shang: Hoist loop invariance.
-  for (int ax1_0_1 = 0; ax1_0_1 < 2; ++ax1_0_1) {
-    for (int local_id = 0; local_id < 8; ++local_id) {
-      int row_offset = (((int)blockIdx_y) / j_factors1) * 16 + ((int)threadIdx.x) / 4 + (local_id % 4) / 2 * 8;
-      if (row_offset < M)
-      {
-        *(C_ptr + ax1_0_1 * 16 + row_offset * OC + (local_id / 4) * 8 + local_id % 2) = __float2half(C_warp[(ax1_0_1 * 8) + local_id]);
-      }
-    }
-  }
-#endif
-}
-
-} // namespace awq
-} // namespace vllm
-
-// in_feats: M, IC [float16]
-// kernel: IC, OC // 8 [int32] -> cast to IC, OC [uint4b]
-// scaling_factors: IC // G, OC [float16]
-// zeros: IC // G, OC // 8 [int32] -> cast to IC // G, OC [uint4b]
-// assume that batch_size < 16 for now
-
-torch::Tensor awq_gemm(
-    torch::Tensor _in_feats,
-    torch::Tensor _kernel,
-    torch::Tensor _scaling_factors,
-    torch::Tensor _zeros,
-    int split_k_iters)
-{
-    int num_in_feats = _in_feats.size(0);
-    int num_in_channels = _in_feats.size(1);
-    const at::cuda::OptionalCUDAGuard device_guard(device_of(_in_feats));
-
-    auto options = torch::TensorOptions().dtype(_in_feats.dtype()).device(_in_feats.device());
-    at::Tensor _out_feats = torch::empty({split_k_iters, num_in_feats, _kernel.size(1) * 8}, options);
-    int num_out_feats = _out_feats.size(-2);
-    int num_out_channels = _out_feats.size(-1);
-
-    auto in_feats = reinterpret_cast<half*>(_in_feats.data_ptr<at::Half>());
-    auto kernel = reinterpret_cast<int*>(_kernel.data_ptr<int>());
-    auto out_feats = reinterpret_cast<half*>(_out_feats.data_ptr<at::Half>());
-    auto scaling_factors = reinterpret_cast<half*>(_scaling_factors.data_ptr<at::Half>());
-    auto zeros = reinterpret_cast<int*>(_zeros.data_ptr<int>());
-    int group_size = num_in_channels / _scaling_factors.size(0);
-
-    if (num_out_channels % 64 != 0)
-        throw std::invalid_argument("OC is not multiple of cta_N = 64");
-    if (num_out_channels % 8 != 0)
-        throw std::invalid_argument("OC is not multiple of pack_num = 8");
-    if (group_size % 32 != 0)
-	      throw std::invalid_argument("Group size should be a multiple of 32");
-    if (num_out_channels % group_size != 0)
-        throw std::invalid_argument("OC is not multiple of Group size");
-
-    const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-    if (num_out_channels % 128 == 0)
-    {
-        int j_factors1 = num_out_channels / 128 / 1;
-        dim3 num_blocks((num_out_feats + 16 - 1) / 16 * j_factors1 * split_k_iters);
-        // threadIdx.x: 32
-        // threadIdx.y: i_factors[2] * j_factors[2]
-        dim3 threads_per_block(32, 2);
-        vllm::awq::gemm_forward_4bit_cuda_m16n128k32<<<num_blocks, threads_per_block, 0, stream>>>(
-            group_size, split_k_iters, in_feats, kernel, scaling_factors, zeros, num_in_feats, num_in_channels, num_out_channels, out_feats);
-    }
-    else if (num_out_channels % 64 == 0)
-    {
-        int j_factors1 = num_out_channels / 64 / 1;
-        dim3 num_blocks(1 * (num_out_feats + 16 - 1) / 16 * j_factors1 * split_k_iters);
-    
-        // threadIdx.x: 32
-        // threadIdx.y: i_factors[2] * j_factors[2]
-        dim3 threads_per_block(32, 2);
-        vllm::awq::gemm_forward_4bit_cuda_m16n64k32<<<num_blocks, threads_per_block, 0, stream>>>(
-            group_size, split_k_iters, in_feats, kernel, scaling_factors, zeros, num_in_feats, num_in_channels, num_out_channels, out_feats);
-    }
-    return _out_feats.sum(0);
-}
--- a/inference/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu
+++ b/inference/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu
@ -1,148 +0,0 @@
-#include <torch/all.h>
-#include <torch/python.h>
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include <cuda_fp16.h>
-
-// half-tensor
-#include <c10/cuda/CUDAStream.h>
-#include <ATen/cuda/CUDATensorMethods.cuh>
-
-#define BLOCKWIDTH 128
-#define BLOCKHEIGHT4 16
-
-namespace vllm {
-namespace squeezellm {
-
-__device__ inline unsigned int as_unsigned(int i) {
-  return *reinterpret_cast<unsigned int*>(&i);
-}
-
-// 4-bit matvec kernel (LUT-based)
-__global__ void NUQ4MatMulKernel(
-    const  half2* __restrict__ vec,
-    const    int* __restrict__ mat,
-           half2* __restrict__ mul,
-    const  __half* __restrict__ lookup_table,
-    int height,
-    int width,
-    int batch,
-    int vec_height
-) {
-
-  const int blockwidth2 = BLOCKWIDTH / 2;
-
-  int row = BLOCKHEIGHT4 * blockIdx.x;
-  int col =  BLOCKWIDTH * blockIdx.y + threadIdx.x;
-
-  __shared__ half2 blockvec[blockwidth2];
-
-  __shared__ __half deq2[16][BLOCKWIDTH];
-  int off = threadIdx.x;
-  int column_offset = col * 16;
-  for (int val = 0; val < 16; val += 1) {
-    int lut_index = column_offset + val;
-    deq2[val][off] = lookup_table[lut_index];
-  }
-
-  __half res;
-  half2 res2;
-  half2 tmp2;
-
-  int i;
-  int k;
-
-  unsigned int tmp1;
-  unsigned int lut_index1, lut_index2;
-
-  for (int b = 0; b < batch; ++b){
-    i = width * row + col;
-    res = __int2half_rd(0);
-    k = 0;
-
-    __syncthreads();
-    if (threadIdx.x < blockwidth2)
-      blockvec[threadIdx.x] = vec[b * vec_height / 2 + (row / BLOCKHEIGHT4) * blockwidth2 + threadIdx.x];
-    __syncthreads();
-
-    while (k < blockwidth2) {
-      tmp1 = as_unsigned(mat[i]);
-
-      res2 = {};
-      tmp2 = {};
-
-      lut_index1 = tmp1 & 0xF;
-      lut_index2 = (tmp1 >> 4) & 0xF;
-      tmp2.x = deq2[lut_index1][off];
-      tmp2.y = deq2[lut_index2][off];
-      res2 = __hfma2(tmp2, blockvec[k + 0], res2);
-
-      lut_index1 = (tmp1 >> 8) & 0xF;
-      lut_index2 = (tmp1 >> 12) & 0xF;
-      tmp2.x = deq2[lut_index1][off];
-      tmp2.y = deq2[lut_index2][off];
-      res2 = __hfma2(tmp2, blockvec[k + 1], res2);
-
-      lut_index1 = (tmp1 >> 16) & 0xF;
-      lut_index2 = (tmp1 >> 20) & 0xF;
-      tmp2.x = deq2[lut_index1][off];
-      tmp2.y = deq2[lut_index2][off];
-      res2 = __hfma2(tmp2, blockvec[k + 2], res2);
-
-      lut_index1 = (tmp1 >> 24) & 0xF;
-      lut_index2 = (tmp1 >> 28) & 0xF;
-      tmp2.x = deq2[lut_index1][off];
-      tmp2.y = deq2[lut_index2][off];
-      res2 = __hfma2(tmp2, blockvec[k + 3], res2);
-
-      res = __hadd(__hadd(res2.x, res2.y), res);
-
-      i += width;
-      k += 4;
-    }
-
-    // col%2 -> only set one of the two values
-    half2 res3 = {};
-    if (col % 2 == 0) {
-      res3.x = res;
-    } else {
-      res3.y = res;
-    }
-
-    atomicAdd(&mul[b * width / 2 + col / 2], res3);
-  }
-}
-
-} // namespace squeezellm
-} // namespace vllm
-
-// 4-bit matvec kernel (LUT-based)
-void squeezellm_gemm(
-  torch::Tensor vec,
-  torch::Tensor mat,
-  torch::Tensor mul,
-  torch::Tensor lookup_table
-) {
-  int height = mat.size(0);
-  int width = mat.size(1);
-
-  int batch = vec.size(0);
-  int vec_height = vec.size(1);
-
-  dim3 blocks(
-    (height + BLOCKHEIGHT4 - 1) / BLOCKHEIGHT4,
-    (width + BLOCKWIDTH - 1) / BLOCKWIDTH
-  );
-  dim3 threads(BLOCKWIDTH);
-
-  vllm::squeezellm::NUQ4MatMulKernel<<<blocks, threads>>>(
-    (half2*) vec.data<at::Half>(),
-    mat.data_ptr<int>(),
-    (half2*) mul.data<at::Half>(),
-    (__half*) lookup_table.data<at::Half>(),
-    height, width, batch, vec_height
-  );
-}
-
-#undef BLOCKWIDTH
-#undef BLOCKHEIGHT4
--- a/inference/vllm/csrc/reduction_utils.cuh
+++ b/inference/vllm/csrc/reduction_utils.cuh
@ -1,51 +0,0 @@
-/*
- * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/reduce_kernel_utils.cuh
- * Copyright (c) 2023, The vLLM team.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-namespace vllm {
-
-template<typename T>
-__inline__ __device__ T warpReduceSum(T val) {
-#pragma unroll
-  for (int mask = 16; mask > 0; mask >>= 1)
-    val += __shfl_xor_sync(0xffffffff, val, mask, 32);
-  return val;
-}
-
-/* Calculate the sum of all elements in a block */
-template<typename T>
-__inline__ __device__ T blockReduceSum(T val) {
-  static __shared__ T shared[32];
-  int lane = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  val = warpReduceSum<T>(val);
-
-  if (lane == 0)
-    shared[wid] = val;
-
-  __syncthreads();
-
-  // Modify from blockDim.x << 5 to blockDim.x / 32. to prevent
-  // blockDim.x is not divided by 32
-  val = (threadIdx.x < (blockDim.x / 32.f)) ? shared[lane] : (T)(0.0f);
-  val = warpReduceSum<T>(val);
-  return val;
-}
-
-} // namespace vllm
--- a/inference/vllm/docs/Makefile
+++ b/inference/vllm/docs/Makefile
@ -1,20 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = source
-BUILDDIR      = build
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/inference/vllm/docs/README.md
+++ b/inference/vllm/docs/README.md
@ -1,19 +0,0 @@
-# vLLM documents
-
-## Build the docs
-
-```bash
-# Install dependencies.
-pip install -r requirements-docs.txt
-
-# Build the docs.
-make clean
-make html
-```
-
-## Open the docs with your browser
-
-```bash
-python -m http.server -d build/html/
-```
-Launch your browser and open localhost:8000.
--- a/inference/vllm/docs/make.bat
+++ b/inference/vllm/docs/make.bat
@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
--- a/inference/vllm/docs/requirements-docs.txt
+++ b/inference/vllm/docs/requirements-docs.txt
@ -1,3 +0,0 @@
-sphinx == 6.2.1
-sphinx-book-theme == 1.0.1
-sphinx-copybutton == 0.5.2
--- a/inference/vllm/docs/source/assets/logos/vllm-logo-only-light.png
+++ b/inference/vllm/docs/source/assets/logos/vllm-logo-only-light.png
--- a/inference/vllm/docs/source/assets/logos/vllm-logo-text-dark.png
+++ b/inference/vllm/docs/source/assets/logos/vllm-logo-text-dark.png
--- a/inference/vllm/docs/source/assets/logos/vllm-logo-text-light.png
+++ b/inference/vllm/docs/source/assets/logos/vllm-logo-text-light.png
--- a/inference/vllm/docs/source/conf.py
+++ b/inference/vllm/docs/source/conf.py
@ -1,67 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-# import os
-# import sys
-# sys.path.insert(0, os.path.abspath('.'))
-
-
-# -- Project information -----------------------------------------------------
-
-project = 'vLLM'
-copyright = '2023, vLLM Team'
-author = 'the vLLM Team'
-
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    "sphinx.ext.napoleon",
-    "sphinx.ext.viewcode",
-    "sphinx.ext.intersphinx",
-    "sphinx_copybutton",
-]
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = []
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r"\$ "
-copybutton_prompt_is_regexp = True
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_title = project
-html_theme = 'sphinx_book_theme'
-html_logo = 'assets/logos/vllm-logo-text-light.png'
-html_theme_options = {
-    'logo_only': True,
-    'path_to_docs': 'docs/source',
-    'repository_url': 'https://github.com/vllm-project/vllm',
-    'use_repository_button': True,
-}
-
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ['_static']
--- a/inference/vllm/docs/source/getting_started/installation.rst
+++ b/inference/vllm/docs/source/getting_started/installation.rst
@ -1,50 +0,0 @@
-.. _installation:
-
-Installation
-============
-
-vLLM is a Python library that also contains pre-compiled C++ and CUDA (11.8) binaries.
-
-Requirements
------------
-
-* OS: Linux
-* Python: 3.8 -- 3.11
-* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
-
-Install with pip
----------------
-
-You can install vLLM using pip:
-
-.. code-block:: console
-
-    $ # (Optional) Create a new conda environment.
-    $ conda create -n myenv python=3.8 -y
-    $ conda activate myenv
-
-    $ # Install vLLM.
-    $ pip install vllm
-
-
-.. _build_from_source:
-
-Build from source
-----------------
-
-You can also build and install vLLM from source:
-
-.. code-block:: console
-
-    $ git clone https://github.com/vllm-project/vllm.git
-    $ cd vllm
-    $ pip install -e .  # This may take 5-10 minutes.
-
-.. tip::
-    If you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
-
-    .. code-block:: console
-
-        $ # Pull the Docker image with CUDA 11.8.
-        $ # Use `--ipc=host` to make sure the shared memory is large enough.
-        $ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:22.12-py3
--- a/inference/vllm/docs/source/getting_started/quickstart.rst
+++ b/inference/vllm/docs/source/getting_started/quickstart.rst
@ -1,158 +0,0 @@
-.. _quickstart:
-
-Quickstart
-==========
-
-This guide shows how to use vLLM to:
-
-* run offline batched inference on a dataset;
-* build an API server for a large language model;
-* start an OpenAI-compatible API server.
-
-Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.
-
-Offline Batched Inference
-------------------------
-
-We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to generate texts for a list of input prompts.
-
-Import ``LLM`` and ``SamplingParams`` from vLLM. The ``LLM`` class is the main class for running offline inference with vLLM engine. The ``SamplingParams`` class specifies the parameters for the sampling process.
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the `class definition <https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py>`_.
-
-.. code-block:: python
-
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-Initialize vLLM's engine for offline inference with the ``LLM`` class and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_. The list of supported models can be found at :ref:`supported models <supported_models>`.
-
-.. code-block:: python
-
-    llm = LLM(model="facebook/opt-125m")
-
-Use model from www.modelscope.cn
-
-.. code-block:: shell
-
-    export VLLM_USE_MODELSCOPE=True
-
-.. code-block:: python
-
-    llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True)
-
-Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.
-
-.. code-block:: python
-
-    outputs = llm.generate(prompts, sampling_params)
-
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-
-The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
-
-
-API Server
----------
-
-vLLM can be deployed as an LLM service. We provide an example `FastAPI <https://fastapi.tiangolo.com/>`_ server. Check `vllm/entrypoints/api_server.py <https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py>`_ for the server implementation. The server uses ``AsyncLLMEngine`` class to support asynchronous processing of incoming requests.
-
-Start the server:
-
-.. code-block:: console
-
-    $ python -m vllm.entrypoints.api_server
-
-Use model from www.modelscope.cn
-
-.. code-block:: console
-
-    $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.api_server \
-    $    --model="qwen/Qwen-7B-Chat" \
-    $    --revision="v1.1.8" \
-    $    --trust-remote-code
-
-
-By default, this command starts the server at ``http://localhost:8000`` with the OPT-125M model.
-
-Query the model in shell:
-
-.. code-block:: console
-
-    $ curl http://localhost:8000/generate \
-    $     -d '{
-    $         "prompt": "San Francisco is a",
-    $         "use_beam_search": true,
-    $         "n": 4,
-    $         "temperature": 0
-    $     }'
-
-See `examples/api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/api_client.py>`_ for a more detailed client example.
-
-OpenAI-Compatible Server
------------------------
-
-vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
-
-Start the server:
-
-.. code-block:: console
-
-    $ python -m vllm.entrypoints.openai.api_server \
-    $     --model facebook/opt-125m
-
-Use model from www.modelscope.cn
-
-.. code-block:: console
-
-    $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
-    $     --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
-
-By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the above command) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_ and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
-
-This server can be queried in the same format as OpenAI API. For example, list the models:
-
-.. code-block:: console
-
-    $ curl http://localhost:8000/v1/models
-
-Query the model with input prompts:
-
-.. code-block:: console
-
-    $ curl http://localhost:8000/v1/completions \
-    $     -H "Content-Type: application/json" \
-    $     -d '{
-    $         "model": "facebook/opt-125m",
-    $         "prompt": "San Francisco is a",
-    $         "max_tokens": 7,
-    $         "temperature": 0
-    $     }'
-
-Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the ``openai`` python package:
-
-.. code-block:: python
-
-    import openai
-    # Modify OpenAI's API key and API base to use vLLM's API server.
-    openai.api_key = "EMPTY"
-    openai.api_base = "http://localhost:8000/v1"
-    completion = openai.Completion.create(model="facebook/opt-125m",
-                                          prompt="San Francisco is a")
-    print("Completion result:", completion)
-
-For a more detailed client example, refer to `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
--- a/inference/vllm/docs/source/index.rst
+++ b/inference/vllm/docs/source/index.rst
@ -1,81 +0,0 @@
-Welcome to vLLM!
-================
-
-.. figure:: ./assets/logos/vllm-logo-text-light.png
-  :width: 60%
-  :align: center
-  :alt: vLLM
-  :class: no-scaled-link
-
-.. raw:: html
-
-   <p style="text-align:center">
-   <strong>Easy, fast, and cheap LLM serving for everyone
-   </strong>
-   </p>
-
-   <p style="text-align:center">
-   <script async defer src="https://buttons.github.io/buttons.js"></script>
-   <a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
-   <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
-   <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
-   </p>
-
-
-
-vLLM is a fast and easy-to-use library for LLM inference and serving.
-
-vLLM is fast with:
-
-* State-of-the-art serving throughput
-* Efficient management of attention key and value memory with **PagedAttention**
-* Continuous batching of incoming requests
-* Optimized CUDA kernels
-
-vLLM is flexible and easy to use with:
-
-* Seamless integration with popular HuggingFace models
-* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
-* Tensor parallelism support for distributed inference
-* Streaming outputs
-* OpenAI-compatible API server
-
-For more information, check out the following:
-
-* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
-* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
-* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
-
-
-
-Documentation
-------------
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Getting Started
-
-   getting_started/installation
-   getting_started/quickstart
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Serving
-
-   serving/distributed_serving
-   serving/run_on_sky
-   serving/deploying_with_triton
-   serving/deploying_with_docker
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Models
-
-   models/supported_models
-   models/adding_model
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Quantization
-
-   quantization/auto_awq
--- a/inference/vllm/docs/source/models/adding_model.rst
+++ b/inference/vllm/docs/source/models/adding_model.rst
@ -1,97 +0,0 @@
-.. _adding_a_new_model:
-
-Adding a New Model
-==================
-
-This document provides a high-level guide on integrating a `HuggingFace Transformers <https://github.com/huggingface/transformers>`_ model into vLLM.
-
-.. note::
-    The complexity of adding a new model depends heavily on the model's architecture.
-    The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
-    However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
-
-.. tip::
-    If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
-    We will be happy to help you out!
-
-
-0. Fork the vLLM repository
--------------------------------
-
-Start by forking our `GitHub <https://github.com/vllm-project/vllm/>`_ repository and then :ref:`build it from source <build_from_source>`.
-This gives you the ability to modify the codebase and test your model.
-
-
-1. Bring your model code
------------------------
-
-Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the `vllm/model_executor/models <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models>`_ directory.
-For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/opt.py>`_ was adpated from the HuggingFace's `modeling_opt.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py>`_ file.
-
-.. warning::
-    When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
-
-
-2. Rewrite the :code:`forward` methods
--------------------------------------
-
-Next, you need to rewrite the :code:`forward` methods of your model by following these steps:
-
-1. Remove any unnecessary code, such as the code only used for training.
-2. Change the input parameters:
-
-.. code-block:: diff
-
-    def forward(
-        self,
-        input_ids: torch.Tensor,
-    -    attention_mask: Optional[torch.Tensor] = None,
-    -    position_ids: Optional[torch.LongTensor] = None,
-    -    past_key_values: Optional[List[torch.FloatTensor]] = None,
-    -    inputs_embeds: Optional[torch.FloatTensor] = None,
-    -    labels: Optional[torch.LongTensor] = None,
-    -    use_cache: Optional[bool] = None,
-    -    output_attentions: Optional[bool] = None,
-    -    output_hidden_states: Optional[bool] = None,
-    -    return_dict: Optional[bool] = None,
-    -) -> Union[Tuple, CausalLMOutputWithPast]:
-    +    positions: torch.Tensor,
-    +    kv_caches: List[KVCache],
-    +    input_metadata: InputMetadata,
-    +    cache_events: Optional[List[torch.cuda.Event]],
-    +) -> SamplerOutput:
-
-3. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
-4. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
-
-.. note::
-    Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
-    If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
-
-
-3. (Optional) Implement tensor parallelism and quantization support
-------------------------------------------------------------------
-
-If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
-To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
-For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
-When it comes to the linear layers, we provide the following options to parallelize them:
-
-* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
-* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
-* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
-* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
-* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
-
-Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
-
-4. Implement the weight loading logic
-------------------------------------
-
-You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
-This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
-
-5. Register your model
----------------------
-
-Finally, include your :code:`*ForCausalLM` class in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_ and register it to the :code:`_MODEL_REGISTRY` in `vllm/model_executor/model_loader.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader.py>`_.
--- a/inference/vllm/docs/source/models/supported_models.rst
+++ b/inference/vllm/docs/source/models/supported_models.rst
@ -1,98 +0,0 @@
-.. _supported_models:
-
-Supported Models
-================
-
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
-The following is the list of model architectures that are currently supported by vLLM.
-Alongside each architecture, we include some popular models that use it.
-
-.. list-table::
-  :widths: 25 25 50
-  :header-rows: 1
-
-  * - Architecture
-    - Models
-    - Example HuggingFace Models
-  * - :code:`AquilaForCausalLM`
-    - Aquila
-    - :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
-  * - :code:`BaiChuanForCausalLM`
-    - Baichuan
-    - :code:`baichuan-inc/Baichuan-7B`, :code:`baichuan-inc/Baichuan-13B-Chat`, etc.
-  * - :code:`ChatGLMModel`
-    - ChatGLM
-    - :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
-  * - :code:`BloomForCausalLM`
-    - BLOOM, BLOOMZ, BLOOMChat
-    - :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
-  * - :code:`FalconForCausalLM`
-    - Falcon
-    - :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc.
-  * - :code:`GPT2LMHeadModel`
-    - GPT-2
-    - :code:`gpt2`, :code:`gpt2-xl`, etc.
-  * - :code:`GPTBigCodeForCausalLM`
-    - StarCoder, SantaCoder, WizardCoder
-    - :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc.
-  * - :code:`GPTJForCausalLM`
-    - GPT-J
-    - :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc.
-  * - :code:`GPTNeoXForCausalLM`
-    - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
-    - :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc.
-  * - :code:`InternLMForCausalLM`
-    - InternLM
-    - :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc.
-  * - :code:`LlamaForCausalLM`
-    - LLaMA, LLaMA-2, Vicuna, Alpaca, Koala, Guanaco
-    - :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`young-geng/koala`, etc.
-  * - :code:`MistralForCausalLM`
-    - Mistral, Mistral-Instruct
-    - :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc.
-  * - :code:`MPTForCausalLM`
-    - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
-    - :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc.
-  * - :code:`OPTForCausalLM`
-    - OPT, OPT-IML
-    - :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc.
-  * - :code:`PhiForCausalLM`
-    - Phi-1.5
-    - :code:`microsoft/phi-1_5`, etc.
-  * - :code:`QWenLMHeadModel`
-    - Qwen
-    - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
-  * - :code:`YiForCausalLM`
-    - Yi
-    - :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
-
-If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
-Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` for instructions on how to implement support for your model.
-Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ project.
-
-.. tip::
-    The easiest way to check if your model is supported is to run the program below:
-
-    .. code-block:: python
-
-        from vllm import LLM
-
-        llm = LLM(model=...)  # Name or path of your model
-        output = llm.generate("Hello, my name is")
-        print(output)
-
-    To use model from www.modelscope.cn
-
-    .. code-block:: shell
-
-       $ export VLLM_USE_MODELSCOPE=True
-
-    .. code-block:: python
-
-        from vllm import LLM
-
-        llm = LLM(model=..., revision=..., trust_remote_code=True)  # Name or path of your model
-        output = llm.generate("Hello, my name is")
-        print(output)
-
-    If vLLM successfully generates text, it indicates that your model is supported.
--- a/inference/vllm/docs/source/quantization/auto_awq.rst
+++ b/inference/vllm/docs/source/quantization/auto_awq.rst
@ -1,69 +0,0 @@
-.. _auto_awq:
-
-AutoAWQ
-==================
-
-To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_. 
-Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
-The main benefits are lower latency and memory usage.
-
-You can quantize your own models by installing AutoAWQ or picking one of the `400+ models on Huggingface <https://huggingface.co/models?sort=trending&search=awq>`_. 
-
-.. code-block:: console
-
-    $ pip install autoawq
-
-After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:
-
-.. code-block:: python
-
-    from awq import AutoAWQForCausalLM
-    from transformers import AutoTokenizer
-
-    model_path = 'lmsys/vicuna-7b-v1.5'
-    quant_path = 'vicuna-7b-v1.5-awq'
-    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
-
-    # Load model
-    model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
-    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-
-    # Quantize
-    model.quantize(tokenizer, quant_config=quant_config)
-
-    # Save quantized model
-    model.save_quantized(quant_path)
-    tokenizer.save_pretrained(quant_path)
-
-To run an AWQ model with vLLM, you can use `TheBloke/Llama-2-7b-Chat-AWQ <https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ>`_ with the following command:
-
-.. code-block:: console
-
-    $ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
-
-AWQ models are also supported directly through the LLM entrypoint:
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    # Sample prompts.
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    # Create an LLM.
-    llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
-    # Generate texts from the prompts. The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/inference/vllm/docs/source/serving/deploying_with_docker.rst
+++ b/inference/vllm/docs/source/serving/deploying_with_docker.rst
@ -1,21 +0,0 @@
-.. _deploying_with_docker:
-
-Deploying with Docker
-============================
-
-You can build and run vLLM from source via the provided dockerfile. To build vLLM:
-
-.. code-block:: console
-
-    $ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8
-
-To run vLLM:
-
-.. code-block:: console
-
-    $ docker run --runtime nvidia --gpus all \
-        -v ~/.cache/huggingface:/root/.cache/huggingface \
-        -p 8000:8000 \
-        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-        vllm <args...>
-
--- a/inference/vllm/docs/source/serving/deploying_with_triton.rst
+++ b/inference/vllm/docs/source/serving/deploying_with_triton.rst
@ -1,6 +0,0 @@
-.. _deploying_with_triton:
-
-Deploying with NVIDIA Triton
-============================
-
-The `Triton Inference Server <https://github.com/triton-inference-server>`_ hosts a tutorial demonstrating how to quickly deploy a simple `facebook/opt-125m <https://huggingface.co/facebook/opt-125m>`_ model using vLLM. Please see `Deploying a vLLM model in Triton <https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton>`_ for more details.
--- a/inference/vllm/docs/source/serving/distributed_serving.rst
+++ b/inference/vllm/docs/source/serving/distributed_serving.rst
@ -1,38 +0,0 @@
-.. _distributed_serving:
-
-Distributed Inference and Serving
-=================================
-
-vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with `Ray <https://github.com/ray-project/ray>`_. To run distributed inference, install Ray with:
-
-.. code-block:: console
-
-    $ pip install ray
-
-To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
-
-.. code-block:: python
-
-    from vllm import LLM
-    llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
-    output = llm.generate("San Franciso is a")
-
-To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
-
-.. code-block:: console
-
-    $ python -m vllm.entrypoints.api_server \
-    $     --model facebook/opt-13b \
-    $     --tensor-parallel-size 4
-
-To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
-
-.. code-block:: console
-
-    $ # On head node
-    $ ray start --head
-
-    $ # On worker nodes
-    $ ray start --address=<ray-head-address>
-
-After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
--- a/inference/vllm/docs/source/serving/run_on_sky.rst
+++ b/inference/vllm/docs/source/serving/run_on_sky.rst
@ -1,69 +0,0 @@
-.. _on_cloud:
-
-Running on clouds with SkyPilot
-===============================
-
-.. raw:: html
-
-    <p align="center">
-        <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
-    </p>
-
-vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud.
-
-To install SkyPilot and setup your cloud credentials, run:
-
-.. code-block:: console
-
-    $ pip install skypilot
-    $ sky check
-
-See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
-
-.. code-block:: yaml
-
-    resources:
-        accelerators: A100
-
-    envs:
-        MODEL_NAME: decapoda-research/llama-13b-hf
-        TOKENIZER: hf-internal-testing/llama-tokenizer
-
-    setup: |
-        conda create -n vllm python=3.9 -y
-        conda activate vllm
-        git clone https://github.com/vllm-project/vllm.git
-        cd vllm
-        pip install .
-        pip install gradio
-
-    run: |
-        conda activate vllm
-        echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.api_server \
-                        --model $MODEL_NAME \
-                        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-                        --tokenizer $TOKENIZER 2>&1 | tee api_server.log &
-        echo 'Waiting for vllm api server to start...'
-        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-        echo 'Starting gradio server...'
-        python vllm/examples/gradio_webserver.py
-
-Start the serving the LLaMA-13B model on an A100 GPU:
-
-.. code-block:: console
-
-    $ sky launch serving.yaml
-
-Check the output of the command. There will be a sharable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
-
-.. code-block:: console
-
-    (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
-
-**Optional**: Serve the 65B model instead of the default 13B and use more GPU:
-
-.. code-block:: console
-
-    sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf
-
--- a/inference/vllm/examples/api_client.py
+++ b/inference/vllm/examples/api_client.py
@ -1,77 +0,0 @@
-"""Example Python client for vllm.entrypoints.api_server"""
-
-import argparse
-import json
-from typing import Iterable, List
-
-import requests
-
-
-def clear_line(n: int = 1) -> None:
-    LINE_UP = '\033[1A'
-    LINE_CLEAR = '\x1b[2K'
-    for _ in range(n):
-        print(LINE_UP, end=LINE_CLEAR, flush=True)
-
-
-def post_http_request(prompt: str,
-                      api_url: str,
-                      n: int = 1,
-                      stream: bool = False) -> requests.Response:
-    headers = {"User-Agent": "Test Client"}
-    pload = {
-        "prompt": prompt,
-        "n": n,
-        "use_beam_search": True,
-        "temperature": 0.0,
-        "max_tokens": 16,
-        "stream": stream,
-    }
-    response = requests.post(api_url, headers=headers, json=pload, stream=True)
-    return response
-
-
-def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
-    for chunk in response.iter_lines(chunk_size=8192,
-                                     decode_unicode=False,
-                                     delimiter=b"\0"):
-        if chunk:
-            data = json.loads(chunk.decode("utf-8"))
-            output = data["text"]
-            yield output
-
-
-def get_response(response: requests.Response) -> List[str]:
-    data = json.loads(response.content)
-    output = data["text"]
-    return output
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--host", type=str, default="localhost")
-    parser.add_argument("--port", type=int, default=8000)
-    parser.add_argument("--n", type=int, default=4)
-    parser.add_argument("--prompt", type=str, default="San Francisco is a")
-    parser.add_argument("--stream", action="store_true")
-    args = parser.parse_args()
-    prompt = args.prompt
-    api_url = f"http://{args.host}:{args.port}/generate"
-    n = args.n
-    stream = args.stream
-
-    print(f"Prompt: {prompt!r}\n", flush=True)
-    response = post_http_request(prompt, api_url, n, stream)
-
-    if stream:
-        num_printed_lines = 0
-        for h in get_streaming_response(response):
-            clear_line(num_printed_lines)
-            num_printed_lines = 0
-            for i, line in enumerate(h):
-                num_printed_lines += 1
-                print(f"Beam candidate {i}: {line!r}", flush=True)
-    else:
-        output = get_response(response)
-        for i, line in enumerate(output):
-            print(f"Beam candidate {i}: {line!r}", flush=True)
--- a/inference/vllm/examples/gradio_webserver.py
+++ b/inference/vllm/examples/gradio_webserver.py
@ -1,52 +0,0 @@
-import argparse
-import json
-
-import gradio as gr
-import requests
-
-
-def http_bot(prompt):
-    headers = {"User-Agent": "vLLM Client"}
-    pload = {
-        "prompt": prompt,
-        "stream": True,
-        "max_tokens": 128,
-    }
-    response = requests.post(args.model_url,
-                             headers=headers,
-                             json=pload,
-                             stream=True)
-
-    for chunk in response.iter_lines(chunk_size=8192,
-                                     decode_unicode=False,
-                                     delimiter=b"\0"):
-        if chunk:
-            data = json.loads(chunk.decode("utf-8"))
-            output = data["text"][0]
-            yield output
-
-
-def build_demo():
-    with gr.Blocks() as demo:
-        gr.Markdown("# vLLM text completion demo\n")
-        inputbox = gr.Textbox(label="Input",
-                              placeholder="Enter text and press ENTER")
-        outputbox = gr.Textbox(label="Output",
-                               placeholder="Generated result from the model")
-        inputbox.submit(http_bot, [inputbox], [outputbox])
-    return demo
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--host", type=str, default=None)
-    parser.add_argument("--port", type=int, default=8001)
-    parser.add_argument("--model-url",
-                        type=str,
-                        default="http://localhost:8000/generate")
-    args = parser.parse_args()
-
-    demo = build_demo()
-    demo.queue(concurrency_count=100).launch(server_name=args.host,
-                                             server_port=args.port,
-                                             share=True)
--- a/inference/vllm/examples/infer_cpm/compare_phi2.py
+++ b/inference/vllm/examples/infer_cpm/compare_phi2.py
@ -1,59 +0,0 @@
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-torch.set_default_device("cuda")
-
-model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
-
-prompts = [
-    # "input: 请告诉我把大象放进冰箱需要几步？output:",
-    # "input:请给我写一首描绘极寒的冬天的古诗 output:",
-    # "input:周杰伦都写了哪些歌呢?尽可能多地列举一些 output:",
-    # "input:铁做氧化还原反应的化学方程式是什么 output:",
-    # "input:linux里 watch -n1 file.txt命令是什么意思 output:",
-    # "input:帮我计算一下 12 + 13 - 2 * 5 = ? 请一步一步来 output:",
-    # "input:这段python代码 ```def add_number_minus1(x, y):\n    return x + y - 1\n``` 用C语言应该怎么写？output:",
-    # "input:假唱是指什么？请剖析五月天假唱的正面影响和负面影响, 他们应该假唱吗? output:",
-    "Q:Which songs has Jay Chou written? Please list as many as possible. A:",
-    "Q:What is the chemical equation for the redox reaction of iron? A:",
-    "Q:Tell me a joke about a classmate who wanted to go to the bathroom during class. A:",
-    "Q:What does lip-syncing mean? Please analyze the positive and negative impacts of the band Mayday lip-syncing. Should they lip-sync? A:"
-]
-
-prompts = [
-    "<用户>Write five words that start with “en”, then write the result of “77+33”<AI>",
-    "<用户>Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.<AI>",
-    "<用户>A group of students are planning to go on a field trip to a museum. They need to decide how many buses to rent and how to divide the students among the buses. Each bus can hold up to 40 students, but the museum can only accommodate 120 students at a time. The group has a budget of $800 for the bus rental, and each bus costs $200 per day. How many buses should the group rent, and how many students should go on each bus? Explain your reasoning.<AI>",
-    """<用户>Selena, Jennifer and Miley wear a blue dress, yellow dress, and green dress in an unknown order. It is known that:
-
-1) If Selena wears blue, then Jennifer wears green.
-2) If Selena wears yellow, then Miley wears green.
-3) If Jennifer does not wear yellow, then Miley wears blue.
-
-What is the color of the dress Selena is wearing?<AI>""",
-    """<用户>Given the following premise:
-It is given that Amy, Bernadette and Penny are good friends of Sheldon and Leonard.
-Leslie is a good friend of Leonard.
-
-We can deduce that Leslie is a friend of Sheldon as well.
-
-Does this deduction follow the premise? Explain your reasoning.<AI>""",
-    """<用户>A group of five friends are going to watch a movie together. They have to choose between three genres: comedy, horror, and action. Each friend has a different preference for the genre. Here are some clues to help you figure out their preferences: Alice likes comedy more than horror, but less than action. Bob likes horror more than comedy, but less than action. Carol likes action more than horror, but less than comedy. David likes comedy more than action, but less than horror. Eve likes horror more than action, but less than comedy. What is the order of preference for each friend from most to least liked genre? Write your answer using the following format: Friend: Genre > Genre > Genre <AI>""",
-    "<用户>If you were in a race and passed the person in second place, what place would you be in now?<AI>",
-    "<用户>Which one is more south? California or New York?<AI>",
-    "<用户>linux里 watch -n1 file.txt命令是什么意思<AI>",
-    "<用户>Translate this sentence into Russian: '如何用Python创建一个简单的网页爬虫？'.<AI>",
-    """<用户>Translate this sentence into French: "I am a fresh man on Chinese, do you know how this sentence is translated: 如何用Python创建一个简单的网页爬虫？" <AI>""",
-    "<用户>Micro-expressions mean that people express their inner feelings to each other by making some expressions.Between different expressions made by people or in a certain expression, the face will \"leak\" other information.The shortest-lasting micro-expressions can last 1 / 25 seconds, although a subconscious expression may only last for a moment, it is easy to expose emotions.When the face is making an expression, these extremely short-term expressions suddenly flash by, and sometimes the opposite mood.\nAccording to the above definition, which of the following is a micro-expression?\nA.After Wang was frightened, his face continued to twitch\nB.The spy sends a message to associates in the blink of an eye\nC.The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer.\nD.Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding\nA:<AI>",
-    "<用户>A’s brother was half her age when she was 6 how old is her brother when she’s 42?<AI>",
-] 
-
-for sent in prompts:
-    inputs = tokenizer(sent, return_tensors="pt", return_attention_mask=False)
-
-    outputs = model.generate(**inputs, max_length=200)
-    text = tokenizer.batch_decode(outputs)[0]
-    print("-"*20)
-    print(text)
-    print("="*20)
--- a/inference/vllm/examples/infer_cpm/inference_old.py
+++ b/inference/vllm/examples/infer_cpm/inference_old.py
@ -1,59 +0,0 @@
-from vllm import LLM, SamplingParams
-import argparse
-
-parser = argparse.ArgumentParser()
-
-parser.add_argument("--model_path", type=str, default="")
-parser.add_argument("--prompt_path", type=str, default="")
-
-
-args = parser.parse_args()
-
-with open(args.prompt_path, "r") as f:
-    prompts = f.readlines()
-
-prompt_template = "<用户>{}<AI>"
-
-prompts = [prompt_template.format(prompt.strip()) for prompt in prompts]
-
-params_dict = {
-    "n": 1,
-    "best_of": None,
-    "presence_penalty": 0.0,    
-    "frequency_penalty": 0.0,
-    "repetition_penalty": 1.0,
-    "temperature": 1.0,
-    "top_p": 0.5,
-    "top_k": -1,
-    "use_beam_search": False,
-    "length_penalty": 1.5,
-    "early_stopping": False,
-    "stop": None,
-    "stop_token_ids": None,
-    "ignore_eos": False,
-    "max_tokens": 1000,
-    "logprobs": None,
-    "prompt_logprobs": None,
-    "skip_special_tokens": True,
-}
-
-# Create a sampling params object.
-sampling_params = SamplingParams(**params_dict)
-
-# Create an LLM.
-llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16')
-# Generate texts from the prompts. The output is a list of RequestOutput objects
-# that contain the prompt, generated text, and other information.
-for prompt in prompts:
-    outputs = llm.generate(prompt, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print("================")
-        # find the first <用户> and remove the text before it.
-        clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):]
-
-        print(f"""<用户>: {clean_prompt.replace("<AI>", "")}""")
-        print(f"<AI>:")
-        print(generated_text)
--- a/inference/vllm/examples/infer_cpm/inference_renping.py
+++ b/inference/vllm/examples/infer_cpm/inference_renping.py
@ -1,67 +0,0 @@
-from vllm import LLM, SamplingParams
-import argparse
-import json
-
-parser = argparse.ArgumentParser()
-
-parser.add_argument("--model_path", type=str, default="")
-parser.add_argument("--prompt_path", type=str, default="")
-parser.add_argument("--output_path", type=str, default="")
-
-
-args = parser.parse_args()
-
-with open(args.prompt_path, "r") as f:
-    data_list = json.load(f)
-    prompts = [data["prompt"] for data in data_list]
-
-prompt_template = "{}"
-
-prompts = [prompt_template.format(prompt.strip()) for prompt in prompts]
-
-params_dict = {
-    "n": 1,
-    "best_of": None,
-    "presence_penalty": 1.0,    
-    "frequency_penalty": 0.0,
-    "temperature": 0.3,
-    "top_p": 0.8,
-    "top_k": -1,
-    "use_beam_search": False,
-    "length_penalty": 1.0,
-    "early_stopping": False,
-    "stop": None,
-    "stop_token_ids": None,
-    "ignore_eos": False,
-    "max_tokens": 1000,
-    "logprobs": None,
-    "prompt_logprobs": None,
-    "skip_special_tokens": True,
-}
-
-# Create a sampling params object.
-sampling_params = SamplingParams(**params_dict)
-
-# Create an LLM.
-llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16')
-# Generate texts from the prompts. The output is a list of RequestOutput objects
-# that contain the prompt, generated text, and other information.
-for data in data_list:
-    prompt = data["prompt"]
-    outputs = llm.generate(prompt, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print("================")
-        # find the first <用户> and remove the text before it.
-        clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):]
-
-        print(f"""<用户>: {clean_prompt.replace("<AI>", "")}""")
-        print(f"<AI>:")
-        print(generated_text)
-        data['cpm_new'] = generated_text
-        with open(args.prompt_path+args.output_path, "a") as f:
-            f.write(json.dumps(data, ensure_ascii=False, indent=4))
-            f.write(",\n")
-
--- a/inference/vllm/examples/infer_cpm/logs/output_case_job_566873_ckpt_3000.txt
+++ b/inference/vllm/examples/infer_cpm/logs/output_case_job_566873_ckpt_3000.txt
@ -1,48 +0,0 @@
-================
-<用户>: 你是谁？ 
-<AI>: ' 我是一个大型语言模型。'
-================
-<用户>: Who are you? 
-<AI>: ' I am an AI language model created by OpenAI. My purpose is to assist you in generating human-like text based on the prompts you provide. I do not have personal feelings or emotions, but I am designed to understand and respond to your questions and requests.'
-================
-<用户>: 给我写一份去东京旅游的游记，要求每句话都以“哈哈”开头。
-<AI>: ' 哈哈，终于决定去东京旅游了！这次旅行让我充满了期待和兴奋。\n\n哈哈，一到达东京，我就被这座城市的繁华和活力所吸引。我立刻去了新宿站，这里人山人海，热闹非凡。\n\n哈哈，我去了东京塔，站在塔顶上，俯瞰整个东京市，感觉非常震撼。\n\n哈哈，我还去了浅草寺，这里有着悠久的历史和文化底蕴，让我感受到了日本传统文化的魅力。\n\n哈哈，在东京的街头巷尾，我品尝了各种美食，比如寿司、拉面、天妇罗等等，每一口都让我回味无穷。\n\n哈哈，我还去了迪士尼乐园，这里的游乐设施和表演都非常精彩，让我度过了一个难忘的假期。\n\n哈哈，最后一天，我在银座购物，买了一些纪念品和礼物，准备带回家和家人分享。\n\n哈哈，这次东京之行让我收获了很多美好的回忆和体验，我会一直珍藏它们。'
-================
-<用户>: 假唱是指什么？请剖析五月天假唱的负面影响, 他们应该假唱吗?
-<AI>: ' 假唱是指歌手在演唱时使用录音或现场效果器代替真实的声音。五月天是台湾著名的摇滚乐队，他们曾多次被质疑假唱。\n\n假唱的负面影响主要有以下几点：\n\n1. 影响观众体验：假唱会使观众感受到音乐的真实性和情感的缺失，影响他们的观赏体验。\n2. 损害歌手形象：假唱会降低歌手的专业素养和艺术水准，损害其形象和声誉。\n3. 影响音乐产业：假唱会破坏音乐产业的生态平衡，导致原创音乐受到排挤，对音乐产业的发展造成负面影响。\n\n五月天是否应该假唱是一个有争议的问题。一方面，假唱可以保证演唱会的顺利进行，避免因歌手嗓子不好而影响演出效果。另一方面，假唱也会损害歌手的形象和声誉，降低观众对音乐的信任度。因此，五月天是否应该假唱需要权衡利弊，综合考虑各种因素。'
-================
-<用户>: 请告诉我把大象放进冰箱需要几步？
-<AI>: ' 把大象放进冰箱是一个不可能完成的任务，因为大象的尺寸远远超过冰箱的容量。因此，这个问题实际上是不存在的。'
-================
-<用户>: 请给我写一首描绘极寒的冬天的七言绝句。
-<AI>: ' 雪花纷飞白茫茫，\n寒风凛冽刺骨寒。\n枯枝落叶随风舞，\n冬日寒意入心间。'
-================
-<用户>: 周杰伦都写了哪些歌呢?尽可能多地列举一些
-<AI>: ' 周杰伦是一位非常著名的华语流行歌手和音乐制作人，他的歌曲涵盖了多个风格和主题。以下是一些周杰伦的经典歌曲：\n\n1. 《青花瓷》\n2. 《千里之外》\n3. 《爱情转移》\n4. 《霍元甲》\n5. 《晴天》\n6. 《爱在西元前》\n7. 《夜曲》\n8. 《东风破》\n9. 《双截棍》\n10. 《倒带》\n11. 《不该》\n12. 《不能说的秘密》\n13. 《告白气球》\n14. 《发如雪》\n15. 《不该让你等》\n16. 《简单爱》\n17. 《等你下课》\n18. 《世界末日》\n19. 《听妈妈的话》\n20. 《稻香》\n\n以上只是周杰伦众多经典歌曲中的一部分，他的音乐作品非常丰富多样，涵盖了多种风格和主题，包括摇滚、流行、民谣、电子等。'
-================
-<用户>: 铁做氧化还原反应的化学方程式是什么
-<AI>: ' 铁（Fe）与氧气（O2）反应生成氧化铁（Fe2O3）的化学方程式如下：\n\n4 Fe + 3 O2 → 2 Fe2O3\n\n这个方程式表示了铁和氧气在适当条件下发生氧化还原反应的过程。在这个反应中，铁原子失去电子被氧化为三价铁离子（Fe3+），而氧气分子接受这些电子被还原为氧离子（O2-）。'
-================
-<用户>: linux里 watch -n1 file.txt命令是什么意思
-<AI>: ' `watch -n1 file.txt` 命令在 Linux 系统中用于监视文件的变化。其中，`watch` 命令是一个用于监视文件变化的工具，`-n` 表示每秒钟检查一次文件的变化，而 `1` 表示检查的时间间隔为 1 秒。\n因此，`watch -n1 file.txt` 命令的含义是每秒钟检查一次 `file.txt` 文件的变化。如果文件发生了任何变化，命令会立即输出变化的内容。'
-================
-<用户>: Translate this sentence into Russian: '如何用Python创建一个简单的网页爬虫？'.
-<AI>: ' Как создать простый скрипт для сбора данных из веб-страниц, используя Python?'
-================
-<用户>: Translate this sentence into French: "I am a fresh man on Chinese, do you know how this sentence is translated: 如何用Python创建一个简单的网页爬虫？" 
-<AI>: ' "Je suis un nouveau-bien dans le chinois, savez-vous comment traduire cette phrase en Python : Comment créer un simple web scraper ?"'
-================
-<用户>: Micro-expressions mean that people express their inner feelings to each other by making some expressions.Between different expressions made by people or in a certain expression, the face will "leak" other information.The shortest-lasting micro-expressions can last 1 / 25 seconds, although a subconscious expression may only last for a moment, it is easy to expose emotions.When the face is making an expression, these extremely short-term expressions suddenly flash by, and sometimes the opposite mood.
-According to the above definition, which of the following is a micro-expression?
-A.After Wang was frightened, his face continued to twitch
-B.The spy sends a message to associates in the blink of an eye
-C.The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer.
-D.Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding
-A:
-<AI>: " Micro-expressions, as defined in the question, are brief, involuntary facial expressions that reveal an individual's inner feelings to others. These expressions can last from 1/25 seconds to a moment but are often easily detectable due to their brevity and the fact that they can be triggered by subconscious emotions.\n\nLet's analyze each option:\n\nA. After Wang was frightened, his face continued to twitch - This is not a micro-expression because it describes a continuous action (twitching) rather than a brief expression.\n\nB. The spy sends a message to associates in the blink of an eye - This is not a micro-expression either. It refers to a quick action or communication, not a facial expression.\n\nC. The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer - This is a micro-expression because it describes a brief, involuntary change in facial expression that reveals the sales clerk's underlying emotion (contempt).\n\nD. Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding - This is not a micro-expression either. It describes a physical reaction to cold weather rather than a facial expression revealing an emotional state.\n\nTherefore, the correct answer is C, as it is the only option that clearly describes a brief, involuntary change in facial expression that reveals an underlying emotion..\n\nThe answer is C."
-================
-<用户>: How many legs did a three-legged dog have **before** it lost one leg? 
-<AI>: ' A three-legged dog had four legs before it lost one leg.'
-================
-<用户>: A’s brother was half her age when she was 6 how old is her brother when she’s 42?
-<AI>: " Step 1: First, let's calculate the age difference between A and her brother. When A was 6, her brother was half her age, so her brother was 6 / 2 = 3 years old at that time.\nStep 2: Now, let's calculate the current age of A when she is 42. The age difference between A and her brother remains the same, so when A is 42, her brother would be 42 - 3 = 39 years old.\n\nAnswer:\n39"
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json.new-dpo.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json.new-dpo.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json.new.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json.new.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json301.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.json301.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.2500.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.2500.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.jsonl
@ -1,9 +0,0 @@
-{
-    "idx": 28,
-    "prompt": "<用户>晚上睡不着，可以写哪些简短高级的朋友圈文案？<AI>",
-    "response": "1. 月儿弯弯照九州，露珠滴答心悠游。 \n2. 星空璀璨，我在寻找一颗属于你的星星。 \n3. 月亮悄悄落下，悄悄落在你的梦里。 \n4. 夜空中最亮的星，只为你而闪烁。 \n5. 黑夜温柔，我在心头的星河里寻找你。 \n6. 睡不着，就让我在这银色的夜晚里，为你编织一个美丽的梦。 \n7. 黑暗中，我在寻找那一道属于你的光芒。 \n8. 静谧的夜晚，月光下的思绪，只为你一人。 \n9. 月儿弯弯挂天边，星河闪烁伴你眠。 \n10. 夜幕降临，我在这无尽的星空里，为你点亮一盏希望的灯。 \n11. 月儿轻轻弯，星光点点伴你眠。 \n12. 黑夜深邃，我在这辽阔的星海里，寻找你的踪迹。 \n13. 月儿弯弯挂夜空，思绪万千伴你眠。 \n14. 月儿弯弯照九州，星光闪烁伴你眠。 \n15. 夜色降临，我在这星空之下，为你编织一个浪漫的梦境。 \n16. 月儿弯弯照九州，星河璀璨伴你眠。 \n17. 月亮弯弯照九州，星空璀璨伴你眠。 \n18. 月儿弯弯挂夜空，思绪万千伴你眠。 \n19. 夜晚寂静，我在这星空之中，为你点亮一盏希望的灯。 \n20. 月儿弯弯挂夜空，星河闪烁伴你眠。",
-    "sample": "<用户>晚上睡不着，可以写哪些简短高级的朋友圈文案？<AI>1. 月儿弯弯照九州，露珠滴答心悠游。 \n2. 星空璀璨，我在寻找一颗属于你的星星。 \n3. 月亮悄悄落下，悄悄落在你的梦里。 \n4. 夜空中最亮的星，只为你而闪烁。 \n5. 黑夜温柔，我在心头的星河里寻找你。 \n6. 睡不着，就让我在这银色的夜晚里，为你编织一个美丽的梦。 \n7. 黑暗中，我在寻找那一道属于你的光芒。 \n8. 静谧的夜晚，月光下的思绪，只为你一人。 \n9. 月儿弯弯挂天边，星河闪烁伴你眠。 \n10. 夜幕降临，我在这无尽的星空里，为你点亮一盏希望的灯。 \n11. 月儿轻轻弯，星光点点伴你眠。 \n12. 黑夜深邃，我在这辽阔的星海里，寻找你的踪迹。 \n13. 月儿弯弯挂夜空，思绪万千伴你眠。 \n14. 月儿弯弯照九州，星光闪烁伴你眠。 \n15. 夜色降临，我在这星空之下，为你编织一个浪漫的梦境。 \n16. 月儿弯弯照九州，星河璀璨伴你眠。 \n17. 月亮弯弯照九州，星空璀璨伴你眠。 \n18. 月儿弯弯挂夜空，思绪万千伴你眠。 \n19. 夜晚寂静，我在这星空之中，为你点亮一盏希望的灯。 \n20. 月儿弯弯挂夜空，星河闪烁伴你眠。",
-    "cpm_input": "<用户>晚上睡不着，可以写哪些简短高级的朋友圈文案？<AI>",
-    "cpm_output": " 1. \"月黑风高夜，思绪万千涌。心若止水，独享宁静。愿此刻，你的梦境如星辰般璀璨。\"\n\n2. \"夜幕降临，繁星闪烁。在这个宁静的夜晚，愿我的文字能为你带来一丝慰藉。\"\n\n3. \"风轻轻吹过，月儿弯弯。思绪万千，思绪万千。愿今晚的宁静能陪伴你入梦。\"\n\n4. \"夜深人静，微风拂面。愿你的梦境如月光般温柔，如清风般宁静。\"\n\n5. \"此刻，我与你共度夜晚。愿你的梦境如繁星般闪耀，如月光般宁静。\"\n\n6. \"月圆之夜，愿你的梦境如星辰般璀璨，如月光般宁静。愿你的心灵在这宁静的夜晚得到宁静。\"\n\n7. \"夜深人静，月儿弯弯。愿你的梦境如繁星般闪耀，如月光般宁静。愿你的心灵在这宁静的夜晚得到宁静。\"\n\n8. \"月黑风高夜，愿你的梦境如星辰般璀璨，如月光般宁静。愿你的心灵在这宁静的夜晚得到宁静。\"\n\n9. \"月圆之夜，愿你的梦境如繁星般闪耀，如月光般宁静。愿你的心灵在这宁静的夜晚得到宁静。\"\n\n10. \"夜深人静，愿你的梦境如繁星般闪耀，如月光般宁静。愿你的心灵在这宁静的夜晚得到宁静。\"",
-    "cpm_new": " 1. \"夜深人静，思绪万千。愿这宁静的夜晚，能让我在文字中寻找心灵的慰藉。\"\n2. \"失眠的夜晚，我选择用文字来填补内心的空虚。愿每个字都能成为你心灵的一束光。\"\n3. \"在这个寂静的夜晚，我想用我的笔，为你编织一个梦境，让你在梦中感受到温暖与安宁。\"\n4. \"在这宁静的夜晚，我与你分享我的孤独与寂寞，愿我们的文字能成为彼此心灵的寄托。\"\n5. \"失眠的夜晚，我选择用文字来表达我对生活的热爱和执着。愿每个字都能成为你心中的一抹亮色。\"\n6. \"在这个安静的夜晚，我与你共享这份孤独，愿我们的文字能成为彼此心灵的慰藉。\"\n7. \"在这个寂静的夜晚，我选择用文字来诉说我的思念与渴望。愿每个字都能成为你心中的一丝温暖。\"\n8. \"在这个安静的夜晚，我与你分享我的梦想与期待。愿每个字都能成为你心灵的一束光。\"\n9. \"在这个寂静的夜晚，我选择用文字来记录生活的点滴，愿每个字都能成为你心中的一抹色彩。\"\n10. \"在这个安静的夜晚，我与你共享这份孤独，愿我们的文字能成为彼此心灵的寄托。\""
-},
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo.xlsx
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2-dpo.xlsx
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsondpo.v2.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft-3000.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft-3000.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft-569929.1800.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft-569929.1800.jsonl
--- a/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft.v2-sft.jsonl
+++ b/inference/vllm/examples/infer_cpm/prompts/cpm-8b-1210_zh_cpm_0129.jsonsft.v2-sft.jsonl
--- a/Show More
+++ b/Show More