diff --git a/README.md b/README.md
index 3810c96..9974a9c 100644
--- a/README.md
+++ b/README.md
@@ -23,14 +23,13 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
@@ -69,7 +68,7 @@ https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
-
1M Context Local Inference on a Desktop with Only 24GB VRAM
+
More advanced features will coming soon, so stay tuned!
🚀 Quick Start
-
Preparation
-Some preparation:
-- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
-
- ```sh
- # Adding CUDA to PATH
- export PATH=/usr/local/cuda/bin:$PATH
- export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
- export CUDA_PATH=/usr/local/cuda
- ```
+Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
-- Linux-x86_64 with gcc, g++ and cmake
-
- ```sh
- sudo apt-get update
- sudo apt-get install gcc g++ cmake ninja-build
- ```
+### 📥 Installation
-- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
-
- ```sh
- conda create --name ktransformers python=3.11
- conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
- ```
+To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/).
-- Make sure that PyTorch, packaging, ninja is installed
-
- ```
- pip install torch packaging ninja cpufeature numpy
- ```
-
-
Installation
-
-1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)
-
-2. You can install using Pypi (for linux):
-
- ```
- pip install ktransformers --no-build-isolation
- ```
-
- for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
-
-3. Or you can download source code and compile:
-
- - init source code
-
- ```sh
- git clone https://github.com/kvcache-ai/ktransformers.git
- cd ktransformers
- git submodule init
- git submodule update
- ```
-
- - [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
-
- - Compile and install (for Linux)
-
- ```
- bash install.sh
- ```
-
- - Compile and install(for Windows)
-
- ```
- install.bat
- ```
-4. If you are developer, you can make use of the makefile to compile and format the code.
the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
-
Local Chat
-We provide a simple command-line local chat Python script that you can run for testing.
-
-> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.
-
-
Run Example
-
-```shell
-# Begin from root of your cloned repo!
-# Begin from root of your cloned repo!!
-# Begin from root of your cloned repo!!!
-
-# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
-mkdir DeepSeek-V2-Lite-Chat-GGUF
-cd DeepSeek-V2-Lite-Chat-GGUF
-
-wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
-
-cd .. # Move to repo's root dir
-
-# Start local chat
-python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
-
-# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
-# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
-# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
-```
-
-It features the following arguments:
-
-- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
-
- > Note:
.safetensors files are not required in the directory. We only need config files to build model and tokenizer.
-
-- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
-
-- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
-
-- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
-
-- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
-
-
Suggested Model
-
-| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
-| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
-| DeepSeek-R1-q4_k_m | 377G | 14G | 382G | 512G |
-| DeepSeek-V3-q4_k_m | 377G | 14G | 382G | 512G |
-| DeepSeek-V2-q4_k_m | 133G | 11G | 136G | 192G |
-| DeepSeek-V2.5-q4_k_m | 133G | 11G | 136G | 192G |
-| DeepSeek-V2.5-IQ4_XS | 117G | 10G | 107G | 128G |
-| Qwen2-57B-A14B-Instruct-q4_k_m | 33G | 8G | 34G | 64G |
-| DeepSeek-V2-Lite-q4_k_m | 9.7G | 3G | 13G | 16G |
-| Mixtral-8x7B-q4_k_m | 25G | 1.6G | 51G | 64G |
-| Mixtral-8x22B-q4_k_m | 80G | 4G | 86.1G | 96G |
-| InternLM2.5-7B-Chat-1M | 15.5G | 15.5G | 8G(32K context) | 150G (1M context) |
-
-
-More will come soon. Please let us know which models you are most interested in.
-
-Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
-
-
- Click To Show how to run other examples
-
-* Qwen2-57B
-
- ```sh
- pip install flash_attn # For Qwen2
-
- mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF
-
- wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf
-
- cd ..
-
- python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF
-
- # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
- # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
- # python ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
- ```
-
-* DeepseekV2
-
- ```sh
- mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
- # Download weights
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf
-
- cd ..
-
- python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
-
- # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
-
- # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
-
- # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
- ```
-
-| model name | weights download link |
-|----------|----------|
-| Qwen2-57B | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main) |
-| DeepseekV2-coder |[DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
-| DeepseekV2-chat |[DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main) |
-| DeepseekV2-lite | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) |
-
-
-
-
-
-
-RESTful API and Web UI
-
-
-Start without website:
-
-```sh
-ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
-```
-
-Start with website:
-
-```sh
-ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
-```
-
-Or you want to start server with transformers, the model_path should include safetensors
-
-```bash
-ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
-```
-
-Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :
-
-
-
-
-
-
-
-More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
📃 Brief Injection Tutorial
At the heart of KTransformers is a user-friendly, template-based injection framework.
diff --git a/README_ZH.md b/README_ZH.md
index 6a5df33..e75d13b 100644
--- a/README_ZH.md
+++ b/README_ZH.md
@@ -94,222 +94,8 @@ https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
🚀 快速入门
-准备工作
-一些准备工作:
-- 如果您还没有 CUDA 12.1 及以上版本,可以从 [这里](https://developer.nvidia.com/cuda-downloads) 安装。
-
- ```sh
- # Adding CUDA to PATH
- export PATH=/usr/local/cuda/bin:$PATH
- export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
- export CUDA_PATH=/usr/local/cuda
- ```
-
-- Linux-x86_64 系统,需要安装 gcc、g++ 和 cmake
-
- ```sh
- sudo apt-get update
- sudo apt-get install gcc g++ cmake ninja-build
- ```
-
-- 我们建议使用 Conda 创建一个 Python=3.11 的虚拟环境来运行我们的程序。
-
- ```sh
- conda create --name ktransformers python=3.11
- conda activate ktransformers # 您可能需要先运行 ‘conda init’ 并重新打开 shell
- ```
-
-- 确保安装了 PyTorch、packaging、ninja
-
- ```
- pip install torch packaging ninja cpufeature numpy
- ```
-
-安装
-
-1. 使用 Docker 镜像,详见 [Docker 文档](./doc/en/Docker.md)
-
-2. 您可以使用 Pypi 安装(适用于 Linux):
-
- ```
- pip install ktransformers --no-build-isolation
- ```
-
- 对于 Windows,我们提供了一个预编译的 whl 包 [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl),需要 cuda-12.5、torch-2.4、python-3.11,更多预编译包正在制作中。
-
-3. 或者您可以下载源代码并编译:
-
- - init source code
-
- ```sh
- git clone https://github.com/kvcache-ai/ktransformers.git
- cd ktransformers
- git submodule init
- git submodule update
- ```
-
- - [可选] 如果您想运行网站,请在执行```bash install.sh```之前, 进行 [compile the website](./doc/en/api/server/website.md)
-
- - 编译并安装(适用于 Linux)
-
- ```
- bash install.sh
- ```
-
- - 编译并安装(适用于 Windows)
-
- ```
- install.bat
- ```
-4. 如果您是开发者,可以使用 makefile 来编译和格式化代码。makefile 的详细用法请参见 [这里](./doc/en/makefile_usage.md)
-
-本地聊天
-我们提供了一个简单的命令行本地聊天 Python 脚本,您可以运行它进行测试。
-
-> 请注意,这只是一个非常简单的测试工具,仅支持一轮聊天,不记忆上一次输入。如果您想体验模型的全部功能,可以前往 RESTful API 和 Web UI。这里以 DeepSeek-V2-Lite-Chat-GGUF 模型为例,但我们也支持其他模型,您可以替换为您想要测试的任何模型。
-
-运行示例
-
-```shell
-# 从克隆的仓库根目录开始!
-# 从克隆的仓库根目录开始!!
-# 从克隆的仓库根目录开始!!!
-
-# 从 Hugging Face 下载 mzwing/DeepSeek-V2-Lite-Chat-GGUF
-mkdir DeepSeek-V2-Lite-Chat-GGUF
-cd DeepSeek-V2-Lite-Chat-GGUF
-
-wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
-
-cd .. # 返回仓库根目录
-
-# 启动本地聊天
-python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
-
-# 如果遇到报错 “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, 请尝试:
-# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
-# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
-```
-
-它具有以下参数:
-
-- `--model_path` (required): 模型名称 (例如 "deepseek-ai/DeepSeek-V2-Lite-Chat" 将自动从 [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) 下载配置)。或者,如果您已经有本地文件,可以直接使用该路径来初始化模型。
-
- > Note: .safetensors 文件不是必需的。我们只需要配置文件来构建模型和分词器。
-
-- `--gguf_path` (required): 包含 GGUF 文件的目录路径,可以从 [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) 下载。请注意,该目录应仅包含当前模型的 GGUF,这意味着您需要为每个模型使用一个单独的目录。
-
-- `--optimize_rule_path` (必需,Qwen2Moe 和 DeepSeek-V2 除外): 包含优化规则的 YAML 文件路径。在 [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) 目录中有两个预写的规则文件,用于优化 DeepSeek-V2 和 Qwen2-57B-A14,这两个是 SOTA MoE 模型。
-
-- `--max_new_tokens`: Int (default=1000). 要生成的最大 new tokens。
-
-- `--cpu_infer`: Int (default=10). 用于推理的 CPU 数量。理想情况下应设置为(总核心数 - 2)。
-
- 建议模型
-
-| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
-| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
-| DeepSeek-R1-q4_k_m | 377G | 14G | 382G | 512G |
-| DeepSeek-V3-q4_k_m | 377G | 14G | 382G | 512G |
-| DeepSeek-V2-q4_k_m | 133G | 11G | 136G | 192G |
-| DeepSeek-V2.5-q4_k_m | 133G | 11G | 136G | 192G |
-| DeepSeek-V2.5-IQ4_XS | 117G | 10G | 107G | 128G |
-| Qwen2-57B-A14B-Instruct-q4_k_m | 33G | 8G | 34G | 64G |
-| DeepSeek-V2-Lite-q4_k_m | 9.7G | 3G | 13G | 16G |
-| Mixtral-8x7B-q4_k_m | 25G | 1.6G | 51G | 64G |
-| Mixtral-8x22B-q4_k_m | 80G | 4G | 86.1G | 96G |
-| InternLM2.5-7B-Chat-1M | 15.5G | 15.5G | 8G(32K context) | 150G (1M context) |
-
-
-更多即将推出。请告诉我们您最感兴趣的模型。
-
-请注意,在使用 [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) 和 [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE) 时,需要遵守相应的模型许可证。
-
-
- 点击显示如何运行其他示例
-
-* Qwen2-57B
-
- ```sh
- pip install flash_attn # For Qwen2
-
- mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF
-
- wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf
-
- cd ..
-
- python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF
-
- # 如果遇到报错 “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, 请尝试:
- # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
- # python ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
- ```
-
-* DeepseekV2
-
- ```sh
- mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
- # Download weights
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
- wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf
-
- cd ..
-
- python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
-
- # 如果遇到报错 “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, 请尝试:
-
- # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
-
- # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
- ```
-
-| model name | weights download link |
-|----------|----------|
-| Qwen2-57B | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main) |
-| DeepseekV2-coder |[DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
-| DeepseekV2-chat |[DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main) |
-| DeepseekV2-lite | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) |
-
-
-
-
-
-
-RESTful API and Web UI
-
-
-启动不带网站的服务:
-
-```sh
-ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
-```
-
-启动带网站的服务:
-
-```sh
-ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
-```
-
-或者,如果您想使用 transformers 启动服务,model_path 应该包含 safetensors 文件:
-
-```bash
-ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
-```
-
-通过 [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) 访问:
-
-
-
-
-
-
-
-关于 RESTful API 服务器的更多信息可以在这里找到 [这里](doc/en/api/server/server.md)。您还可以在这里找到与 Tabby 集成的示例 [这里](doc/en/api/server/tabby.md)。
+KTransformers 的入门非常简单!请参考我们的[安装指南]((https://kvcache-ai.github.io/ktransformers/))进行安装。
📃 简要注入教程
KTransformers 的核心是一个用户友好的、基于模板的注入框架。这使得研究人员可以轻松地将原始 torch 模块替换为优化的变体。它还简化了多种优化的组合过程,允许探索它们的协同效应。
@@ -320,7 +106,7 @@ KTransformers 的核心是一个用户友好的、基于模板的注入框架。
-鉴于 vLLM 已经是一个用于大规模部署优化的优秀框架,KTransformers 特别关注受资源限制的本地部署。我们特别关注异构计算时机,例如量化模型的 GPU/CPU 卸载。例如,我们支持高效的 Llamafile 和Marlin 内核,分别用于 CPU 和 GPU。 更多详细信息可以在这里找到 这里。
+鉴于 vLLM 已经是一个用于大规模部署优化的优秀框架,KTransformers 特别关注受资源限制的本地部署。我们特别关注异构计算时机,例如量化模型的 GPU/CPU 卸载。例如,我们支持高效的 Llamafile 和Marlin 内核,分别用于 CPU 和 GPU。 更多详细信息可以在 这里找到。
示例用法
@@ -340,7 +126,7 @@ generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_
如何自定义您的模型
-一个详细的使用 DeepSeek-V2 作为示例的注入和 multi-GPU 教程在这里给出 [这里](doc/en/injection_tutorial.md)。
+一个详细的使用 DeepSeek-V2 作为示例的注入和 multi-GPU 教程在 [这里](doc/en/injection_tutorial.md)。
以下是一个将所有原始 Linear 模块替换为 Marlin 的 YAML 模板示例,Marlin 是一个高级的 4 位量化内核。
diff --git a/doc/SUMMARY.md b/doc/SUMMARY.md
index bf5579f..c2461fc 100644
--- a/doc/SUMMARY.md
+++ b/doc/SUMMARY.md
@@ -1,12 +1,15 @@
# Ktransformer
[Introduction](./README.md)
-# DeepSeek
-- [Deepseek-R1/V3 Tutorial](en/DeepseekR1_V3_tutorial.md)
-- [Deepseek-V2 Injection](en/deepseek-v2-injection.md)
-- [Injection Tutorial](en/injection_tutorial.md)
+# Install
+- [Installation Guide](en/install.md)
-# Server
+# Tutorial
+- [Deepseek-R1/V3 Show Case](en/DeepseekR1_V3_tutorial.md)
+- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
+- [Injection Tutorial](en/injection_tutorial.md)
+- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
+# Server(Temperary Deprected)
- [Server](en/api/server/server.md)
- [Website](en/api/server/website.md)
- [Tabby](en/api/server/tabby.md)
diff --git a/doc/assets/DeepSeek-on-KTransformers.PNG b/doc/assets/DeepSeek-on-KTransformers.PNG
deleted file mode 100644
index 455f210..0000000
Binary files a/doc/assets/DeepSeek-on-KTransformers.PNG and /dev/null differ
diff --git a/doc/en/DeepseekR1_V3_tutorial.md b/doc/en/DeepseekR1_V3_tutorial.md
index f42a46d..9815693 100644
--- a/doc/en/DeepseekR1_V3_tutorial.md
+++ b/doc/en/DeepseekR1_V3_tutorial.md
@@ -1,7 +1,7 @@
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
- [SUMMARY](#summary)
- - [Prerequisites](#prerequisites)
+ - [Show Case Environment](#show-case-environment)
- [Bench Result](#bench-result)
- [V0.2](#v02)
- [Settings](#settings)
@@ -50,7 +50,7 @@ We also give our upcoming optimizations previews, including an Intel AMX-acceler
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
-## Prerequisites
+## Show Case Environment
We run our best performance tests (V0.2) on
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes)
GPU: 4090D 24G VRAM
@@ -110,10 +110,6 @@ is speed up which is inspiring. So our showcase makes use of this finding*
#### Single socket version (32 cores)
Our local_chat test command is:
``` shell
-git clone https://github.com/kvcache-ai/ktransformers.git
-cd ktransformers
-git submodule init
-git submodule update
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 33 --max_new_tokens 1000
```
@@ -121,24 +117,28 @@ numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path ` can also be online, but as its large we recommend you download it and quantize the model to what you want (notice it's the dir path)
`--max_new_tokens 1000` is the max output token length. If you find the answer is truncated, you
can increase the number for longer answer (But be aware of OOM, and increase it will slow down the generation rate.).
-
-The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes
+
+The command `numactl -N 1 -m 1` aims to advoid data transfer between numa nodes
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. This is explained in [FAQ](#faq) part
#### Dual socket version (64 cores)
-Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set)
-Our local_chat test command is:
+
+Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details.
+
+Test Command:
``` shell
-git clone https://github.com/kvcache-ai/ktransformers.git
-cd ktransformers
-git submodule init
-git submodule update
-export USE_NUMA=1
-make dev_install # or sh ./install.sh
+# ---For those who have not installed ktransformers---
+# git clone https://github.com/kvcache-ai/ktransformers.git
+# cd ktransformers
+# git submodule init
+# git submodule update
+# export USE_NUMA=1
+# make dev_install # or sh ./install.sh
+# ----------------------------------------------------
python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 65 --max_new_tokens 1000
```
-The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
+The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
### V0.3 Showcase
#### Dual socket version (64 cores)
diff --git a/doc/en/deepseek-v2-injection.md b/doc/en/deepseek-v2-injection.md
index 4884b66..fcd5abe 100644
--- a/doc/en/deepseek-v2-injection.md
+++ b/doc/en/deepseek-v2-injection.md
@@ -1,6 +1,6 @@
-# Tutorial: Heterogeneous and Local DeepSeek-V2 Inference
+# Tutorial: Heterogeneous and Local MoE Inference
-DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4.
+DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4. DeepSeek-R1 uses a similar architecture to DeepSeek-V2, but with a bigger number of parameters.
diff --git a/doc/en/install.md b/doc/en/install.md
new file mode 100644
index 0000000..a191d22
--- /dev/null
+++ b/doc/en/install.md
@@ -0,0 +1,267 @@
+
+# How to Run DeepSeek-R1
+In this document, we will show you how to install and run KTransformers on your local machine. There are two versions:
+* V0.2 is the current main branch.
+* V0.3 is a preview version only provides binary distribution for now.
+* To reproduce our DeepSeek-R1/V3 results, please refer to [Deepseek-R1/V3 Tutorial](./DeepseekR1_V3_tutorial.md) for more detail settings after installation.
+## Preparation
+Some preparation:
+
+- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
+
+ ```sh
+ # Adding CUDA to PATH
+ export PATH=/usr/local/cuda/bin:$PATH
+ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+ export CUDA_PATH=/usr/local/cuda
+ ```
+
+- Linux-x86_64 with gcc, g++ and cmake
+
+ ```sh
+ sudo apt-get update
+ sudo apt-get install gcc g++ cmake ninja-build
+ ```
+
+- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
+
+ ```sh
+ conda create --name ktransformers python=3.11
+ conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
+ ```
+
+- Make sure that PyTorch, packaging, ninja is installed
+
+ ```
+ pip install torch packaging ninja cpufeature numpy
+ ```
+
+## Installation
+
+
+
+* Download source code and compile:
+
+ - init source code
+
+ ```sh
+ git clone https://github.com/kvcache-ai/ktransformers.git
+ cd ktransformers
+ git submodule init
+ git submodule update
+ ```
+
+ - [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
+
+ - For Linux
+ - For simple install:
+
+ ```shell
+ bash install.sh
+ ```
+ - For those who have two cpu and 1T RAM:
+
+ ```shell
+ # Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
+ export USE_NUMA=1
+ bash install.sh # or `make dev_install`
+ ```
+
+ - For Windows
+
+ ```shell
+ install.bat
+ ```
+
+* If you are developer, you can make use of the makefile to compile and format the code.
the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
+
+Local Chat
+We provide a simple command-line local chat Python script that you can run for testing.
+
+> Note: this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666).
+
+Run Example
+
+```shell
+# Begin from root of your cloned repo!
+# Begin from root of your cloned repo!!
+# Begin from root of your cloned repo!!!
+
+# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
+mkdir DeepSeek-V2-Lite-Chat-GGUF
+cd DeepSeek-V2-Lite-Chat-GGUF
+
+wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
+
+cd .. # Move to repo's root dir
+
+# Start local chat
+python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
+
+# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
+# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
+# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
+```
+
+It features the following arguments:
+
+- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
+
+ > Note: .safetensors files are not required in the directory. We only need config files to build model and tokenizer.
+
+- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
+
+- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
+
+- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
+
+- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
+
+
+Supported Models/quantization
+
+### Supported models include:
+
+| ✅ **Supported Models** | ❌ **Deprecated Models** |
+|------------------------|------------------------|
+| DeepSeek-R1 | ~~InternLM2.5-7B-Chat-1M~~ |
+| DeepSeek-V3 | |
+| DeepSeek-V2 | |
+| DeepSeek-V2.5 | |
+| Qwen2-57B | |
+| DeepSeek-V2-Lite | |
+| Mixtral-8x7B | |
+| Mixtral-8x22B | |
+
+### Support quantize format:
+
+| ✅ **Supported Formats** | ❌ **Deprecated Formats** |
+|--------------------------|--------------------------|
+| Q2_K_L | ~~IQ2_XXS~~ |
+| Q2_K_XS | |
+| Q3_K_M | |
+| Q4_K_M | |
+| Q5_K_M | |
+| Q6_K | |
+| Q8_0 | |
+
+
+
+Suggested Model
+
+| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
+| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
+| DeepSeek-R1-q4_k_m | 377G | 14G | 382G | 512G |
+| DeepSeek-V3-q4_k_m | 377G | 14G | 382G | 512G |
+| DeepSeek-V2-q4_k_m | 133G | 11G | 136G | 192G |
+| DeepSeek-V2.5-q4_k_m | 133G | 11G | 136G | 192G |
+| DeepSeek-V2.5-IQ4_XS | 117G | 10G | 107G | 128G |
+| Qwen2-57B-A14B-Instruct-q4_k_m | 33G | 8G | 34G | 64G |
+| DeepSeek-V2-Lite-q4_k_m | 9.7G | 3G | 13G | 16G |
+| Mixtral-8x7B-q4_k_m | 25G | 1.6G | 51G | 64G |
+| Mixtral-8x22B-q4_k_m | 80G | 4G | 86.1G | 96G |
+| InternLM2.5-7B-Chat-1M | 15.5G | 15.5G | 8G(32K context) | 150G (1M context) |
+
+
+More will come soon. Please let us know which models you are most interested in.
+
+Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
+
+
+
+
+ Click To Show how to run other examples
+
+* Qwen2-57B
+
+ ```sh
+ pip install flash_attn # For Qwen2
+
+ mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF
+
+ wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf
+
+ cd ..
+
+ python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF
+
+ # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
+ # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
+ # python ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
+ ```
+
+* Deepseek-V2
+
+ ```sh
+ mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
+ # Download weights
+ wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
+ wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
+ wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
+ wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf
+
+ cd ..
+
+ python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
+
+ # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
+
+ # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
+
+ # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
+ ```
+
+| model name | weights download link |
+|----------|----------|
+| Qwen2-57B | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main) |
+| DeepseekV2-coder |[DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
+| DeepseekV2-chat |[DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main) |
+| DeepseekV2-lite | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) |
+| DeepSeek-R1 | [DeepSeek-R1-gguf-Q4K-M](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M) |
+
+
+
+
+
+
+RESTful API and Web UI (deprected)
+
+
+Start without website:
+
+```sh
+ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
+```
+
+Start with website:
+
+```sh
+ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
+```
+
+Or you want to start server with transformers, the model_path should include safetensors
+
+```bash
+ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
+```
+
+Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :
+
+
+
+
+
+
+
+More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
diff --git a/doc/en/multi-gpu-tutorial.md b/doc/en/multi-gpu-tutorial.md
new file mode 100644
index 0000000..29bd496
--- /dev/null
+++ b/doc/en/multi-gpu-tutorial.md
@@ -0,0 +1,118 @@
+
+# Muti-GPU
+
+Assume you have read the [Injection Tutorial](./injection_tutorial.md) and have a basic understanding of how to inject a model. In this tutorial, we will show you how to use KTransformers to run a model on multiple GPUs.
+
+If you have multiple GPUs, you can set the device for each module to different GPUs.
+DeepseekV2-Chat got 60 layers, if we got 2 GPUs, we can allocate 30 layers to each GPU. Complete multi GPU rule examples [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml).
+
+
+
+
+
+
+
+
+First of all, for multi-GPU, we have to inject an new operator `KDeepseekV2Model`. And set division of the layers to different GPUs. For our case, we have to set the `transfer_map` in the `KDeepseekV2Model` operatoras as follows:
+
+```yaml
+- match:
+ name: "^model$"
+ replace:
+ class: "ktransformers.operators.models.KDeepseekV2Model"
+ kwargs:
+ transfer_map:
+ 30: "cuda:1"
+```
+
+And we have to set the device for each module in the model.
+
+For example, for `routed experts`, the yaml for one GPU is:
+```yaml
+- match:
+ name: "^model\\.layers\\..*\\.mlp\\.experts$"
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts # Custom MoE kernel with expert parallelism
+ kwargs:
+ generate_device: "cuda:0"
+ generate_op: "MLPCUDAExperts"
+ out_device: "cuda:0"
+ recursive: False # Don't recursively inject submodules of this module
+```
+But for two GPUs, we need to set the device for each module in the model.
+
+```yaml
+# allcate 0-29 layers‘s out_device to cuda:0
+- match:
+ name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.experts$"
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
+ kwargs:
+ generate_device: "cpu"
+ generate_op: "KExpertsCPU"
+ out_device: "cuda:0"
+ recursive: False # don't recursively inject submodules of this module
+
+# allocate 30-59 layers‘s out_device to cuda:1
+- match:
+ name: "^model\\.layers\\.([345][0-9])\\.mlp\\.experts$"
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
+ kwargs:
+ generate_device: "cpu"
+ generate_op: "KExpertsCPU"
+ out_device: "cuda:1"
+ recursive: False # don't recursively inject submodules of this module
+```
+For other modules, we can set the device in the same way.
+
+# How to fully utilize multi-GPU's VRAM
+
+When you have multiple GPUs, you can fully utilize the VRAM of each GPU by moving more weights to the GPU.
+
+For example, for DeepSeekV2-Chat, we can move the weights of the experts to the GPU.
+
+For example, the yaml for two GPUs is:
+```yaml
+- match:
+ name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.experts$"
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts
+ kwargs:
+ generate_device: "cpu"
+ generate_op: "KExpertsCPU"
+ out_device: "cuda:0"
+ recursive: False
+```
+
+But we got extra 60GB VRAM on cuda:0, we can move experts in layer 4~8 to cuda:0.
+
+```yaml
+# Add new rule before old rule.
+- match:
+ name: "^model\\.layers\\.([4-8])\\.mlp\\.experts$" # inject experts in layer 4~8 as marlin expert
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts
+ kwargs:
+ generate_device: "cuda:0"
+ generate_op: "KExpertsMarlin"
+ recursive: False
+
+- match:
+ name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.experts$"
+ replace:
+ class: ktransformers.operators.experts.KTransformersExperts
+ kwargs:
+ generate_device: "cpu"
+ generate_op: "KExpertsCPU"
+ out_device: "cuda:0"
+ recursive: False
+```
+
+Adjust the layer range as you want. Note that:
+* The loading speed will be significantly slower for each expert moved to the GPU.
+* You have to close the cuda graph if you want to move the experts to the GPU.
+* For DeepSeek-R1/V3, each expert moved to the GPU will consume approximately 6GB of VRAM.
+* The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
+
+