mirror of
https://github.com/RYDE-WORK/ktransformers.git
synced 2026-02-02 04:28:01 +08:00
⚡ fix typo
This commit is contained in:
parent
107e4be417
commit
3d7dfd6151
@ -6,8 +6,8 @@ GPU: 4090D 24G VRAM <br>
|
|||||||
## Bench Result
|
## Bench Result
|
||||||
### V0.2
|
### V0.2
|
||||||
#### Settings
|
#### Settings
|
||||||
- Model: DeepseekV3-q4km(int4)<br>
|
- Model: DeepseekV3-q4km (int4)<br>
|
||||||
- CPU: cpu_model_name:Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
|
||||||
- GPU: 4090D 24G VRAM
|
- GPU: 4090D 24G VRAM
|
||||||
- We test after enough warm up
|
- We test after enough warm up
|
||||||
#### Memory consumption:
|
#### Memory consumption:
|
||||||
@ -16,7 +16,7 @@ GPU: 4090D 24G VRAM <br>
|
|||||||
|
|
||||||
#### Benchmark Results
|
#### Benchmark Results
|
||||||
|
|
||||||
"6 experts" case is part of v0.3's preview
|
"6 experts" case is part of V0.3's preview
|
||||||
|
|
||||||
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) |
|
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) |
|
||||||
| --- | --- | --- | --- | --- | --- |
|
| --- | --- | --- | --- | --- | --- |
|
||||||
@ -28,7 +28,7 @@ GPU: 4090D 24G VRAM <br>
|
|||||||
### V0.3-Preview
|
### V0.3-Preview
|
||||||
#### Settings
|
#### Settings
|
||||||
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
|
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
|
||||||
- CPU: cpu_model_name:Intel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
||||||
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
|
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
|
||||||
|
|
||||||
#### Memory consumptions:
|
#### Memory consumptions:
|
||||||
@ -56,28 +56,28 @@ is speed up which is inspiring. So our showcase makes use of this finding*
|
|||||||
## How to Run
|
## How to Run
|
||||||
### V0.2 Showcase
|
### V0.2 Showcase
|
||||||
#### Single socket version (32 cores)
|
#### Single socket version (32 cores)
|
||||||
our local_chat test command is:
|
Our local_chat test command is:
|
||||||
``` shell
|
``` shell
|
||||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||||
cd ktransformers
|
cd ktransformers
|
||||||
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 33 --cache_lens 1536
|
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 33 --cache_lens 1536
|
||||||
<when you see chat, then press enter to load the text prompt_file>
|
<when you see chat, then press enter to load the text prompt_file>
|
||||||
```
|
```
|
||||||
\<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br>
|
\<your model path\> can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
|
||||||
\<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want <br>
|
\<your gguf path\> can also be online, but as its large we recommend you download it and quantize the model to what you want <br>
|
||||||
The command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes
|
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes
|
||||||
#### Dual socket version (64 cores)
|
#### Dual socket version (64 cores)
|
||||||
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
|
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
|
||||||
our local_chat test command is:
|
Our local_chat test command is:
|
||||||
``` shell
|
``` shell
|
||||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||||
cd ktransformers
|
cd ktransformers
|
||||||
export USE_NUMA=1
|
export USE_NUMA=1
|
||||||
make dev_install # or sh ./install.sh
|
make dev_install # or sh ./install.sh
|
||||||
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 65 --cache_lens 1536
|
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 65 --cache_lens 1536
|
||||||
<when you see chat, then press enter to load the text prompt_file>
|
<when you see chat, then press enter to load the text prompt_file>
|
||||||
```
|
```
|
||||||
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65
|
The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
|
||||||
## Some Explanations
|
## Some Explanations
|
||||||
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||||
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user