diff --git a/README.md b/README.md index b8e83cc..b4ad06a 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

🔥 Updates

-* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64X speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md) +* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64x speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md) * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G. * **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU. diff --git a/doc/en/DeepseekR1_V3_tutorial.md b/doc/en/DeepseekR1_V3_tutorial.md index a56b689..45a5aab 100644 --- a/doc/en/DeepseekR1_V3_tutorial.md +++ b/doc/en/DeepseekR1_V3_tutorial.md @@ -47,6 +47,12 @@ The main acceleration comes from - Intel AMX instruction set and our specially designed cache friendly memory layout - Expert selection strategy that selects fewer experts based on offline profile results of out of domain data + +*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, +when we slightly decrease the activation experts num in inference, +the output quality doesn't change,But the speed of decoding and prefill +is speed up which is inspiring. So our showcase makes use of this finding* + ## how to run ### v0.2 showcase #### single socket version(32 cores)