From 03f8bc9f79d9b3915bce73ab13174f91e53a79d9 Mon Sep 17 00:00:00 2001 From: Atream <80757050+Atream@users.noreply.github.com> Date: Tue, 25 Feb 2025 21:35:31 +0800 Subject: [PATCH] Update DeepseekR1_V3_tutorial.md add long context --- doc/en/DeepseekR1_V3_tutorial.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/doc/en/DeepseekR1_V3_tutorial.md b/doc/en/DeepseekR1_V3_tutorial.md index 52e9d32..22cfab7 100644 --- a/doc/en/DeepseekR1_V3_tutorial.md +++ b/doc/en/DeepseekR1_V3_tutorial.md @@ -154,6 +154,18 @@ the output quality doesn't change. But the speed of decoding and prefill is speed up which is inspiring. So our showcase makes use of this finding* ## How to Run +### V0.2.2 longer context +If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this: +``` +- match: + name: "^model\\.layers\\..*\\.self_attn$" + replace: + class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation + kwargs: + generate_device: "cuda" + prefill_device: "cuda" + absorb_for_prefill: True # change this to True to enable long context(prefill may slower). +``` ### V0.2 & V0.2.1 Showcase #### Single socket version (32 cores) Our local_chat test command is: