From 790ff6ba657acdb3933c4449a6e5fb370b76731a Mon Sep 17 00:00:00 2001 From: root <403644786@qq.com> Date: Mon, 15 Jul 2024 15:57:09 +0800 Subject: [PATCH] =?UTF-8?q?=E5=B0=86=E5=BF=AB=E9=80=9F=E5=AF=BC=E8=88=AA?= =?UTF-8?q?=E5=92=8C=E6=96=B0=E7=9A=84=E6=95=99=E7=A8=8B=E6=B7=BB=E5=8A=A0?= =?UTF-8?q?=E5=88=B0=E4=BA=86reamd=5Fen?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README-en.md | 189 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 188 insertions(+), 1 deletion(-) diff --git a/README-en.md b/README-en.md index ddc8c41..61461ce 100644 --- a/README-en.md +++ b/README-en.md @@ -59,6 +59,16 @@ We release all model parameters for research and limited commercial use. -

+## Common Modules +| [infer](#2) | [finetune](#6) | [deployment](#4) | [quantize](#quantize) +|-------------|------------|-----------|-----------| +|[Transformers](#Huggingface)|[Transformers](#6)|[MLC](#MLC)|[GPTQ](#gptq)| +|[vLLM](#vLLM)|[mlx_finetune](#mlx_finetune)|[llama.cpp](#llama.cpp)|[AWQ](#awq)| +|[llama.cpp](#llama.cpp)|[llama_factory](./finetune/llama_factory_example/README.md)||[bnb](#bnb)| +|[ollama](#ollama)|||[quantize_test](#quantize_test)| +|[fastllm](#fastllm)|||| +|[mlx_lm](#mlx)|||| +|[powerinfer](#powerinfer)|||| ## Update Log - **2024/04/11 We release [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0), [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog.** - 2024/03/16 Intermediate checkpoints were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)! @@ -101,6 +111,8 @@ We release all model parameters for research and limited commercial use. - [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) +

+ #### Huggingface ##### MiniCPM-2B @@ -169,6 +181,7 @@ res, context, _ = model.chat( ) print(res) ``` +

#### vLLM @@ -186,6 +199,7 @@ print(res) #### llama.cpp, Ollama, fastllm, mlx_lm Inference We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama. +

**llama.cpp** 1. [install llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build) @@ -196,13 +210,37 @@ We have supported inference with [llama.cpp](https://github.com/ggerganov/llama. ``` More parameters adjustment [see this](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) -**ollama** +

+ +**ollama run MiniCPM-2B-dpo** 1. [install ollama](https://github.com/ollama/ollama) 2. In command line: ``` ollama run modelbest/minicpm-2b-dpo ``` +**ollama other models** +1. [Install ollama](https://github.com/ollama/ollama) +2. Download models in gguf format. [Download link for 2b-fp16 format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [Download link for 2b-q4km format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) [Download link for 1b-fp16 format](https://huggingface.co/linglingdan/MiniCPM-1b-fp16-gguf) [Download link for 1b-qr_1 format](https://huggingface.co/linglingdan/MiniCPM-1b-q4-1) +3. Run the following command in the command line, `model_name` can be customized:: +``` +touch model_name.Modelfile +``` +4. Edit the content of `model_name.Modelfile` as follows, write the path of the gguf model after the FROM space: +```shell +FROM model_path/model_name.gguf +TEMPLATE """{{ .Prompt }}{{ .Response }}""" +PARAMETER stop "<\s>" +``` +5. Run the following command in the command line to create an ollama model, `ollama_model_name` can be customized, `model_name.Modelfile` should follow the naming from step 3: +```shell +ollama create ollama_model_name -f model_name.Modelfile +``` +6. Run the ollama model: +```sehll +ollama run ollama_model_name +``` + **fastllm** 1. install [fastllm](https://github.com/ztxz16/fastllm) 2. inference @@ -217,8 +255,10 @@ llm.set_device_map("cpu") model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4" print(model.response("<用户>Write an acrostic poem with the word MINICPM (One line per letter)", top_p=0.8, temperature=0.5, repeat_penalty=1.02)) ``` +

**mlx_lm** + 1. install mlx_lm ```shell pip install mlx_lm @@ -229,6 +269,150 @@ print(model.response("<用户>Write an acrostic poem with the word MINICPM (One python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code ``` +#### powerinfer +Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned. +1. Ensure your cmake version is 3.17 or above. If you have already installed it, you can skip this step. +```bash + # Download the installation package + sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz + # Extract the installation package + sudo tar -zxvf cmake-3.23.0.tar.gz + # Configure the installation environment + sudo ./configure + sudo make -j8 + # Compile and install + sudo make install + # Check the version after installation + cmake --version + # If the version number is returned, the installation was successful + # cmake version 3.23.0 +``` +2. Install PowerInfer:: +```bash + git clone https://github.com/SJTU-IPADS/PowerInfer + cd PowerInfer + pip install -r requirements.txt # install Python helpers' dependencies +``` +3. Compile the CPU version of PowerInfer. If your machine only has a CPU, or if you want to perform inference using the CPU, run the following commands:: +```bash + cmake -S . -B build + cmake --build build --config Release +``` +4. Compile the GPU version of PowerInfer. If your machine has a GPU, you can run the following commands: +```bash + cmake -S . -B build -DLLAMA_CUBLAS=ON + cmake --build build --config Release +``` +5. Retrieve the sparse model: +```bash +git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main +#or +git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf +``` +6. Model Inference: +```bash +cd PowerInfer +# Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text. +#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt +# Below is an example +./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p 'hello,tell me a story please.' +``` + +

+ +## Quantize +

+ +**gptq** +1. Firstly, obtain the[minicpm_gptqd code](https://github.com/LDLINGLINGLING/AutoGPTQ/tree/minicpm_gptq) +2. Navigate to the main directory of minicpm_gptqd ./AutoGPTQ, then in the command line, input: + ``` + pip install e . + ``` +3. Proceed to [model download] (#1) to download all files from the unquantized MiniCPM repository to a local folder; both 1b and 2b models are acceptable, as well as post-training models. +4. Input the following command in the command line, where `no_quantized_model_path` is the path to the model downloaded in step 3, `save_path` is the path to save the quantized model, and `--bits` is the quantization bit width which can be set to either 4 or 8. + ``` + cd Minicpm/quantize + python gptq_quantize.py --pretrained_model_dir no_quant_model_path --quantized_model_dir quant_save_path --bits 4 + ``` +5. You can perform inference using ./AutoGPTQ/examples/quantization/inference.py, or refer to the previous section on using vllm with the quantized model. For the minicpm-1b-int4 model, vllm inference on a single 4090 card operates at around 2000 tokens per second + +

+ +**awq** +1. Modify the configuration parameters in the quantize/awq_quantize.py file according to the comments: +```python + model_path = '/root/ld/ld_model_pretrained/MiniCPM-1B-sft-bf16' # model_path or model_id + quant_path = '/root/ld/ld_project/pull_request/MiniCPM/quantize/awq_cpm_1b_4bit' # quant_save_path + quant_data_path='/root/ld/ld_project/pull_request/MiniCPM/quantize/quantize_data/wikitext' # Input the provided quantization dataset, alpaca or wikitext under data + quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8 + quant_samples=512 # Number of samples to use for calibration + custom_data=[{'question':'What is your name.','answer':'I am the open-source mini cannon MiniCPM from OpenMBMB.'}, # Custom dataset is available + {'question':'What are your features.','answer':'I am small, but I am strong.'}] +``` +2. Under the quantize/quantize_data folder, two datasets, alpaca and wiki_text, are provided as quantization calibration sets. Modify the aforementioned quant_data_path to the path of one of these folders. +3. If you need a custom dataset, modify the custom_data variable in quantize/awq_quantize.py, such as: + ```python + custom_data=[{'question':'What symptoms does allergic rhinitis have?','answer':'Allergic rhinitis may cause nasal congestion, runny nose, headache, etc., which recur frequently. It is recommended to seek medical attention in severe cases.'}, + {'question':'What is 1+1 equal to?','answer':'It equals 2'}] + ``` +4. Based on the selected dataset, choose one of the following lines of code to replace line 38 in quantize/awq_quantize.py:: + ```python + # Quantize using wikitext + model.quantize(tokenizer, quant_config=quant_config, calib_data=load_wikitext(quant_data_path=quant_data_path)) + # Quantize using alpaca + model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca(quant_data_path=quant_data_path)) + # Quantize using a custom dataset + model.quantize(tokenizer, quant_config=quant_config, calib_data=load_cust_data(quant_data_path=quant_data_path)) + ``` +5. Run the quantize/awq_quantize.py file; the AWQ quantized model will be available in the specified quan_path directory. + + +

+ +**bnb** +1. Modify the configuration parameters in the quantize/bnb_quantize.py file according to the comments: +```python +model_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16" # Model path +save_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16_int4" # Path to save the quantized model +``` +2. Additional quantization parameters can be modified based on the comments and the llm.int8() algorithm (optional): +```python +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, # Whether to perform 4-bit quantization + load_in_8bit=False, # Whether to perform 8-bit quantization + bnb_4bit_compute_dtype=torch.float16, # Computation precision setting + bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights + bnb_4bit_quant_type="nf4", # Quantization format, Gaussian-distributed int4 used here + bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters + llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, parameters saved on the CPU use fp32 + llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled + #llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules that do not undergo quantization + llm_int8_threshold=6.0, # Outlier value in the llm.int8() algorithm, determines whether quantization is performed based on this value +) +``` +3. Run the quantize/bnb_quantize.py script, and the BNB quantized model will be available in the directory specified by save_path. +```python +cd MiniCPM/quantize +python bnb_quantize.py +``` +

+ +**quantize_test** +1. In the command line, navigate to the MiniCPM/quantize directory. +2. Modify the awq_path, gptq_path, and awq_path in the quantize_eval.sh file. Keep the types you don't want to test as empty strings. The following example indicates testing only the AWQ model: + ``` + awq_path="/root/ld/ld_project/AutoAWQ/examples/awq_cpm_1b_4bit" + gptq_path="" + model_path="" + bnb_path="" + ``` +3. 3. In the MiniCPM/quantize directory, enter the following command in the command line: + ``` + bash quantize_eval.sh + ``` +4. The window will display the memory usage and perplexity of the model. +

## Community @@ -680,6 +864,7 @@ MBPP, instead of the hand-verified set. ## Deployment on mobile phones #### Tutorial +

* After INT4 quantization, MiniCPM only occupies 2GB of space, meeting the requirements of inference on end devices. * We have made different adaptations for different operating systems. @@ -758,6 +943,8 @@ python demo/hf_based_demo.py --model_path * Using [BMTrain](https://github.com/OpenBMB/BMTrain),as well as checkpointing and ZeRO-3 (zero redundancy optimizer),we can tune all parameters of MiniCPM using one piece of NVIDIA GeForce GTX 3090/4090. * This code will be available soon. +

+ * mlx Parameter-efficient Tuning * environment preparation ```shell