mirror of
https://github.com/RYDE-WORK/MiniCPM.git
synced 2026-01-19 12:53:36 +08:00
将快速导航和新的教程添加到了reamd_en
This commit is contained in:
parent
3d18712792
commit
790ff6ba65
189
README-en.md
189
README-en.md
@ -59,6 +59,16 @@ We release all model parameters for research and limited commercial use.
|
||||
-
|
||||
<p id="0"></p>
|
||||
|
||||
## Common Modules
|
||||
| [infer](#2) | [finetune](#6) | [deployment](#4) | [quantize](#quantize)
|
||||
|-------------|------------|-----------|-----------|
|
||||
|[Transformers](#Huggingface)|[Transformers](#6)|[MLC](#MLC)|[GPTQ](#gptq)|
|
||||
|[vLLM](#vLLM)|[mlx_finetune](#mlx_finetune)|[llama.cpp](#llama.cpp)|[AWQ](#awq)|
|
||||
|[llama.cpp](#llama.cpp)|[llama_factory](./finetune/llama_factory_example/README.md)||[bnb](#bnb)|
|
||||
|[ollama](#ollama)|||[quantize_test](#quantize_test)|
|
||||
|[fastllm](#fastllm)||||
|
||||
|[mlx_lm](#mlx)||||
|
||||
|[powerinfer](#powerinfer)||||
|
||||
## Update Log
|
||||
- **2024/04/11 We release [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0), [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog.**
|
||||
- 2024/03/16 Intermediate checkpoints were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)!
|
||||
@ -101,6 +111,8 @@ We release all model parameters for research and limited commercial use.
|
||||
|
||||
- [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing)
|
||||
|
||||
<p id="Huggingface"></p>
|
||||
|
||||
#### Huggingface
|
||||
|
||||
##### MiniCPM-2B
|
||||
@ -169,6 +181,7 @@ res, context, _ = model.chat(
|
||||
)
|
||||
print(res)
|
||||
```
|
||||
<p id="vLLM"></p>
|
||||
|
||||
#### vLLM
|
||||
|
||||
@ -186,6 +199,7 @@ print(res)
|
||||
#### llama.cpp, Ollama, fastllm, mlx_lm Inference
|
||||
We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama.
|
||||
|
||||
<p id="llama.cpp"></p>
|
||||
|
||||
**llama.cpp**
|
||||
1. [install llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build)
|
||||
@ -196,13 +210,37 @@ We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.
|
||||
```
|
||||
More parameters adjustment [see this](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
|
||||
|
||||
**ollama**
|
||||
<p id="ollama"></p>
|
||||
|
||||
**ollama run MiniCPM-2B-dpo**
|
||||
1. [install ollama](https://github.com/ollama/ollama)
|
||||
2. In command line:
|
||||
```
|
||||
ollama run modelbest/minicpm-2b-dpo
|
||||
```
|
||||
|
||||
**ollama other models**
|
||||
1. [Install ollama](https://github.com/ollama/ollama)
|
||||
2. Download models in gguf format. [Download link for 2b-fp16 format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [Download link for 2b-q4km format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) [Download link for 1b-fp16 format](https://huggingface.co/linglingdan/MiniCPM-1b-fp16-gguf) [Download link for 1b-qr_1 format](https://huggingface.co/linglingdan/MiniCPM-1b-q4-1)
|
||||
3. Run the following command in the command line, `model_name` can be customized::
|
||||
```
|
||||
touch model_name.Modelfile
|
||||
```
|
||||
4. Edit the content of `model_name.Modelfile` as follows, write the path of the gguf model after the FROM space:
|
||||
```shell
|
||||
FROM model_path/model_name.gguf
|
||||
TEMPLATE """<s><USER>{{ .Prompt }}<AI>{{ .Response }}"""
|
||||
PARAMETER stop "<\s>"
|
||||
```
|
||||
5. Run the following command in the command line to create an ollama model, `ollama_model_name` can be customized, `model_name.Modelfile` should follow the naming from step 3:
|
||||
```shell
|
||||
ollama create ollama_model_name -f model_name.Modelfile
|
||||
```
|
||||
6. Run the ollama model:
|
||||
```sehll
|
||||
ollama run ollama_model_name
|
||||
```
|
||||
|
||||
**fastllm**
|
||||
1. install [fastllm](https://github.com/ztxz16/fastllm)
|
||||
2. inference
|
||||
@ -217,8 +255,10 @@ llm.set_device_map("cpu")
|
||||
model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4"
|
||||
print(model.response("<用户>Write an acrostic poem with the word MINICPM (One line per letter)<AI>", top_p=0.8, temperature=0.5, repeat_penalty=1.02))
|
||||
```
|
||||
<p id="mlx"></p>
|
||||
|
||||
**mlx_lm**
|
||||
|
||||
1. install mlx_lm
|
||||
```shell
|
||||
pip install mlx_lm
|
||||
@ -229,6 +269,150 @@ print(model.response("<用户>Write an acrostic poem with the word MINICPM (One
|
||||
python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code
|
||||
```
|
||||
|
||||
#### powerinfer
|
||||
Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned.
|
||||
1. Ensure your cmake version is 3.17 or above. If you have already installed it, you can skip this step.
|
||||
```bash
|
||||
# Download the installation package
|
||||
sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz
|
||||
# Extract the installation package
|
||||
sudo tar -zxvf cmake-3.23.0.tar.gz
|
||||
# Configure the installation environment
|
||||
sudo ./configure
|
||||
sudo make -j8
|
||||
# Compile and install
|
||||
sudo make install
|
||||
# Check the version after installation
|
||||
cmake --version
|
||||
# If the version number is returned, the installation was successful
|
||||
# cmake version 3.23.0
|
||||
```
|
||||
2. Install PowerInfer::
|
||||
```bash
|
||||
git clone https://github.com/SJTU-IPADS/PowerInfer
|
||||
cd PowerInfer
|
||||
pip install -r requirements.txt # install Python helpers' dependencies
|
||||
```
|
||||
3. Compile the CPU version of PowerInfer. If your machine only has a CPU, or if you want to perform inference using the CPU, run the following commands::
|
||||
```bash
|
||||
cmake -S . -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
4. Compile the GPU version of PowerInfer. If your machine has a GPU, you can run the following commands:
|
||||
```bash
|
||||
cmake -S . -B build -DLLAMA_CUBLAS=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
5. Retrieve the sparse model:
|
||||
```bash
|
||||
git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main
|
||||
#or
|
||||
git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf
|
||||
```
|
||||
6. Model Inference:
|
||||
```bash
|
||||
cd PowerInfer
|
||||
# Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text.
|
||||
#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
|
||||
# Below is an example
|
||||
./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p '<User>hello,tell me a story please.<AI>'
|
||||
```
|
||||
|
||||
<p id="quantize"></p>
|
||||
|
||||
## Quantize
|
||||
<p id="gptq"></p>
|
||||
|
||||
**gptq**
|
||||
1. Firstly, obtain the[minicpm_gptqd code](https://github.com/LDLINGLINGLING/AutoGPTQ/tree/minicpm_gptq)
|
||||
2. Navigate to the main directory of minicpm_gptqd ./AutoGPTQ, then in the command line, input:
|
||||
```
|
||||
pip install e .
|
||||
```
|
||||
3. Proceed to [model download] (#1) to download all files from the unquantized MiniCPM repository to a local folder; both 1b and 2b models are acceptable, as well as post-training models.
|
||||
4. Input the following command in the command line, where `no_quantized_model_path` is the path to the model downloaded in step 3, `save_path` is the path to save the quantized model, and `--bits` is the quantization bit width which can be set to either 4 or 8.
|
||||
```
|
||||
cd Minicpm/quantize
|
||||
python gptq_quantize.py --pretrained_model_dir no_quant_model_path --quantized_model_dir quant_save_path --bits 4
|
||||
```
|
||||
5. You can perform inference using ./AutoGPTQ/examples/quantization/inference.py, or refer to the previous section on using vllm with the quantized model. For the minicpm-1b-int4 model, vllm inference on a single 4090 card operates at around 2000 tokens per second
|
||||
|
||||
<p id="awq"></p>
|
||||
|
||||
**awq**
|
||||
1. Modify the configuration parameters in the quantize/awq_quantize.py file according to the comments:
|
||||
```python
|
||||
model_path = '/root/ld/ld_model_pretrained/MiniCPM-1B-sft-bf16' # model_path or model_id
|
||||
quant_path = '/root/ld/ld_project/pull_request/MiniCPM/quantize/awq_cpm_1b_4bit' # quant_save_path
|
||||
quant_data_path='/root/ld/ld_project/pull_request/MiniCPM/quantize/quantize_data/wikitext' # Input the provided quantization dataset, alpaca or wikitext under data
|
||||
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8
|
||||
quant_samples=512 # Number of samples to use for calibration
|
||||
custom_data=[{'question':'What is your name.','answer':'I am the open-source mini cannon MiniCPM from OpenMBMB.'}, # Custom dataset is available
|
||||
{'question':'What are your features.','answer':'I am small, but I am strong.'}]
|
||||
```
|
||||
2. Under the quantize/quantize_data folder, two datasets, alpaca and wiki_text, are provided as quantization calibration sets. Modify the aforementioned quant_data_path to the path of one of these folders.
|
||||
3. If you need a custom dataset, modify the custom_data variable in quantize/awq_quantize.py, such as:
|
||||
```python
|
||||
custom_data=[{'question':'What symptoms does allergic rhinitis have?','answer':'Allergic rhinitis may cause nasal congestion, runny nose, headache, etc., which recur frequently. It is recommended to seek medical attention in severe cases.'},
|
||||
{'question':'What is 1+1 equal to?','answer':'It equals 2'}]
|
||||
```
|
||||
4. Based on the selected dataset, choose one of the following lines of code to replace line 38 in quantize/awq_quantize.py::
|
||||
```python
|
||||
# Quantize using wikitext
|
||||
model.quantize(tokenizer, quant_config=quant_config, calib_data=load_wikitext(quant_data_path=quant_data_path))
|
||||
# Quantize using alpaca
|
||||
model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca(quant_data_path=quant_data_path))
|
||||
# Quantize using a custom dataset
|
||||
model.quantize(tokenizer, quant_config=quant_config, calib_data=load_cust_data(quant_data_path=quant_data_path))
|
||||
```
|
||||
5. Run the quantize/awq_quantize.py file; the AWQ quantized model will be available in the specified quan_path directory.
|
||||
|
||||
|
||||
<p id="bnb"></p>
|
||||
|
||||
**bnb**
|
||||
1. Modify the configuration parameters in the quantize/bnb_quantize.py file according to the comments:
|
||||
```python
|
||||
model_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16" # Model path
|
||||
save_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16_int4" # Path to save the quantized model
|
||||
```
|
||||
2. Additional quantization parameters can be modified based on the comments and the llm.int8() algorithm (optional):
|
||||
```python
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True, # Whether to perform 4-bit quantization
|
||||
load_in_8bit=False, # Whether to perform 8-bit quantization
|
||||
bnb_4bit_compute_dtype=torch.float16, # Computation precision setting
|
||||
bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights
|
||||
bnb_4bit_quant_type="nf4", # Quantization format, Gaussian-distributed int4 used here
|
||||
bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters
|
||||
llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, parameters saved on the CPU use fp32
|
||||
llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled
|
||||
#llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules that do not undergo quantization
|
||||
llm_int8_threshold=6.0, # Outlier value in the llm.int8() algorithm, determines whether quantization is performed based on this value
|
||||
)
|
||||
```
|
||||
3. Run the quantize/bnb_quantize.py script, and the BNB quantized model will be available in the directory specified by save_path.
|
||||
```python
|
||||
cd MiniCPM/quantize
|
||||
python bnb_quantize.py
|
||||
```
|
||||
<p id="quantize_test"></p>
|
||||
|
||||
**quantize_test**
|
||||
1. In the command line, navigate to the MiniCPM/quantize directory.
|
||||
2. Modify the awq_path, gptq_path, and awq_path in the quantize_eval.sh file. Keep the types you don't want to test as empty strings. The following example indicates testing only the AWQ model:
|
||||
```
|
||||
awq_path="/root/ld/ld_project/AutoAWQ/examples/awq_cpm_1b_4bit"
|
||||
gptq_path=""
|
||||
model_path=""
|
||||
bnb_path=""
|
||||
```
|
||||
3. 3. In the MiniCPM/quantize directory, enter the following command in the command line:
|
||||
```
|
||||
bash quantize_eval.sh
|
||||
```
|
||||
4. The window will display the memory usage and perplexity of the model.
|
||||
|
||||
<p id="Community"></p>
|
||||
|
||||
## Community
|
||||
@ -680,6 +864,7 @@ MBPP, instead of the hand-verified set.
|
||||
## Deployment on mobile phones
|
||||
|
||||
#### Tutorial
|
||||
<p id="MLC"></p>
|
||||
|
||||
* After INT4 quantization, MiniCPM only occupies 2GB of space, meeting the requirements of inference on end devices.
|
||||
* We have made different adaptations for different operating systems.
|
||||
@ -758,6 +943,8 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
|
||||
* Using [BMTrain](https://github.com/OpenBMB/BMTrain),as well as checkpointing and ZeRO-3 (zero redundancy optimizer),we can tune all parameters of MiniCPM using one piece of NVIDIA GeForce GTX 3090/4090.
|
||||
* This code will be available soon.
|
||||
|
||||
<p id="mlx_finetune"></p>
|
||||
|
||||
* mlx Parameter-efficient Tuning
|
||||
* environment preparation
|
||||
```shell
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user