diff --git a/README-en.md b/README-en.md index 2276636..d3b5959 100644 --- a/README-en.md +++ b/README-en.md @@ -1,7 +1,5 @@
-

- MiniCPM: Unveiling the Potential of End-side Large Language Models -

+

@@ -12,492 +10,439 @@

Technical Blog | +MiniCPM Wiki (in Chinese) | MiniCPM Paper | MiniCPM-V Repo | Join our discord and WeChat

-MiniCPM is an End-Side LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings (2.7B in total). +## Changelog🔥 -- MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc. -- After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench. -- MiniCPM-V 2.0, based on MiniCPM-2B, achieves state-of-the-art performance on multiple benchmarks among models under 7B parameters. It even outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass. MiniCPM-V 2.0 also shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding. -- MiniCPM can be deployed and infer on smartphones, and the speed of streaming output is relatively higher than human verbal speed. MiniCPM-V has also successfully deployed multi-modal models on smartphones. -- The cost of developing based on MiniCPM is low. Parameter efficient finetuning can be conducted with a single 1080/2080 GPU and full parameter finetuning can be conducted with a 3090/4090 GPU. - -We release all model parameters for research and limited commercial use. - -- SFT and DPO version based on MiniCPM-2B: **MiniCPM-2B-SFT/DPO** -- The multi-modal model **MiniCPM-V 2.0** based on MiniCPM-2B. -- The INT4 quantized version **MiniCPM-2B-SFT/DPO-Int4** based on MiniCPM-2B-SFT/DPO -- The 128k long context version of MiniCPM-2B: **MiniCPM-2B-128k**. -- The MoE version of MiniCPM-2B: **MiniCPM-MoE-8x2B**. -- SFT version of MiniCPM-1B, a lighter-weight model: **MiniCPM-1B-SFT**. -- Mobile phone application based on MLC-LLM and LLMFarm. Both language model and multimodel model can conduct inference on smartphones. -- 30 Intermidiate [checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history) of MiniCPM-2B for academic purpose. - -### Limitations - -- Due to limitations in model size, the model may experience hallucinatory issues. As DPO model tend to generate longer response, hallucinations are more likely to occur. We will also continue to iterate and improve the MiniCPM model. -- To ensure the generality of the model for academic research purposes, we have not subject it to any identity-specific training. Meanwhile, as we use ShareGPT open-source corpus as part of the training data, the model may output identity-related information similar to the GPT series models. -- Due to the limitation of model size, the output of the model is greatly influenced by prompts, which may result in inconsistent results from multiple attempts. -- Due to limited model capacity, the model's knowledge recall may not be accurate. In the future, we will combine the RAG method to enhance the model's knowledge retention ability. +- [2024.09.05] We release [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)! This model outperforms Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and is comparable to several models with 7B-9B parameters like Llama3.1-8B-Instruct, Qwen2-7B-Instruct, and GLM-4-9B-Chat. +- [2024.07.05] Released [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)! This model achieves an average sparsity of 87.89% in the FFN layer, reducing FFN FLOPs by 84%, while maintaining downstream task performance. +- [2024.04.11] Released [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog. +- [2024.03.16] Intermediate checkpoints of MiniCPM-2B were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)! +- [2024.02.01] Released [**MiniCPM-2B**](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)! This model performs similarly to Mistral-7B on public benchmarks (with better performance in Chinese, math, and code abilities) and overall outperforms models like Llama2-13B, MPT-30B, and Falcon-40B. ## Quick Links -- [Updates](#0) -- [Downloading](#1) -- [Quick Start](#2) -- [Community](#community) -- [Benchmark](#3) -- [Deployment on Mobile Phones](#4) -- [Demo & API](#5) -- [Fine-tuning Models](#6) -- [LICENSE](#7) -- [Citation](#8) -- [Show Cases](#9) -- -

+- [Model Downloads](#model-downloads) +- [MiniCPM 3.0](#minicpm-30) + - [Evaluation Results](#evaluation-results) + - [Comprehensive Evaluation](#comprehensive-evaluation) + - [Function Calling](#function-calling) + - [Long Context](#long-context) + - [Inference](#inference) + - [HuggingFace](#huggingface) + - [vLLM](#vllm) + - [llama.cpp](#llamacpp) + - [Fine-Tuning](#fine-tuning) + - [LLaMA-Factory](#llama-factory) + - [Advanced Features](#advanced-features) + - [Function Calling](#function-calling-1) + - [Code Interpreter](#code-interpreter) +- [MiniCPM 2.0](#minicpm-20) +- [MiniCPM 1.0](#minicpm-10) -## Common Modules -The following table allows you quick access to commonly used engineering modules. If you need extensive and detailed tutorials, please click on [Tutorials]((https://modelbest.feishu.cn/wiki/D2tFw8Pcsi5CIzkaHNacLK64npg?from=from_copylink)). -| [infer](#2) | [finetune](#6) | [deployment](#4) | [quantize](#quantize) -|-------------|------------|-----------|-----------| -|[Transformers](#Huggingface)|[Transformers](#6)|[MLC](#MLC)|[GPTQ](#gptq)| -|[vLLM](#vLLM)|[mlx_finetune](#mlx_finetune)|[llama.cpp](#llama.cpp)|[AWQ](#awq)| -|[llama.cpp](#llama.cpp)|[LLaMA-Factory](./finetune/llama_factory_example/README.md)||[bnb](#bnb)| -|[ollama](#ollama)|||[quantize_test](#quantize_test)| -|[fastllm](#fastllm)|||| -|[mlx_lm](#mlx)|||| -|[powerinfer](#powerinfer)|||| -## Update Log -- **2024/04/11 We release [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0), [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog.** -- 2024/03/16 Intermediate checkpoints were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)! -- 2024/02/13 We support llama.cpp -- 2024/02/09 We have included a [Community](#community) section in the README to encourage support for MiniCPM from the open-source community. -- 2024/02/08 We updated the [llama-format model weights](#llamaformat), which can be loaded into LlamaModel directly, making it more convenient for everyone to use our model quickly. -- 2024/02/01 Initial release. +## Model Downloads -

- -## Downloading - -* Language Model - - | HuggingFace | ModelScope | WiseModel | - |-------------|------------|-----------| - |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)| - |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)| + | HuggingFace | ModelScope | + |-------------|------------| + |[MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B)| + |[MiniCPM-2B-sft](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)| + |[MiniCPM-2B-dpo](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)| |[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k) |[MiniCPM-2B-128k](https://modelscope.cn/models/openbmb/MiniCPM-2B-128k/summary)| |[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) |[MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)| - |[MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) | + |[MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) | + |[MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)|[MiniCPM-S-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft)| - Note: More model versions can be found [here](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f). - -* Multimodel Model - - | HuggingFace | ModelScope | WiseModel | - |-------------|------------|-----------| - | [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2) | [MiniCPM-V 2.0](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) | - | [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) | [MiniCPM-V](https://modelscope.cn/models/OpenBMB/MiniCPM-V/) | [MiniCPM-V](https://wisemodel.cn/models/OpenBMB/MiniCPM-V) | - | [OmniLMM-12B](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM-12B](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM-12B](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) | +Note: More model versions can be found [here](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f). +## MiniCPM 3.0 -

+MiniCPM 3.0 is a language model with 4 billion parameters. Compared to MiniCPM 1.0/2.0, it offers more comprehensive features and a significant improvement in overall capabilities. Its performance on most evaluation benchmarks rivals or even surpasses many models with 7B-9B parameters. -## Quick Start +* **Supports Function Call🛠️ and Code Interpreter💻**: Achieved SOTA among models with fewer than 9B parameters on the [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html), outperforming GLM-4-9B-Chat and Qwen2-7B-Instruct. +* **Exceptional Reasoning Ability🧮**: In terms of math abilities, it outperforms GPT-3.5-Turbo and several 7B-9B models on [MathBench](https://open-compass.github.io/MathBench/). On the highly challenging [LiveCodeBench](https://livecodebench.github.io/), it surpasses Llama3.1-8B-Instruct. +* **Outstanding Instruction-Following in English and Chinese🤖**: Exceeds GLM-4-9B-Chat and Qwen2-7B-Instruct on English instruction following with [IFEval](https://huggingface.co/datasets/google/IFEval) and on Chinese instruction following with [FollowBench-zh](https://huggingface.co/datasets/YuxinJiang/FollowBench). +* **Long Context Capability**: Natively supports 32k context length, with flawless performance. We introduce the **LLM x MapReduce** approach, theoretically enabling processing of context lengths up to infinity. +* **RAG Capability**:We release [MiniCPM RAG Suite](https://huggingface.co/collections/openbmb/minicpm-rag-suite-66d976b4204cd0a4f8beaabb). Based on the MiniCPM series models, [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding) and [MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker) achieve SOTA performance on Chinese and Chinese-English cross-lingual retrieval tests. Specifically designed for the RAG scenario, [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA) outperforms models like Llama3-8B and Baichuan2-13B on multiple tasks, such as open-domain question answering. -#### Online +## Evaluation Results -- [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) +### Comprehensive Evaluation -

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BenchmarksQwen2-7B-InstructGLM-4-9B-ChatGemma2-9B-itLlama3.1-8B-InstructGPT-3.5-Turbo-0125Phi-3.5-mini-Instruct(3.8B)MiniCPM3-4B
English
MMLU70.572.472.669.469.268.467.2
BBH64.976.365.267.870.368.670.2
MT-Bench8.418.357.888.288.178.608.41
IFEVAL (Prompt Strict-Acc.)51.064.571.971.558.849.468.4
Chinese
CMMLU80.971.559.555.854.546.973.3
CEVAL77.275.656.755.252.846.173.6
AlignBench v1.17.106.617.105.685.825.736.74
FollowBench-zh (SSR)63.056.457.050.664.658.166.8
Mathematics
MATH49.650.646.051.941.846.446.6
GSM8K82.379.679.784.576.482.781.1
MathBench63.459.445.854.348.954.965.6
Coding
HumanEval+70.167.161.662.866.568.968.3
MBPP+57.162.264.355.371.455.863.2
LiveCodeBench22.220.219.220.424.019.622.6
Tool Use
BFCL71.670.119.273.375.448.476.0
Overall
Average65.365.057.960.861.057.266.3
-#### Huggingface +#### Function Calling -##### MiniCPM-2B +We evaluate the function calling capability of MiniCPM3 on [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html). MiniCPM3-4B outperforms several models with 7B-9B parameters on this leaderboard, surpassing GPT-3.5-Turbo-0125. -* Install `transformers>=4.36.0` and `accelerate`,run the following python code. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelOverall AccuracyAST SummaryExec SummaryIrrelevance DetectionRelevance Detection
MiniCPM3-4B76.03%68.55%85.54%53.71%90.24%
Llama3.1-8B-Instruct73.28%64.61%86.48%43.12%85.37%
Qwen2-7B-Instruct71.61%65.71%79.57%44.70%90.24%
GLM-4-9B-Chat70.08%60.69%80.02%55.02%82.93%
Phi-3.5-mini-instruct48.44%38.89%54.04%46.78%65.85%
Gemma2-9B-it19.18%5.41%18.50%88.88%7.32%
+ +#### Long Context Capability + +In the [Needle in a Haystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) test with a context length of 32k, the results are shown as follows: + +![needle](assets/eval_needle.jpeg) + +### Inference + +#### Huggingface ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(0) -path = 'openbmb/MiniCPM-2B-dpo-bf16' +path = 'openbmb/MiniCPM3-4B' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) -responds, history = model.chat(tokenizer, "Which city is the capital of China?", temperature=0.8, top_p=0.8) +responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7) print(responds) ``` -* Examples - -```shell -The capital city of China is Beijing. Beijing is not only the political center of China but also a cultural and economic hub. It is known for its rich history and numerous landmarks, such as the Great Wall, the Forbidden City, and the Temple of Heaven. The city is also home to the National Stadium, also known as the "Bird's Nest," and the National Aquatics Center, or "Water Cube." Beijing is a significant city in China, with a population of over 21 million people. -``` -

- -##### MiniCPM-2B (Llama Format) -To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model: -```python -import torch -from transformers import LlamaTokenizerFast, LlamaForCausalLM -model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" -tokenizer = LlamaTokenizerFast.from_pretrained(model_path) -model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) - -prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" -input_ids = tokenizer.encode("{}".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() -responses = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) -responses = tokenizer.decode(responses[0], skip_special_tokens=True) -print(responses) -``` - - -##### MiniCPM-V - -```python -import torch -from PIL import Image -from transformers import AutoModel, AutoTokenizer - -model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) -model.eval().cuda() - -image = Image.open('xx.jpg').convert('RGB') -question = 'What is in the image?' -msgs = [{'role': 'user', 'content': question}] - -res, context, _ = model.chat( - image=image, - msgs=msgs, - context=None, - tokenizer=tokenizer, - sampling=True, - temperature=0.7 -) -print(res) -``` -

- -#### vLLM - -* Install [vLLM](https://github.com/vllm-project/vllm) +#### vLLM +* Install vllm ```shell - pip install "vllm>=0.4.1" + pip install git+https://github.com/OpenBMB/vllm.git@minicpm3 ``` - -* Examples - ```shell - python inference/inference_vllm.py --model_path --prompt_path prompts/prompt_demo.txt - ``` - - -#### llama.cpp, Ollama, fastllm, mlx_lm Inference -We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama. - -

- -**llama.cpp** -1. [install llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build) -2. download model in gguf format. [link-fp16](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [link-q4km](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) -3. In command line: -``` -./main -m ../../model_ckpts/download_from_hf/MiniCPM-2B-dpo-fp16-gguf.gguf --prompt "<用户>Write an acrostic poem with the word MINICPM (One line per letter)" --temp 0.3 --top-p 0.8 --repeat-penalty 1.05 -``` -More parameters adjustment [see this](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) - -

- -**ollama run MiniCPM-2B-dpo** -1. [install ollama](https://github.com/ollama/ollama) -2. In command line: -``` -ollama run modelbest/minicpm-2b-dpo -``` - -**ollama other models** -1. [Install ollama](https://github.com/ollama/ollama) -2. Download models in gguf format. [Download link for 2b-fp16 format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [Download link for 2b-q4km format](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) [Download link for 1b-fp16 format](https://huggingface.co/linglingdan/MiniCPM-1b-fp16-gguf) [Download link for 1b-qr_1 format](https://huggingface.co/linglingdan/MiniCPM-1b-q4-1) -3. Run the following command in the command line, `model_name` can be customized:: -``` -touch model_name.Modelfile -``` -4. Edit the content of `model_name.Modelfile` as follows, write the path of the gguf model after the FROM space: -```shell -FROM model_path/model_name.gguf -TEMPLATE """{{ .Prompt }}{{ .Response }}""" -PARAMETER stop "<\s>" -``` -5. Run the following command in the command line to create an ollama model, `ollama_model_name` can be customized, `model_name.Modelfile` should follow the naming from step 3: -```shell -ollama create ollama_model_name -f model_name.Modelfile -``` -6. Run the ollama model: -```sehll -ollama run ollama_model_name -``` - -**fastllm** -1. install [fastllm](https://github.com/ztxz16/fastllm) -2. inference -```python -import torch -from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM -path = 'openbmb/MiniCPM-2B-dpo-fp16' -tokenizer = AutoTokenizer.from_pretrained(path) -model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True) -from fastllm_pytools import llm -llm.set_device_map("cpu") -model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4" -print(model.response("<用户>Write an acrostic poem with the word MINICPM (One line per letter)", top_p=0.8, temperature=0.5, repeat_penalty=1.02)) -``` -

- -**mlx_lm** - -1. install mlx_lm - ```shell - pip install mlx_lm - ``` -2. download model weights [MiniCPM-2B-sft-bf16-llama-format-mlx](https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx) -3. inference - ```shell - python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code - ``` - -#### powerinfer -Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned. -1. Ensure your cmake version is 3.17 or above. If you have already installed it, you can skip this step. -```bash - # Download the installation package - sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz - # Extract the installation package - sudo tar -zxvf cmake-3.23.0.tar.gz - # Configure the installation environment - sudo ./configure - sudo make -j8 - # Compile and install - sudo make install - # Check the version after installation - cmake --version - # If the version number is returned, the installation was successful - # cmake version 3.23.0 -``` -2. Install PowerInfer:: -```bash - git clone https://github.com/SJTU-IPADS/PowerInfer - cd PowerInfer - pip install -r requirements.txt # install Python helpers' dependencies -``` -3. Compile the CPU version of PowerInfer. If your machine only has a CPU, or if you want to perform inference using the CPU, run the following commands:: -```bash - cmake -S . -B build - cmake --build build --config Release -``` -4. Compile the GPU version of PowerInfer. If your machine has a GPU, you can run the following commands: -```bash - cmake -S . -B build -DLLAMA_CUBLAS=ON - cmake --build build --config Release -``` -5. Retrieve the sparse model: -```bash -git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main -#or -git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf -``` -6. Model Inference: -```bash -cd PowerInfer -# Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text. -#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt -# Below is an example -./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p 'hello,tell me a story please.' -``` - -

- -## Quantize -

- -**gptq** -1. Firstly, obtain the[minicpm_gptqd code](https://github.com/LDLINGLINGLING/AutoGPTQ/tree/minicpm_gptq) -2. Navigate to the main directory of minicpm_gptqd ./AutoGPTQ, then in the command line, input: - ``` - pip install e . - ``` -3. Proceed to [model download] (#1) to download all files from the unquantized MiniCPM repository to a local folder; both 1b and 2b models are acceptable, as well as post-training models. -4. Input the following command in the command line, where `no_quantized_model_path` is the path to the model downloaded in step 3, `save_path` is the path to save the quantized model, and `--bits` is the quantization bit width which can be set to either 4 or 8. - ``` - cd Minicpm/quantize - python gptq_quantize.py --pretrained_model_dir no_quant_model_path --quantized_model_dir quant_save_path --bits 4 - ``` -5. You can perform inference using ./AutoGPTQ/examples/quantization/inference.py, or refer to the previous section on using vllm with the quantized model. For the minicpm-1b-int4 model, vllm inference on a single 4090 card operates at around 2000 tokens per second - -

- -**awq** -1. Modify the configuration parameters in the quantize/awq_quantize.py file according to the comments: -```python - model_path = '/root/ld/ld_model_pretrained/MiniCPM-1B-sft-bf16' # model_path or model_id - quant_path = '/root/ld/ld_project/pull_request/MiniCPM/quantize/awq_cpm_1b_4bit' # quant_save_path - quant_data_path='/root/ld/ld_project/pull_request/MiniCPM/quantize/quantize_data/wikitext' # Input the provided quantization dataset, alpaca or wikitext under data - quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8 - quant_samples=512 # Number of samples to use for calibration - custom_data=[{'question':'What is your name.','answer':'I am the open-source mini cannon MiniCPM from OpenMBMB.'}, # Custom dataset is available - {'question':'What are your features.','answer':'I am small, but I am strong.'}] -``` -2. Under the quantize/quantize_data folder, two datasets, alpaca and wiki_text, are provided as quantization calibration sets. Modify the aforementioned quant_data_path to the path of one of these folders. -3. If you need a custom dataset, modify the custom_data variable in quantize/awq_quantize.py, such as: - ```python - custom_data=[{'question':'What symptoms does allergic rhinitis have?','answer':'Allergic rhinitis may cause nasal congestion, runny nose, headache, etc., which recur frequently. It is recommended to seek medical attention in severe cases.'}, - {'question':'What is 1+1 equal to?','answer':'It equals 2'}] - ``` -4. Based on the selected dataset, choose one of the following lines of code to replace line 38 in quantize/awq_quantize.py:: +* Inference ```python - # Quantize using wikitext - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_wikitext(quant_data_path=quant_data_path)) - # Quantize using alpaca - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca(quant_data_path=quant_data_path)) - # Quantize using a custom dataset - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_cust_data(quant_data_path=quant_data_path)) + from transformers import AutoTokenizer + from vllm import LLM, SamplingParams + + model_name = "openbmb/MiniCPM3-4B" + prompt = [{"role": "user", "content": "Write an article about Artificial Intelligence."}] + + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) + + llm = LLM(model=model_name, + trust_remote_code=True, + tensor_parallel_size=1 + ) + sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024) + + outputs = llm.generate(prompts=input_text, sampling_params=sampling_params) + + print(outputs[0].outputs[0].text) ``` -5. Run the quantize/awq_quantize.py file; the AWQ quantized model will be available in the specified quan_path directory. - -

- -**bnb** -1. Modify the configuration parameters in the quantize/bnb_quantize.py file according to the comments: -```python -model_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16" # Model path -save_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16_int4" # Path to save the quantized model -``` -2. Additional quantization parameters can be modified based on the comments and the llm.int8() algorithm (optional): -```python -quantization_config = BitsAndBytesConfig( - load_in_4bit=True, # Whether to perform 4-bit quantization - load_in_8bit=False, # Whether to perform 8-bit quantization - bnb_4bit_compute_dtype=torch.float16, # Computation precision setting - bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights - bnb_4bit_quant_type="nf4", # Quantization format, Gaussian-distributed int4 used here - bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters - llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, parameters saved on the CPU use fp32 - llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled - #llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules that do not undergo quantization - llm_int8_threshold=6.0, # Outlier value in the llm.int8() algorithm, determines whether quantization is performed based on this value -) -``` -3. Run the quantize/bnb_quantize.py script, and the BNB quantized model will be available in the directory specified by save_path. -```python -cd MiniCPM/quantize -python bnb_quantize.py -``` -

- -**quantize_test** -1. In the command line, navigate to the MiniCPM/quantize directory. -2. Modify the awq_path, gptq_path, and awq_path in the quantize_eval.sh file. Keep the types you don't want to test as empty strings. The following example indicates testing only the AWQ model: +#### llama.cpp +* Install llama.cpp + ```shell + git clone https://github.com/OpenBMB/llama.cpp.git + git checkout minicpm3 + cd llama.cpp + make ``` - awq_path="/root/ld/ld_project/AutoAWQ/examples/awq_cpm_1b_4bit" - gptq_path="" - model_path="" - bnb_path="" +* Create model directory + ```shell + cd llama.cpp/models + mkdir Minicpm3 ``` -3. In the MiniCPM/quantize directory, enter the following command in the command line: +* Download MiniCPM3 into `llama.cpp/models/Minicpm3` + ```shell + cd llama.cpp/models/Minicpm3 + git clone https://huggingface.co/openbmb/MiniCPM3-4B ``` - bash quantize_eval.sh +* Convert the model to gguf format,and quantize it: + ```python + python3 -m pip install -r requirements.txt + python3 convert-hf-to-gguf.py models/Minicpm3/ --outfile /your/path/llama.cpp/models/Minicpm3/CPM-4B-F16.gguf + ./llama-quantize ./models/Minicpm3/CPM-4B-F16.gguf ./models/Minicpm3/ggml-model-Q4_K_M.gguf Q4_K_M + ``` +* Inference + ```shell + ./llama-cli -c 1024 -m ./models/Minicpm/ggml-model-Q4_K_M.gguf -n 1024 --top-p 0.7 --temp 0.7 --prompt "<|im_start|>user\nWrite an article about Artificial Intelligence.<|im_end|>\n<|im_start|>assistant\n" ``` -4. The window will display the memory usage and perplexity of the model. -

+### Fine-Tuning -## Community -- [xtuner](https://github.com/InternLM/xtuner): [More efficient fine-tuning options of MiniCPM](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#AMdXdzz8qoadZhxU4EucELWznzd) -- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git): [MiniCPM fine-tuning one-click solution](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#BAWrdSjXuoFvX4xuIuzc8Amln5E) -- [ChatLLM](https://github.com/foldl/chatllm.cpp): [Run MiniCPM on CPU](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796) +#### LLaMA-Factory + +We have supported fine-tuning MiniCPM3 using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). For usage instructions, refer to [LLaMA-Factory Fine-tuning](https://modelbest.feishu.cn/docx/Z7USdW4lloZzkZxQ14icJ3senjb?from=from_copylink)." + +### Advanced Features + +#### Function calling + +We provide example code for using function calls with MiniCPM3, see [`demo/function_call.py`](./demo/function_calling.py). -

+#### Code Interpreter -## Evaluation results +We provide example code for using the code interpreter with MiniCPM3, see [`demo/code_interpreter.py`](./demo/code_interpreter.py). -#### Evaluation Settings +Below is a demo: -* Since it is difficult to standardize the evaluation of LLMs and there is no public prompt and test code for a large number of evaluations, we can only try our best to make it suitable for all types of models in terms of specific evaluation methods. -* Overall, we use a unified prompt input for testing, and adjust the input according to the corresponding template for each model. -* **The evaluation scripts and prompts have been open-sourced in our Github repository, and we welcome more developers to continuously improve our evaluation methods.** - * For the text evaluation part, we use our open source large model capability evaluation framework [UltraEval](https://github.com/OpenBMB/UltraEval). The following is the open source model reproduction process: - * install UltraEval - ```shell - git clone https://github.com/OpenBMB/UltraEval.git - cd UltraEval - pip install -e . - ``` - * Download the relevant data and unzip it for processing - ```shell - wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1" - unzip RawData.zip - python data_process.py - ``` - * Execute evaluation scripts (templates are provided and can be customized) - ```shell - bash run_eval.sh - ``` +![code_interpreter](./assets/code_interpreter.gif) -#### Deployment mode -* Because MiniCPM uses the structure of Mup, which is slightly different from existing models in terms of specific computations, we have based the implementation of our model on the vllm=0.2.2 version. -* **For non-MiniCPM models, we directly sampled the latest version of vllm=0.2.7 for inference.** +## MiniCPM 2.0 +
+Click to view details about MiniCPM2.0 -#### Evaluation method +### Introdution +MiniCPM 2.0 series upgrade MiniCPM in multiple dimensions, including: +- [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k):Extend the length of MiniCPM-2B context window to 128k, outperform larger models such as ChatGLM3-6B-128k、Yi-6B-200k on InfiniteBench. +- [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B):Upcycling from MiniCPM-2B. Compared to MiniCPM-2B, the overall performance improves by an average of 4.5pp. +- [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16): 60% inference cost reduction compared with MiniCPM-2B, while still showing better overall performance than LLaMA2-13B. +- [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft): The FFN layer achieves an average sparsity of 87.89% and reduces FFN FLOPs by 84%, while maintaining no performance loss in downstream tasks. Combined with the PowerInfer, MiniCPM-S-1B inferece speed increase is approximately 2.8x. -* For the QA task (multiple-choice task), we chose to test in two ways: - * PPL: The options are used as a continuation of the question generation and the answer selection is based on the PPL of each option; - * The second is to generate the answer options directly. -* For different models, the results obtained by these two approaches vary widely. the results on both MiniCPM models are closer, while models such as Mistral-7B-v0.1 perform better on PPL and worse on direct generation. -* In the specific evaluation, we take the higher score of the two evaluation methods as the final result, so as to ensure the fairness of the comparison (* in the following table indicates the PPL). +### Evaluation Results -#### Text evaluation - -|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|Llama2-7B|35.40|36.21|31.765|32.42|31.11|44.32|12.2|27.17|13.57|1.8|33.23|75.25|42.75|75.62*| -|Qwen-7B|49.46|47.19|59.655|58.96|60.35|57.65|17.07|42.15|41.24|5.34|37.75|83.42|64.76|75.32*| -|Deepseek-7B|39.96|39.15|43.635|42.82|44.45|47.82|20.12|41.45|15.85|1.53|33.38|74.58*|42.15*|75.45*| -|Mistral-7B|48.97|49.96|44.54|46.12|42.96|62.69|27.44|45.2|33.13|5.0|41.06|83.92|70.73|80.43*| -|Llama2-13B|41.48|42.44|37.19|37.32|37.06|54.71|17.07|32.55|21.15|2.25|37.92|78.87*|58.19|79.23*| -|MPT-30B|38.17|39.82|30.715|29.34|32.09|46.56|21.95|35.36|10.31|1.56|38.22|78.66*|46.08*|79.72*| -|Falcon-40B|43.62|44.21|40.93|40.29|41.57|53.53|24.39|36.53|22.44|1.92|36.24|81.94*|57.68|83.26*| -|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|TinyLlama-1.1B|25.36|25.55|24.525|25.02|24.03|24.3|6.71|19.91|2.27|0.74|28.78|60.77*|28.15*|58.33*|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| -|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| -|Gemini Nano-3B|-|-|-|-|-|-|-|27.2(report)|22.8(report)|-|42.4(report)|-|-|-| -|StableLM-Zephyr-3B|43.46|46.31|30.615|30.34|30.89|45.9|35.37|31.85|52.54|12.49|37.68|73.78|55.38|71.87*| -|Phi-2-2B|48.84|54.41|23.775|23.37|24.18|52.66|47.56|55.04|57.16|3.5|43.39|86.11|71.25|73.07*| -|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|ChatGLM2-6B|37.98|35.17|50.63|52.05|49.21|45.77|10.37|9.38|22.74|5.96|32.6|74.45|56.82|58.48*| -|Mistral-7B-Instruct-v0.1|44.36|45.89|37.51|38.06|36.96|53.56|29.27|39.34|28.73|3.48|39.52|81.61|63.99|73.47*| -|Mistral-7B-Instruct-v0.2|50.91|52.83|42.235|42.55|41.92|60.51|36.59|48.95|40.49|4.95|39.81|86.28|73.38|84.55*| -|Qwen-7B-Chat|44.93|42.05|57.9|58.57|57.23|56.03|15.85|40.52|42.23|8.3|37.34|64.44*|39.25*|74.52*| -|Yi-6B-Chat|50.46|45.89|70.995|70.88|71.11|62.95|14.02|28.34|36.54|3.88|37.43|84.89|70.39|74.6*| -|Baichuan2-7B-Chat|44.68|42.74|53.39|53.28|53.5|53|21.34|32.32|25.25|6.32|37.46|79.63|60.15|69.23*| -|Deepseek-7B-chat|49.34|49.56|48.335|46.95|49.72|51.67|40.85|48.48|48.52|4.26|35.7|76.85|63.05|76.68*| -|Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*| -|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -#### MiniCPM-2B-128k Evaluation +#### MiniCPM-2B-128k | Model | avg | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run | |-------------------------------------|-------|-------------------|---------|---------------|--------------|---------------------|-----------------|-----------------|------------------|---------------------|-----------|-----------|------------|----------| | LWM-Text-128k | 24.45 | 33.62 | 100 | 97.8 | 0.6 | 28.82 | 15.93 | 14.31 | 9.99 | 1.5 | 0 | 3.43 | 20.05 | 1 | @@ -507,7 +452,7 @@ python bnb_quantize.py | chatglm3-6b-128k | 25.58 | 36.57 | 89.93 | 99.66 | 5.2 | 46.29 | 10.7 | 8.38 | 25.91 | 6.5 | 0 | 8 | 5.33 | 1 | | MiniCPM-2.4B-128k | 27.32 | 37.68 | 98.31 | 99.83 | 9 | 29.69 | 23.06 | 16.33 | 15.73 | 9.5 | 0 | 4.29 | 22.08 | 0 | -#### MiniCPM-MoE-8x2B Evaluation +#### MiniCPM-MoE-8x2B
@@ -607,244 +552,175 @@ python bnb_quantize.py -

- Note:* means evaluation results are directly taken from their technical reports. † means evaluation results on the full set of MBPP, instead of the hand-verified set. -#### Multimodal evaluation +#### MiniCPM-S-1B -
+- Code Generation:Average pass@1 score of HumanEval(0-shot) and MBPP(3-shot). +- Commonsense Reasoning: Average 0-shot accuracy of PIQA, SIQA, HellaSwag, WinoGrande and COPA. +- Reading Comprehension: Average 0-shot accuracy of BoolQ, LAMBADA and TyDi-QA. +- Other Benchmarks: We report average performance of GSM8K(8-shot)、MMLU(5-shot)、BBH(3-shot) and AGI-Eval(0-shot). -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeTextVQA valDocVQA testOCRBenchOpenCompassMMEMMB dev(en)MMB dev(zh)MMMU valMathVistaLLaVA BenchObject HalBench
Proprietary models
Gemini Pro Vision- 74.688.168063.82148.975.274.048.945.879.9-
GPT-4V- 78.088.464563.21771.575.175.053.847.893.186.4 / 92.7
Open-source models 6B~34B
Yi-VL-6B6.7B45.5*17.1*29049.31915.1 68.6 68.3 40.3 28.8 51.9 -
Qwen-VL-Chat9.6B61.562.6488 52.1 1860.0 60.6 56.7 37.0 33.8 67.7 56.2 / 80.0
Yi-VL-34B34B43.4*16.9*29052.6 2050.271.171.445.130.762.3-
DeepSeek-VL-7B7.3B64.7*47.0* 43555.6 1765.4 74.1 72.8 38.3 36.877.8 -
TextMonkey9.7B64.366.7 558- - - - - -- -
CogVLM-Chat17.4B70.433.3*590 52.5 1736.6 63.7 53.8 37.3 34.7 73.9 73.6 / 87.4
Open-source models 1B~3B
DeepSeek-VL-1.3B1.7B58.4*37.9*41346.0 1531.6 64.0 61.2 33.8 29.4 51.1 -
MobileVLM V23.1B57.519.4*--1440.5(P) 63.2 -----
Mini-Gemini2.2B56.234.2*--1653.0 59.8 - 31.7 -- -
MiniCPM-V2.8B 60.638.2 36647.61650.2 67.9 65.3 38.328.951.3 78.4 / 88.5
MiniCPM-V 2.02.8B 74.171.9 60555.01808.6 69.6 68.1 38.2 38.769.2 85.5 / 92.2
+| Setting | Average
Sparsity | Average
Performance | Code
Generation | Commonsense
Reasoning | Reading
Comprehension | GSM8K | MMLU | BBH | AGI-Eval | +| :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: | +| LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | +| ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | +| **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | +| **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | +| LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | +| ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | +| **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | +| **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | +| MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 | +| **MiniCPM-S-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 | +| **MiniCPM-S-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 | -
-* We evaluate the officially released checkpoint by ourselves. +Note: +1. [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B). "ProSparse-7B\*"、"ProSparse-13B\*" and "MiniCPM-S-1B\*" represent ProSparse versions that don't have activation thresholds offset. +2. For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA and AGI-Eval, we adopt ppl-based evalution. For GSM8K, MMLU and BBH, we perform generation-based evalution. + + +### Inference +#### HuggingFace, vLLM +Please refer to [Inference](#huggingface-inferene) section in MiniCPM1.0. + +#### PowerInfer +Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned. +1. Ensure your cmake version is 3.17 or above. If you have already installed it, you can skip this step. +```bash + # Download the installation package + sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz + # Extract the installation package + sudo tar -zxvf cmake-3.23.0.tar.gz + # Configure the installation environment + sudo ./configure + sudo make -j8 + # Compile and install + sudo make install + # Check the version after installation + cmake --version + # If the version number is returned, the installation was successful + # cmake version 3.23.0 +``` +2. Install PowerInfer:: +```bash + git clone https://github.com/SJTU-IPADS/PowerInfer + cd PowerInfer + pip install -r requirements.txt # install Python helpers' dependencies +``` +3. Compile the CPU version of PowerInfer. If your machine only has a CPU, or if you want to perform inference using the CPU, run the following commands:: +```bash + cmake -S . -B build + cmake --build build --config Release +``` +4. Compile the GPU version of PowerInfer. If your machine has a GPU, you can run the following commands: +```bash + cmake -S . -B build -DLLAMA_CUBLAS=ON + cmake --build build --config Release +``` +5. Retrieve the sparse model: +```bash +git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main +#or +git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf +``` +6. Model Inference: +```bash +cd PowerInfer +# Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text. +#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt +# Below is an example +./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p 'hello,tell me a story please.' +``` + +
+ + +## MiniCPM 1.0 +
+Click to view details about MiniCPM1.0 + +### Introduction +MiniCPM-2B is a dense language model with only 2.4B parameters excluding embeddings (2.7B in total). + +- MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc. + +- After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench. + +Note: To ensure the generality of the model for academic research purposes, **we have not subject it to any identity-specific training.** Meanwhile, as we use ShareGPT open-source corpus as part of the training data, the model may output identity-related information similar to the GPT series models. + +### Evaluation Results + +#### Evaluation Settings +* Since it is difficult to standardize the evaluation of LLMs and there is no public prompt and test code for a large number of evaluations, we can only try our best to make it suitable for all types of models in terms of specific evaluation methods. +* Overall, we use a unified prompt input for testing, and adjust the input according to the corresponding template for each model. +* **The evaluation scripts and prompts have been open-sourced in our Github repository, and we welcome more developers to continuously improve our evaluation methods.** + * For the text evaluation part, we use our open source large model capability evaluation framework [UltraEval](https://github.com/OpenBMB/UltraEval). The following is the open source model reproduction process: + * install UltraEval + ```shell + git clone https://github.com/OpenBMB/UltraEval.git + cd UltraEval + pip install -e . + ``` + * Download the relevant data and unzip it for processing + ```shell + wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1" + unzip RawData.zip + python data_process.py + ``` + * Execute evaluation scripts (templates are provided and can be customized) + ```shell + bash run_eval.sh + ``` + +#### Deployment mode + +* Because MiniCPM uses the structure of Mup, which is slightly different from existing models in terms of specific computations, we have based the implementation of our model on the vllm=0.2.2 version. +* **For non-MiniCPM models, we directly sampled the latest version of vllm=0.2.7 for inference.** + +#### Evaluation method + +* For the QA task (multiple-choice task), we chose to test in two ways: + * PPL: The options are used as a continuation of the question generation and the answer selection is based on the PPL of each option; + * The second is to generate the answer options directly. +* For different models, the results obtained by these two approaches vary widely. the results on both MiniCPM models are closer, while models such as Mistral-7B-v0.1 perform better on PPL and worse on direct generation. +* In the specific evaluation, we take the higher score of the two evaluation methods as the final result, so as to ensure the fairness of the comparison (* in the following table indicates the PPL). + +#### Text evaluation + +|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|Llama2-7B|35.40|36.21|31.765|32.42|31.11|44.32|12.2|27.17|13.57|1.8|33.23|75.25|42.75|75.62*| +|Qwen-7B|49.46|47.19|59.655|58.96|60.35|57.65|17.07|42.15|41.24|5.34|37.75|83.42|64.76|75.32*| +|Deepseek-7B|39.96|39.15|43.635|42.82|44.45|47.82|20.12|41.45|15.85|1.53|33.38|74.58*|42.15*|75.45*| +|Mistral-7B|48.97|49.96|44.54|46.12|42.96|62.69|27.44|45.2|33.13|5.0|41.06|83.92|70.73|80.43*| +|Llama2-13B|41.48|42.44|37.19|37.32|37.06|54.71|17.07|32.55|21.15|2.25|37.92|78.87*|58.19|79.23*| +|MPT-30B|38.17|39.82|30.715|29.34|32.09|46.56|21.95|35.36|10.31|1.56|38.22|78.66*|46.08*|79.72*| +|Falcon-40B|43.62|44.21|40.93|40.29|41.57|53.53|24.39|36.53|22.44|1.92|36.24|81.94*|57.68|83.26*| +|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| + +|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|TinyLlama-1.1B|25.36|25.55|24.525|25.02|24.03|24.3|6.71|19.91|2.27|0.74|28.78|60.77*|28.15*|58.33*|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| +|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| +|Gemini Nano-3B|-|-|-|-|-|-|-|27.2(report)|22.8(report)|-|42.4(report)|-|-|-| +|StableLM-Zephyr-3B|43.46|46.31|30.615|30.34|30.89|45.9|35.37|31.85|52.54|12.49|37.68|73.78|55.38|71.87*| +|Phi-2-2B|48.84|54.41|23.775|23.37|24.18|52.66|47.56|55.04|57.16|3.5|43.39|86.11|71.25|73.07*| +|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| + +|Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|ChatGLM2-6B|37.98|35.17|50.63|52.05|49.21|45.77|10.37|9.38|22.74|5.96|32.6|74.45|56.82|58.48*| +|Mistral-7B-Instruct-v0.1|44.36|45.89|37.51|38.06|36.96|53.56|29.27|39.34|28.73|3.48|39.52|81.61|63.99|73.47*| +|Mistral-7B-Instruct-v0.2|50.91|52.83|42.235|42.55|41.92|60.51|36.59|48.95|40.49|4.95|39.81|86.28|73.38|84.55*| +|Qwen-7B-Chat|44.93|42.05|57.9|58.57|57.23|56.03|15.85|40.52|42.23|8.3|37.34|64.44*|39.25*|74.52*| +|Yi-6B-Chat|50.46|45.89|70.995|70.88|71.11|62.95|14.02|28.34|36.54|3.88|37.43|84.89|70.39|74.6*| +|Baichuan2-7B-Chat|44.68|42.74|53.39|53.28|53.5|53|21.34|32.32|25.25|6.32|37.46|79.63|60.15|69.23*| +|Deepseek-7B-chat|49.34|49.56|48.335|46.95|49.72|51.67|40.85|48.48|48.52|4.26|35.7|76.85|63.05|76.68*| +|Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*| +|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| #### DPO evaluation @@ -862,66 +738,11 @@ MBPP, instead of the hand-verified set. |Mistral-7B-Instruct-v0.1|6.84| |MPT-34B-instruct|6.39| -

+### Quick Start -## Deployment on mobile phones +#### Online -#### Tutorial -

- -* After INT4 quantization, MiniCPM only occupies 2GB of space, meeting the requirements of inference on end devices. -* We have made different adaptations for different operating systems. -* **Note: The current open-source framework is still improving its support for mobile phones, and not all chips and operating system versions can successfully run MLC-LLM or LLMFarm.** -* Android, HarmonyOS - * Adapt based on open-source framework MLC-LLM. - * Adapted for text model MiniCPM, and multimodel model MiniCPM-V. - * Support MiniCPM-2B-SFT-INT4, MiniCPM-2B-DPO-INT4, and MiniCPM-V. - * [Compile and Installation Guide](https://github.com/OpenBMB/mlc-MiniCPM/blob/main/README.md) -* iOS - * Adapt based on open-source framework LLMFarm. - * Adapted for text model MiniCPM. - * Support MiniCPM-2B-SFT-INT4, MiniCPM-2B-DPO-INT4. - * [Compile and Installation Guide](https://github.com/OpenBMB/LLMFarm) - -#### Performance - -* We did not conduct in-depth optimization and system testing on the mobile inference model, only verifying the feasibility of MiniCPM using mobile phone chips for inference. **We welcome more developers to continuously improve the inference performance of LLMs on mobile phones and update the test results below.** - -| Mobile Phones | OS | Processor | Memory(GB) | Inference Throughput(token/s) | -| ----------------- | ------------- | ------------------ | ------------ | ------------------------------- | -| OPPO Find N3 | Android 13 | snapdragon 8 Gen2 | 12 | 6.5 | -| Samsung S23 Ultra | Android 14 | snapdragon 8 Gen2 | 12 | 6.4 | -| Meizu M182Q | Android 11 | snapdragon 888Plus | 8 | 3.7 | -| Xiaomi 12 Pro | Android 13 | snapdragon 8 Gen1 | 8+3 | 3.7 | -| Xiaomi Redmi K40 | Android 11 | snapdragon 870 | 8 | 3.5 | -| Oneplus LE 2100 | Android 13 | snapdragon 870 | 12 | 3.5 | -| Oneplus HD1900 | Android 11 | snapdragon 865 | 8 | 3.2 | -| Oneplus HD1900 | Android 11 | snapdragon 855 | 8 | 3.0 | -| Oneplus HD1905 | Android 10 | snapdragon 855 | 8 | 3.0 | -| Oneplus HD1900 | Android 11 | snapdragon 855 | 8 | 3.0 | -| Xiaomi MI 8 | Android 9 | snapdragon 845 | 6 | 2.3 | -| Huawei Nova 11SE | HarmonyOS 4.0.0 | snapdragon 778 | 12 | 1.9 | -| Xiaomi MIX 2 | Android 9 | snapdragon 835 | 6 | 1.3 | -| iPhone 15 Pro | iOS 17.2.1 | A16 | 8 | 18.0 | -| iPhone 15 | iOS 17.2.1 | A16 | 6 | 15.0 | -| iPhone 12 Pro | iOS 16.5.1 | A14 | 6 | 5.8 | -| iPhone 12 | iOS 17.2.1 | A14 | 4 | 5.8 | -| iPhone 11 | iOS 16.6 | A13 | 4 | 4.6 | -|Xiaomi Redmi K50 | HyperOS 1.0.2 | MediaTek Dimensity 8100 |12 |3.5| - -* We have also verified the feasibility of deploying MiniCPM-V series models on mobile phones based on MLC-LLM, and it can input and output normally. However, there also exist a problem of long image processing time, which needs further optimization. The demo video below is the raw screen recording on a Xiaomi 14 Pro without edition. - - -

- - -

-
- - -

- -## Demo & API +- [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) #### Web-demo based on Gradio @@ -934,77 +755,78 @@ python demo/vllm_based_demo.py --model_path python demo/hf_based_demo.py --model_path ``` -

+#### Huggingface Inferene +##### MiniCPM-2B -## Fine-tuning +* Install `transformers>=4.36.0` and `accelerate`,run the following python code. -* Parameter-efficient Tuning +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch +torch.manual_seed(0) + +path = 'openbmb/MiniCPM-2B-dpo-bf16' +tokenizer = AutoTokenizer.from_pretrained(path) +model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) + +responds, history = model.chat(tokenizer, "Which city is the capital of China?", temperature=0.8, top_p=0.8) +print(responds) +``` +* Examples + +```shell +The capital city of China is Beijing. Beijing is not only the political center of China but also a cultural and economic hub. It is known for its rich history and numerous landmarks, such as the Great Wall, the Forbidden City, and the Temple of Heaven. The city is also home to the National Stadium, also known as the "Bird's Nest," and the National Aquatics Center, or "Water Cube." Beijing is a significant city in China, with a population of over 21 million people. +``` + +##### MiniCPM-2B (Llama Format) +To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model: +```python +import torch +from transformers import LlamaTokenizerFast, LlamaForCausalLM +model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" +tokenizer = LlamaTokenizerFast.from_pretrained(model_path) +model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) + +prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" +input_ids = tokenizer.encode("{}".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() +responses = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) +responses = tokenizer.decode(responses[0], skip_special_tokens=True) +print(responses) +``` + +#### vLLM +* Install [vLLM](https://github.com/vllm-project/vllm) + ```shell + pip install "vllm>=0.4.1" + ``` + +* Examples + ```shell + python inference/inference_vllm.py --model_path --prompt_path prompts/prompt_demo.txt + ``` + +#### llama.cpp, Ollama, fastllm, mlx_lm Inference +We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama. + +Please refer to [Quantization Tutorial](https://modelbest.feishu.cn/wiki/EatbwdLuvitbbMk2X5wcX6h5n7c) in "MiniCPM Knowbase". + +#### Parameter-efficient Tuning * With parameter-efficient tuning, we can tune MiniCPM using one piece of NVIDIA GeForce GTX 1080/2080. - * [Code for Parameter-efficient Tuning](https://github.com/OpenBMB/MiniCPM/tree/main/finetune) - -* Full-parameter Tuning - * Using [BMTrain](https://github.com/OpenBMB/BMTrain),as well as checkpointing and ZeRO-3 (zero redundancy optimizer),we can tune all parameters of MiniCPM using one piece of NVIDIA GeForce GTX 3090/4090. - * This code will be available soon. + * mlx finetune:[Guideline](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#share-ASrDdvFAloHtycxfy85cLNhAnd3) + - [xtuner](https://github.com/InternLM/xtuner): [The best choice to do parameter-efficient tuning on MiniCPM](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#AMdXdzz8qoadZhxU4EucELWznzd) + - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git):[One click solution of finetuning MiniCPM](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#BAWrdSjXuoFvX4xuIuzc8Amln5E) -

+
-* mlx Parameter-efficient Tuning - * environment preparation - ```shell - pip install -r finetune/requirements_mlx.txt - ``` - * finetune - ```shell - # train - python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx --data data/AdvertiseGen --train --seed 2024 --iters 500 - # test - python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx --data data/AdvertiseGen --test --seed 2024 - ``` - -

- -## Show Cases - -#### Text Generation - -![内容创作-case1](./assets/en.creation.case1.png) - -![内容创作-case2](./assets/en.creation.case2.png) - -#### Code Generation - -![代码生成-case1](./assets/en.code.case1.gif) - -#### Reasoning - -![数理逻辑-case1](./assets/en.math.case1.png) - -![数理逻辑-case2](./assets/en.math.case2.png) - -#### Translation - -![文本翻译-case1](./assets/en.translation.case1.png) - -#### Instruction Following - -![指令跟随-case1](./assets/en.instruction_following.case1.png) - -#### Special characters - -![指令跟随-case1](./assets/en.special_char.case1.png) - -![指令跟随-case2](./assets/en.special_char.case2.png) - -

## LICENSE #### Model LICENSE * This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. -* The usage of MiniCPM model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). -* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use. - +* The usage of MiniCPM model weights must strictly follow [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). +* The models and weights of MiniCPM are completely free for academic research. after filling out a [questionnaire](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use. + #### Statement * As a language model, MiniCPM generates content by learning from a vast amount of text. @@ -1012,7 +834,12 @@ python demo/hf_based_demo.py --model_path * Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. * Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own. -

+## Institutions + +This project is developed by the following institutions: + +- [Modelbest Inc.](https://modelbest.cn/) +- [THUNLP](https://nlp.csai.tsinghua.edu.cn/) ## Citation diff --git a/README.md b/README.md index dc03df9..eb60413 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,5 @@
-

- MiniCPM: 揭示端侧大语言模型的无限潜力 -

+

@@ -13,517 +11,434 @@

MiniCPM 技术博客 | +MiniCPM 知识库 | MiniCPM 论文 | MiniCPM-V 仓库 | 加入我们的 discord微信群

-MiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的系列端侧大模型,主体语言模型 MiniCPM-2B 仅有 24亿(2.4B)的非词嵌入参数量, 总计2.7B参数量。 -- 经过 SFT 后,MiniCPM-2B 在公开综合性评测集上与 Mistral-7B 表现相近(中文、数学、代码能力更优),整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。 -- 经过 DPO 后,MiniCPM-2B 在当前最接近用户体感的评测集 MTBench 上也超越了 Llama2-70B-Chat、Vicuna-33B、Mistral-7B-Instruct-v0.1、Zephyr-7B-alpha 等众多代表性开源大模型。 -- 以 MiniCPM-2B 为基础构建端侧多模态大模型 MiniCPM-V 2.0,在多个测试基准中实现了 7B 以下模型的最佳性能,在 OpenCompass 榜单上超过了 Qwen-VL-Chat 9.6B、CogVLM-Chat 17.4B 和 Yi-VL 34B 等更大参数规模的模型。MiniCPM-V 2.0 还展现出领先的 OCR 能力,在场景文字识别能力上接近 Gemini Pro。 -- 经过 Int4 量化后,MiniCPM 可在手机上进行部署推理,流式输出速度略高于人类说话速度。MiniCPM-V 也直接跑通了多模态大模型在手机上的部署。 -- 一张1080/2080可高效参数微调,一张3090/4090可全参数微调,一台机器可持续训练 MiniCPM,二次开发成本较低。 +## 更新日志🔥 -我们完全开源MiniCPM系列的模型参数供学术研究和有限商用。 -具体而言,我们目前已公开以下模型,地址详见 [模型下载](#1) 部分 -- 基于MiniCPM-2B的指令微调与人类偏好对齐版本**MiniCPM-2B-SFT/DPO**。 -- 基于MiniCPM-2B的多模态模型**MiniCPM-V 2.0**。 -- MiniCPM-2B-SFT/DPO的Int4量化版**MiniCPM-2B-SFT/DPO-Int4**。 -- MiniCPM-2B的128k长文本版本**MiniCPM-2B-128k**。 -- MiniCPM-2B的MoE版本**MiniCPM-MoE-8x2B**。 -- 更轻量级的MiniCPM-1B指令微调版本**MiniCPM-1B-SFT**。 -- 基于MLC-LLM、LLMFarm开发的MiniCPM手机端程序,**文本及多模态模型均可在手机端进行推理**。 -- MiniCPM-2B训练过程中的[30个Checkpoints](https://huggingface.co/openbmb/MiniCPM-2B-history)供模型机理研究。 - -### 局限性: - -- 受限于模型规模,模型可能出现**幻觉性问题**。其中由于DPO模型生成的回复内容更长,更容易出现幻觉。我们也将持续进行MiniCPM模型的迭代改进。 -- 为了保证在学术研究用途上模型的通用性,我们**未对模型进行任何身份认同训练**。同时由于我们用ShareGPT开源语料作为部分训练数据,模型可能会输出类似GPT系列模型的身份认同信息。 -- 受限于模型规模,模型的**输出受到提示词(prompt)的影响较大**,可能多次尝试产生不一致的结果。 -- 受限于模型容量,模型的**知识记忆较不准确**,后续我们将结合RAG方法来增强模型的知识记忆能力。 +- [2024.09.05] 发布 [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)!该模型的表现超越 Phi-3.5-mini-instruct 和 GPT-3.5-Turbo-0125,并且能够比肩 Llama3.1-8B-Instruct、Qwen2-7B-Instruct、GLM-4-9B-Chat 等多个 7B-9B 参数量的模型。 +- [2024.07.05] 发布 [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)!该模型在保持下游任务性能无损的前提下,FFN 层实现了 87.89% 的平均稀疏度,将 FFN FLOPs 降低了 84%。 +- [2024.04.11] 发布 [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)、[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) 和 [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)!点击[这里](https://openbmb.vercel.app/?category=Chinese+Blog)查看技术博客。 +- [2024.03.16] MiniCPM-2B 的 30 余个中间检查点开放了![HuggingFace链接](https://huggingface.co/openbmb/MiniCPM-2B-history) +- [2024.02.01] 发布 [**MiniCPM-2B**](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)!该模型在公开评测集上与 Mistral-7B 表现相近(中文、数学、代码能力更优),整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。 ## 目录 -- [更新日志](#0)| -- [模型下载](#1)| -- [快速上手](#2)| -- [模型量化](#quantize)| -- [开源社区](#community)| -- [评测结果](#3)| -- [手机部署](#4)| -- [Demo & API 部署](#5)| -- [二次开发](#6)| -- [开源协议](#7)| -- [工作引用](#8)| -- [典型示例](#9)| +- [模型下载](#模型下载) +- [MiniCPM 3.0](#minicpm-30) + - [评测结果](#评测结果) + - [综合评测](#综合评测) + - [工具调用能力](#工具调用能力) + - [长文本能力](#长文本能力) + - [模型推理](#模型推理) + - [HuggingFace](#huggingface) + - [vLLM](#vllm) + - [llama.cpp](#llamacpp) + - [模型微调](#模型微调) + - [LLaMA-Factory](#llama-factory) + - [进阶功能](#进阶功能) + - [工具调用](#工具调用) + - [代码解释器](#代码解释器) +- [MiniCPM 2.0](#minicpm-20) +- [MiniCPM 1.0](#minicpm-10) -## 常用模块导航 -以下表格可以让你快速访问常用的工程模块,如果你需要广泛而详细的教程请点击[教程](https://modelbest.feishu.cn/wiki/D2tFw8Pcsi5CIzkaHNacLK64npg?from=from_copylink) - -| [推理](#2) | [微调](#6) | [手机部署](#4) | [量化](#quantize) -|-------------|------------|-----------|-----------| -|[Transformers](#Huggingface模型)|[Transformers](#transformer_finetune)|[MLC部署](#MLC)|[GPTQ](#gptq)| -|[vLLM](#vllm-推理)|[mlx_finetune](#mlx)|[llama.cpp](#llama.cpp)|[AWQ](#awq)| -|[llama.cpp](#llama.cpp)|[LLaMA-Factory](./finetune/llama_factory_example/README.md)||[bnb](#bnb)| -|[ollama](#ollama)|||[量化测试](#quantize_test)| -|[fastllm](#fastllm)|||| -|[mlx_lm](#mlx_lm)|||| -|[powerinfer](#powerinfer)|||| -

- -## 更新日志 -- **2024/04/11 开源[MiniCPM-V-2.0](https://huggingface.co/openbmb/MiniCPM-V-2.0)、[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)、[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B)和[MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)!点击[这里](https://openbmb.vercel.app/?category=Chinese+Blog)查看技术博客。** -- 2024/03/16 MiniCPM-2B 的30余个中间检查点开放了![HuggingFace链接](https://huggingface.co/openbmb/MiniCPM-2B-history) -- 2024/02/13 支持了llama.cpp -- 2024/02/09 我们在README里加入了一个[开源社区](#community)章节,用来收集开源社区对MiniCPM的支持案例。 -- 2024/02/08 我们更新了[llama-format的模型权重](#llamaformat),方便大家更加快捷地使用我们的模型。 -- 2024/02/01 初始发布。 - -

## 模型下载 - -* 语言模型 - | HuggingFace | ModelScope | WiseModel | - |-------------|------------|-----------| - |[MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft-bf16](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)|[MiniCPM-2B-sft-bf16](https://wisemodel.cn/models/OpenBMB/miniCPM-bf16)| - |[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)|[MiniCPM-2B-dpo-bf16](https://wisemodel.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16)| + | HuggingFace | ModelScope | + |-------------|------------| + |[MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B)| + |[MiniCPM-2B-sft](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)| + |[MiniCPM-2B-dpo](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)| |[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k) |[MiniCPM-2B-128k](https://modelscope.cn/models/openbmb/MiniCPM-2B-128k/summary)| |[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) |[MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)| - |[MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B-sft-bf16](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) | + |[MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) | + |[MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)|[MiniCPM-S-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft)| 注: 更多模型版本见[这里](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f)。 -* 多模态模型 - | HuggingFace | ModelScope | WiseModel | - |-------------|------------|-----------| - | [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2) | [MiniCPM-V 2.0](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) | - | [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) | [MiniCPM-V](https://modelscope.cn/models/OpenBMB/MiniCPM-V/) | [MiniCPM-V](https://wisemodel.cn/models/OpenBMB/MiniCPM-V) | - | [OmniLMM-12B](https://huggingface.co/openbmb/OmniLMM-12B) | [OmniLMM-12B](https://modelscope.cn/models/OpenBMB/OmniLMM-12B) | [OmniLMM-12B](https://wisemodel.cn/models/OpenBMB/OmniLMM-12B) | +## MiniCPM 3.0 - +MiniCPM 3.0 是一个 4B 参数量的语言模型,相比 MiniCPM1.0/2.0,功能更加全面,综合能力大幅提升,多数评测集上的效果比肩甚至超越众多 7B-9B 模型。 +* **支持工具调用🛠️(Function Calling)和代码解释器💻(Code Interpreter)**:[Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html) 上取得 9B 规模以下 SOTA,超越 GLM-4-9B-Chat、Qwen2-7B-Instruct。 +* **超强的推理能力🧮**:数学能力方面,[MathBench](https://open-compass.github.io/MathBench/) 上的效果超越 GPT-3.5-Turbo 以及多个 7B-9B 模型。在非常具有挑战性的 [LiveCodeBench](https://livecodebench.github.io/) 上,效果超越 Llama3.1-8B-Instruct。 +* **出色的中英文指令遵循能力🤖**:英文指令遵循 [IFEval](https://huggingface.co/datasets/google/IFEval)、中文指令遵循 [FollowBench-zh](https://huggingface.co/datasets/YuxinJiang/FollowBench) 效果超越 GLM-4-9B-Chat、Qwen2-7B-Instruct。 +* **长文本能力**:原生支持 32k 上下文长度,32k 长度内大海捞针全绿。提出 **LLM x MapReduce** ,理论可处理的上下文长度达到 +∞。 +* **RAG能力**:我们发布了 [MiniCPM RAG 套件](https://huggingface.co/collections/openbmb/minicpm-rag-suite-66d976b4204cd0a4f8beaabb)。基于 MiniCPM 系列模型的 [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding)、[MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker) 在中文、中英跨语言检索测试中取得 SOTA 表现;针对 RAG 场景的 [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA) 在开放域问答等多项任务上超越 Llama3-8B、Baichuan2-13B 等模型。 +### 评测结果 -

+#### 综合评测 -## 快速上手 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
评测集Qwen2-7B-InstructGLM-4-9B-ChatGemma2-9B-itLlama3.1-8B-InstructGPT-3.5-Turbo-0125Phi-3.5-mini-Instruct(3.8B)MiniCPM3-4B
英文能力
MMLU70.572.472.669.469.268.467.2
BBH64.976.365.267.870.368.670.2
MT-Bench8.418.357.888.288.178.608.41
IFEVAL (Prompt Strict-Acc.)51.064.571.971.558.849.468.4
中文能力
CMMLU80.971.559.555.854.546.973.3
CEVAL77.275.656.755.252.846.173.6
AlignBench v1.17.106.617.105.685.825.736.74
FollowBench-zh (SSR)63.056.457.050.664.658.166.8
数学能力
MATH49.650.646.051.941.846.446.6
GSM8K82.379.679.784.576.482.781.1
MathBench63.459.445.854.348.954.965.6
代码能力
HumanEval+70.167.161.662.866.568.968.3
MBPP+57.162.264.355.371.455.863.2
LiveCodeBench22.220.219.220.424.019.622.6
工具调用能力
BFCL71.670.119.273.375.448.476.0
综合能力
平均分65.365.057.960.861.057.266.3
-#### 在线体验 +#### 工具调用能力 -- [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) +我们在 [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html) 上测试了模型的工具调用能力,MiniCPM3-4B 在该榜单上的表现超越了多个 7B-9B 参数量的模型,优于 GPT-3.5-Turbo-0125。 -

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
模型总体准确率AST SummaryExec SummaryIrrelevance DetectionRelevance Detection
MiniCPM3-4B76.03%68.55%85.54%53.71%90.24%
Llama3.1-8B-Instruct73.28%64.61%86.48%43.12%85.37%
Qwen2-7B-Instruct71.61%65.71%79.57%44.70%90.24%
GLM-4-9B-Chat70.08%60.69%80.02%55.02%82.93%
Phi-3.5-mini-instruct48.44%38.89%54.04%46.78%65.85%
Gemma2-9B-it19.18%5.41%18.50%88.88%7.32%
-#### Huggingface 模型 +#### 长文本能力 -##### MiniCPM-2B -* 安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码 +在 32k 的上下文长度进行[大海捞针](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)测试,结果如下图: + +![needle](assets/eval_needle.jpeg) + +### 模型推理 + +#### Huggingface ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(0) -path = 'openbmb/MiniCPM-2B-dpo-bf16' +path = 'openbmb/MiniCPM3-4B' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) -responds, history = model.chat(tokenizer, "山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?", temperature=0.5, top_p=0.8, repetition_penalty=1.02) +responds, history = model.chat(tokenizer, "请写一篇关于人工智能的文章,详细介绍人工智能的未来发展和隐患。", temperature=0.7, top_p=0.7) print(responds) ``` -* 期望输出 -```shell -山东省最高的山是泰山,海拔1545米。 +#### vLLM +* 安装 vllm + ```shell + pip install git+https://github.com/OpenBMB/vllm.git@minicpm3 + ``` +* 推理 + ```python + from transformers import AutoTokenizer + from vllm import LLM, SamplingParams -相对于黄山(海拔1864米),泰山海拔较低,相差约319米。 -``` + model_name = "openbmb/MiniCPM3-4B" + prompt = [{"role": "user", "content": "请写一篇关于人工智能的文章,详细介绍人工智能的未来发展和隐患。"}] -

+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) -##### MiniCPM-2B (Llama Format) -我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的[格式](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format),以便大家尝试: -```python -import torch -from transformers import LlamaTokenizerFast, LlamaForCausalLM -model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" -tokenizer = LlamaTokenizerFast.from_pretrained(model_path) -model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) + llm = LLM(model=model_name, + trust_remote_code=True, + tensor_parallel_size=1 + ) + sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024) -prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" -input_ids = tokenizer.encode("<用户>{}".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() -responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) -responds = tokenizer.decode(responds[0], skip_special_tokens=True) -print(responds) -``` + outputs = llm.generate(prompts=input_text, sampling_params=sampling_params) -##### MiniCPM-V - -```python -import torch -from PIL import Image -from transformers import AutoModel, AutoTokenizer - -model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) -model.eval().cuda() - -image = Image.open('xx.jpg').convert('RGB') -question = 'What is in the image?' -msgs = [{'role': 'user', 'content': question}] - -res, context, _ = model.chat( - image=image, - msgs=msgs, - context=None, - tokenizer=tokenizer, - sampling=True, - temperature=0.7 -) -print(res) -``` - - -#### vLLM 推理 - -* 安装[vLLM](https://github.com/vllm-project/vllm) -```shell -pip install "vllm>=0.4.1" -``` - -* 测试样例 -```shell -python inference/inference_vllm.py --model_path --prompt_path prompts/prompt_demo.txt -``` - -* 期望输出 -```shell -<用户>: Which city is the capital of China? -: - The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade. -``` - -#### llama.cpp、Ollama、fastllm、mlx_lm推理 -MiniCPM支持[llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm)、[mlx_lm](https://github.com/ml-explore/mlx-examples)推理。感谢[@runfuture](https://github.com/runfuture)对llama.cpp和ollama的适配。 - -

+ print(outputs[0].outputs[0].text) + ``` #### llama.cpp -1. [安装llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build) -2. 下载gguf形式的模型。[下载链接-fp16格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [下载链接-q4km格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) -3. 在命令行运行示例代码: -``` -./main -m ../../model_ckpts/download_from_hf/MiniCPM-2B-dpo-fp16-gguf.gguf --prompt "<用户>写藏头诗,藏头是龙年大吉" --temp 0.3 --top-p 0.8 --repeat-penalty 1.05 -``` -更多参数调整[详见](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) - -

- -#### ollama -***ollama自动安装模型*** -1. [安装ollama](https://github.com/ollama/ollama) -2. 在命令行运行: -``` -ollama run modelbest/minicpm-2b-dpo -``` -***ollama手动安装模型*** -1. [安装ollama](https://github.com/ollama/ollama) -2. 下载gguf形式的模型。[下载链接2b-fp16格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-fp16-gguf) [下载链接2b-q4km格式](https://huggingface.co/runfuture/MiniCPM-2B-dpo-q4km-gguf) [下载链接1b-fp16格式](https://huggingface.co/linglingdan/MiniCPM-1b-fp16-gguf) [下载链接1b-qr_1格式](https://huggingface.co/linglingdan/MiniCPM-1b-q4-1) -3. 在命令行运行以下命令,model_name可自定义: -``` -touch model_name.Modelfile -``` -4. 将以上model_name.Modelfile的内容修改如下,FROM空格后写入gguf的模型路径 -``` -FROM model_path/model_name.gguf -TEMPLATE """{{ .Prompt }}{{ .Response }}""" -PARAMETER stop "<\s>" -``` -5. 在命令行运行以下命令,创建ollama模型,ollama_model_name可自定义,model_name.Modelfile参考第3步命名 -``` -ollama create ollama_model_name -f model_name.Modelfile -``` -6. 运行ollama模型: -``` -ollama run ollama_model_name -``` -

- -#### fastllm -1. [编译安装fastllm](https://github.com/ztxz16/fastllm) -2. 模型推理 -```python -import torch -from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM -path = 'openbmb/MiniCPM-2B-dpo-fp16' -tokenizer = AutoTokenizer.from_pretrained(path) -model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True) -from fastllm_pytools import llm -llm.set_device_map("cpu") -model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4" -print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?", top_p=0.8, temperature=0.5, repeat_penalty=1.02)) -``` -

- -#### mlx_lm -1. 安装mlx_lm库 - ```shell - pip install mlx_lm - ``` -2. 下载转换后的模型权重[MiniCPM-2B-sft-bf16-llama-format-mlx](https://huggingface.co/mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx) -3. 模型推理 - ```shell - python -m mlx_lm.generate --model mlx-community/MiniCPM-2B-sft-bf16-llama-format-mlx --prompt "hello, tell me a joke." --trust-remote-code - ``` - -

- -#### powerinfer -powerinfer目前仅针对MiniCPM-S-1B模型,其他版本暂不支持,敬请期待。 -1. 保证cmake版本3.17以上,如果已经安装过,则跳过此步骤 - ```bash - # 下载安装包 - sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz - # 解压安装包 - sudo tar -zxvf cmake-3.23.0.tar.gz - # 配置安装环境 - sudo ./configure - sudo make -j8 - # 编译安装 - sudo make install - # 查看安装后版本 - cmake --version - # 返回版本号则安装成功 - #cmake version 3.23.0 +* 安装 llama.cpp + ```shell + git clone https://github.com/OpenBMB/llama.cpp.git + git checkout minicpm3 + cd llama.cpp + make ``` -2. 安装powerinfer: -```bash - git clone https://github.com/SJTU-IPADS/PowerInfer - cd PowerInfer - pip install -r requirements.txt # install Python helpers' dependencies -``` -3. cpu版本powerinfer编译,如果你的机器只有cpu,或者只想使用cpu进行推理,则运行以下命令: -```bash - cmake -S . -B build - cmake --build build --config Release -``` -4. gpu版本powerinfer编译,如果你的机器有gpu,则可以运行以下命令: -```bash - cmake -S . -B build -DLLAMA_CUBLAS=ON - cmake --build build --config Release -``` -5. 获取稀疏模型 -```bash -git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main -#or -git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf -``` -6. 模型推理: -```bash -cd PowerInfer -# 以下是命令模版,output_token_count为最大输出tokens,thread_num 为线程数,prompt为输入prompt字符 -#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt -# 以下是示例 -./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p '<用户>hello,tell me a story please.' -``` - -

- -## 模型量化 -

- -**gptq量化** -1. 首先git获取[minicpm_gptqd代码](https://github.com/LDLINGLINGLING/AutoGPTQ/tree/minicpm_gptq) -2. 进入minicpm_gptqd主目录./AutoGPTQ,命令行输入: - ``` - pip install e . - ``` -3. 前往[模型下载](#1)下载未量化的MiniCPM仓库下所有文件放至本地同一文件夹下,1b、2b模型均可,训练后模型亦可。 -4. 命令行输入以下命令,其中no_quantized_model_path是第3步模型下载路径,save_path是量化模型保存路径,--bits 为量化位数可以选择输入4或者8 - ``` - cd Minicpm/quantize - python gptq_quantize.py --pretrained_model_dir no_quant_model_path --quantized_model_dir quant_save_path --bits 4 - ``` -5. 可以使用./AutoGPTQ/examples/quantization/inference.py进行推理,也可以参考前文使用vllm对量化后的模型,单卡4090下minicpm-1b-int4模型vllm推理在2000token/s左右。 - -

- -**awq量化** -1. 在quantize/awq_quantize.py 文件中修改根据注释修改配置参数: +* 创建模型目录 + ```shell + cd llama.cpp/models + mkdir Minicpm3 + ``` +* 下载 MiniCPM3 模型所有文件到 `llama.cpp/models/Minicpm3` + ```shell + cd llama.cpp/models/Minicpm3 + git clone https://huggingface.co/openbmb/MiniCPM3-4B + ``` +* 将模型转换为 gguf 格式,并且量化: ```python - model_path = '/root/ld/ld_model_pretrained/MiniCPM-1B-sft-bf16' # model_path or model_id - quant_path = '/root/ld/ld_project/pull_request/MiniCPM/quantize/awq_cpm_1b_4bit' # quant_save_path - quant_data_path='/root/ld/ld_project/pull_request/MiniCPM/quantize/quantize_data/wikitext'# 写入自带量化数据集,data下的alpaca或者wikitext - quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8 - quant_samples=512 # how many samples to use for calibration - custom_data=[{'question':'你叫什么名字。','answer':'我是openmbmb开源的小钢炮minicpm。'}, # 自定义数据集可用 - {'question':'你有什么特色。','answer':'我很小,但是我很强。'}] + python3 -m pip install -r requirements.txt + # 将pytorch模型转化为fp16的gguf + python3 convert-hf-to-gguf.py models/Minicpm3/ --outfile /your/path/llama.cpp/models/Minicpm3/CPM-4B-F16.gguf + # 完成以上步骤,llama.cpp/models/Minicpm3目录下有一个CPM-4B-F16.gguf的模型文件 + ./llama-quantize ./models/Minicpm3/CPM-4B-F16.gguf ./models/Minicpm3/ggml-model-Q4_K_M.gguf Q4_K_M + # 使用本行代码执行成功后,./models/Minicpm3下将存在ggml-model-Q4_K_M.gguf的4bit量化文件 ``` -2. 在quantize/quantize_data文件下已经提供了alpaca和wiki_text两个数据集作为量化校准集,修改上述quant_data_path为其中一个文件夹的路径 -3. 如果需要自定义数据集,修改quantize/awq_quantize.py中的custom_data变量,如: - ```python - custom_data=[{'question':'过敏性鼻炎有什么症状?','answer':'过敏性鼻炎可能鼻塞,流鼻涕,头痛等症状反复发作,严重时建议及时就医。'}, - {'question':'1+1等于多少?','answer':'等于2'}] - ``` -4. 根据选择的数据集,选择以下某一行代码替换 quantize/awq_quantize.py 中第三十八行: - ```python - #使用wikitext进行量化 - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_wikitext(quant_data_path=quant_data_path)) - #使用alpaca进行量化 - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca(quant_data_path=quant_data_path)) - #使用自定义数据集进行量化 - model.quantize(tokenizer, quant_config=quant_config, calib_data=load_cust_data(quant_data_path=quant_data_path)) - +* 推理 + ```shell + ./llama-cli -c 1024 -m ./models/Minicpm/ggml-model-Q4_K_M.gguf -n 1024 --top-p 0.7 --temp 0.7 --prompt "<|im_start|>user\n请写一篇关于人工智能的文章,详细介绍人工智能的未来发展和隐患。<|im_end|>\n<|im_start|>assistant\n" ``` -5. 运行quantize/awq_quantize.py文件,在设置的quan_path目录下可得awq量化后的模型。 -

-

+### 模型微调 +#### LLaMA-Factory +目前模型微调支持 [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory),使用方法参考 [LLaMA-Factory 微调](https://modelbest.feishu.cn/docx/Z7USdW4lloZzkZxQ14icJ3senjb?from=from_copylink)。 -**bnb量化** -1. 在quantize/bnb_quantize.py 文件中修改根据注释修改配置参数: -```python -model_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16" # 模型地址 -save_path = "/root/ld/ld_model_pretrain/MiniCPM-1B-sft-bf16_int4" # 量化模型保存地址 -``` -2. 更多量化参数可根据注释以及llm.int8()算法进行修改(optional): -```python -quantization_config = BitsAndBytesConfig( - load_in_4bit=True, # 是否进行4bit量化 - load_in_8bit=False, # 是否进行8bit量化 - bnb_4bit_compute_dtype=torch.float16, # 计算精度设置 - bnb_4bit_quant_storage=torch.uint8, # 量化权重的储存格式 - bnb_4bit_quant_type="nf4", # 量化格式,这里用的是正太分布的int4 - bnb_4bit_use_double_quant=True, # 是否采用双量化,即对zeropoint和scaling参数进行量化 - llm_int8_enable_fp32_cpu_offload=False, # 是否llm使用int8,cpu上保存的参数使用fp32 - llm_int8_has_fp16_weight=False, # 是否启用混合精度 - #llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # 不进行量化的模块 - llm_int8_threshold=6.0, # llm.int8()算法中的离群值,根据这个值区分是否进行量化 -) -``` -3. 运行quantize/bnb_quantize.py文件,在设置的save_path目录下可得bnb量化后的模型。 -```python -cd MiniCPM/quantize -python bnb_quantize.py -``` +### 进阶功能 +#### 工具调用 -**量化测试** -1. 命令行进入到 MiniCPM/quantize 目录下 -2. 修改quantize_eval.sh文件中awq_path,gptq_path,awq_path,如果不需要测试的类型保持为空字符串,如下示例表示仅测试awq模型: - ``` - awq_path="/root/ld/ld_project/AutoAWQ/examples/awq_cpm_1b_4bit" - gptq_path="" - model_path="" - ``` -3. 在MiniCPM/quantize路径下命令行输入: - ``` - bash quantize_eval.sh - ``` -4. 窗口将输出该模型的内存占用情况、困惑度。 -

+我们提供了使用 MiniCPM3 调用工具的示例代码,见[`demo/function_calling.py`](./demo/function_calling.py)。 -## 开源社区 -- [xtuner](https://github.com/InternLM/xtuner): [MiniCPM高效率微调的不二选择](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#AMdXdzz8qoadZhxU4EucELWznzd) -- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git):[MiniCPM微调一键式解决方案](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#BAWrdSjXuoFvX4xuIuzc8Amln5E) -- [ChatLLM框架](https://github.com/foldl/chatllm.cpp):[在CPU上跑MiniCPM](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796) +#### 代码解释器 +我们提供了一个 MiniCPM3 使用代码解释器的示例代码,见[`demo/code_interpreter.py`](./demo/code_interpreter.py)。 +下面是一个 Demo: -

+![code_interpreter](./assets/code_interpreter.gif) -## 评测结果 +## MiniCPM 2.0 -#### 评测设置 +
+查看 MiniCPM 2.0 的详细信息 -* 由于大模型评测难以统一,且大量评测也没有公开的prompt和测试代码,对于具体评测方式,我们只能尽量做到适合各类模型。 -* 整体而言,我们测试时采用统一的prompt输入,并按照各模型对应的模板进行输入调整。 -* **评测脚本及prompt已开源在我们的Github仓库中,也欢迎更多开发者来不断改进我们的评测方式。** - * 文本评测部分,采用了我们的开源大模型能力评测框架[UltraEval](https://github.com/OpenBMB/UltraEval)。以下为开源模型复现流程: - * 安装UltraEval - ```shell - git clone https://github.com/OpenBMB/UltraEval.git - cd UltraEval - pip install -e . - ``` - * 下载相关数据并解压处理 - ```shell - wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1" - unzip RawData.zip - python data_process.py - ``` - * 执行评测脚本(提供了模板,可自定义) - ```shell - bash run_eval.sh - ``` +MiniCPM 2.0 系列模型对 MiniCPM 进行了多个维度的升级,包括以下模型版本: +- MiniCPM-2B-128k:将 MiniCPM-2B 的上下文长度从 4k 扩展至 128k,在 InfiniteBench 测试集上优于 ChatGLM3-6B-128k、Yi-6B-200k 等更大参数量的模型。 +- MiniCPM-MoE-8x2B:基于 MiniCPM-2B 进行 MoE 扩展,综合表现相比于 MiniCPM-2B 平均提高 4.5 个百分点。 +- MiniCPM-1B:相比于 MiniCPM-2B 成本下降 60%,综合表现仍然优于 LLaMA2-13B。 +- MiniCPM-S-1B:在保持下游任务性能无损的前提下,FFN 层实现了 87.89% 的平均稀疏度,将 FFN FLOPs 降低了 84%。结合 PowerInfer 推理框架,解码速度提升约 2.8 倍。 -#### 部署模式 - -* 因为MiniCPM采用Mup的结构,与现有模型在具体计算上有细微差别,我们是基于vllm=0.2.2版本进行了我们模型的实现。 -* **对于非MiniCPM模型,我们采用了vllm=0.2.7的最新版本进行推理。** - -#### 评测度量 - -* 对于QA任务(选择题任务),我们选用两种方式进行测试: - * PPL:将选项作为题目生成的延续,并根据各个选项的PPL来进行答案选择; - * 第二种是直接生成答案选项。 -* 对于不同模型,这两种方式得到的结果差异较大。MiniCPM两种模式上的结果较为接近,而Mistral-7B-v0.1等模型在PPL上表现较好,直接生成上效果较差。 -* 在具体评测时,我们以两种评测方式得分的最高者为最终结果,以此保证对比的公平性(以下表格中*号表示采用PPL)。 - -#### 文本模型评测 - -**越级比较:** -|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|Llama2-7B|35.40|36.21|31.765|32.42|31.11|44.32|12.2|27.17|13.57|1.8|33.23|75.25|42.75|75.62*| -|Qwen-7B|49.46|47.19|59.655|58.96|60.35|57.65|17.07|42.15|41.24|5.34|37.75|83.42|64.76|75.32*| -|Deepseek-7B|39.96|39.15|43.64|42.82|44.45|47.82|20.12|41.45|15.85|1.53|33.38|74.58*|42.15*|75.45*| -|Mistral-7B|48.97|49.96|44.54|46.12|42.96|62.69|27.44|45.2|33.13|5.0|41.06|83.92|70.73|80.43*| -|Llama2-13B|41.48|42.44|37.19|37.32|37.06|54.71|17.07|32.55|21.15|2.25|37.92|78.87*|58.19|79.23*| -|MPT-30B|38.17|39.82|30.72|29.34|32.09|46.56|21.95|35.36|10.31|1.56|38.22|78.66*|46.08*|79.72*| -|Falcon-40B|43.62|44.21|40.93|40.29|41.57|53.53|24.39|36.53|22.44|1.92|36.24|81.94*|57.68|83.26*| -|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -**同级比较:** -|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|TinyLlama-1.1B|25.36|25.55|24.525|25.02|24.03|24.3|6.71|19.91|2.27|0.74|28.78|60.77*|28.15*|58.33*|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| -|Qwen-1.8B|34.72|31.87|47.57|49.81|45.32|43.37|7.93|17.80|19.26|2.42|29.07|63.97*|43.69|59.28*| -|Gemini Nano-3B|-|-|-|-|-|-|-|27.2(report)|22.8(report)|-|42.4(report)|-|-|-| -|StableLM-Zephyr-3B|43.46|46.31|30.62|30.34|30.89|45.9|35.37|31.85|52.54|12.49|37.68|73.78|55.38|71.87*| -|Phi-2-2B|48.84|54.41|23.78|23.37|24.18|52.66|47.56|55.04|57.16|3.5|43.39|86.11|71.25|73.07*| -|MiniCPM-2B|52.33|52.6|51.10|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -**Chat模型比较:** -|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|ChatGLM2-6B|37.98|35.17|50.63|52.05|49.21|45.77|10.37|9.38|22.74|5.96|32.6|74.45|56.82|58.48*| -|Mistral-7B-Instruct-v0.1|44.36|45.89|37.51|38.06|36.96|53.56|29.27|39.34|28.73|3.48|39.52|81.61|63.99|73.47*| -|Mistral-7B-Instruct-v0.2|50.91|52.83|42.235|42.55|41.92|60.51|36.59|48.95|40.49|4.95|39.81|86.28|73.38|84.55*| -|Qwen-7B-Chat|44.93|42.05|57.9|58.57|57.23|56.03|15.85|40.52|42.23|8.3|37.34|64.44*|39.25*|74.52*| -|Yi-6B-Chat|50.46|45.89|70.995|70.88|71.11|62.95|14.02|28.34|36.54|3.88|37.43|84.89|70.39|74.6*| -|Baichuan2-7B-Chat|44.68|42.74|53.39|53.28|53.5|53|21.34|32.32|25.25|6.32|37.46|79.63|60.15|69.23*| -|Deepseek-7B-chat|49.34|49.56|48.335|46.95|49.72|51.67|40.85|48.48|48.52|4.26|35.7|76.85|63.05|76.68*| -|Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*| -|MiniCPM-2B|52.33|52.6|51.10|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| - -**DPO后模型比较:** - -|模型|MT-bench| -|---|---| -|GPT-4-turbo|9.32| -|GPT-3.5-turbo|8.39| -|Mistral-8*7b-Instruct-v0.1|8.30| -|Claude-2.1|8.18| -|Zephyr-7B-beta|7.34| -|**MiniCPM-2B**|**7.25**| -|Vicuna-33B|7.12| -|Zephyr-7B-alpha|6.88| -|LLaMA-2-70B-chat|6.86| -|Mistral-7B-Instruct-v0.1|6.84| -|MPT-34B-instruct|6.39| +### 评测结果 #### MiniCPM-2B-128k 模型评测 | Model | avg | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run | @@ -535,7 +450,7 @@ python bnb_quantize.py | chatglm3-6b-128k | 25.58 | 36.57 | 89.93 | 99.66 | 5.2 | 46.29 | 10.7 | 8.38 | 25.91 | 6.5 | 0 | 8 | 5.33 | 1 | | MiniCPM-2.4B-128k | 27.32 | 37.68 | 98.31 | 99.83 | 9 | 29.69 | 23.06 | 16.33 | 15.73 | 9.5 | 0 | 4.29 | 22.08 | 0 | -#### MiniCPM-MoE-8x2B模型评测 +#### MiniCPM-MoE-8x2B 模型评测
@@ -635,305 +550,200 @@ python bnb_quantize.py -

- 注:* 表示结果取自技术报告。† 表示评测集为MBPP全集。 -#### 多模态模型评测 +#### MiniCPM-S-1B 评测结果 -
+- 代码生成:在 HumanEval(0-shot)和 MBPP(3-shot)上的平均 pass@1 得分。 +- 常识推理:在 PIQA、SIQA、HellaSwag、WinoGrande 和 COPA 上的平均 0-shot 准确率。 +- 阅读理解:在 BoolQ、LAMBADA 和 TyDi QA 上的平均 0-shot 准确率。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeTextVQA valDocVQA testOCRBenchOpenCompassMMEMMB dev(en)MMB dev(zh)MMMU valMathVistaLLaVA BenchObject HalBench
Proprietary models
Gemini Pro Vision- 74.688.168063.82148.975.274.048.945.879.9-
GPT-4V- 78.088.464563.21771.575.175.053.847.893.186.4 / 92.7
Open-source models 6B~34B
Yi-VL-6B6.7B45.5*17.1*29049.31915.1 68.6 68.3 40.3 28.8 51.9 -
Qwen-VL-Chat9.6B61.562.6488 52.1 1860.0 60.6 56.7 37.0 33.8 67.7 56.2 / 80.0
Yi-VL-34B34B43.4*16.9*29052.6 2050.271.171.445.130.762.3-
DeepSeek-VL-7B7.3B64.7*47.0* 43555.6 1765.4 74.1 72.8 38.3 36.877.8 -
TextMonkey9.7B64.366.7 558- - - - - -- -
CogVLM-Chat17.4B70.433.3*590 52.5 1736.6 63.7 53.8 37.3 34.7 73.9 73.6 / 87.4
Open-source models 1B~3B
DeepSeek-VL-1.3B1.7B58.4*37.9*41346.0 1531.6 64.0 61.2 33.8 29.4 51.1 -
MobileVLM V23.1B57.519.4*--1440.5(P) 63.2 -----
Mini-Gemini2.2B56.234.2*--1653.0 59.8 - 31.7 -- -
MiniCPM-V2.8B 60.638.2 36647.61650.2 67.9 65.3 38.328.951.3 78.4 / 88.5
MiniCPM-V 2.02.8B 74.171.9 60555.01808.6 69.6 68.1 38.2 38.769.2 85.5 / 92.2
+其他测试集:我们报告在GSM8K(8-shot)、MMLU(5-shot)、BBH(3-shot)和 AGI-Eval(0-shot)上的平均准确率。 -
-* 我们自己评测了正式开源的模型权重。 +| Setting | Average
Sparsity | Average
Performance | Code
Generation | Commonsense
Reasoning | Reading
Comprehension | GSM8K | MMLU | BBH | AGI Eval | +| :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: | +| LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | +| ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | +| **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | +| **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | +| LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | +| ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | +| **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | +| **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | +| MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 | +| **MiniCPM-S-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 | +| **MiniCPM-S-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 | + +注: +1. ReluLLaMA-7B 和 ReluLLaMA-13B 的下载链接分别是 [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B)。"ProSparse-7B\*"、"ProSparse-13B\*" 和 "MiniCPM-S-1B\*" 代表没有激活阈值偏移的 ProSparse 版本。 +2. 对于 PIQA、SIQA、HellaSwag、WinoGrande、COPA、BoolQ、LAMBADA、TyDi QA 和 AGI-Eval,我们根据各个选项的 PPL 来进行答案选择。对于 GSM8K、MMLU 和 BBH,我们直接生成答案。 + +### 模型推理 + +#### HuggingFace、vLLM推理 + +参考 MiniCPM 1.0 中的[模型推理](#huggingface-推理)部分。 + +#### Powerinfer 推理 + +针对 MiniCPM-S-1B 模型,我们可以使用 Powerinfer 进行推理加速,使用方法如下: + +1. 保证cmake版本3.17以上,如果已经安装过,则跳过此步骤 + ```bash + # 下载安装包 + sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz + # 解压安装包 + sudo tar -zxvf cmake-3.23.0.tar.gz + # 配置安装环境 + sudo ./configure + sudo make -j8 + # 编译安装 + sudo make install + # 查看安装后版本 + cmake --version + # 返回版本号则安装成功 + #cmake version 3.23.0 + ``` +2. 安装powerinfer: +```bash + git clone https://github.com/SJTU-IPADS/PowerInfer + cd PowerInfer + pip install -r requirements.txt # install Python helpers' dependencies +``` +3. cpu版本powerinfer编译,如果你的机器只有cpu,或者只想使用cpu进行推理,则运行以下命令: +```bash + cmake -S . -B build + cmake --build build --config Release +``` +4. gpu版本powerinfer编译,如果你的机器有gpu,则可以运行以下命令: +```bash + cmake -S . -B build -DLLAMA_CUBLAS=ON + cmake --build build --config Release +``` +5. 获取稀疏模型 +```bash +git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main +#or +git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf +``` +6. 模型推理: +```bash +cd PowerInfer +# 以下是命令模版,output_token_count为最大输出tokens,thread_num 为线程数,prompt为输入prompt字符 +#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt +# 以下是示例 +./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p '<用户>hello,tell me a story please.' +``` +
+ +## MiniCPM 1.0 + +
+查看 MiniCPM 1.0 的详细信息 + +MiniCPM-2B 语言模型有 24亿(2.4B)的非词嵌入参数量, 总计 2.7B 参数量。 +- 经过 SFT 后,MiniCPM-2B 在公开评测集上与 Mistral-7B 表现相近(中文、数学、代码能力更优),整体性能超越 Llama2-13B、MPT-30B、Falcon-40B 等模型。 +- 经过 DPO 后,MiniCPM-2B 在 MTBench 上也超越了 Llama2-70B-Chat、Vicuna-33B、Mistral-7B-Instruct-v0.1、Zephyr-7B-alpha 等众多代表性开源大模型。 + +注意:为了保证在学术研究用途上模型的通用性,我们**未对 MiniCPM-2B 进行任何身份认同训练**。同时由于我们用 ShareGPT 开源语料作为部分训练数据,模型可能会输出类似 GPT 系列模型的身份认同信息。 + +### 评测结果 + +#### 评测设置 + +* 由于大模型评测难以统一,且大量评测也没有公开的prompt和测试代码,对于具体评测方式,我们只能尽量做到适合各类模型。 +* 整体而言,我们测试时采用统一的prompt输入,并按照各模型对应的模板进行输入调整。 +* **评测脚本及prompt已开源在我们的Github仓库中,也欢迎更多开发者来不断改进我们的评测方式。** + * 文本评测部分,采用了我们的开源大模型能力评测框架[UltraEval](https://github.com/OpenBMB/UltraEval)。以下为开源模型复现流程: + * 安装UltraEval + ```shell + git clone https://github.com/OpenBMB/UltraEval.git + cd UltraEval + pip install -e . + ``` + * 下载相关数据并解压处理 + ```shell + wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1" + unzip RawData.zip + python data_process.py + ``` + * 执行评测脚本(提供了模板,可自定义) + ```shell + bash run_eval.sh + ``` + +#### 部署模式 + +* 因为MiniCPM采用Mup的结构,与现有模型在具体计算上有细微差别,我们是基于vllm=0.2.2版本进行了我们模型的实现。 +* **对于非MiniCPM模型,我们采用了vllm=0.2.7的最新版本进行推理。** + +#### 评测度量 + +* 对于QA任务(选择题任务),我们选用两种方式进行测试: + * PPL:将选项作为题目生成的延续,并根据各个选项的PPL来进行答案选择; + * 第二种是直接生成答案选项。 +* 对于不同模型,这两种方式得到的结果差异较大。MiniCPM两种模式上的结果较为接近,而Mistral-7B-v0.1等模型在PPL上表现较好,直接生成上效果较差。 +* 在具体评测时,我们以两种评测方式得分的最高者为最终结果,以此保证对比的公平性(以下表格中*号表示采用PPL)。 + +#### 文本模型评测 + +**越级比较:** +|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|Llama2-7B|35.40|36.21|31.765|32.42|31.11|44.32|12.2|27.17|13.57|1.8|33.23|75.25|42.75|75.62*| +|Qwen-7B|49.46|47.19|59.655|58.96|60.35|57.65|17.07|42.15|41.24|5.34|37.75|83.42|64.76|75.32*| +|Deepseek-7B|39.96|39.15|43.64|42.82|44.45|47.82|20.12|41.45|15.85|1.53|33.38|74.58*|42.15*|75.45*| +|Mistral-7B|48.97|49.96|44.54|46.12|42.96|62.69|27.44|45.2|33.13|5.0|41.06|83.92|70.73|80.43*| +|Llama2-13B|41.48|42.44|37.19|37.32|37.06|54.71|17.07|32.55|21.15|2.25|37.92|78.87*|58.19|79.23*| +|MPT-30B|38.17|39.82|30.72|29.34|32.09|46.56|21.95|35.36|10.31|1.56|38.22|78.66*|46.08*|79.72*| +|Falcon-40B|43.62|44.21|40.93|40.29|41.57|53.53|24.39|36.53|22.44|1.92|36.24|81.94*|57.68|83.26*| +|MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| + +**同级比较:** +|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|TinyLlama-1.1B|25.36|25.55|24.525|25.02|24.03|24.3|6.71|19.91|2.27|0.74|28.78|60.77*|28.15*|58.33*|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| +|Qwen-1.8B|34.72|31.87|47.57|49.81|45.32|43.37|7.93|17.80|19.26|2.42|29.07|63.97*|43.69|59.28*| +|Gemini Nano-3B|-|-|-|-|-|-|-|27.2(report)|22.8(report)|-|42.4(report)|-|-|-| +|StableLM-Zephyr-3B|43.46|46.31|30.62|30.34|30.89|45.9|35.37|31.85|52.54|12.49|37.68|73.78|55.38|71.87*| +|Phi-2-2B|48.84|54.41|23.78|23.37|24.18|52.66|47.56|55.04|57.16|3.5|43.39|86.11|71.25|73.07*| +|MiniCPM-2B|52.33|52.6|51.10|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| + +**Chat模型比较:** +|模型|平均分|英文均分|中文均分|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| +|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +|ChatGLM2-6B|37.98|35.17|50.63|52.05|49.21|45.77|10.37|9.38|22.74|5.96|32.6|74.45|56.82|58.48*| +|Mistral-7B-Instruct-v0.1|44.36|45.89|37.51|38.06|36.96|53.56|29.27|39.34|28.73|3.48|39.52|81.61|63.99|73.47*| +|Mistral-7B-Instruct-v0.2|50.91|52.83|42.235|42.55|41.92|60.51|36.59|48.95|40.49|4.95|39.81|86.28|73.38|84.55*| +|Qwen-7B-Chat|44.93|42.05|57.9|58.57|57.23|56.03|15.85|40.52|42.23|8.3|37.34|64.44*|39.25*|74.52*| +|Yi-6B-Chat|50.46|45.89|70.995|70.88|71.11|62.95|14.02|28.34|36.54|3.88|37.43|84.89|70.39|74.6*| +|Baichuan2-7B-Chat|44.68|42.74|53.39|53.28|53.5|53|21.34|32.32|25.25|6.32|37.46|79.63|60.15|69.23*| +|Deepseek-7B-chat|49.34|49.56|48.335|46.95|49.72|51.67|40.85|48.48|48.52|4.26|35.7|76.85|63.05|76.68*| +|Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*| +|MiniCPM-2B|52.33|52.6|51.10|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| + +**DPO后模型比较:** + +|模型|MT-bench| +|---|---| +|GPT-4-turbo|9.32| +|GPT-3.5-turbo|8.39| +|Mistral-8*7b-Instruct-v0.1|8.30| +|Claude-2.1|8.18| +|Zephyr-7B-beta|7.34| +|**MiniCPM-2B**|**7.25**| +|Vicuna-33B|7.12| +|Zephyr-7B-alpha|6.88| +|LLaMA-2-70B-chat|6.86| +|Mistral-7B-Instruct-v0.1|6.84| +|MPT-34B-instruct|6.39| +### 快速上手 -

+#### 在线体验 -## 手机部署 -

- -#### 部署步骤 - -* 进行Int4量化后,MiniCPM只占2GB空间,具备在端侧手机进行模型部署的条件。 -* 对于不同的操作系统,我们进行了不同的适配。 -* **注意:当前开源框架对手机支持还在完善,并非所有芯片与操作系统版本均能成功运行MLC-LLM或LLMFarm。** -* Android、HarmonyOS - * 使用开源框架MLC-LLM进行模型适配。 - * 支持文本模型、多模态模型。 - * 适用于MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4、MiniCPM-V。 - * [编译安装MiniCPM指南](https://github.com/OpenBMB/mlc-MiniCPM) -* iOS - * 使用开源框架LLMFarm进行模型适配。 - * 支持文本模型。 - * 适用于MiniCPM-2B-SFT-INT4、MiniCPM-2B-DPO-INT4。 - * [编译安装MiniCPM指南](https://github.com/OpenBMB/LLMFarm) - -#### 部署性能 - -* 我们未针对手机推理模型进行深度优化和系统测试,仅验证MiniCPM使用手机芯片进行推理的可行性。**我们也欢迎更多开发者进一步调优并更新下面的测试列表,不断提升端侧大模型在手机上的推理性能**。 - -|手机型号|操作系统|处理器|Memory(GB)|文本吞吐(token/s)| -|-|-|-|-|-| -|OPPO Find N3|Android 13|snapdragon 8 Gen2|12|6.5| -|Samsung S23 Ultra|Android 14|snapdragon 8 Gen2|12|6.4| -|Meizu M182Q|Android 11|snapdragon 888Plus|8|3.7| -|Xiaomi 12 Pro|Android 13|snapdragon 8 Gen1|8+3|3.7| -|Xiaomi Redmi K40|Android 11|snapdragon 870|8|3.5| -|Oneplus LE 2100|Android 13|snapdragon 870|12|3.5| -|Oneplus HD1900|Android 11|snapdragon 865|8|3.2| -|Oneplus HD1900|Android 11|snapdragon 855|8|3.0| -|Oneplus HD1905|Android 10|snapdragon 855|8|3.0| -|Oneplus HD1900|Android 11|snapdragon 855|8|3.0| -|Xiaomi MI 8|Android 9|snapdragon 845|6|2.3| -|Huawei Nova 11SE|HarmonyOS 4.0.0|snapdragon 778|12|1.9| -|Xiaomi MIX 2|Android 9|snapdragon 835|6|1.3| -|iPhone 15 Pro|iOS 17.2.1|A17 pro|8|18.0| -|iPhone 15|iOS 17.2.1|A16|6|15.0| -|iPhone 12 Pro|iOS 16.5.1|A14|6|5.8| -|iPhone 12|iOS 17.2.1|A14|4|5.8| -|iPhone 11|iOS 16.6|A13|4|4.6| -|Xiaomi Redmi K50|HyperOS 1.0.2|MediaTek Dimensity 8100|12|3.5 - -* 我们也使用MLC-LLM验证了在手机上部署MiniCPM-V系列模型的可行性,能够正常输入输出,但也存在图片处理时间较长的问题,需要进一步优化,兼容性问题也需要进一步解决。下面的动图是使用小米14 Pro运行MiniCPM-V 2.0的屏幕录像,没有进行任何编辑。 - - -

- - -

-
- - -

- -## Demo & API 部署 +- [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) #### 基于Gradio的网页版Demo @@ -946,87 +756,87 @@ python demo/vllm_based_demo.py --model_path python demo/hf_based_demo.py --model_path ``` -

+#### HuggingFace 推理 -## 二次开发 -

+##### MiniCPM-2B +* 安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码 +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch +torch.manual_seed(0) -* 高效参数微调 - * 一张1080/2080可实现高效参数微调 - * [高效参数微调代码](https://github.com/OpenBMB/MiniCPM/tree/main/finetune) -

+path = 'openbmb/MiniCPM-2B-dpo-bf16' +tokenizer = AutoTokenizer.from_pretrained(path) +model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) -* 全参数微调 or 持续训练 - * 使用[BMTrain](https://github.com/OpenBMB/BMTrain),借助重计算和ZeRO-3,一张3090/4090可实现全参数微调,一台机器可实现持续训练 - * 相关代码也将陆续推出 -

+responds, history = model.chat(tokenizer, "山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?", temperature=0.5, top_p=0.8, repetition_penalty=1.02) +print(responds) +``` -* mlx高效参数微调 - * 环境准备 - ```shell - pip install -r finetune/requirements_mlx.txt - ``` - * 微调命令 - ```shell - # train - python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx --data data/AdvertiseGen --train --seed 2024 --iters 500 - # test - python mlx_finetune.py --model MiniCPM-2B-sft-bf16-llama-format-mlx --data data/AdvertiseGen --test --seed 2024 - ``` -* [llama_factory微调](https://github.com/OpenBMB/MiniCPM/tree/main/finetune/llama_factory_example/README.md) +* 期望输出 +```shell +山东省最高的山是泰山,海拔1545米。 -

+相对于黄山(海拔1864米),泰山海拔较低,相差约319米。 +``` -## 典型示例 +##### MiniCPM-2B (Llama Format) +我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的[格式](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format),以便大家尝试: +```python +import torch +from transformers import LlamaTokenizerFast, LlamaForCausalLM +model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" +tokenizer = LlamaTokenizerFast.from_pretrained(model_path) +model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) -#### 文本生成 +prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" +input_ids = tokenizer.encode("<用户>{}".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() +responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) +responds = tokenizer.decode(responds[0], skip_special_tokens=True) +print(responds) +``` -![内容创作-case1](./assets/creation.case1.png) +#### vLLM 推理 -![内容创作-case2](./assets/creation.case2.png) +* 安装[vLLM](https://github.com/vllm-project/vllm) +```shell +pip install "vllm>=0.4.1" +``` -![内容创作-case3](./assets/creation.case3.png) +* 测试样例 +```shell +python inference/inference_vllm.py --model_path --prompt_path prompts/prompt_demo.txt +``` -#### 代码生成 +* 期望输出 +```shell +<用户>: Which city is the capital of China? +: + The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade. +``` -![代码生成-case1](./assets/code.case1.gif) +#### llama.cpp、Ollama、fastllm、mlx_lm推理 +MiniCPM支持[llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm)、[mlx_lm](https://github.com/ml-explore/mlx-examples)推理。感谢[@runfuture](https://github.com/runfuture)对llama.cpp和ollama的适配。 -![代码生成-case2](./assets/code.case2.gif) +请参考 MiniCPM 知识库中的[量化指南](https://modelbest.feishu.cn/wiki/EatbwdLuvitbbMk2X5wcX6h5n7c)。 -#### 数理逻辑 +#### 模型微调 -![数理逻辑-case1](./assets/math.case1.png) +- 一张 1080/2080 可实现高效参数微调:[代码](https://github.com/OpenBMB/MiniCPM/tree/main/finetune) +- mlx 微调:[教程](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#share-ASrDdvFAloHtycxfy85cLNhAnd3) +- [xtuner](https://github.com/InternLM/xtuner): [MiniCPM高效率微调的不二选择](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#AMdXdzz8qoadZhxU4EucELWznzd) +- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git):[MiniCPM微调一键式解决方案](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#BAWrdSjXuoFvX4xuIuzc8Amln5E) -![数理逻辑-case1](./assets/math.case2.png) +
-#### 文本翻译 - -![文本翻译-case1](./assets/translation.case1.png) - -![文本翻译-case2](./assets/translation.case2.png) - -#### 指令跟随 - -![指令跟随-case1](./assets/instruction_following.case1.png) - -![指令跟随-case1](./assets/instruction_following.case2.png) - -#### 特殊字符 - -![特殊字符-case1](./assets/special_char.case1.png) - -![特殊字符-case2](./assets/special_char.case2.png) - - -

## 开源协议 #### 模型协议 * 本仓库中代码依照 [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) 协议开源 -* MiniCPM 模型权重的使用则需要遵循 [“MiniCPM模型商用许可协议.md”](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%E6%A8%A1%E5%9E%8B%E5%95%86%E7%94%A8%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.md)。 -* MiniCPM 模型权重对学术研究完全开放,在填写[“问卷”](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)进行登记后亦允许免费商业使用。 +* MiniCPM 模型权重的使用则需要遵循 [MiniCPM 模型商用许可协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%E6%A8%A1%E5%9E%8B%E5%95%86%E7%94%A8%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.md)。 +* MiniCPM 模型权重对学术研究完全开放,在填写[问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)进行登记后亦允许免费商业使用。 #### 声明 @@ -1034,7 +844,12 @@ python demo/hf_based_demo.py --model_path * 因此用户在使用 MiniCPM 生成的内容时,应自行负责对其进行评估和验证。 * 如果由于使用 MiniCPM 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 -

+## 开发机构 + +本项目由以下机构共同开发: + +- [面壁智能](https://modelbest.cn/) +- [清华大学自然语言处理实验室](https://nlp.csai.tsinghua.edu.cn/) ## 工作引用 diff --git a/assets/COCO_test2015_000000262144.jpg b/assets/COCO_test2015_000000262144.jpg deleted file mode 100644 index 012f88d..0000000 Binary files a/assets/COCO_test2015_000000262144.jpg and /dev/null differ diff --git a/assets/code.case1.gif b/assets/code.case1.gif deleted file mode 100644 index 2218ed6..0000000 Binary files a/assets/code.case1.gif and /dev/null differ diff --git a/assets/code.case2.gif b/assets/code.case2.gif deleted file mode 100644 index f0a036c..0000000 Binary files a/assets/code.case2.gif and /dev/null differ diff --git a/assets/code_interpreter.gif b/assets/code_interpreter.gif new file mode 100644 index 0000000..98d72d1 Binary files /dev/null and b/assets/code_interpreter.gif differ diff --git a/assets/creation.case1.png b/assets/creation.case1.png deleted file mode 100644 index 3f6d1aa..0000000 Binary files a/assets/creation.case1.png and /dev/null differ diff --git a/assets/creation.case2.png b/assets/creation.case2.png deleted file mode 100644 index e4a7b1c..0000000 Binary files a/assets/creation.case2.png and /dev/null differ diff --git a/assets/creation.case3.png b/assets/creation.case3.png deleted file mode 100644 index 08e1eba..0000000 Binary files a/assets/creation.case3.png and /dev/null differ diff --git a/assets/en.code.case1.gif b/assets/en.code.case1.gif deleted file mode 100644 index 6c6e04f..0000000 Binary files a/assets/en.code.case1.gif and /dev/null differ diff --git a/assets/en.creation.case1.png b/assets/en.creation.case1.png deleted file mode 100644 index 2c390cc..0000000 Binary files a/assets/en.creation.case1.png and /dev/null differ diff --git a/assets/en.creation.case2.png b/assets/en.creation.case2.png deleted file mode 100644 index 08b72f3..0000000 Binary files a/assets/en.creation.case2.png and /dev/null differ diff --git a/assets/en.instruction_following.case1.png b/assets/en.instruction_following.case1.png deleted file mode 100644 index 69a6484..0000000 Binary files a/assets/en.instruction_following.case1.png and /dev/null differ diff --git a/assets/en.math.case1.png b/assets/en.math.case1.png deleted file mode 100644 index 4f6a0fc..0000000 Binary files a/assets/en.math.case1.png and /dev/null differ diff --git a/assets/en.math.case2.png b/assets/en.math.case2.png deleted file mode 100644 index 908885d..0000000 Binary files a/assets/en.math.case2.png and /dev/null differ diff --git a/assets/en.special_char.case1.png b/assets/en.special_char.case1.png deleted file mode 100644 index 5d80129..0000000 Binary files a/assets/en.special_char.case1.png and /dev/null differ diff --git a/assets/en.special_char.case2.png b/assets/en.special_char.case2.png deleted file mode 100644 index bd7a50e..0000000 Binary files a/assets/en.special_char.case2.png and /dev/null differ diff --git a/assets/en.translation.case1.png b/assets/en.translation.case1.png deleted file mode 100644 index adaffb4..0000000 Binary files a/assets/en.translation.case1.png and /dev/null differ diff --git a/assets/eval_needle.jpeg b/assets/eval_needle.jpeg new file mode 100644 index 0000000..cfb7e5f Binary files /dev/null and b/assets/eval_needle.jpeg differ diff --git a/assets/instruction_following.case1.png b/assets/instruction_following.case1.png deleted file mode 100644 index 43229e9..0000000 Binary files a/assets/instruction_following.case1.png and /dev/null differ diff --git a/assets/instruction_following.case2.png b/assets/instruction_following.case2.png deleted file mode 100644 index 8f3c146..0000000 Binary files a/assets/instruction_following.case2.png and /dev/null differ diff --git a/assets/knowledge.case1.png b/assets/knowledge.case1.png deleted file mode 100644 index bbe6f7b..0000000 Binary files a/assets/knowledge.case1.png and /dev/null differ diff --git a/assets/math.case1.png b/assets/math.case1.png deleted file mode 100644 index 617f0b4..0000000 Binary files a/assets/math.case1.png and /dev/null differ diff --git a/assets/math.case2.png b/assets/math.case2.png deleted file mode 100644 index fab8b54..0000000 Binary files a/assets/math.case2.png and /dev/null differ diff --git a/assets/minicpm_logo.png b/assets/minicpm_logo.png new file mode 100644 index 0000000..3da1191 Binary files /dev/null and b/assets/minicpm_logo.png differ diff --git a/assets/modelbest.png b/assets/modelbest.png new file mode 100644 index 0000000..c5d0b86 Binary files /dev/null and b/assets/modelbest.png differ diff --git a/assets/special_char.case1.png b/assets/special_char.case1.png deleted file mode 100644 index 7f51de1..0000000 Binary files a/assets/special_char.case1.png and /dev/null differ diff --git a/assets/special_char.case2.png b/assets/special_char.case2.png deleted file mode 100644 index 43a5d6c..0000000 Binary files a/assets/special_char.case2.png and /dev/null differ diff --git a/assets/thunlp.png b/assets/thunlp.png new file mode 100644 index 0000000..85f5128 Binary files /dev/null and b/assets/thunlp.png differ diff --git a/assets/translation.case1.png b/assets/translation.case1.png deleted file mode 100644 index c2a23af..0000000 Binary files a/assets/translation.case1.png and /dev/null differ diff --git a/assets/translation.case2.png b/assets/translation.case2.png deleted file mode 100644 index 95dee81..0000000 Binary files a/assets/translation.case2.png and /dev/null differ diff --git a/demo/code_interpreter.py b/demo/code_interpreter.py new file mode 100644 index 0000000..0cea893 --- /dev/null +++ b/demo/code_interpreter.py @@ -0,0 +1,186 @@ +import contextlib +import io +import json +import os +import re +import sys +import traceback + +import fire +from vllm import LLM, SamplingParams + +max_turns = 5 +system_prompt_template = """You are an AI Agent who is proficient in solve complicated task. +Each step you should wirte executable code to fulfill user query. Any Response without code means the task is completed and you do not have another chance to submit code + +You are equipped with a codeinterpreter. You can give the code and get the execution result of your code. You should use the codeinterpreter in the following format: +<|execute_start|> +```python + + + +``` +<|execute_end|> + + +WARNING:Do not use cv2.waitKey(0) cv2.destroyAllWindows()!!! Or the program will be destoried + +Each round, your answer should ALWAYS use the following format(Each of your response should contain code, until you complete the task): + + +Analyse:(Analyse the message you received and plan what you should do) + +This Step Todo: One Subtask need to be done at this step + +Code(WARNING:MAKE SURE YOU CODE FOLLOW THE FORMAT AND WRITE CODE OR THE TASK WILL BE FAILED): +<|execute_start|> +```python + + + + +``` +<|execute_end|> + + +You will got the result of your code after each step. When the code of previous subtask is excuted successfully, you can write and excuet the code for next subtask +When all the code your write are executed and you got the code result that can fulfill the user query, you should summarize the previous analyse process and make a formal response to user, The response should follow this format: +WARNING:MAKE SURE YOU GET THE CODE EXECUTED RESULT THAT FULFILLED ALL REQUIREMENT OF USER BEFORE USE "Finished" +Finished: + +Some notice: +1. When you want to draw a plot, use plt.savefig() and print the image path in markdown format instead of plt.show() +2. Save anything to ./output folder +3. End the process whenever you complete the task, When you do not have Action(Code), Use: Finished: +4. Do not ask for user input in your python code. +""" + +def execute_code(code): + + stdout_capture = io.StringIO() + stderr_capture = io.StringIO() + + # Note here we simplely imitate notebook output. + # if you want to run more complex tasks, try to use nbclient to run python code + lines = code.strip().split('\n') + last_expr = lines[-1].strip() + + if '=' in last_expr: + value = last_expr.split('=')[0].strip() + code += f"\nprint({value})" + + with contextlib.redirect_stdout(stdout_capture), contextlib.redirect_stderr(stderr_capture): + try: + # execute code here + exec(code) + except Exception as e: + return {'output': stdout_capture.getvalue(), 'error': str(e)} + + return {'output': stdout_capture.getvalue(), 'error': stderr_capture.getvalue()} + +class DemoLLM: + def __init__(self, model_path): + # Initialize default sampling parameters + params_dict = { + "n": 1, + "best_of": None, + "presence_penalty": 0.0, + "frequency_penalty": 0.0, + "repetition_penalty": 1.02, + "temperature": 1.0, + "top_p": 0.85, + "top_k": -1, + "use_beam_search": False, + "length_penalty": 1.0, + "early_stopping": False, + "stop": None, + "stop_token_ids": None, + "ignore_eos": False, + "max_tokens": 300, + "logprobs": None, + "prompt_logprobs": None, + "skip_special_tokens": True, + } + + # Create a SamplingParams object + self.sampling_params = SamplingParams(**params_dict) + + # Initialize the language model + self.llm = LLM( + model=model_path, + tensor_parallel_size=1, + trust_remote_code=True, + enforce_eager=True + ) + + def apply_template(self, messages): + """Formats messages into a prompt string for the LLM.""" + formatted_messages = [ + f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>\n" + for msg in messages + ] + formatted_messages.append("<|im_start|>assistant\n") + return ''.join(formatted_messages) + + def generate(self, messages): + """Generates a response from the LLM based on the input messages.""" + raw_input = self.apply_template(messages) + response = self.llm.generate(raw_input, self.sampling_params) + if response: + return response[0].outputs[0].text + return None + +def extract_code(text): + """ Extracts Python code blocks from the given text. """ + # Define a regular expression pattern to match Python code blocks + pattern = r'```python\s+(.*?)\s+```' + matches = re.findall(pattern, text, re.DOTALL) + + return matches + +def process(model_path): + """ + Processes interactions with the DemoLLM using provided model path. + + Args: + model_path (str): The path to the language model directory. + """ + + # Initialize the language model + llm = DemoLLM(model_path) + + # Define initial messages + messages = [ + {"role": "system", "content": system_prompt_template}, + {"role": "user", "content": "2 的 100 次方是多少?"}, + ] + + for index in range(max_turns): + print(f"Turn {index+1} start...") + + # Generate response from the LLM + raw_resp = llm.generate(messages) + print(f"Raw response: {raw_resp}") + + # Check if the response contains the termination keyword + if "Finished" in raw_resp: + break + + # Extract code from the raw response + code_list = extract_code(raw_resp) + + if not code_list: + break + + # Execute the extracted code + code_str = code_list[-1] + run_result = execute_code(code_str) + executor_response = run_result['output'] if run_result['error'] == "" else run_result['error'] + print(f"Code execution result: {run_result}") + + # Append the execution result to the messages + messages.append({"role": "user", "content": executor_response}) + + +if __name__ == "__main__": + fire.Fire(process) \ No newline at end of file diff --git a/demo/function_calling.py b/demo/function_calling.py new file mode 100644 index 0000000..cfab613 --- /dev/null +++ b/demo/function_calling.py @@ -0,0 +1,120 @@ +#!/usr/bin/env python +# encoding: utf-8 +from transformers import AutoTokenizer +from vllm import LLM, SamplingParams +import json + +model_path = "openbmb/MiniCPM3-4B" + +tools = [ + { + "type": "function", + "function": { + "name": "get_delivery_date", + "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'", + "parameters": { + "type": "object", + "properties": { + "order_id": { + "type": "string", + "description": "The customer's order ID.", + }, + }, + "required": ["order_id"], + "additionalProperties": False, + }, + }, + } +] +messages = [ + { + "role": "system", + "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user.", + }, + { + "role": "user", + "content": "Hi, can you tell me the delivery date for my order? The order id is 1234 and 4321.", + }, + # { + # "content": "", + # "tool_calls": [ + # { + # "type": "function", + # "function": { + # "name": "get_delivery_date", + # "arguments": {"order_id": "1234"}, + # }, + # "id": "call_b4ab0b4ec4b5442e86f017fe0385e22e", + # }, + # { + # "type": "function", + # "function": { + # "name": "get_delivery_date", + # "arguments": {"order_id": "4321"}, + # }, + # "id": "call_628965479dd84794bbb72ab9bdda0c39", + # }, + # ], + # "role": "assistant", + # }, + # { + # "role": "tool", + # "content": '{"delivery_date": "2024-09-05", "order_id": "1234"}', + # "tool_call_id": "call_b4ab0b4ec4b5442e86f017fe0385e22e", + # }, + # { + # "role": "tool", + # "content": '{"delivery_date": "2024-09-05", "order_id": "4321"}', + # "tool_call_id": "call_628965479dd84794bbb72ab9bdda0c39", + # }, + # { + # "content": "Both your orders will be delivered on 2024-09-05.", + # "role": "assistant", + # "thought": "\nI have the information you need, both orders will be delivered on the same date, 2024-09-05.\n", + # }, +] +tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) +prompt = tokenizer.apply_chat_template( + messages, tools=tools, tokenize=False, add_generation_prompt=True +) +llm = LLM(model_path, trust_remote_code=True) +sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1000) + + +def fake_tool_execute(toolcall): + data = { + "delivery_date": "2024-09-05", + "order_id": toolcall.get("function", {}) + .get("arguments", {}) + .get("order_id", "order_id"), + } + return json.dumps(data) + + +while True: + prompt = tokenizer.apply_chat_template( + messages, tools=tools, tokenize=False, add_generation_prompt=True + ) + outputs = llm.generate([prompt], sampling_params) + response = outputs[0].outputs[0].text + msg = tokenizer.decode_function_call(response) + if ( + "tool_calls" in msg + and msg["tool_calls"] is not None + and len(msg["tool_calls"]) > 0 + ): + messages.append(msg) + print(msg) + for toolcall in msg["tool_calls"]: + tool_response = fake_tool_execute(toolcall) + tool_msg = { + "role": "tool", + "content": tool_response, + "tool_call_id": toolcall["id"], + } + messages.append(tool_msg) + print(tool_msg) + else: + messages.append(msg) + print(msg) + break diff --git a/inference/inference_vllm.py b/inference/inference_vllm.py deleted file mode 100644 index d9c1094..0000000 --- a/inference/inference_vllm.py +++ /dev/null @@ -1,58 +0,0 @@ -from vllm import LLM, SamplingParams -import argparse - -parser = argparse.ArgumentParser() - -parser.add_argument("--model_path", type=str, default="") -parser.add_argument("--prompt_path", type=str, default="") - - -args = parser.parse_args() - -with open(args.prompt_path, "r") as f: - prompts = f.readlines() - -prompt_template = "<用户>{}" - -prompts = [prompt_template.format(prompt.strip()) for prompt in prompts] - -params_dict = { - "n": 1, - "best_of": 1, - "presence_penalty": 1.0, - "frequency_penalty": 0.0, - "temperature": 0.5, - "top_p": 0.8, - "top_k": -1, - "use_beam_search": False, - "length_penalty": 1, - "early_stopping": False, - "stop": None, - "stop_token_ids": None, - "ignore_eos": False, - "max_tokens": 1000, - "logprobs": None, - "prompt_logprobs": None, - "skip_special_tokens": True, -} - -# Create a sampling params object. -sampling_params = SamplingParams(**params_dict) - -# Create an LLM. -llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16') -# Generate texts from the prompts. The output is a list of RequestOutput objects -# that contain the prompt, generated text, and other information. -for prompt in prompts: - outputs = llm.generate(prompt, sampling_params) - # Print the outputs. - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print("================") - # find the first <用户> and remove the text before it. - clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):] - - print(f"""<用户>: {clean_prompt.replace("", "")}""") - print(f":") - print(generated_text)