update readme

This commit is contained in:
zh-zheng 2024-09-07 15:53:06 +08:00
parent bb2e8478c2
commit d20a13200a
2 changed files with 73 additions and 5 deletions

View File

@ -19,6 +19,7 @@ Join our <a href="https://discord.gg/3cGQn9b3YM" target="_blank">discord</a> and
## Changelog🔥
- [2024.09.05] We release [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)! This model outperforms Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and is comparable to several models with 7B-9B parameters like Llama3.1-8B-Instruct, Qwen2-7B-Instruct, and GLM-4-9B-Chat.
- [2024.07.09] MiniCPM-2B has been supported by [SGLang](#sglang-inference)!
- [2024.07.05] Released [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)! This model achieves an average sparsity of 87.89% in the FFN layer, reducing FFN FLOPs by 84%, while maintaining downstream task performance.
- [2024.04.11] Released [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog.
- [2024.03.16] Intermediate checkpoints of MiniCPM-2B were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)!
@ -787,9 +788,10 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
```
#### Huggingface Inferene
##### MiniCPM-2B
* Install `transformers>=4.36.0` and `accelerate`run the following python code.
Install `transformers>=4.36.0` and `accelerate`run the following python code:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
@ -805,7 +807,9 @@ print(responds)
```
##### MiniCPM-2B (Llama Format)
To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model:
```python
import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
@ -820,7 +824,8 @@ responses = tokenizer.decode(responses[0], skip_special_tokens=True)
print(responses)
```
#### vLLM
#### vLLM Inference
Install [vLLM](https://github.com/vllm-project/vllm).
```shell
@ -829,6 +834,35 @@ pip install "vllm>=0.4.1"
See [here](#vllm) for the inference code.
#### SGLang Inference
Install [SGLang](https://github.com/sgl-project/sglang).
* First, start a server:
```bash
python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000
```
* You can use it for inference as shown below:
```python
from sglang import function, gen, set_default_backend, RuntimeEndpoint
@function
def text_qa(s, question):
s += "<User>" + question + "<AI>"
s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7)
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = text_qa.run(
question="What is the capital of China?",
)
print(state["answer"])
```
#### llama.cpp, Ollama, fastllm, mlx_lm Inference
We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama.

View File

@ -21,7 +21,7 @@
## 更新日志🔥
- [2024.09.05] 发布 [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)!该模型的表现超越 Phi-3.5-mini-instruct 和 GPT-3.5-Turbo-0125并且能够比肩 Llama3.1-8B-Instruct、Qwen2-7B-Instruct、GLM-4-9B-Chat 等多个 7B-9B 参数量的模型。
- [2024.07.09] MiniCPM-2B 已经支持使用 [SGLang](https://github.com/sgl-project/sglang) 推理!
- [2024.07.09] MiniCPM-2B 已经支持使用 [SGLang](#sglang-推理) 推理!
- [2024.07.05] 发布 [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)该模型在保持下游任务性能无损的前提下FFN 层实现了 87.89% 的平均稀疏度,将 FFN FLOPs 降低了 84%。
- [2024.04.11] 发布 [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)、[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) 和 [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)!点击[这里](https://openbmb.vercel.app/?category=Chinese+Blog)查看技术博客。
- [2024.03.16] MiniCPM-2B 的 30 余个中间检查点开放了![HuggingFace链接](https://huggingface.co/openbmb/MiniCPM-2B-history)
@ -793,7 +793,9 @@ python demo/hf_based_demo.py --model_path <hf_repo_path>
#### HuggingFace 推理
##### MiniCPM-2B
* 安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码
安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
@ -808,7 +810,9 @@ print(responds)
```
##### MiniCPM-2B Llama Format
我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的[格式](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format),以便大家尝试:
```python
import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
@ -825,13 +829,43 @@ print(responds)
#### vLLM 推理
* 安装[vLLM](https://github.com/vllm-project/vllm)
安装 [vLLM](https://github.com/vllm-project/vllm)。
```shell
pip install "vllm>=0.4.1"
```
具体推理代码见[这里](#vllm)。
#### SGLang 推理
安装 [SGLang](https://github.com/sgl-project/sglang)。
* 首先需要启动一个服务:
```bash
python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000
```
* 下面是一个推理代码的样例:
```python
from sglang import function, gen, set_default_backend, RuntimeEndpoint
@function
def text_qa(s, question):
s += "<用户>" + question + "<AI>"
s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7)
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = text_qa.run(
question="What is the capital of China?",
)
print(state["answer"])
```
#### llama.cpp、Ollama、fastllm、mlx_lm推理
MiniCPM支持[llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm)、[mlx_lm](https://github.com/ml-explore/mlx-examples)推理。感谢[@runfuture](https://github.com/runfuture)对llama.cpp和ollama的适配。