diff --git a/README-en.md b/README-en.md index 49cfdd8..318c9c6 100644 --- a/README-en.md +++ b/README-en.md @@ -19,6 +19,7 @@ Join our discord and ## Changelog🔥 - [2024.09.05] We release [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)! This model outperforms Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and is comparable to several models with 7B-9B parameters like Llama3.1-8B-Instruct, Qwen2-7B-Instruct, and GLM-4-9B-Chat. +- [2024.07.09] MiniCPM-2B has been supported by [SGLang](#sglang-inference)! - [2024.07.05] Released [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)! This model achieves an average sparsity of 87.89% in the FFN layer, reducing FFN FLOPs by 84%, while maintaining downstream task performance. - [2024.04.11] Released [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog. - [2024.03.16] Intermediate checkpoints of MiniCPM-2B were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)! @@ -787,9 +788,10 @@ python demo/hf_based_demo.py --model_path ``` #### Huggingface Inferene + ##### MiniCPM-2B -* Install `transformers>=4.36.0` and `accelerate`,run the following python code. +Install `transformers>=4.36.0` and `accelerate`,run the following python code: ```python from transformers import AutoModelForCausalLM, AutoTokenizer @@ -805,7 +807,9 @@ print(responds) ``` ##### MiniCPM-2B (Llama Format) + To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model: + ```python import torch from transformers import LlamaTokenizerFast, LlamaForCausalLM @@ -820,7 +824,8 @@ responses = tokenizer.decode(responses[0], skip_special_tokens=True) print(responses) ``` -#### vLLM +#### vLLM Inference + Install [vLLM](https://github.com/vllm-project/vllm). ```shell @@ -829,6 +834,35 @@ pip install "vllm>=0.4.1" See [here](#vllm) for the inference code. +#### SGLang Inference + +Install [SGLang](https://github.com/sgl-project/sglang). + +* First, start a server: + +```bash +python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000 +``` + +* You can use it for inference as shown below: + +```python +from sglang import function, gen, set_default_backend, RuntimeEndpoint + +@function +def text_qa(s, question): + s += "" + question + "" + s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7) + +set_default_backend(RuntimeEndpoint("http://localhost:30000")) + +state = text_qa.run( + question="What is the capital of China?", +) + +print(state["answer"]) +``` + #### llama.cpp, Ollama, fastllm, mlx_lm Inference We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama. diff --git a/README.md b/README.md index 8829c01..3b63797 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ ## 更新日志🔥 - [2024.09.05] 发布 [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)!该模型的表现超越 Phi-3.5-mini-instruct 和 GPT-3.5-Turbo-0125,并且能够比肩 Llama3.1-8B-Instruct、Qwen2-7B-Instruct、GLM-4-9B-Chat 等多个 7B-9B 参数量的模型。 -- [2024.07.09] MiniCPM-2B 已经支持使用 [SGLang](https://github.com/sgl-project/sglang) 推理! +- [2024.07.09] MiniCPM-2B 已经支持使用 [SGLang](#sglang-推理) 推理! - [2024.07.05] 发布 [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)!该模型在保持下游任务性能无损的前提下,FFN 层实现了 87.89% 的平均稀疏度,将 FFN FLOPs 降低了 84%。 - [2024.04.11] 发布 [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k)、[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) 和 [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)!点击[这里](https://openbmb.vercel.app/?category=Chinese+Blog)查看技术博客。 - [2024.03.16] MiniCPM-2B 的 30 余个中间检查点开放了![HuggingFace链接](https://huggingface.co/openbmb/MiniCPM-2B-history) @@ -793,7 +793,9 @@ python demo/hf_based_demo.py --model_path #### HuggingFace 推理 ##### MiniCPM-2B -* 安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码 + +安装`transformers>=4.36.0`以及`accelerate`后,运行以下代码: + ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch @@ -808,7 +810,9 @@ print(responds) ``` ##### MiniCPM-2B (Llama Format) + 我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的[格式](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16-llama-format),以便大家尝试: + ```python import torch from transformers import LlamaTokenizerFast, LlamaForCausalLM @@ -825,13 +829,43 @@ print(responds) #### vLLM 推理 -* 安装[vLLM](https://github.com/vllm-project/vllm) +安装 [vLLM](https://github.com/vllm-project/vllm)。 + ```shell pip install "vllm>=0.4.1" ``` 具体推理代码见[这里](#vllm)。 +#### SGLang 推理 + +安装 [SGLang](https://github.com/sgl-project/sglang)。 + +* 首先需要启动一个服务: + +```bash +python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000 +``` + +* 下面是一个推理代码的样例: + +```python +from sglang import function, gen, set_default_backend, RuntimeEndpoint + +@function +def text_qa(s, question): + s += "<用户>" + question + "" + s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7) + +set_default_backend(RuntimeEndpoint("http://localhost:30000")) + +state = text_qa.run( + question="What is the capital of China?", +) + +print(state["answer"]) +``` + #### llama.cpp、Ollama、fastllm、mlx_lm推理 MiniCPM支持[llama.cpp](https://github.com/ggerganov/llama.cpp/) 、[ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm)、[mlx_lm](https://github.com/ml-explore/mlx-examples)推理。感谢[@runfuture](https://github.com/runfuture)对llama.cpp和ollama的适配。