support fastllm

This commit is contained in:
xuhaifeng 2024-03-01 11:20:24 +08:00
parent b3358343cb
commit ab25fc9bb1
2 changed files with 34 additions and 4 deletions

View File

@ -195,8 +195,8 @@ python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/promp
```
#### llama.cpp and Ollama Inference
We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [ollama](https://github.com/ollama/ollama).
#### llama.cpp、Ollama、fastllm Inference
We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/) [ollama](https://github.com/ollama/ollama)、[fastllm](https://github.com/ztxz16/fastllm).
**llama.cpp**
@ -218,6 +218,21 @@ Solving [this issue](https://github.com/ollama/ollama/issues/2383)
- [ChatLLM](https://github.com/foldl/chatllm.cpp) :[Run MiniCPM on CPU](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16/discussions/2#65c59c4f27b8c11e43fc8796)
**fastllm**
1. [install fastllm]([fastllm](https://github.com/ztxz16/fastllm)
2. inference
```
import torch
from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM
path = 'openbmb/MiniCPM-2B-dpo-fp16'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True)
from fastllm_pytools import llm
llm.set_device_map("cpu")
model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4"
print(model.response("<用户>Write an acrostic poem with the word MINICPM (One line per letter)<AI>", top_p=0.8, temperature=0.5, repeat_penalty=1.02))
```
<p id="3"></p>

View File

@ -202,8 +202,8 @@ python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/promp
The capital city of China is Beijing. Beijing is a major political, cultural, and economic center in China, and it is known for its rich history, beautiful architecture, and vibrant nightlife. It is also home to many of China's most important cultural and historical sites, including the Forbidden City, the Great Wall of China, and the Temple of Heaven. Beijing is a popular destination for tourists from around the world, and it is an important hub for international business and trade.
```
#### llama.cpp与Ollama推理
我们支持了[llama.cpp](https://github.com/ggerganov/llama.cpp/) 推理[ollama](https://github.com/ollama/ollama)推理.
#### llama.cpp、Ollama、fastllm推理
我们支持了[llama.cpp](https://github.com/ggerganov/llama.cpp/) 推理[ollama](https://github.com/ollama/ollama)推理、[fastllm](https://github.com/ztxz16/fastllm)推理.
**llama.cpp**
1. [安装llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build)
@ -217,6 +217,21 @@ python inference.py --model_path <vllmcpm_repo_path> --prompt_path prompts/promp
**ollama**
正在解决[这个问题](https://github.com/ollama/ollama/issues/2383)
**fastllm**
1. [编译安装fastllm](https://github.com/ztxz16/fastllm)
2. 模型推理
```
import torch
from transformers import AutoTokenizer, LlamaTokenizerFast, AutoModelForCausalLM
path = 'openbmb/MiniCPM-2B-dpo-fp16'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True)
from fastllm_pytools import llm
llm.set_device_map("cpu")
model = llm.from_hf(model, tokenizer, dtype = "float16") # dtype支持 "float16", "int8", "int4"
print(model.response("<用户>山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?<AI>", top_p=0.8, temperature=0.5, repeat_penalty=1.02))
```
<p id="community"></p>