-🎉 Langchain-Chatchat 项目官方公众号,欢迎扫码关注。
+🎉 Langchain-Chatchat 项目官方公众号,欢迎扫码关注。
\ No newline at end of file
diff --git a/README_en.md b/README_en.md
index 62ca1853..fca63ea9 100644
--- a/README_en.md
+++ b/README_en.md
@@ -8,6 +8,10 @@
A LLM application aims to implement knowledge and search engine based QA based on Langchain and open-source or remote
LLM API.
+⚠️`0.2.10` will be the last version of the `0.2.x` series. The `0.2.x` series will stop updating and technical support,
+and strive to develop `Langchain-Chachat 0.3.x with stronger applicability. `.
+
+
---
## Table of Contents
@@ -25,7 +29,8 @@ LLM API.
## Introduction
🤖️ A Q&A application based on local knowledge base implemented using the idea
-of [langchain](https://github.com/hwchase17/langchain). The goal is to build a KBQA(Knowledge based Q&A) solution that
+of [langchain](https://github.com/langchain-ai/langchain). The goal is to build a KBQA(Knowledge based Q&A) solution
+that
is friendly to Chinese scenarios and open source models and can run both offline and online.
💡 Inspired by [document.ai](https://github.com/GanymedeNil/document.ai)
@@ -56,10 +61,9 @@ The main process analysis from the aspect of document process:
🚩 The training or fine-tuning are not involved in the project, but still, one always can improve performance by do
these.
-🌐 [AutoDL image](registry.cn-beijing.aliyuncs.com/chatchat/chatchat:0.2.5) is supported, and in v13 the codes are update
-to v0.2.9.
+🌐 [AutoDL image](https://www.codewithgpu.com/i/chatchat-space/Langchain-Chatchat/Langchain-Chatchat) is supported, and in v13 the codes are update to v0.2.9.
-🐳 [Docker image](registry.cn-beijing.aliyuncs.com/chatchat/chatchat:0.2.7)
+🐳 [Docker image](registry.cn-beijing.aliyuncs.com/chatchat/chatchat:0.2.7) is supported to 0.2.7
## Pain Points Addressed
@@ -99,7 +103,9 @@ $ pip install -r requirements_webui.txt
# 默认依赖包括基本运行环境(FAISS向量库)。如果要使用 milvus/pg_vector 等向量库,请将 requirements.txt 中相应依赖取消注释再安装。
```
-Please note that the LangChain-Chachat `0.2.x` series is for the Langchain `0.0.x` series version. If you are using the Langchain `0.1.x` series version, you need to downgrade.
+
+Please note that the LangChain-Chachat `0.2.x` series is for the Langchain `0.0.x` series version. If you are using the
+Langchain `0.1.x` series version, you need to downgrade.
### Model Download
@@ -159,15 +165,33 @@ please refer to the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/
---
+## Project Milestones
+
++ `April 2023`: `Langchain-ChatGLM 0.1.0` released, supporting local knowledge base question and answer based on the
+ ChatGLM-6B model.
++ `August 2023`: `Langchain-ChatGLM` was renamed to `Langchain-Chatchat`, `0.2.0` was released, using `fastchat` as the
+ model loading solution, supporting more models and databases.
++ `October 2023`: `Langchain-Chachat 0.2.5` was released, Agent content was launched, and the open source project won
+ the third prize in the hackathon held by `Founder Park & Zhipu AI & Zilliz`.
++ `December 2023`: `Langchain-Chachat` open source project received more than **20K** stars.
++ `January 2024`: `LangChain 0.1.x` is launched, `Langchain-Chachat 0.2.x` is released. After the stable
+ version `0.2.10` is released, updates and technical support will be stopped, and all efforts will be made to
+ develop `Langchain with stronger applicability -Chat 0.3.x`.
+
+
++ 🔥 Let’s look forward to the future Chatchat stories together···
+
+---
+
## Contact Us
### Telegram
[](https://t.me/+RjliQ3jnJ1YyN2E9)
-### WeChat Group、
+### WeChat Group
-
+
### WeChat Official Account
diff --git a/configs/__init__.py b/configs/__init__.py
index e1c21010..e4f5f1d0 100644
--- a/configs/__init__.py
+++ b/configs/__init__.py
@@ -5,4 +5,4 @@ from .server_config import *
from .prompt_config import *
-VERSION = "v0.2.9"
+VERSION = "v0.2.10"
diff --git a/configs/kb_config.py.example b/configs/kb_config.py.example
index 731148a3..23e06bdc 100644
--- a/configs/kb_config.py.example
+++ b/configs/kb_config.py.example
@@ -21,10 +21,9 @@ OVERLAP_SIZE = 50
# 知识库匹配向量数量
VECTOR_SEARCH_TOP_K = 3
-# 知识库匹配的距离阈值,取值范围在0-1之间,SCORE越小,距离越小从而相关度越高,
-# 取到1相当于不筛选,实测bge-large的距离得分大部分在0.01-0.7之间,
-# 相似文本的得分最高在0.55左右,因此建议针对bge设置得分为0.6
-SCORE_THRESHOLD = 0.6
+# 知识库匹配的距离阈值,一般取值范围在0-1之间,SCORE越小,距离越小从而相关度越高。
+# 但有用户报告遇到过匹配分值超过1的情况,为了兼容性默认设为1,在WEBUI中调整范围为0-2
+SCORE_THRESHOLD = 1.0
# 默认搜索引擎。可选:bing, duckduckgo, metaphor
DEFAULT_SEARCH_ENGINE = "duckduckgo"
@@ -49,12 +48,17 @@ BING_SUBSCRIPTION_KEY = ""
# metaphor搜索需要KEY
METAPHOR_API_KEY = ""
+# 心知天气 API KEY,用于天气Agent。申请:https://www.seniverse.com/
+SENIVERSE_API_KEY = ""
# 是否开启中文标题加强,以及标题增强的相关配置
# 通过增加标题判断,判断哪些文本为标题,并在metadata中进行标记;
# 然后将文本与往上一级的标题进行拼合,实现文本信息的增强。
ZH_TITLE_ENHANCE = False
+# PDF OCR 控制:只对宽高超过页面一定比例(图片宽/页面宽,图片高/页面高)的图片进行 OCR。
+# 这样可以避免 PDF 中一些小图片的干扰,提高非扫描版 PDF 处理速度
+PDF_OCR_THRESHOLD = (0.6, 0.6)
# 每个知识库的初始化介绍,用于在初始化知识库时显示和Agent调用,没写则没有介绍,不会被Agent调用。
KB_INFO = {
diff --git a/configs/model_config.py.example b/configs/model_config.py.example
index b203e933..8746f098 100644
--- a/configs/model_config.py.example
+++ b/configs/model_config.py.example
@@ -6,9 +6,9 @@ import os
MODEL_ROOT_PATH = ""
# 选用的 Embedding 名称
-EMBEDDING_MODEL = "bge-large-zh"
+EMBEDDING_MODEL = "bge-large-zh-v1.5"
-# Embedding 模型运行设备。设为"auto"会自动检测,也可手动设定为"cuda","mps","cpu"其中之一。
+# Embedding 模型运行设备。设为 "auto" 会自动检测(会有警告),也可手动设定为 "cuda","mps","cpu","xpu" 其中之一。
EMBEDDING_DEVICE = "auto"
# 选用的reranker模型
@@ -26,44 +26,33 @@ EMBEDDING_MODEL_OUTPUT_PATH = "output"
# 在这里,我们使用目前主流的两个离线模型,其中,chatglm3-6b 为默认加载模型。
# 如果你的显存不足,可使用 Qwen-1_8B-Chat, 该模型 FP16 仅需 3.8G显存。
-# chatglm3-6b输出角色标签<|user|>及自问自答的问题详见项目wiki->常见问题->Q20.
-
-LLM_MODELS = ["chatglm3-6b", "zhipu-api", "openai-api"] # "Qwen-1_8B-Chat",
-
-# AgentLM模型的名称 (可以不指定,指定之后就锁定进入Agent之后的Chain的模型,不指定就是LLM_MODELS[0])
+LLM_MODELS = ["chatglm3-6b", "zhipu-api", "openai-api"]
Agent_MODEL = None
-# LLM 运行设备。设为"auto"会自动检测,也可手动设定为"cuda","mps","cpu"其中之一。
+# LLM 模型运行设备。设为"auto"会自动检测(会有警告),也可手动设定为 "cuda","mps","cpu","xpu" 其中之一。
LLM_DEVICE = "auto"
-# 历史对话轮数
HISTORY_LEN = 3
-# 大模型最长支持的长度,如果不填写,则使用模型默认的最大长度,如果填写,则为用户设定的最大长度
-MAX_TOKENS = None
+MAX_TOKENS = 2048
-# LLM通用对话参数
TEMPERATURE = 0.7
-# TOP_P = 0.95 # ChatOpenAI暂不支持该参数
ONLINE_LLM_MODEL = {
- # 线上模型。请在server_config中为每个在线API设置不同的端口
-
"openai-api": {
- "model_name": "gpt-3.5-turbo",
+ "model_name": "gpt-4",
"api_base_url": "https://api.openai.com/v1",
"api_key": "",
"openai_proxy": "",
},
- # 具体注册及api key获取请前往 http://open.bigmodel.cn
+ # 智谱AI API,具体注册及api key获取请前往 http://open.bigmodel.cn
"zhipu-api": {
"api_key": "",
- "version": "chatglm_turbo", # 可选包括 "chatglm_turbo"
+ "version": "glm-4",
"provider": "ChatGLMWorker",
},
-
# 具体注册及api key获取请前往 https://api.minimax.chat/
"minimax-api": {
"group_id": "",
@@ -72,13 +61,12 @@ ONLINE_LLM_MODEL = {
"provider": "MiniMaxWorker",
},
-
# 具体注册及api key获取请前往 https://xinghuo.xfyun.cn/
"xinghuo-api": {
"APPID": "",
"APISecret": "",
"api_key": "",
- "version": "v1.5", # 你使用的讯飞星火大模型版本,可选包括 "v3.0", "v1.5", "v2.0"
+ "version": "v3.0", # 你使用的讯飞星火大模型版本,可选包括 "v3.0", "v2.0", "v1.5"
"provider": "XingHuoWorker",
},
@@ -93,8 +81,8 @@ ONLINE_LLM_MODEL = {
# 火山方舟 API,文档参考 https://www.volcengine.com/docs/82379
"fangzhou-api": {
- "version": "chatglm-6b-model", # 当前支持 "chatglm-6b-model", 更多的见文档模型支持列表中方舟部分。
- "version_url": "", # 可以不填写version,直接填写在方舟申请模型发布的API地址
+ "version": "chatglm-6b-model",
+ "version_url": "",
"api_key": "",
"secret_key": "",
"provider": "FangZhouWorker",
@@ -102,15 +90,15 @@ ONLINE_LLM_MODEL = {
# 阿里云通义千问 API,文档参考 https://help.aliyun.com/zh/dashscope/developer-reference/api-details
"qwen-api": {
- "version": "qwen-turbo", # 可选包括 "qwen-turbo", "qwen-plus"
- "api_key": "", # 请在阿里云控制台模型服务灵积API-KEY管理页面创建
+ "version": "qwen-max",
+ "api_key": "",
"provider": "QwenWorker",
- "embed_model": "text-embedding-v1" # embedding 模型名称
+ "embed_model": "text-embedding-v1" # embedding 模型名称
},
# 百川 API,申请方式请参考 https://www.baichuan-ai.com/home#api-enter
"baichuan-api": {
- "version": "Baichuan2-53B", # 当前支持 "Baichuan2-53B", 见官方文档。
+ "version": "Baichuan2-53B",
"api_key": "",
"secret_key": "",
"provider": "BaiChuanWorker",
@@ -132,6 +120,11 @@ ONLINE_LLM_MODEL = {
"secret_key": "",
"provider": "TianGongWorker",
},
+ # Gemini API https://makersuite.google.com/app/apikey
+ "gemini-api": {
+ "api_key": "",
+ "provider": "GeminiWorker",
+ }
}
@@ -143,6 +136,7 @@ ONLINE_LLM_MODEL = {
# - GanymedeNil/text2vec-large-chinese
# - text2vec-large-chinese
# 2.2 如果以上本地路径不存在,则使用huggingface模型
+
MODEL_PATH = {
"embed_model": {
"ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
@@ -169,55 +163,59 @@ MODEL_PATH = {
},
"llm_model": {
- # 以下部分模型并未完全测试,仅根据fastchat和vllm模型的模型列表推定支持
"chatglm2-6b": "THUDM/chatglm2-6b",
"chatglm2-6b-32k": "THUDM/chatglm2-6b-32k",
-
"chatglm3-6b": "THUDM/chatglm3-6b",
"chatglm3-6b-32k": "THUDM/chatglm3-6b-32k",
- "chatglm3-6b-base": "THUDM/chatglm3-6b-base",
- "Qwen-1_8B": "Qwen/Qwen-1_8B",
+ "Orion-14B-Chat": "OrionStarAI/Orion-14B-Chat",
+ "Orion-14B-Chat-Plugin": "OrionStarAI/Orion-14B-Chat-Plugin",
+ "Orion-14B-LongChat": "OrionStarAI/Orion-14B-LongChat",
+
+ "Llama-2-7b-chat-hf": "meta-llama/Llama-2-7b-chat-hf",
+ "Llama-2-13b-chat-hf": "meta-llama/Llama-2-13b-chat-hf",
+ "Llama-2-70b-chat-hf": "meta-llama/Llama-2-70b-chat-hf",
+
"Qwen-1_8B-Chat": "Qwen/Qwen-1_8B-Chat",
- "Qwen-1_8B-Chat-Int8": "Qwen/Qwen-1_8B-Chat-Int8",
- "Qwen-1_8B-Chat-Int4": "Qwen/Qwen-1_8B-Chat-Int4",
-
- "Qwen-7B": "Qwen/Qwen-7B",
"Qwen-7B-Chat": "Qwen/Qwen-7B-Chat",
-
- "Qwen-14B": "Qwen/Qwen-14B",
"Qwen-14B-Chat": "Qwen/Qwen-14B-Chat",
-
- "Qwen-14B-Chat-Int8": "Qwen/Qwen-14B-Chat-Int8",
- # 在新版的transformers下需要手动修改模型的config.json文件,在quantization_config字典中
- # 增加`disable_exllama:true` 字段才能启动qwen的量化模型
- "Qwen-14B-Chat-Int4": "Qwen/Qwen-14B-Chat-Int4",
-
- "Qwen-72B": "Qwen/Qwen-72B",
"Qwen-72B-Chat": "Qwen/Qwen-72B-Chat",
- "Qwen-72B-Chat-Int8": "Qwen/Qwen-72B-Chat-Int8",
- "Qwen-72B-Chat-Int4": "Qwen/Qwen-72B-Chat-Int4",
- "baichuan2-13b": "baichuan-inc/Baichuan2-13B-Chat",
- "baichuan2-7b": "baichuan-inc/Baichuan2-7B-Chat",
-
- "baichuan-7b": "baichuan-inc/Baichuan-7B",
- "baichuan-13b": "baichuan-inc/Baichuan-13B",
+ "baichuan-7b-chat": "baichuan-inc/Baichuan-7B-Chat",
"baichuan-13b-chat": "baichuan-inc/Baichuan-13B-Chat",
-
- "aquila-7b": "BAAI/Aquila-7B",
- "aquilachat-7b": "BAAI/AquilaChat-7B",
+ "baichuan2-7b-chat": "baichuan-inc/Baichuan2-7B-Chat",
+ "baichuan2-13b-chat": "baichuan-inc/Baichuan2-13B-Chat",
"internlm-7b": "internlm/internlm-7b",
"internlm-chat-7b": "internlm/internlm-chat-7b",
+ "internlm2-chat-7b": "internlm/internlm2-chat-7b",
+ "internlm2-chat-20b": "internlm/internlm2-chat-20b",
+
+ "BlueLM-7B-Chat": "vivo-ai/BlueLM-7B-Chat",
+ "BlueLM-7B-Chat-32k": "vivo-ai/BlueLM-7B-Chat-32k",
+
+ "Yi-34B-Chat": "https://huggingface.co/01-ai/Yi-34B-Chat",
+
+ "agentlm-7b": "THUDM/agentlm-7b",
+ "agentlm-13b": "THUDM/agentlm-13b",
+ "agentlm-70b": "THUDM/agentlm-70b",
"falcon-7b": "tiiuae/falcon-7b",
"falcon-40b": "tiiuae/falcon-40b",
"falcon-rw-7b": "tiiuae/falcon-rw-7b",
+ "aquila-7b": "BAAI/Aquila-7B",
+ "aquilachat-7b": "BAAI/AquilaChat-7B",
+ "open_llama_13b": "openlm-research/open_llama_13b",
+ "vicuna-13b-v1.5": "lmsys/vicuna-13b-v1.5",
+ "koala": "young-geng/koala",
+ "mpt-7b": "mosaicml/mpt-7b",
+ "mpt-7b-storywriter": "mosaicml/mpt-7b-storywriter",
+ "mpt-30b": "mosaicml/mpt-30b",
+ "opt-66b": "facebook/opt-66b",
+ "opt-iml-max-30b": "facebook/opt-iml-max-30b",
"gpt2": "gpt2",
"gpt2-xl": "gpt2-xl",
-
"gpt-j-6b": "EleutherAI/gpt-j-6b",
"gpt4all-j": "nomic-ai/gpt4all-j",
"gpt-neox-20b": "EleutherAI/gpt-neox-20b",
@@ -225,63 +223,51 @@ MODEL_PATH = {
"oasst-sft-4-pythia-12b-epoch-3.5": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
"dolly-v2-12b": "databricks/dolly-v2-12b",
"stablelm-tuned-alpha-7b": "stabilityai/stablelm-tuned-alpha-7b",
-
- "Llama-2-13b-hf": "meta-llama/Llama-2-13b-hf",
- "Llama-2-70b-hf": "meta-llama/Llama-2-70b-hf",
- "open_llama_13b": "openlm-research/open_llama_13b",
- "vicuna-13b-v1.3": "lmsys/vicuna-13b-v1.3",
- "koala": "young-geng/koala",
-
- "mpt-7b": "mosaicml/mpt-7b",
- "mpt-7b-storywriter": "mosaicml/mpt-7b-storywriter",
- "mpt-30b": "mosaicml/mpt-30b",
- "opt-66b": "facebook/opt-66b",
- "opt-iml-max-30b": "facebook/opt-iml-max-30b",
-
- "agentlm-7b": "THUDM/agentlm-7b",
- "agentlm-13b": "THUDM/agentlm-13b",
- "agentlm-70b": "THUDM/agentlm-70b",
-
- "Yi-34B-Chat": "01-ai/Yi-34B-Chat",
},
- "reranker":{
- "bge-reranker-large":"BAAI/bge-reranker-large",
- "bge-reranker-base":"BAAI/bge-reranker-base",
- #TODO 增加在线reranker,如cohere
+
+ "reranker": {
+ "bge-reranker-large": "BAAI/bge-reranker-large",
+ "bge-reranker-base": "BAAI/bge-reranker-base",
}
}
-
# 通常情况下不需要更改以下内容
# nltk 模型存储路径
NLTK_DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "nltk_data")
+# 使用VLLM可能导致模型推理能力下降,无法完成Agent任务
VLLM_MODEL_DICT = {
- "aquila-7b": "BAAI/Aquila-7B",
- "aquilachat-7b": "BAAI/AquilaChat-7B",
-
- "baichuan-7b": "baichuan-inc/Baichuan-7B",
- "baichuan-13b": "baichuan-inc/Baichuan-13B",
- "baichuan-13b-chat": "baichuan-inc/Baichuan-13B-Chat",
-
"chatglm2-6b": "THUDM/chatglm2-6b",
"chatglm2-6b-32k": "THUDM/chatglm2-6b-32k",
"chatglm3-6b": "THUDM/chatglm3-6b",
"chatglm3-6b-32k": "THUDM/chatglm3-6b-32k",
+ "Llama-2-7b-chat-hf": "meta-llama/Llama-2-7b-chat-hf",
+ "Llama-2-13b-chat-hf": "meta-llama/Llama-2-13b-chat-hf",
+ "Llama-2-70b-chat-hf": "meta-llama/Llama-2-70b-chat-hf",
+
+ "Qwen-1_8B-Chat": "Qwen/Qwen-1_8B-Chat",
+ "Qwen-7B-Chat": "Qwen/Qwen-7B-Chat",
+ "Qwen-14B-Chat": "Qwen/Qwen-14B-Chat",
+ "Qwen-72B-Chat": "Qwen/Qwen-72B-Chat",
+
+ "baichuan-7b-chat": "baichuan-inc/Baichuan-7B-Chat",
+ "baichuan-13b-chat": "baichuan-inc/Baichuan-13B-Chat",
+ "baichuan2-7b-chat": "baichuan-inc/Baichuan-7B-Chat",
+ "baichuan2-13b-chat": "baichuan-inc/Baichuan-13B-Chat",
+
"BlueLM-7B-Chat": "vivo-ai/BlueLM-7B-Chat",
"BlueLM-7B-Chat-32k": "vivo-ai/BlueLM-7B-Chat-32k",
- # 注意:bloom系列的tokenizer与model是分离的,因此虽然vllm支持,但与fschat框架不兼容
- # "bloom": "bigscience/bloom",
- # "bloomz": "bigscience/bloomz",
- # "bloomz-560m": "bigscience/bloomz-560m",
- # "bloomz-7b1": "bigscience/bloomz-7b1",
- # "bloomz-1b7": "bigscience/bloomz-1b7",
-
"internlm-7b": "internlm/internlm-7b",
"internlm-chat-7b": "internlm/internlm-chat-7b",
+ "internlm2-chat-7b": "internlm/Models/internlm2-chat-7b",
+ "internlm2-chat-20b": "internlm/Models/internlm2-chat-20b",
+
+ "aquila-7b": "BAAI/Aquila-7B",
+ "aquilachat-7b": "BAAI/AquilaChat-7B",
+
"falcon-7b": "tiiuae/falcon-7b",
"falcon-40b": "tiiuae/falcon-40b",
"falcon-rw-7b": "tiiuae/falcon-rw-7b",
@@ -294,8 +280,6 @@ VLLM_MODEL_DICT = {
"oasst-sft-4-pythia-12b-epoch-3.5": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
"dolly-v2-12b": "databricks/dolly-v2-12b",
"stablelm-tuned-alpha-7b": "stabilityai/stablelm-tuned-alpha-7b",
- "Llama-2-13b-hf": "meta-llama/Llama-2-13b-hf",
- "Llama-2-70b-hf": "meta-llama/Llama-2-70b-hf",
"open_llama_13b": "openlm-research/open_llama_13b",
"vicuna-13b-v1.3": "lmsys/vicuna-13b-v1.3",
"koala": "young-geng/koala",
@@ -305,37 +289,14 @@ VLLM_MODEL_DICT = {
"opt-66b": "facebook/opt-66b",
"opt-iml-max-30b": "facebook/opt-iml-max-30b",
- "Qwen-1_8B": "Qwen/Qwen-1_8B",
- "Qwen-1_8B-Chat": "Qwen/Qwen-1_8B-Chat",
- "Qwen-1_8B-Chat-Int8": "Qwen/Qwen-1_8B-Chat-Int8",
- "Qwen-1_8B-Chat-Int4": "Qwen/Qwen-1_8B-Chat-Int4",
-
- "Qwen-7B": "Qwen/Qwen-7B",
- "Qwen-7B-Chat": "Qwen/Qwen-7B-Chat",
-
- "Qwen-14B": "Qwen/Qwen-14B",
- "Qwen-14B-Chat": "Qwen/Qwen-14B-Chat",
- "Qwen-14B-Chat-Int8": "Qwen/Qwen-14B-Chat-Int8",
- "Qwen-14B-Chat-Int4": "Qwen/Qwen-14B-Chat-Int4",
-
- "Qwen-72B": "Qwen/Qwen-72B",
- "Qwen-72B-Chat": "Qwen/Qwen-72B-Chat",
- "Qwen-72B-Chat-Int8": "Qwen/Qwen-72B-Chat-Int8",
- "Qwen-72B-Chat-Int4": "Qwen/Qwen-72B-Chat-Int4",
-
- "agentlm-7b": "THUDM/agentlm-7b",
- "agentlm-13b": "THUDM/agentlm-13b",
- "agentlm-70b": "THUDM/agentlm-70b",
-
}
-# 你认为支持Agent能力的模型,可以在这里添加,添加后不会出现可视化界面的警告
-# 经过我们测试,原生支持Agent的模型仅有以下几个
SUPPORT_AGENT_MODEL = [
- "azure-api",
- "openai-api",
- "qwen-api",
- "Qwen",
- "chatglm3",
- "xinghuo-api",
+ "openai-api", # GPT4 模型
+ "qwen-api", # Qwen Max模型
+ "zhipu-api", # 智谱AI GLM4模型
+ "Qwen", # 所有Qwen系列本地模型
+ "chatglm3-6b",
+ "internlm2-chat-20b",
+ "Orion-14B-Chat-Plugin",
]
diff --git a/configs/server_config.py.example b/configs/server_config.py.example
index 7fa0c412..eea9c34d 100644
--- a/configs/server_config.py.example
+++ b/configs/server_config.py.example
@@ -40,8 +40,6 @@ FSCHAT_MODEL_WORKERS = {
"device": LLM_DEVICE,
# False,'vllm',使用的推理加速框架,使用vllm如果出现HuggingFace通信问题,参见doc/FAQ
# vllm对一些模型支持还不成熟,暂时默认关闭
- # fschat=0.2.33的代码有bug, 如需使用,源码修改fastchat.server.vllm_worker,
- # 将103行中sampling_params = SamplingParams的参数stop=list(stop)修改为stop= [i for i in stop if i!=""]
"infer_turbo": False,
# model_worker多卡加载需要配置的参数
@@ -92,11 +90,10 @@ FSCHAT_MODEL_WORKERS = {
# 'disable_log_requests': False
},
- # 可以如下示例方式更改默认配置
- # "Qwen-1_8B-Chat": { # 使用default中的IP和端口
- # "device": "cpu",
- # },
- "chatglm3-6b": { # 使用default中的IP和端口
+ "Qwen-1_8B-Chat": {
+ "device": "cpu",
+ },
+ "chatglm3-6b": {
"device": "cuda",
},
@@ -128,14 +125,11 @@ FSCHAT_MODEL_WORKERS = {
"tiangong-api": {
"port": 21009,
},
+ "gemini-api": {
+ "port": 21010,
+ },
}
-# fastchat multi model worker server
-FSCHAT_MULTI_MODEL_WORKERS = {
- # TODO:
-}
-
-# fastchat controller server
FSCHAT_CONTROLLER = {
"host": DEFAULT_BIND_HOST,
"port": 20001,
diff --git a/document_loaders/__init__.py b/document_loaders/__init__.py
index a4d6b28d..88cfeae8 100644
--- a/document_loaders/__init__.py
+++ b/document_loaders/__init__.py
@@ -1,2 +1,4 @@
from .mypdfloader import RapidOCRPDFLoader
-from .myimgloader import RapidOCRLoader
\ No newline at end of file
+from .myimgloader import RapidOCRLoader
+from .mydocloader import RapidOCRDocLoader
+from .mypptloader import RapidOCRPPTLoader
diff --git a/document_loaders/mydocloader.py b/document_loaders/mydocloader.py
new file mode 100644
index 00000000..7f5462a2
--- /dev/null
+++ b/document_loaders/mydocloader.py
@@ -0,0 +1,71 @@
+from langchain.document_loaders.unstructured import UnstructuredFileLoader
+from typing import List
+import tqdm
+
+
+class RapidOCRDocLoader(UnstructuredFileLoader):
+ def _get_elements(self) -> List:
+ def doc2text(filepath):
+ from docx.table import _Cell, Table
+ from docx.oxml.table import CT_Tbl
+ from docx.oxml.text.paragraph import CT_P
+ from docx.text.paragraph import Paragraph
+ from docx import Document, ImagePart
+ from PIL import Image
+ from io import BytesIO
+ import numpy as np
+ from rapidocr_onnxruntime import RapidOCR
+ ocr = RapidOCR()
+ doc = Document(filepath)
+ resp = ""
+
+ def iter_block_items(parent):
+ from docx.document import Document
+ if isinstance(parent, Document):
+ parent_elm = parent.element.body
+ elif isinstance(parent, _Cell):
+ parent_elm = parent._tc
+ else:
+ raise ValueError("RapidOCRDocLoader parse fail")
+
+ for child in parent_elm.iterchildren():
+ if isinstance(child, CT_P):
+ yield Paragraph(child, parent)
+ elif isinstance(child, CT_Tbl):
+ yield Table(child, parent)
+
+ b_unit = tqdm.tqdm(total=len(doc.paragraphs)+len(doc.tables),
+ desc="RapidOCRDocLoader block index: 0")
+ for i, block in enumerate(iter_block_items(doc)):
+ b_unit.set_description(
+ "RapidOCRDocLoader block index: {}".format(i))
+ b_unit.refresh()
+ if isinstance(block, Paragraph):
+ resp += block.text.strip() + "\n"
+ images = block._element.xpath('.//pic:pic') # 获取所有图片
+ for image in images:
+ for img_id in image.xpath('.//a:blip/@r:embed'): # 获取图片id
+ part = doc.part.related_parts[img_id] # 根据图片id获取对应的图片
+ if isinstance(part, ImagePart):
+ image = Image.open(BytesIO(part._blob))
+ result, _ = ocr(np.array(image))
+ if result:
+ ocr_result = [line[1] for line in result]
+ resp += "\n".join(ocr_result)
+ elif isinstance(block, Table):
+ for row in block.rows:
+ for cell in row.cells:
+ for paragraph in cell.paragraphs:
+ resp += paragraph.text.strip() + "\n"
+ b_unit.update(1)
+ return resp
+
+ text = doc2text(self.file_path)
+ from unstructured.partition.text import partition_text
+ return partition_text(text=text, **self.unstructured_kwargs)
+
+
+if __name__ == '__main__':
+ loader = RapidOCRDocLoader(file_path="../tests/samples/ocr_test.docx")
+ docs = loader.load()
+ print(docs)
diff --git a/document_loaders/mypdfloader.py b/document_loaders/mypdfloader.py
index 51778b89..faaf63dd 100644
--- a/document_loaders/mypdfloader.py
+++ b/document_loaders/mypdfloader.py
@@ -1,5 +1,6 @@
from typing import List
from langchain.document_loaders.unstructured import UnstructuredFileLoader
+from configs import PDF_OCR_THRESHOLD
from document_loaders.ocr import get_ocr
import tqdm
@@ -15,23 +16,25 @@ class RapidOCRPDFLoader(UnstructuredFileLoader):
b_unit = tqdm.tqdm(total=doc.page_count, desc="RapidOCRPDFLoader context page index: 0")
for i, page in enumerate(doc):
-
- # 更新描述
b_unit.set_description("RapidOCRPDFLoader context page index: {}".format(i))
- # 立即显示进度条更新结果
b_unit.refresh()
- # TODO: 依据文本与图片顺序调整处理方式
text = page.get_text("")
resp += text + "\n"
- img_list = page.get_images()
+ img_list = page.get_image_info(xrefs=True)
for img in img_list:
- pix = fitz.Pixmap(doc, img[0])
- img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, -1)
- result, _ = ocr(img_array)
- if result:
- ocr_result = [line[1] for line in result]
- resp += "\n".join(ocr_result)
+ if xref := img.get("xref"):
+ bbox = img["bbox"]
+ # 检查图片尺寸是否超过设定的阈值
+ if ((bbox[2] - bbox[0]) / (page.rect.width) < PDF_OCR_THRESHOLD[0]
+ or (bbox[3] - bbox[1]) / (page.rect.height) < PDF_OCR_THRESHOLD[1]):
+ continue
+ pix = fitz.Pixmap(doc, xref)
+ img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, -1)
+ result, _ = ocr(img_array)
+ if result:
+ ocr_result = [line[1] for line in result]
+ resp += "\n".join(ocr_result)
# 更新进度
b_unit.update(1)
diff --git a/document_loaders/mypptloader.py b/document_loaders/mypptloader.py
new file mode 100644
index 00000000..f14d0728
--- /dev/null
+++ b/document_loaders/mypptloader.py
@@ -0,0 +1,59 @@
+from langchain.document_loaders.unstructured import UnstructuredFileLoader
+from typing import List
+import tqdm
+
+
+class RapidOCRPPTLoader(UnstructuredFileLoader):
+ def _get_elements(self) -> List:
+ def ppt2text(filepath):
+ from pptx import Presentation
+ from PIL import Image
+ import numpy as np
+ from io import BytesIO
+ from rapidocr_onnxruntime import RapidOCR
+ ocr = RapidOCR()
+ prs = Presentation(filepath)
+ resp = ""
+
+ def extract_text(shape):
+ nonlocal resp
+ if shape.has_text_frame:
+ resp += shape.text.strip() + "\n"
+ if shape.has_table:
+ for row in shape.table.rows:
+ for cell in row.cells:
+ for paragraph in cell.text_frame.paragraphs:
+ resp += paragraph.text.strip() + "\n"
+ if shape.shape_type == 13: # 13 表示图片
+ image = Image.open(BytesIO(shape.image.blob))
+ result, _ = ocr(np.array(image))
+ if result:
+ ocr_result = [line[1] for line in result]
+ resp += "\n".join(ocr_result)
+ elif shape.shape_type == 6: # 6 表示组合
+ for child_shape in shape.shapes:
+ extract_text(child_shape)
+
+ b_unit = tqdm.tqdm(total=len(prs.slides),
+ desc="RapidOCRPPTLoader slide index: 1")
+ # 遍历所有幻灯片
+ for slide_number, slide in enumerate(prs.slides, start=1):
+ b_unit.set_description(
+ "RapidOCRPPTLoader slide index: {}".format(slide_number))
+ b_unit.refresh()
+ sorted_shapes = sorted(slide.shapes,
+ key=lambda x: (x.top, x.left)) # 从上到下、从左到右遍历
+ for shape in sorted_shapes:
+ extract_text(shape)
+ b_unit.update(1)
+ return resp
+
+ text = ppt2text(self.file_path)
+ from unstructured.partition.text import partition_text
+ return partition_text(text=text, **self.unstructured_kwargs)
+
+
+if __name__ == '__main__':
+ loader = RapidOCRPPTLoader(file_path="../tests/samples/ocr_test.pptx")
+ docs = loader.load()
+ print(docs)
diff --git a/embeddings/add_embedding_keywords.py b/embeddings/add_embedding_keywords.py
index 622a4cac..f46dee29 100644
--- a/embeddings/add_embedding_keywords.py
+++ b/embeddings/add_embedding_keywords.py
@@ -7,31 +7,35 @@
保存的模型的位置位于原本嵌入模型的目录下,模型的名称为原模型名称+Merge_Keywords_时间戳
'''
import sys
+
sys.path.append("..")
+import os
+import torch
+
from datetime import datetime
from configs import (
MODEL_PATH,
EMBEDDING_MODEL,
EMBEDDING_KEYWORD_FILE,
)
-import os
-import torch
+
from safetensors.torch import save_model
from sentence_transformers import SentenceTransformer
+from langchain_core._api import deprecated
+@deprecated(
+ since="0.3.0",
+ message="自定义关键词 Langchain-Chatchat 0.3.x 重写, 0.2.x中相关功能将废弃",
+ removal="0.3.0"
+ )
def get_keyword_embedding(bert_model, tokenizer, key_words):
tokenizer_output = tokenizer(key_words, return_tensors="pt", padding=True, truncation=True)
-
- # No need to manually convert to tensor as we've set return_tensors="pt"
input_ids = tokenizer_output['input_ids']
-
- # Remove the first and last token for each sequence in the batch
input_ids = input_ids[:, 1:-1]
keyword_embedding = bert_model.embeddings.word_embeddings(input_ids)
keyword_embedding = torch.mean(keyword_embedding, 1)
-
return keyword_embedding
@@ -47,14 +51,11 @@ def add_keyword_to_model(model_name=EMBEDDING_MODEL, keyword_file: str = "", out
bert_model = word_embedding_model.auto_model
tokenizer = word_embedding_model.tokenizer
key_words_embedding = get_keyword_embedding(bert_model, tokenizer, key_words)
- # key_words_embedding = st_model.encode(key_words)
embedding_weight = bert_model.embeddings.word_embeddings.weight
embedding_weight_len = len(embedding_weight)
tokenizer.add_tokens(key_words)
bert_model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=32)
-
- # key_words_embedding_tensor = torch.from_numpy(key_words_embedding)
embedding_weight = bert_model.embeddings.word_embeddings.weight
with torch.no_grad():
embedding_weight[embedding_weight_len:embedding_weight_len + key_words_len, :] = key_words_embedding
@@ -76,46 +77,3 @@ def add_keyword_to_embedding_model(path: str = EMBEDDING_KEYWORD_FILE):
output_model_name = "{}_Merge_Keywords_{}".format(EMBEDDING_MODEL, current_time)
output_model_path = os.path.join(model_parent_directory, output_model_name)
add_keyword_to_model(model_name, keyword_file, output_model_path)
-
-
-if __name__ == '__main__':
- add_keyword_to_embedding_model(EMBEDDING_KEYWORD_FILE)
-
- # input_model_name = ""
- # output_model_path = ""
- # # 以下为加入关键字前后tokenizer的测试用例对比
- # def print_token_ids(output, tokenizer, sentences):
- # for idx, ids in enumerate(output['input_ids']):
- # print(f'sentence={sentences[idx]}')
- # print(f'ids={ids}')
- # for id in ids:
- # decoded_id = tokenizer.decode(id)
- # print(f' {decoded_id}->{id}')
- #
- # sentences = [
- # '数据科学与大数据技术',
- # 'Langchain-Chatchat'
- # ]
- #
- # st_no_keywords = SentenceTransformer(input_model_name)
- # tokenizer_without_keywords = st_no_keywords.tokenizer
- # print("===== tokenizer with no keywords added =====")
- # output = tokenizer_without_keywords(sentences)
- # print_token_ids(output, tokenizer_without_keywords, sentences)
- # print(f'-------- embedding with no keywords added -----')
- # embeddings = st_no_keywords.encode(sentences)
- # print(embeddings)
- #
- # print("--------------------------------------------")
- # print("--------------------------------------------")
- # print("--------------------------------------------")
- #
- # st_with_keywords = SentenceTransformer(output_model_path)
- # tokenizer_with_keywords = st_with_keywords.tokenizer
- # print("===== tokenizer with keyword added =====")
- # output = tokenizer_with_keywords(sentences)
- # print_token_ids(output, tokenizer_with_keywords, sentences)
- #
- # print(f'-------- embedding with keywords added -----')
- # embeddings = st_with_keywords.encode(sentences)
- # print(embeddings)
\ No newline at end of file
diff --git a/img/qr_code_85.jpg b/img/qr_code_85.jpg
deleted file mode 100644
index d294f22f4ad4de3ac74e0f027ef4b0874d32dcae..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001
literal 278081
zcmdSA2|Sc<+c13Dk}cVib+QyC6iH>7O130Rb}`9LNU}vPktJ(Xgg>%nC;OVQ3rUvj
z%a{o%!;CVn%=TUV@B4Y~=X;;``+o25ectz5PA-nQ=G>0sJkDc1Px=gf1sKu}a(4j$
zLqp&c000Mp0}Q+X6AUrHKLCR;!16Z?02T}){|?(T$o{hqBLFb+{i}YAJHY(U`f#6r
z7rlRe{>OjT=K#P8zhOS3q;xig;U6#qBNOoE-`<}+bLI`hzj}Pb_-{Qjfp3`p1s5Lp
zXPv(X`SXX4Wt7#|(n1-V8tLfY&;}Ub`7<7MxPSlNpDB5HKYU=Sd-a%=waqcswSQpu
zr9Faj=fVj3MHaX5&d|kc}m+
z16Gb$LogIgxK0OV{B&U2KcIDoF;Y{{**&HhtE+jP+Y`<+yVl$~DU-Uh`o^66F1Swh
zC7a7^(`ln~4EqAw0_hb$F6qY9GiU`Ys}(=9}T#>>qbdOzBhasRUdl;9+9#%-Zks
zRQ1+?r#m)V(%RWsZ(MCgDC70Y)snCWkL7Bn8t|+=Hl9DkCEFv7>J!3$-u_T?twqgD
z@BuK0F+MQ-s}IZ_rHEI;tS}v;?nlsC5sYcqdhEE8HG(w%@^+%K=uD6NFJ)ct7a#ao
zW+eml_FUZp=MzVAx7K=airscbCNVk_p+P
z3Uv?0`w&!3V<#(ytNRkF2ae?M5mh6i7UTQMx_7Kb9fPh_baqyr`jnavT-3Q5wqerD
zL}}|_qK%=&!`MR|3DbS3C{2Yg50Qu6GM@^bw``nNVvW;wn0;Md&RJf-nDn%qu`rTq
zO9#5vkz!$I=4i`E?l4YkD2X=T-Rkp1UM1wQ(vnNuqR+O!*j17E#L~o3X^YP%tg`dY
zG!MJZYqG;FU!uk9(eavyFPi5m4uqf3^{IWK2EAXlp4Cn7>%CfVO{zSxsk;4N0zcf?
znt^6O8#D&94IP7(VV7aYg&;Lj8X>07PGp7D@b+8zO4~5Vo84%6WU6k|-{*K;ziLHS
zRS|EwlhUS?_e3)j#08VYHtch*XqX7tr5(l|6xnGn_vPX6V5oEc#*Tr&1clTz{cdid
zbLE*ulADv;;?cuH_qD@1;nC)ih>r7~2!e$MYQ>@e`&f&{GkWmbqwC-1?)jYyn2OjR
zSBp o*6cyaUEPJJ2#pl@l&bSilK~IeP>! rj>6seK?rW>Ysl_#sV599>gA0gyxD7*W*_4i@om&0FI
zPxmzWhy>eQCgMyK_M@hy`)uFedlg9MjuVFJ34qqMk$MKwU79~gDw;4nK@mPb+GY`0
zx-t30&2!PVDP?~;SQ@Dh(@{U^%h{>-$N4@uFsW&^E7?x|HV6_oY0BQ(*)oYgcEWP!
zvuu**<>a<{P04q!Z|xn%*Mafe*)Aj1k-hC{d*|`Jf -8wc&Opm*<6t6P^ZaF)=cKVs~r6{V(>z
zTjH4gSl6pTLD3btySIw3MMAQsG=WTG;8xzo_0GUawxyBL2o8kyd|%ePb?nI@3D6Pe
z#QQ$G(Z(t3c;Q!u# *O9Da{jBbnz$fKkuR@a{~W`A;Xo|*%e~JmUjn{+adz1HJ-;t`
zXXBZ-7hU~P9A0e&QGg(W=rlVBAU5#YmZjl1Ne0|SQhmlk4O%sYd6
z3EHwcxKOv5p*UJD5>t;RTtUt-k>aQgyFPPj*krur3OYPTikO|pt5`YT^2&RH+%voy3j4|~NOEU>~EGv=3=bPDXl`sYAMP&KT`DFP87$Ht|^
zFO6BHkeY?(yxvzipZi`*2V#ehb~mNzpG
1b06vW=D3Uv%kALU3Uv4ep5gnmfKWG~`|-@*#1%_&Go7=u7q{yDt%M`D)ezb`
z3SS4(Y(F-|yFUwLd8TI_dVae35^?YkS5Ak;M|*5Eugj`^=U!99UsFx_X?$w+T1M?c
zuu4TPbI2fY=#W}k7^jIEKeMbGi z|S$8T|y1hywP723({-Tnw%pH+K+JEhh8ws=UUDLjLm5K=87o_%Xvhh}5-Sn;H
z@w>)tGWQD?TO9iMF4_VE#ecx0x3A0-XMxcYp!9nJdkP=1zx{Qj1+E-I;w;2>o_@bN
zJgb6%VsG+WGgAXs4Ds+qK!kd})e6jg1Qz3aitm35Vsa4Cpl?hqz)tn@(i=B|K`BJ3
zfq(6TeQ)dL!^2$g;=n$^lH=63NJ1}t+!XozW-P|NQ
5A0IX-&b+nc=8=|ije`j
zON?
u(^h}+IWEt7hYG6V=sg(k`n#O{Kldj
zl