Langchain-Chatchat/test_pdf.py
zhenkaivip d2716addd6
使用paddleocr实现实现UnstructuredPaddlePDFLoader和UnstructuredPaddleImageLoader (#344)
* jpg and png ocr

* fix

* write docs to tmp file

* fix

* image loader

* fix

* fix

* add pdf_loader

* fix

* update INSTALL.md

---------

Co-authored-by: imClumsyPanda <littlepanda0716@gmail.com>
2023-05-13 11:13:40 +08:00

13 lines
292 B
Python

from configs.model_config import *
import nltk
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
filepath = "docs/test.pdf"
from loader import UnstructuredPaddlePDFLoader
loader = UnstructuredPaddlePDFLoader(filepath, mode="elements")
docs = loader.load()
for doc in docs:
print(doc)