gettext PO - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

PyMuPDF 1.12.2 documentation

Extracting Text We can also extract all text of a page in one chunk of string: >>> text = page.getText(type) Use one of the following strings for type: "text": (default) plain text with line breaks output = "text") Extracts the text of a page given its page number pno (zero- based). Invokes Page.getText(). pno (int) – Page number, zero-based. Any Parameters: value < len(doc) is acceptable. output list(range(len(doc))) # list of page numbers for page in doc: if not page.getText(): # page contains no text r.remove(page.number) # remove page

0 码力 | 387 页 | 2.70 MB | 1 年前
3
PyMuPDF 1.24.2 Documentation

bottom-right (ignored for XHTML, HTML and XML output). 2. Use the fitz module in CLI: python -m fitz gettext ..., which produces a text file where text has been re-arranged in layout-preserving mode. Many options corresponding unicode character. 20.5.6 Solution 1. Use layout preserving text extraction: python -m fitz gettext file.pdf. 2. If other text extraction tools also don’t work, then the only solution again is OCRing like this (produced by the command pymupdf gettext -pages 1 demo1.pdf): 21.7. Text Extraction 187 PyMuPDF Documentation, Release 1.24.2 Note: The “gettext” command offers a functionality similar to

0 码力 | 565 页 | 6.84 MB | 1 年前
3

共 2 条前往

页

PyMuPDF 1.12 documentation 1.24 Documentation