PyMuPDF 1.12.2 documentationExtracting Text We can also extract all text of a page in one chunk of string: >>> text = page.getText(type) Use one of the following strings for type: "text": (default) plain text with line breaks output = "text") Extracts the text of a page given its page number pno (zero- based). Invokes Page.getText(). pno (int) – Page number, zero-based. Any Parameters: value < len(doc) is acceptable. output list(range(len(doc))) # list of page numbers for page in doc: if not page.getText(): # page contains no text r.remove(page.number) # remove page0 码力 | 387 页 | 2.70 MB | 1 年前3
PyMuPDF 1.24.2 Documentationbottom-right (ignored for XHTML, HTML and XML output). 2. Use the fitz module in CLI: python -m fitz gettext ..., which produces a text file where text has been re-arranged in layout-preserving mode. Many options corresponding unicode character. 20.5.6 Solution 1. Use layout preserving text extraction: python -m fitz gettext file.pdf. 2. If other text extraction tools also don’t work, then the only solution again is OCRing like this (produced by the command pymupdf gettext -pages 1 demo1.pdf): 21.7. Text Extraction 187 PyMuPDF Documentation, Release 1.24.2 Note: The “gettext” command offers a functionality similar to0 码力 | 565 页 | 6.84 MB | 1 年前3
共 2 条
- 1













