How to Extract Text from a PDF with PDFMiner in Python

09/17/2021

Contents

In this article, you will learn how to extract text from a PDF with PDFMiner in Python.

Extract text from a PDF

PDFMiner is a Python package that allows developers to extract text from PDF files. The package provides various tools and APIs for extracting information such as text, metadata, images, and more. Here’s how you can use PDFMiner to extract text from a PDF file in Python:

Install PDFMiner:

You can install PDFMiner using pip. Open a terminal or command prompt and run the following command:

pip install pdfminer

Import the necessary modules:

Import the following modules in your Python script:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

Define a function to extract text:

Create a function that takes the path of the PDF file and returns the extracted text. Here’s an example function:

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as fh:
        # create a PDF resource manager object that stores shared resources
        rsrcmgr = PDFResourceManager()

        # create a string object to store the extracted text
        output_string = StringIO()

        # create a PDF converter object that converts the input PDF file to plain text
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())

        # create a PDF interpreter object that reads and processes the input PDF file
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        # loop through the pages of the PDF file and extract text
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            interpreter.process_page(page)

        # get the extracted text from the string object
        extracted_text = output_string.getvalue()

        # close the PDF converter and string objects
        device.close()
        output_string.close()

        # return the extracted text
        return extracted_text

Call the function:

Call the extract_text_from_pdf function and pass the path of the PDF file as an argument. The function will return the extracted text, which you can print or use for further processing.

pdf_path = '/path/to/pdf/file.pdf'
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)