Find and Highlight Text in PDF with Python

default

markdown

https://medium.com/@alice.yang_10652/find-and-highlight-text-in-pdf-with-python-213deeed1718

# Find and Highlight Text in PDF with Python | by Alice Yang | Medium

![Alice Yang](https://miro.medium.com/v2/resize:fill:88:88/2*s_IabP-ULpYzhSINpNzrPA.png)
[Alice Yang](https://medium.com/@alice.yang_10652)

Find and Highlight Text in PDF with Python

PDF (Portable Document Format) files are widely used for sharing and preserving documents with their original formatting intact. When working with lengthy PDF documents, finding specific information can be time-consuming. That’s where the Find and highlight text feature becomes invaluable. By utilizing this feature, you can quickly locate relevant information, extract important details, and create visual markers for easy reference. This article will explore how to **find and highlight text in PDF using Python**. It covers the following topics:

*   [Find and Highlight Text in PDF with Python](#a3a5)
*   [Find and Highlight Text in a PDF Page Area with Python](#6d9e)
*   [Find and Highlight Text in PDF using Regex with Python](#d8ce)
*   [Find Text and Get Its Coordinates in PDF with Python](#5414)

Python Library to Find and Highlight Text in PDF
------------------------------------------------

To find and highlight text in PDF files with Python, we will use [Spire.PDF for Python](https://www.e-iceblue.com/Introduce/pdf-for-python.html). It is a feature-rich and user-friendly library designed to create, read, edit, and convert PDF files within Python applications.

You can install Spire.PDF for Python from [PyPI](https://pypi.org/project/Spire.Pdf/) using the following pip command:

```
pip install Spire.Pdf
```

If you already have Spire.PDF for Python installed and would like to upgrade to the latest version, use the following pip command:

```
pip install --upgrade Spire.Pdf
```

For more detailed information about the installation, you can check this official documentation: [How to Install Spire.PDF for Python in VS Code](https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Getting-Started/How-to-Install-Spire.PDF-for-Python-in-VS-Code.html).

Find and Highlight Text in PDF with Python
------------------------------------------

The P**dfTextFinder** class in Spire.PDF for Python is used to search for text in PDF documents. Using the **Find()** method of this class, you can find a specific word or sentence on PDF pages. Then you can highlight each found instance of the text with a bright color, and get the number of instances and the corresponding page numbers.

Here are the steps to find and highlight text in a PDF document with Python:

*   Create an instance of the **PdfDocument** class and load the PDF document using the **PdfDocument.LoadFromFile()** method.
*   Initialize a counter to track the number of text instances and a list to store the page numbers where the text appears.
*   Iterate through the pages in the PDF.
*   For each page, create a **PdfTextFinder** instance and set the text finding parameters (such as WholeWord, IgnoreCase) through the **PdfTextFinder.Options.Parameter** property.
*   Use the **PdfTextFinder.Find()** method to search for the specific text on the page. This method will return a list of **PdfTextFragment** objects, each representing an instance of the text in the document.
*   Iterate through the **PdfTextFragment** objects in the list. Then use **PdfTextFragment.Highlight()** method to highlight each instance, increment the count of text instances and add the current page number to the list.
*   Use the **PdfDocument.SaveToFile()** method to save the resulting document to a new file.
*   Print the number of text instances and the page numbers.

Here is a code example of how to find and highlight text in a PDF with Python:

```
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Initialize a counter to keep track of the number of instances
instance_count = 0
# Initialize a list to store the page numbers
page_numbers = []
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    # Create a PdfTextFinder instance
    finder = PdfTextFinder(page)
    # Set the text finding parameter
    finder.Options.Parameter = TextFindParameter.WholeWord
    # Find a specific text
    results = finder.Find("Adobe Acrobat")
    # Highlight all instances of the specific text
    for text in results:
        text.HighLight(Color.get_Yellow())
        # Increment the instance count
        instance_count += 1
        # Add the page number to the list
        page_numbers.append(i+1)
# Save the result file
doc.SaveToFile("FindAndHighlightText.pdf")
# Print the number of instances and the page numbers
print(f"The text 'Adobe Acrobat' appears {instance_count} times in the PDF.")
print(f"The text appears on the following pages: {', '.join(map(str, page_numbers))}")
```

Find and Highlight Text in a PDF Page Area with Python
------------------------------------------------------

In some cases, you may need to find and highlight text within a specific area or region of a PDF page, rather than the entire page. Using the **PdfTextFinder.Options.Area** property, you can easily define the page area to search for text.

Here are the steps to find and highlight text in a specific PDF page area with Python:

*   Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
*   Iterate through the pages in the PDF.
*   For each page, create a **PdfTextFinder** instance and set the page area to search for text through the **PdfTextFinder.Options.Area** property.
*   Use the **PdfTextFinder.Find()** method to search for the specific text in the page area.
*   Highlight each found instance using the **PdfTextFragment.Highlight()** method.
*   Save the resulting document using **PdfDocument.SaveToFile()** method.

Here is a code example of how to find and highlight text in a specific PDF page area with Python:

```
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    # Create a PdfTextFinder instance
    finder = PdfTextFinder(page)
    # Set the page area to search for text
    finder.Options.Area = RectangleF(0.0, 0.0, 300.0, 300.0)
    # Find a specific text
    results = finder.Find("Adobe Acrobat")
    # Highlight all instances of the specific text
    for text in results:
        text.HighLight(Color.get_Yellow())
# Save the resulting file
doc.SaveToFile("FindAndHighlightTextInPageArea.pdf")
doc.Close()
```

Find and Highlight Text in PDF using Regex with Python
------------------------------------------------------

Regular expressions (regex) are a powerful tool for performing sophisticated text finds, allowing you to precisely match and extract information based on complex patterns and rules.

To search and highlight text using regular expression in a PDF, you first need to set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regex-based searching. Then, pass the regular expression as a parameter to the **Find()** method to implement text searching based on regular expression.

Here are the steps to find and highlight text in a PDF using regular expression with Python:

*   Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
*   Iterate through the pages in the PDF.
*   For each page, create a **PdfTextFinder** instance and set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regular expression-based searching.
*   Pass the regular expression to the **PdfTextFinder.Find()** method to implement text searching based on regular expressions.
*   Use the **PdfTextFragment.Highlight()** method to highlight each matched instance.
*   Save the resulting document using the **PdfDocument.SaveToFile()** method.

Here is a code example of how to find and highlight text in a PDF using regular expression in Python:

```
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Template.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    # Create a PdfTextFinder instance
    finder = PdfTextFinder(page)
    # Set the text finding parameter to enable regex-based searching
    finder.Options.Parameter = TextFindParameter.Regex
    # Find the text starting with the symbol "#"
    results = finder.Find("""\\#\\w+\\b""")
    # Highlight all matched text
    for text in results:
        text.HighLight(Color.get_Yellow())
# Save the resulting document
doc.SaveToFile("FindAndHighlightTextUsingRegex.pdf")
doc.Close()
```

Find Text and Get Its Coordinates in PDF with Python
----------------------------------------------------

You can find specific text in a PDF and retrieve the coordinates of each found text instance through the **PdfTextFragment.Positions\[\].X** and **PdfTextFragment.Positions\[\].Y** properties.

Here are the steps to find text in a PDF and retrieve the coordinates of each found instance with Python:

*   Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
*   Iterate through the pages in the PDF.
*   For each page, create a **PdfTextFinder** instance.
*   Use the **PdfTextFinder.Find()** method to search for the specific text.
*   Use the **PdfTextFragment.Positions[0].X** and **PdfTextFragment.Positions[0].Y** properties to get the X and Y coordinates of each found instance.

Here is a code example of how to find text in a PDF and retrieve the coordinates of each found instance with Python:

```
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    # Create a PdfTextFinder instance
    finder = PdfTextFinder(page)
    # Find a specific text
    results = finder.Find("Adobe Acrobat")
    # Print the coordinates of each found instance
    for text in results:
        print(f"Text Position: ({text.Positions[0].X}, {text.Positions[0].Y})")        
        doc.Close()
```

Conclusion
----------

This blog demonstrated various scenarios for searching and highlighting text in PDF using Python. Additionally, it also explained how to get the coordinates of specific text in PDF using Python. We hope you find it helpful.

Editing Find and Highlight Text in PDF with Python