☰
Current Page
Main Menu
Home
Home
Editing Find and Highlight Text in PDF with Python
Edit
Preview
H1
H2
H3
default
Set your preferred keybinding
default
vim
emacs
markdown
Set this page's format to
AsciiDoc
Creole
Markdown
MediaWiki
Org-mode
Plain Text
RDoc
Textile
Rendering unavailable for
BibTeX
Pod
reStructuredText
Help 1
Help 1
Help 1
Help 2
Help 3
Help 4
Help 5
Help 6
Help 7
Help 8
Autosaved text is available. Click the button to restore it.
Restore Text
https://medium.com/@alice.yang_10652/find-and-highlight-text-in-pdf-with-python-213deeed1718 # Find and Highlight Text in PDF with Python | by Alice Yang | Medium  [Alice Yang](https://medium.com/@alice.yang_10652) Find and Highlight Text in PDF with Python PDF (Portable Document Format) files are widely used for sharing and preserving documents with their original formatting intact. When working with lengthy PDF documents, finding specific information can be time-consuming. That’s where the Find and highlight text feature becomes invaluable. By utilizing this feature, you can quickly locate relevant information, extract important details, and create visual markers for easy reference. This article will explore how to **find and highlight text in PDF using Python**. It covers the following topics: * [Find and Highlight Text in PDF with Python](#a3a5) * [Find and Highlight Text in a PDF Page Area with Python](#6d9e) * [Find and Highlight Text in PDF using Regex with Python](#d8ce) * [Find Text and Get Its Coordinates in PDF with Python](#5414) Python Library to Find and Highlight Text in PDF ------------------------------------------------ To find and highlight text in PDF files with Python, we will use [Spire.PDF for Python](https://www.e-iceblue.com/Introduce/pdf-for-python.html). It is a feature-rich and user-friendly library designed to create, read, edit, and convert PDF files within Python applications. You can install Spire.PDF for Python from [PyPI](https://pypi.org/project/Spire.Pdf/) using the following pip command: ``` pip install Spire.Pdf ``` If you already have Spire.PDF for Python installed and would like to upgrade to the latest version, use the following pip command: ``` pip install --upgrade Spire.Pdf ``` For more detailed information about the installation, you can check this official documentation: [How to Install Spire.PDF for Python in VS Code](https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Getting-Started/How-to-Install-Spire.PDF-for-Python-in-VS-Code.html). Find and Highlight Text in PDF with Python ------------------------------------------ The P**dfTextFinder** class in Spire.PDF for Python is used to search for text in PDF documents. Using the **Find()** method of this class, you can find a specific word or sentence on PDF pages. Then you can highlight each found instance of the text with a bright color, and get the number of instances and the corresponding page numbers. Here are the steps to find and highlight text in a PDF document with Python: * Create an instance of the **PdfDocument** class and load the PDF document using the **PdfDocument.LoadFromFile()** method. * Initialize a counter to track the number of text instances and a list to store the page numbers where the text appears. * Iterate through the pages in the PDF. * For each page, create a **PdfTextFinder** instance and set the text finding parameters (such as WholeWord, IgnoreCase) through the **PdfTextFinder.Options.Parameter** property. * Use the **PdfTextFinder.Find()** method to search for the specific text on the page. This method will return a list of **PdfTextFragment** objects, each representing an instance of the text in the document. * Iterate through the **PdfTextFragment** objects in the list. Then use **PdfTextFragment.Highlight()** method to highlight each instance, increment the count of text instances and add the current page number to the list. * Use the **PdfDocument.SaveToFile()** method to save the resulting document to a new file. * Print the number of text instances and the page numbers. Here is a code example of how to find and highlight text in a PDF with Python: ``` from spire.pdf.common import * from spire.pdf import * # Create an object of the PdfDocument class doc = PdfDocument() # Load a PDF file doc.LoadFromFile("Adobe Acrobat.pdf") # Initialize a counter to keep track of the number of instances instance_count = 0 # Initialize a list to store the page numbers page_numbers = [] # Iterate through the pages in the document for i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the text finding parameter finder.Options.Parameter = TextFindParameter.WholeWord # Find a specific text results = finder.Find("Adobe Acrobat") # Highlight all instances of the specific text for text in results: text.HighLight(Color.get_Yellow()) # Increment the instance count instance_count += 1 # Add the page number to the list page_numbers.append(i+1) # Save the result file doc.SaveToFile("FindAndHighlightText.pdf") # Print the number of instances and the page numbers print(f"The text 'Adobe Acrobat' appears {instance_count} times in the PDF.") print(f"The text appears on the following pages: {', '.join(map(str, page_numbers))}") ``` Find and Highlight Text in a PDF Page Area with Python ------------------------------------------------------ In some cases, you may need to find and highlight text within a specific area or region of a PDF page, rather than the entire page. Using the **PdfTextFinder.Options.Area** property, you can easily define the page area to search for text. Here are the steps to find and highlight text in a specific PDF page area with Python: * Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document. * Iterate through the pages in the PDF. * For each page, create a **PdfTextFinder** instance and set the page area to search for text through the **PdfTextFinder.Options.Area** property. * Use the **PdfTextFinder.Find()** method to search for the specific text in the page area. * Highlight each found instance using the **PdfTextFragment.Highlight()** method. * Save the resulting document using **PdfDocument.SaveToFile()** method. Here is a code example of how to find and highlight text in a specific PDF page area with Python: ``` from spire.pdf.common import * from spire.pdf import * # Create an object of the PdfDocument class doc = PdfDocument() # Load a PDF file doc.LoadFromFile("Adobe Acrobat.pdf") # Iterate through the pages in the document for i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the page area to search for text finder.Options.Area = RectangleF(0.0, 0.0, 300.0, 300.0) # Find a specific text results = finder.Find("Adobe Acrobat") # Highlight all instances of the specific text for text in results: text.HighLight(Color.get_Yellow()) # Save the resulting file doc.SaveToFile("FindAndHighlightTextInPageArea.pdf") doc.Close() ``` Find and Highlight Text in PDF using Regex with Python ------------------------------------------------------ Regular expressions (regex) are a powerful tool for performing sophisticated text finds, allowing you to precisely match and extract information based on complex patterns and rules. To search and highlight text using regular expression in a PDF, you first need to set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regex-based searching. Then, pass the regular expression as a parameter to the **Find()** method to implement text searching based on regular expression. Here are the steps to find and highlight text in a PDF using regular expression with Python: * Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document. * Iterate through the pages in the PDF. * For each page, create a **PdfTextFinder** instance and set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regular expression-based searching. * Pass the regular expression to the **PdfTextFinder.Find()** method to implement text searching based on regular expressions. * Use the **PdfTextFragment.Highlight()** method to highlight each matched instance. * Save the resulting document using the **PdfDocument.SaveToFile()** method. Here is a code example of how to find and highlight text in a PDF using regular expression in Python: ``` from spire.pdf.common import * from spire.pdf import * # Create an object of the PdfDocument class doc = PdfDocument() # Load a PDF file doc.LoadFromFile("Template.pdf") # Iterate through the pages in the document for i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the text finding parameter to enable regex-based searching finder.Options.Parameter = TextFindParameter.Regex # Find the text starting with the symbol "#" results = finder.Find("""\\#\\w+\\b""") # Highlight all matched text for text in results: text.HighLight(Color.get_Yellow()) # Save the resulting document doc.SaveToFile("FindAndHighlightTextUsingRegex.pdf") doc.Close() ``` Find Text and Get Its Coordinates in PDF with Python ---------------------------------------------------- You can find specific text in a PDF and retrieve the coordinates of each found text instance through the **PdfTextFragment.Positions\[\].X** and **PdfTextFragment.Positions\[\].Y** properties. Here are the steps to find text in a PDF and retrieve the coordinates of each found instance with Python: * Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document. * Iterate through the pages in the PDF. * For each page, create a **PdfTextFinder** instance. * Use the **PdfTextFinder.Find()** method to search for the specific text. * Use the **PdfTextFragment.Positions[0].X** and **PdfTextFragment.Positions[0].Y** properties to get the X and Y coordinates of each found instance. Here is a code example of how to find text in a PDF and retrieve the coordinates of each found instance with Python: ``` from spire.pdf.common import * from spire.pdf import * # Create an object of the PdfDocument class doc = PdfDocument() # Load a PDF file doc.LoadFromFile("Adobe Acrobat.pdf") # Iterate through the pages in the document for i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Find a specific text results = finder.Find("Adobe Acrobat") # Print the coordinates of each found instance for text in results: print(f"Text Position: ({text.Positions[0].X}, {text.Positions[0].Y})") doc.Close() ``` Conclusion ---------- This blog demonstrated various scenarios for searching and highlighting text in PDF using Python. Additionally, it also explained how to get the coordinates of specific text in PDF using Python. We hope you find it helpful. Related Topics -------------- * [How to Find and Replace Text in PDF with Python](https://medium.com/@alice.yang_10652/how-to-find-and-replace-text-in-pdf-with-python-9a788ed3cd9a) * [Extract PDF Tables to Text, Excel, and CSV in Python](https://medium.com/@alice.yang_10652/extract-pdf-tables-to-text-excel-and-csv-in-python-53fdbf3fad91) * [Extract Images and Image Information from PDF with Python](https://medium.com/@alice.yang_10652/extract-images-and-image-information-from-pdf-with-python-10719a3bda81) * [How to Encrypt and Decrypt PDF Files with Python](https://medium.com/@alice.yang_10652/how-to-encrypt-and-decrypt-pdf-files-with-python-124d86a70718) * [5 Ways to Compress PDF or Reduce PDF File Size with Python](https://medium.com/@alice.yang_10652/5-ways-to-compress-pdf-or-reduce-pdf-file-size-with-python-655551041982)
Uploading file...
Edit message:
Cancel