Python/Find and Highlight Text in PDF with Python.md
... ...
@@ -0,0 +1,224 @@
1
+https://medium.com/@alice.yang_10652/find-and-highlight-text-in-pdf-with-python-213deeed1718
2
+
3
+
4
+# Find and Highlight Text in PDF with Python | by Alice Yang | Medium
5
+
6
+![Alice Yang](https://miro.medium.com/v2/resize:fill:88:88/2*s_IabP-ULpYzhSINpNzrPA.png)
7
+[Alice Yang](https://medium.com/@alice.yang_10652)
8
+
9
+Find and Highlight Text in PDF with Python
10
+
11
+PDF (Portable Document Format) files are widely used for sharing and preserving documents with their original formatting intact. When working with lengthy PDF documents, finding specific information can be time-consuming. That’s where the Find and highlight text feature becomes invaluable. By utilizing this feature, you can quickly locate relevant information, extract important details, and create visual markers for easy reference. This article will explore how to **find and highlight text in PDF using Python**. It covers the following topics:
12
+
13
+* [Find and Highlight Text in PDF with Python](#a3a5)
14
+* [Find and Highlight Text in a PDF Page Area with Python](#6d9e)
15
+* [Find and Highlight Text in PDF using Regex with Python](#d8ce)
16
+* [Find Text and Get Its Coordinates in PDF with Python](#5414)
17
+
18
+Python Library to Find and Highlight Text in PDF
19
+------------------------------------------------
20
+
21
+To find and highlight text in PDF files with Python, we will use [Spire.PDF for Python](https://www.e-iceblue.com/Introduce/pdf-for-python.html). It is a feature-rich and user-friendly library designed to create, read, edit, and convert PDF files within Python applications.
22
+
23
+You can install Spire.PDF for Python from [PyPI](https://pypi.org/project/Spire.Pdf/) using the following pip command:
24
+
25
+```
26
+pip install Spire.Pdf
27
+```
28
+
29
+
30
+If you already have Spire.PDF for Python installed and would like to upgrade to the latest version, use the following pip command:
31
+
32
+```
33
+pip install --upgrade Spire.Pdf
34
+```
35
+
36
+
37
+For more detailed information about the installation, you can check this official documentation: [How to Install Spire.PDF for Python in VS Code](https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Getting-Started/How-to-Install-Spire.PDF-for-Python-in-VS-Code.html).
38
+
39
+Find and Highlight Text in PDF with Python
40
+------------------------------------------
41
+
42
+The P**dfTextFinder** class in Spire.PDF for Python is used to search for text in PDF documents. Using the **Find()** method of this class, you can find a specific word or sentence on PDF pages. Then you can highlight each found instance of the text with a bright color, and get the number of instances and the corresponding page numbers.
43
+
44
+Here are the steps to find and highlight text in a PDF document with Python:
45
+
46
+* Create an instance of the **PdfDocument** class and load the PDF document using the **PdfDocument.LoadFromFile()** method.
47
+* Initialize a counter to track the number of text instances and a list to store the page numbers where the text appears.
48
+* Iterate through the pages in the PDF.
49
+* For each page, create a **PdfTextFinder** instance and set the text finding parameters (such as WholeWord, IgnoreCase) through the **PdfTextFinder.Options.Parameter** property.
50
+* Use the **PdfTextFinder.Find()** method to search for the specific text on the page. This method will return a list of **PdfTextFragment** objects, each representing an instance of the text in the document.
51
+* Iterate through the **PdfTextFragment** objects in the list. Then use **PdfTextFragment.Highlight()** method to highlight each instance, increment the count of text instances and add the current page number to the list.
52
+* Use the **PdfDocument.SaveToFile()** method to save the resulting document to a new file.
53
+* Print the number of text instances and the page numbers.
54
+
55
+Here is a code example of how to find and highlight text in a PDF with Python:
56
+
57
+```
58
+from spire.pdf.common import *
59
+from spire.pdf import *
60
+# Create an object of the PdfDocument class
61
+doc = PdfDocument()
62
+# Load a PDF file
63
+doc.LoadFromFile("Adobe Acrobat.pdf")
64
+# Initialize a counter to keep track of the number of instances
65
+instance_count = 0
66
+# Initialize a list to store the page numbers
67
+page_numbers = []
68
+# Iterate through the pages in the document
69
+for i in range(doc.Pages.Count):
70
+ page = doc.Pages[i]
71
+ # Create a PdfTextFinder instance
72
+ finder = PdfTextFinder(page)
73
+ # Set the text finding parameter
74
+ finder.Options.Parameter = TextFindParameter.WholeWord
75
+ # Find a specific text
76
+ results = finder.Find("Adobe Acrobat")
77
+ # Highlight all instances of the specific text
78
+ for text in results:
79
+ text.HighLight(Color.get_Yellow())
80
+ # Increment the instance count
81
+ instance_count += 1
82
+ # Add the page number to the list
83
+ page_numbers.append(i+1)
84
+# Save the result file
85
+doc.SaveToFile("FindAndHighlightText.pdf")
86
+# Print the number of instances and the page numbers
87
+print(f"The text 'Adobe Acrobat' appears {instance_count} times in the PDF.")
88
+print(f"The text appears on the following pages: {', '.join(map(str, page_numbers))}")
89
+```
90
+
91
+
92
+Find and Highlight Text in a PDF Page Area with Python
93
+------------------------------------------------------
94
+
95
+In some cases, you may need to find and highlight text within a specific area or region of a PDF page, rather than the entire page. Using the **PdfTextFinder.Options.Area** property, you can easily define the page area to search for text.
96
+
97
+Here are the steps to find and highlight text in a specific PDF page area with Python:
98
+
99
+* Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
100
+* Iterate through the pages in the PDF.
101
+* For each page, create a **PdfTextFinder** instance and set the page area to search for text through the **PdfTextFinder.Options.Area** property.
102
+* Use the **PdfTextFinder.Find()** method to search for the specific text in the page area.
103
+* Highlight each found instance using the **PdfTextFragment.Highlight()** method.
104
+* Save the resulting document using **PdfDocument.SaveToFile()** method.
105
+
106
+Here is a code example of how to find and highlight text in a specific PDF page area with Python:
107
+
108
+```
109
+from spire.pdf.common import *
110
+from spire.pdf import *
111
+# Create an object of the PdfDocument class
112
+doc = PdfDocument()
113
+# Load a PDF file
114
+doc.LoadFromFile("Adobe Acrobat.pdf")
115
+# Iterate through the pages in the document
116
+for i in range(doc.Pages.Count):
117
+ page = doc.Pages[i]
118
+ # Create a PdfTextFinder instance
119
+ finder = PdfTextFinder(page)
120
+ # Set the page area to search for text
121
+ finder.Options.Area = RectangleF(0.0, 0.0, 300.0, 300.0)
122
+ # Find a specific text
123
+ results = finder.Find("Adobe Acrobat")
124
+ # Highlight all instances of the specific text
125
+ for text in results:
126
+ text.HighLight(Color.get_Yellow())
127
+# Save the resulting file
128
+doc.SaveToFile("FindAndHighlightTextInPageArea.pdf")
129
+doc.Close()
130
+```
131
+
132
+
133
+Find and Highlight Text in PDF using Regex with Python
134
+------------------------------------------------------
135
+
136
+Regular expressions (regex) are a powerful tool for performing sophisticated text finds, allowing you to precisely match and extract information based on complex patterns and rules.
137
+
138
+To search and highlight text using regular expression in a PDF, you first need to set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regex-based searching. Then, pass the regular expression as a parameter to the **Find()** method to implement text searching based on regular expression.
139
+
140
+Here are the steps to find and highlight text in a PDF using regular expression with Python:
141
+
142
+* Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
143
+* Iterate through the pages in the PDF.
144
+* For each page, create a **PdfTextFinder** instance and set the **PdfTextFinder.Options.Parameter** property to **TextFindParameter.Regex** to enable regular expression-based searching.
145
+* Pass the regular expression to the **PdfTextFinder.Find()** method to implement text searching based on regular expressions.
146
+* Use the **PdfTextFragment.Highlight()** method to highlight each matched instance.
147
+* Save the resulting document using the **PdfDocument.SaveToFile()** method.
148
+
149
+Here is a code example of how to find and highlight text in a PDF using regular expression in Python:
150
+
151
+```
152
+from spire.pdf.common import *
153
+from spire.pdf import *
154
+# Create an object of the PdfDocument class
155
+doc = PdfDocument()
156
+# Load a PDF file
157
+doc.LoadFromFile("Template.pdf")
158
+# Iterate through the pages in the document
159
+for i in range(doc.Pages.Count):
160
+ page = doc.Pages[i]
161
+ # Create a PdfTextFinder instance
162
+ finder = PdfTextFinder(page)
163
+ # Set the text finding parameter to enable regex-based searching
164
+ finder.Options.Parameter = TextFindParameter.Regex
165
+ # Find the text starting with the symbol "#"
166
+ results = finder.Find("""\\#\\w+\\b""")
167
+ # Highlight all matched text
168
+ for text in results:
169
+ text.HighLight(Color.get_Yellow())
170
+# Save the resulting document
171
+doc.SaveToFile("FindAndHighlightTextUsingRegex.pdf")
172
+doc.Close()
173
+```
174
+
175
+
176
+Find Text and Get Its Coordinates in PDF with Python
177
+----------------------------------------------------
178
+
179
+You can find specific text in a PDF and retrieve the coordinates of each found text instance through the **PdfTextFragment.Positions\[\].X** and **PdfTextFragment.Positions\[\].Y** properties.
180
+
181
+Here are the steps to find text in a PDF and retrieve the coordinates of each found instance with Python:
182
+
183
+* Create an instance of the **PdfDocument** class and use the **PdfDocument.LoadFromFile()** method to load the PDF document.
184
+* Iterate through the pages in the PDF.
185
+* For each page, create a **PdfTextFinder** instance.
186
+* Use the **PdfTextFinder.Find()** method to search for the specific text.
187
+* Use the **PdfTextFragment.Positions[0].X** and **PdfTextFragment.Positions[0].Y** properties to get the X and Y coordinates of each found instance.
188
+
189
+Here is a code example of how to find text in a PDF and retrieve the coordinates of each found instance with Python:
190
+
191
+```
192
+from spire.pdf.common import *
193
+from spire.pdf import *
194
+# Create an object of the PdfDocument class
195
+doc = PdfDocument()
196
+# Load a PDF file
197
+doc.LoadFromFile("Adobe Acrobat.pdf")
198
+# Iterate through the pages in the document
199
+for i in range(doc.Pages.Count):
200
+ page = doc.Pages[i]
201
+ # Create a PdfTextFinder instance
202
+ finder = PdfTextFinder(page)
203
+ # Find a specific text
204
+ results = finder.Find("Adobe Acrobat")
205
+ # Print the coordinates of each found instance
206
+ for text in results:
207
+ print(f"Text Position: ({text.Positions[0].X}, {text.Positions[0].Y})")
208
+ doc.Close()
209
+```
210
+
211
+
212
+Conclusion
213
+----------
214
+
215
+This blog demonstrated various scenarios for searching and highlighting text in PDF using Python. Additionally, it also explained how to get the coordinates of specific text in PDF using Python. We hope you find it helpful.
216
+
217
+Related Topics
218
+--------------
219
+
220
+* [How to Find and Replace Text in PDF with Python](https://medium.com/@alice.yang_10652/how-to-find-and-replace-text-in-pdf-with-python-9a788ed3cd9a)
221
+* [Extract PDF Tables to Text, Excel, and CSV in Python](https://medium.com/@alice.yang_10652/extract-pdf-tables-to-text-excel-and-csv-in-python-53fdbf3fad91)
222
+* [Extract Images and Image Information from PDF with Python](https://medium.com/@alice.yang_10652/extract-images-and-image-information-from-pdf-with-python-10719a3bda81)
223
+* [How to Encrypt and Decrypt PDF Files with Python](https://medium.com/@alice.yang_10652/how-to-encrypt-and-decrypt-pdf-files-with-python-124d86a70718)
224
+* [5 Ways to Compress PDF or Reduce PDF File Size with Python](https://medium.com/@alice.yang_10652/5-ways-to-compress-pdf-or-reduce-pdf-file-size-with-python-655551041982)
... ...
\ No newline at end of file