Finding text in different types of pdfs

akim · April 10, 2024, 11:50pm

I have an applescript in which the PDFKit framework command to findString correctly finds instances of a string in a pdf that has been made from a Word Document, but fails in a scanned document that has been saved with OCR’d text.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use framework "Foundation"
use framework "PDFKit"

set DirPath to "/Users/alan/Desktop/PDF Scripts/"
set fileName to "Target.pdf"
set posixPath to DirPath & fileName
set searchString to "Alan"

set theURL to current application's class "NSURL"'s fileURLWithPath:posixPath

#	PDFDocument object represents PDF data or a PDF file and defines methods for writing, searching, and selecting PDF data.
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL

#	findString synchronously finds all instances of a specified string in the document and returns matches as a PDFSelection object. If the search reaches the end of the document without any hits, this method returns NULL.
set PDFSelectionObjects to (thePDF's findString:searchString withOptions:0) -- find matches as PDFSelections

When the file Target.pdf is a pdf saved from a Word document, the applescript returns an array of found instances, as expected.

When the file Target pdf, however, is a pdf scanned from a paper document and saved via ocr, the applescript fails to find any instance. This happens even though the same text can be manually searched and found using the Search field in the upper right hand corner of Preview.

Questions:

What might be possible causes for the discrepancy in find results using PDFDocument’s findString command between these two types of pdf documents?
If so, what method could I use to find text in a scanned pdf?

peavine · April 11, 2024, 1:56am

akim. Can you post an example of the scanned PDF? Lacking that, a few comments.

The findString method seemed to work reliably for me in limited testing.

It appears that the NSLiteralSearch option is enabled by default. You may want to disable that to see if it makes a difference.

If the OCR is done by the scanner, perhaps that’s the issue. To test for this you could scan the document to an image and then do the OCR with an AppleScript or shortcut.

You may also want to view the strings returned by both documents using PDFDocument’s string property. Perhaps that will show something.

use scripting additions
use framework "Foundation"
use framework "PDFKit"

set posixPath to POSIX path of (choose file)
set theURL to current application's class "NSURL"'s fileURLWithPath:posixPath
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
set theText to thePDF's |string|()

akim · April 11, 2024, 2:27am

Peavine, thanks for your help. The pdf that fails my findString function returns a large block of blank text with no text whatsoever.

set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
set theText to thePDF's |string|()
--> (NSString) " 
 
 
 "

For reasons that I do not understand, although text is displayed and searchable via the Preview app, it is is invisible to my AppleScript PDFKit script.

I will attempt to ocr the pdf document, via applescript or shortcut, to compare the results.

If NSLiteralSearch option is enabled by default, how do I write an option to the findString function that disables it?

.

peavine · April 11, 2024, 12:57pm

akim. The option for NSLiteralSearch is 2. You may also want to consider setting NSCaseInsensitiveSearch, which is option 1. Of course, together they are option 3.

BTW, if NSLiteralSearch is specified, I would normally expect this to enable this option. However, the documentation (here) specifies the following:

Search and comparison are currently performed as if the NSLiteralSearch option were specified.