PDFKit findstring using regex

Hi,

I’m new to this lark and struggling a little. I can successfully search a PDF for a specific string, but I’m struggling to get the regex option to work. The documentation implies that if I set the regex option in “withOptions” it uses the search string as a regex expression.

But I get nothing.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

# Test PDF
# https://docs.oracle.com/en/database/oracle/oracle-database/19/cncpt/database-concepts.pdf

on searchPDFnoRegex:posixPath forString:searchString
        set theURL to current application's class "NSURL"'s fileURLWithPath:posixPath
        set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
        set theMatches to (thePDF's findString:searchString withOptions:0)
        log ("Number of Hits " & (count of theMatches))
end searchPDFnoRegex:forString:

on searchPDFwithRegex:posixPath forString:searchString
        set theURL to current application's class "NSURL"'s fileURLWithPath:posixPath
        set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
        set theMatches to (thePDF's findString:searchString withOptions:((current application's NSRegularExpressionSearch)))
        log ("Number of Hits " & (count of theMatches))
end searchPDFwithRegex:forString:

set theFile to "/Users/home/database-concepts.pdf"

log ("Regex option not set")
my searchPDFnoRegex:theFile forString:"Oracle"
#Reports 2698 hits

log ("Regex option set, but no wildcards")
my searchPDFwithRegex:theFile forString:"Oracle"
#Reports 2698 hits

log ("Regex option set, with wildcards")
my searchPDFwithRegex:theFile forString:"Oracl."
#Reports 0 hits

Please could some kind sole suggest what I might be doing wrong ?

Thanks

Spooks. I’ve quoted below from the documentation, which would lead me to believe that you cannot do a regex search with the findString method. You indicate that the documentation implies otherwise, but I couldn’t find that.

NSRegularExpressionSearch
The search string is treated as an ICU-compatible regular expression. If set, no other options can apply except NSCaseInsensitiveSearch and NSAnchoredSearch. You can use this option only with the rangeOfString:… methods and stringByReplacingOccurrencesOfString:withString:options:range:.

You may already know this but you can get a count of regex matches in a PDF as follows:

use framework "Foundation"
use framework "PDFKit"
use scripting additions

set theFile to POSIX path of (choose file of type {"pdf"})
set theFile to (current application's |NSURL|'s fileURLWithPath:theFile)
set theDoc to (current application's PDFDocument's alloc()'s initWithURL:theFile)
set theString to theDoc's |string|()
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:"Oracl." options:0 |error|:(missing value)
set matchCount to theRegex's numberOfMatchesInString:theString options:0 range:{location:0, |length|:theString's |length|()} --> 2698

BTW, you didn’t include a use statement for the PDFKit framework in your script, but including it didn’t make a difference in my testing.

Just on a proof-of-concept basis, I rewrote my script to return the number of matches in each page of a PDF. I checked the results against the Preview app, and they were the same. The timing result when searching for the string “list” in Shane’s ASObjC book (which contains 159 pages) was 272 milliseconds. In actual use, the regex pattern might be changed to specify word boundaries or other refinements.

use framework "Foundation"
use framework "PDFKit"
use scripting additions

set thePattern to "(?i)list" -- case insensitive flag set
set theFile to POSIX path of (choose file of type {"pdf"})
set theURL to (current application's |NSURL|'s fileURLWithPath:theFile)
set theDoc to (current application's PDFDocument's alloc()'s initWithURL:theURL)
set thePageCount to theDoc's pageCount()
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set theArray to current application's NSMutableArray's arrayWithArray:{"PAGE - MATCHES"}
repeat with i from 1 to thePageCount
	set aPage to (theDoc's pageAtIndex:(i - 1))
	set aString to aPage's |string|()
	set matchCount to (theRegex's numberOfMatchesInString:aString options:0 range:{location:0, |length|:aString's |length|()})
	if matchCount > 0 then
		set pageAndMatches to current application's NSString's stringWithFormat_("%@ - %@", i, matchCount)
		(theArray's addObject:pageAndMatches)
	end if
end repeat
set theString to (theArray's componentsJoinedByString:linefeed) as text

Thank you @peavine for spending time looking at this. I really appreciate it, I’m sure you’ve got better things to do.
My initial code was also just a proof of concept and a failed one at that. The bigger picture is;

I have a large number of PDFs to process. Each document has a unique document ID. The content of each document will often reference other documents by documentID. Unfortunately I can’t share these documents with you but I can create a mock up.

So within document 123-456-789 there may be references to 234-567-890 and 345-678-901. At the moment each of these references includes an annotation such as “h__ps://old-service-provider.com/docid=234-567-890”

(1) I need to search a PDF for documentID references (using regex) ideally returning a PDFKit selection, and the matching documentID (e.g. 234-567-890).

(2) On finding the documentID I make an API call to another system requesting it return a new URL for the documentID (e.g. h__ps://new-service-provider.com/docid=234-567-890)

(3) Then using a modified version of Shane’s makeLinksInPDF code create new annotations with the new URL and the PDF Selection. Ideally the existing annotation would be removed as well.
Adding a hyperlink to a text string in existing PDF?

(4) Repeat 1-3 for each documentID found in the document.

My current solution is a hack.

(1) for step 1 I call pdfgrep with the “-o” option. This outputs all of the document IDs found by regex in a PDF which I then store in a list of records.

(2) I loop through the list and for each document ID I call the API to generate a new URL for that document and I add it to the record for that document ID.

(3) I loop through the list again calling Shane’s makeLinksInPDF with the documentID and URL from each record.

I would prefer to avoid using pdfgrep if I can and to bring all of these steps into one nice loop that runs against each PDF. I would also prefer to remove existing annotations that are within the bounds of the an existing annotation.

So to resolve step 1
Challenge A - perform a regex search on a PDF. [Falied! But you’ve provide a solution thanks !]
Challenge B - Adjust solution from A to return “PDF selections”
Challenge C - Adjust solution from B to return the “matched” value as well (e.g. 234-567-890)

I’m not asking you to write the solution, I’m trying to resolve an issue and learn more about AppleScript and where necessary the ObjC interaction with AppleScript as I go. But I have to admit I’m struggling with the ObjC part. So if there any rabbit holes you can point out before I go down them, that would be appreciated. e.g. I totally misread the limitations of the apple statement about PDFKit supporting regex.

Thanks

Spooky

@Spooks. Thanks for the detailed explanation. Part of what you are doing is beyond my knowledge level, but I’ve included below two scripts that might be helpful as regards the above.

The first script simply does a regex search and creates an array (coerced to a list for testing purposes) of the annotations. The regex pattern is easily modified to return the document IDs:

use framework "Foundation"
use framework "PDFKit"
use scripting additions

set theString to "This is some text h__ps://old-service-provider.com/docid=234-567-890 and some more text h__ps://old-service-provider.com/docid=389-003-444 and some more text h__ps://old-service-provider.com/docid=786-811-343 and ending text"
set theString to current application's NSString's stringWithString:theString

set thePattern to "h__.*?(\\d{3}-\\d{3}-\\d{3})"
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
	(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
return theMatches as list --> {"h__ps://old-service-provider.com/docid=234-567-890", "h__ps://old-service-provider.com/docid=389-003-444", "h__ps://old-service-provider.com/docid=786-811-343"}

The second script goes one step further and creates two arrays–one with the annotations and one with the document IDs. It uses a regex capture group for brevity.

use framework "Foundation"
use framework "PDFKit"
use scripting additions

set theString to "This is some text h__ps://old-service-provider.com/docid=234-567-890 and some more text h__ps://old-service-provider.com/docid=389-003-444 and some more text h__ps://old-service-provider.com/docid=786-811-343 and ending text"
set theString to current application's NSString's stringWithString:theString

set thePattern to "h__.*?(\\d{3}-\\d{3}-\\d{3})"
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theAnnotations to current application's NSMutableArray's new()
set theIDs to current application's NSMutableArray's new()
repeat with aMatch in regexResults
	set aRange to aMatch's range()
	(theAnnotations's addObject:(theString's substringWithRange:aRange))
	set aRange to (aMatch's rangeAtIndex:1)
	(theIDs's addObject:(theString's substringWithRange:aRange))
end repeat

theAnnotations as list --> {"h__ps://old-service-provider.com/docid=234-567-890", "h__ps://old-service-provider.com/docid=389-003-444", "h__ps://old-service-provider.com/docid=786-811-343"}

theIDs as list --> {"234-567-890", "389-003-444", "786-811-343"}

Two comments. The output of the second script consists of two arrays, but this could be changed to whatever is needed. Both of the above scripts are based, in part, on examples in Shane’s ASObjC book.