ASObjc replacement for read command

peavine · November 17, 2022, 1:39pm

I wrote a rudimentary PDF object viewer for some research I’m doing. The following works but I’d prefer to replace the read command with ASObjC. I tried NSString’s stringWithContentsOfFile but every encoding I tested returned a text-encoding error message. I wondered if there’s a simple ASObjC method that will replace the read command in my script? Thanks

use framework "Foundation"
use scripting additions

tell application "Finder" to set pdfFile to selection as alias -- a Finder selection
set theString to read pdfFile
set theString to current application's NSString's stringWithString:theString
set thePattern to "\\d{1,5} \\d obj((.|\\n)*?)(endobj|>>)"
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value) -- this and following from Shane
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
	(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
set theMatches to (theMatches's sortedArrayUsingSelector:"localizedStandardCompare:")
set theString to theMatches's componentsJoinedByString:(linefeed & linefeed)
return (theString as text) & linefeed

Mark_FX · November 17, 2022, 8:53pm

It’s best to use the “PDFKit” framework for working with PDF documents and data.
With ASOC you will have to use the “Quartz” framework which contains the “PDFKit” framework.
With Swift you can import the “PDFKit” framework directly.

https://developer.apple.com/documentation/pdfkit

The “PDFKit” allows you to create a “PDFDocument” class object from an “NSURL” or “NSData”.
And from that created “PDFDocument” object, you can access the text or images contained in the document.
It also allows you to create “PDFPage” class objects, that represent the different pages within the document.
And you can then access the string or images contained within the different pages.
Possibly by iterating over the different pages within a repeat loop, based on the documents page count.

Here’s a couple of very simple examples of creating the “PDFDocument” class, and accessing the string of the whole document, or the string of the first page in the document.


use scripting additions
use framework "Quartz"
use framework "Foundation"

on run{}
try
set theFilePath to POSIX path of (choose file of type {“pdf”})
on error
return – User Canceled
end try
set theFileURL to current application’s NSURL’s fileURLWithPath:theFilePath
set thePDFDocument to current application’s PDFDocument’s alloc()'s initWithURL:theFileURL
set thePDFDocumentString to thePDFDocument’s |string|() as text
end run

use scripting additions
use framework “Quartz”
use framework “Foundation”

on run {}
try
set theFilePath to POSIX path of (choose file of type {“pdf”})
on error
return – User Canceled
end try
set theFileURL to current application’s NSURL’s fileURLWithPath:theFilePath
set thePDFDocument to current application’s PDFDocument’s alloc()'s initWithURL:theFileURL
set thePDFDocPageCount to thePDFDocument’s pageCount() as integer – Num of Pages in PDF
set thePDFDocFirstPage to thePDFDocument’s pageAtIndex:1 – Returns a PDFPage class object
set thePDFDocFirstPageString to thePDFDocFirstPage’s |string|() as text
end run



It's a very powerful framework to use with PDF files, and is worth reading the documentation, to come up with more complete solutions, rather than my very basic examples.

Regards Mark

peavine · November 17, 2022, 9:33pm

Mark. Thanks for looking at my script and for the suggestions, both of which work well.

akim · December 5, 2022, 6:56pm

I am also looking for a method to read text from a pdf page.
Although the above examples appeared to work for Peavine and MarkFX, they did not produce any readable text when I ran them.

Peavine’s example yielded:

1 0 obj
<< /Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 3 0 R /MediaBox [0 0 609 790]
/CropBox [0 0 607 790] /Rotate 0 >>

2 0 obj
<< /Type /Pages /MediaBox [0 0 612 792] /Count 1 /Kids [ 1 0 R ] >>

3 0 obj
<< /Filter /FlateDecode /Length 69 >>

4 0 obj
<< /ProcSet [ /PDF /ImageB /ImageC /ImageI ] /XObject << /Im1 5 0 R >>

5 0 obj
<< /Type /XObject /Subtype /Image /Width 1687 /Height 2199 /Interpolate true
/ColorSpace 6 0 R /BitsPerComponent 8 /Length 263261 /Filter /DCTDecode >>

6 0 obj
[ /ICCBased 7 0 R ]
endobj

7 0 obj
<< /N 1 /Alternate /DeviceGray /Length 3385 /Filter /FlateDecode >>

8 0 obj
<< /Type /Catalog /Pages 2 0 R >>

9 0 obj
<< /Producer (macOS Version 13.0 (Build 22A380) Quartz PDFContext) /CreationDate
(D:20221101205514Z00’00’) /ModDate (D:20221101205514Z00’00’) >>

MarkFX’s first example yielded an empty string.
MarkFX’s second example yielded an execution error:

which responded in this one page document by changing the index to 0


set thePDFDocFirstPage to thePDFDocument's pageAtIndex:0 -- Returns a PDFPage class object

In any regard, as my goal is to read the text of a pdf, what script needs to be added to capture a pdf’s text?

peavine · December 5, 2022, 9:26pm

akim. It’s not the purpose of my script to read text from a PDF–it won’t do what you want.

I tested Mark’s first suggestion on Shane’s ASObjC book (it’s a PDF) and it returned the text of the entire document. Are you sure the PDF you are using contains text? You can check this by attempting to select text in the PDF. Mark’s first script contains an errant AppleScript tag, and perhaps that’s the issue. Please try the following:

use scripting additions
use framework "Quartz"
use framework "Foundation"

set theFilePath to POSIX path of (choose file of type {"pdf"})
set theFileURL to current application's NSURL's fileURLWithPath:theFilePath
set thePDFDocument to current application's PDFDocument's alloc()'s initWithURL:theFileURL
set thePDFDocumentString to thePDFDocument's |string|() as text

Mark_FX · December 6, 2022, 11:11am

My mistake in my rushed basic example code above.
This line for the first page.


set thePDFDocFirstPage to thePDFDocument's pageAtIndex:1 -- Returns a PDFPage class object

Should have been.


set thePDFDocFirstPage to thePDFDocument's pageAtIndex:0 -- Returns a PDFPage class object

As the page index array, is a zero based array.
So the first page will have an index of 0.

@akim
What you have to understand about PDF documents, is that the content can and maybe made up of different types of media.
For example you could create an image in a painting program, that could contain only typed text, and saved as a png image file, and then embed this image in a PDF document.
To a reader of this PDF document, this would appear to be text, but in fact it is an image.
So the example above would not retreive a string, but a media object.

So what you have to do is read the PDF document pages to an NSData class, and then query the type of data contained.
The above example where only quick dirty get you started examples for peavine.
You will have to read the Apple developer documentation for a more comprehensive understanding of how you could use the PDFKit frameork for other operations.
On a Windows work computer at the moment, so no Xcode available to me to show other uses of the PDFDocument class, but maybe at another time.

Regards Mark

akim · December 6, 2022, 5:32pm

Peavine and Mark, thanks for your help.

A pdf that I have chosen appears to contain text that is not part of a graphic object, and that can be selected and copied, unless the pdf’s data is interpreted by Preview to be read as text. I will research the idea of reading the pdf contents as NSData, as I am unfamiliar with this approach.

So far however, I have found that using the Edit Menu command in Preview to Select All, I was able to capture the pdf’s text. Even though the text was not often continuous in the pdf, I was able to select all the text scattered through the page and then copy it to the clipboard.

The following script using System Events also captured the text, but for reasons that I cannot comprehend, only inconsistently copied the pdf’s text to the clipboard.

set theFilePath to POSIX path of (choose file of type {"pdf"})
do shell script "open " & quoted form of theFilePath
tell application "System Events"
	tell process "Preview"
		repeat until exists window 1
			delay 0.2
		end repeat
		tell menu "Edit" of menu bar 1
			click menu item "Select All"
			delay 0.5
			click menu item "Copy"
		end tell
	end tell
end tell

Mark, referring to your idea of returning data from a pdf to read as text, do the 9 objects returned in Peavine’s example

set theString to read pdfFile

provide any clue on revealing pdf data that might be interpreted as text?

akim · December 8, 2022, 7:36am

I found a solution using Apple’s Shortcuts app by VikingOSX at https://discussions.apple.com/thread/253841866. As I do not know how to upload a picture of the Shortcut, I wrote the Shortcut commands as follows:

Calling ShortCut via a shell script, yields a text document containing the text of the pdf document.

	set ShortCutSh to "shortcuts run 'Extract Image Text' -i " & quoted form of PDFFilePSX & " -o " & quoted form of OutputTextFile
	do shell script ShortCutSh

Using AppleScript, is there a method to write this Shortcuts’ commands using Objective C’s frameworks?

akim · December 8, 2022, 5:46pm

Fredrik71, I am happy running the Shortcuts script from an AppleScript calling a shell script. I was interested in substituting the Shortcuts script with an Objective C Framework script. It appears that the Vision framework needed to replace this Shortcuts scrip is not available in Objective C.

akim · December 9, 2022, 3:48am

Fredrik71, Thanks for alerting me to an AppleScript method for running the Shortcuts application. I have attempted to reconstruct the Shortcuts shell script in plain AppleScript.


set PDFFilePSX to POSIX path of (choose file of type {"pdf"})
tell application "Shortcuts Events"
	  to (run the shortcut named "Extract Image Text" with input PDFFilePSX)
end tell

The Shortcuts dictionary fails to list a method to designate a location for writing the script’s result or output. Although Shortcuts’ command line designates an -o or --output-path to identify a location to write its output, the Shortcuts’s AppleScript dictionary failed to list such property.

Do you have any suggestions on how to list the output for an AppleScript Shortcut?

akim · December 9, 2022, 10:00pm

Peavine, I appreciate your idea for using the clipboard as a transitional location. Previously, when I attempted to copy a pdf’s selected text, from Preview’s visible display of text, to the clipboard, the latter frequently failed in its ability to copy the selected text.
Although I was not sure why the clipboard copy function failed, I found it correctable by means of a shell command line to kill the clipboard function.

do shell script "killall pboard"

As my goal is to copy text from many pdf pages in a loop, I do not want to risk the failure of this copying activity or of disrupting the loop. As such, I am fearful of relying on the clipboard’s copy function.
I found the following shell script more successful.

set ShortCutSh to "shortcuts run 'Extract Image Text' -i " & quoted form of PDFFilePSX & " -o " & quoted form of OutputTextFile
   do shell script ShortCutSh

The shell command line, running this Shortcuts app, in a repeat loop copying at least 60 pdfs, has not failed so far.

peavine · March 20, 2023, 1:43pm

I revised my script in post 1 and thought I would post it here FWIW. The purpose of this script is to read and display objects contained in a PDF, although it’s a pretty rudimentary tool at best.

-- revised 2023.03.20

use framework "Foundation"
use scripting additions

on main()
	set pdfFile to (choose file of type {"pdf"})
	try
		set objectData to read pdfFile
		set objectData to getObjectData(objectData)
	on error
		display alert "An error has occurred" message "Object data was not found in the selected file"
		error number -128
	end try
	makeNote(pdfFile, objectData)
end main

on getObjectData(theString) -- a minor rewrite of a handler by Shane Stanley
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\d{1,5} \\d obj((.|\\n)*?)(endobj|>>)"
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	set theMatches to current application's NSMutableArray's new()
	repeat with aRange in theRanges
		(theMatches's addObject:(theString's substringWithRange:aRange))
	end repeat
	set theMatches to (theMatches's sortedArrayUsingSelector:"localizedStandardCompare:")
	set theString to theMatches's componentsJoinedByString:(linefeed & linefeed)
	return (theString as text) & linefeed
end getObjectData

on makeNote(pdfFile, theData)
	set theData to "Source PDF File" & linefeed & POSIX path of pdfFile & linefeed & linefeed & theData
	tell application "TextEdit"
		activate
		if (text of front document) > 0 then make new document
		set text of front document to theData
		set font of text of front document to "Menlo"
		set size of text of front document to 14
	end tell
end makeNote

main()

peavine · May 23, 2023, 1:23pm

I was curious about Mark’s above statement, because other forum members occasionally use the second of the following statements at the top of their scripts instead of the first of the following. In his book Shane always uses a Quartz statements and that’s what I’ve done. Does it make any difference which I use to manipulate PDF documents? Thanks.

use framework “Quartz”
use framework “PDFKit”

Mark_FX · May 23, 2023, 4:10pm

It doesn’t matter which framework you use, as the ‘Quartz’ framework also contains the ‘PDFKit’.

https://developer.apple.com/documentation/quartz

I think in the earlier Mac OSX versions, the ‘PDFKit’ framework was not yet a standalone framework.
So you had to use the Quartz framework, because there was no choice.

Also the standalone ‘PDFKit’ framework is a smaller package, so doesn’t have the large overhead of the bigger ‘Quartz’ package, so if you don’t need the other features of the big ‘Quartz’ framework, then use the smaller overhead of the ‘PDFKit’ framework.

Regards Mark

peavine · May 23, 2023, 11:06pm

Thanks Mark. I’ll use PDFKit in the future.