Extracting info with Text Item Delimiters.... [PDF]

Kyle_Jones · August 9, 2007, 4:15pm

Hi All,

I’ve been reading a bit on the Text Item Delimeters on the forum, as i’m really new to them. I thought it may help me with a small issue i have but i can’t figure out how to use them in this scenario. Alot of the stuff i’ve read on them, already have the text within the applescript to amend.
I don’t know if this is possible or i’m looking at the wrong type of Applescript, but i’m wanting my script to read through a pdf file and find the paths of linked images and set them as variables so i automatically collect them later…
I will be using different pdfs, with different linked images, the only thing that is common is that all the images are held on the same server.
so…
inside the pdf, the start of the linked file line would be: ‘%%DocumentFiles:/Volumes/MyServer/etc/etc’ - i would like to take this line from the pdf and then have my applescript collect the linked image (which is the easy part :D) i’ve just got no idea how to get this variable line from my pdf?..

i just pretty confused and don’t know if this is the right way to go about it? A point in the right direction would be great help…

Thanks…

Adam_Bell · August 9, 2007, 8:27pm

Kyle Jones:

Hi All,

I’ve been reading a bit on the Text Item Delimeters on the forum, as i’m really new to them. I thought it may help me with a small issue i have but i can’t figure out how to use them in this scenario. Alot of the stuff i’ve read on them, already have the text within the applescript to amend.
I don’t know if this is possible or i’m looking at the wrong type of Applescript, but i’m wanting my script to read through a pdf file and find the paths of linked images and set them as variables so i automatically collect them later…
I will be using different pdfs, with different linked images, the only thing that is common is that all the images are held on the same server.
so…
inside the pdf, the start of the linked file line would be: ‘%%DocumentFiles:/Volumes/MyServer/etc/etc’ - i would like to take this line from the pdf and then have my applescript collect the linked image (which is the easy part :D) i’ve just got no idea how to get this variable line from my pdf?..

i just pretty confused and don’t know if this is the right way to go about it? A point in the right direction would be great help…

Thanks…

This is easy to do if and only if a file is readable as plain or unicode text. A PDF file is not, and Adobe Reader is not scriptable so you have no way of applying TIDs. I might have a workaround and I’ll post back.

Adam_Bell · August 9, 2007, 8:59pm

Here’s a method that “sorta” works; a hack.

set F to choose file default location (path to documents folder) without invisibles
tell application "Preview" to open F
delay 2
activate application "Preview"
tell application "System Events" to tell process "Preview"
	keystroke "a" using {command down}
	delay 3
	keystroke "c" using {command down}
end tell
set R to the clipboard
-- and so on with the TID extractions

You may have to fiddle with the delays. Some PDFs simply don’t show up in Preview, and others are not copyable.

Bruce_Phillips · August 10, 2007, 5:42pm

Try something like this:

choose file with prompt "Get image data for this PDF:" without invisibles
set thePDF to result

try
	do shell script "/usr/bin/strings " & quoted form of POSIX path of thePDF & ¬
		" | /usr/bin/grep '^%%' | /usr/bin/grep --only-matching '/.*'"
	set grepStrings to paragraphs of result
on error
	display alert "No Image Data" message "No image data could be found in that file." buttons {"Cancel"} default button "Cancel"
	error number -128 -- cancel
end try

set imageList to {}
repeat with thisItem in grepStrings
	try
		-- Try to filter out results that aren't files (e.g. dates)
		set end of imageList to (POSIX file thisItem) as alias
	end try
end repeat

If you want the results to be POSIX paths, then you could change the repeat loop:

repeat with thisItem in grepStrings
	try
		(POSIX file thisItem) as alias -- Try to filter out results that aren't files (e.g. dates)
		set end of imageList to thisItem
	end try
end repeat

StefanK · August 10, 2007, 6:38pm

Hi,

there’s a scriptable 15$ shareware File Juicer, which is able to extract images from PDF files

set f to choose file with multiple selections allowed
tell application "File Juicer"
	juice files f results location on the desktop with showing results
end tell

Kyle_Jones · August 13, 2007, 9:28am

Thanks for all your help!

Bruce’s script works a treat, Thanks

I tried to use the ‘grep’ shell command, but i couldn’t manage to get only the paths for the linked images, i just got the whole text information of the file or nothing at all… that’s because it’s a binary file?

Adam_Bell · August 13, 2007, 1:45pm

Bruce’s fails for me on all the same files that mine does.

Bruce_Phillips · August 13, 2007, 7:06pm

Yes. By default, grep returns the entire line that the match was found in; A binary file wouldn’t have multiple lines. The --only-matching (or -o) option will (obviously?) output only what was matched by the pattern; However, you wouldn’t be able to come up a pattern that would determine the end of the file path.

What strings does is “find the printable strings in a object”; That is, it looks for what might be readable text in a (usually binary) file. The important part for my script is that each string that is found is output on it’s own line. This means we don’t have to use grep to find the end of the file path.

This is only useful for a PDF that is referencing external files. (Side note: I added a nicer error message in my script above.)

In case you’re curious, here’s a small sample of the strings output: