Hi All,
So I’m a huge fan of spending hours on a script I’ll only use once. Mysterious reasons, I needed to take a large PDF of scanned images and get the images out. I can do this with an automator action that takes <2 minutes to make, but that runs in serial and is really slow, so I put together this guy that uses pdfimages from xpdf and mogrify from ImageMagick to get the job done in parallel.
Some things the script assumes are that a pdf file is being selected and that the output folder TheFolder doesn’t already exist, otherwise it’ll throw errors. I use it as an automator service that takes PDFs in Finder, so the first thing isn’t a problem. I guess it should probably iterate the output folder name to fix the second. I’ll do that later (no point in having a script you only use once if it’s not perfect right?). It’s also non completely load balanced, so the last process could end up with NumProcs - 1 more pages than the rest of the processes, but that’s not a huge deal.
The reason for the finding and deleting output files <1kb is that pdfimages exports everything it can find that’s an “image” which includes basically blank, very small images in some cases, so those are removed before getting everything into JPG (since pdfimages outputs to .ppm and .pbm format). When this happens there are gaps in the output file names (missing numbers), but they retain the correct order.
You could add options for more output image types really easily if you wanted, but JPG is fine for my use. This method takes about 15 seconds on a test file that took about 8 minutes using the automator action. Note that this does not work on vector PDFs; has to be scanned images or at least contain at least one image in each process’s range. I think it’s pretty cool
Take care,
Tim
*Edit: naming of the output folder now iterates so it won’t fail if the folder already exists.
property NumProcs : 8
on run {input, parameters}
--set input to {}
--set end of input to (choose file with multiple selections allowed) as alias
set CloseTerminal to not appIsRunning("Terminal")
repeat with TheFile in input
tell application "Finder" to set TheFileFolder to (container of TheFile) as alias
set TheFileName to items 1 through -2 of parseLine(name of (info for TheFile), ".") as text
set PageCount to (last word of (do shell script "mdls " & quoted form of POSIX path of TheFile & " | grep kMDItemNumberOfPages")) as number
set TheFolder to ((POSIX path of TheFileFolder) & TheFileName)
try
do shell script "mkdir " & quoted form of TheFolder
on error
set i to 1
repeat
try
set TheNewFolder to (TheFolder & "-" & text -1 through -2 of ("0" & i))
do shell script "mkdir " & quoted form of TheNewFolder
set TheFolder to TheNewFolder
exit repeat
on error
set i to i + 1
end try
end repeat
end try
tell application "Terminal" to activate
if PageCount ≥ 3 * NumProcs then
set PageInterval to PageCount div NumProcs
--set PageRemainder to PageCount mod NumProcs
repeat with i from 1 to NumProcs
set FirstPage to (i - 1) * PageInterval + 1
if i < NumProcs then
set LastPage to i * PageInterval
else
set LastPage to PageCount
end if
set FileRoot to TheFileName & "___" & text -1 through -3 of ("00" & i)
--set PageRange to LastPage - FirstPage
set ScriptString to "pdfimages -f " & FirstPage & " -l " & LastPage & space & quoted form of POSIX path of TheFile & space & quoted form of (TheFolder & "/" & FileRoot)
set ScriptString to ScriptString & ";cd " & quoted form of TheFolder & ";find . -type f -size -1k -exec rm {} +;mogrify -format jpg ./" & quoted form of (FileRoot) & "*;exit"
tell application "Terminal" to do script ScriptString --& ";exit"
end repeat
delay 2
tell application "Terminal"
set WindowIDs to id of windows
set JobWindows to {}
repeat with AnID in WindowIDs
tell window id AnID
if history contains "pdfimages -f " then
set end of JobWindows to AnID
set visible to false
end if
end tell
end repeat
set NumFinished to 0
repeat
repeat with i from 1 to count of JobWindows
if item i of JobWindows is not missing value then
if history of window id (item i of JobWindows) contains "[Process completed]" then
close window id (item i of JobWindows)
set (item i of JobWindows) to missing value
set NumFinished to NumFinished + 1
end if
end if
end repeat
if NumFinished ≥ NumProcs then exit repeat
delay 1
end repeat
do shell script "cd " & quoted form of TheFolder & ";rm -rf *.ppm *.pbm"
end tell
else
set ScriptString to "pdfimages " & quoted form of POSIX path of TheFile & space & quoted form of (TheFolder & "/" & TheFileName)
set ScriptString to ScriptString & ";cd " & quoted form of TheFolder & ";find . -type f -size -1k -exec rm {} +;mogrify -format jpg ./" & quoted form of (TheFileName) & "*;exit"
tell application "Terminal" to do script ScriptString --& ";exit"
delay 2
tell application "Terminal"
set WindowIDs to id of windows
repeat with AnID in WindowIDs
tell window id AnID
if history contains "pdfimages " then
set JobWindow to AnID
set visible to false
exit repeat
end if
end tell
end repeat
repeat
if history of window id (JobWindow) contains "[Process completed]" then
close window id (JobWindow)
exit repeat
end if
delay 1
end repeat
do shell script "cd " & quoted form of TheFolder & ";rm -rf *.ppm *.pbm"
end tell
end if
end repeat
if CloseTerminal then tell application "Terminal" to quit
end run
on parseLine(theLine, delimiter)
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to {delimiter}
set theTextItems to theLine's text items
set AppleScript's text item delimiters to astid
repeat with i from 1 to (count theTextItems)
if (item i of theTextItems is "") then set item i of theTextItems to missing value
end repeat
return theTextItems's every text
end parseLine
on appIsRunning(appName)
tell application "System Events" to (name of processes) contains appName
end appIsRunning