Batch Parallel Export of Scanned PDF Pages to JPG

Hi All,

So I’m a huge fan of spending hours on a script I’ll only use once. Mysterious reasons, I needed to take a large PDF of scanned images and get the images out. I can do this with an automator action that takes <2 minutes to make, but that runs in serial and is really slow, so I put together this guy that uses pdfimages from xpdf and mogrify from ImageMagick to get the job done in parallel.

Some things the script assumes are that a pdf file is being selected and that the output folder TheFolder doesn’t already exist, otherwise it’ll throw errors. I use it as an automator service that takes PDFs in Finder, so the first thing isn’t a problem. I guess it should probably iterate the output folder name to fix the second. I’ll do that later (no point in having a script you only use once if it’s not perfect right?). It’s also non completely load balanced, so the last process could end up with NumProcs - 1 more pages than the rest of the processes, but that’s not a huge deal.

The reason for the finding and deleting output files <1kb is that pdfimages exports everything it can find that’s an “image” which includes basically blank, very small images in some cases, so those are removed before getting everything into JPG (since pdfimages outputs to .ppm and .pbm format). When this happens there are gaps in the output file names (missing numbers), but they retain the correct order.

You could add options for more output image types really easily if you wanted, but JPG is fine for my use. This method takes about 15 seconds on a test file that took about 8 minutes using the automator action. Note that this does not work on vector PDFs; has to be scanned images or at least contain at least one image in each process’s range. I think it’s pretty cool :cool:

Take care,

Tim

*Edit: naming of the output folder now iterates so it won’t fail if the folder already exists.

property NumProcs : 8

on run {input, parameters}
	--set input to {}
	--set end of input to (choose file with multiple selections allowed) as alias
	
	set CloseTerminal to not appIsRunning("Terminal")
	
	repeat with TheFile in input
		tell application "Finder" to set TheFileFolder to (container of TheFile) as alias
		set TheFileName to items 1 through -2 of parseLine(name of (info for TheFile), ".") as text
		
		set PageCount to (last word of (do shell script "mdls " & quoted form of POSIX path of TheFile & " | grep kMDItemNumberOfPages")) as number
		
		set TheFolder to ((POSIX path of TheFileFolder) & TheFileName)
		
		try
			do shell script "mkdir " & quoted form of TheFolder
		on error
			set i to 1
			repeat
				try
					set TheNewFolder to (TheFolder & "-" & text -1 through -2 of ("0" & i))
					do shell script "mkdir " & quoted form of TheNewFolder
					set TheFolder to TheNewFolder
					exit repeat
				on error
					set i to i + 1
				end try
			end repeat
		end try
		
		tell application "Terminal" to activate
		
		if PageCount ≥ 3 * NumProcs then
			set PageInterval to PageCount div NumProcs
			--set PageRemainder to PageCount mod NumProcs
			
			repeat with i from 1 to NumProcs
				set FirstPage to (i - 1) * PageInterval + 1
				if i < NumProcs then
					set LastPage to i * PageInterval
				else
					set LastPage to PageCount
				end if
				set FileRoot to TheFileName & "___" & text -1 through -3 of ("00" & i)
				--set PageRange to LastPage - FirstPage
				set ScriptString to "pdfimages -f " & FirstPage & " -l " & LastPage & space & quoted form of POSIX path of TheFile & space & quoted form of (TheFolder & "/" & FileRoot)
				set ScriptString to ScriptString & ";cd " & quoted form of TheFolder & ";find . -type f -size -1k -exec rm {} +;mogrify -format jpg ./" & quoted form of (FileRoot) & "*;exit"
				tell application "Terminal" to do script ScriptString --& ";exit"
			end repeat
			delay 2
			tell application "Terminal"
				set WindowIDs to id of windows
				set JobWindows to {}
				repeat with AnID in WindowIDs
					tell window id AnID
						if history contains "pdfimages -f " then
							set end of JobWindows to AnID
							set visible to false
						end if
					end tell
				end repeat
				set NumFinished to 0
				repeat
					repeat with i from 1 to count of JobWindows
						if item i of JobWindows is not missing value then
							if history of window id (item i of JobWindows) contains "[Process completed]" then
								close window id (item i of JobWindows)
								set (item i of JobWindows) to missing value
								set NumFinished to NumFinished + 1
							end if
						end if
					end repeat
					if NumFinished ≥ NumProcs then exit repeat
					delay 1
				end repeat
				do shell script "cd " & quoted form of TheFolder & ";rm -rf *.ppm *.pbm"
				
			end tell
		else
			set ScriptString to "pdfimages " & quoted form of POSIX path of TheFile & space & quoted form of (TheFolder & "/" & TheFileName)
			set ScriptString to ScriptString & ";cd " & quoted form of TheFolder & ";find . -type f -size -1k -exec rm {} +;mogrify -format jpg ./" & quoted form of (TheFileName) & "*;exit"
			tell application "Terminal" to do script ScriptString --& ";exit"
			delay 2
			tell application "Terminal"
				set WindowIDs to id of windows
				repeat with AnID in WindowIDs
					tell window id AnID
						if history contains "pdfimages " then
							set JobWindow to AnID
							set visible to false
							exit repeat
						end if
					end tell
				end repeat
				repeat
					if history of window id (JobWindow) contains "[Process completed]" then
						close window id (JobWindow)
						exit repeat
					end if
					delay 1
				end repeat
				do shell script "cd " & quoted form of TheFolder & ";rm -rf *.ppm *.pbm"
			end tell
		end if
	end repeat
	if CloseTerminal then tell application "Terminal" to quit
	
end run

on parseLine(theLine, delimiter)
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to {delimiter}
	set theTextItems to theLine's text items
	set AppleScript's text item delimiters to astid
	
	repeat with i from 1 to (count theTextItems)
		if (item i of theTextItems is "") then set item i of theTextItems to missing value
	end repeat
	
	return theTextItems's every text
end parseLine

on appIsRunning(appName)
	tell application "System Events" to (name of processes) contains appName
end appIsRunning