PDF2MANY - Converts rich text of a PDF into seven different formats

At my workplace I often have to convert simple structured PDF documents into other formats like HTML, DOC or RTF. The PDF documents do not contain images, fancy charts or complex tables, just formatted text.

But as you know yourself, converting documents can be quite a tedious task. Copying and pasting text, opening and saving documents, adjusting the formatting…boring!

That’s why I wrote PDF2MANY, which combines two powerful applications on the Mac, to help me out with this process.

How does PDF2MANY work? Well, it first uses the free and scriptable PDF reader Skim to get the styled text of a PDF document in rich text format and afterwards converts the RTF data with the built-in command line utility textutil into one of seven different formats.

The available conversion formats are RTF, HTML, DOC, DOCX, ODT, WORDML and TXT.

If you want to try it yourself, then I invite you to dowload the script right here:

PDF2MANY - Converts rich text of a PDF into seven different formats (ca. 365 KB)

Please note that PDF2MANY requires at least Skim 1.3.4 and Mac OS X 10.5. It was tested on Intel based Macs.

The script offers two modes of operation: If you simply run it from the Finder, it will process the frontmost PDF document in Skim. But you can also drag and drop PDF files onto its icon to batch process them.

The converted files will be saved in the same folder as the original PDF document. Existing files will not be overwritten.

To switch between the various conversion formats, you just need to rename the script or a copy of it like shown in the above screen shot.

Please note that PDF2MANY will just convert the styled text of a PDF document, it will not convert any images, charts or tables.

I am aware of the fact that code-wise this script might seem like an overkill. Moreover I know that the «info for» command is deprecated. But I just love it. It is so much better and faster than calling «System Events» :cool:

HAPPY CONVERTING! :smiley:


-- author: Martin Michel
-- version 1.1
-- created: June 2006
-- modified: 16.03.2010
-- requires:
-- ¢ Mac OS X 10.5 or higher
-- ¢ Skim 1.3.4 or higher 
-- tested on/with:
-- ¢ Mac OS X 10.6.2
-- ¢ Skim 1.3.4
-- 
-- get Skim here for free:
-- http://skim-app.sourceforge.net
--
-- This script can convert the styled text of PDF documents into several different
-- formats using Skim and the built-in textutil command line tool.
-- The supported formats for
-- Mac OS X 10.6 are: html, txt, rtf, doc, docx, wordml and odt
--
-- To switch between the different conversion formats, just rename the script
-- itself following the following name scheme
-- PDF2HTML		-> HTML conversion
-- PDF2DOC		-> DOC conversion
-- PDF2RTF		-> RTF conversion
-- etc.
--
-- The converted files are saved in the same folder as the original files,
-- only with a different file name extension (matching the chosen conversion format).
-- The script will not overwrite an already existing file.
-- Don't expect any layout conversion (images, tables, etc.), the script just
-- converts the styled text part of a PDF document.

property mytitle : ""

-- I am called when the user opens the script with a double click
on run
	try
		-- getting the script's current name
		set mypath to path to me
		set mytitle to (displayed name of (info for mypath)) as text
		-- determining the conversion format by inspecting the script's name
		set convformat to my getconvformat()
		-- is te user running a compatible Skim version?
		if not my validskimversion() then
			set errmsg to "Skim 1.3.4 or higher is required to run ths script."
			my dsperrmsg(errmsg, "--")
		end if
		-- getting the PDF document and its RTF data from Skim
		tell application "Skim"
			
			if not (exists document 1) then
				set errmsg to "I could not find an open document in Skim."
				my dsperrmsg(errmsg, "--")
				return
			else
				set pdfdoc to document 1
			end if
		end tell
		-- processing the found PDF document
		my processpdfdoc(pdfdoc, convformat)
	on error errmsg number errnum
		-- ignoring 'User canceled'-error
		if errnum is not equal to -128 then
			my dsperrmsg(errmsg, errnum)
		end if
	end try
end run

-- I am called whenever a user drops Finder items onto the script's iocn
on open finderitems
	try
		-- getting the script's current name
		set mypath to path to me
		set mytitle to (displayed name of (info for mypath)) as text
		-- determining the conversion format by inspecting the script's name
		set convformat to my getconvformat()
		-- is te user running a compatible Skim version?
		if not my validskimversion() then
			set errmsg to "Skim 1.3.4 or higher is required to run ths script."
			my dsperrmsg(errmsg, "--")
		end if
		-- searching the dropped Finder items for PDF documents
		set pdffiles to {}
		repeat with finderitem in finderitems
			set finderiteminfo to (info for finderitem)
			if not folder of finderiteminfo then
				if ((name of finderiteminfo) ends with "pdf") or ((kind of finderiteminfo) contains "PDF") then
					set pdffiles to pdffiles & finderitem
				end if
			end if
		end repeat
		-- unfortunately the user did not drop any PDF documents on the script
		-- ...or we did not find them :)
		if pdffiles is {} then
			set errmsg to "I could not find any PDF documents in the dropped Finder items."
			my dsperrmsg(errmsg, "--")
			return
		end if
		-- processing the list of dropped PDF files
		repeat with pdffile in pdffiles
			tell application "Skim"
				open {pdffile}
				set pdfdoc to document 1
				my processpdfdoc(pdfdoc, convformat)
				close pdfdoc
			end tell
		end repeat
	on error errmsg number errnum
		-- ignoring 'User canceled'-error
		if errnum is not equal to -128 then
			my dsperrmsg(errmsg, errnum)
		end if
	end try
end open

-- I am actually converting the styled text found in the
-- given Skim PDF document to the chosen format
on processpdfdoc(pdfdoc, convformat)
	tell application "Skim"
		set rtfdata to RTF of text of pdfdoc
		set pdffilepath to path of pdfdoc
	end tell
	-- obtaining the path to the corresponding conversion file path
	set convfilepath to my getconvfilepath(pdffilepath, convformat)
	-- we do not overwrite existing files
	if not (my itempathexists((POSIX file convfilepath) as text)) then
		-- obtaining an unused temporary file path to store the RTF data
		set tmpfilepath to my TmpFile's newpath()
		-- wrtiting the RTF data to the temporary file
		set writesuccess to my writetofile(tmpfilepath, rtfdata)
		if writesuccess is false then
			try
				my TmpFile's remove()
			end try
			set errmsg to "Could not write RTF data to temporary file:" & return & return & tmpfilepath
			my dsperrmsg(errmsg, "--")
		else
			-- converting the temporary RTF file to HTML using the built-in textutil utility
			set command to "/usr/bin/textutil -convert " & convformat & " -output " & quoted form of convfilepath & " " & quoted form of (POSIX path of tmpfilepath)
			try
				do shell script command
			on error errmsg number errnum
				my dsperrmsg(errmsg, "--")
			end try
			-- removing the temporary file
			my TmpFile's remove()
		end if
	end if
end processpdfdoc

-- I am returning the conversion format by inspecting the script's own file name
-- If determining the converson format fails then I return the default format "html"
on getconvformat()
	set convformats to {"html", "txt", "rtf", "doc", "docx", "wordml", "odt"}
	set mypath to path to me
	set myname to (name of (info for mypath)) as text
	-- stripping of the .app name extension
	set revmyname to (reverse of (characters of myname)) as text
	set dotoffset to offset of "." in revmyname
	if dotoffset > 0 then
		set myname to (characters 1 through -(dotoffset + 1) of myname) as text
	end if
	repeat with convformat in convformats
		if myname ends with convformat then
			return (convformat as text)
		end if
	end repeat
	-- unfortunately we could not detect a conversion format from the script's name
	-- so now we need to inform the user and tell him about the
	-- available conversion formats on his system
	set countconvformats to length of convformats
	set strconvformats to ""
	repeat with i from 1 to countconvformats
		set convformat to item i of convformats
		if i is not equal to countconvformats then
			set strconvformats to strconvformats & convformat & ", "
		else
			set strconvformats to strconvformats & convformat
		end if
	end repeat
	set errmsg to "Could not detect a valid conversion format from the script's name: " & mytitle & return & return & "Available conversion formats: " & strconvformats
	error errmsg
end getconvformat

-- I am returning the file path to the converted document
on getconvfilepath(pdffilepath, convformat)
	-- pdffilepath is a Posix path!
	set command to "/usr/bin/basename " & quoted form of pdffilepath
	set pdffilename to do shell script command
	if pdffilename ends with ".pdf" then
		set convfilename to ((characters 1 through -5 of pdffilename) as text) & "." & convformat
	else
		set convfilename to pdffilename & "." & convformat
	end if
	set pdffolderpath to my getparentfolderpath((POSIX file pdffilepath) as text)
	set convfilepath to (pdffolderpath & convfilename) as text
	return (POSIX path of convfilepath)
end getconvfilepath

-- I am indicating if the installed version of Skim is
-- sufficient to run the script (min. Skim 1.3.4)
on validskimversion()
	tell application "Skim"
		activate
		set appversion to version
		set appversions to quoted form of ("1.3.4" & (ASCII character 10) & appversion)
		set command to "echo " & appversions & " | sort -f"
		set output to paragraphs of (do shell script command)
		if appversion is equal to "1.3.4" then
			return true
		else
			-- version lower than 1.3.4
			if appversion is equal to (item 1 of output) then
				return false
				-- version higher than 1.3.4
			else if appversion is equal to (item 2 of output) then
				return true
			end if
		end if
	end tell
end validskimversion

-- I am indicating whether a given item path exists or not
on itempathexists(itempath)
	try
		set itemalias to itempath as alias
		return true
	on error
		return false
	end try
end itempathexists

-- I am writing the given content to the given file path :)
-- I indicate the write success by returning a boolean value (false/true)
on writetofile(filepath, content)
	try
		set fileobj to open for access filepath with write permission
		write content to fileobj
		close access fileobj
	on error errmsg number errnum
		try
			close access fileobj
		end try
		return false
	end try
	return true
end writetofile

-- I am returning the parent folder of an item as a Mac path! ':'
on getparentfolderpath(itempath)
	set olddelims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to {":"}
	set counttxtitems to (count text items of itempath)
	set lasttxtitem to the last text item of itempath
	if lasttxtitem = "" then
		set counttxtitems to counttxtitems - 2 -- bei Pfad zu einem Ordner 
	else
		set counttxtitems to counttxtitems - 1 -- bei Pfad zu einer Datei
	end if
	set pardirpath to text 1 thru text item counttxtitems of itempath & ":"
	set AppleScript's text item delimiters to olddelims
	return pardirpath
end getparentfolderpath

-- Script object to manage a temporary file
script TmpFile
	property filepath : missing value
	
	-- I am creating a new, not yet existing file path in the temp folder
	on newpath()
		set tmpfolderpath to (path to temporary items folder from user domain) as Unicode text
		repeat
			set rndnum to random number from 1000 to 99999
			set tmpfilepath to (tmpfolderpath & (rndnum as Unicode text) & ".tmp")
			try
				set tmpfilepath to tmpfilepath as alias
			on error
				set filepath to tmpfilepath
				exit repeat
			end try
		end repeat
		return filepath
	end newpath
	
	-- I am returning the file path of the temporary file 
	on getpath()
		return filepath
	end getpath
	
	-- I am trying to delete the temporary file using the Finder
	on remove()
		try
			do shell script "rm " & quoted form of (POSIX path of filepath)
		end try
	end remove
end script

-- I am displaying error messages to the user (hopefully onle few)
on dsperrmsg(errmsg, errnum)
	tell me
		activate
		display dialog "Sorry, an error occurred:" & return & return & errmsg & return & "(" & errnum & ")" buttons {"OK"} default button 1 with icon stop with title mytitle
	end tell
end dsperrmsg