Extract text from pdf and reformat paragraphs

Hi List,

I am trying to extract the plain text of a pdf document using Automator’s Extract PDF Text and then clean the paragraph marks at the end of each line (except the ones near the period at the end of the paragraph) of the returned text to reformat the paragraphs, using a regex.

But, how can I call the Extract PDF Text application in a script?

Thanks for your input.

Michael

Hi Michael,

Skim, a free PDF viewer, might be of interest for you. It is scriptable and also features an AppleScript command to extract the text from certain PDF pages.

Here is an example:


tell application "Skim"
	open (POSIX file "/Users/martin/Desktop/example.pdf")
	tell document 1
		tell page 1
			set pdftext to get text for
		end tell
	end tell
end tell

You can get Skim here:
http://skim-app.sourceforge.net

Best regards,

Martin

Thanks Martin for your suggestion.

SKIM is actually a great application; works perfectly, in my case, and does so much more!

Michael

Is there a way to extract only specially formatted text from a pdf?
For example, extract all ‘bold’ words or all words in ‘italics’ from a given pdf .

thanks for the prompt reply.
error: Skim got an error: Can’t get every word of document 1 whose font contains “Italic”.

similar error for “bold” as well

Skim Version 1.1.12 (34)
Mac OS 10.5.6

I also made sure that the pdf contains proper text. It is not an image or encrypted text

thanks, Jacques. It does not work on the version i used earlier but does work on the latest version. However, it extracts text (bold and italics) only from the first page of document 1.

sorry for my previous post. I referred the dictionary of Skim and it was not difficult to write a script to extract bold words from all pages.
This is the script that works:

tell application "Skim"
	tell document 1
	set bold_Word_list to words of every page whose its font contains "Bold"
			end tell
end tell

Suppose there are two or more consecutive bold words then they are displayed as separate words in the results which i get after running the script. I would want such consecutive words to be shown together as one word. Any ideas?

Is there a way to get the words that start with “A” as a list?