Getting text from multiple pdfs into spreadsheet

Hi again, I was wondering if anyone could help me with a script to help me tabulate my invoice totals. What I want to do is

  1. go through each pdf file in a directory
  2. pull only specific pieces of the text from the pdf document (namely, the chunks of text that follow the phrases “Customer No.” and “Total ex GST”
  3. create a spreadsheet from this information (preferably in Excel, but anything exportable would work)
    Hopefully this will save me on going through all of my invoice pdfs individually and manually entering the data
    Thanks in advance if anyone has any suggestions
    Jeremy

You have a bit to do there, so start by breaking it down.

First, you need to get the text out of the PDFs. How it is returned will depend a lot on the way the PDF is laid out: you generally get everything in one string, and the order is based on position on the page. Assuming a fixed invoice format, you might be lucky.

Once you can get the text out, and can see a pattern in how the values appear, you can begin parsing. Then you can think about processing all the files in a directory, and putting the results in something like a tabbed-text file.

What version of the OS are you using?

For the spread sheet part. I would simply make the out put a csv formatted file.

excel and any other spread sheet app will be able to read it with no problem.

You would then only need to add your sum functions etc later.

Can you post a example of an invoice

If your pdf’s are actually pdf forms, you could use the following strategy.

  1. In Acrobat X or XI, have a pdf document open. This is to show the various tools.
  2. Under “Tools”, “Forms”, “More Form Options…” select “Merge Data Files Into Spreadsheet…”
  3. Add all the pdfs you are interested in and export the data.

This will produce a .csv file that you can manipulate in Excel as you wish. It will have all the data contained on the form for each file that will appear in the sheet as a separate record.

Sorry, no AppleScript, but its an easy way to get your job done.