Extracting PDF Text

A friend has pointed out that there’s an Automator action called “Extract PDF Text” that will do just that in either plain text or Rich Text and saves the result as a file. If the extracted file size is zero, the operation failed (i.e., the PDF was protected). Clearly if Automator can do it, the system can.

The objective is to write an AppleScript to test PDFs and if not searchable make make them PDF searchable without having to print and OCR them, or use Adobe Acrobat’s OCR function. Unfortunately there doesn’t seem to be a way to determine if a PDF is searchable just by looking at the file itself without opening it and seeing if it has a search box.

Suggestions??

I guess the automator action parses the raw data of the PDF file and searches for streams that point out some text data. I guess you can do the same thing with AppleScript (or any other language witch can read files and parse data). The only thing you need to know is how the structure of a PDF file looks in order to be able to parse it. Adobe itself has a (rather large) document about PDF documents and its structure. It’s a whole task to read it through and it’s actually like learning a new programming language.

Hope it helps,
ief2

Here’s some objective-c code for you. It will tell you if the file is encrypted with a password. It returns yes/no (unless there’s an error). Yes means it’s encrypted. I’m assuming that files that are not searchable are encrypted.

Create a new Foundation Tool in Xcode. I called mine “pdfIsEncrypted”. Add the Quartz.framework to your project. Let me know if you need help creating the tool.

#import <Foundation/Foundation.h>
#import <Quartz/Quartz.h>

int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
	
	// see if help is being requested
	NSArray* pInfo = [[NSArray alloc] initWithArray:[[NSProcessInfo processInfo] arguments]];
	if ([pInfo count] == 1) {
		fprintf(stderr, "Please pass in the path to a pdf file to find out if it is encrypted with a password.\n");
		return 1;
	} else if ([[pInfo objectAtIndex:1] isEqualToString:@"-h"]) {
		fprintf(stdout, "Pass in the path to a pdf file to find out if it is encrypted with a password.\n");
		return 0;
	}
	
	NSString* path = [pInfo objectAtIndex:1];
	NSURL* u = [NSURL fileURLWithPath:path];
	[pInfo release];
	
	PDFDocument* d = nil;
	d = [[PDFDocument alloc] initWithURL:u];
	if (!d) {
		[d release];
		fprintf(stderr, "The file could not be queried as a pdf document: %s\n", [path UTF8String]);
		return 1;
	}
	
	if ([d isEncrypted]) {
		fprintf(stdout, "yes\n");
	} else {
		fprintf(stdout, "no\n");
	}
	[d release];
	
    [pool drain];
    return 0;
}

Hallo Mr. Bell,

I guess i completly missunderstand you :wink: but to extract text from pdf-files I use xpdf. MAN: http://manpagez.com/man/1/pdftotext/:

works fine …

Hans

Thanks, Hans; I got it and installed it.

And thank you, Hank. I’ve never done it, but I’ll try. :slight_smile:

Hi Adam,

You can compile Hank’s code using the following from Terminal:

Just name the file “ispdfencrypted.m” or change the command above to match the file name.

Good then, Craig showed how to compile the code without Xcode. It seems you want to write the text to a file in the cases when the pdf is encrypted. So I modified the code so you can pass in a second argument… the password to use to decrypt an encrypted file.

Now this new code will pass back “no” when the file is not encrypted. If it is encrypted it will pass back the text of the pdf file (which you can then write to a file using applescript). If you don’t give a correct password (or don’t include a password at all) the tool will pass back “yes” like before. Here’s an example applescript to show how you can use this new code.

set theResult to do shell script "pdfIsEncrypted '/path/to/pdf' 'password'"

if theResult is "no" then
	return "the file is not encrypted"
else if theResult is "yes" then
	return "the file is encrypted but we could not decrypt it"
else
	return "the file text of the encrypted pdf is:" & return & theResult
end if

And here’s the new objective-c code.

#import <Foundation/Foundation.h>
#import <Quartz/Quartz.h>

int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
	
	// see if help is being requested
	NSArray* pInfo = [[NSArray alloc] initWithArray:[[NSProcessInfo processInfo] arguments]];
	NSUInteger pCount = [pInfo count];
	if (pCount == 1) {
		fprintf(stderr, "Please pass in the path to a pdf file to find out if it is encrypted with a password. You can get the text of an encrypted file if you pass in a password as the second argument.\n");
		return 1;
	} else if ([[pInfo objectAtIndex:1] isEqualToString:@"-h"]) {
		fprintf(stdout, "Pass in the path to a pdf file to find out if it is encrypted with a password. You can get the text of an encrypted file if you pass in a password as the second argument.\n");
		return 0;
	}
	
	// get the arguments
	NSString* path = [pInfo objectAtIndex:1];
	NSURL* u = [NSURL fileURLWithPath:path];
	
	// look for a password
	NSString* pass = nil;
	if (pCount > 2) pass = [pInfo objectAtIndex:2];
	[pInfo release];
	
	PDFDocument* d = nil;
	d = [[PDFDocument alloc] initWithURL:u];
	if (!d) {
		[d release];
		fprintf(stderr, "The file could not be queried as a pdf document: %s\n", [path UTF8String]);
		return 1;
	}
	
	if ([d isEncrypted]) {
		if (pass && [d unlockWithPassword:pass]) {
			fprintf(stdout, "%s\n", [[d string] UTF8String]);
		} else {
			fprintf(stdout, "yes\n");
		}
	} else {
		fprintf(stdout, "no\n");
	}
	[d release];
	
    [pool drain];
    return 0;
}

You may want to try the property ‘kMDItemSecurityMethod’ in the PDF’'s metadata too.

Wow, folks. I’m in for a bit of learning curve, but I’ll get there :wink:

Thank you all. I’ll probably resurrect this thread when I hit a wall. :mad: