Reading File Header?

I’ve got a series of Mac files that were converted from one system to another, and only after it was too late to do much about it, we realized the files got stripped of their Macintosh-specific information (resource fork presumeably). So file type and creator codes are gone, icon previews as well as embedded previews for file types like EPS.

I’ve written a script to “fix” some of these files when they are retrieved from the new system, based on certain assumptions. For example:

–If it ends in “.psd” make the appropriate changes to File Type and Creator, as well as open the file in Photoshop and re-save to get the preview back.

–If it ends in “.ai” or “.indd”, fix the File Type and Creator to Illustrator and InDesign

And other such basic assumptions. However, we still have about 25% of the files which have no extensions and while the average human using these files can figure out what they are by looking at the file names or whatnot, then fix the files manually, the logic is too “fuzzy” to automate.

I’d like to automate this “hole” and I figured if I could peer inside the files “guts” and figure out things to look for in there I could use this plus a series of checks to fix all the file’s resource information.

But the first trick is to look inside.

I got BBEdit to peek at the raw file code of some file types, like Illustrator, by simply changing the file extension to “.txt” to trick it into opening them. Sure enough, the word “Illustrator” is in there near the top.

However, BBEdit will try to display graphics images (like JPEG) even when you give it fake extensions. Apparently it is pre-emptively peeking inside the headers itself (darn smart buggers those BareBones folks).

So what I need is a program, preferrably scriptable (but I’ll resort to UI scripting if I have to), that I can have “peek inside,” look for certain strings, then fix the resoruce information via the Finder. I had hoped BBEdit would work, which would mean TextWrangler might have worked, since being able to access BBEdit’s GREP engine would have been useful in this automation.

One last thing: I can’t fix the files where they reside. They “live” in a Windows-based asset management system. I can only fix them at the user level, after they download them. So this script will see heavy use. My first version of this “fixit” script was a drag-n-drop script, the next version is going to get a host of enhancements via FaceSpan.

Anybody have any ideas?

if you’re looking for ASCII markers inside of those files you might use the shell command ‘cat’ - I tried:


set catFile to (do shell script "cat " & quoted form of (POSIX path of (choose file)))
if catFile contains "MSWordDoc" then
	display dialog "seems to be a MSWord Document"
else if catFile contains "FreeHand11" then
	display dialog "seems to be a FreeHand 11 Document"
end if

and it seems to work.

Quick reply. You may be able to make use of the “file” command line tool to determine what kind of file it is.

Quick examples:

/usr/bin/file -b /Users/bruce/Dev/-misc/Large\ Arrow.psd

Adobe Photoshop Image

/usr/bin/file -bi /Users/bruce/Dev/-misc/Large\ Arrow.psd

image/x-photoshop

heres a script that may point you in the right dirction.

http://forums.macosxhints.com/showthread.php?t=56435

tell application "Finder"
	set fileselection to selection as list
	
	repeat with i from 1 to number of items in fileselection
		tell application "Finder"
			set this_item to item i of fileselection as string
			set this_itemz to POSIX path of this_item
			
			set docname to name of (info for alias this_item)
		end tell
		set displaytype to do shell script "hexdump -C " & quoted form of this_itemz & " | head -1 | open -f"
		tell application "TextEdit"
			set name of window 1 to docname
		end tell
	end repeat
end tell

Sorry it took me so long to get back to everyone. I tried this method and it apparently relies on the Resource Fork to provide accurate information. As I mentioned, in my case the resource fork is gone. So I get wierd things like “octet-stream” for files I know are Photoshop EPS and “application/postscript” for files I know are Illustrator. This is a cool trick for files with the normal Mac-specific information, and going to log it in the back of my brain.

For the record, this is my first time playing with doing command-line commands from within AppleScript. Very cool.

–Kevin

This is about what I figured I’d have to do…get something to pry-open the file in some text-readable format and scan for specific strings. My next trick is to see if I can make it discern between, say, a Photoshop EPS and a Photoshop TIFF. Should just be a matter of going into every application I anticipate files from, run “cat” on them, then create a series of checks to find the same information in files that I need to “fix.”

Nice, and thanks!

EDIT: I’m running into a problem though that I suspect is due to some sort of limit on how variables can store, or a limitation on the amount of memory AppleScript can use. When trying to run “cat” and put them into a variable with files from PhotoShop, the AppleScript crashes. I suspect a limit on how many characters can be in a variable or that memory can store…is there such a limit?

Even a basic Illustrator file’s “cat”-dump made a 4.3 MB text file from a 2.1 MB Illustrator file. I tried a 14.1 MB Photoshop EPS file and instead of crashing I got an Out of Memory error. 4.6 MB Photoshop file made it okay (4.6 MB text file). Problem is, I will need to parse files in the 500 MB and up range. :frowning:

I can’t use the “first X characters” since I found Word/Excel identifiers near the end of a “cat” dump, Adobe puts Illustrator info near the top, but InDesign and Photoshop near the middle. :-/

I’m assuming this will thwart Mike’s dump-to-file method as well since a variable has to store the value first.

–Kevin

Correct me if I’m wrong, but this seems like a brute-force version of Dominik’s solution. Or does it have advantages I don’t see? I haven’t tried it yet since I’d prefer if I could avoid having a series of text files created/destroyed when the user does a whole bunch of files at once.

But I will certainly keep this in mind if the “cat” solution doesn’t seem to pan-out.

I did borrow the loop itself as the starting point to put Dominick’s script in. This will be “droplet” so it needs to be able to prase through multiple files via the “open” handler.

EDIT: Well I gave this script a try, but I’m not sure what I’m looking at here. For a file I know is Illustrator, I get back:

00000000 25 21 50 53 2d 41 64 6f 62 65 2d 33 2e 30 20 0d |%!PS-Adobe-3.0 .|

For a Photoshop file:

00000000 c5 d0 d3 c6 1e 00 00 00 c2 85 5d 00 00 00 00 00 |…]…|

So I’m not sure what I’ve gained here, unless there is something in the hex that should be helping me? I need a little help here.

TO EVERYONE: Great responses, thanks a bundle!

–Kevin

I figured I’d “finish this out” since it kinda died…

I’ve opted for a hybrid approach, alot more complex to write, but should also run faster:

–Certain file types I can assume their file type and creater from their extensions. Any JPG, PSD, TIF, etc. here had to be created in Photoshop, and the extension itself lets me set the file type.

–Certain other “known extensions” can be fixed via simple checks. INDD, QXD, AI, etc.

–We used some odd naming conventions in the past for EPS files, and those standards always applied to Photoshop files. So I can weed out non-vector EPS files.

The above checks run fast and don’t require alot of memory, just a complex series of if/then/else checks versus a set of list values.

For files without extensions or are EPS files of unknown origin, looks like I can use the “file” command on most of these, if not all of them. I haven’t done exhaustive tests on all the file types the script may encounter. If I run into ambiguous returns (if “file” retuns an answer applicable to multiple file types) then I’ll be forced to use “cat” on that file.

If anyone has follow-up feedback, I’ll stay subscribed to this one. I’m still coding (wearer of many hats, constantly interrupted) so if something juicy comes along I’ve still got time to rethink my approach. :wink:

Thanks for your help,

–Kevin

Did not know this was still going,

I was gathering this script while Dominik was posting so did not see his until after I posted.

It was meant as an example of how you may be able to look in side the file. I only have jpegs and other image files on this mac…

when I run the script it gives me for a jpeg

00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48 |…JFIF…H|

The text files are what I need, again its an example for you not the end product.

Saying that if I apply dominiks if then statements and take out the head command it does the same thing.

But dominics works much much faster.

mark,

Gave yours a whirl removing the -head attribute from your hexdump script and AppleScript seems much more inclined to do this without crashing than using “cat,” and at first glance there seems to be good information hiding in text it returns. For example, one problem I am having is telling a Photoshop 7 from a a Photoshop 10 file (just as a quick example).

Instead of dumping to a file, what would be an efficient way to search the return from a hexdump within AppleScript? For example, if I wanted to find “Photoshop 7.0” within all that hex? Is there a “find” or “search” available for strings, or would it be more efficient to just script into a text editor I can “hide” from the user while it’s working?

Thanks for following-up, may have something here that will get more information when returns from “file” are still too vague, but without the crashing caused by large “cat” returns.

Still all tenative until I can write this into the larger script. Right now I’m testing these detection routines as separate snippets.

–Kevin

I was given a faster, better way to do this via a colleage at work whom I was bouncing ideas off of. So I have a “solution” now that others might like:

PROBLEM: Scan files for a given string as a means to detect file attributes. In my case, I have thousands of files that had their Macintosh File Type and File Creator wiped (long story…moral: don’t trust Windows-based digital asset management systems). Many of the file are older with no extensions. Needed to restore “double-clickability” for my users:

SOLUTION: Use “grep”!

First, I created a simple droplet using the “cat” technique from this thread. Then I saved sample documents from every program and every file type we use for graphics around here, including some fairly obscure ones over 56 files in all. Then I looked over each one of them using the “cat” dump for a unique string or at least strings that would help me narrow the possibilities, which I could combine with other checks I was using.

In order to know if a file had a given string, I would load the string or strings into a list, which would then be passed to this:


on grepForString(path_to_fix, search_list)
	repeat with i from 1 to number of items of search_list
		set current_item to item i of search_list
		try
			set grep_result to grepMe(path_to_fix, current_item)
			exit repeat
		on error
			set grep_result to 0
		end try
	end repeat
	
	return {grep_result, current_item}
end grepForString

on grepMe(path_to_fix, search_item)
	set path_to_fix_POSIX to POSIX path of path_to_fix
	set shell_string to ("grep -ch \"" & search_item & "\" " & quoted form of path_to_fix_POSIX)
	return (do shell script shell_string)
end grepMe

Couldn’t figure out why when the grep query was called directly in grepForString without the try it was failing. Drove me nuts (again). But a search of MacScripter showed me this was a known bug. Yay MacScripters!

Since my grep check was nested in a repeating loop, this error killed the loop. So it was necessary to isolate the grep query itself in a separate handler so it could fail without disturbing the loop.

So all I did was simply check if the grep result was greater than zero, in which case I knew the string(s) I was after was found. The return also gives me which string in the list was what it found.

This method is fast, even on 200+ MB Photoshop files.

NOTE:
Silly me, my friend and I assumed a command called “POSIX path of” would actually generate a valid, escaped *NIX path. It doesn’t. Probably common knowledge, except for us apparently. So don’t forget the “quoted form of” to accommodate that problem. This one puzzled my friend and I for a bit since the script would would work on some files and not on others, until we stared carefully at their full paths and realized some had spaces and special characters. V8 forehead-slap

If anyone has improvements on the technique, let me know.

–Kevin

I wanted everyone who helped me to know that my “File Fixer” application finished internal testing and goes live to all 40+ Macs starting tomorrow. It could probably be streamlined, and I wouldn’t mind a good “peer review,” but there are two problems:

–It’s final form is a FaceSpan Project, so not sure how useful that is for anyone
–I’m not sure if my company would let me “release it to the wild.” You know, paid for by the company, phalanxes of lawyers, etc.

I could probably post some of the key handlers if anyone was interested (and told me where to put them).

But I did want to say “Thanks for all the help.” :slight_smile: