The trick is YOU need to know the File Type and File Creator in order to make the fixes. Mac OS X is kinda dumb on this front, I found much to my dismay.
We had a bunch of Mac files in a Digital Asset Manager (DAM) that handled Mac files fine. We had to migrate to a new system, and all seemed well until we realized far too late that all File Type and File Creator information got stripped. Lost the resource fork, essentially.
Since we couldn’t fix the hundreds of thousands of assets (nearly 1.2 TB) server-side, we opted for a “fixit” routine I wrote for Mac folks when they download the files. After the download they drag-n-drop the files onto my droplet and it fixes over 95% of issues, including restoring previews in image files. It also makes minor changes, like applying extensions to files that didn’t have them, or fixes extensions that the asset manager “gives” them because they had none (the DAM guesses wrong alot).
I did all this “magic” by scouring our Macs for every common file type the DAM contained, even scouring old archives, and then creating a chart of Type and Creator. This involved a script that would write to a text file the file name, Creator, and Type. Then I had to see how much of the Type/Creator info was unique to certain extensions (this gives the File Type part of the equation). I also had to use hexdumps to view file contents looking for unique strings (the Creator Type detective work). Then I had to reference that against things I could figure out from a file that didn’t have this information, which meant reading headers, knowing something about our internal file naming standards, etc.
In the end, I had 40 cases I had to do lookups against. I narrowed the list down to a tree of conditions. Sometimes a single condition defined a Creator/Type pair; sometimes it spawned a sub-tree of conditionals to check; sometimes a “this condition AND that condition” had to be met, which could then spawn more sub-conditions. In other words, nowhere near straightfoward.
In fact, last week we found a certain file with a certain “guess” induced by the DAM caused some files not to be “fixed” by my script. I narrowed down what the problem was (a set of circumstances I didn’t predict), and I will have to re-assess all my nested conditionals and rebuild one of the branches for PDF-vs-Illustrator detection.
One thing I learned, you can’t reply on the Mac OS itself. It is “guessing” based on the existing file extension. If none exists, it’s a horrible guesser. For example, Adobe Illustrator CS files look like PDFs, so the Mac OS can’t tell the difference between a “real” PDF and an Illustrator file without an extension. The only real way to tell is to GREP the contents of the file seeking specific text strings embedded therein. This is further complicated by “encrypted” files formats used by some companies (cough Quark XPress cough) that have no English-readable strings for GREP to find. Those were fun…NOT.
If anyone is interested, I can post the code I came-up with, but it’s pretty long. It’s not as elegant as it could be, and is very biased towards graphics files (I work in the Creative Services department). I’ve also since learned better ways to do things like file extension detection, I just don’t have the time to go back and re-write the appropriate handlers.