File Organization with Applescript. (100,000+ files)

Hello, my name is Ryan, and I am the production manager for a publishing company that produces two 100 page + magazines, with print runs of around 200,000. We have an absolute nightmare of a server, and it contains something like 150,000 media files that are shoddily organized by hand, and rife with duplicates. When we package the magazines to send to our print techs, it creates a ton of duplicate files, and I need a way to get rid of them and organize the server.

So… I was hoping that if I wrote some psuedo-code you guys could tell me if it was possible to really do, so here goes:

---------------- Image Organization -------------------

in volume(server1) scan all files
if document type = image(.tiff, .jpg, .psd, etc…)
then create folders (#,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z) in (/server/images)
if document name starts with ‘a’ then place in (/server/images/a)
repeat for A-Z
if duplicate file = true then place in (/server/images/duplicatestoreview)


Is this possible? Is it a really bad idea? Will it choke on 150,000 images?

Thanks!

-Ryan

Also, if one of you Applescript gurus was interested in making a really great script for us, I imagine I could talk my boss into paying you for your time. If it’s easy though, I hope I can do it myself.

Hi Ryan - what your looking to do is relatively easy with a script - but Applescript may be too inefficient to do it in an acceptable timeframe. I suggest a shell script - it will be far faster/easier to build. Also - if your looking for duplicates, there are 2 ways to programatically check. One is by filename (ie two files with the same name MIGHT be dups, and the other is by the images MD5. The MD5 is a hidden code for every image that you can extract - then compare to another image…if the numbers match - then you DEFINATLEY have a duplicate. Also are all the images named with the proper extension? If they aren’t we’ll have to write something that determines the filetype of the image - not difficult - just adds a few lines and some time to the script though.

That said - if you could get me some additonal details of the challenge - ie: how are the folders arranged now, how much data are we talking about, in GB’s, etc… I would be happy to take a stab at it.

Thanks
Chris

Hi there.

There are something around 180 GB of files. They’re organized all over the place right now, in things like /server/xsortfile/randomimage.tiff and sort of semi-organized in an images folder with sub-folders from #-Z. There are a good number of duplicated files - something like 40,000 (eeek) and many of them are not duplicates - they’re client provided digital camera shots with names like DSC001.jpg - there might be 10 of that particular image, but they’re all different. I recently joined the company and I’m trying to get things straightened out. Hope that helped.

Thanks!

-Ryan

Hi Ryan - a few more questions:

Do they all have file extensions?

and What would you like the defautr behavior to be if it encounters 2 files of the same name that are actually different images, for example - should they be named DSC1000-1.jpg, DSC1000-2.jpg, etc…?

Let me know
Thanks
Chris

Ryan - cut and paste this line into a terminal window - it’s just a find command (it doesn’t do anything with the files) but it will give you an idea of the power of unix.

This is should be one line - no breaks:
find ~/ -type f ! -name “." -name ".[Tt][Ii][Ff]” -or -name “.[Tt][Ii][Ff][Ff]" -or -name ".[Ee][Pp][Ss]” -or -name “.[Jj][Pp][Gg]" -or -name ".[Pp][Ss][Dd]”

Basically - this searches your home directory “~/” for “files” that have extensions ending in tif, tiff, eps,jpg, psd. the [Ee] means match upper/lower case, etc…

to put the results into a file - simply add:

<< ~/Desktop/findresults.txt

to the end of that line and it will give you a text file on your desktop that will have all the paths to all the matching files.

let me know how this works out for you,
thanks
Chris