Extension Detection with a Twist

I have this really, really UGLY extension detection routine I could use some help streamlining.

Before I display the ugly code, here’s what it has to do:

–Determine the file extension, taking into account 2-digit, 3-digit, and 4-digit extensions (.ai, .eps, .indd as examples)
–It also has to take into account “special” extensions we used for quite a while (long story) that are 6-digit (like .eps_lr)
–It also has to take into account the possibility of false extensions added by our asset management software, which causes a whopping 10-digit extension (like .eps_lr.tif, notice the TWO periods)

The last one requires a bit of explanation. The asset manager often plays games with the extensions…sometimes adding it’s own, sometimes replacing with it’s own. Usually it replaces (xxx.eps becomes xxx.pdf), but because those 6-digit special extensions aren’t recognized, it simply tacks-on an extension (xxx.eps_lr becomes xxx.eps_lr.pdf).

So in the third case I need to know the “real” extension (.eps_lr) and not the false part on the end added by the asset manager. Ugly, yes? :rolleyes:

My methodology as shown is to first capture the extension, then suss-out the false, super-long extensions as needed.

Oh, as a reminder to folks not used to my scripts, logMe is a generic log file writer I use. Feel free to strip that out to give feedback, I can always put stuff back as needed. There is ALOT of logging because I needed to track every logic step this script made to be 100% certain it got it right, there is NO margin for error either in this script or the larger parent script it is part of (this is a chunk of my File Type/Creator Type Repair software from Code Exchange).

Okay, here’s what I have, which works like a charm, but isn’t exactly pretty:


--
-- lr/hr file extension parser
--
on lrhrParse(name_type, path_to_fix)
	tell application "Finder"
		set file_to_fix to name of path_to_fix as string
		set name_length to length of file_to_fix as number
	end tell
	
	if name_type is "long" then
		set extension_length to 10
	else
		set extension_length to 6
	end if
	
	if name_length is greater than extension_length then
		set this_extension to (text items (name_length - extension_length) thru name_length of file_to_fix) as string
		
		--check validity of extension (trap for false "." characters)
		set extension_validate to (text items 5 thru 7 of this_extension) as string
		if extension_validate is not in {"_lr", "_hr"} then
			return "<invalid>"
		else
			return this_extension
		end if
	else
		return "too short"
	end if
end lrhrParse

--
-- Determines "real" extension
--
on extensionFinder(path_to_fix)
	logMe("extensionFinder Handler Called", 3)
	
	--get some basic Finder information
	tell application "Finder"
		set file_to_fix to name of path_to_fix as string
		set name_length to length of file_to_fix as number
	end tell
	logMe("name length = " & name_length, 4)
	
	--parse end of file name for possible extension types
	--
	--define possible "_lr/_hr + CCM" extension, watch for files names too short for this check
	set lrhr_long_extension to lrhrParse("long", path_to_fix)
	logMe("lrhr_long_extension = " & lrhr_long_extension, 4)
	
	--define possible _lr/_hr extension, watch for files names too short for this check
	set lrhr_short_extension to lrhrParse("short", path_to_fix)
	logMe("lrhr_short_extension = " & lrhr_short_extension, 4)
	
	--define any Finder-recognized extension
	set standard_2_extension to (text items (name_length - 2) thru name_length of file_to_fix) as string --like Illustrator's .ai
	logMe("standard_2_extension = " & standard_2_extension, 4)
	
	set standard_3_extension to (text items (name_length - 3) thru name_length of file_to_fix) as string --standard 3-letter extensions
	logMe("standard_3_extension = " & standard_3_extension, 4)
	
	set standard_4_extension to (text items (name_length - 4) thru name_length of file_to_fix) as string --like InDesign's .indd
	logMe("standard_4_extension = " & standard_4_extension, 4)
	
	--figure out which extension type captured above is the "real" one
	--
	--is it "_lr/_hr + CCM"?
	if text item 1 of lrhr_long_extension is "." then
		logMe("EXTENSION TYPE: CADS-style Long (CADS-style + CCM)", 4)
		
		--remove the extra CCM extension
		set fixed_long_extension to (text items 1 thru ((length of lrhr_long_extension) - 4) of lrhr_long_extension as string)
		set my_extension to fixed_long_extension
		logMe("¢ CONVERT TO CADS-style SHORT: " & fixed_long_extension, 4)
	else
		-- is it simply _lr/_hr?
		if text item 1 of lrhr_short_extension is "." then
			logMe("EXTENSION TYPE: CADS-style Short", 4)
			set my_extension to lrhr_short_extension
		else
			--is it something like InDesign (.indd)
			if text item 1 of standard_4_extension is "." then
				logMe("EXTENSION TYPE: Finder Long", 4)
				set my_extension to standard_4_extension
			else
				--is it a typical 3-digit Finder?
				if text item 1 of standard_3_extension is "." then
					logMe("EXTENSION TYPE: Finder Typical", 4)
					set my_extension to standard_3_extension
				else
					--is it something like Illustrator (.ai)
					if text item 1 of standard_2_extension is "." then
						logMe("EXTENSION TYPE: Finder Short", 4)
						set my_extension to standard_2_extension
					else
						--no extension
						logMe("EXTENSION TYPE: none", 4)
						set my_extension to "none"
					end if
				end if
			end if
		end if
	end if
	
	logMe("extensionFinder Handler Finished", 3)
	
	return my_extension as string
	
end extensionFinder

For the third case, are the special cases always the same? That is, could you identify them with a list of text or by length?

The special 6-digit extensions are always in the format:

“real extension” + (“_lr” or “_hr”)

I’m not 100% certain which file types the convention got use on over the years, some combos, but maybe not all of them:

.eps_lr
.eps_hr
.psd_lr
.psd_hr
.tif_lr
.tif_hr

The “false” extension adds to the confusion because it is variable (I can’t figure out how it chooses). All of these are likely to happen:

.eps_lr.eps
.eps_lr.tif
.eps_lr.pdf

I’d rather err on the side of a broader detection rather than searching a list, just to be safe.

ALSO: My bad, I forgot that sometimes there is NO extension, these being older Mac files that didn’t have them.

The trick, if you wade through my current script, is trying to figure out which extension, if any, but separate from the possibility of legitimate filenames with periods in them. For example…

12345.123_ABC_FileName.eps_lr.tif

…is a perfectly possible combination, and the returned extension should be “.eps_lr”

Cringing yet? :stuck_out_tongue:

A couple of things that contribute to the unfortunate looks, if I might discreetly mention them: :wink:

tell application "Finder"
	set file_to_fix to name of path_to_fix as string
	set name_length to length of file_to_fix as number
end tell
  1. The name will be returned as Unicode text. There’s no need to coerce it to string. In fact, it’s a bad idea to do so if the name might contain Unicode-only characters.
  2. The length will be returned as a number. There’s no need to coerce it to something it is already.
set extension_validate to (text items 5 thru 7 of this_extension) as string
  1. A ‘text item’ is a subsection of the main text that’s delimited in that text by the current value of AppleScript’s text item delimiters. If the delimiter value is “” or the default {“”}, a ‘text item’ is the same as a ‘character’. Otherwise it could be anything. So: if you mean characters, you should write ‘characters’; if you mean text items, the script should set the text item delimiters to something first.
  2. ‘(text items 5 thru 7 of this_extension) as string’ [or ‘(characters 5 thru 7 of this_extension) as string’] is two commands. One creates a list of the individual text items [or characters] and the other then coerces the list to string. When a list is coerced to string, the current value of AppleScript’s text item delimiters is interpolated between each item in the list in the final string.
set AppleScript's text item delimiters to "*" -- Say the delimiters have been left at this value.

set this_extension to ".eps_lr.tif"
set extension_validate to (text items 5 thru 7 of this_extension) as string
--> Error, because there's only 1 'text item' in the string. (No asterisks.)

set extension_validate to (characters 5 thru 7 of this_extension) as string
--> "_*l*r"

The same happens when coercing a list to Unicode text and the class of the result depends on the coercion specified.

A much better way to extract a section of text between two indexed characters is with ‘text’:

set extension_validate to text 5 thru 7 of this_extension

This is a single command, so it’s much more efficient. It’s immune to text item delimiters, it returns a result that’s the same class as the original text, and of course it’s less to type. :slight_smile:

Point taken Nigel, thanks!

I wanted this sort of nitpicking. I plan to comb the much larger script this is a part of from top-to-bottom and details like this may help. I knew it looked “unfortunate”…partly because I was very novice when I started and just stacked more advanced stuff on top of it without going back. Curse of being rushed. Since the parent script will be “headed for the wild,” I plan to push back for time to do it right.

I do admit a tendancy to over-coerce because of my haste I often found being overly-explicit to be more reliable…assume nothing, I hate tracking down coersion errors, they make my head hurt (aliases versus paths and what syntax needs which is still painful, for example). I’ve been getting better…this script has alot of “legacy” bits, some of it dating back to OS 8.

I didn’t now about the “character” and “text item” differences and can’t remember why I did that, since I know about “character.” Perhaps because I kept getting a list, and coercing to a string would have “fixed” it at that level.

I’ve yet to play with delimiters in a script. Another “something” I need to learn for certain character manipulations.

The final example you gave uses “text” and I’m confused, after the effort you went thru to explain “character” why that wasn’t used?

Maybe if you could give me a quick synposis of the differences between:

–text item
–text
–character

I could also use an explanation of when I should coerce to “UNICODE text” versus “string.” It sounds like they do the same thing, but I’m getting the impression it would be safer to always use “UNICODE text” instead of “string.” Or do they do other fundamentally different things? (i.e. it sounds like “UNICODE text” is basically “string but keep the funky characters in place”)

As always, thanks for the help!

Sorry that wasn’t clear, Calvin. (Kevin?) I think my explanation this morning was a little too verbose. (It’s my birthday today and my concentration hasn’t been all it might have been. That’s my excuse, anyway. ;))

I was trying to point out two things about your text extraction lines. Firstly, that using ‘text items’ when you mean ‘characters’ is wrong. Secondly, that, in any case, ‘text 5 thru 7 of this_extension’ is better than ‘(characters 5 thru 7 of this_extension) as string’, for the reasons stated.

Perhaps a few examples would help:

set myText to "The quick brown fox jumps over the lazy dog."

-- Text items:
-- AppleScript's text item delimiters "belong" to the copy of AppleScript possessed by the application running the script.
-- Thus each script-running application has its own, independent set of delimiters, which start off with the value {""} when the application's launched.
-- The list aspect and plural keyword are for "possible future expansion", and have been for several years. Only one delimiter at a time is currently heeded.
-- If an application runs two or more scripts without quitting (eg. scripts in different Script Editor windows), the scripts all share the same delimiters. 
set AppleScript's text item delimiters to "" -- = {""}, the default start value.
get myText's text items
--> {"T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f", "o", "x", " ", "j", "u", "m", "p", "s", " ", "o", "v", "e", "r", " ", "t", "h", "e", " ", "l", "a", "z", "y", " ", "d", "o", "g", "."}

set AppleScript's text item delimiters to "quick"
get myText's text items
--> {"The ", " brown fox jumps over the lazy dog."}

set AppleScript's text item delimiters to "o"
get myText's text items
--> {"The quick br", "wn f", "x jumps ", "ver the lazy d", "g."}

set AppleScript's text item delimiters to "zzz" -- This isn't in myText.
get myText's text items
--> {"The quick brown fox jumps over the lazy dog."}

-- Characters:
get myText's characters
--> {"T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f", "o", "x", " ", "j", "u", "m", "p", "s", " ", "o", "v", "e", "r", " ", "t", "h", "e", " ", "l", "a", "z", "y", " ", "d", "o", "g", "."}

get characters 5 thru 7 of myText
--> {"q", "u", "i"}

-- List-to-text coercions and text item delimiters:
-- When lists are coerced to string or to Unicode text, the value of the text item delimiters is significant.
-- This is to allow the text-replacement technique of breaking up text with one delimiter and gluing it back together with another.
set AppleScript's text item delimiters to ""
get (characters 5 thru 7 of myText) as string
--> "qui"

set AppleScript's text item delimiters to "Hello!"
get (characters 5 thru 7 of myText) as string
--> "qHello!uHello!i"

--> Text.
--> No list, so no need for coercion, so no issue with text item delimiters:
set AppleScript's text item delimiters to "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch"
get text 5 thru 7 of myText
-->"qui"

--set AppleScript's text item delimiters to ""
get text 5 thru 7 of myText
-->"qui"

In Mac OS X, all file and folder names are Unicode text and are returned as such by the various system applications (such as the Finder) and other processes that return them. Most applications (especially Apple ones) return any textual information to scripts as Unicode text too. This is in order to make the system usable by users of many different languages and writing systems. Unicode uses up to three (or four?) bytes per character, giving a wide range of numbers that can represent a host of different characters.

The older “string” class uses one byte per character and, since a byte can only have one of 256 different values, it’s necessary to switch to different “character sets” in order to get these 256 values to represent more characters.

Users of English and some European languages can generally switch between strings and Unicode text with impunity, which is fine if they’re only writing scripts for their own machines. But when writing scripts that might be used anywhere in the world, it’s best to favour Unicode text (where the software supports it), otherwise some users will find some of their favourite characters turning into question marks.

This from the Release Notes for AppleScript 1.10 (introduced with Tiger): “The implicitly encoded text types, typeText, typeCString, and typePString, are all deprecated as of AppleScript 1.9.2, since they are incapable of representing international characters and may be reinterpreted in unpredictable ways. Additionally, typeCString and typePString do not support the full range of text coercions, and will be removed entirely in a future release. typeStyledText and typeIntlText, while they have explicit encodings, are not recommended, since they are incapable of representing Unicode-only characters like Hungarian, Arabic, or Thai. The recommended text type is typeUnicodeText.”

In AppleScript terms, this means that the classes ‘string’ (in its “plain text” variety), ‘C string’ and ‘Pascal string’ are now deprecated, the latter two definitely being for the chop. Other forms of ‘string’ (which exist, but are all called ‘string’ in AppleScript) are not recommended, while the use of Unicode text is smiled upon.

Finally got a chance to try this out Jacques…and wow that works well. Also properly returned this:

file 1.1.1.xxx.eps_lr.pdf → .eps_lr

VERY nice!

A teensy and hesitantly offered correction for one of Nigel’s statements: “– If an application runs two or more scripts without quitting (eg. scripts in different Script Editor windows), the scripts all share the same delimiters.” This is true in the Script Editor, of course, but not in Script Debugger 4 where each running script is in its own instance of AppleScript, i.e. its own thread, so the TIDs can be independent unless one is loaded into the other. If, as Nigel points out in his discussion of Unicode, the script might be run by the Script Editor by another party, then great care must be taken to preserve them, set them, set them back.

Great care should always be taken when using AppleScript’s text item delimiters! Especially when you can’t depend on a certain enviroment (e.g. you don’t know how people will use code that is posted here).

Hi, Adam.

Thanks for the information about Script Debugger. I mentioned Script Editor because what I wrote is easily tested there. But it’s also true for Script Menu and for at least three other applications (PowerMail, TextWrangler, and DVD Player) that have their own “Scripts” menus and are amenable to having their delimiters tested.

Don’t worry, even though I’ve haven’t toyed with text item delimiters in my own scripts yet, all the books I’ve read beat that into your head–if you change them, change them back the instant you’re done. :wink:

I forgot the one other case this routine needs to detect:

For a while, we used a “special” extension type that gave us things like .eps_lr and .tif_hr and so on. The above examples cover that nicely. But it didn’t quite work right with “false” extensions our asset manager uses.

Since our asset manager software doesn’t “recognize” an extension like .eps_lr it assumes it needs a “real” one and tacks one on, often guessing wrong. So we get things like .eps_lr.tif when we download them. So the script needs to discern that .eps_lr is the “real” extension, ignore the false ending, and return the base file name (without extension) so that another handler can fix the file name.

So I modified the above script a little. I also un-one-lined the “tell file_name” line since it was hard to stare at for me and I wanted to add to what it does. Apologies in advance to the one-line fans. :wink:

EDIT: forgot to mention…“CADS” is our previous asset manager that we used the “special long extensions” in…it didn’t freak-out about them. I’ve taken to calling those _lr/_hr extensions “CADS-style” extensions for brevity. MediaBin is the new asset manager.

on extensionFinder(file_path)
	--default finder info
	tell (info for file_path without size) to set {file_name, file_extension} to {name, name extension}
	
	if file_extension is not missing value then
		--the finder recognizes the _lr/_hr style extensions just fine,
		--use that to get the name without the extension the Finder sees
		set file_name to text 1 thru -((count file_extension) + 2) of file_name
		
		--check the file name to see if it still contains a CADS extension and
		--the one the Finder saw was the false one added by MediaBin
		tell file_name
			if length > 6 and character -7 is "." then
				if it ends with "_lr" or it ends with "_hr" then
					set file_extension to text -6 thru -1
					set file_name to text 1 thru -8
				end if
			end if
		end tell
	else
		set file_extension to "none"
	end if
	
	return {"." & file_extension, file_name}
	
end extensionFinder

Nothing the matter with that, Kevin. I can’t think of a better algorithm although, as you mention, some of the lines could be “squished” into one.