Problem with a shell/AppleScript and Scandinavian characters

I need to be able to detect if a specific file exists from an AppleScript. I also need to use wildcards for the file name, so I am doing it with a “do shell script”, like this:
do shell script “/bin/test -f ‘part of file name’*”
Putting that inside a try block will return an error if the file doesn’t exist.

This works fine most of the time, but if the file name contains characters like for example ö, it cannot find the file. What makes this interesting, is that if I am not using the *, it works also with the character ö. It is the combination of these two that is causing trouble.

Here is a small script that demonstrates this problem. It will create a folder called test on your desktop and two files (“Ykkönen.mp3” and “Ykkonen.mp3”) within that folder. It then shows that it is able to find both of the files using the full file names, and also it finds the file where the letter ö is replaced with a letter o using a wildcard. But the wildcard match for the ö file fails.

set dir to (path to desktop folder from user domain as string) & "test:"
set dir to quoted form of POSIX path of dir
do shell script "/bin/mkdir -p " & dir
do shell script "/usr/bin/touch " & dir & "'Ykkönen.mp3'"
do shell script "/usr/bin/touch " & dir & "'Ykkonen.mp3'"
display dialog isfound("'Ykkönen.mp3'")
display dialog isfound("'Ykkonen.mp3'")
display dialog isfound("'Ykkönen'*")
display dialog isfound("'Ykkonen'*")

on isfound(filename)
	global dir
		do shell script "/bin/test -f " & dir & filename
	on error
		return filename & " is NOT found"
	end try
	return filename & " is found"
end isfound

I also tried using ls instead of test, but it gives the same result. Also converting the strings to Unicode text has no effect. Does anybody know what is causing this problem?

It’s a Unicode compatibility issue. While the filesystem APIs are Unicode-aware, the great majority of Unix command line tools (including bash, which does the pathname expansion) aren’t. AppleScript represents that “ö” character as a single Unicode code point, U00F6 (the composed form). The HFS filesystem represents the same character using two code points, U006F + U0308 (‘o’ followed by a combining diaresis - the decomposed form).

Unicode-aware comparison routines normalise the two strings (i.e. convert them to the same form) before comparison, so they declare them a match, hence your non-wildcard version works because ‘test’ is obviously using the system APIs to test that paths exist.

Non-Unicode-aware routines don’t normalise the two strings, so they declare them a non-match, thus your wildcard version fails because bash shell’s pathname expansion (*) is getting a list of pathnames doing the comparisons itself.

Frankly, the simplest solution would just be to use Unicode-aware tools. Is there some reason you couldn’t use Finder or System Events to do the test, or even just get a list of all filenames and test them yourself using AS? Example:

tell application "Finder"
	exists (first file of folder dir whose name starts with filename)
end tell

BTW, if you only want to ignore the filename extension then Finder/SE/bash pathname expansion isn’t sufficient as it may produce false matches, e.g. ‘foo*’ would match ‘foo.eps’ but also ‘foobar.eps’. You’ll need to get a list of filenames using ‘list folder’ and use AS to strip off the extensions using TIDs before comparing them.


p.s. Your isfound() handler’s shell script contains a bug that’s being masked by the indiscriminate ‘try’ block. (‘test -f’ goes blooey if the pathname expansion produces more than one path.) To avoid such errors slipping by, always ensure your ‘try’ blocks trap only the error(s) you want; in this case, by using ‘on error number 1’.