Problems with Asian Characters

In another thread I was getting help with a script that would scan some user folders on a server and mark them red (eventually it will delete them) if they are older than a certain amount of time.

Was working fine, until we got a foreign-language job from Southeast Asian and the files use the “double byte” asian glyphs, which chokes my script (below). Any ideas?

(I’ve left out some of the globals and properties because they give-away sensitive information. Also left a couple of the handlers out like logMe because they are not the cause. If you need those in some abstract form, let me know)

-- Get a file/folder list of all items at a certain level inside a given folder
--
-- with help from chrys of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=91191#p91191
--
on listGetter(folder_to_scan, scan_level, folder_exceptions)
	--exceptions formatted for shell find
	copy folder_exceptions to folder_exceptions
	repeat with fe_ref in folder_exceptions
		set contents of fe_ref to quoted form of contents of fe_ref
	end repeat
	set ASTID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to " -or -name "
	set exclude_code to text 6 thru -1 of ("" & ({""} & folder_exceptions))
	set AppleScript's text item delimiters to ASTID
	--do shell find with exceptions
	do shell script "/usr/bin/find " & (quoted form of POSIX path of folder_to_scan) & " ! \\( \\( " & exclude_code & " \\) -prune \\) -maxdepth " & scan_level & " -mindepth " & scan_level & " -print0 ; true" without altering line endings
	set find0 to result
	set {ASTID, text item delimiters} to {text item delimiters, {ASCII character 0}}
	try
		set POSIX_pathnames to text items 1 through -2 of find0 -- Drop the last text item because it is always empty (find -print0 always prints a trailing null).
		set text item delimiters to ASTID
	on error m number n from o partial result r to t
		set text item delimiters to ASTID
		error m number n from o partial result r to t
	end try
	script speedHack
		property Mac_pathnames : {}
	end script
	repeat with P_pn in POSIX_pathnames
		set end of speedHack's Mac_pathnames to (POSIX file (contents of P_pn)) as Unicode text
	end repeat
	speedHack's Mac_pathnames
end listGetter


--get user folders that aren't empty
--
on nonEmptyFolders(directory_to_filter)
	set filtered_folder_list to {}
	
	repeat with e from 1 to (number of items in directory_to_filter)
		tell application "Finder"
			set folder_to_check to (item e of directory_to_filter)
			set item_contents to number of items in folder folder_to_check
			if number of items in folder folder_to_check > 0 then
				set filtered_folder_list to filtered_folder_list & folder_to_check
			end if
		end tell
	end repeat
	
	return filtered_folder_list
end nonEmptyFolders

--
-- MAIN SCRIPT
--

my logMe("Transfer Folder Cleaner and Logger Begin--" & (current date), 1)

--COLLECT DATA

--get transfer folder categories (include empties)
set category_folders to {}
set category_folders to listGetter(g_transfer_files_location, 1, g_exclusions_folders)

--get user folders (include empties)
set user_folders to {}
repeat with j from 1 to (number of items in category_folders)
	set user_folders to user_folders & listGetter(item j of category_folders, 1, g_exclusions_users)
end repeat

--get user folders that aren't empty
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folder contents (skip empties)
set user_folders_filtered_contents to {}
repeat with k from 1 to (number of items in user_folders_filtered)
	set user_folders_filtered_contents to user_folders_filtered_contents & listGetter(item k of user_folders_filtered, 1, g_exclusions_macosx)
end repeat

--filter for files older than mark date
set files_to_mark to {}
repeat with m from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item m of user_folders_filtered_contents as alias)) < g_mark_date then
			set files_to_mark to files_to_mark & item m of user_folders_filtered_contents
		end if
	end tell
end repeat

--filter for files older than delete date
set files_to_delete to {}
repeat with d from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item d of user_folders_filtered_contents as alias)) < g_delete_date then
			set files_to_delete to files_to_delete & item d of user_folders_filtered_contents
		end if
	end tell
end repeat

--MARK OLD FILES
repeat with currently_coloring in files_to_mark
	tell application "Finder"
		set label index of (currently_coloring as alias) to "2"
	end tell
end repeat

--DELETE OLDEST FILES
--(code to come...left out for safety reasons, delete oldest files)

--get user folders that aren't empty (updated)
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folders over threshold size
set oversized_user_folders to {}
repeat with b from 1 to (number of items in user_folders_filtered)
	tell application "Finder"
		set size_in_megabytes to (size of (info for (item b of user_folders_filtered as alias))) / 1024 / 1024 as integer
		if size_in_megabytes ≥ g_threshold_size then
			set oversized_user_folders to oversized_user_folders & item b of user_folders_filtered
		end if
	end tell
end repeat

--all folders over 300 MB, send e-mail warning (or at least prefab the e-mail for user to click "send")
--(code to come)

--LOG INFORMATION

--Get total space used by Transfer folders by getting subfolder totals and adding together (does not cause the Finder to stall)
tell application "Finder"
	set transfer_folder_total to 0
	repeat with s from 1 to (number of items in user_folders_filtered)
		set transfer_folder_total to transfer_folder_total + (size of (info for (item s of user_folders_filtered as alias)))
	end repeat
	--convert to megabytes
	set transfer_folder_total to transfer_folder_total / 1024 / 1024 as integer
end tell

--Log transfer folder total in Excel, with Graph
--Log totals in each category in Excel, with Graph
--Log totals for each user in Excel, with Graph

my logMe("Transfer Folder Cleaner and Logger End--" & (current date), 1)

Hi Kevin,

I guess, the problem is this odd conversion using text item delimiters ASCII 0 (find -print0 always prints a trailing null).
The “trailing null” is actually Unicode (UTF-16) encoding, which is required to handle asian glyphs properly.

Maybe chrys can help you to change the script, I don’t understand some secrets like

copy folder_exceptions to folder_exceptions

or

set ASTID to AppleScript's text item delimiters set AppleScript's text item delimiters to " -or -name " set exclude_code to text 6 thru -1 of ("" & ({""} & folder_exceptions)) set AppleScript's text item delimiters to ASTID

What role do these characters play? I would assume that they are in the pathnames, but it would be nice to be clear on this point.

Can you give an example of the strings/characters that are causing problems? Maybe the particular sequence that is giving you problems is confidential, but could you or your client(s) generate a benign string that triggers the same problem? That would help in reproducing the problem. At least let us know to which section of Unicode the problematic characters belong. To locate the characters in the Unicode table you can use the Character Palette (launch app “CharPaletteServer”). Select Code Tables from the drop down list at the top. Paste a character into the search box at the bottom. Wait for it to find it and produce a results list. Double-click on the character in the results list. It will show you the character in its place in the whole Unicode table.

How, exactly, does the script choke? Is there an error message? Does it just silently misbehave? If so, how does it misbehave?

Note: Unless you know for a fact that the characters are “double byte” (in what encoding?), it is best to not describe them like that. Be specific if possible. For example TETRAGAM FOR CENTRE (U+1D306) “𝌆” is four bytes in UTF-8 (hex F09D8C86),
UTF-16 (big endian: D834DF06), and UTF-32 (0001D306). Every code point encoded in UTF-32 is four bytes. Every code point encoded in UTF-16 is either two or four bytes. Code points encoded in UTF-8 vary from one to four bytes. All of these are for current Unicode size. For example, there are longer sequences available in the framework that UTF-8 lays out, but they are not currently used or needed. Most CJK ideographs actually take up three bytes in UTF-8 but only two in UTF-16. CJK UNIFIED IDEOGRAPH-4E8C (U+4E8C) “二” (representing the number two, I think) is E4BA8C (three bytes) in UTF-8, 4E8C (two bytes) in UTF-16 and 00004E8C (four bytes) in UTF-32.

Also, are you still on Mac OS X 10.4 (if I recall correctly), or are you running 10.5 by now?

According to the do shell script entry in my copy of Standard Addition’s dictionary, UTF-8 is the default encoding for the data read from the shell commands started by do shell script. According to various sources (Wikipedia,IETF RFC 3692) UTF-8 can encode any code point that UTF-16 can (all of current Unicode).

While there are 0x00 (null) bytes in the UTF-16 encoding of the code points for many Latin characters, all of the CJK code points are high enough that both bytes have non-zero values in UTF-16 encodings. I can not rule out some encoding issue, but I do not think it has to do with the null bytes output by find (which will be generating UTF-8 encoded text between the null bytes, since that is the encoding used by the BSD filesystem API). The only code point that encodes under UTF-8 to have a 0x00 in it is U+0000 (which encodes exactly to 0x00). So the null bytes that find generates fit in perfectly well in the UTF-8 data of the pathnames. The nulls end up representing U+0000 (the code point for the null character) in the decoded Unicode data.

Here is some code that attempts to demonstrate, and test a few variations on using the encoding specifier for do shell script:

property promptForText : false
property removeFileWhenDone : true

set testText to getTestText(promptForText)
set testFilePathname to getTestFilePathname()
set testFileQuotedPOSIXPathname to quoted form of POSIX path of (a reference to file testFilePathname)
writeUTF8TextToPathname(testText, testFilePathname)

set UTF8_as_NoSpecifiedEncoding to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname) without altering line endings
set UTF8_as_UTF8 to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname) as «class utf8» without altering line endings
set UTF8_as_UTF16 to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname) as Unicode text without altering line endings
set UTF16_as_NoSpecifiedEncoding to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname & " | iconv -f utf-8 -t utf-16") without altering line endings
set UTF16_as_UTF8 to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname & " | iconv -f utf-8 -t utf-16") as «class utf8» without altering line endings
set UTF16_as_UTF16 to ¬
    do shell script ("cat " & testFileQuotedPOSIXPathname & " | iconv -f utf-8 -t utf-16") as Unicode text without altering line endings

if removeFileWhenDone then do shell script "rm -f " & testFileQuotedPOSIXPathname

set sameAsSameAreIdentical to stringsAreIdentical(UTF8_as_UTF8, UTF16_as_UTF16)
set origAnd8As8AreIdentical to stringsAreIdentical(testText, UTF8_as_UTF8)
set unspecifiedMeansUTF8 to stringsAreIdentical(UTF8_as_NoSpecifiedEncoding, UTF8_as_UTF8)
set unspecifiedMeansUTF8AnotherWay to stringsAreIdentical(UTF16_as_UTF8, UTF16_as_NoSpecifiedEncoding)
set unspecifiedMeansUTF16 to stringsAreIdentical(UTF16_as_NoSpecifiedEncoding, UTF16_as_UTF16)
set unspecifiedMeansUTF16AnotherWay to stringsAreIdentical(UTF8_as_UTF16, UTF8_as_NoSpecifiedEncoding)



set report to {}
if not sameAsSameAreIdentical then set end of report to "Reading UTF-8 data as UTF-8 did not produce identical result as reading UTF-16 data as UTF-16! (UNEXPECTED; rest of report may be meaningless)"
if origAnd8As8AreIdentical then
    set end of report to "Data read from "do shell script" is the same as was written via "open for access"+"write". (EXPECTED)"
else
    set end of report to "Data read from "do shell script" (UTF-8 as read as UTF-8) is NOT the same as was written via "open for access"+"write". (UNEXPECTED!)"
end if
if unspecifiedMeansUTF8 then
    set end of report to "do shell script without encoding specifier seems to mean UTF-8. (EXPECTED)"
    if unspecifiedMeansUTF8AnotherWay then
        set end of report to "do shell script decodes UTF-16 data identically with "as «class utf8»" and without encoding specifier (EXPECTED)"
    else
        set end of report to "do shell script decodes UTF-16 data differently with "as «class utf8»" and without encoding specifier (unexpected; OK? Maybe. But, probably not.)"
    end if
else if unspecifiedMeansUTF16 then
    set end of report to "do shell script without encoing specifier seems to mean UTF-16. (unexpected, but probably OK)"
    if unspecifiedMeansUTF16AnotherWay then
        set end of report to "do shell script decodes UTF-8 data identically with "as Unicode text" and without encoding specifier (expected; based only on the preceding unexpected result)"
    else
        set end of report to "do shell script decodes UTF-8 data differently with "as Unicode text" and without encoding specifier (unexpected; OK? Maybe. But, probably not.)"
    end if
else
    set end of report to "do shell script without encoding specifier does not match results when specifing either UTF-8 or UTF-16! (UNEXPECTED)"
end if
try
    set {otid, text item delimiters} to {text item delimiters, {return & return}}
    display dialog ""do shell script" encoding report:" with title "Report" default answer ("" as Unicode text) & (report)
    set text item delimiters to otid
on error
    set text item delimiters to otid
end try
report
(* The report from Chris's system:
Data read from "do shell script" is the same as was written via "open for access"+"write". (EXPECTED)

do shell script without encoding specifier seems to mean UTF-8. (EXPECTED)

do shell script decodes UTF-16 data identically with "as «class utf8»" and without encoding specifier (EXPECTED)
*)

(* HANDLERS *)

to getTestText(promptForText)
    set originalText to («data utxt003500200076006100720069006F007500730020006D00610072006B006500720073003A00202713260500B600A72022000D003200200065006C006C00690070007300650073003A0020202622EE000D00340020006D006F006E00650079002000730079006D0062006F006C0073003A002000A400A2002400A3000D0031» as Unicode text) & «data utxt00310020006B00650079002000730079006D0062006F006C0073003A002021E721EA005E23252318232B2326238B232421A923CF000D0033002000740065007400720061006700720061006D0073003A0020D834DF06D834DF2ED834DF56002000280073007500720072006F006700610074006500200070006100690072» & «data utxt00730020007700680065006E00200069006E0020005500540046002D003100360029000D003700200042007200610069006C006C00650020007000610074007400650072006E0073003A00202809282B2808282F280E28032815000D0031003000200043004A004B0020006E0075006D0062006500720073003F003A0020» & «data utxt4E004E8C4E0956DB4E94516D4E03516B4E5D5341000D»
    if promptForText then
        display dialog "Enter Test Unicode Text:" default answer originalText
        return text returned of result
    else
        return originalText
    end if
end getTestText
to getTestFilePathname()
    set testFileFolder to path to temporary items folder
    set testFilePathname to (testFileFolder as Unicode text) & "UTF-8 text.txt"
    repeat
        set testFilePOSIXPathname to POSIX path of file testFilePathname
        
        try
            alias testFilePathname
            true
        on error
            false
        end try
        if result then
            display dialog "The default test file " & return & """ & testFilePOSIXPathname & """ & return & " already exists" with title "Test File Already Exists" buttons {"Cancel", "Choose Another.", "Overwrite"} default button 1 cancel button 1
            set btn to button returned of result
            if btn is "Overwrite" then exit repeat
            choose file name "Choose a new or existing file to store the UTF-8 test data:" default name "" default location testFileFolder
            set testFilePathname to result as Unicode text
        else
            -- Use default pathname
            exit repeat
        end if
    end repeat
    testFilePathname
end getTestFilePathname
to writeUTF8TextToPathname(testText, testFilePathname)
    set frn to open for access testFilePathname with write permission
    try
        set eof frn to 0
        write testText to frn as «class utf8»
        close access frn
    on error m number n from o partial result r to t
        try
            close access frn
        end try
        error m number n from o partial result r to t
    end try
end writeUTF8TextToPathname
to stringsAreIdentical(a, b)
    -- AppleScript 1.10.7 (on Mac OS X 10.4.11) seems to have some bugs when doing a certain comparison (UTF16_as_NoSpecifiedEncoding (string)= UTF16_as_UTF16 (Unicode text)), so explicitly compare the class to avoid the bug
    (class of a is equal to class of b) and (a is equal to b)
end stringsAreIdentical

Mystery Code Explanations

The first is to prevent contaminating the contents of the original list that is passed by reference from the caller. The following code demonstrates what can happen if this is not done when using the “rewrite the contents of the list” pattern that follows the copy statement. The third handler in the code (quoteListItemsAlternate) has a similar effect (no contamination) to that of the second handler (quoteListItemsWithCopy) and may be in a more familiar style. The “copy x to x” style is basically just a way of reusing the variable instead of having to have an additional variable.

to quoteListItemsWithoutCopy(lst)
    repeat with itm in lst
        -- This modification done in the next line does not have to be "quoted form of", it could be anything, as long as it is different from the original, it will illustrate the point.
        set contents of itm to quoted form of itm
    end repeat
    lst
end quoteListItemsWithoutCopy

to quoteListItemsWithCopy(lst)
    copy lst to lst -- make our own copy of the list so that we do not pollute the caller's original copy with my changes
    repeat with itm in lst
        -- This modification done in the next line does not have to be "quoted form of", it could be anything, as long as it is different from the original, it will illustrate the point.
        set contents of itm to quoted form of itm
    end repeat
    lst
end quoteListItemsWithCopy

to quoteListItemsAlternate(lst)
    set newLst to {}
    repeat with itm in lst
        set end of newLst to quoted form of itm
    end repeat
    newLst
end quoteListItemsAlternate


set l to {"a", "b", "c c", "D E F"}
set l2 to {"a", "b", "c c", "D E F"}
set inputsAreIdenticalBeforeHandlers to ¬
    l is equal to l2

set outputWithoutCopy to ¬
    quoteListItemsWithoutCopy(l)
set outputWithCopy to ¬
    quoteListItemsWithCopy(l2)
set inputsAreIdenticalAfterHandlers to l is equal to l2
set outputsAreIdentical to outputWithoutCopy is equal to outputWithCopy


set summary to {}
if inputsAreIdenticalBeforeHandlers then
    set end of summary to "Inputs were identical before the handlers were called. (expected)"
else
    set end of summary to "Inputs were NOT identical before the handlers were called. (UNEXPECTED)"
    set end of summary to {|without copy|:l, |with copy|:l2}
end if
if inputsAreIdenticalAfterHandlers then
    set end of summary to "Inputs were identical after the handlers were called."
else
    set end of summary to "Inputs were NOT identical after the handlers were called (THIS IS WHY THE COPY IS DONE)."
    set end of summary to {|without copy|:l, |with copy|:l2}
end if
if outputsAreIdentical then
    set end of summary to "Outputs were identical. (expected)"
else
    set end of summary to "Outputs were NOT identical. (UNEXPECTED)."
    set end of summary to {|without copy|:outputWithoutCopy, |with copy|:outputWithCopy}
end if
summary
(* --> {
 *  "Inputs were identical before the handlers were called. (expected)",
 *  "Inputs were NOT identical after the handlers were called (THIS IS WHY THE COPY IS DONE).",
 *  {|without copy|:{"'a'", "'b'", "'c c'", "'D E F'"}, |with copy|:{"a", "b", "c c", "D E F"}},
 *  "Outputs were identical. (expected)"}
 *)

As for the (“” & ({“”} & folder_exceptions)) mess, that was an attempt at kai’s class-preservation technique that is in other posts here. Unfortunately, my misapplication of it does not work any better than “” & folder_exceptions would. A way that actually works and a demonstration of when it is useful:

to emptyStringWithSameClassAs(str)
    local someChar, otid, emptyStr
    set someChar to ASCII character 0
    set {otid, text item delimiters} to {text item delimiters, {someChar}}
    set str to str & someChar -- appending uses class of first string, so it preserves the class of str
    set emptyStr to last text item of str -- text items have same class as str
    set text item delimiters to otid
    emptyStr
end emptyStringWithSameClassAs
set a to "a"
set u to "u" as Unicode text
set emptyA to emptyStringWithSameClassAs(a)
set emptyU to emptyStringWithSameClassAs(u)
{{class of a, a, class of emptyA, emptyA}, {class of u, u, class of emptyU, emptyU}}
--> {{string, "a", string, ""}, {Unicode text, "u", Unicode text, ""}}

set lst to {"a" as Unicode text, "b" as Unicode text, "c" as Unicode text}
set lst2 to {"d", "e", "f"}
set r to {}
set {otid, text item delimiters} to {text item delimiters, {":"}}
to i(s)
    {class of s, s}
end i
repeat with l in {lst, lst2}
    set l to contents of l -- dereference implicit reference
    set unpreservedStr to "" & l
    set preservedStr to emptyStringWithSameClassAs(beginning of l) & l
    set textStr to l as text
    set unicodeStr to l as Unicode text
    set end of r to {unpreserved:i(unpreservedStr), text:i(textStr), preserved:i(preservedStr), unicode:i(unicodeStr)}
end repeat
set text item delimiters to otid
r
(*
-->
{
 {unpreserved:{string, "a:b:c"},
  text:{string, "a:b:c"},
  preserved:{Unicode text, "a:b:c"}, -- unicode in, unicode out
  unicode:{Unicode text, "a:b:c"}}, 
 {unpreserved:{string, "d:e:f"},
  text:{string, "d:e:f"},
  preserved:{string, "d:e:f"}, -- string in, string out
  unicode:{Unicode text, "d:e:f"}}
}
*)

In the context of Kevin’s script I do not think it matters that the code does not do what I originally intended. The input lists are probably made up from string literals in the AppleScript code, which on pre-10.5 systems can only use characters that can be represented in the system’s default encoding (probably MacRoman). Such characters will fit in a string type object just fine. On 10.5 one may use any of Unicode in AppleScript string literals, but the string object is also upgraded to Unicode, so a non-coerced empty string should be Unicode-capable and result when a list is concatenated with it, the result should be Unicode-capable, too. So, as long as the exclusion list always comes from a string literal, there should not be any problems. There might be problems if the exclusion list is created by getting the names of existing files, read in from disk via a Unicode-capable encoding, is from the output of do shell script, or is generated with via «class data utxt». In those cases (on 10.4) it would be coerced to a plain string and would misrepresent any characters that do not fit into the system’s default encoding.

Thank you very much, Chris, for this extraordinarily extensive explanation.

PS:

This is right, the default text class of the shell is UTF-8
but the object of do shell script to be passed to the shell is converted to UTF-8
and the result from the shell is converted from UTF-8 to UTF-16.

So the class of the result of the find line in Kevin’s script is definitively UTF-16

They were in the folder name, yes.

Assuming MacScripter can handle them via cut-n-paste:

新汉仪中文字体
(on my screen, the second and third glyphs aren’t supported by Firefox it seems)

No idea what it says, so if it’s a swear word, insult, etc., it’s not my fault. :wink:

CJK UNIFIED IDEOGRAPH-65B0
UNICODE: 65B0
UTF8: E6 96 80

That’s just the first glyph.

It simply gives an error dialog that lists the enclosing folder name (user folder). When I delete the folder with the Asian characters, the script runs fine. This has happened twice. AppleScript isn’t giving any other feedback. My only assumption is the failure is in the do script somewhere since AppleScript seems to give less feedback on those kinds of errors.

Terminology mix-up, sorry. I’m used to referring to these types of gyphs as “double byte fonts” because of my years in graphics design and postscript fonts. Was always vague on why they were called that “back in the day.”

Tiger, 10.4.11, and will be for a while. Leopard needs a veterinarian. :wink:

Based on the script it looks like you have a transfer/category/user/stuff hierarchy. I tried variations with some or all of the problematic characters in each of those levels except for transfer (I expect that one has a fixed name that users do not change).

They seem have gone through the forum and come out just fine in Safari here (they all look like ideographs on my screen). I was able to copy and paste them into Finder to make new folders names and files names. I created this structure:

I added variables for the various settings to your posted script and ran it on the above directory structure and never got a failure that displayed a folder name. I did get a couple of errors relating not handling zero exclusions and not handling empty output from find. Both of these errors looked like “Can’t get text 6 thru -1 of “”.” and “Can’t get text items 1 thru -2 of “”.”. They both highlighted the “text . through .” parts of the code. Failing to get an pathname related error, I reworked the exclusion building code to hopefully be a bit easier to understand. I also added some log statements to check the values of the variables that were collected (marked, for deletion, oversize, total size). It did make some of my files red and it seemed to return reasonable lists for the the deletion targets, the oversize users, and a reasonable total for the whole transfer folder.

(*
THE CHANGES TO THIS SCRIPT ARE
ABOVE THE UPPER LINE OF ASTERISKS, 
BELOW THE LOWER LINE OF ASTERISKS, AND
INSIDE THE listGetter HANDLER TO
	DEAL WITH ZERO EXCLUSIONS, 
	DEAL WITH EMPTY OUTPUT FROM FIND (EMPTY (possibly except for exclusions) FOLDER), AND
	TO ELUCIDATE THE EXCLUSION BUILDING CODE
*)
to logMe(s, n)
	log ("" as Unicode text) & n & ": " & s
end logMe
set g_transfer_files_location to ((path to desktop as Unicode text) & "Test Transfer Folder")
set g_exclusions_folders to {}
set g_exclusions_users to {".DS_Store"}
set g_exclusions_macosx to {".DS_Store", ".bashrc"}
set g_mark_date to (current date) - 5 * days
set g_delete_date to (current date) - 100 * days
set g_threshold_size to 1
(*************)
-- Get a file/folder list of all items at a certain level inside a given folder
--
-- with help from chrys of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=91191#p91191
--
on listGetter(folder_to_scan, scan_level, folder_exceptions)
	--exceptions formatted for shell find
	if length of folder_exceptions is greater than 0 then --XXX added
		copy folder_exceptions to folder_exceptions
		repeat with fe_ref in folder_exceptions
			set contents of fe_ref to quoted form of contents of fe_ref
		end repeat
		set ASTID to AppleScript's text item delimiters
		set AppleScript's text item delimiters to " -or -name "
		set exclude_code to folder_exceptions as Unicode text -- Does not preserve the class(es) of the list items, but Unicode text should be a superset of text/string.
		set AppleScript's text item delimiters to ASTID
		set exclude_code to "! \\( \\( -name " & exclude_code & " \\) -prune \\) "
	else
		set exclude_code to ""
	end if
	--do shell find with exceptions
	do shell script "/usr/bin/find " & (quoted form of POSIX path of folder_to_scan) & " " & exclude_code & " -maxdepth " & scan_level & " -mindepth " & scan_level & " -print0 ; true" without altering line endings
	set find0 to result
	if length of find0 is equal to 0 then return {}
	set {ASTID, text item delimiters} to {text item delimiters, {ASCII character 0}}
	try
		set POSIX_pathnames to text items 1 through -2 of find0 -- Drop the last text item because it is always empty (find -print0 always prints a trailing null).
		set text item delimiters to ASTID
	on error m number n from o partial result r to t
		set text item delimiters to ASTID
		error m number n from o partial result r to t
	end try
	script speedHack
		property Mac_pathnames : {}
	end script
	repeat with P_pn in POSIX_pathnames
		set end of speedHack's Mac_pathnames to (POSIX file (contents of P_pn)) as Unicode text
	end repeat
	speedHack's Mac_pathnames
end listGetter


--get user folders that aren't empty
--
on nonEmptyFolders(directory_to_filter)
	set filtered_folder_list to {}
	
	repeat with e from 1 to (number of items in directory_to_filter)
		tell application "Finder"
			set folder_to_check to (item e of directory_to_filter)
			set item_contents to number of items in folder folder_to_check
			if number of items in folder folder_to_check > 0 then
				set filtered_folder_list to filtered_folder_list & folder_to_check
			end if
		end tell
	end repeat
	
	return filtered_folder_list
end nonEmptyFolders

--
-- MAIN SCRIPT
--

my logMe("Transfer Folder Cleaner and Logger Begin--" & (current date), 1)

--COLLECT DATA

--get transfer folder categories (include empties)
set category_folders to {}
set category_folders to listGetter(g_transfer_files_location, 1, g_exclusions_folders)

--get user folders (include empties)
set user_folders to {}
repeat with j from 1 to (number of items in category_folders)
	set user_folders to user_folders & listGetter(item j of category_folders, 1, g_exclusions_users)
end repeat

--get user folders that aren't empty
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folder contents (skip empties)
set user_folders_filtered_contents to {}
repeat with k from 1 to (number of items in user_folders_filtered)
	set user_folders_filtered_contents to user_folders_filtered_contents & listGetter(item k of user_folders_filtered, 1, g_exclusions_macosx)
end repeat

--filter for files older than mark date
set files_to_mark to {}
repeat with m from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item m of user_folders_filtered_contents as alias)) < g_mark_date then
			set files_to_mark to files_to_mark & item m of user_folders_filtered_contents
		end if
	end tell
end repeat

--filter for files older than delete date
set files_to_delete to {}
repeat with d from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item d of user_folders_filtered_contents as alias)) < g_delete_date then
			set files_to_delete to files_to_delete & item d of user_folders_filtered_contents
		end if
	end tell
end repeat

--MARK OLD FILES
repeat with currently_coloring in files_to_mark
	tell application "Finder"
		set label index of (currently_coloring as alias) to "2"
	end tell
end repeat

--DELETE OLDEST FILES
--(code to come...left out for safety reasons, delete oldest files)

--get user folders that aren't empty (updated)
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folders over threshold size
set oversized_user_folders to {}
repeat with b from 1 to (number of items in user_folders_filtered)
	tell application "Finder"
		set size_in_megabytes to (size of (info for (item b of user_folders_filtered as alias))) / 1024 / 1024 as integer
		if size_in_megabytes ≥ g_threshold_size then
			set oversized_user_folders to oversized_user_folders & item b of user_folders_filtered
		end if
	end tell
end repeat

--all folders over 300 MB, send e-mail warning (or at least prefab the e-mail for user to click "send")
--(code to come)

--LOG INFORMATION

--Get total space used by Transfer folders by getting subfolder totals and adding together (does not cause the Finder to stall)
tell application "Finder"
	set transfer_folder_total to 0
	repeat with s from 1 to (number of items in user_folders_filtered)
		set transfer_folder_total to transfer_folder_total + (size of (info for (item s of user_folders_filtered as alias)))
	end repeat
	--convert to megabytes
	set transfer_folder_total to transfer_folder_total / 1024 / 1024 as integer
end tell

--Log transfer folder total in Excel, with Graph
--Log totals in each category in Excel, with Graph
--Log totals for each user in Excel, with Graph

my logMe("Transfer Folder Cleaner and Logger End--" & (current date), 1)
(*****************)
log "marked"
log files_to_mark
log "to delete"
log files_to_delete
log "oversize"
log oversized_user_folders
log "total"
log transfer_folder_total

When you get an error is any particular part of the code highlighted? That is assuming you can run it from Script Editor or the like. When I run my version as an app, it works OK: I see some files turn red and I get no errors. Could you try just renaming the offending folder to have normal characters (or would that confuse people and/or other automated processes?)? That might help narrow down the problem. Maybe it is a problem with the contents of the folder and not the folder name itself.


Chris

Okay more details:

This weekend, my script stalled with this error:

File ServerName:Transfer Files:Designers:DesignerName:è¨Ã¤ÃÃ¨oöS(

The file it got stuck on was:

小狗出售(溫馨感人).pps

Here’s the last edit of my code:

--
-- DECLARE PROPERTIES
--
-- debugging/logging on?
property g_debug : true

--basic file path and names
property g_home_folder_path : path to home folder
property g_log_file_name : "LOG--Overnight Automation.txt"
property g_transfer_files_location : "ServerName:Transfer Files"

--Mac OS names to be ignored
property g_exclusions_macosx : {"Temporary Items", "Trash", ".DS_Store", "TheFindByContentFolder", "TheVolumeSettingsFolder", "Icon
"}

--folder names not to be scanned
property g_exclusions_folders : g_exclusions_macosx & {"_VACATION REQUESTS", "SomeDude"}

--folder names of users whose contents should never be deleted automatically
property g_exclusions_users : g_exclusions_macosx & {"SKU LISTS", "SomeDude", "SomeDude2", "SomeDude3", "SomeDude4", "Kevin Quosig", "Cost Tracking DB"}

--file marking threshold 
property g_mark_days : 7 --files older than this number of days will be marked
set g_mark_date to ((current date) - (g_mark_days * days))

--file deletion threshold 
property g_delete_days : 14 --files older than this number of days will be deleted
set g_delete_date to ((current date) - (g_delete_days * days))

--transfer folder size for additional reminders
property g_threshold_size : 300 as integer --size in megabytes


--
-- UTILITY HANDLERS
--

--Log Entry Generation
--
-- with help from StephanK of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=76607#p76607
--
on logMe(log_string, indent_level)
	if g_debug is true then --allows turning the debugger on and off so my logMe's can be left in final version
		set log_target to ("Data: Automation:Logs:" & g_log_file_name) as text
		try
			set log_file_ref to open for access file log_target with write permission
			repeat indent_level times
				write tab to log_file_ref starting at eof
			end repeat
			write ((log_string as text) & return) to log_file_ref starting at eof
			close access log_file_ref
			return true
		on error
			try
				close access file log_target
			end try
			return false
		end try
	end if
end logMe


-- Date stamp generator
--
-- with help from StefanK of MacScripter
-- http://bbs.applescript.net/viewtopic.php?id=20420
--
on dateStamp()
	
	-- Load date components from system
	tell (current date)
		set dayStamp to day
		set monthStamp to (its month as integer)
		set yearStamp to year
	end tell
	
	--coerce components to two-digit form
	set dayStamp to (text -2 thru -1 of ("0" & dayStamp as string))
	set monthStamp to (text -2 thru -1 of ("0" & monthStamp as string))
	set yearStamp to (text 3 thru 4 of (yearStamp as string))
	
	--Assemble datestamp
	return yearStamp & monthStamp & dayStamp as text
	
end dateStamp


-- Get a file/folder list of all items at a certain level inside a given folder
--
-- with help from chrys of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=91191#p91191
--
on listGetter(folder_to_scan, scan_level, folder_exceptions)
	--exceptions formatted for shell find
	copy folder_exceptions to folder_exceptions
	repeat with fe_ref in folder_exceptions
		set contents of fe_ref to quoted form of contents of fe_ref
	end repeat
	set ASTID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to " -or -name "
	set exclude_code to text 6 thru -1 of ("" & ({""} & folder_exceptions))
	set AppleScript's text item delimiters to ASTID
	--do shell find with exceptions
	do shell script "/usr/bin/find " & (quoted form of POSIX path of folder_to_scan) & " ! \\( \\( " & exclude_code & " \\) -prune \\) -maxdepth " & scan_level & " -mindepth " & scan_level & " -print0 ; true" without altering line endings
	set find0 to result
	set {ASTID, text item delimiters} to {text item delimiters, {ASCII character 0}}
	try
		set POSIX_pathnames to text items 1 through -2 of find0 -- Drop the last text item because it is always empty (find -print0 always prints a trailing null).
		set text item delimiters to ASTID
	on error m number n from o partial result r to t
		set text item delimiters to ASTID
		error m number n from o partial result r to t
	end try
	script speedHack
		property Mac_pathnames : {}
	end script
	repeat with P_pn in POSIX_pathnames
		set end of speedHack's Mac_pathnames to (POSIX file (contents of P_pn)) as Unicode text
	end repeat
	speedHack's Mac_pathnames
end listGetter


--get user folders that aren't empty
--
on nonEmptyFolders(directory_to_filter)
	set filtered_folder_list to {}
	
	repeat with e from 1 to (number of items in directory_to_filter)
		tell application "Finder"
			set folder_to_check to (item e of directory_to_filter)
			set item_contents to number of items in folder folder_to_check
			if number of items in folder folder_to_check > 0 then
				set filtered_folder_list to filtered_folder_list & folder_to_check
			end if
		end tell
	end repeat
	
	return filtered_folder_list
end nonEmptyFolders

--
-- MAIN SCRIPT
--

my logMe("Transfer Folder Cleaner and Logger Begin--" & (current date), 1)

--COLLECT DATA

--get transfer folder categories (include empties)
set category_folders to {}
set category_folders to listGetter(g_transfer_files_location, 1, g_exclusions_folders)

--get user folders (include empties)
set user_folders to {}
repeat with j from 1 to (number of items in category_folders)
	set user_folders to user_folders & listGetter(item j of category_folders, 1, g_exclusions_users)
end repeat

--get user folders that aren't empty
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folder contents (skip empties)
set user_folders_filtered_contents to {}
repeat with k from 1 to (number of items in user_folders_filtered)
	set user_folders_filtered_contents to user_folders_filtered_contents & listGetter(item k of user_folders_filtered, 1, g_exclusions_macosx)
end repeat

--filter for files older than mark date
set files_to_mark to {}
repeat with m from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item m of user_folders_filtered_contents as alias)) < g_mark_date then
			set files_to_mark to files_to_mark & item m of user_folders_filtered_contents
		end if
	end tell
end repeat

--filter for files older than delete date
set files_to_delete to {}
repeat with d from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item d of user_folders_filtered_contents as alias)) < g_delete_date then
			set files_to_delete to files_to_delete & item d of user_folders_filtered_contents
		end if
	end tell
end repeat

--MARK OLD FILES
repeat with currently_coloring in files_to_mark
	tell application "Finder"
		set label index of (currently_coloring as alias) to "2"
	end tell
end repeat

my logMe("Items over " & g_mark_days & " days: " & (number of items in files_to_mark), 2)

--DELETE OLDEST FILES
--(code to come...left out for safety reasons, delete oldest files)

--get user folders that aren't empty (updated)
set user_folders_filtered to {}
set user_folders_filtered to nonEmptyFolders(user_folders)

--get user folders over threshold size
set oversized_user_folders to {}
repeat with b from 1 to (number of items in user_folders_filtered)
	tell application "Finder"
		set size_in_megabytes to (size of (info for (item b of user_folders_filtered as alias))) / 1024 / 1024 as integer
		if size_in_megabytes ≥ g_threshold_size then
			set oversized_user_folders to oversized_user_folders & item b of user_folders_filtered
		end if
	end tell
end repeat

--all folders over 300 MB, send e-mail warning (or at least prefab the e-mail for user to click "send")
--(code to come)

--LOG INFORMATION

--Get total space used by Transfer folders by getting subfolder totals and adding together (does not cause the Finder to stall)
tell application "Finder"
	set transfer_folder_total to 0
	repeat with s from 1 to (number of items in user_folders_filtered)
		set transfer_folder_total to transfer_folder_total + (size of (info for (item s of user_folders_filtered as alias)))
	end repeat
	--convert to megabytes
	set transfer_folder_total to transfer_folder_total / 1024 / 1024 as integer
end tell

--Log transfer folder total in Excel, with Graph
--Log totals in each category in Excel, with Graph
--Log totals for each user in Excel, with Graph

my logMe("Transfer Folder Cleaner and Logger End--" & (current date), 1)

I ran the script with ScriptDebugger in debug mode, and the error happens in this code section:

--filter for files older than mark date
set files_to_mark to {}
repeat with m from 1 to (number of items in user_folders_filtered_contents)
	tell application "Finder"
		if (modification date of (item m of user_folders_filtered_contents as alias)) < g_mark_date then
			set files_to_mark to files_to_mark & item m of user_folders_filtered_contents
		end if
	end tell
end repeat

Specifically, this line:

		if (modification date of (item m of user_folders_filtered_contents as alias)) < g_mark_date then

Thanks again for everyone’s help!

Hmm, that error messages seems like it was truncated somewhere. The filename seems incomplete and the error message seems like is missing some text. We have a noun phrase “File .”, but no verb or adjective or anything to tell us what the problem really is.

Interestingly, the junk from the error message is the same as what Script Editor shows when I try to compile a script with the string literal “小狗出售(溫馨感人).pps” in it. I have a script whose entire text is “小狗出售(溫馨感人).pps”. When Script Editor compiles that script, it silently changes the script’s text to “è¨Ã£ÃÃ¨oöS(∑≈ä]ä¥Ãªl).pps”. I have not determined the exact nature of this mangling, but I think it is because Script Editor (well, AppleScript really, I suppose) cannot handle string literals with characters that are outside MacRoman (my system’s default “legacy encoding”). Of course this changes on Mac OS X 10.5 (Leopard), where AppleScript and Script Editor can represent full Unicode string literals.

Anyway, it seems that something, somewhere is botching an encoding conversion in a way that is similar to the way that Script Editor mis-converts Unicode string literals to MacRoman.

I still can not find the problem though. Your script works OK for me. I added five lines to your latest script to adapt it to my environment (each of my lines is marked with XXX; the preceding, original line is included for context):

-- .
property g_transfer_files_location : "ServerName:Transfer Files"
set g_transfer_files_location to (path to desktop folder as text) & "Test Transfer Folder:" --XXX
-- .
property g_threshold_size : 300 as integer --size in megabytes
set g_threshold_size to 0.3 --XXX
-- .
		set log_target to ("Data: Automation:Logs:" & g_log_file_name) as text
		set log_target to ((path to desktop folder as text) & "Test Transfer Folder:" & g_log_file_name) as text --XXX
-- .
	set find0 to result
	if length of find0 is 0 then return {} --XXX I have some empty folders in my test environment, this prevents them from causing problems
-- .
my logMe("Transfer Folder Cleaner and Logger End--" & (current date), 1)
return {files_to_mark, files_to_delete, oversized_user_folders, transfer_folder_total} --XXX

Which ends up returning this:

{{"MacHD:Users:username:Desktop:Test Transfer Folder:A category folder:A user folder:a.file", "MacHD:Users:username:Desktop:Test Transfer Folder:A category folder:A user folder:a.file copy", "MacHD:Users:username:Desktop:Test Transfer Folder:A category folder:A user folder:a.file copy 1", "MacHD:Users:username:Desktop:Test Transfer Folder:A category folder:A user folder:a.file copy 2", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:User X:x.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:User X:新汉仪中文字体.", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:仪中:g.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:仪中:k.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:仪中:小狗出售(溫馨感人).pps", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:文字:b.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:文字:i.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:文字:j.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:文字:字", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:新汉:h.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:新汉:l.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:新汉:字"}, {"MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:仪中:g.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:仪中:小狗出售(溫馨感人).pps", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:新汉:h.file", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:新汉:l.file"}, {"MacHD:Users:username:Desktop:Test Transfer Folder:A category folder:A user folder", "MacHD:Users:username:Desktop:Test Transfer Folder:新汉仪中文字体:文字"}, 3}

That problematic filename ends up in that list twice, once as a file to mark, and again as a file to delete (yes, its date is old enough to qualify it for both). The file (and the other oldish files) were all marked with the red label, too.

Obviously if the strings in user_folders_filtered_contents is mangled, the conversion to an alias will fail. I just don’t see any reason the contents would be mangled. Can Script Debugger evaluate statements in the context of the error? The result of tell item m of user_folders_filtered_contents to {class of it, it} might be interesting.

I don’t have Script Debugger, but I suppose the problem might be due to a difference between Script Debugger and Script Editor (though based on the reputation of Script Debugger, it seems unlikely).

Another difference is that I am doing all my testing with files on my boot drive. Maybe the file sharing is causing the problem. Though it would seem to indicate that it is causing a problem for the BSD layer (the results from find are bad?), but not the Mac layer (I am assuming you see the file correctly in Finder), which would be strange. Do the problematic files show up OK in Terminal if you do an ls -Rlw from the user’s folder? That ls command can also be done from the top-level transfer folder or the category folder: the -R causes recursion so it should work from any level, you will just see more junk if you are less deep.

Model: iBook G4 933
AppleScript: 1.10.7
Browser: Safari 3.0.4 (523.12)
Operating System: Mac OS X (10.4)