Unicode text conversion pains

Greetings. These really are my first AppleScripting steps, hence help would be appreciated.

The task: convert ID3 tag data on a selection of songs in iTunes from a legacy (assumed CP-866, i.e. Russian MS-DOS) charset to Unicode. The first attempt was to go via “do shell command” and call up iconv(1):


tell application "iTunes"
	set songList to selection
	repeat with song in songList
		set tSongName to name of song as string
		set uSongName to do shell script "echo '" & tSongName & "'|iconv -f cp866 -t utf8" as «class utf8»
		display dialog uSongName
	end repeat
end tell

This kinda-sorta works, but I do get a lot of garbage characters like:

“├З├и├м├а”

instead of just

“Зима”

Gets worse if a string is longer.

The next attempt was to employ a custom-built character map:

tell application "iTunes"
	set songList to selection
	repeat with song in songList
		set tSongName to name of song as string
		set uSongName to degarbled(tSongName) as Unicode text
		display dialog uSongName
	end repeat
end tell

set testStr to degarbled("Çèìà Áåçíà çâà íèÿÏîäî") as Unicode text
display dialog testStr

on degarbled(str)
	script o
		-- garbled up cyrillics are coming up as these "extended latin" chars
		property g : {192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 177, 184}
		-- the real thing is here
		property c : {1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1025, 1105}
		-- `JO' is always at the end (not after `JE')
		property t : {}
	end script
	repeat with i from 1 to length of str
		set charCode to ASCII number of (character i of str)
		if charCode ≠ 184 or charCode ≠ 177 then
			set o's t's end to item (charCode - 191) of o's g
		else
			-- special treatment for `JOs'
			if charCode = 177 then set o's t's end to 1025
			if charCode = 184 then set o's t's end to 1105
		end if
	end repeat
	set outStr to "" as Unicode text
	repeat with i from 1 to length of o's t
		set outStr to outStr & (unicode character of item i of o's t)
	end repeat
	"" & outStr
end degarbled

And this breaks with “unicode character of…” I realise there does not seem to be a conversion like that, but is there any way to build up a Unicode string out of character codes?

Model: PowerBook G4, iMac G5, MacBook
AppleScript: 1.10.7
Browser: Safari 419.3
Operating System: Mac OS X (10.4)

Hi,

this is only possible with a trick.
First, replace the values in the property c with their hexadecimal equivalents as string

property c : {"0410", "0411"...}

Second, isn’t it

set o's t's end to item (charCode - 191) of o's c -- c not g!!

??

Third, the Unicode character can only be created with the run script command

set outStr to outStr & (run script "«data utxt" & item i of o's t & "»" as Unicode text)

Hope it helps

Edit: Here’s your subroutine, a little optimized (of course I used a script to create the list of hex values :wink: )

display dialog degarbled("Çèìà Áåçíà çâà íèÿÏîäî±âˆ")

on degarbled(str)
    script o
        -- the real thing is here
        property c : {"0410", "0411", "0412", "0413", "0414", "0415", "0416", "0417", "0418", "0419", "041A", "041B", "041C", "041D", "041E", "041F", ¬
            "0420", "0421", "0422", "0423", "0424", "0425", "0426", "0427", "0428", "0429", "042A", "042B", "042C", "042D", "042E", "042F", ¬
            "0430", "0431", "0432", "0433", "0434", "0435", "0436", "0437", "0438", "0439", "043A", "043B", "043C", "043D", "043E", "043F", ¬
            "0440", "0441", "0442", "0443", "0444", "0445", "0446", "0447", "0448", "0449", "044A", "044B", "044C", "044D", "044E", "044F"}
        -- `JO' is always at the end (not after `JE')
    end script
    set outStr to "" as Unicode text
    repeat with i in characters of str
        set charCode to ASCII number of i
        if charCode = 184 then
            set Uni to "0451"
        else if charCode = 177 then
            set Uni to "0401"
        else
            set Uni to item (charCode - 191) of o's c
        end if
        set outStr to outStr & (run script "«data utxt" & Uni & "»" as Unicode text)
    end repeat
    return outStr
end degarbled

Hmmm… Close, but not quite there… Must be something not quite right in the mapping table… Certainly no more garbage chars. Ok, let me try and take it from here – thanks a bunch!

Model: PowerBook G4, iMac G5, MacBook
Browser: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.8.1.4) Gecko/20070509 Camino/1.5
Operating System: Mac OS X (10.4)

Guys, did anyone manage how to fix this conversion?

I’m getting the same silly thing:


	do shell script "echo 'Áè2- Ïîëêîâíèê'|iconv -f CP866 -t UTF-8" as «class utf8»
		"├Б├и2- ├П├о├л├к├о├в├н├и├к"

The text is garbled, however, it’s not just an extra characters, but also some chars are converted wrong, like this one:


	do shell script "echo 'Ñìûñëîâûå ãà ëëflöèíà öèè - Âå÷íî'|iconv -f CP866 -t UTF-8" as «class utf8»
		"├С├м├?├▒├л├о├в├?├е ├г├а├л├лямВ├?├и├н├а├?├и├и - ├В├е├?├н├о"

If you can read russian you can easily see that such letters like “Ñ‹ с ÑŽ ц ч” and I can guess many others are broken as well.

I’d appreciate any help.

Hi white,

ceesaxp, the original poster, has published a solution here.
Hope it helps

One tiny word of caution: if you’re on an Intel machine, as-convert-russian does not work quite as expected :frowning:

Most certainly should be linked with endianness of x86 vs. ppc.

So, if you are on an Intel machine, you may need to do a little bit of tweaking…