Unicode text conversion pains

ceesaxp · July 24, 2007, 7:43pm

Greetings. These really are my first AppleScripting steps, hence help would be appreciated.

The task: convert ID3 tag data on a selection of songs in iTunes from a legacy (assumed CP-866, i.e. Russian MS-DOS) charset to Unicode. The first attempt was to go via “do shell command” and call up iconv(1):


tell application "iTunes"
	set songList to selection
	repeat with song in songList
		set tSongName to name of song as string
		set uSongName to do shell script "echo '" & tSongName & "'|iconv -f cp866 -t utf8" as «class utf8»
		display dialog uSongName
	end repeat
end tell

This kinda-sorta works, but I do get a lot of garbage characters like:

“â”œÐ—â”œÐ¸â”œÐ¼â”œÐ°”

instead of just

“Ð—Ð¸Ð¼Ð°”

Gets worse if a string is longer.

The next attempt was to employ a custom-built character map:

tell application "iTunes"
	set songList to selection
	repeat with song in songList
		set tSongName to name of song as string
		set uSongName to degarbled(tSongName) as Unicode text
		display dialog uSongName
	end repeat
end tell

set testStr to degarbled("Ã‡Ã¨Ã¬Ã ÃÃ¥Ã§ÃÃ Ã§Ã¢Ã ÃÃ¨Ã¿ÃÃ®Ã¤Ã®") as Unicode text
display dialog testStr

on degarbled(str)
	script o
		-- garbled up cyrillics are coming up as these "extended latin" chars
		property g : {192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 177, 184}
		-- the real thing is here
		property c : {1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1025, 1105}
		-- `JO' is always at the end (not after `JE')
		property t : {}
	end script
	repeat with i from 1 to length of str
		set charCode to ASCII number of (character i of str)
		if charCode â‰  184 or charCode â‰  177 then
			set o's t's end to item (charCode - 191) of o's g
		else
			-- special treatment for `JOs'
			if charCode = 177 then set o's t's end to 1025
			if charCode = 184 then set o's t's end to 1105
		end if
	end repeat
	set outStr to "" as Unicode text
	repeat with i from 1 to length of o's t
		set outStr to outStr & (unicode character of item i of o's t)
	end repeat
	"" & outStr
end degarbled

And this breaks with “unicode character of…” I realise there does not seem to be a conversion like that, but is there any way to build up a Unicode string out of character codes?

Model: PowerBook G4, iMac G5, MacBook
AppleScript: 1.10.7
Browser: Safari 419.3
Operating System: Mac OS X (10.4)

StefanK · July 24, 2007, 8:30pm

Hi,

this is only possible with a trick.
First, replace the values in the property c with their hexadecimal equivalents as string

property c : {"0410", "0411"...}

Second, isn’t it

set o's t's end to item (charCode - 191) of o's c -- c not g!!

??

Third, the Unicode character can only be created with the run script command

set outStr to outStr & (run script "«data utxt" & item i of o's t & "»" as Unicode text)

Hope it helps

Edit: Here’s your subroutine, a little optimized (of course I used a script to create the list of hex values )

display dialog degarbled("Ã‡Ã¨Ã¬Ã ÃÃ¥Ã§ÃÃ Ã§Ã¢Ã ÃÃ¨Ã¿ÃÃ®Ã¤Ã®±âˆ")

on degarbled(str)
    script o
        -- the real thing is here
        property c : {"0410", "0411", "0412", "0413", "0414", "0415", "0416", "0417", "0418", "0419", "041A", "041B", "041C", "041D", "041E", "041F", ¬
            "0420", "0421", "0422", "0423", "0424", "0425", "0426", "0427", "0428", "0429", "042A", "042B", "042C", "042D", "042E", "042F", ¬
            "0430", "0431", "0432", "0433", "0434", "0435", "0436", "0437", "0438", "0439", "043A", "043B", "043C", "043D", "043E", "043F", ¬
            "0440", "0441", "0442", "0443", "0444", "0445", "0446", "0447", "0448", "0449", "044A", "044B", "044C", "044D", "044E", "044F"}
        -- `JO' is always at the end (not after `JE')
    end script
    set outStr to "" as Unicode text
    repeat with i in characters of str
        set charCode to ASCII number of i
        if charCode = 184 then
            set Uni to "0451"
        else if charCode = 177 then
            set Uni to "0401"
        else
            set Uni to item (charCode - 191) of o's c
        end if
        set outStr to outStr & (run script "«data utxt" & Uni & "»" as Unicode text)
    end repeat
    return outStr
end degarbled

ceesaxp · July 25, 2007, 2:49am

Hmmm… Close, but not quite there… Must be something not quite right in the mapping table… Certainly no more garbage chars. Ok, let me try and take it from here – thanks a bunch!

Model: PowerBook G4, iMac G5, MacBook
Browser: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.8.1.4) Gecko/20070509 Camino/1.5
Operating System: Mac OS X (10.4)

white · September 21, 2007, 4:40am

Guys, did anyone manage how to fix this conversion?

I’m getting the same silly thing:


	do shell script "echo 'ÃÃ¨2- ÃÃ®Ã«ÃªÃ®Ã¢ÃÃ¨Ãª'|iconv -f CP866 -t UTF-8" as «class utf8»
		"â”œÐ‘â”œÐ¸2- â”œÐŸâ”œÐ¾â”œÐ»â”œÐºâ”œÐ¾â”œÐ²â”œÐ½â”œÐ¸â”œÐº"

The text is garbled, however, it’s not just an extra characters, but also some chars are converted wrong, like this one:


	do shell script "echo 'Ã‘Ã¬Ã»Ã±Ã«Ã®Ã¢Ã»Ã¥ Ã£Ã Ã«Ã«ï¬‚Ã¶Ã¨ÃÃ Ã¶Ã¨Ã¨ - Ã‚Ã¥Ã·ÃÃ®'|iconv -f CP866 -t UTF-8" as «class utf8»
		"â”œÐ¡â”œÐ¼â”œ?â”œâ–’â”œÐ»â”œÐ¾â”œÐ²â”œ?â”œÐµ â”œÐ³â”œÐ°â”œÐ»â”œÐ»ÑÐ¼Ð’â”œ?â”œÐ¸â”œÐ½â”œÐ°â”œ?â”œÐ¸â”œÐ¸ - â”œÐ’â”œÐµâ”œ?â”œÐ½â”œÐ¾"

If you can read russian you can easily see that such letters like “Ñ‹ Ñ ÑŽ Ñ† Ñ‡” and I can guess many others are broken as well.

I’d appreciate any help.

StefanK · September 21, 2007, 9:09pm

Hi white,

ceesaxp, the original poster, has published a solution here.
Hope it helps

ceesaxp · September 24, 2007, 3:03am

One tiny word of caution: if you’re on an Intel machine, as-convert-russian does not work quite as expected

Most certainly should be linked with endianness of x86 vs. ppc.

So, if you are on an Intel machine, you may need to do a little bit of tweaking…