Parsing UTF-8 and UTF-16

maddogandnoriko · January 15, 2010, 11:14pm

I am putting together a simple app to geocode some addresses from a csv file. Everything works correctly when the file is UTF-8 but if the file is utf-16 things get fouled. There are two fields per line a name and an address with no commas. I change the delimeter to comma and set the variables and set the delimiter back to “”. I display a dialog and the the name and address look correct but an error erupts. When I set the read command to read as unicode text it does read the utf-16 file but then refuses to read the utf-8 file. So my question is…can I detect if the file is in utf-16 and “read as unicode text” as needed? I put some of the code below…if mor is needed please let me know.

the error occurs on:


					set shellScript to "php -r \"echo urlencode('" & formattedAddress & "');\""
					set query to queryPrefix & (do shell script shellScript) & "&output=csv&sensor=false&key=" & queryKey

Thank you once again,
Todd

on readFile(theFile)
	if theFile = "" then
		set theFile to (choose file with prompt "Select a file to read:")
	end if
	open for access theFile
	set thePlaces to (read theFile)
	close access theFile
	return thePlaces
end readFile

			try
				changeDelimiter(",")
				set theName to text item 1 of thePerson
				set formattedAddress to text item 2 of thePerson
				changeDelimiter("")
				if (count of formattedAddress) > 1 then
					display dialog (formattedAddress)
					
					set shellScript to "php -r \"echo urlencode('" & formattedAddress & "');\""
					set query to queryPrefix & (do shell script shellScript) & "&output=csv&sensor=false&key=" & queryKey
					set googleContent to (do shell script "curl " & query)
					
					set googleContentRemainder to (characters (offset of "\"coordinates\": [" in googleContent) thru (length of googleContent) of googleContent) as text
					set googleContentRemainder to (characters 17 thru ((offset of "]" in googleContentRemainder) - 2) of googleContentRemainder) as text
					set googleLongitude to (characters 1 thru ((offset of "," in googleContentRemainder) - 1) of googleContentRemainder) as text
					set googleLatitude to (characters ((offset of "," in googleContentRemainder) + 2) thru ((length of googleContentRemainder) - 3) of googleContentRemainder) as text
					set googleLongitude to googleLongitude as number
					set googleLatitude to googleLatitude as number
					
					--Latitudes can take any value between -90 and 90 while longitude values can take any value between -180 and 180
					--47.267585,-122.584693
					if googleLatitude > -90 and googleLatitude < 90 and googleLongitude > -180 and googleLongitude < 180 then
						set successful to successful + 1
						set newText to newText & googleLatitude & "," & googleLongitude & "," & theName & return
						--latitude and longitude values
						set googleMarkers to googleMarkers & "|" & googleLatitude & "," & googleLongitude
					else
						set incompleteAddress to incompleteAddress + 1
					end if
				end if
				
			on error msg number num
				beep
				set AppleScript's text item delimiters to {""}
				display alert "Error " & num & " occurred" message (msg & " while processing " & name of thePerson) as critical
				return
			end try

StefanK · January 16, 2010, 8:06am

Hi,

there is no distinct way to identify text encoding, try this


set thePath to "path:to:textfile.txt"
try
	read file thePath as «class utf8»
on error
	read file thePath as Unicode text
end try

Nigel_Garvey · January 16, 2010, 10:01am

. unless whatever wrote the text to the file was kind enough to put a Unicode byte-order mark at the beginning. (One can always hope!)

«data rdatEFBBBF»: UTF-8.
«data rdatFFFE»: Big-endian UTF-16, which is native on PowerPC machines and is, for historical reasons, the default for AppleScript’s File Read/Write commands.
«data rdatFEFF»: Little-endian UTF-16, which is native on Intel machines and is understood by ‘read . as Unicode text’ if the read starts with a BOM.


-- Script cut.

Edit: UTF-16 BOM values corrected and script scrapped in favour of the one two posts down.

maddogandnoriko · January 18, 2010, 12:11am

I rewrote the script a bit so I could post the code here. It is acting differently than in xcode. But I believe my problem is the same. When i run the script in Script editor, the utf-8 file has “Ã”ªÃ¸” before the first visible letter and Utf-16 has “Ë›Ë‡” before it. I imagine this is a clue. I could see the “invisible” characters when displaying a dialog in Script editor but not when displaying a dialog in Xcode. Know why? Also in XCode I displayed the first character’s ascii value for utf-8 and it returned “0”, script editor returned 239.

Todd

set theFile to (choose file with prompt "Select a file to read:")
try
	open for access theFile
	set thePlaces to (read theFile)
	close access theFile
on error
	display dialog ("There was an error reading the file")
end try


display dialog (ASCII number of character 1 of thePlaces)
return thePlaces
set theStartingCount to count thePlaces
set inputFormat to "csv"
set OutputFormat to "tomtom"

repeat with thePerson in thePlaces
	try
		set AppleScript's text item delimiters to {","}
		set theName to text item 1 of thePerson
		display dialog (theName)
		set formattedAddress to text item 2 of thePerson
		set AppleScript's text item delimiters to {""}
		
		if (count of formattedAddress) > 1 then
			display dialog ("Address:" & formattedAddress)
			set shellScript to "php -r \"echo urlencode('" & formattedAddress & "');\""
			set query to queryPrefix & (do shell script shellScript) & "&output=csv&sensor=false&key=" & queryKey
		end if
	on error msg number num
		beep
		set AppleScript's text item delimiters to {""}
		display alert "Error " & num & " occurred" message (msg & " while processing " & thePerson) as critical
		return
	end try
end repeat

Nigel_Garvey · January 18, 2010, 1:21am

Those are the BOMs I mentioned in my post. You’re seeing them because you’re not using an ‘as’ parameter with your ‘read’ command, so the text is being interpreted as a one-byte-per-character string.

But I’m getting rubbish results for some reason this evening from the UTF-8 test I posted. I also got the UTF-16 BOMs slightly wrong ” for which I apologise! Here’s another version of the code, which works:


set theFile to (choose file with prompt "Select a file to read:")
set fref to (open for access theFile)
try
	if ((read fref for 3 as data) is «data rdatEFBBBF») then
		set thePlaces to (read fref from 1 as «class utf8»)
	else if ((read fref from 1 to 2 as data) is in {«data rdatFFFE», «data rdatFEFF»}) then
		set thePlaces to (read fref from 1 as Unicode text)
	else
		-- Don't know.
	end if
end try
close access fref

-- Rest of your script.

maddogandnoriko · January 18, 2010, 3:05pm

Thank you Nigel. That is exactly what I needed. On the same note I have another project that makes sure my subtitles are formatted correctly for my dvd player. I need to read a utf-8 BOM file and write it as a utf-8 no BOM. My question: If I read the file using your above method, into a variable does it retain the invisible characters(BOM)? And How can I write a variable into a utf-8 no BOM file?

Thanks again,
Todd

Nigel_Garvey · January 18, 2010, 9:29pm

Hi, Todd.

If a read is ‘as «class utf8»’, any UTF-8 BOM at the beginning is recognised and ignored. It’s not returned as part of the text. Also, the text itself is returned and handled within AppleScript as UTF-16. If you want to write it back to the same or another file as UTF-8, you have to use ‘as «class utf8»’ with the ‘write’ command. No BOM is written unless you supply one yourself, perhaps by writing «data rdatEFBBBF» to the file first.

If a read is ‘as Unicode text’, any UTF-16 BOM at the beginning is used to interpret whether the text in the file is big-endian or little-endian. Obviously this only works when reading the file from the beginning. Any subsequent reads from later in the file will not include the BOM and will assume big-endian text. As with UTF-8, no BOM is written when Unicode text is written to a file, so you have to supply your own if you want one. Since ‘write’ writes Unicode text exclusively in big-endian form, any BOM you supply must be big-endian too: «data rdatFFFE».

The “big-endian” assumption is because Macs used big-endian processors until recently. (68000s, then PowerPCs.) The File Read/Write commands have to make the same assumptions on all Macs when there are no BOMs, so for now that’s still “big-endian”.

maddogandnoriko · January 19, 2010, 2:58pm

Thank you so much nigel. I did a little playing around with opening files yesterday and I think your open code worked great. Solved my subtitler problem, I now understand a bit better bout the bom and where the strange characters come from, and subsequently where they go when using “read as”.

Thank you,
Todd