how to translate html numeric character encodings?

regulus6633 · June 11, 2007, 12:21pm

I use curl and grep to get some infomation from a web page. My result is a persons name like the following:

Fran&#XXX;ois Ber&#XXXand <— where the first XXX is number 231 and the second is 233. I can’t put them in because macsripter is translating them into the real letters!

The header for the web page says it’s encoded using “charset=iso-8859-1”. So after some research I found this website (http://unicode.coeurlumiere.com/) which shows me the conversions. Does applescript or the Terminal have any translators for handling these numeric encodings?

The form seems to be &#XXX;
number 231 translates to Ã§
number 233 translates to Ã©

Adam_Bell · June 11, 2007, 1:26pm

They are standard HTML Special Characters included in HTML 2 thru HTML 4 for encoding for letters and symbols not in the first 127 ASCII characters starting with a non-breaking space at 160 and continuing to 255 which is a lower case y with an umlaut. You can’t enter them to show in a post here because your browser interprets them, not the web site. AppleScript does not have a translator, and I don’t know of one, so if I had to deal with it, I’d arrange an array of those I thought I’d encounter.

Google for HTML Special Character set, for example: http://www.w3.org/MarkUp/html-spec/html-spec_13.html

hhas · June 11, 2007, 1:27pm

Page encoding tells you how to translate unencoded characters; it’s irrelevant to interpreting HTML entities.

For converting HTML entities, see TextCommands’ ‘decode HTML’ command. For converting character sets, see the ‘convert to unicode’ command.

HTH

Adam_Bell · June 11, 2007, 1:32pm

Got it some time ago, but haven’t had occasion to use it yet, hhas - thanks for the reminder.

regulus6633 · June 11, 2007, 2:34pm

That’s what I was hoping to avoid.

Thanks. I gave it a try and it works. I thought their might be some straight conversion from ascii characters though. Something like…

set the_letter to acsii charater(HTML_code_number + x) → But I guess not.

Adam_Bell · June 11, 2007, 3:05pm

AppleScript is not really HTML or XML oriented. It’s just barely Unicode capable. You have to rely on external helpers like HHAS’s.

hhas · June 11, 2007, 3:12pm

Dunno why you’d want to convert them yourself, but anyway, numeric entities are just UCS [1] code points, written either in decimal (å) or hexadecimal (å) form. Decimal values are easily converted using the ‘unicode characters’ command; hexadecimals you’d first need to convert to decimal yourself.

To be honest though, scraping webpages is best done using tools that already understand HTML (e.g. the Beautiful Soup module for Python). I think the closest you’ll get in AppleScript is to convert the page to valid XHTML using HTMLTidy, then use XMLLib or System Events to extract the data. Or just use something like Beautiful Soup from AppleScript (e.g.) via ‘do shell script’.

[1] TextCommands is limited to UCS2, but this won’t be an issue unless you’re dealing with really exotic stuff.

regulus6633 · June 11, 2007, 11:22pm

So my code is more portable.

I couldn’t figure this out. In the end I had to convert the decimal value to hex and then to the unicode character.

Thanks for the help hhas and Adam! I appreciate it. Through what you told me and searching I came up with a solution that’ll work for my case. Any optimizations you see would be welcome.

set the_string to "Fran&#XXX;ois Ber&#XXXand" <--- where the first XXX is number 231 and the second is 233. Again Safari is converting my code so it's entered like this for this example
my decHTML_to_string(the_string)

on decHTML_to_string(the_string)
	set {TIDs, text item delimiters} to {text item delimiters, "&#"}
	set b to text items of the_string
	set text item delimiters to TIDs
	set uniList to {item 1 of b}
	repeat with i from 2 to (count of b)
		set this_string to item i of b
		set string_count to count of this_string
		repeat with j from 1 to string_count
			if item j of this_string is ";" or item j of this_string is "\\" then
				set nDec to text 1 thru (j - 1) of this_string -- get the decimal value
				set nHex to do shell script "perl -e 'printf(\"%04X\", " & nDec & ")'" -- convert decimal to hex
				set uChar to run script "«data utxt" & nHex & "»" -- convert unicode hex to unicode character
				if string_count > j then
					set u_string to (uChar & (text (j + 1) thru string_count of this_string)) as string
				else
					set u_string to uChar
				end if
				set end of uniList to u_string
				exit repeat
			end if
		end repeat
	end repeat
	return uniList as string
end decHTML_to_string

Adam_Bell · June 12, 2007, 12:51am

Works nicely if the code is given in numeric form, Regulus, but will poop out if you encounter a web site that uses the mnemonic equivalents.

e.g. “Fran&#231\ois Ber&#233\and” (\ = ; so it won’t convert) can also be acceptably written as “François Beréand” and most modern browsers will correctly interpret that (though punBB doesn’t). I stuck both in a web page and BBEdit’s preview got them right.

regulus6633 · June 12, 2007, 1:50am

Thanks, that’s why I called the handler decHTML… for decimal form HTML.

I’m new to this so I didn’t know about that form, anyway it’s an easy fix to account for that. I added it to the above script

I knew about those but I’m not going to worry about them. I’d need an array to handle that and the website I’m targeting this for doesn’t use them. I’m targeting this for the IMDB website and the quicktime script I posted in the code exchange… http://bbs.applescript.net/viewtopic.php?pid=82571#p82571.

Through my limited research into this I saw where it’s also acceptable to use the hex value instead of the numeric value. My script won’t handle those either. For the all-around job another tool is the better choice but for my specific target my script should do nicely… at least until they change their website!

Adam_Bell · June 12, 2007, 3:35pm

Which some sites do with annoying regularity. I had a neat little script to grab an exchange rate from a bank that they broke so often I canned it.