xml & diacriticals (accented letters)

rickzu · September 7, 2004, 4:46pm

I have a script that writes an xml file. It works just fine unless some text in the file includes a diacritical (an accented letter such as î, û, é, etc.). Then the file appears to be corrupted and I cannot either import into Illustrator or open with TextEdit. If I take out the diacriticals then it’s fine.

Does anyone know any way around this?

Thanks, Rick

jonn8 · September 7, 2004, 5:24pm

You probably have to encode your high-ascii just like in HTML so diacriticals like é should be either é or & etc. There are probably better ways of doing this but you can just do a straight search and replace:

property search_entities : (characters of "&"<>ÄÅÇÉÑÖÜáàâäãåçéèêëíìîïñóòôöõúùûü†°¢£§•¶ß®©™´¨?ÆØ?±??¥µ?????ªº?æø¿¡¬?ƒ??… ÀÃÕŒœ–—“”‘’÷?ÿŸ?€‹›‡·‚„‰ÊÁËÈÍÎÏÌÓÔÒÚÛÙˆ˜¯¸ﬁﬂı˘˙˚˝˛ˇ")
property replace_entities : {"&", """, "<", ">", "Ä", "Å", "Ç", "É", "Ñ", "Ö", "Ü", "á", "à", "â", "ä", "ã", "å", "ç", "é", "è", "ê", "ë", "í", "ì", "î", "ï", "ñ", "ó", "ò", "ô", "ö", "õ", "ú", "ù", "û", "ü", "†", "°", "¢", "£", "§", "•", "¶", "ß", "®", "©", "™", "´", "¨", "≠", "Æ", "Ø", "∞", "±", "≤", "≥", "¥", "µ", "∂", "∑", "∏", "π", "∫", "ª", "º", "Ω", "æ", "ø", "¿", "¡", "¬", "√", "ƒ", "≈", "Δ", "…", " ", "À", "Ã", "Õ", "Œ", "œ", "–", "—", "“", "”", "‘", "’", "÷", "◊", "ÿ", "Ÿ", "⁄", "€", "‹", "›", "‡", "·", "‚", "„", "‰", "", "Ê", "Á", "Ë", "È", "Í", "Î", "Ï", "Ì", "Ó", "Ô", "Ò", "Ú", "Û", "Ù", "ˆ", "˜", "¯", "¸", "ﬁ", "ﬂ", "", "ı", "˘", "˙", "˚", "˝", "˛", "ˇ"}
property replace_entities_decimal : {"&", """, "<", ">", "Ä", "Å", "Ç", "É", "Ñ", "Ö", "Ü", "á", "à", "â", "ä", "ã", "å", "ç", "é", "è", "ê", "ë", "í", "ì", "î", "ï", "ñ", "ó", "ò", "ô", "ö", "õ", "ú", "ù", "û", "ü", "†", "°", "¢", "£", "§", "•", "¶", "ß", "®", "©", "™", "´", "¨", "≠", "Æ", "Ø", "∞", "±", "≤", "≥", "¥", "µ", "∂", "∑", "∏", "π", "∫", "ª", "º", "Ω", "æ", "ø", "¿", "¡", "¬", "√", "ƒ", "≈", "Δ", "…", " ", "À", "Ã", "Õ", "Œ", "œ", "–", "—", "“", "”", "‘", "’", "÷", "◊", "ÿ", "Ÿ", "⁄", "€", "‹", "›", "‡", "·", "‚", "„", "‰", "", "Ê", "Á", "Ë", "È", "Í", "Î", "Ï", "Ì", "Ó", "Ô", "Ò", "Ú", "Û", "Ù", "ˆ", "˜", "¯", "¸", "ﬁ", "ﬂ", "", "ı", "˘", "˙", "˚", "˝", "˛", "ˇ"}

set t to "This is my résumé"
set t to my replace_special_chars(t, false)

on replace_special_chars(t, use_decimal)
	if use_decimal then
		set r to replace_entities_decimal
	else
		set r to replace_entities
	end if
	repeat with i from 1 to count search_entities
		if t contains (item i of search_entities) then set t to my snr(t, (item i of search_entities), (item i of r))
	end repeat
	return t
end replace_special_chars

on snr(t, s, r)
	tell (a reference to my text item delimiters)
		set {o, contents} to {contents, s}
		set {t, contents} to {t's text items, r}
		set {t, contents} to {"" & t, o}
	end tell
	return t
end snr

Jon

julifos · September 7, 2004, 7:36pm

Or, if your file is/should-be utf-8 encoded, as defined in the first line (eg, <?xml version="1.0" encoding="UTF-8"?>), just write the data “as «class utf8»”, and special characters will be translated automagically to utf-8 encoding and you will end with a properly formatted file.

rickzu · September 12, 2004, 12:43am

Jon, I couldn’t get your solution to work. I think I just don’t know what I’m doing. Thank you for being so quick with your reply.

jj, I got yours to work. Thank you. I found somewhere else to start the file with two ASCII characters (“write (ASCII character 239) & (ASCII character 187) & (ASCII character 191)”). It works but I’m not sure why.

Thank you both for taking the time.

julifos · September 12, 2004, 1:06pm

Windows’ notepad (or whatever is called) adds these three characters automatically when you save text files in UTF-8 encoding. Some “folks” need these three characters to recognize the utf-8 encoding of the file (for example, the PC version of the flash player).