Coerce a list from Unicode text

Ray_Barber · June 17, 2004, 5:59pm

I need to read a file ecoded in utf-8. The data are multiple dimension array delimited by multiple delimiters. I have no problem reading it into unicode with the first level array resolved into a list. However, I can’t further break it down.

Is there a way to coerce a list from a Unicode text?

The set text delimiters and text items of command pairs does not work on Unicode text as they do on string. :rolleyes:

Please give me a hand.

Thank you very much.

Ray_Barber · June 17, 2004, 9:17pm

Bad news!

Even read file using delimiter does not work with non ASCII characters!

Please help me parsing Unicode text! Many thanks!

julifos · June 18, 2004, 9:12am

Could you post a live example of what you are trying to do? For instance, this works for me:

Ray_Barber · June 19, 2004, 3:55am

Thanks jj.

It works somehow.

However, the delimiter it uses is u+001E and u+001D. It fails to distinct them.

For example, a Unicode text as
u+FEFF u+4E2D u+570D u+001D u+0032 u+0030 u+0030 u+001D u+0034 u+0037 u+0030 u+001E
u+65E5 u+672C u+000D u+0033 u+0030 u+0030 u+001D u+0035 u+0030 u+0030 u+001E

going thru the following code:

 set x to read file "path:to:file.txt" as «class utf8»  -- x will be converted to Unicode text
set AppleScript's text item delimiters to ((ASCII character 30) as Unicode text) -- 30 = 0x001E
set y to x's text items

the resulting y =
{u+4E2D u+570D u+001D u+0032 u+0030 u+0030,
u+0034 u+0037 u+0030,
u+65E5 u+672C u+000D u+0033 u+0030 u+0030,
u+0035 u+0030 u+0030}
instead of
{u+4E2D u+570D u+001D u+0032 u+0030 u+0030 u+001D u+0034 u+0037 u+0030,
u+65E5 u+672C u+000D u+0033 u+0030 u+0030 u+001D u+0035 u+0030 u+0030}
it treats u+001D as u+001E and flattened the layer and prevent me from further breaking. :shock:

The same result happens to u+0007 as well. Are there some other noprintable control codes text item distinguishes?

Thanks for helping me!

julifos · June 19, 2004, 10:37am

Hmm. I can’t run your code. You are describing “Unicode text” (UTF-16), not UTF-8, so you can’t read the file “as «class utf8»”, but “as Unicode text”. Anyway, what you say is true. If you try breaking the text by «data utxt001E», it mangles the result (not only the 001D).

If you explain what you need, perhaps we can find a workaround, such as treating the file as ASCII, then searching for ((ASCII character 0) & (ASCII character 30)). Or perhaps using the new search/replace feature in Smile, called “uchange” (“the Unicode version of change”) or “ufind text”.

Meanwhile, I’ll insert this question in the AUL, so we know what’s going on (I’m also interested in utxt ), which smells to a simple bug or still unsupported feature (perhaps that’s because Smile provides its “uchange”):

read alias "path:to:unicodeText.txt" as Unicode text
{result contains «data utxt001D», ¬
	sr(«data utxt001E», "", result) contains «data utxt001D»}
--> {true, false}

to sr(s, r, t)
	set AppleScript's text item delimiters to s
	set t to t's text items
	set AppleScript's text item delimiters to r
	set t to t as text
	set AppleScript's text item delimiters to {""}
	t
end sr