Read and copy Chinese text from a simple text file

Joy · July 5, 2018, 9:12am

I saved Chinese quotes in a simple Chinese.txt file, which I need every day.
My script reads for example, paragraph 1 of the aforementioned documented and copies the output as Unicode text into my clipboard.

Unfortunately I get a mess of characters doing so, even using suffix “Unicode text”. Some suggestions how to fetch the glyphs from my txt -document and put them readable format into my clipboard?

Example text in my Chinese.txt :
“下午好，让我们一起度过一个愉快的下午茶时光？”

Nigel_Garvey · July 5, 2018, 9:45am

Hi.

If you’re using scripts both to write to the file and read it, as Unicode text should be good enough:

set ChineseText to "下午好，让我们一起度过一个愉快的下午茶时光？"

set fileHandle to (open for access file ((path to desktop as text) & "Chinese.txt") with write permission)
try
	set eof fileHandle to 0
	write ChineseText as Unicode text to fileHandle
	set checkText to (read fileHandle from 1 as Unicode text)
end try
close access fileHandle

return checkText
--> "下午好，让我们一起度过一个愉快的下午茶时光？"

But for historical reasons, ‘Unicode text’ here means UTF16 big-endian. The native form of UTF16 on Intel machines is little-endian, so if you need open the file in, say, TextEdit, it would be better to use as «class ut16», or better still as «class utf8».

If you’re using a script to read a file that’s been saved by an application, you’ll have use whatever form of text the application’s set up to save, eg. either «class ut16» or «class utf8».

Joy · July 5, 2018, 10:51am

Ah! I got it! Just technical issues… Quick look and the clipboard aren’t able to display the glyphs in the correct manner but the pasted text is fine. ^^
Resolved

Joy · July 6, 2018, 3:51pm

It’s curious to notice,
simple text documents written in Unicode text (e. G. Example.txt)
Give another output when reading them with : read file
So if for example, we have e. G. “sunflower” in the first paragraph of Example.txt, the read command outputs the paragraph not as word, but as: s u n f l o w e r
I get characters, not words.

Yvan_Koenig · July 6, 2018, 5:38pm

I applied what Nigel wrote :

set ChineseText to "sunflower" & linefeed & "下午好，让我们一起度过一个愉快的下午茶时光？"

set fileHandle to (open for access file ((path to desktop as text) & "Chinese.txt") with write permission)
try
	set eof fileHandle to 0
	write ChineseText as «class utf8» to fileHandle
	set checkText to (read fileHandle from 1 as «class utf8»)
end try
close access fileHandle

return checkText

and it behaves flawlessly.
It worked OK too using «class ut16».

When I tried with unicode text, opening in TextEdit I got :

[center]sunflower
NSHY}ˇã©bNÏNçw^¶è«NN*a _ÎvÑNSHÉ6eˆQIˇ[/center]

Yvan KOENIG running High Sierra 10.13.5 in French (VALLAURIS, France) vendredi 6 juillet 2018 19:38:13

Nigel_Garvey · July 6, 2018, 6:27pm

If you’re using ‘read’ to read a file that’s been saved as UTF-16 Unicode text, you have to use the ‘as’ parameter to tell ‘read’ what kind of text to expect. If you don’t, it’ll assume it’s reading MacRoman text and treat each byte as a separate character. (I don’t know why you’re getting spaces instead of invisible (character id 0)'s. Are you doing something else in the script involving TIDs?)

Normally, you’d use something like …

read (choose file) as Unicode text

… to read a file containing big-endian UTF-16 text, or …

read (choose file) as «class UT16»

… to read a file containing little-endian UTF-16. But if the UTF-16 has been saved with a Byte-Order Mark (BOM) — as happens when you save as UTF-16 from TextEdit — and if the read is from the beginning of the file, you can use either of the above because the BOM will tell them whether the text is big-endian or little-endian.

Yvan_Koenig · July 6, 2018, 7:10pm

Hello Nigel

I already got the spaces in the past.
It was when I try to read “as utf8” a text saved as “ut16”.
It’s due to the fact that in this case, “text” characters are stored as (for instance) 00 78.
Reading as “utf8” TextEdit treats the 00 as a blank character.

Yvan KOENIG running High Sierra 10.13.5 in French (VALLAURIS, France) vendredi 6 juillet 2018 21:09:55

Nigel_Garvey · July 6, 2018, 9:24pm

Thanks, Yvan. I haven’t been able to reproduce that, but I wouldn’t be surprised if it does happen sometimes when combination of characters is right.

I forgot to mention that when writing as «class ut16», a BOM is included anyway. This may occasionally be useful, but only when writing from the beginning of the file.