AS to save ".txt" file from unicode (utf-8) to unicode (utf-16) format

johnmathew · May 18, 2015, 10:30pm

Hi Guys,

Hi had a folder which contains bunch of “.txt” files. Some files with (utf-8) format and some files with (utf-16) format. I need a Applescript to find out the (utf-8) fomat files and to convert it to (utf-16) format.

I got the below script from this forum. But it save as a new file and adding “_incov” to the file name. But I want to change the format in the original file itself.

set thefile to POSIX path of (choose file)
set newFileName to (do shell script "str=" & quoted form of thefile & ";echo ${str%.*}") & "_iconv.txt"

do shell script "xxd -p -r <<< xfeff > " & quoted form of newFileName
do shell script "iconv -f UTF-8 -t UTF-16BE " & space & quoted form of thefile & " >> " & quoted form of newFileName

Thanks in advance,
John

kel1 · May 19, 2015, 1:29am

Hi John,

Still looking into replace with unix ‘textutil’. You might want to look into the man page for that.

Personally, I’d rather use utf8 as it is more compatible with Mac. But you may want to use utf16 for other reasons.

gl,
kel

DJ_Bazzie_Wazzie · May 19, 2015, 7:31am

UTF-8 is popular because it’s small and supports single byte characters (7-bits ASCII). Everything else UTF-16 is superior to UTF-8, like counting characters.

Shane_Stanley · May 19, 2015, 7:45am

And reading substrings…

I’m not sure why the OP wants big-endian, but otherwise I’d suggest:

on open fileList
	repeat with aFile in fileList
		try
			set theText to read aFile as «class utf8»
			-- if we got here, it's UTF-8, so write it as UTF-16
			write theText to aFile as Unicode text
		on error
			-- it's not UTF-8, so nothing to do
		end try
	end repeat
end open

Shane_Stanley · May 19, 2015, 8:03am

… and for big-endian and Yosemite:

use scripting additions
use framework "Foundation"

on open fileList
	repeat with aFile in fileList
		set thePath to POSIX path of aFile
		set theNSString to (current application's NSString's stringWithContentsOfFile:thePath encoding:(current application's NSUTF8StringEncoding) |error|:(missing value))
		if theNSString is not missing value then -- it was UTF-8
			(theNSString's writeToFile:thePath atomically:true encoding:(current application's NSUTF16BigEndianStringEncoding) |error|:(missing value))
		end if
	end repeat
end open

DJ_Bazzie_Wazzie · May 19, 2015, 8:35am

The endianness of an file works best if it matches the processor’s endianness (read: no byte shifts are required for each character). So therefore I think the file is used for another architecture than his Mac.

edit: Fun Fact, the PPC architecture is one of the few processors architectures that supports different endian modi on the fly. Reading an UTF-16BE or UTF-16LE using PPC Macs should make no difference while on the Intel Macs it does make a difference.

Shane_Stanley · May 19, 2015, 9:39am

I guess so.

Interesting. If becoming a bit academic at this point

McUsrII · May 19, 2015, 9:40am

There is also that, that some software, only deals with unicode text, so you have to convert the text to unicode before importing it. Software like this, is often software that runs on several platforms, and probably takes this approach in order to keep the versions for the different platforms compatible with each other.

Nigel_Garvey · May 19, 2015, 11:22am

Shane Stanley:

I’m not sure why the OP wants big-endian, but otherwise I’d suggest:

on open fileList
	repeat with aFile in fileList
		try
			set theText to read aFile as «class utf8»
			-- if we got here, it's UTF-8, so write it as UTF-16
			write theText to aFile as Unicode text
		on error
			-- it's not UTF-8, so nothing to do
		end try
	end repeat
end open

You assume of course that the UTF-16 code will be the same length or longer than the UTF-8, which is probably a fairly safe bet.

The uncredited code the OP posted also writes a big-endian BOM to the file. (Two, in fact!) This version of yours writes a BOM too:

on open fileList
	repeat with aFile in fileList
		set fRef to (open for access aFile with write permission)
		try
			set txt to (read fRef as «class utf8»)
			-- if we got here, it's UTF-8, so write it as UTF-16, big-endian with a BOM.
			set eof fRef to 0
			write «data rdatFEFF» to fRef
			write txt to fRef as Unicode text
		on error
			-- It's not UTF-8, so it's probably UTF-16 already.
		end try
		close access fRef
	end repeat
end open

Shane_Stanley · May 19, 2015, 11:35am

Or living dangerously…

I was thinking as Unicode text did that automatically, but it’s actually as «class ut16» that inserts the BOM.

Nigel_Garvey · May 19, 2015, 1:49pm

And it’s apparently machine-native: little-endian on my MacBook Pro, big-endian on my G5.

johnmathew · May 19, 2015, 3:47pm

Hi Guys,

Sorry for the late reply.

Thanks for your all efforts. I check it and let you know if I have any queries.

Thanks,
John

Yvan_Koenig · May 19, 2015, 4:33pm

Questio related to the original one.

For years I used this handler ;


#===== Handler borrowed from Regulus6633 - http://macscripter.net/viewtopic.php?id=36861

on writeTo(targetFile, theData, dataType, apendData)
	-- targetFile is the path to the file you want to write
	-- theData is the data you want in the file.
	-- dataType is the data type of theData and it can be text, list, record etc.
	-- apendData is true to append theData to the end of the current contents of the file or false to overwrite it
	try
		set targetFile to targetFile as text
		set openFile to open for access file targetFile with write permission
		if not apendData then set eof of openFile to 0
		write theData to openFile starting at eof as dataType
		close access openFile
		return true
	on error errMsg number errNbr
		log "errNbr #" & errNbr & "    " & errMsg
		try
			close access file targetFile
		end try
		return false
	end try
end writeTo

#=====

Some days ago, I had to store datas in which some eastern characters may appear.

I tried to replace the handler by this one :



#===== 

on writeUTF8(_text, _file, apendData)
	try
		set fRef to open for access _file with write permission
		if not apendData then set eof of fRef to 0
		write _text to fRef as «class utf8»
		close access fRef
	on error e number n
		try
			close access fRef
		on error e number n
			error "Error in writeUTF8() handler!" & linefeed & linefeed & e
		end try
	end try
end writeUTF8[

#=====/applescript]

The optional append / don't append feature wasn't available in the code which I borrowed on the net.

It seems that there was a good reason. When I call the handler to write a block of datas starting from the eof 0 it works flawlessly but if I try to append datas at the end of the already written ones, I get an awful result. Most of the datas are lost.
Is it a "normal" behavior ? Must I definitely admit that appending utf8 datas at end of utf8 datas can't be done ?

Of course, I double checked my code. If I just switch from writeUTF8 to writeTo, everything is stored - but of course, the eastern characters aren't correctly stored. In French I would say that I can't "avoir le beurre et l'argent du beurre (et encore moins la crÃ©miÃ¨re)"

Yvan KOENIG (VALLAURIS, France) mardi 19 mai 2015 18:33:04

Nigel_Garvey · May 19, 2015, 5:07pm

Hi Yvan.

I’m not sure what you’re saying doesn’t work, but your second script has ‘starting at eof’ missing from the ‘write’ line, so the write’s starting at the beginning of the file.

Less importantly, the direct parameter of ‘set eof’ should be a file or open-file reference, not ‘of’ and a reference.

if not apendData then set eof fRef to 0
write _text to fRef as «class utf8» starting at eof

Yvan_Koenig · May 19, 2015, 5:20pm

Hello Nigel

Thank you, It seems that it’s time to replace my glasses.

Yvan KOENIG (VALLAURIS, France) mardi 19 mai 2015 19:19:57