Convert XLS files into CSV preserving UTF-8 encoding

Oh yes! It does work! :smiley:

However it still breaks the column at the comma
 :slight_smile:

Hello.

As I suggested in a post above, you should really use semi colon, or colon as a field separator when you are using a “,” as the decimal separator.

I have changed the script to use semi-colon!

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal is not "" then
				if text 1 of cellVal as text is not ";" then
					set rowStr to cellVal
				else
					set rowStr to "0" & cellVal
				end if
			else
				set rowStr to ""
			end if
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal is not "" then
					if text 1 of cellVal is not ";" then
						set rowStr to rowStr & ";" & cellVal
					else
						set rowStr to rowStr & ";" & "0" & cellVal
					end if
				else
					set rowStr to rowStr & ";"
				end if
				
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

Hello

Here is an edited version taking care of the local delimiter.


character 2 of (0.5 as text)
if result is "," then
	set delim to ";"
else
	set delim to ","
end if

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal is not "" then
				if text 1 of cellVal as text is not delim then
					set rowStr to cellVal
				else
					set rowStr to "0" & cellVal
				end if
			else
				set rowStr to ""
			end if
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal is not "" then
					if text 1 of cellVal is not delim then
						set rowStr to rowStr & delim & cellVal
					else
						set rowStr to rowStr & delim & "0" & cellVal
					end if
				else
					set rowStr to rowStr & delim
				end if
				
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell


Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 16:58:47

Hmm, tried McUsr’s latest version and the columns are merged into one :smiley:

Tried Yvan Koenig’s version and same column break after comma happens :smiley:

Does it happen to you too using this file?

http://www.sendspace.com/file/hq55j9

Hello!

This ought to work.

I’ll try it on the file.

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			
			set rowStr to cellVal
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				set rowStr to rowStr & ";" & cellVal
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

It seems to work. But now you have to set FS=“;” when you use awk.

difference between excel 2008 and 2011 is opening:
2008 it opens as macroman
2011 import the csv into cell A1 and select the correct encoding, delimiter and column types so you’re sure all data is in your sheet.

When it shows wrong in excel it is probably the correct encoding, because utf-8 CSV files should be displaying special characters wrong. Set ‘opening character encoding’ of text edit’s preferences to utf-8 and open the CSV file there, if the characters are displayed correctly, the encoding is correct. Because you said that my script with CRLR (return followed with a linefeed) is the correct line terminator I assumed that the encoding you got is Windows. Not certain if it’s UTF-8 or not already but here a script where you can change the first two variable values to the correct encodings.

set fromEncoding to "CP1252"
set toEncoding to "UTF-8"

set iconvCommand to " | iconv -f " & fromEncoding & " -t " & toEncoding & space
set theFiles to choose file with multiple selections allowed
set fileName to text returned of (display dialog "Type file name:" default answer ".csv")
set saveFile to POSIX path of (((path to desktop) as text) & fileName)

do shell script "cat " & quoted form of POSIX path of item 1 of theFiles & iconvCommand & "> " & quoted form of saveFile
if (count of theFiles) = 1 then return
repeat with fileToAdd in theFiles
	set x to quoted form of POSIX path of fileToAdd
	do shell script "cat " & x & " | awk 'BEGIN {RS=\"\\r\\n\"; getline}{print $0}' " & iconvCommand & ">> " & quoted form of saveFile
end repeat

Now we assume that the encoding of the selected files is CP1252 (Windows latin 1) and that the output file, the merged CSV file, is UTF-8 encoded.

We’re working with CSV files and automate these actions for CMS systems of world leading WebShops and we know the drawbacks (read my post in previous posts). How about fields that needs text indicators (quoting), does the CMS support that as well? You don’t want that the lines shifts and that the wrong copy or price will be related to a product.

Hmm tried on the file and everything is merged into a column, and only breaks for the comma :smiley:

Does it happen to you?

No you won’t. print$0 prints the whole line without considering the FS at all

You may see what I get here thru :

https://www.box.com/s/qvve9u8zj0nhf9nfbeug

I’m puzzled by McUsr tests for the first value in a row.
I assume that he is trying to take care of a thousands separator but I’m not sure.

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 17:14:50

DJ Bazzie Wazzie, I get

error "
iconv: (stdin):1:21: cannot convert" number 1

The fromEncoding should match the encoding of the file you selected. Then you won’t get an error, now it was on the first line character 21 that couldn’t be converter. probably a â€˜ĂƒÂ©â€™ character or something.

Hello

It fail if a cell contain a semi colon.
In such case we must enclose it with double quotes.


character 2 of (0.5 as text)
if result is "," then
	set delim to ";"
	# set sep1000 to character id 160
else
	set delim to ","
	# set sep1000 to ","
end if

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal contains delim then set cellVal to quote & cellVal & quote # ADDED
			set rowStr to cellVal
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal contains delim then set cellVal to quote & cellVal & quote # ADDED
				set rowStr to rowStr & delim & cellVal
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 17:29:53

Are your sure? because I asked TS if the CMS supports it or not and still no answer

Hello.

The Op don’t have to use awk at all if the purpose is merly to merge the files without the column heading.

He can do like this.

cat file1 >result.csv for i in file2 file3 file4 file5 file6 ; do cat $i |sed 1d >>result.csv done
from a terminal prompt.

I test always before writing.
I’m accustomed to the semi-colon delimiter because it’s the one in use here in France.
In his late code, McUsr use the semi-colon as value delimiter which matches the asker’s settings.
I inserted a semi-colon in a cell of the excel file and for sure, in the exported csv cells are moved one column to the right.

With this source contents :

type segmentCode geoCode formatCode languageCode keyPath description isPublished isOneOff value
Text all emea common cs pub.promo.parallelsdesktop8.september2012.headline Parallels Desktop 8 promo row VRAI FAUX Parallels Desktop 8
0,23 all emea common cs pub.promo.parallelsdesktop8.september2012.paragraph Parallels Desktop 8 promo row desc VRAI FAUX PouĂ…ÂŸĂƒÂ­vejte Windows na Macu.
;aa all emea common pub.promo.parallelsdesktop8edu.september2012.headline Parallels Desktop 8 - Education Edition promo row VRAI FAUX Parallels Desktop 8 “ Education Edition
Text all emea common cs pub.promo.parallelsdesktop8edu.september2012.paragraph Parallels Desktop 8 - Education Edition promo row desc VRAI FAUX PouĂ…ÂŸĂƒÂ­vejte Windows na Macu.

The exported contents is :

type segmentCode geoCode formatCode languageCode keyPath description isPublished isOneOff value
Text all emea common cs pub.promo.parallelsdesktop8.september2012.headline Parallels Desktop 8 promo row true false Parallels Desktop 8
0,23 all emea common cs pub.promo.parallelsdesktop8.september2012.paragraph Parallels Desktop 8 promo row desc true false PouĂ…ÂŸĂƒÂ­vejte Windows na Macu.
aa all emea common pub.promo.parallelsdesktop8edu.september2012.headline Parallels Desktop 8 - Education Edition promo row true false Parallels Desktop 8 “ Education Edition
Text all emea common cs pub.promo.parallelsdesktop8edu.september2012.paragraph Parallels Desktop 8 - Education Edition promo row desc true false PouĂ…ÂŸĂƒÂ­vejte Windows na Macu.

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 17:37:04

That is correct Yvan.

I didn’t try to put a semicolon into a cell, as a start value. If it breaks for the op, then I’ll go back and fix it. :slight_smile:

It’s fixed in the edited version which I posted.

My scheme is OK with comma delimiter or with semi-colon one.

I’m always puzzled by what you applied to the first cell of a row in your “long” script :


 if cellVal is not "" then
               if text 1 of cellVal as text is not "," then
                   set rowStr to cellVal
               else
                   set rowStr to "0" & cellVal
               end if
           else
               set rowStr to ""
           end if

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 17:46:45

Hello. I did that, as often the cell in row 1 col 1 in my tables are empty. So I figured it could be something else there as well, as numbers formatted like ,1 for instance.

Here is my edited version that escapes “;”, while I write an applescript to merge his files.

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with R from 1 to lastRow
			set cellVal to (value of cell R of column 1 of active sheet) as text
			if cellVal contains ";" then
				set cellVal to "\"" & cellVal & "\""
			end if
			set rowStr to cellVal
			repeat with c from 2 to lastCol
				set cellVal to (value of cell R of column c of active sheet) as text
				if cellVal contains ";" then
					set cellVal to "\"" & cellVal & "\""
				end if
				
				set rowStr to rowStr & ";" & cellVal
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
	end try
end tell


1 Like

I apologize but I disagree with this code.

if a cell contains com;mon,
when we open the csv created by your script, we get the value common.

It’s not the semi-colon which must be enclosed with quotes, it’s the value itself.


character 2 of (0.5 as text)
if result is "," then
	set delim to ";"
	# set sep1000 to character id 160
else
	set delim to ","
	# set sep1000 to ","
end if

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal contains delim then set cellVal to quote & cellVal & quote # ADDED
			set rowStr to cellVal
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal contains delim then set cellVal to quote & cellVal & quote # ADDED
				set rowStr to rowStr & delim & cellVal
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

behaves flawlessly.

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 18:19:19

1 Like