How to Read CSV File and Parse into AppleScript List

JMichaelTX · December 11, 2015, 7:33am

As I had a recent need for this functionality, I did a fairly extensive Internet search, and could not find exactly what I needed. I found lots and lots of solutions, lots bits and pieces, and a few very complex solutions.

So, based on everything I read, I have built my own script. Certainly the idea is not new, and many have contributed long before me. I apologize in advance if I happen to be using parts of code that you have posted somewhere on the Internet.

For the benefit of others like me, I’m posting my compiled solution. It has no error checking (which it needs) and is a very SIMPLE approach that does NOT try to adhere to the RFC 4180 de facto standard.

For a more robust, extensive process, see:
CSV-to-list converter by Nigel Garvey, 2010-03-13
http://macscripter.net/viewtopic.php?pid=125444#p125444

For AppleScript newbies, it puts together the CSV file read, and the parsing of the fields/columns.
It meets my needs, but I’m sure it will not meet everyone’s needs.

If you see any issues, or have suggestions for improvement, please post here.


(*
====================================================
	[CSV]  How to Read CSV File into AppleScript List
====================================================

PURPOSE:
	¢ Provide a simple process to read a CSV file, and parse each line/row into a AS list
		for further use as needed by the programmer.
	¢ This is a tool/example for the AppleScript programmer, not the end-user of a script
	¢ See limitations below
	
DATE:   	Fri, Dec 11, 2015			VER: 1.0
AUTHOR: JMichaelTX (on MacScripter.net and StackOverflow.com forums)

LIMITATIONS:
	¢ I don't believe the parsing of the CSV data here conforms to the RFC 4180 de facto standard
	¢ It doesn't allow for clear separation of strings from numbers
	¢ If you need a more full-featured, compliant CSV parser, then see the below ref.

REF:    	For a more robust, extensive process, see:
			CSV-to-list converter by Nigel Garvey, 2010-03-13
			http://macscripter.net/viewtopic.php?pid=125444#p125444
			
SAMPLE INPUT DATA (from CSV file):
	Parent,Tag,Num Notes			<=== first line/row is column titles
	!SYMBOLS,SYM.ES,TBD
	ZIP_List,ZIP.77077,TBD
	FINANCE,FIN.Call,TBD
	Evernote,EN.UI,TBD
	HISTORY,HiS.NatlSec,TBD
	. . .

SAMPLE OUTPUT DATA (log)
	(*Number of Rows: 102*)
	(*!SYMBOLS, SYM.ES, TBD*)
	(*ZIP_List, ZIP.77077, TBD*)
	(*FINANCE, FIN.Call, TBD*)
	(*Evernote, EN.UI, TBD*)
	(*HISTORY, HiS.NatlSec, TBD*)
	. . .
====================================================
*)
--- GET THE CSV FILE YOU WANT TO READ ---
set pathInputFile to (choose file with prompt "Select the CSV file" of type "csv")

--- READ THE FILE CONTENTS ---
set strFileContents to read pathInputFile

--- BREAK THE FILE INTO PARAGRAPHS (i.e., ROWS or LINES) ---
--		(AS Paragraphs are separated by LF or CR)

set parFileContents to (paragraphs of strFileContents)
set numRows to count of parFileContents
log "Number of Rows: " & numRows
--log parFileContents

--- PROCESS EACH PARAGRAPH (AKA LINE or ROW) OF INPUT FILE ---

--“““““““““““““““““““““““““““““““““““““““““““““““““““““““““““
repeat with iPar from 2 to number of items in parFileContents
	--““““““““““““““““““““““““““““““““““““““““““““““““““““““““
	--		Skip first row since it has column titles, data starts in 2nd row
	
	set lstRow to item iPar of parFileContents
	if lstRow = "" then exit repeat -- EXIT Loop if Row is empty, like the last line
	
	set lstFieldsinRow to parseCSV(lstRow as text)
	
	--- THE FOLLOWING LINES ARE SPECIFIC TO YOUR FILE/DATA ---
	
	set strParTag to item 1 of lstFieldsinRow -- COL 1 of CSV file
	set strTag to item 2 of lstFieldsinRow -- COL 2 of CSV file
	set numNotes to item 3 of lstFieldsinRow -- COL 3 of CSV file
	
	log lstFieldsinRow
	--log "[" & (iPar - 1) & "]: " & strTag
	--“““““““““““““““““““““““““““““““““““““““““““““““““““
	
	--““““““““““““““““““	
end repeat -- with iPar
--“““““““““““““““““““““““““““““““““““““““““““““““““““““““““

--=============== END OF MAIN SCRIPT ==============

on parseCSV(pstrRowText)
	set {od, my text item delimiters} to {my text item delimiters, ","}
	set parsedText to text items of pstrRowText
	set my text item delimiters to od
	return parsedText
end parseCSV

alastor933 · December 11, 2015, 8:43am

This routine will convert csv data to a list of lists: each sublist contains one line of the input file, each sublist item is a field from the input.
I’ve used it for years to deal with the transactions files from my bank.

ccstone · December 13, 2015, 7:35pm

I think Shane has posted an ASObjC method for handling CSV data that intelligently handles commas within quoted field values.

But for simple field separation into a list of lists I’d use the Satimage.osax. It boils down to a couple of lines of code.


set cvsData to text 2 thru -1 of "
Parent,Tag,Num Notes
!SYMBOLS,SYM.ES,TBD
ZIP_List,ZIP.77077,TBD
FINANCE,FIN.Call,TBD
Evernote,EN.UI,TBD
HISTORY,HiS.NatlSec,TBD
"

# Grab each line containing non-whitespace characters into a list:
set cvsDataList to find text "^.*\\S.*" in cvsData with regexp, all occurrences and string result

# Split each list item into fields creating a list of lists:
set cvsDataList to find text "[^,]+" in cvsDataList with regexp, all occurrences and string result

–
Chris

{ MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11.2 }
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

JMichaelTX · December 14, 2015, 6:23am

Thanks Chris. As always I appreciate your excellent solutions.

For the benefit of all of us (especially me), would you mind commenting on the advantages of your solution vs the one I posted?

Always looking to learn.

Best Regards,
JMichaelTX

DJ_Bazzie_Wazzie · January 11, 2016, 5:09pm

Since we’re sharing osaxen solutions here: I have finished the AppleScript Toolbox 2.0 osax today and released I waited with an reply how you can use that osax as well for both quoted and unquoted fields to parse CSV data easily.

When you have simple fields as in ccstone’s example with no quoted fields:

set csvData to "Parent,Tag,Num Notes
!SYMBOLS,SYM.ES,TBD
ZIP_List,ZIP.77077,TBD
FINANCE,FIN.Call,TBD
Evernote,EN.UI,TBD
HISTORY,HiS.NatlSec,TBD"


set csvDataList to AST find regex "(?:" & linefeed & "?)([^" & linefeed & "]*)" in string csvData regex group 2
set csvDataList to AST find regex "(?:,?)([^,]*)" in string csvDataList regex group 2

When you have mixed quoted and unquoted fields:

set csvData to "Parent,Tag,Num Notes
!SYMBOLS,SYM.ES,TBD,\"Just an example
with multiline comments, field separators
and \"\"escpaped\"\" characters\"
ZIP_List,ZIP.77077,TBD
FINANCE,FIN.Call,TBD
Evernote,EN.UI,TBD
HISTORY,HiS.NatlSec,TBD"

set csvDataList to AST find regex "([" & linefeed & "])?((?:[^" & linefeed & "\"]*(?:(?:\"([^\"]*|\"\")*\"))*)*)" in string csvData regex group 3
set csvDataList to AST find regex "(,)?((?:\"((?:[^\"]|\"\")*)\")|([^,]*))" in string csvDataList regex group 3

Note: the field itself is not unquoted.

dlaurentny · January 8, 2018, 3:05am

Thanks much!! Creating CSS/HTML Timelines and using script to read a CSV file and then create the HTML code. Used your code almost verbatim


-- Thank you AUTHOR: JMichaelTX 2015 (on MacScripter.net and StackOverflow.com forums) for sample scripts used here --

set strOutput to "" as string

-- Choose and read contents of CSV file

set pathInputFile to (choose file with prompt "Select the CSV file" of type "csv")
set strFileContents to read pathInputFile

set parFileContents to (paragraphs of strFileContents)
set numRows to count of parFileContents
log "Number of Rows: " & numRows

-- Parse CSV Row into 'Columns'

repeat with iPar from 2 to number of items in parFileContents -- set to 1 if no header row
	set lstRow to item iPar of parFileContents
	if lstRow = "" then exit repeat -- EXIT Loop if Row is empty, like the last line
	
	set lstFieldsinRow to parseCSV(lstRow as text)
	
	set strYear to item 1 of lstFieldsinRow -- COL 1 of CSV file
	set strTitle to item 2 of lstFieldsinRow -- COL 2 of CSV file
	set strDesc to item 3 of lstFieldsinRow -- COL 3 of CSV file
	
	log lstFieldsinRow
	--log "[" & (iPar - 1) & "]: " & strTag
	
	
	-- Write content of Row to strOutput
	-- Use [\"] for Quotes inside output string
	
	set strOutput to strOutput & "<div class=\"column\">" & return & "<div class=\"title\">" & return & "<h1> " & strYear & "</h1>" & return & "<h2> " & strTitle & "</h2>" & return & "</div>" & return & "<div class=\"description\">" & return & "<p>" & strDesc & "</p>" & return & "</div>" & return & "</div>" & return & return
	
	
	log strOutput
	
	-- Loop to EOF or Exit
end repeat -- with iPar


-- Set Output File
set outputFile to ((path to desktop as text) & "TimeLineOutput_html.txt")

-- Write Body to File
try
	set fileReference to open for access file outputFile with write permission
	write strOutput to fileReference
	close access fileReference
on error
	try
		close access file outputFile
	end try
end try


-- Functions

on parseCSV(pstrRowText)
	set {od, my text item delimiters} to {my text item delimiters, ","}
	set parsedText to text items of pstrRowText
	set my text item delimiters to od
	return parsedText
end parseCSV



-- On errors --

Nigel_Garvey · January 24, 2018, 10:18pm

I’ve now given the script in that other post a cosmetic overhaul and have managed to speed it up very slightly.

Looking at the various solutions offered here and in the other thread:

• JMichaelTX’s and ccstone’s scripts, and DJ’s first, are obviously special-case jobs and have no provision either for quoted fields or for different field separators.

• DJ’s second script captures quoted fields but, as he noted, it doesn’t remove the quoting or unescape escaped quotes. Also, it returns too few fields for records whose first field is empty and unquoted and includes an extra “record” list if there’s a line break after the last record.

• The scripts linked to by alastor933 are by the late Kai Edwards and are typically brilliant, fast, and terse almost to the point of unintelligibility. Both take care of quoted fields, removing the quoting and unescaping any escaped double-quotes. The first script can easily be adapted to take an alternative field separator as a parameter, while the second has a go at guessing for itself whether the separator’s a comma or a semicolon. The first script actually stopped working in Leopard, which was released a year after the scripts were posted, but now works as before. It relies on the dubious technique of substituting low-ASCII codes for the CSV formatting characters and hoping they don’t occur in the text already. (Leopard’s TID system couldn’t tell low-ASCII codes apart, which made the script return rubbish.) The second script invents its own unique strings using letters and sequences of colons (genius! ), but it needs a minor rewrite to work on modern systems because a trick it has for getting round a problem which existed in Tiger runs foul of the way text items are returned now. On any system, when coaxed into working, both scripts return an extra “record” if the text ends with a line break and they mistake empty quoted fields for escaped double-quotes.

• In the same thread as my script, Yvan suggested Shane’s BridgePlus library, which offers an arrayFromCSV:commaIs: method. This is very fast indeed and allows the field separator character to be specified. However, it omits any spaces at the beginning of unquoted fields (except in the first field of a record) and returns too few fields for records whose last field is either quoted and empty or unquoted and either empty or white space.

My script ticks all the boxes I think should be ticked and takes optional parameters which allow a separator other than a comma to be specified and non-quoted fields to be trimmed if required. It’s entirely vanilla, not slow, and is known to work correctly in Tiger (10.4.11), Leopard (10.5.8), El Capitan (10.11.6), and High Sierra (10.13.3). However, here’s a slightly faster script which has mine’s virtues and uses Kai’s “unique string” technique. The problem with quoted empty fields is overcome by using a regex to identify enclosing quotes before escaped quotes and so the script requires a fairly recent version of the Satimage OSAX:

(* Assumes that the CSV text follows the RFC 4180 convention:
	Records are delimited by CRLF line breaks (but LFs or CRs are OK too with this script).
	The last record in the text may or may not be followed by a line break.
	Fields in the same record are separated by commas (but a different separator can be specified here with an optional parameter).
	The last field in a record is NOT followed by a field separator. Each record has (number of separators + 1) fields, even when these are empty.
	All the records should have the same number of fields (but this script just renders what it finds).
	Any field value may be enquoted with double-quotes and should be if the value contains line breaks, separator characters, or double-quotes.
	Double-quote pairs within quoted fields represent escaped double-quotes.
	Trailing or leading spaces in unquoted fields are part of the field values (but trimming can specified here with an optional parameter).
	By implication, spaces (or anything else!) outside the quotes of quoted fields are not allowed.
		
	No other variations are currently supported. *)

(* REQUIRES THE SATIMAGE OSAX in a couple of places, but otherwise uses TIDs as these are faster. *)

on csvToList(csvText, implementation)
	-- The 'implementation' parameter is a record with optional properties specifying the field separator character and/or trimming state. The defaults are: {separator:",", trimming:false}.
	set {separator:separator, trimming:trimming} to (implementation & {separator:",", trimming:false})
	
	considering case
		-- Create unique strings to substitute for quotes, linebreaks, and separators in the text. (Developed from an idea in a script by Kai Edwards.)
		-- Get a character which isn't the separator, a double-quote, a linefeed, a return, or a known regex operator. (This is the development!)
		set nonSeparator to "≈"
		if (nonSeparator is separator) then set nonSeparator to "§"
		-- Get a sequence of that character which doesn't occur in the text.
		set uniqueSequence to nonSeparator
		repeat while (csvText contains uniqueSequence)
			set uniqueSequence to uniqueSequence & nonSeparator
		end repeat
		-- Derive the unique strings.
		set quoteProxy to "q" & uniqueSequence
		set lineBreakProxy to "b" & uniqueSequence
		set separatorProxy to "s" & uniqueSequence
		
		script o -- For fast list access. Only one, reusable property is actually needed, but the script goes faster with two!
			property textBlocks : missing value
			property finalResult : missing value
		end script
		
		set astid to AppleScript's text item delimiters
		
		-- Replace the enclosing quotes of any quoted fields (including empty ones) with the quoteProxy character defined above.
		set doctoredCSVtext to (change " *+\"((?:[^\"]|\"\")*+)\" *+" into quoteProxy & "\\1" & quoteProxy in csvText with regexp) -- Satimage.
		-- Replace any double-quote pairs left in the text with double-quote singletons.
		set AppleScript's text item delimiters to "\"\""
		set textItems to doctoredCSVtext's text items
		set AppleScript's text item delimiters to quote
		set doctoredCSVtext to textItems as text
		
		-- Split the text at the quote proxies, if any.
		set AppleScript's text item delimiters to quoteProxy
		set o's textBlocks to doctoredCSVtext's text items
		-- o's textBlocks is a list of the CSV text's text items after delimitation with the double-quote proxy character.
		-- Assuming the convention described at top of this script, the number of blocks is always odd.
		-- Even-numbered blocks, if any, are the unquoted contents of quoted fields and don't need parsing.
		-- Odd-numbered blocks are everything else.
		
		repeat with i from 1 to (count o's textBlocks) by 2
			-- Replace whatever line endings there are in this block with the lineBreakProxy character defined above.
			set AppleScript's text item delimiters to lineBreakProxy
			set thisBlock to (paragraphs of item i of o's textBlocks) as text
			-- Replace all instances of the field separator in this block with the separatorProxy character defined above.
			set AppleScript's text item delimiters to separator
			set textItems to text items of thisBlock
			set AppleScript's text item delimiters to separatorProxy
			set item i of o's textBlocks to textItems as text
		end repeat
		-- Lose any trailing line break proxy from the last block.
		set lastBlock to end of o's textBlocks
		if (lastBlock ends with lineBreakProxy) then
			if (lastBlock is lineBreakProxy) then
				set item -1 of o's textBlocks to ""
			else
				set item -1 of o's textBlocks to text 1 thru (-1 - (count lineBreakProxy)) of lastBlock
			end if
		end if
		
		-- Coerce the blocks back to a single text, with further doctoring if trimming.
		if (trimming) then
			-- Reinstate any quote proxies.
			set AppleScript's text item delimiters to quoteProxy
			set doctoredCSVtext to o's textBlocks as text
			-- Lose any spaces or quote proxies immediately adjacent to separator or line break proxies.
			set doctoredCSVtext to (change "(?:" & quoteProxy & "| ++)?(" & (separatorProxy & "|" & lineBreakProxy) & ")(?: ++|" & quoteProxy & ")?" into "\\1" in doctoredCSVtext with regexp) -- Satimage.
		else
			-- Coerce to text without quote proxies.
			set AppleScript's text item delimiters to ""
			set doctoredCSVtext to o's textBlocks as text
		end if
		
		-- Break the text at the line break proxies for a list of the record texts.
		set AppleScript's text item delimiters to lineBreakProxy
		set o's finalResult to doctoredCSVtext's text items
		-- Replace each record text with a list of its individual field values.
		set AppleScript's text item delimiters to separatorProxy
		repeat with i from 1 to (count o's finalResult)
			set item i of o's finalResult to text items of item i of o's finalResult
		end repeat
		
		set AppleScript's text item delimiters to astid
	end considering
	
	return o's finalResult
end csvToList

-- Demos:
set csvText to "caiv2 , 2010BBDGRC,\"\"\"President\"\", \"\"Board of Directors\"\"\"" & linefeed & ",\"\"," & linefeed & " , , " & linefeed & "Another line, for demo purposes , " & linefeed & ",," & linefeed
csvToList(csvText, {})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}

set csvText to "caiv2 ; 2010BBDGRC;\"\"\"President\"\"; \"\"Board of Directors\"\"\"" & linefeed & ";\"\";" & linefeed & " ; ; " & linefeed & "Another line; for demo purposes ; " & linefeed & ";;" & linefeed
csvToList(csvText, {separator:";"})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {separator:";", trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}

DJ_Bazzie_Wazzie · January 25, 2018, 4:25pm

To summarize the whole problem with all script solutions can simply be explained by the fact that CSV, unlike XML for example, is designed to process as a stream. It’s not only the fastest way to process data it also requires almost no memory and leaves no memory footprint. Unfortunately AppleScript is not designed to work with streams so the ideal solution is always a balance between being fast or the most correct implementation. With the most correct implementation I mean when the paradigm behind the file structure and the design of the software are the same.

I’ve written an CSV read/write implementation in one of my osaxen for xls(x) editing. It reads and writes at maximum speed of your storage device (400MB/s on my machine) and uses no memory. Maybe it’s time to move this over to AST where the CSV data is presented as a 2D list instead (the AS list itself will need the necessary memory of course). It supports variable text indicators, field delimiters and line breaks and most importantly it works according to the CSV standard, so quoting and unquoting in the same field isn’t a problem either. The processor doesn’t predict any characters or structures like Regex solutions do but changes state when certain characters are read.