I’ve now given the script in that other post a cosmetic overhaul and have managed to speed it up very slightly.
Looking at the various solutions offered here and in the other thread:
• JMichaelTX’s and ccstone’s scripts, and DJ’s first, are obviously special-case jobs and have no provision either for quoted fields or for different field separators.
• DJ’s second script captures quoted fields but, as he noted, it doesn’t remove the quoting or unescape escaped quotes. Also, it returns too few fields for records whose first field is empty and unquoted and includes an extra “record” list if there’s a line break after the last record.
• The scripts linked to by alastor933 are by the late Kai Edwards and are typically brilliant, fast, and terse almost to the point of unintelligibility. Both take care of quoted fields, removing the quoting and unescaping any escaped double-quotes. The first script can easily be adapted to take an alternative field separator as a parameter, while the second has a go at guessing for itself whether the separator’s a comma or a semicolon. The first script actually stopped working in Leopard, which was released a year after the scripts were posted, but now works as before. It relies on the dubious technique of substituting low-ASCII codes for the CSV formatting characters and hoping they don’t occur in the text already. (Leopard’s TID system couldn’t tell low-ASCII codes apart, which made the script return rubbish.) The second script invents its own unique strings using letters and sequences of colons (genius! ), but it needs a minor rewrite to work on modern systems because a trick it has for getting round a problem which existed in Tiger runs foul of the way text items are returned now. On any system, when coaxed into working, both scripts return an extra “record” if the text ends with a line break and they mistake empty quoted fields for escaped double-quotes.
• In the same thread as my script, Yvan suggested Shane’s BridgePlus library, which offers an arrayFromCSV:commaIs: method. This is very fast indeed and allows the field separator character to be specified. However, it omits any spaces at the beginning of unquoted fields (except in the first field of a record) and returns too few fields for records whose last field is either quoted and empty or unquoted and either empty or white space.
My script ticks all the boxes I think should be ticked and takes optional parameters which allow a separator other than a comma to be specified and non-quoted fields to be trimmed if required. It’s entirely vanilla, not slow, and is known to work correctly in Tiger (10.4.11), Leopard (10.5.8), El Capitan (10.11.6), and High Sierra (10.13.3). However, here’s a slightly faster script which has mine’s virtues and uses Kai’s “unique string” technique. The problem with quoted empty fields is overcome by using a regex to identify enclosing quotes before escaped quotes and so the script requires a fairly recent version of the Satimage OSAX:
(* Assumes that the CSV text follows the RFC 4180 convention:
Records are delimited by CRLF line breaks (but LFs or CRs are OK too with this script).
The last record in the text may or may not be followed by a line break.
Fields in the same record are separated by commas (but a different separator can be specified here with an optional parameter).
The last field in a record is NOT followed by a field separator. Each record has (number of separators + 1) fields, even when these are empty.
All the records should have the same number of fields (but this script just renders what it finds).
Any field value may be enquoted with double-quotes and should be if the value contains line breaks, separator characters, or double-quotes.
Double-quote pairs within quoted fields represent escaped double-quotes.
Trailing or leading spaces in unquoted fields are part of the field values (but trimming can specified here with an optional parameter).
By implication, spaces (or anything else!) outside the quotes of quoted fields are not allowed.
No other variations are currently supported. *)
(* REQUIRES THE SATIMAGE OSAX in a couple of places, but otherwise uses TIDs as these are faster. *)
on csvToList(csvText, implementation)
-- The 'implementation' parameter is a record with optional properties specifying the field separator character and/or trimming state. The defaults are: {separator:",", trimming:false}.
set {separator:separator, trimming:trimming} to (implementation & {separator:",", trimming:false})
considering case
-- Create unique strings to substitute for quotes, linebreaks, and separators in the text. (Developed from an idea in a script by Kai Edwards.)
-- Get a character which isn't the separator, a double-quote, a linefeed, a return, or a known regex operator. (This is the development!)
set nonSeparator to "≈"
if (nonSeparator is separator) then set nonSeparator to "§"
-- Get a sequence of that character which doesn't occur in the text.
set uniqueSequence to nonSeparator
repeat while (csvText contains uniqueSequence)
set uniqueSequence to uniqueSequence & nonSeparator
end repeat
-- Derive the unique strings.
set quoteProxy to "q" & uniqueSequence
set lineBreakProxy to "b" & uniqueSequence
set separatorProxy to "s" & uniqueSequence
script o -- For fast list access. Only one, reusable property is actually needed, but the script goes faster with two!
property textBlocks : missing value
property finalResult : missing value
end script
set astid to AppleScript's text item delimiters
-- Replace the enclosing quotes of any quoted fields (including empty ones) with the quoteProxy character defined above.
set doctoredCSVtext to (change " *+\"((?:[^\"]|\"\")*+)\" *+" into quoteProxy & "\\1" & quoteProxy in csvText with regexp) -- Satimage.
-- Replace any double-quote pairs left in the text with double-quote singletons.
set AppleScript's text item delimiters to "\"\""
set textItems to doctoredCSVtext's text items
set AppleScript's text item delimiters to quote
set doctoredCSVtext to textItems as text
-- Split the text at the quote proxies, if any.
set AppleScript's text item delimiters to quoteProxy
set o's textBlocks to doctoredCSVtext's text items
-- o's textBlocks is a list of the CSV text's text items after delimitation with the double-quote proxy character.
-- Assuming the convention described at top of this script, the number of blocks is always odd.
-- Even-numbered blocks, if any, are the unquoted contents of quoted fields and don't need parsing.
-- Odd-numbered blocks are everything else.
repeat with i from 1 to (count o's textBlocks) by 2
-- Replace whatever line endings there are in this block with the lineBreakProxy character defined above.
set AppleScript's text item delimiters to lineBreakProxy
set thisBlock to (paragraphs of item i of o's textBlocks) as text
-- Replace all instances of the field separator in this block with the separatorProxy character defined above.
set AppleScript's text item delimiters to separator
set textItems to text items of thisBlock
set AppleScript's text item delimiters to separatorProxy
set item i of o's textBlocks to textItems as text
end repeat
-- Lose any trailing line break proxy from the last block.
set lastBlock to end of o's textBlocks
if (lastBlock ends with lineBreakProxy) then
if (lastBlock is lineBreakProxy) then
set item -1 of o's textBlocks to ""
else
set item -1 of o's textBlocks to text 1 thru (-1 - (count lineBreakProxy)) of lastBlock
end if
end if
-- Coerce the blocks back to a single text, with further doctoring if trimming.
if (trimming) then
-- Reinstate any quote proxies.
set AppleScript's text item delimiters to quoteProxy
set doctoredCSVtext to o's textBlocks as text
-- Lose any spaces or quote proxies immediately adjacent to separator or line break proxies.
set doctoredCSVtext to (change "(?:" & quoteProxy & "| ++)?(" & (separatorProxy & "|" & lineBreakProxy) & ")(?: ++|" & quoteProxy & ")?" into "\\1" in doctoredCSVtext with regexp) -- Satimage.
else
-- Coerce to text without quote proxies.
set AppleScript's text item delimiters to ""
set doctoredCSVtext to o's textBlocks as text
end if
-- Break the text at the line break proxies for a list of the record texts.
set AppleScript's text item delimiters to lineBreakProxy
set o's finalResult to doctoredCSVtext's text items
-- Replace each record text with a list of its individual field values.
set AppleScript's text item delimiters to separatorProxy
repeat with i from 1 to (count o's finalResult)
set item i of o's finalResult to text items of item i of o's finalResult
end repeat
set AppleScript's text item delimiters to astid
end considering
return o's finalResult
end csvToList
-- Demos:
set csvText to "caiv2 , 2010BBDGRC,\"\"\"President\"\", \"\"Board of Directors\"\"\"" & linefeed & ",\"\"," & linefeed & " , , " & linefeed & "Another line, for demo purposes , " & linefeed & ",," & linefeed
csvToList(csvText, {})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}
set csvText to "caiv2 ; 2010BBDGRC;\"\"\"President\"\"; \"\"Board of Directors\"\"\"" & linefeed & ";\"\";" & linefeed & " ; ; " & linefeed & "Another line; for demo purposes ; " & linefeed & ";;" & linefeed
csvToList(csvText, {separator:";"})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {separator:";", trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}