Use shell script sort for csv file

I have a large csv file of names and addresses (sample below) and would like to sort into surname sequence. Unfortunately the name field , field 1, contains both first name and surname so cannot figure out how to achieve this.

Adam Collins,27 The Mill Apartments Colchester CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene Clacton Road Weeley Heath CO16 9DN,01255 830942,
Alan Irwin,53 Greenacres Clacton on Sea Essex CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue Frinton on Sea CO13 9LL,01255 677407,
Alison O’Reilly,Small World Coggeshall Road Dedham CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road Great Holland Essex CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way Frinton on Sea CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place Brightlingsea CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,

Have tried using:

do shell script "sort -t ‘,’ -k 1.6,1.9 " & quoted form of filePath & " -o " & quoted form of filePath

but is too variable on count of characters in surname.
Is there anyway to select the surname for sorting, any suggestions gratefully received.

It would be difficult to split correctly the combo first name + surname.

How will you automatically split :
Jean Paul Sartre which must be Jean Paul | Sartre
Bret Easton Ellis
Linda Joy Singleton I guess that it’s Linda Joy | Singleton but I’m not sure
Frank Lloyd Wright which if I remember well must be Frank | Lloyd Wright but is stored as Wright in my dictionary
Charles de Gaulle which must be Charles | de Gaulle but is stored as Gaulle in my dictionary
Antoine de Saint-Exupéry which must be Antoine | de Saint-Exupéry but is stored as Saint-Exupéry in my dictionary
Daphne du Maurier which must be Daphne | du Maurier

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) mercredi 17 aout 2016 15:56:06

Hey Sunny,

Yvan is correct about potential difficulties with multi-word surnames, but for single-word surnames this should suffice:


set dataFile to POSIX path of ((path to downloads folder as text) & "test.txt")

set shCMD to "
CSVDATA=$(sed -E 's!,!TAB!; s! ([^ ]+TAB)!TAB\\1!' " & quoted form of dataFile & ")
CSVDATA=$(<<< \"$CSVDATA\" sort -t 'TAB' -k 2 | sed -E 's!TAB! !' | sed -E 's!TAB!,!')
echo \"$CSVDATA\"
"
do shell script shCMD

  • Replace “TAB” with a literal tab in the script.

I’m simply replacing the first comma with a tab and then bracketing the word (surname) before it with a tab.

From there its simple to sort by the surname column using tab as the delimiter.

Then I remove the tabs and replace the comma.


Chris


{ MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11.6 }
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Hi Chris
Many thanks for the response but afraid I’m just a simple Applescript scripter and have no idea how to replace “TAB” in your script with a literal tab. Have looked on the net but have only found Ctrl+V and tab which has no effect, please could you enlighten me!

Hi. I can’t immediately figure out the purpose of the -t option. My edit appears to sort properly and at least partly address issues with names beginning with “du” or “de.” No testing of any real consequence was done beyond the supplied sample text.

set peeps to "Adam Collins,27 The Mill Apartments  Colchester  CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene  Clacton Road  Weeley Heath  CO16 9DN,01255 830942,
Alan Irwin,53 Greenacres  Clacton on Sea  Essex  CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue  Frinton on Sea  CO13 9LL,01255 677407,
Alison O'Reilly,Small World  Coggeshall Road  Dedham  CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road  Great Holland  Essex  CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way  Frinton on Sea  CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place  Brightlingsea  CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885"


(do shell script "echo " & (peeps)'s quoted form & " | sort -fk 2.1,2.2 ")

This might be overkill, but it reverses the name order and sorts on that, so where last names match, the entries will be sorted on first name. It assumes everything after the first space is the last name. Requires 10.10 or later.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set posixPath to POSIX path of (choose file)
set theString to current application's NSString's stringWithContentsOfFile:posixPath encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
-- add trailing linefeed if not present; change to linebreak of choice
if not (theString's hasSuffix:linefeed) as boolean then
	set theString to theString's stringByAppendingString:linefeed
end if
set theLength to theString's |length|()
set theStart to 0
-- make array to hold dictionaries/records
set dataRecs to current application's NSMutableArray's array()
repeat
	-- get range of paragraph, plus its first space and comma
	set parRange to theString's paragraphRangeForRange:{theStart, 0}
	set spaceRange to theString's rangeOfString:space options:0 range:parRange
	set commaRange to theString's rangeOfString:"," options:0 range:parRange
	-- build string of last name followed by first name
	set lastNameRange to {spaceRange's location, (commaRange's location) - (spaceRange's location)}
	set firstNameRange to {parRange's location, (spaceRange's location) - (parRange's location)}
	set sortName to (theString's substringWithRange:lastNameRange)'s stringByAppendingString:(theString's substringWithRange:firstNameRange)
	-- add dictionary/record to array
	dataRecs's addObject:(current application's NSDictionary's dictionaryWithObjects:{theString's substringWithRange:parRange, sortName} forKeys:{"theEntry", "theKey"})
	-- increment start for next paragraph
	set theStart to (parRange's location) + (parRange's |length|)
	if theStart ≥ theLength then exit repeat
end repeat
-- sort
set sortDesc to current application's NSSortDescriptor's sortDescriptorWithKey:"theKey" ascending:true selector:"localizedCaseInsensitiveCompare:"
-- extract the original text
set sortedValues to ((dataRecs's sortedArrayUsingDescriptors:{sortDesc})'s valueForKey:"theEntry")'s componentsJoinedByString:""
-- build new path and write file
set posixPath to current application's NSString's stringWithString:posixPath
set theExt to posixPath's pathExtension()
set newPath to (posixPath's stringByDeletingPathExtension()'s stringByAppendingString:"-new")'s stringByAppendingPathExtension:theExt
sortedValues's writeToFile:newPath atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)

Hello Marc

I tested you command with :

set peeps to "Adam de Collins,27 The Mill Apartments  Colchester  CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene  Clacton Road  Weeley Heath  CO16 9DN,01255 830942,
Alan le Irwin,53 Greenacres  Clacton on Sea  Essex  CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue  Frinton on Sea  CO13 9LL,01255 677407,
Alison O'Reilly,Small World  Coggeshall Road  Dedham  CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road  Great Holland  Essex  CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way  Frinton on Sea  CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place  Brightlingsea  CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean-Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Bret Easton Ellis,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Frank Lloyd Wright,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Charles de Gaulle,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Antoine de Saint-Exupéry,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Daphne du Maurier,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885"


(do shell script "echo " & (peeps)'s quoted form & " | sort -fk 2.1,2.2 ")

and I got :

[format]“Alan Armstrong,Ruby Dene Clacton Road Weeley Heath CO16 9DN,01255 830942,
Adam de Collins,27 The Mill Apartments Colchester CO1 2QT,01206 863601,
Antoine de Saint-Exupéry,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Charles de Gaulle,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Daphne du Maurier,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885
Amanda Elliot,6 Manor Road Great Holland Essex CO13 0JT,01255 674057,
Angela Evans,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Bret Easton Ellis,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alan le Irwin,53 Greenacres Clacton on Sea Essex CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue Frinton on Sea CO13 9LL,01255 677407,
Frank Lloyd Wright,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alison O’Reilly,Small World Coggeshall Road Dedham CO7 6ET,01206 323363,
Andrea Poulter,89 Rainham Way Frinton on Sea CO13 9NT,01255 673293,
Jean Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Jean-Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Andrew Theobald,7 St.Andrews Place Brightlingsea CO7 0RH,01206 303000,”[/format]

which is better than what I guessed.

I’m just wondering why Paul Loup Sulitzer is sorted upon Loup Sulitzer
while Jean Paul Sartre is sorted upon Sartre.

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) vendredi 19 aout 2016 16:07:52

Hello Shane
With the datas used to test Mark proposal yours returned :

[format]Alan Armstrong,Ruby Dene Clacton Road Weeley Heath CO16 9DN,01255 830942,
Adam de Collins,27 The Mill Apartments Colchester CO1 2QT,01206 863601,
Charles de Gaulle,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Antoine de Saint-Exupéry,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Daphne du Maurier,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Bret Easton Ellis,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Amanda Elliot,6 Manor Road Great Holland Essex CO13 0JT,01255 674057,
Angela Evans,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alan le Irwin,53 Greenacres Clacton on Sea Essex CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue Frinton on Sea CO13 9LL,01255 677407,
Frank Lloyd Wright,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alison O’Reilly,Small World Coggeshall Road Dedham CO7 6ET,01206 323363,
Jean Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Andrea Poulter,89 Rainham Way Frinton on Sea CO13 9NT,01255 673293,
Jean-Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Andrew Theobald,7 St.Andrews Place Brightlingsea CO7 0RH,01206 303000,
[/format]

which is a bit better which put Charles de Gaulle before Antoine de Saint-Exupery. (no political judgement intended :slight_smile: )

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) vendredi 19 aout 2016 16:14:18

I think Marc’s works on the second word, whereas mine takes all the text from the first space through to the first comma,then appends the text before the space, and sorts on the result.

Hello Shane

You’re right (as quite always).
It explains the different treatment given to the two occurrences of Sartre.

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) vendredi 19 aout 2016 16:26:41

Hey Sunny,

Select “TAB” and hit the tab key. :cool:

The forum doesn’t let you copy literal tabs from displayed code, and OSX’s sed doesn’t like the idiomatic \t.

-Chris

Hi Chris

With my test file it returns :
[format]“Alan Armstrong,Ruby Dene Clacton Road Weeley Heath CO16 9DN,01255 830942,
Adam de Collins,27 The Mill Apartments Colchester CO1 2QT,01206 863601,
Amanda Elliot,6 Manor Road Great Holland Essex CO13 0JT,01255 674057,
Bret Easton Ellis,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Angela Evans,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Charles de Gaulle,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alan le Irwin,53 Greenacres Clacton on Sea Essex CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue Frinton on Sea CO13 9LL,01255 677407,
Daphne du Maurier,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Alison O’Reilly,Small World Coggeshall Road Dedham CO7 6ET,01206 323363,
Andrea Poulter,89 Rainham Way Frinton on Sea CO13 9NT,01255 673293,
Antoine de Saint-Exupéry,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Jean Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Jean-Paul Sartre,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,
Andrew Theobald,7 St.Andrews Place Brightlingsea CO7 0RH,01206 303000,
Frank Lloyd Wright,7 Regency Lodge Clacton on Sea CO15 2AN,07540 433885,”[/format]

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) vendredi 19 aout 2016 17:40:13

Hey Yvan,

Yes “ I have not allowed for multipart surnames “ since the OP’s test data has none.

However the basic system I’ve employed permits one to allow for such things by marking common surname prefixes before moving on to the sort step.

It would be super simple to generically handle lower case prefixes.

I’ll give it some thought.

-Chris

Similar to Shane’s script (in concept at least) but using the Satimage.osax.

-------------------------------------------------------------------------------------------
# Sorting text in a variable.
# Requires the Satimage.osax to be installed.
-------------------------------------------------------------------------------------------
set theData to text 2 thru -1 of "
Adam Collins,27 The Mill Apartments  Colchester  CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene  Clacton Road  Weeley Heath  CO16 9DN,01255 830942,
Alan Irwin,53 Greenacres  Clacton on Sea  Essex  CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue  Frinton on Sea  CO13 9LL,01255 677407,
Alison O'Reilly,Small World  Coggeshall Road  Dedham  CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road  Great Holland  Essex  CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way  Frinton on Sea  CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place  Brightlingsea  CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
"

set theData to change "^([^ ]+) ([^,]+)," into "\\2¢\\1," in theData with regexp
set theData to sortlist theData comparison 1
set theData to change "^([^¢]+)¢([^,]+)," into "\\2 \\1," in theData with regexp
set theData to join theData using linefeed
-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------
# Sorting text in a text file (in place).
# Requires the Satimage.osax to be installed.
-------------------------------------------------------------------------------------------
# Change to reflect the correct path to your file.
set theFile to alias ((path to downloads folder as text) & "test.txt")
set theData to readtext theFile
set theData to change "^([^ ]+) ([^,]+)," into "\\2¢\\1," in theData with regexp
set theData to sortlist theData comparison 1 -- case insensitive sort
set theData to change "^([^¢]+)¢([^,]+)," into "\\2 \\1," in theData with regexp
set theData to join theData using linefeed
writetext theData to theFile
-------------------------------------------------------------------------------------------

And no Yvan “ I’m not yet handling multipart surnames. :cool:

-Chris

Okay. This one handles multipart surnames with a lowercase prefix.

Replace “TAB” with a literal tab character.

-Chris

set csvData to text 2 thru -1 of "
Adam de Collins,27 The Mill Apartments  Colchester  CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene  Clacton Road  Weeley Heath  CO16 9DN,01255 830942,
Alan le Irwin,53 Greenacres  Clacton on Sea  Essex  CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue  Frinton on Sea  CO13 9LL,01255 677407,
Alison O'Reilly,Small World  Coggeshall Road  Dedham  CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road  Great Holland  Essex  CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way  Frinton on Sea  CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place  Brightlingsea  CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean-Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Bret Easton Ellis,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Frank Lloyd Wright,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Charles de Gaulle,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Antoine de Saint-Exupéry,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Daphne du Maurier,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885
"

set shCMD to "
CSVDATA=" & quoted form of csvData & ";
CSVDATA=$(
sed -E '
   s!^([^,]+) ([a-z][^,]+),!\\1TAB\\2,!
   s!^([^,TAB]+) ([^ ,]+),!\\1TAB\\2,!
' <<< \"$CSVDATA\"
);

CSVDATA=$(sort -f -t'TAB' -k2 <<< \"$CSVDATA\");
CSVDATA=$(sed -E 's!TAB! !' <<< \"$CSVDATA\");

echo \"$CSVDATA\";
"
do shell script shCMD

Thanks Chris.

I was busy elsewhere so I didn’t saw your message stamped : Today 01:33:13 pm

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) vendredi 19 aout 2016 22:51:10

So let’s assume this thread prompts a minor migration to Clacton on Sea. Among the newcomers are Gordon Bennett, Lenny Bruce, and Maurice Béjart, who all move into Regency Lodge. What happens when they’re sorted in?

The sort command does a great job of ASCII sorting. Alas, the world now mostly runs on Unicode. Time to move on, I reckon.

Hi Shane

I wasn’t aware for Gordon Bennett, Lenny Bruce, and Maurice Béjart :slight_smile:

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) samedi 20 aout 2016 11:00:11

That’s very Pythonesque! ˜:lol:’

Hi Nigel

I don’t know the address of their graves so I used a fake address.
I tested with other names but if I post them here, censor would strike :rolleyes:

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) samedi 20 aout 2016 12:45:17