Sort a list of strings based on a substring

The localizedStandardCompare selector can be used directly without mapping the list to a dictionary

It sorts the list like in Finder, to sort descending set ascending to false.

use framework "Foundation"

set theList to {"A04", "B03", "C02", "D01"}
set theArray to current application's NSArray's arrayWithArray:theList
set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"self" ascending:false selector:"localizedStandardCompare:"
return (theArray's sortedArrayUsingDescriptors:{theDescriptor}) as list

It works also with {"A-121", "D01", "CCC6", "BB011"} just as-is.

If the sort order is ascending you can use the API sortedArrayUsingSelector without creating a sort descriptor

use framework "Foundation"

set theList to {"A-121", "D01", "CCC6", "BB011"}
set theArray to current application's NSArray's arrayWithArray:theList
ascending:false selector:"localizedStandardCompare:"
return (theArray's sortedArrayUsingSelector:"localizedStandardCompare:") as list

Thanks Stefan for the suggestions.

I didn’t properly explain my request in post 1, and my test lists were poor. My goal was to completely ignore the initial letter of each item of the list and to sort on the numbers only. The following demonstrates why localizedStandardCompare can’t be used directly to achieve this goal (unless I’m missing something):

use framework "Foundation"

set theList to {"B04", "A03", "C02", "D01"}
set theArray to current application's NSArray's arrayWithArray:theList
set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"self" ascending:false selector:"localizedStandardCompare:"
return (theArray's sortedArrayUsingDescriptors:{theDescriptor}) as list
-- returns {"D01", "C02", "B04", "A03"}
-- desired result {"D01", "C02", "A03", "B04"}

I ran timing tests with a slightly modified version of the above scripts. The results with lists that contained 128 and 512 items were 12 and 39 milliseconds.

use framework "Foundation"
use scripting additions

set theList to {"Item 11", "Item 01", "Item 21"}
set thePattern to "^\\D*" -- remove characters not decimal digit from front of string
set sortedList to getSortedList(theList, thePattern)

on getSortedList(theList, thePattern)
	set theArray to current application's NSArray's arrayWithArray:theList
	set sortingArray to current application's NSMutableArray's new()
	repeat with anItem in theArray
		set theSubstring to (anItem's stringByReplacingOccurrencesOfString:thePattern withString:"" options:1024 range:{0, anItem's |length|()}) -- option 1024 is RegEx search
		(sortingArray's addObject:{originalString:anItem, sortString:theSubstring})
	end repeat
	set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"sortString" ascending:true selector:"localizedStandardCompare:"
	return ((sortingArray's sortedArrayUsingDescriptors:{theDescriptor})'s valueForKey:"originalString") as list
end getSortedList

The script immediately above isolates the desired substring by using the RegEx pattern to remove unwanted portions of the string. The following script works by using the RegEx pattern to directly identify the desired substring, which in most cases is probably a better approach.

As written, the following script throws an error if a list item does not contain a matching substring. Error correction–which is appropriate to the task at hand–needs to be added for this.

use framework "Foundation"
use scripting additions

set theList to {"item 21", "item 01", "item 11"}
set thePattern to "\\d+" -- match first instance decimal digits
set sortedList to getSortedList(theList, thePattern)

on getSortedList(theList, thePattern)
	set theArray to current application's NSArray's arrayWithArray:theList
	set sortingArray to current application's NSMutableArray's new()
	repeat with anItem in theArray
		set theRange to (anItem's rangeOfString:thePattern options:1024) -- option 1024 is RegEx search
		set theSubstring to (anItem's substringWithRange:theRange)
		(sortingArray's addObject:{originalString:anItem, sortString:theSubstring})
	end repeat
	set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"sortString" ascending:true selector:"localizedStandardCompare:"
	return ((sortingArray's sortedArrayUsingDescriptors:{theDescriptor})'s valueForKey:"originalString") as list
end getSortedList

In addition to those detailed above, there’s actually a third approach, which utilizes a capture group and both filters and sorts. I don’t know if this would ever be of actual use, so I include it FWIW.

The following script returns all items that begin with the character “i” and that contain one or more numbers in parentheses. The substring that is sorted on is the first instance of numbers in parentheses.

use framework "Foundation"
use scripting additions

set theList to {"skip item (33)", "item (22)", "item (01)", "item (11)"}
set thePattern to "^i.*?\\((\\d+)\\).*$"
set sortedList to getSortedList(theList, thePattern)

on getSortedList(theList, thePattern)
	set theArray to current application's NSArray's arrayWithArray:theList
	set sortingArray to current application's NSMutableArray's new()
	repeat with anItem in theArray
		set theSubstring to (anItem's stringByReplacingOccurrencesOfString:thePattern withString:"$1" options:1024 range:{0, anItem's |length|()}) -- option 1024 is RegEx search
		if (anItem's isEqualToString:theSubstring) is false then (sortingArray's addObject:{originalString:anItem, sortString:theSubstring})
	end repeat
	set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"sortString" ascending:true selector:"localizedStandardCompare:"
	return ((sortingArray's sortedArrayUsingDescriptors:{theDescriptor})'s valueForKey:"originalString") as list
end getSortedList

For purely inspirational purposes, the (hopefully) equivalent script in JavaScript:

const l = ["skip item (33)", "item (22)", "item (01)", "item (11)"];

const res = l.filter(x => /^i.*\(\d+\)/.test(x)).sort((a,b) => {
  const anum = +a.match(/\((\d+)\)/)[1];
  const bnum = +b.match(/\((\d+)\)/)[1];
  return anum-bnum;
})
console.log(res) 

That returns
["item (01)","item (11)","item (22)"]
which might not be exactly what the Applescript version does.

The filter method returns a new array containing only those elements of l that begin with “i” and contain a number in parenthesis. sort then sorts this array by using the anonymous function passed as a parameter. Note that this approach might not be the most performant one because of the repeated execution of match.

In any case, there’s not much sense in using .*$ in the regular expression since you don’t care what the rest of the string after the closing parenthesis is. But you still force the RE engine to continue its work. Also, .* is in many cases not a good idea because it might gobble up more then one wants.

chrillek. Thanks for looking at my thread and for the suggestion.

IMO, the use of .*$ is necessary in my third script. If it’s not included in the pattern, all of the string after the first instance of numbers in parentheses is included in the substring. It’s rare this would impact the sort order, but it seems best to set the substring to the desired substring and nothing else. I ran some timing tests with a list that contained 500 items, and there was no difference if I included .*$ or not.

You’re right of course. I was thinking along the lines of „use the capturing group for comparison“, whereas your code is relying on a modified string for that. My method is probably but possible with the ObjC frameworks.

chrillek. You make a good point. In testing with a 640-item list, an ASObjC script that filters then sorts is 36 percent faster (30 versus 47 milliseconds). The use of a capture group is still necessary to get the numbers within parentheses (without the parentheses), although including the parentheses with the substring probably wouldn’t impact the sort order.

use framework "Foundation"

set theList to {"skip item (33)", "item (22)", "item (01)", "item (11)", "item (skip item)"}
set sortedList to getSortedList(theList)

on getSortedList(theList)
	set theArray to current application's NSArray's arrayWithArray:theList
	set thePredicate to current application's NSPredicate's predicateWithFormat:"self MATCHES 'i.*?\\\\(\\\\d+\\\\).*$'"
	set theArray to theArray's filteredArrayUsingPredicate:thePredicate
	set sortingArray to current application's NSMutableArray's new()
	repeat with anItem in theArray
		set theSubstring to (anItem's stringByReplacingOccurrencesOfString:"^.*?\\((\\d+)\\).*$" withString:"$1" options:1024 range:{0, anItem's |length|()}) -- option 1024 is RegEx search
		(sortingArray's addObject:{originalString:anItem, sortString:theSubstring})
	end repeat
	set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"sortString" ascending:true selector:"localizedStandardCompare:"
	return ((sortingArray's sortedArrayUsingDescriptors:{theDescriptor})'s valueForKey:"originalString") as list
end getSortedList

Faster then the JavaScript variant?

I tested with Script Geek, and I don’t know how to test JavaScript with Script Geek. I created the test list with the following but didn’t include the creation of the list in the timing result:

set theList to {"skip item (33)", "item (22)", "item (01)", "item (11)", "item (skip item)"}
repeat 7 times
	set theList to theList & theList
end repeat
theList

FTR: I ran the JavaScript version in Scriptable on an iPad Pro 11 (M1) with this 640 element array in 6ms (using the Date.now() method after filling the array and after sorting: 6ms. To make the test more valuable, I increased the array size to 100000, for which the script ran in 882 ms.

Here’s the code:

const l = ["skip item (33)", "item (22)", "item (01)", "item (11)", "item (skip item)"];
const list = Array.from({length:20000}, () => l).flat(); /* 100000 elements, use 128 for 640 elements*/

const startTime = Date.now();
const res = list.filter(x => /^i.*\(\d+\)/.test(x)).sort((a,b) => {
  const anum = +a.match(/\((\d+)\)/)[1];
  const bnum = +b.match(/\((\d+)\)/)[1];
  return anum-bnum;
})
console.log(Date.now()-startTime);

Saving that in a file and running via osascript -l JavaScript <file> in the terminal should write the elapsed time in ms to the terminal.

1 Like

And a final timing result, this time for the AppleScript script here:

I used a list with 10240 elements, i.e. the repeat loop run 11 times. I then changed the call to the sort handler like so

set startTime to current application's NSDate's now
set sortedList to getSortedList(theList)
set thetime to ((startTime's timeIntervalSinceNow()) * -1000) as integer
log thetime

Then I used osascript to run the JavaScript version posted earlier with the same array length (i.e. 10240) and the AppleScript version (both on a Macbook Pro from 2019 with the current version of Ventura installed.

| JavaScript | AppleScript |
|       96ms |      1605ms |

Obviously, the JavaScript version is not only a lot shorter, but it also runs about 16 times faster. Which is not to imply that this is generally the case. But here, no ObjC framework and no marshalling between two languages is required.

chrillek. Thanks for the timing results.

I use AppleScript almost exclusively and wondered how JavaScript could be used in an AppleScript. Can the following be made to work?

set theList to {"skip item (33)", "item (22)", "item (01)", "item (11)", "item (skip item)"}

set sortedList to getSortedList(theList) --> {"item (01)", "item (11)", "item (22)"}

on getSortedList(theList)
const res = l.filter(x => /^i.*\(\d+\)/.test(x)).sort((a,b) => {
  const anum = +a.match(/\((\d+)\)/)[1];
  const bnum = +b.match(/\((\d+)\)/)[1];
  return anum-bnum;
})
console.log(res)
end getSortedList

That sounds challenging. First off, you can’t directly include JS in AS code since the script engine is „set“ (for lack of a better term) to AS.

Second, passing the list as a parameter to JS is not obvious. I’ll try to find out more about this.

1 Like

Given that the items to be sorted are all strings, they can be concatenated to form part of the JS code string:

{"skip item (22)", "item (33)", "item (100)", "item (011)", "item (skip item)"}
sort_(result) --> {"item (011)", "item (33)", "item (100)"} 

to sort:(L as list)
        set my text item delimiters to character id 0
        run script "`" & L & "`.split('\\x00')
        .filter(x => x.match(/[0-9]/) && !(
                     x.match(/skip item/i)))
        .sort((a,b) => a.match(/[0-9]+/)
                     - b.match(/[0-9]+/));
        " in "JavaScript"
end sort:

The key aspect of this script is the joining together of a list of a strings using a nul byte (character id 0), as this wouldn’t ordinarily be a character that features in conventional strings: where they most often appear are as a marker delimiting special segments of a string in which specific byte positions or ranges are allocated for specific pieces of information. This is precisely the function they’ll be serving here, by delimiting the boundaries between distinct items in the original list, which will be passed into the JXA code as a single string.

Thus, after setting the text item delimiters to character id 0, it is safe to perform this operation with L (the list of string items):

"`" & L & "`.split(..."

This implicitly coerces L from a list into a string, performing serial concatenations of all of its items with the nul byte as described above. In the JavaScript code, this string is enclosed inside a pair of back-ticks, i.e. `․․․` , which allows for the possibility of multi-line string items that might feature in the original list. However, it would require that any occurrence of a back-tick character in any of the string items be encoded before being sent through. If multi-line strings are not a consideration, it would be better to do this instead:

quoted form of (L as text) & ".split(..."

After passing the string representation of the original list through to the JXA environment, it needs to be converted from a string back into separate items housed within a native Array-prototype object. The items were originally joined using the nul byte (character id 0), so the string now needs to be split at every occurrence of a nul byte (which, in JavaScript, is expressed as the hexadecimal escaped form \x00—this will need to be double-escaped, of course), i.e.:

"`" & L & "`.split('\\x00')

or, if preferred:

quoted form of (L as text) & ".split('\\x00')"

The remainder of the JXA code acts upon the array to sort it using the custom callback function, which is by no means the best example of such a function here, but it sufficiently handles and correctly sorts the test items supplied, returning a JavaScript array, which conveniently gets converted back into an AppleScript list upon completion of run script command.

I didn’t run that yet but am a bit puzzled: afaik, match returns an Array. What is the difference of two Arrays that is calculated in sort? And why do you use g in the regular expression there?

Edit To answer my own question: The - operator applied to two single-element arrays calculates the difference of the two elements and returns a number. But: You use g in the regular expression. So, if the string contains more than one match, the result is NaN:

var a = '123 456';
var b = '789 123';
a.match(/\d+/g)-b.match(/\d+/g); // [123,456]-[789,123] => NaN

Since we’re interested only in the first number, one could simply use
/\d+/ without the g in the match.

Perhaps you could elaborate a bit on the working of your script? I gathered this

  • result contains the list defined in the first line
  • in sort, the first line establishes ASCII 0 as the new text item delimiter
  • In run script, using the string concatenation operator on L casts the list to a string, with the components separated by ASCII 0.

The rest is straightforward (except for the problem with the global flag in the match operator, see above).
But the result of the sort is a JavaScript array, and that seems to be returned as an AppleScript list to the caller – how/why does that happen? I’d expected that AS lists and JS Arrays are different beasts that can’t be simply passed between the two languages.

Quite right. That was a typo that resulted from my toying around with two different implementations, one of which utilised .replace(/^[0-9]/g, '') as a possibility for a more robust solution. But I decided not to focus on solving the sorting issue, which had already received a myriad solutions, for which I personally would elect to use Nigel’s solution. When I swapped the match() function back in, I left the g option flag in place by mistake.

What I had aimed to address specifically was the problem that I quoted from your earlier message in my response, namely:

I’ll add the details to highlight the key aspects of what the script is doing in order to work around this specific issue.

There’s not a huge difference between lists and arrays, the main one being how memory is allocated and subsequently how data is stored (arrays use a single, contiguous block of memory, which is allocated before the array is created, fixing the size of the array until its destruction). JavaScript arrays are a bit weird, of course, allowing greater flexibility and ease-of-use for high-level programming. Conceptually, bridging between a JavaScript Array and an AppleScript list is not only achievable, but I imagine arguably essential as part of the remit of Apple’s Open Scripting Architecture endeavour.

Each of the core AppleScript data types partners with an equivalent JavaScript prototype:

AppleScript JavaScript
string String.prototype
text String.prototype
real Number.prototype
integer Number.prototype
date Date.prototype
list Array.prototype
record Object.prototype
file Path.prototype
boolean Boolean.prototype

There was a time when calling JXA code from within AppleScript was properly implemented by way of the run script "..." in "JavaScript" using parameters {...} command. The supplied parameters would be passed into the JXA’s run() function, which would be explicitly declared, allowing the sending of AppleScript data types cleanly into the JavaScript context, and the receipt of JavaScript data types back to the AppleScript context. The last time I remember this working, at least in a limited fashion, was Mojave, and in a much more substantial manner in High Sierra.

Sadly, it doesn’t seem possible to send parameters through using run script (actually, I journey out still works fine, but the run() function doesn’t seem able to yield a meaningful return value, instead returning error code -4960. However, this seems to be a specific fatal flaw of the run() function, as it’s entirely possible to house code inside a named function, which, if called in the last line of a JXA script, will happily pass its result back through to the AppleScript context.

:+1:
Thanks a lot for spelling out all the details. I was not aware that there’s a direct mapping between all the AS and JS types. And I was still thinking do shell script "osascript …" when I asked about passing the array back to AS.

I did a little fine-tuning of my scripts included above and thought I would post the final result. As written, the script finds and sorts on a substring that matches a regex pattern. I’ve included an alternative that finds and sorts on a regex capture group, which allows a more refined pattern match. The script contains rudimentary error checking that reports an error when a match is not found. The timing result with a list of 100 items was 12 milliseconds.

use framework "Foundation"
use scripting additions

set theList to {"/Users/Peavine/File 103.txt", "/Users/Peavine/File 101.txt", "/Users/Peavine/File 102.txt"}
set sortedList to getSortedList(theList)

on getSortedList(theList)
	set theArray to current application's NSArray's arrayWithArray:theList
	set sortingArray to current application's NSMutableArray's new()
	repeat with aString in theArray
		set aRange to (aString's rangeOfString:"\\d+\\." options:1024) -- change pattern as desired
		if aRange's |length|() is 0 then display dialog "Match not found" buttons {"OK"} cancel button 1 default button 1
		set aSubstring to (aString's substringWithRange:aRange)
		(sortingArray's addObject:{originalString:aString, sortString:aSubstring})
	end repeat
	set theDescriptor to current application's NSSortDescriptor's sortDescriptorWithKey:"sortString" ascending:true selector:"localizedStandardCompare:"
	return ((sortingArray's sortedArrayUsingDescriptors:{theDescriptor})'s valueForKey:"originalString") as list
end getSortedList

(* 
-- replace the first 3 lines of the repeat loop with the following to use regex capture group
set aSubstring to aString's mutableCopy()
set matchFound to (aSubstring's replaceOccurrencesOfString:".*(\\d+)\\..*" withString:"$1" options:1024 range:{0, aString's |length|()})
if matchFound is not 1 then display dialog "Match not found" buttons {"OK"} cancel button 1 default button 1 
*)