sorting paragraphs returned by a find shell script

mleonti · May 11, 2009, 11:58am

Does anyone know how you sort this by the file names?

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'")
tell application "Finder" to sort theFiles by name -- fails

James_Nierodzik · May 11, 2009, 1:10pm

mleonti:

Does anyone know how you sort this by the file names?

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'")
tell application "Finder" to sort theFiles by name -- fails

The Finder sort command is expecting Finder paths not posix so you either need to do this all in the Finder or massage the data from posix into finder paths.

set theFolder to choose folder
tell application "Finder"
	set theFiles to every file of entire contents of theFolder
	sort theFiles by name
end tell

mleonti · May 11, 2009, 2:01pm

Thanks James.
Your solution works fine but…
I went that way before and, as I have to scan 100,000 plus files the Finder slows down very much.
I’d was hoping to learn a shell script or an applescript to sort the result of the
set theFiles to paragraphs of (do shell script “find " & quoted form of POSIX path of theFolder & " -type f ! -name ‘.*’”)
or massage the data from posix into finder paths if I could do it fast.

RetroDesign · May 11, 2009, 2:38pm

POSIX file theFile as alias

will do the trick… but if theFile is a list of posix paths, then you need to go with a repeat loop.

James_Nierodzik · May 11, 2009, 2:57pm

Well I have no idea what that performance will be like on this, but give it a try.

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name \".*\" -exec /usr/bin/osascript -e 'POSIX file \"'{}'\"' \\;")

tell application "Finder"
	sort theFiles by name
end tell

This approach is turning the found results into Finder style paths after the find result, but in the shell (albeit it thru an inline apple script). The alternative would be to do it with a repeat loop back inside the script as Retro said, but I think that would be a worse performance hit.

James_Nierodzik · May 11, 2009, 3:09pm

Here is another version that does the same thing, but instead of a find -exec we are using xargs. This should give a slight performance boost

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name \".*\" -print0 | /usr/bin/xargs -0 -I {} /usr/bin/osascript -e 'POSIX file \"{}\"' \\;")

tell application "Finder"
	sort theFiles by name
end tell

Yvan_Koenig · May 11, 2009, 3:27pm

mleonti:

Does anyone know how you sort this by the file names?

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'")
tell application "Finder" to sort theFiles by name -- fails

Why not this simple:


on run
	set theFolder to choose folder
	set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'")
	
	set newList to my recolle(my sort_list(theFiles), return)
end run

on sort_list(unsortedList)
	set AppleScript's text item delimiters to (ASCII character 10)
	set sortedList to paragraphs of (do shell script "echo " & quoted form of (unsortedList as string) & "| sort  -d -f")
	set AppleScript's text item delimiters to ""
	return sortedList
end sort_list

on recolle(l, d)
	local t
	set AppleScript's text item delimiters to d
	set t to l as text
	set AppleScript's text item delimiters to ""
	return t
end recolle

Which was already posted one or two days ago ?

The sort handler was given by Mark J. Reed.

Yvan KOENIG (from FRANCE lundi 11 mai 2009 17:28:00)

James_Nierodzik · May 11, 2009, 3:35pm

Well I just ran this on a very small test folder that contained these two file paths

./first/z.jpg
./last/a.jpg

When running Mark’s script as you posted it returned the elements in that order. While that is sorted it’s by path not filename which was requested. The script I posted (while I’m sure far from perfect) returned the list in this order

./last/a.jpg
./first/z.jpg

Yvan_Koenig · May 11, 2009, 7:25pm

Oops

I read too fast and missed the fact that the requested sort was ‘by name’.

Yvan KOENIG (from FRANCE lundi 11 mai 2009 21:27:16)

Nigel_Garvey · May 11, 2009, 9:32pm

If there’s no way to shell script the sort, there’s a customisable vanilla sort you could try here. Save it somewhere convenient as a compiled script, adjust the path to it in the script below, and run the script below. It sorts the paths by their last text items with AppleScript’s TIDs set to “/”.

script byLastTextItem -- Handlers for customising the sort.
	on isGreater(a, b)
		(text item -1 of a > text item -1 of b)
	end isGreater
	
	on isLess(a, b)
		(text item -1 of a < text item -1 of b)
	end isLess
	
	on swap(a, b)
	end swap
	
	on shift(a, b)
	end shift
end script
set lib to (load script file ((path to scripts folder as Unicode text) & "Libraries:Sorts:CustomQsort.scpt")) -- Your path to the CustomQsort script here.

set theFolder to choose folder
set theFiles to (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name \".*\" -print0" without altering line endings)

set astid to AppleScript's text item delimiters
if ((system attribute "ascv") div 256 mod 16 < 2) then
	set AppleScript's text item delimiters to (ASCII character 0) as Unicode text
else
	set AppleScript's text item delimiters to character id 0
end if
set theFiles to theFiles's text items 1 thru -2
set AppleScript's text item delimiters to "/" as Unicode text
-- considering numeric strings
tell lib to CustomQsort(theFiles, 1, -1, byLastTextItem)
-- end considering
set AppleScript's text item delimiters to astid

theFiles

chrys · May 12, 2009, 8:01am

Heh.

Like in most systems, TIMTOWTDI. At least one for each of Perl, Python, and Ruby (the first by proof, below, the others by reputation). I also came up a decent shell-based (awk/sort/cut) method, too.

(*
 * Two shell-based ways to sort a linefeed-delimited list of pathnames by just the filenames (the part after the last slash, if any).
 * The first is in Perl, the second uses the "shell tools" awk, sort, and cut.
 *
 * Both are effectively variations of the Schwartzian Transform.
 * <http://en.wikipedia.org/wiki/Schwartzian_transform> 
 * <http://www.stonehenge.com/merlyn/UnixReview/col64.html>
 *
 *)

set simulatedInput to generatePathLines()
set perlProgram to "
use strict;
use warnings;
use File::Spec::Functions qw(splitpath);
use locale; # just in case a locale is set, use it

# $/ = chr(0); # uncomment this line to use NUL-terminated 'lines' instead of linefeed-terminated lines (e.g. for use with find ... -print0)

# 1) (<>) Take the input lines.
# 2) (map) Extract the filename.
# 3) (sort) Sort based on only the filename.
# 4) (map) Throw away the filename.
# 5) (print) Print the results.
print #(5)
  map { $_->[1] } # (4)
    sort { $a->[0] cmp $b->[0] } # (3)
      map {
        my ($vol, $dirs, $file) = splitpath($_);
        #$file = lc($file); # uncomment to imitate Finder's case-insensitive sorting 
        [$file, $_];
      } # (2)
        <>; # (1)
"
do shell script "printf %s " & quoted form of simulatedInput & " |  perl -e " & quoted form of perlProgram without altering line endings
set perlResult to result
-- `printf %s blahblah` is like `echo blahblah` but does not automatically end the ouput with a line break.
-- Instead of piping in the output of printf/echo, another command's output (e.g. `find`) could be piped in.

(* This one is more "shellish" and provides the same output for the test data.
 * Since many shell tools can not handle NUL-terminated "lines", this method can not easily be extended to handle paths with embedded newlines.
 *)
--   The first command of this shell script was originally "sed 's|^\(.*\)/\([^/]*\)$|\2/\1/\2|", but filenames with ï£¿ (Option-Shift-K) seemed to send my sed into an infinite loop. Odd.

set shellScript to "
(
  awk -F / '
    NF > 1 {print $NF \"/\" $0} # if we have a slash, print 'filename/full-line' where filename is the stuff after the last slash in the full-line
    NF <= 1 {print $0} # otherwise, just print 'full-line'
    ' |
  sort -s -t / -k 1,1 | # sort by the filename (as previously arranged, the part before the first slash); add -f to do Finder-like case-insensitive sorting
 cut -d / -f 2- # remove 'filename/' from start of each line
) \\
"
do shell script "printf %s " & quoted form of simulatedInput & " | " & shellScript without altering line endings
set shellishResult to result
{perlResult = shellishResult, perlResult, shellishResult} -- compare Perl and shellish results


-- Here are some handlers that generate the test data. They are unrelated to the sorting technique.
to generatePathLines()
	"
/path/to/some/file-G
/path/to/some/file-A
/path/to/some/file-N
/blah/file-A
/path/to/some/file-U
/path/to/some/file-B
/other/path/to/some/file-b
/other/path/to/some/file-a
/other/path/to/some/file-g
/other/path/to/some/file-u
/a/path/that/ends/with/slash/
/other/path/to/some/file-n
/another/path/with/file-Nu
/another/path/with/file-Alpha
/another/path/with/file-Gamma
/another/path/with/file-Beta
/another/path/with/file-Upsilon
./first/z.jpg
./last/a.jpg
"
	switchText from result to ASCII character 10 instead of ASCII character 13 -- make sure we are using linefeeds, I am not sure if we can depend on AS string literals always encoding line breaks as linefeed characters
end generatePathLines

(* switchText From: http://bbs.applescript.net/viewtopic.php?pid=41257#p41257
Credit: kai, Nigel Garvey*)
to switchText from t to r instead of s
	local d
	set d to text item delimiters
	try
		set text item delimiters to s
		set t to t's text items
		-- The text items will be of the same class (string/unicode text) as the original string.
		set text item delimiters to r
		-- Using the first text item (beginning) as the first part of the concatentation means we preserve the class of the original string in the edited string.
		tell t to set t to beginning & ({""} & rest)
		set text item delimiters to d
	on error m number n from o partial result r to t
		set text item delimiters to d
		error m number n from o partial result r to t
	end try
	t
end switchText

Edit History: 1) Inconsequential white-space change.

Nigel_Garvey · May 12, 2009, 10:09am

Hi, Chris. Were you up all night?!

Your ‘shellScript’ script throws a syntax error on both my machines (Jaguar & Tiger). The perl one is lightning fast ” especially in real-life situations with large numbers of files ” but has a different idea of sort order from mine. Non-alphabetical characters are sorted differently and all lower case letters come after all upper case ones (ie. “a” > “Z”).

Mine follows the AppleScript conventions. Upper and lower cases are equivalent except when ‘considering case’ is used, in which case “a” < “A” < “b” < “B” (Unicode text) or “A” < “a” < “B” < “b” (pre-Leopard string).

chrys · May 14, 2009, 11:49am

I wrote it on a Tiger machine, so I am surprised that it did not work at least on your Tiger machine. I did use some extra whitespace, but it is basically just a three-command pipe. The sub-shell parenthesis are my primary suspect for causing syntax problems. They wrap are a remnant of an earlier implementation that was a bit more complicated and much slower. They are not technically necessary in this version. My second most likely guess at a syntax error candidate is the backslash before the last line break (which is only there to make it easier to continue the shell pipeline when concatenating the script’s string with other strings on the AppleScript side).

There are comments in each of the scripts for modifications to do case-insensitive sorting ($file=lc($file) in Perl, sort -f . in the shell version).

I tried out variations of using locale-sensitive collation in perl (LC_COLLATE and “use locale” and “use POSIX” to get “strcoll”), but neither changed the sorting order in my testing. When I looked at /usr/share/locale/*/LC_COLLATE for the collation database for en_US.UTF-8 (the usual locale in Terminal shells on English configurations), I found that the collation database seems to only be based on “Latin” ASCII characters (en_US.UTF-8/LC_COLLATE is a symlink to la_LN.US-ASCII/LC_COLLATE). Maybe there are fancier LC_COLLATE collation databases on Leopard (though I am getting the impression that “things” are moving away from locales and towards Unicode-specified collation).

Anyway, it seems that most AppleScript host processes’ (on Tiger) environments are devoid of locale environment variables (and this is passed on to shells for do shell script). So, if locale-based sorting was to be done inside a do shell script, the AppleScript would have to find a way to convert the current user’s International preferences to locale environment variable settings and enabled those in a preamble in do shell script’s command string. The locale approach mostly seems like a dead end.

After reading a description of the sorting that Mac OS X Finder does, I tried using Unicode::Collate in Perl. The problem is that Perl (at least on Tiger) does not ship with a complete DUCET (Default Unicode Collation Element Table). The one that is on my system is (self-described as) a greatly trimmed down version that is only supposed to be used for Perl’s self-tests. The current, full DUCET is 1.3MB (the shipped one on my system is 52kB). The full version is easy to download, but probably not otherwise pre-installed on most systems.

For kicks, here is the Unicode collation version of the Perl program. As it says, it works best if a full DUCET is installed in a suitable location. I downloaded mine to /tmp/Unicode/Collate (a better, more permanent location might be /Library/Perl/Unicode/Collate) and the results seems to match AppleScript/Finder sorting fairly well. One deviation from Finder (or AppleScript with considering numeric strings) is that runs of multiple digits are not sorted according to their numeric values (such treatment is an optional modification in the latest Unicode Collation Algorithm, but the Perl code implements an older version of the UCA).

set perlProgram to "
# Read lines containing pathnames from STDIN and sort them by filename
# (last path component) according to the Unicode Collation Algorithm.

# NOTE: For best results, this program requires a full
# Default Unicode Collation Element Table (DUCET).

# Search for 'OPTION:' to find a couple of options for this program.

use strict;
use warnings;

use open qw(:encoding(UTF-8) :std); # decode STDIN from (encode STDOUT/ERR to) UTF-8
use File::Spec::Functions qw(splitpath);
use Unicode::Collate;
#use lib '/tmp'; # For testing, I installed the full DUCET under this dir (see OPTION below).

my $UCA_level = 2; # OPTION: use level 3 for case-sensitivity
my $collator = undef;
eval {
  # OPTION: more complete Unicode collation
  # Download <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
  # (the DUCET) to somedir/Unicode/Collate/allkeys.txt. If your
  # somedir is not in the default @INC path (see /usr/bin/perl -V),
  # you will need to add 
  #     use lib 'somedir';
  # to this script.

  # If this system has a DUCET installed as 'allkeys.txt', try that
  # one first.
  local $/ = chr(10); # Tsk, tsk. Unicode::Collate does not guard $/ itself.
  $collator = Unicode::Collate->new(
                                    'table' => 'allkeys.txt',
                                    'level' => $UCA_level,
                                    'variable' => 'non-ignorable',
                                   );
};
if(!defined $collator) {
  # We did not find 'allkeys.txt'. Try to use the stripped down DUCET
  # 'keys.txt' that should be included in Perl installation.
  local $/ = chr(10); # Tsk, tsk. Unicode::Collate does not guard $/ itself.
  $collator = Unicode::Collate->new(
                                    'table' => 'keys.txt',
                                    'level' => $UCA_level,
                                    'variable' => 'non-ignorable',
                                   );
}

# Use a 'Schwartzian Transform' to sort the input lines by the last path component.
# <http://en.wikipedia.org/wiki/Schwartzian_transform> 
# <http://www.stonehenge.com/merlyn/UnixReview/col64.html>

# 1) (<>) Take the input lines.
# 2) (map) Extract the filename.
# 3) (sort) Sort based on only the filename.
# 4) (map) Throw away the filename.
# 5) (print) Print the results.

# OPTION: Uncomment the next line or use -0 on the perl commandline to read 'lines' that are NUL terminated instead of the system's default line break.
#local $/ = chr(0);

print # (5)
  map { $_->[1] } # (4)
    sort { $a->[0] cmp $b->[0] } # (3)
      map {
        my ($vol, $dirs, $file) = splitpath($_);
        my $key = $collator->getSortKey($file); # use Unicode collation
        [$key, $_];
      } # (2)
        <>; # (1)
"

Edit History: Fixed "OPTION"al setting of $/ to chr(0) instead of undef (0 reads NUL-terminated lines, undef reads the whole file as one string).

mleonti · May 14, 2009, 9:47pm

Hi Guys,

First of all a big thanks to all of you.

I tried all your solutions extensively and, though they all are fast, unfortunately I get errors in some cases from each of them.
I cannot get past these two errors:
The command exited with a non-zero status.
Can’t make quoted form of {“/Users/ml/Documents/G5 Backup/Leopard HD partition/m Pictures//Addressbook:iChat/mario.tiff”, “/Users/ml/Documents/G5 Backup/Leopard HD partition/m Pictures//Brisbane 07 on Drago usb/Address Book - 5:07:07.abbu/ABSubscribedPerson.skIndexInverted”

I try to give you more specifics hoping we can resolve this.
I get my files from choose folder and I need the entire contents of all folders an files in it.
I then need to sort them.
(NB User may choose a folder with no files, with 1 or more files in it.
Names of files sometime end in funny ways (like “icon” for custom icons that ends with a return carriage) I do not know of any other but maybe there are and that causes the errors.)

set theFiles to paragraphs of (do shell script “find " & quoted form of POSIX path of theFolder & " -type f ! -name ‘.*’”)
is what I found the fastest to collect them and this one does not fail.

Here is my attempt to make chrys’s script work for me

-- my code
set theFolder to choose folder
set strtScnds to (current date)
with timeout of 10000 seconds -- I eliminated the gathering and switch handlers and get my files like this
	set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'")
	set flCnt to (count theFiles)
end timeout
set simulatedInput to theFiles
if flCnt < 2 then -- the chosen folder is empty or contains 1 file only
	display dialog fldrNm & " cannot continue" & return & return & "You have chosen folder:" & (theFolder as text) & " which contains:" & flCnt & " file/s" & return & return & fldrNm & " needs at list 2 files to be able to scan for duplicates" buttons {"OK"} with icon 0 default button 1
	return
end if
-- return flCnt & return & simulatedInput -- so far so good never any error very fast
-- end my code

set perlProgram to "
use strict;
use warnings;
use File::Spec::Functions qw(splitpath);
use locale; # just in case a locale is set, use it
# $/ = chr(0); # uncomment this line to use NUL-terminated 'lines' instead of linefeed-terminated lines (e.g. for use with find ... -print0)
# 1) (<>) Take the input lines.
# 2) (map) Extract the filename.
# 3) (sort) Sort based on only the filename.
# 4) (map) Throw away the filename.
# 5) (print) Print the results.
print #(5)
map { $_->[1] } # (4)
sort { $a->[0] cmp $b->[0] } # (3)
map {
my ($vol, $dirs, $file) = splitpath($_);
#$file = lc($file); # uncomment to imitate Finder's case-insensitive sorting 
[$file, $_];
} # (2)
<>; # (1)
"
do shell script "printf %s " & quoted form of simulatedInput & " | perl -e " & quoted form of perlProgram without altering line endings
set perlResult to result
-- from the preceeding line I get this error: Can't make quoted form of {"/Users/ml/Documents/G5 Backup/Leopard HD partition/m Pictures//Addressbook:iChat/mario.tiff", "/Users/ml/Documents/G5 Backup/Leopard HD partition/m Pictures//Brisbane 07 on Drago usb/Address Book - 5:07:07.abbu/ABSubscribedPerson.skIndexInverted", 

-- `printf %s blahblah` is like `echo blahblah` but does not automatically end the ouput with a line break.
-- Instead of piping in the output of printf/echo, another command's output (e.g. `find`) could be piped in.

(* This one is more "shellish" and provides the same output for the test data.
* Since many shell tools can not handle NUL-terminated "lines", this method can not easily be extended to handle paths with embedded newlines.
*)
-- The first command of this shell script was originally "sed 's|^\(.*\)/\([^/]*\)$|\2/\1/\2|", but filenames with ï£¿ (Option-Shift-K) seemed to send my sed into an infinite loop. Odd.

set shellScript to "
(
awk -F / '
NF > 1 {print $NF \"/\" $0} # if we have a slash, print 'filename/full-line' where filename is the stuff after the last slash in the full-line
NF <= 1 {print $0} # otherwise, just print 'full-line'
' |
sort -s -t / -k 1,1 | # sort by the filename (as previously arranged, the part before the first slash); add -f to do Finder-like case-insensitive sorting
cut -d / -f 2- # remove 'filename/' from start of each line
) \\
"
do shell script "printf %s " & quoted form of simulatedInput & " | " & shellScript without altering line endings
set shellishResult to result
{perlResult = shellishResult, perlResult, shellishResult} -- compare Perl and shellish results

-- my code
set lpsdTm to cnvrtTm(strtScnds)
return lpsdTm

-- function handlers
on cnvrtTm(strtTm)
	set lpsdScs to (current date) - strtTm
	set hrs to lpsdScs div 3600
	set lpsdScs to lpsdScs - (hrs * 3600)
	set mts to lpsdScs div 60
	set lpsdScs to lpsdScs - (mts * 60)
	return ("Hrs:" & hrs & " Mts:" & mts & " Sec:" & lpsdScs)
end cnvrtTm

–end function handlers
–end my code

chrys · May 14, 2009, 11:55pm

This is because the contents of simulatedInput was supposed to be a string (or Unicode text), but you are setting it to a list with set simulatedInput to theFiles.

Furthermore, the Perl program (or the shell script version, at least on my machine (Nigel had problems with it)) can be used in the initial do shell script. It is inefficient to have find generate the data, save it in a variable, put in an echo/printf command-line and, sort it with shell tools, and read it back in again. Both of my sorting methods were designed to be used in a “pipeline” that accept find’s output.

Both the Perl and shell versions should handle embedded returns (like the Icon^M files in some dirs), and you can even capture such filenames accurately if you use do shell script . without altering line endings. But, if you use paragraphs of, you will end with a small bit of garbage. find ends its lines with LF (^J). If it prints out a path to a file that ends in CR (^M), you will have a series of characters like “/path/to/dir/Icon^M^J” in the output string. paragraphs of breaks strings at all the various line-ending types (CR (old Mac, some parts of Mac OS X)), LF (Unix), and CRLF (DOS)), so it will give you an item of just “/path/to/dir/Icon”, since it sees the CRLF (^M^J)
as a DOS-style line break. Things are more obviously broken if the CR is in the middle of the filename. In that case, paragraphs of will break the filename into two items (since it will treat the embedded CR as a Mac-style line break).

-- Deomonstration of how paragraphs of breaks its input on all three line termination conventions.

set lf to ASCII character 10
set cr to ASCII character 13

"/path/to/dir/some file" & lf & ¬
	"/path/to/dir/Icon" & cr & lf & ¬
	"/path/to/dir/carriage" & cr & "return" & lf & ¬
	"/path/to/dir/another file" & lf

paragraphs of result --> {"/path/to/dir/some file", "/path/to/dir/Icon", "/path/to/dir/carriage", "return", "/path/to/dir/another file", ""}

If you can not live with this, then I suggest you stop using paragraphs of and move to something like text item delimiters = {ASCII character 10}. That still leaves the problem of embedded LFs in filenames (which are legal, but probably not very common). To handle those you need find . -print0, perl -0 . (or whatever you are using for sorting), and text item delimiters = {ASCII character 0}. One of Nigel’s scripts earlier shows how this (find -print0, ASCII character 0 business) can be done.

Another thing I noticed in the script is the with timeout block. I think this should not be needed, even for shell scripts that take more than the default timeout of 2 minutes to finish as long as the do shell script is not inside an application tell block. The timeout only applies to AppleEvent transactions to applications. Here is a little experimental code to demonstrate it:

set r to {}

try
	with timeout of 2 seconds
		do shell script "sleep 5"
	end timeout
	set end of r to "non-app timeout did not trigger (expected)"
on error m number n
	if n is -1712 then
		set end of r to "non-app timeout DID trigger (UNEXPECTED)"
	else
		error m number n -- rethrow other errors
	end if
end try

try
	tell application "System Events" to ¬
		with timeout of 2 seconds
			do shell script "sleep 5"
		end timeout
	set end of r to "app timeout did NOT trigger (UNEXPECTED)"
on error m number n
	if n is -1712 then
		set end of r to "app timeout did trigger (expected)"
	else
		error m number n -- rethrow other errors
	end if
end try

r --> {"non-app timeout did not trigger (expected)", "app timeout did trigger (expected)"}

Last, unless you are testing the Perl and “shellish” sorting systems against one another there is no reason to include them both in your program. But if you are trying to test them, then you probably do want to save the output of find and feed it into each one. My earlier recommendation against this (and recommendation to put Perl in a pipeline after find) was based on the assumption that you are not testing the sorting methods against one another.

Here is a reworked version of your code:

-- A Perl program to sort lines by the text after the last slash (i.e. sort by filename).
-- This could be saved into a file (with a "#!/usr/bin/perl" line at the top), or it can be feed to perl with the -e command-line option.
set perlProgram to "
use strict;
use warnings;
use File::Spec::Functions qw(splitpath);
use locale; # just in case a locale is set, use it
# $/ = chr(0); # uncomment this line to use NUL-terminated 'lines' instead of linefeed-terminated lines (e.g. for use with find ... -print0)
# 1) (<>) Take the input lines.
# 2) (map) Extract the filename.
# 3) (sort) Sort based on only the filename.
# 4) (map) Throw away the filename.
# 5) (print) Print the results.
print #(5)
  map { $_->[1] } # (4)
    sort { $a->[0] cmp $b->[0] } # (3)
      map {
        my ($vol, $dirs, $file) = splitpath($_);
        $file = lc($file); # imitate Finder's case-insensitive sorting; not perfect, but mabye passable
        [$file, $_];
      } # (2)
    <>; # (1)
"

set theFolder to choose folder
set strtScnds to (current date)
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*' | perl -e " & quoted form of perlProgram without altering line endings)
-- Trim last, empty line that "without altering line endings" produces.
if last item of theFiles is "" then ¬
	if (count theFiles) > 1 then
		set theFiles to items 1 through -2 of theFiles
	else
		set theFiles to {}
	end if
set flCnt to (count theFiles)

if flCnt < 2 then -- the chosen folder is empty or contains 1 file only
	set fldrNm to "script name goes here?" -- placeholder for missing variable
	set theFolder to POSIX file (POSIX path of theFolder)
	display dialog fldrNm & " cannot continue" & return & return & "You have chosen folder:" & (theFolder as text) & " which contains:" & flCnt & " file/s" & return & return & fldrNm & " needs at list 2 files to be able to scan for duplicates" buttons {"OK"} with icon 0 default button 1
	return
end if

set lpsdTm to cnvrtTm(strtScnds)
return {lpsdTm, theFiles}

-- function handlers
on cnvrtTm(strtTm)
	set lpsdScs to (current date) - strtTm
	set hrs to lpsdScs div 3600
	set lpsdScs to lpsdScs - (hrs * 3600)
	set mts to lpsdScs div 60
	set lpsdScs to lpsdScs - (mts * 60)
	return ("Hrs:" & hrs & " Mts:" & mts & " Sec:" & lpsdScs)
end cnvrtTm

Model: iBook G4 933
AppleScript: 1.10.7
Browser: Safari Version 4 Public Beta (4528.16)
Operating System: Mac OS X (10.4)

mleonti · May 15, 2009, 12:47am

Thanks for taking the time to explain chrys, much appreciated.

Your last script works really well and it is really fast. I set it on my documents and it did 39228 files in 5 seconds.

I tried it on the whole hard disk and it stopped because of permissions problems:
find: /.fseventsd: Permission denied
find: /.Spotlight-V100: Permission denied

I changed the shell script line to:
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name ‘.*’ | perl -e " & quoted form of perlProgram password “mypass” with administrator privileges without altering line endings)
and I ran it again.

It stopped with the following Applescript error:
find: /dev/fd/7: Not a directory
find: /net/localhost: Operation timed out
find: /net/broadcasthost: Operation timed out
perl(897) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can’t allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!

chrys · May 15, 2009, 2:40am

If you really need to process files that you can only see as an admin, then OK, but otherwise I would avoid using with administrator privleges here. If find is the last process in the pipeline of a do shell script command, it is often useful to add something like " ; true" to the end of the do shell script command because find will return an “error” exit code if it encounters even a single unreadable directory. The error exit code makes do shell script raise an AppleScript error. Adding “true” to the end makes sure that do shell script see a success exit code. But since find is not at the end of the pipeline that is not an issue here.

The error messages indicates that something was attempting to allocate over 16MB there. It looks like the error message is coming from Perl, but I have not been able to reproduce that here.

Pointing it at my home folder, I get “Out of memory” from Script Editor after find and perl are finished (paragraphs is highlighted in Script Editor). In my case, AppleScript seemed to have run out of memory while splitting the output of the shell command into paragraphs. In my case, it was processing 14024156 bytes (~14MB) of UTF-8 encoded path data among 147900 paths (numbers found by piping the find output into wc). While I do have a few filenames with Unicode characters, most are just normal ASCII. I have reason to believe that AppleScript’s internal Unicode text representation is UTF-16BE. For mostly ASCII data, UTF-16BE will take up about twice as much space as UTF-8. So the resulting Unicode text from do shell script takes up about 28MB. Then paragraphs of does its work breaking it into a list of nearly 150k Unicode text objects that cumulatively take up the same about of space again (at least until the original result string can be garbage collected).

So the total would be up to probably 56MB of data for my test (but it ran out of memory part way through). None of this is terribly large compared to main memory sizes, but I have never really tried holding large-ish amounts of data in AppleScript. It appears there are some limits. Maybe they could be raised, but I am not sure how. As a data point, I watched the perl process grow to 300+MB of memory usage (in Activity Monitor) before it printed its output to AppleScript. So does not appear to be some kind of ulimit-like resource limit.

My machine just “finished” running the script on my startup disk. find finished, and perl chugged along, causing heavy swapping along the way (its total memory usage is got up to nearly 1.5GB but only 300MB or so of it fit into main memory). I got an plain “Out of memory” error with do shell script highlighted this time. But I am not sure if it was ultimately perl or Script Editor that ran out of memory.

I think you will need another strategy if you want to process a whole startup volume’s worth of files. What are you ultimately trying to accomplish?

mleonti · May 15, 2009, 2:52am

Hi Nigel,

Thanks for your help.
I tried the script you proposed with the customQsort in an external file.

script byLastTextItem – Handlers for customising the sort.
on isGreater(a, b)
(text item -1 of a > text item -1 of b)
end isGreater

on isLess(a, b)
(text item -1 of a < text item -1 of b)
end isLess

on swap(a, b)
end swap

on shift(a, b)
end shift
end script
set lib to (load script file ((path to scripts folder as Unicode text) & “Libraries:Sorts:CustomQsort.scpt”)) – Your path to the CustomQsort script here.

set theFolder to choose folder
set theFiles to (do shell script “/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name ".*" -print0” without altering line endings)

set astid to AppleScript’s text item delimiters
if ((system attribute “ascv”) div 256 mod 16 < 2) then
set AppleScript’s text item delimiters to (ASCII character 0) as Unicode text
else
set AppleScript’s text item delimiters to character id 0
end if
set theFiles to theFiles’s text items 1 thru -2
set AppleScript’s text item delimiters to “/” as Unicode text
– considering numeric strings
tell lib to CustomQsort(theFiles, 1, -1, byLastTextItem)
– end considering
set AppleScript’s text item delimiters to astid

theFiles

I followed the link saved the script to my script folder in folder ml/Library/Scripts/Libraries/Sorts but when I ran it I got this Applescript error: Stack overflow.
I then incorporated the qSort function directly in the script in a try statement trying to sort all the same. In order to do that I tried to count theFiles and I got an incorrect count. The character count is returned instead of the file count. I let this go as it would not hinder the code below I thought.
I ran the script and I got the following error:
Can’t get item 344 of {“/Users/ml/Desktop/Quarantined viruses//1”, “/Users/ml/Desktop/Quarantined viruses//3”, “/Users/ml/Desktop/Quarantined viruses//Find all files in folde2r.scpt”, "/
The offending line was: set v to item ((leftEnd + rightEnd) div 2) of a’s l – pivot in the middle

Here is my script:

script byLastTextItem -- Handlers for customising the sort.
	on isGreater(a, b)
		(text item -1 of a > text item -1 of b)
	end isGreater
	
	on isLess(a, b)
		(text item -1 of a < text item -1 of b)
	end isLess
	
	on swap(a, b)
	end swap
	
	on shift(a, b)
	end shift
end script
set lib to (load script file ((path to scripts folder as Unicode text) & "Libraries:Sorts:CustomQsort.scpt")) -- Your path to the CustomQsort script here.

set theFolder to choose folder
set theFiles to (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name \".*\" -print0" without altering line endings)
set flCnt to (count theFiles) -- returns the wrong count, counts every character in theFiles not the items
-- return flCnt
set astid to AppleScript's text item delimiters
if ((system attribute "ascv") div 256 mod 16 < 2) then
	set AppleScript's text item delimiters to (ASCII character 0) as Unicode text
else
	set AppleScript's text item delimiters to character id 0
end if
set theFiles to theFiles's text items 1 thru -2
set AppleScript's text item delimiters to "/" as Unicode text
-- considering numeric strings
try
	tell lib to CustomQsort(theFiles, 1, -1, byLastTextItem)
on error
	Qsort(theFiles, 1, flCnt)
end try

-- end considering
set AppleScript's text item delimiters to astid

theFiles

to Qsort(array, leftEnd, rightEnd) -- Hoare's QuickSort Algorithm	
	script a
		property l : array
	end script
	set {i, j} to {leftEnd, rightEnd}
	set v to item ((leftEnd + rightEnd) div 2) of a's l -- pivot in the middle
	repeat while (j > i)
		repeat while (item i of a's l < v)
			set i to i + 1
		end repeat
		repeat while (item j of a's l > v)
			set j to j - 1
		end repeat
		if (not i > j) then
			tell a's l to set {item i, item j} to {item j, item i} -- swap
			set {i, j} to {i + 1, j - 1}
		end if
	end repeat
	if (leftEnd < j) then Qsort(a's l, leftEnd, j)
	if (rightEnd > i) then Qsort(a's l, i, rightEnd)
end Qsort

mleonti · May 15, 2009, 3:42am

Hi Chrys,

The whole idea of this started with me wanting to clean up my HD
I consolidated three macs into 2: one for DAW (music workstation) which I keep clean with music software only and the other one for everything else.
In my diligence of not throwing anything out I am afraid I kept a lot of duplicate files.
I am after finding them by name (I think I have all dups named the same) sorting them so that they appear one under the other in my report with their paths and I can easily delete the ones I do not need any more.
So far I thought of the job going:
Find
Sort
get a return delimited list with only the dups (all of them not exclusive)
After what you pointed out the memory problem I thought:
Find
get a return delimited list with only the dups (all of them not exclusive)
Sort
The list to be sorted would then be much smaller.
Is there a schell script for finding duplicates that you know of that could be incorporated in the script?

Nigel_Garvey · May 15, 2009, 9:20pm

Hi, mleonti. Thanks for trying my sort.

You’ve put your ‘count theFiles’ line immediately after the shell script, which returns a single block of text, so the character count is the correct result at that point. It’s the business with the (ASCII character 0) delimiter after that which splits the text into a list of the individual paths.

I don’t know why you’re getting a “stack overflow” error with CustomQsort. It’s proved itself over the years and works OK for me with that particular script. Maybe there’s something flaky in AppleScript 2.0 that’s reacting badly to it.

The qSort you’ve incorporated into the script is an ordinary Quicksort that sorts on the entire paths, not just on the file names.