Script to count files by extension

FWIW, on my test folder (NAS drive, no packages) it comes up with lower numbers than Nigel’s script and my ASObjC version for several extensions. It also ( like mine) including folders.

For a shell script solution, I’d go for Marc’s rather than mine. It doesn’t descend into packages and it does actually include them! On the other hand, it doesn’t descend into folders with dots in their names either, presumably believing them to be packages.

“-iname” could perhaps be “-name”, since there’s no need for case insensitivity when looking for a dot. And "! -regex ‘./[.].’ " might be simplified to "! -path ‘/.’ " or possibly "! -name ‘.*’ ", either of which would also make the -E option unnecessary. But I doubt any of these make much difference in practice. The “s” command in the “sed” string doesn’t need the “g” because the regex used there initially includes everything in the line and then winds back looking for a dot, so the match is always to the last dot in the line anyway.

The spaces are in fact an indent, in which the counts are right-aligned. But it has a fixed width.

I don’t think any of the shell scripts above are totally bullet-proof when we don’t know what’s in the folder(s). Mine overlooks packages, Marc’s overlooks items in folders which have dots in their names. Shane’s ASObjC script is the best option in these respects, but includes extensions from folders which have dots in their names. (alear does specifically mention files.) While a CSV file is one of the options mentioned, we don’t know what separator’s considered the default on alear’s system.

Here’s a version of Shane’s script which produces a text file similar in appearance to those of the shell scripts. Folder extensions aren’t counted but file and package extensions are. Package contents are ignored, but folder contents aren’t. The results are sorted by extension and are preceded on their lines by their counts, which are dynamically indented to the minimum extent required. The total is displayed at the bottom. It would also be possible to include a header indicating the folder to which the results pertain, but I haven’t bothered here. Hopefully the script’s compatible with Mavericks ….

use AppleScript version "2.3.1" -- macOS 10.9 (Mavericks) or later
use framework "Foundation"
use scripting additions

-- For testing:
--set theSourceFolder to (choose folder with prompt "Select an HDD or folder:")

on reportOnFolder(theSourceFolder)
	set theDestinationFile to (choose file name with prompt "Choose file name" default name "zCARPA.txt")
	set destinationURL to current application's class "NSURL"'s fileURLWithPath:(POSIX path of theDestinationFile)
	-- get all files
	set theSourceFolder to current application's |NSURL|'s fileURLWithPath:(POSIX path of theSourceFolder)
	set fileManager to current application's NSFileManager's defaultManager()
	set URLKeys to current application's class "NSArray"'s arrayWithArray:({current application's NSURLIsRegularFileKey, current application's NSURLIsPackageKey})
	set theOptions to (current application's NSDirectoryEnumerationSkipsPackageDescendants) + (get current application's NSDirectoryEnumerationSkipsHiddenFiles)
	set theFiles to (fileManager's enumeratorAtURL:theSourceFolder includingPropertiesForKeys:(URLKeys) options:theOptions errorHandler:(missing value))'s allObjects()
	-- remove items with no extensions
	set theFilter to current application's NSPredicate's predicateWithFormat:"pathExtension != ''"
	set theFiles to theFiles's filteredArrayUsingPredicate:theFilter
	-- Build a counted set containing the extensions of those items which aren't folders.
	set theSet to current application's NSCountedSet's new()
	repeat with thisItem in theFiles
		if (((thisItem's resourceValuesForKeys:(URLKeys) |error|:(missing value)) as record as list) contains true) then (theSet's addObject:(thisItem's pathExtension()))
	end repeat
	-- build array of records so we can sort
	set theResults to current application's NSMutableArray's array()
	set theSum to 0
	tell (space & space) to tell (it & it) to set eightSpaces to (it & it) -- MacScripter /displays/ the literal string as a single space.
	repeat with aValue in theSet's allObjects()
		set theCount to (theSet's countForObject:(aValue)) as integer
		set theSum to theSum + theCount
		-- The spaces at the beginning of theEntry are padding for an indent, whose size will be adjusted later.
		(theResults's addObject:{theValue:aValue, theEntry:(eightSpaces & theCount) & (space & aValue)})
	end repeat
	-- sort on the dictionaries 'theValue' values.
	set sortDesc to current application's NSSortDescriptor's sortDescriptorWithKey:"theValue" ascending:true
	theResults's sortUsingDescriptors:{sortDesc}
	-- create the text with an entry for the total count at the end.
	set theSum to theSum as text
	theResults's addObject:({theEntry:linefeed & eightSpaces & theSum & " TOTAL"})
	set theText to (theResults's valueForKey:"theEntry")'s componentsJoinedByString:(linefeed)
	-- Adjust the width of the indent to the number of characters in the total.
	set theText to theText's stringByReplacingOccurrencesOfString:("(?m)^ +(?=[ \\d]{" & (count theSum) & "} )") withString:("") options:(current application's NSRegularExpressionSearch) range:({0, theText's |length|()})
	-- Write the text to the specified text file as UTF-8.
	theText's writeToURL:(destinationURL) atomically:(true) encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end reportOnFolder

Edit: Bug pointed out below by Shane fixed. String of eight spaces explicitly set in a variable to avoid confusion when viewed on MacScripter.
Edit 2: Eight-space string set more efficiently for fun and to take up less room on the page.

Thanks Nigel. I guess I should have figured that out.


I think the line:

       (theResults's addObject:{theValue:aValue, theEntry:(" " & theSum) & " " & aValue})

should be:

	(theResults's addObject:{theValue:aValue, theEntry:(" " & theCount) & " " & aValue})

I’m also not seeing the alignment happening.

The process that takes most of the time with Nigel’s ASObjC version is the repeat loop that checks every file to see if it’s a folder or package. If we make an assumption that no folder will be named with a valid file extension — not a bullet-proof assumption, but probably a reasonable one in most cases — we can speed things up a bit by changing this code:

	set theSet to current application's NSCountedSet's new()
	repeat with thisItem in theFiles
		set theExt to thisItem's pathExtension()
		if (((thisItem's resourceValuesForKeys:(URLKeys) |error|:(missing value)) as record as list) contains true) then (theSet's addObject:(thisItem's pathExtension()))
	end repeat

To this:

	set theSet to current application's NSCountedSet's new()
	set sortDesc to current application's NSSortDescriptor's sortDescriptorWithKey:"pathExtension" ascending:true
	set theFiles to theFiles's sortedArrayUsingDescriptors:{sortDesc}
	repeat with thisItem in theFiles
		set theExt to thisItem's pathExtension()
		if (theSet's containsObject:theExt) or (((thisItem's resourceValuesForKeys:(URLKeys) |error|:(missing value)) as record as list) contains true) then (theSet's addObject:theExt)
	end repeat

In my test, that knocks nearly 40% off the overall time.

Thanks, Shane. You’re right. It should be theCount in the repeat (but theSum is right later on). Now fixed. :rolleyes:

There should be a string of eight spaces in front of both theCount and theSum in the lines you’ve quoted. They look like single spaces when viewed in MacScripter, but clicking the “scriplet” link produces the correct number in Script Editor. Your quotes both have only single spaces in those positions. To avoid any confusion, I’ve also edited the script to set the eight-space string using the AppleScript space constant.

Thanks too for this suggestion — although your “from” code’s not quite what’s in my script! :wink:

I don’t see the point of sorting the URL array first. Does it make checking against the set faster? Otherwise you’re sorting all the URLs instead of a potentially smaller number of dictionaries later.

I do see what’s happening in your repeat. Checking if the extension’s already in the set is (I imagine) faster than checking whether or not the URL represents a folder. If the extension is in the set, a file or package with that extension has already passed the test and the assumption is that the current URL is for another file or package of the same type and there’s no need to check if it’s a folder. It’s not an assumption I’d want to make myself,though. Also, of course, if the extension’s not already in the set, both tests have to be done, so any speed advantage depends on how frequently each extension occurs in relation to the number of URLs.

Neither can I :(. it got left there after I tried something else.

Actually, in the case of packages that’s more than an assumption: if you name a folder with a package’s extension, it’s regarded, and treated, as a package. So it’s really an assumption about files. (And it failed on my test case, because I’d deliberately named a folder that way for a test of something else I did ages ago.)

It’s interesting to see just how much time it saves, but I’d never rely on it. It’s code waiting to break.

True. But the extension test is much quicker, and I’m not sure one would be bothered counting at all unless one expected there to be multiple files with the same extension.

I wrote a script that adds a header and total file count to the output of Marc’s command line. I did not change the line alignment in the text file.

I ran this on a backup folder on an external SSD that contained 28,631 files including many apps. For comparison purposes, I also ran Nigel’s AppleScriptObjC solution from earlier in this thread. The extension and total-file counts were identical.

set theSourceFolder to (choose folder with prompt "Select an HDD or folder:")
set theSourceFolder to POSIX path of theSourceFolder

set textFile to (choose file name with prompt "Choose file name" default name "zCARPA")
set textFile to (textFile as text) & ".txt"

--Marc Anthony's command line that gets file-extension counts.
set extensionData to (do shell script "find -E " & quoted form of theSourceFolder & " -iname '*.*' -prune ! -regex '.*/[.].*' | sed 's/.*[.]//g' | sort | uniq -c")

set AppleScript's text item delimiters to linefeed
set extensionData to paragraphs of extensionData
set AppleScript's text item delimiters to ""

--Remove spaces from front of each line of file-extension data for file-count purposes.
set extensionCountData to {}
repeat with aLine in extensionData
	repeat until aLine does not start with " "
		set aLine to text 2 thru -1 of aLine
	end repeat
	set the end of extensionCountData to aLine
end repeat

--Get total file count.
set fileCount to 0
set AppleScript's text item delimiters to " "
repeat with anItem in extensionCountData
		set fileCount to fileCount + ((text item 1 of anItem) as integer)
	end try
end repeat
set the end of extensionData to return & (fileCount as text) & " Total"
set AppleScript's text item delimiters to ""

--Add a header to text file.
set fileData to {"Recursive File-Extension Summary of " & theSourceFolder & return & return}

--Add extension and total file counts to text file.
set AppleScript's text item delimiters to linefeed
set fileData to (fileData & extensionData) as text
set AppleScript's text item delimiters to ""

--Save text file.
	set openedFile to open for access file textFile with write permission
	set eof of openedFile to 0
	write fileData to openedFile
	close access openedFile
on error
		close access openedFile
	end try
end try

At the risk of beating this to death, the results differ here. There’s just no obvious way of getting around the fact that find can’t tell the difference between a directory and a package.

Shane. I understand the issue with the Find utility and packages, and, I assume that the backup folder I used as a test contained no packages. I don’t believe my post was misleading but if it was I apologize.

FWIW, I’ve included below the results that were returned with my script (with Marc’s command line). Nigel’s script did in fact return identical results.

There’s certainly no need to apologize!

It’s not misleading, and it’s a valid approach to the problem – as long as one is aware of its limitations and know they don’t apply to the particular application. My concern is that people often aren’t aware of the limitations, because they’re not obvious. Sometimes even I put reliability before speed :).

Even excluding package contents is an assumption on our part. :wink:

Shane. I agree with what you say, and so I decided to run some tests to determine which scripts return inaccurate results with packages. Thus far, my test folder contains one each of the following:

app - Application
rtfd - RTF with Attachments
scptd - Script Bundle
wdgt - Widget
download - Safari Download

My script with Marc’s command line and Nigel’s AppleScriptObjC script return:

Nigel’s command-line script returns:

I was surprised by the above, as I expected my script with Marc’s command line to return the contents of the packages.

Hi, peavine. Your result from #35 isn’t surprising. Due to the wildcarded prune statement, my code does not evaluate folders whose name contains a period; this necessarily includes packages.

Shane, regarding your discrepant results from post #21, I’m thinking that a culprit may possibly be a difference in resolving aliases/symbolic links, however, in a large test that I just ran, the ASObjC method returned items that were unexpected; it’s finding files inside “.lproj” folders. Are these not packages?


No, .lproj folders are just directories. They normally live inside packages (unless you have Xcode projects).