Building Acronym Table from Word Doc

I’m looking to write an applescript that will parse a Word doc for acronyms (usually upper case 3 or 4 char words) and write them to a comma delimited text file (.csv). I can open a Word doc but all I’m finding in forums is selecting text and taking action on that selection. It seems like it would be simple but I am new to this and I don’t have time to learn all about scripting. Any help would be most appreciated.

Model: Macbook Pro OSX 10.5.6
AppleScript: 2.0.1
Browser: Firefox
Operating System: Mac OS X (10.5)

I emailed Ken off list and here is what he provided

Example word doc contents

Desired output

(I really wish we had an attachment upload btw)

So that all said here is what I came up with. This assumes Word 2008. It may work under 2004, but I know there were a LOT of changes made. The other assumption is that any time there are two or more sequential capital letters we are dealing with an acronym.

set theFile to choose file with prompt "Please select the file to parse"
tell application "Microsoft Word"
	open theFile
	tell front document to set theData to (get content of text object)
end tell

set acronymList to paragraphs of (do shell script "/bin/echo " & quoted form of theData & " | /usr/bin/egrep -o '[A-Z]{2,}'")

set {MasterList, tid, text item delimiters} to {{}, text item delimiters, ", "}

repeat with i from 1 to count acronymList
	if MasterList does not contain item i of acronymList then set end of MasterList to item i of acronymList
end repeat

set masterString to MasterList as string
set text item delimiters to tid

tell application "Finder"
	set {deskPath, fullName, nameExt} to {path to desktop as Unicode text, name of theFile, name extension of theFile}
	set shortName to (text 1 thru -((count nameExt) + 2) of fullName) & "-acronymList.txt"
	set fileRef to (open for access file (deskPath & shortName) with write permission)
	write masterString to fileRef
	close access fileRef
end tell

Works beautifully in Word 08, does not work in Word 04 (I have both).

I’ll have to test on my other workstation tomorrow to make it '04 friendly. In the meantime any suggestions Adam for the first version? I always take a back seat to the insight you and Stefan have to add :smiley:

Nicely done in my view. In the doc I tested with it pulls out two or more contiguous caps and I think the OP wanted 3 or more. You could even make it selectable.

a small improvement using the `textutil’ command… prevent MSWord to launch and with no regard to it’s version number…

set theFile to choose file with prompt "Please select the file to parse" of type ""
set acronymList to paragraphs of (do shell script "/usr/bin/textutil -stdout -cat txt " & quoted form of (POSIX path of theFile) & " | /usr/bin/egrep -o '[A-Z]{2,}'  #| sort | uniq")
-- etc...