Need to get just a portion of the text document

I have word documents from which I get text to build Quark documents. There are as few as 1 and as many as 6 different variations of the same text in the word document for each client. These variations are seperated by “Offer 1”, “Offer 2” and so on. Each word file is one long document containing all the offers that apply to each client. and there are almost one hundred clients. One file might have Offers 1, 3 and 5 the next might have 2,4 and 6, the next may have all 6 offers. The only common denonimator is “Offer 1” through “Offer 6” which are on a line by themselves, preceding the text for that offer.
So, my task is to isolate, for instance Offer 3, and get just the bits and pieces out of Offer 3, using TextEdit or the Finder. I assume I would use TIDs to select the text, but so far I have been unsuccessful. Do you have any suggestions.

Have you “inhaled” the text from word to a variable (how you do this depends on which version of Word you’re using)? What happens that makes you unsuccessful?

Here’s the part of my script that deals with getting the copy from the word doc.


on run
	set aFolder to choose folder with prompt "Choose Folder to Search..."
	set lstFolderContents to list folder aFolder without invisibles
	repeat with i from 1 to (count of lstFolderContents)
		set strFile to ((aFolder & item i of lstFolderContents) as string) as alias
		tell application "TextEdit"
			open strFile
			tell strFile
				set theText to text of document 1 of application "TextEdit"
			end tell
			tell application "TextEdit"
				close document 1
				make new document
			end tell
			tell application "TextEdit"
				set text of document 1 of application "TextEdit" to theText
				set x to every word of theText as text
			end tell
			tell application "TextEdit"
				tell document 1
					set Offer1 to (paragraph 1 whose first word is "Offer")
					set Offer2 to (paragraph 2 whose first word is "Offer")
					set Offer3 to (paragraph 3 whose first word is "Offer")
					set Offer4 to (paragraph 4 whose first word is "Offer")
					set itemCount to count of text items in x
					set currentItem to (text item i of x) as string
					copy currentItem to Offer2
					get result
				end tell
				
			end tell
		end tell
		
	end repeat
end run

When I run the script the event log tells me the script does set Offer 1 to every word on that line, and so on through Offer4. But the lines after “set Offer4 to (paragraph 4 whose first word is “Offer”)” are intended to return the contents of Offer 1, but it returns nothing.

At a glance (without setting up some dummy files to test), I’m not sure what you mean by this:

set Offer1 to (paragraph 1 whose first word is "Offer") -- etc.

Do you mean first paragraph whose first word is…, with offer 1 being the paragraph number? Are you trying to get the contents of every paragraph whose first word is “Offer”?

Actually, line 17 is incorrect. it should read

				set x to every text of document 1 as text

Here is what my end log returns


tell current application
	choose folder with prompt "Choose Folder to Search..."
		alias "Macintosh HD:Users:me:Desktop:Wordfiles:"
	list folder alias "Macintosh HD:Users:me:Desktop:Wordfiles:" without invisibles
		{"1234_ABC.doc"}
end tell
tell application "TextEdit"
	open alias "Macintosh HD:Users:me:Desktop:Wordfiles:1234_ABC.doc"
	get every text of document 1
		"Client text
Offer 1
Some text goes here (sensitive client text removed)

Offer 2
Some text goes here (sensitive client text removed)

Offer 3
Some text goes here (sensitive client text removed)

Offer 4
Some text goes here (sensitive client text removed)"

	close document 1
	make new document
		document 1
	set every text of document 1 to "Client text
Offer 1
Some text goes here (sensitive client text removed)

Offer 2
Some text goes here (sensitive client text removed)

Offer 3
Some text goes here (sensitive client text removed)

Offer 4
Some text goes here (sensitive client text removed)"

	get every text of document 1
		"Client text
Offer 1
Some text goes here (sensitive client text removed)

Offer 2
Some text goes here (sensitive client text removed)

Offer 3
Some text goes here (sensitive client text removed)

Offer 4
Some text goes here (sensitive client text removed)"

	get paragraph 1 of document 1 whose word 1 = "Offer"
		"Offer 1
"
	get paragraph 2 of document 1 whose word 1 = "Offer"
		"Offer 2 
"
	get paragraph 3 of document 1 whose word 1 = "Offer"
		"Offer 3
"
	get paragraph 4 of document 1 whose word 1 = "Offer"
		"Offer 4
"
end tell

I am trying to section the word document by the Offers:
Offer 1
Offer 2
Offer 3
Offer 4

Offer 1 is the first line preceding the text for Offer 1. So what I want to do is get all the text between the lines “Offer 1” and “Offer 2”

I use this (apologies for not citing author) to find text between a pair of delimiters entered as start_delim and end_delim. You get the text of the document as the_text, and then get the found_text. Try it on one of your docs.

--set the_text to (read (choose file))
set the_text to "1234567890asdfghjkl
.qwerty#
hELLO<_wORLD/
#FindMe!<blah/*>
aeiou#Find this too.*[-]
``~+#-*/=fin*"
set {start_delim, end_delim} to {"#", "*"}

set found_text to my find_delimited_text(the_text, start_delim, end_delim)
-->{"FindMe!", "Find this too."}

on find_delimited_text(the_text, start_delim, end_delim)
	set {escaped_start_delim, escaped_end_delim} to {my escaped_delim(start_delim), my escaped_delim(end_delim)}
	set ASCII_10 to (ASCII character 10)
	tell (a reference to my text item delimiters)
		set {old_tid, contents} to {contents, {ASCII_10}}
		set {the_text, contents} to {(the_text's paragraphs) as Unicode text, {end_delim & ASCII_10}}
	end tell
	set found_text to (do shell script "echo " & quoted form of the_text & " | grep -o '" & escaped_start_delim & ".\\+" & escaped_end_delim & "' | colrm 1 1" without altering line endings)'s text items 1 thru -2
	tell (a reference to my text item delimiters) to set contents to old_tid
	return found_text
end find_delimited_text

on escaped_delim(the_delim)
	if the_delim is in "*.?()[]^\\" then return "\\" & the_delim
	return the_delim
end escaped_delim

I probably did this wrong, but I plugged in my file, my delims

set the_text to (read (choose file))

set {start_delim, end_delim} to {"Offer 1", "Offer 2"}
set found_text to my find_delimited_text(the_text, start_delim, end_delim)
-->{"FindMe!", "Find this too."}

on find_delimited_text(the_text, start_delim, end_delim)
	set {escaped_start_delim, escaped_end_delim} to {my escaped_delim(start_delim), my escaped_delim(end_delim)}
	set ASCII_10 to (ASCII character 10)
	tell (a reference to my text item delimiters)
		set {old_tid, contents} to {contents, {ASCII_10}}
		set {the_text, contents} to {(the_text's paragraphs) as Unicode text, {end_delim & ASCII_10}}
	end tell
	set found_text to (do shell script "echo " & quoted form of the_text & " | grep -o '" & escaped_start_delim & ".\\+" & escaped_end_delim & "' | colrm 1 1" without altering line endings)'s text items 1 thru -2
	tell (a reference to my text item delimiters) to set contents to old_tid
	return found_text
end find_delimited_text

on escaped_delim(the_delim)
	if the_delim is in "*.?()[]^\\" then return "\\" & the_delim
	return the_delim
end escaped_delim

and I get an applescript error -
“sh: -c: line 1: unexpected EOF while looking for matching `‘’
sh: -c: line 2: syntax error: unexpected end of file”

I worked with it some and got it to return without error, but it’s all jibberish. I don’t know anything about grep so that part of the script I don’t understand enough to mess with it.

set the_text to (read (choose file))

set {start_delim, end_delim} to {"Offer 1", "Offer 2"}
set find_delimited_text to {the_text, start_delim, end_delim}
set found_text to my find_delimited_text

on find_delimited_text(the_text, start_delim, end_delim)
	set {escaped_start_delim, escaped_end_delim} to {my escaped_delim(start_delim), my escaped_delim(end_delim)}
	set ASCII_10 to (ASCII character 10)
	tell (a reference to my text item delimiters)
		set {old_tid, contents} to {contents, {ASCII_10}}
		set {the_text, contents} to {(the_text's paragraphs) as Unicode text, {end_delim & ASCII_10}}
	end tell
	set found_text to (do shell script "echo " & quoted form of the_text & " | grep -o '" & escaped_start_delim & ".\\+" & escaped_end_delim & "' | colrm 1 1" without altering line endings)'s text items 1 thru -2
	tell (a reference to my text item delimiters) to set contents to old_tid
	return found_text
end find_delimited_text

Maybe I need to ask it differently.

If I have:

Offer 1
Mary had a little lamb
It’s fleece was white as snow

Offer 2
And every where that Mary went
The lamb was sure to go

Offer 3
It followed her to school one day
which was against the rules

Offer 4
It made the children laugh and play
to see a lamb in school

How do I get:

Mary had a little lamb
It’s fleece was white as snow

Understand that I have nearly one hundred files and each one has different text, only the Offer lines are consistent.

Much better. This works for me in the script editor, but your text may have a line feed instead of a return:

set myText to "Offer 1
Mary had a little lamb
It's fleece was white as snow

Offer 2
And every where that Mary went
The lamb was sure to go

Offer 3
It followed her to school one day
which was against the rules

Offer 4
It made the children laugh and play
to see a lamb in school"

set R to return
set L to ASCII character 10 -- line feed
set extracts to {}
repeat with k from 1 to 4
	set end of extracts to extractBetween(myText, "Offer " & k & L, L & "Offer " & (k + 1))
end repeat

to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to startText
	set endItems to text of text item -1 of SearchText
	set AppleScript's text item delimiters to endText
	set beginningToEnd to text of text item 1 of endItems
	set AppleScript's text item delimiters to tid
	return beginningToEnd
end extractBetween

Here’s a slight variation:

to list_offers from t
	set r to ASCII character 10
	set d to text item delimiters
	set text item delimiters to r & r & "Offer "
	set l to t's text items
	set text item delimiters to d
	repeat with i in l
		set i's contents to i's text from paragraph 2 to -1
	end repeat
	l
end list_offers

set offer_text to "Offer 1
Mary had a little lamb
It's fleece was white as snow

Offer 2
And every where that Mary went
The lamb was sure to go

Offer 3
It followed her to school one day
which was against the rules

Offer 4
It made the children laugh and play
to see a lamb in school"

list_offers from offer_text

Wonderfully minimal as always, Kai, and list_offers will have them in the order in which they appear in the text. We hope that every document has every offer number, however, without skipping any (no Offer 2, for example). That is why I worried about the numbers. My version has the disadvantage that after the last Offer, you’ll get everything left whereas yours grabs only two paragraphs. We’ll see what the OP says, I hope.

Yeah - that’s a possible point, Adam. If some kind of placeholder was required for any missing offer, then something like this might do the trick:

to list_offers from t
	set l to {"", "", "", "", "", ""} (* or {"N/A", "N/A", "N/A", "N/A", "N/A", "N/A"} or whatever... *)
	set r to ASCII character 10
	set t to r & r & t
	set d to text item delimiters
	repeat with n from 1 to count l
		set text item delimiters to r & r & "Offer " & n & r
		if (count t's text items) is 2 then
			set i to t's text item -1
			set text item delimiters to r & r & "Offer "
			set l's item n to i's text item 1
		end if
	end repeat
	set text item delimiters to d
	l
end list_offers

set offer_text to "Offer 1
Mary had a little lamb
It's fleece was white as snow

Offer 3
And every where that Mary went
The lamb was sure to go

Offer 4
It followed her to school one day
which was against the rules

Offer 6
It made the children laugh and play
to see a lamb in school"

list_offers from offer_text

You guys rock, these work fantastic.
But:

I am using a run handler to allow me to choose the folder where the word docs are stored, and then run every file in the folder with a “repeat with”, using “end repeat” at the end of the script. My “on run” handler doesn’t seem to want to play nice with your script. I keep getting an error message saying, "expected “end” or “end tell” but found “to”, pointing to:

to list_offers from t
or
to extractBetween(SearchText, startText, endText)

Here is the “on run” handler, which I adapted from another script:


on run
	set aFolder to choose folder with prompt "Choose Folder to Search..."
	set strSearchValue to "ABCD" --the client file code beginning every file name
	set lstFolderContents to list folder aFolder without invisibles
	-- Repeat through the folder's content
	repeat with i from 1 to (count of lstFolderContents)
		set strFile to ((aFolder & item i of lstFolderContents) as string) as alias
		tell application "TextEdit"
			open strFile
			tell strFile
				set theText to text of document 1 of application "TextEdit"
			end tell
		end tell
		tell application "TextEdit"
			close document 1
			make new document
		end tell
		tell application "TextEdit"
			set text of document 1 of application "TextEdit" to theText
			-- the rest of the script dealing with getting the text into QuarkXPress file goes here"
		end tell
	end repeat
end run

Hi, Skip;

Not clear to me from your on run handler how you expect to call extractBetween or list_offers from t. With a little pruning, your handler looks like this (you don’t need all the intermediate tells):

set aFolder to choose folder with prompt "Choose Folder to Search..."
	set strSearchValue to "ABCD" --the client file code beginning every file name
	set lstFolderContents to list folder aFolder without invisibles
	-- Repeat through the folder's content
	repeat with i from 1 to (count of lstFolderContents)
		set strFile to ((aFolder & item i of lstFolderContents) as string) as alias
		tell application "TextEdit"
			open strFile
			tell strFile
				set theText to text of document 1
			end tell
			close document 1 -- [1]
			make new document
			set text of document 1 of application "TextEdit" to theText
			-- the rest of the script dealing with getting the text into QuarkXPress file goes here"
		end tell
	end repeat

So at the point I’ve labeled [1] in your script, you’ve got the text of the document in strFile. What is the purpose of creating a new document and putting the text there? If you just want to extract the label stuff, do it at that point. I don’t have any way to test this, but try it yourself.

set whichOne to choose from list {"Offer 1", "Offer 2", "Offer 3", "Offer 4", "Offer 5", "Offer 6"} with prompt "Please choose the offer to isolate" without multiple selections allowed and empty selection allowed
set nextNum to 1 + (item -1 of characters of (whichOne as string))

set aFolder to choose folder with prompt "Choose Folder to Search..."
set strSearchValue to "ABCD" --the client file code beginning every file name
set lstFolderContents to list folder aFolder without invisibles
-- Repeat through the folder's content
repeat with i from 1 to (count of lstFolderContents)
	set strFile to ((aFolder & item i of lstFolderContents) as string) as alias
	tell application "TextEdit"
		open strFile
		tell strFile
			set theText to text of document 1
		end tell
		close document 1
	end tell
	set theOffer to my extractBetween(theText, whichOne, "Offer " & nextNum)
	-- the rest of the script dealing with getting the text into QuarkXPress file goes here"
end repeat

to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to startText
	set endItems to text of text item -1 of SearchText
	set AppleScript's text item delimiters to endText
	set beginningToEnd to text of text item 1 of endItems
	set AppleScript's text item delimiters to tid
	return beginningToEnd
end extractBetween

Terrific. I had to do a little tweeking, and it now it works fantastic. This is so awesome. You guys totally rule.

The reason I set the text of the text file to a variable, close the text file, open a new file and set it to the variable is to change the text to plain text for QuarkXPress to read. I’ve seen others do it with the clipboard, but this seemed more efficient to me. Without the close/open, nothing happens, it ends there without error.

I actually have six applescripts, one wired up for each offer, because each quark file for each offer differs from the others and can’t share the same script. With some tweeking, I hardwired your script to the specific offer of each script, and now I can throw seventy word files of that offer in my target folder, start the script, tell the script which folder to select, and it slams out quark files in about two seconds each. All I have to do is go in and put in the graphics.

When I first ran these scripts last week, I could not tell the script which offer to use, so if I was using offer 2, I had to tell the script to get "paragraph 2 who’s first word is “Pay”. This worked okay, until one of the text files had an offer 1 and 2 and 4 but no 3. The script would be looking for paragraph 4 who’s first word is “Pay” and not finding it. I was having to put in way too many on error scripts to cover the possibilities.

Now, it goes straight to the offer and ignores all other text.
As soon as I clean one of the scripts of all client information, I’ll post it.

My next goal is to run all offers at one time, letting the script run through each file and run every offer it has text for, resulting in over three hundred files simultaneously.

If you want them all, modify the script to use Kai’s “list_offers” handler - it’s set up to grab them all and just puts “” if an offer is missing. You can scan that for the absence of text and eliminate it easily if that’s what you want.

I am still having a problem
This script works but it doesn’t work. I hardwired it (and set it for Offer 3, for example). When I run the script and set the menu at the bottom for “Result”, I see that “Offer 3” and only “Offer 3” is the result. But, if I set the bottom menu to “Event Log”, I see the entire script, (including all offers) in the screen. When I plug in the rest of my file to get the text and put it into the Quark file, the script accesses “Offer 1” which is the beginning of the entire word document as in the Event Log, instead of isolating “Offer 3” as in the “Results” log. I would assume that what is in the “Result” window is what I have to work with when all is said and done, but that doesn’t seem to be the case.


to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to startText
	set endItems to text of text item -1 of SearchText
	set AppleScript's text item delimiters to endText
	set beginningToEnd to text of text item 1 of endItems
	set AppleScript's text item delimiters to tid
	return beginningToEnd
end extractBetween

set whichOne to ("Offer 3")
set nextNum to 1 + (item -1 of characters of (whichOne as string))

set aFolder to choose folder with prompt "Choose Folder to Search..."
set strSearchValue to "ABCD" --the client file code beginning every file name
set lstFolderContents to list folder aFolder without invisibles
-- Repeat through the folder's content
repeat with i from 1 to (count of lstFolderContents)
	set strFile to ((aFolder & item i of lstFolderContents) as string) as alias
	tell application "TextEdit"
		set theText to text of document 1 of application "TextEdit"
		set theOffer to my extractBetween(theText, whichOne, "Offer " & nextNum)
		-- the rest of the script dealing with getting the text into QuarkXPress file goes here"
	end tell
end repeat

This much of it works for me. Does it for you?

set theText to "Offer 1
Mary had a little lamb
It's fleece was white as snow

Offer 3
And every where that Mary went
The lamb was sure to go

Offer 4
It followed her to school one day
which was against the rules

Offer 6
It made the children laugh and play
to see a lamb in school"

to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to startText
	set endItems to text of text item -1 of SearchText
	set AppleScript's text item delimiters to endText
	set beginningToEnd to text of text item 1 of endItems
	set AppleScript's text item delimiters to tid
	return beginningToEnd
end extractBetween

set whichOne to ("Offer 3")
set nextNum to 1 + (item -1 of characters of (whichOne as string))
set theOffer to my extractBetween(theText, whichOne, "Offer " & nextNum)

One possibility is that since TextEdit also understands text item delimiters, you might try this line in front of whichOne:

set AppleScript's text item delimiters to {""}