extracting sentence

Here is a sample paragraph:

I would like to extract the sentence between the word START and END.

the word START is contained in the above paragraph 3 times so I need to extract the sentence right after the third START and finish extracting right before the first END

Hi nyl,

Could there be four "start"s or do want exactly the third start?

Model: MacBook Pro
AppleScript: 2.2.3
Browser: Safari 536.26.17
Operating System: Mac OS X (10.8)

If there are exactly 3 "start"s st the start and 3 "end"s at the end, then this works:


set the_text to "start What is AppleScript? start AppleScript is a language used to automate the actions of the Macintosh Operating System and many of its applications. start Whether a task is as simple as copying a file or as complex as building a real estate catalog, AppleScript can perform the requisite actions for you with "intelligence," controlling applications and making decisions based on its observations or from information provided by its interaction with the person running the script. end Every day, businesses and individuals alike use AppleScript to create newspapers and books, manage networks, build DVDs, process images, generate web pages, backup files and folders, make videos, and much more. end AppleScript is the most powerful, easy-to-use, automation tool available on any platform. And best of all, this technology is free and is built into every copy of the Mac OS!end"
set tids to AppleScript's text item delimiters
set AppleScript's text item delimiters to {"start", "end"}
set target_text to (middle item of (text items of the_text)) as string
set AppleScript's text item delimiters to tids
return target_text

Note: you should add error checking when setting text item delimiters.

Editted: BTW, here’s a novel way I was thinking about earlier:


set the_text to "start What is AppleScript? start AppleScript is a language used to automate the actions of the Macintosh Operating System and many of its applications. start Whether a task is as simple as copying a file or as complex as building a real estate catalog, AppleScript can perform the requisite actions for you with "intelligence," controlling applications and making decisions based on its observations or from information provided by its interaction with the person running the script. end Every day, businesses and individuals alike use AppleScript to create newspapers and books, manage networks, build DVDs, process images, generate web pages, backup files and folders, make videos, and much more. end AppleScript is the most powerful, easy-to-use, automation tool available on any platform. And best of all, this technology is free and is built into every copy of the Mac OS!end"
set end_offset to offset of "end" in the_text
set the_text to text 1 thru (end_offset - 1) of the_text
set reverse_text to (reverse of (characters of the_text)) as string
set trats_offset to offset of "trats" in reverse_text
set reverse_text to text 1 thru (trats_offset - 1) of reverse_text
set the_text to (reverse of (characters of reverse_text)) as string

:slight_smile:

gl,

that is fantastic!!! thank you!!!

how about extracting after the 4 start or 6 start?

Hi nyl,

You should use something else besides “start” and “end” because these are common words. You could use something like and ; anything that’s not common. But, it depends on what kind of text you’re working with. If the number of "start"s and "end"s are the same then using ‘middle’ will work. Otherwise, there are many ways.

gl,

I can have the start word rapped in brackets like but how can I tell the script to extract after the 5th ?

This script works also with multiple occurences (it’s part of an html parser I use personally). Maybe it’s useful…

set the_text to "start What is AppleScript? start AppleScript is a language used to automate the actions of the Macintosh Operating System and many of its applications. start Whether a task is as simple as copying a file or as complex as building a real estate catalog, AppleScript can perform the requisite actions for you with "intelligence," controlling applications and making decisions based on its observations or from information provided by its interaction with the person running the script. end Every day, businesses and individuals alike use AppleScript to create newspapers and books, manage networks, build DVDs, process images, generate web pages, backup files and folders, make videos, and start much more. end AppleScript is the most powerful, easy-to-use, automation tool available on any platform. And best of all, this technology is free and is built into every copy of the Mac OS!end"

getTextBetweenStrings(the_text, "start", "end", true)

on getTextBetweenStrings(str, startMark, endMark, keepMarks)
	set subStringsFound to {}
	set subStrings to explode(str, endMark)
	if (count subStrings) < 2 then return subStringsFound
	repeat with subString in subStrings
		set x to explode(subString, startMark)
		if (count x) > 1 then
			set end of subStringsFound to last item of x
		end if
	end repeat
	if keepMarks then
		repeat with x in subStringsFound
			set contents of x to startMark & x & endMark
		end repeat
	end if
	return subStringsFound
end getTextBetweenStrings

on explode(str, separator)
	set {AppleScript's text item delimiters, oDelimiter} to {separator, AppleScript's text item delimiters}
	set theList to every text item of str
	set AppleScript's text item delimiters to oDelimiter
	return theList
end explode

edit: changed some variable names and improved the last loop.

What you have to do is make your script dynamic. So you’re trying to find the last “start” right? Think specifically about what you’re trying to do then you waste less time.

YOu could use a repeat loop to find exactly the court occurrence of “start” in the text. But, when would you want the fourth or sixth or third, etc.

Ok.

Since it is the text between start and end tags that are to be extracted: if other start/end tokens is to be included. Then I’d set text item deliimiters to end and gathered all the text items I got.

Then I’d treat every text item as a text, and try to get every text item of it with the start text item.

If the count of text items is less than two, then I know that it should be amended to the previous, as the end was false, or toss the whole text item away. Since it has no beginning or start, but I noted that false starts where allowed above, so I think false ends should be as well.

If the count of text items is greater than 1 and the first text item is empty, then we know we can assemble a complete sentence out of it.

if the first text item isn’t empty, then the starting point for completing a complete start/end sequence is from the first text item after one that is empty. What is between the first text item and the first that is empty is then junk as it isn’t contained by a start/end sequence.

Edit
There is a special case for the first text item, if there isn’t found a start in this one, then it is junk. This case must be taken into account, when making the second pass with “start” as the text item delimiter.

I guess I first need to count how many times the word START is present and when I reach the START word for the 4th time execute the extracting script from kel1 - i am not able to do it

Hi nyl,

If you want exactly the fourth or exactly the fifth occurrence of a string in the text, then you can easily do it. Sounds like recursion What you do is first send the original text to a subroutine. That subroutine calls itself and sends the text from after the first occurrence of the string. That’s another method but how do you know what number of the string “start” to look for. Perhaps the user could tell the program that he wants the fourth or fifth occurance or whatever.

gl,

Hi nyl,

I had a very long sleep :). Her’e a modified version that asks for user input:


set the_text to "start What is AppleScript? start AppleScript is a language used to automate the actions of the Macintosh Operating System and many of its applications. start Whether a task is as simple as copying a file or as complex as building a real estate catalog, AppleScript can perform the requisite actions for you with "intelligence," controlling applications and making decisions based on its observations or from information provided by its interaction with the person running the script. end Every day, businesses and individuals alike use AppleScript to create newspapers and books, manage networks, build DVDs, process images, generate web pages, backup files and folders, make videos, and much more. end AppleScript is the most powerful, easy-to-use, automation tool available on any platform. And best of all, this technology is free and is built into every copy of the Mac OS!end"

display dialog "Enter start number:" default answer "1"
set the_index to (text returned of result) as integer

set tids to AppleScript's text item delimiters
set AppleScript's text item delimiters to {"start"}
try
	set target_text to (text items (the_index + 1) thru -1 of the_text) as string
	set AppleScript's text item delimiters to {"end"}
	set target_text to (text items 1 thru -(the_index + 1) of target_text) as string
	set AppleScript's text item delimiters to tids
on error
	set AppleScript's text item delimiters to tids
	error "Bad number."
end try
return target_text

You need to add error checking. For instance, if the number "start"s in the text is greater than the number of "end"s, then what? etc.

gl,

spectacular THANK YOU kel1!!!