Extracting paragraph containing text search string

custardo · August 27, 2008, 5:36pm

Hi

I have text documents with 20-50 (sometimes more, sometimes less) paragraphs of various info including a postcode and address. An example of how the paragraphs look is shown below:-

I’ve adapted a text extraction script which came originally from Appleworks to work in Text Edit as follows:-

property matchString : ""

tell application "TextEdit"
	display dialog "Extract (into a new document) all paragraphs containing what string?" default answer matchString
	set matchString to text returned of result
	set matches to every paragraph in text of front document where it contains matchString
	return matches
end tell

I need to sort the post codes geographically but need all the other info in the paragraph that the post code is contained in. The script allows me to search for the post code but extracts only one line of the paragraph that the post code is contained in. I am wanting to run a script to write these paragraphs in geographically sorted order to save copy/pasting. Can anyone tell me how I can extract the whole paragraph containing the search term rather than just the one line??

Have been searching for some time for clues but can find nothing.

Am a beginner with scripting so please allow for this in any explanations.

Thank you in advance!

Adam_Bell · August 27, 2008, 6:32pm

The sample you’ve given us is three paragraphs long. How are you defining “paragraph”? What distinguishes this set from the next? How would you sort the codes – alphabetically? Is the code always in the third line? Is it always followed (as in your example) by two colons?

Craig_Williams · August 27, 2008, 6:49pm

There is most likely a better solution but this should give you some ideas.
The script assumes that a ‘paragraph’ is defined as having two returns before
and after it.

set matchString to text returned of (display dialog "Extract (into a new document) all paragraphs containing what string?" default answer "")

tell application "TextEdit"
	set theText to text of document 1
end tell

set theParagraphs to tidStuff((ASCII character 10) & (ASCII character 10), theText)

set foundParas to {}
repeat with i from 1 to count of theParagraphs
	set thisPara to item i of theParagraphs
	if thisPara contains matchString then
		set end of foundParas to thisPara
	end if
end repeat

on tidStuff(paramHere, textHere)
	set OLDtid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to paramHere
	set theItems to text items of textHere
	set AppleScript's text item delimiters to OLDtid
	return theItems
end tidStuff

hth,

Craig

custardo · August 27, 2008, 8:13pm

Thanks your response Adam.

The text appears as per my example and could be 3, 4, 5 or even 6 lines long. The text is then separated by a carriage return (line feed) between the next text which takes a similar format as my example but may not be exactly the same layout. Is a line of text called a paragraph in scripting language?

The codes are sorted geographically from my head. My search will take the form of “SW17 1”, SW17 2" etc… It is an English zip code.

The code may appear anywhere not always in the same place and not always by a colon.

Thanks for the script idea Craig. It runs as is but does not produce anything just blank. I’ll see if I can tweak it to get it working for my purposes. I am grateful to you.

mark_hunte · August 27, 2008, 9:13pm

It is in Applescript.

Is there a static order to to each entry.

As in, does the first line of the entry always contain a word that is always the same that you could search for.
And then find everything between its line and the postcode line

custardo · August 27, 2008, 9:47pm

Thanks for the info - that is a vital bit of information for a beginner!
The text entry will always have the word “photo” and the relevant date. Other text will vary each time.
I think you could be onto something?

mark_hunte · August 27, 2008, 9:50pm

Here is an example of what I mean, Biglist is the store for the records,

property matchString : "sw17"

property matchBorough : "Borough"
property matchText : "Borough:test  PHOTO:DEFAULTS Saturday  16-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1139 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim
Borough:test2  PHOTO:SHARP Sunday  17-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1179 CHARGE >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim
Borough:test3  PHOTO:FUZZY Monday  18-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1173 CAPE >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim"
property biglist : {}
set biglist to {}
set counter to count paragraph in text of matchString

repeat with i from 1 to count of paragraphs in matchText
	set this_item to paragraph i of matchText
	if this_item contains matchBorough then
		set Frommatch to i
		my endm(i, Frommatch)
	end if
	
end repeat

on endm(i, Frommatch)
	repeat with ii from i to count of paragraphs in matchText
		set ii to (ii + 1)
		log ii
		set this_item2 to paragraph ii of matchText
		if this_item2 contains matchString then
			
			set ENDmatch to ii
			copy (paragraphs Frommatch thru ENDmatch of matchText as string) to end of biglist
			exit repeat
		end if
	end repeat
end endm

biglist

**edit

You can change property matchBorough : “Borough” to property matchBorough : “PHOTO:”

custardo · August 27, 2008, 11:26pm

mark hunte:

Here is an example of what I mean, Biglist is the store for the records,

property matchString : "sw17"

property matchBorough : "Borough"
property matchText : "Borough:test  PHOTO:DEFAULTS Saturday  16-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1139 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim
Borough:test2  PHOTO:SHARP Sunday  17-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1179 CHARGE >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim
Borough:test3  PHOTO:FUZZY Monday  18-Aug-2008 
Clear Channel 096 001266 / 02 Test Lane C/O Test Way  
1173 CAPE >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim"
property biglist : {}
set biglist to {}
set counter to count paragraph in text of matchString

repeat with i from 1 to count of paragraphs in matchText
	set this_item to paragraph i of matchText
	if this_item contains matchBorough then
		set Frommatch to i
		my endm(i, Frommatch)
	end if
	
end repeat

on endm(i, Frommatch)
	repeat with ii from i to count of paragraphs in matchText
		set ii to (ii + 1)
		log ii
		set this_item2 to paragraph ii of matchText
		if this_item2 contains matchString then
			
			set ENDmatch to ii
			copy (paragraphs Frommatch thru ENDmatch of matchText as string) to end of biglist
			exit repeat
		end if
	end repeat
end endm

biglist

**edit

You can change property matchBorough : “Borough” to property matchBorough : “PHOTO:”

Hi Mark

Your script works well however, it will be necessary for me to copy a lot of text into the script and then edit it by taking line feeds (carriage returns) away so as to avoid conflict with script language. The example I gave is just one sample to be sorted out of anything from 20 maybe up to 100 (all in one document).

I’ve tried adapting some of your code into my original script but keep getting the error “Can’t get paragraph 2 of “photo”.”

I’ll keep fiddling with it. Thanks your help so far.

Adam_Bell · August 28, 2008, 12:28am

Hey, Custardo;

Because of the variations you keep mentioning, would you be kind enough to simply copy here between quote tags or AppleScript tags some of your text containing several blocks of information, exactly as it is in the source document?

Assuming you are starting with a document that is, or could be, converted to plain text, it is easy for AppleScript to read the document into a variable in a script and then parse it on the basis of its keywords or structure. Our problem here is that the relevant key words and/or structure aren’t clear from your examples. Sorting all that later on the basis of anything found in it is easy – the hard part is separating it into parts and then recognizing features of the parts. For that, we gotta see what we’re being asked to deal with. If the info is sensitive in some way, use search/replace to provide fictional substitutions.

Can you do that?

Also, since you mention paragraphs and line feeds, this sounds like it might be a Windows document with paragraphs ending in \p\n and lines within paragraphs just ending in \n, where \p is a return character, and \n is a line feed character, ASCII 13 and 10 respectively. Is any of that so?

custardo · August 28, 2008, 10:02am

Adam Bell:

Hey, Custardo;

Because of the variations you keep mentioning, would you be kind enough to simply copy here between quote tags or AppleScript tags some of your text containing several blocks of information, exactly as it is in the source document?

Assuming you are starting with a document that is, or could be, converted to plain text, it is easy for AppleScript to read the document into a variable in a script and then parse it on the basis of its keywords or structure. Our problem here is that the relevant key words and/or structure aren’t clear from your examples. Sorting all that later on the basis of anything found in it is easy – the hard part is separating it into parts and then recognizing features of the parts. For that, we gotta see what we’re being asked to deal with. If the info is sensitive in some way, use search/replace to provide fictional substitutions.

Can you do that?

Also, since you mention paragraphs and line feeds, this sounds like it might be a Windows document with paragraphs ending in \p\n and lines within paragraphs just ending in \n, where \p is a return character, and \n is a line feed character, ASCII 13 and 10 respectively. Is any of that so?

Hi Adam

This is not a one off document but a regular changing worksheet I get every two weeks. The origin is a pdf file which I extract to text (rtf or ascii - does not matter for me). I spend many hours copy/ pasting each 3, 4 or 5 lines of text (script paragraphs) like the example given, by post (zip) codes into a geographically ordered list. The original script I have works great but does not me give me the lines above the search string which is needed.

I cannot post the documents I get as I would not be popular but it is not difficult to imagine from the example given. If you think of five pages of 3, 4 or 5 lines (script paragraphs) of text like the example I have given which I manually separate with one line feed (pressing the C/R button on the keyboard). If it makes it easier just think of five pages of text like the example I have given separated by one line feed/carriage return.

The text is more or less the same format but some may contain more lines depending on the amount of text being used for things like addresses and campaign descriptions. The amount of text being used may affect the text formatting when extracted from pdf to text file.

Am looking to extract a block of text containing my search string which may be 3, 4 or 5 script paragraphs long and then write those script paragraphs to a new document. The block of text will always contain the words “photo” and a date - other than that the text will be variable.

Hope that helps - let me know if you need more info.

Thank you.

Mark67 · August 28, 2008, 10:50am

I may be a mile off here but I would be looking for the post codes in your text file. (As you want to make a sort based on this info) I would make use of regular expressions in 2 ways one using OSXA Satimage as this will return a matchResult string (I have no idea how you do this without this others may do?) and do shell grep command as this returns the whole lines containing a search pattern. Make 2 lists and sort on list “lines” based on list “postcodes”. You should be able to google for “UK post codes & regular expressions” there are plenty about as standards have shifted. An applescript list sorting sub you could tweak I would expect to find here. (expression I’ve used as example here is a basic one).

property Postcode_List : {}
property Postcode_Lines : {}
--
set The_Text to (choose file without invisibles) -- the text file
--
my Find_Postcodes(The_Text, "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")
--
set This_File to POSIX path of The_Text
--
my GREP(This_File, "-E", "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")
--
on Find_Postcodes(The_Text, Search_String) -- this sub requires Satimage OSXA
	try
		set Postcode_Info to find text Search_String in The_Text with regexp and all occurrences
		repeat with i from 1 to (count of Postcode_Info)
			set end of Postcode_List to matchResult of item i of Postcode_Info
		end repeat
		return Postcode_List
	on error
		return false
	end try
end Find_Postcodes
--
on GREP(This_File, Options, Search_String)
	try
		set Postcode_Lines to paragraphs of (do shell script "/usr/bin/grep" & space & Options & space & quoted form of Search_String & space & quoted form of This_File)
		return Postcode_Lines
	on error
		return false
	end try
end GREP

Postcode_List
-- Postcode_Lines

Adam_Bell · August 28, 2008, 2:25pm

custardo:

Hi Adam

This is not a one off document but a regular changing worksheet I get every two weeks. The origin is a pdf file which I extract to text (rtf or ascii - does not matter for me). I spend many hours copy/ pasting each 3, 4 or 5 lines of text (script paragraphs) like the example given, by post (zip) codes into a geographically ordered list. The original script I have works great but does not me give me the lines above the search string which is needed.

I cannot post the documents I get as I would not be popular but it is not difficult to imagine from the example given. If you think of five pages of 3, 4 or 5 lines (script paragraphs) of text like the example I have given which I manually separate with one line feed (pressing the C/R button on the keyboard). If it makes it easier just think of five pages of text like the example I have given separated by one line feed/carriage return.

That’s not a line feed (ASCII character 10), custardo, that’s a return (ASCII character 13). A Macintosh keyboard doesn’t have a line feed character on it. Sounds like you put a blank line between the groups you have identified by hand, and that no other paragraphs of your text contain blank lines.

Does the rest of a block of data have any return characters in it or is it a long string?

My approach would be to read a block (identified by the leading and trailing blank lines) into a variable, find the zip code, and create two new lists. The first list would be the zip codes and the second would be the entire text of the paragraphs that corresponded to those zip codes. I would then sort the first while keeping the second in register (meaning that as I sorted the first, I would create a new second that kept the text with the sorted first list.

Assuming the blank line, there are two returns in a row; one at the end of a block, and one immediately following for the blank line.

-- assuming we've read the doc into tDocument...
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & return
set tParts to text items of tDocument
set AppleScript's text item delimiters to tid
-- now tParts is a list of the isolated blocks

At this point, I’d iterate through the items of tParts using Mark67’s scheme to extract the zip code from it and store that at the end of a new list of just the zip codes. Now you’ve got two lists: tZips, and tParts. We want to sort tZips and keep tParts in the same order:

set NewList to item 2 of sort2Lists(tZips, tParts)
-- with this handler in your script to do it
to sort2Lists(|sortlist|, SecondList)
	tell (count |sortlist|) to repeat with i from (it - 1) to 1 by -1
		set s to |sortlist|'s item i
		set r to SecondList's item i
		repeat with i from (i + 1) to it
			tell |sortlist|'s item i to if s > it then
				set |sortlist|'s item (i - 1) to it
				set SecondList's item (i - 1) to SecondList's item i
			else
				set |sortlist|'s item (i - 1) to s
				set SecondList's item (i - 1) to r
				exit repeat
			end if
		end repeat
		if it is i and s > |sortlist|'s end then
			set |sortlist|'s item it to s
			set SecondList's item it to r
		end if
	end repeat
	return {|sortlist|, SecondList} -- the first is now ordered and the second is in the same order.
end sort2Lists

And NewList is your ordered blocks of the original.

This generic approach is the best I can do without a concrete example to work on.

Mark67 · August 28, 2008, 2:43pm

Adam, we were along a similar line of thinking although I came up with this. (don’t laugh at the sort)

property Postcode_List : {}
property Postcode_Lines : {}
--
set The_Text to (choose file without invisibles) -- the text file
--
my Find_Postcodes(The_Text, "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")

set This_File to POSIX path of The_Text

my GREP(This_File, "-E", "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")

my Bubble_Sort(Postcode_List, Postcode_Lines)
--
on Find_Postcodes(The_Text, Search_String) -- this sub requires Satimage OSXA
	try
		set Postcode_Info to find text Search_String in The_Text with regexp and all occurrences
		repeat with i from 1 to (count of Postcode_Info)
			set end of Postcode_List to matchResult of item i of Postcode_Info
		end repeat
		return Postcode_List
	on error
		return false
	end try
end Find_Postcodes
--
on GREP(This_File, Options, Search_String)
	try
		set Postcode_Lines to paragraphs of (do shell script "/usr/bin/grep" & space & Options & space & quoted form of Search_String & space & quoted form of This_File)
		return Postcode_Lines
	on error
		return false
	end try
end GREP
--
on Bubble_Sort(List_1, List_2)
	-- borrowed from NG.
	-- slowed down by ME.
	-- call the speed police put out an APB.
	repeat with i from (count List_1) to 2 by -1
		set A to beginning of List_1
		set Y to beginning of List_2
		repeat with j from 2 to i
			set B to item j of List_1
			set Z to item j of List_2
			if (A > B) then
				set item (j - 1) of List_1 to B
				set item (j - 1) of List_2 to Z
				set item j of List_1 to A
				set item j of List_2 to Y
			else
				set A to B
				set Y to Z
			end if
		end repeat
	end repeat
	-- return List_1
	return List_2
end Bubble_Sort

posted wrong one oops

Adam_Bell · August 28, 2008, 3:39pm

No laughter here, Mark – the sort I put in was developed from one by Kai of yore in response to a query about sorting a list of events by a separate list of their dates.

Too bad that Custardo can’t contrive a dummy document for us that was representative of the data set he’s sorting. It’s sooo much more satisfying to actually test your ideas on a real sample.

custardo · August 28, 2008, 7:10pm

Thank you Adam and Mark67 for the time and effort you (and others) have put into this. It is really appreciated.

I’m probably doing lots of things wrong here as I am on a learning curve but for Mark67’s first script I couldn’t get it to run in AppleScript apart from returning a result of “{}”. I did run it in something else however, and it produced an impressive list of full post codes. Can’t remember what application that was but think it was “Smile”?? Anyway that script alone is really useful for me.

Adam’s script I am still looking at but can’t get past the error “Can’t get every text item of document 1” - it’s probably me.

Mark67’s second script does exactly what I want it to do ie. extracts the full block of text with post code when run with this short sample which is copy/pasted from this page/post:-

Borough:test PHOTO:DEFAULTS Saturday 16-Aug-2008 ¨Clear Channel 096 001266 / 02 Test Lane C/O Test Way ¨11739 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5GE ::Bike , Jump and Swim

Borough:test PHOTO:DEFAULTS Saturday 16-Aug-2008 ¨Clear Channel 096 001266 / 02 Test Lane C/O Test Way ¨11738 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 1AB ::Bike , Jump and Swim

Borough:test PHOTO:DEFAULTS Saturday 16-Aug-2008 ¨Clear Channel 096 001266 / 02 Test Lane C/O Test Way ¨11737 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 1AC ::Bike , Jump and Swim

Borough:test PHOTO:DEFAULTS Saturday 16-Aug-2008 ¨Clear Channel 096 001266 / 02 Test Lane C/O Test Way ¨11736 CAMPAIGN >>Coca Cola - Campaign Description (2941) DESIGN > SW17 5ab ::Bike , Jump and Swim

When I try the script with a larger sample and different text but more or less the same format it will only extract one line with the post code as my original script did.

I feel great progress and have a lot to go on anyway thanks to everybody’s help.

Adam_Bell · August 28, 2008, 10:24pm

With the document, preferable a text document saved somewhere, you get its text like this:

set tDocument to read (choose file with prompt "Choose a data file")
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & return
set tParts to text items of tDocument
set AppleScript's text item delimiters to tid
-- now tParts is a list of the isolated blocks

Let us know if that works, and Mark or I will join it to the rest

custardo · August 28, 2008, 11:40pm

Adam Bell:

With the document, preferable a text document saved somewhere, you get its text like this:
set tDocument to read (choose file with prompt "Choose a data file")
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & return
set tParts to text items of tDocument
set AppleScript's text item delimiters to tid
-- now tParts is a list of the isolated blocks
Let us know if that works, and Mark or I will join it to the rest

Thanks for that Adam. Now get the error The variable tZips is not defined. You can try it with the sample text I posted that is what I am using or trying to use (saved as text file).

Just to add confusion to all this it appears that my original script will produce the same result as Mark’s second script when run using the four block text sample I posted here. I can add more blocks of text by pasting and editing some info and it will still work fine. As soon as I start adding more text from any other text document it will only extract one line for that added record. The format of the text is virtually the same but some words/numbers are different. I’ve been staring at it for some time and the only difference I can think of is that my sample text has been copy/pasted from a web page and not a text document. I’ve tried to put text into an html document and then paste it into a text document but I only get the one line extraction again.
Not sure what is going on.
I can PM examples of text that will only extract one line/paragraph if of any use but I can’t put anything online as it is not my info to publish.

Mark67 · August 29, 2008, 8:29am

Custardo, to get the regular expressions that part of my script used to work you would be required to add a scripting addition “Satimage” to your System Library ScriptingAdditions. I don’t have Smile but think the scripting addition that I have is a standard component of it and that is why it understood the syntax. If you need to run on a machine without Smile it would need this addition too. There are links to this from this site.

custardo · August 29, 2008, 9:00am

Hi Mark67. I do have Satimage.osax installed in Scripting Additions library folder.

Mark67 · August 29, 2008, 1:59pm

This works for me if I use a clean text file I made myself. For some reason if I C&P your example it fails to make one of the lists (I think theres something in it I can’t see?) If I read your example saved as text it has some odd characters at the beginning of the first paragraph?

property Postcode_List : {}
set Sorted_Text to (path to desktop as Unicode text) & "Sorted_Text.txt"
set The_Text to read (choose file without invisibles)

-- Search text for regular expression for UK postcodes
my Find_Postcodes(The_Text, "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")

-- Break text into a list of blocks of info
set {ASTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, (ASCII character 10) & (ASCII character 10)}
set Info_Blocks to text items of The_Text
set AppleScript's text item delimiters to ASTID

-- Sort the info based on postcodes
set Sorted_List to my Bubble_Sort(Postcode_List, Info_Blocks)

-- Put info back in 2 paragraph blocks
set {ASTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, (ASCII character 10) & (ASCII character 10)}
set Two_Para_List to items of Sorted_List as text
set AppleScript's text item delimiters to ASTID

-- Write results to text file
my write_to_file(Two_Para_List, Sorted_Text, false)

-- This sub requires Satimage OSXA
on Find_Postcodes(The_Text, Search_String)
	try
		set Postcode_Info to find text Search_String in The_Text with regexp and all occurrences
		repeat with i from 1 to (count of Postcode_Info)
			set end of Postcode_List to matchResult of item i of Postcode_Info
		end repeat
		return Postcode_List
	on error
		return false
	end try
end Find_Postcodes
--
on Bubble_Sort(List_1, List_2)
	repeat with i from (count List_1) to 2 by -1
		set A to beginning of List_1
		set Y to beginning of List_2
		repeat with j from 2 to i
			set B to item j of List_1
			set Z to item j of List_2
			if (A > B) then
				set item (j - 1) of List_1 to B
				set item (j - 1) of List_2 to Z
				set item j of List_1 to A
				set item j of List_2 to Y
			else
				set A to B
				set Y to Z
			end if
		end repeat
	end repeat
	return List_2
end Bubble_Sort
--
on write_to_file(this_data, target_file, append_data)
	try
		set the target_file to the target_file as string
		set the open_target_file to open for access file target_file with write permission
		if append_data is false then set eof of the open_target_file to 0
		write this_data to the open_target_file starting at eof
		close access the open_target_file
		return true
	on error
		try
			close access file target_file
		end try
		return false
	end try
end write_to_file