Extracting paragraph containing text search string

I think you are right. It is pulled off the web as html, printed as pdf then extracted to text. With all that I think some characters are being left in the final text document. I think I am doomed to copy/paste and tennis elbow for the rest of my life :expressionless:

Then you may want to make a new topic and see if anybody can help you get nice clean text from source. All these encoding things are not my thing but could well be someone else’s.

I made some changes:

  • test if the text contains double lineFeeds or double returns.
  • add an instruction in the regexp command so that it ignore case (the original text contains an entry with lowercases)
  • as it uses a Satimage feature, it may also use the Satimage’s sortList
  • on exit it reset the property to {} so it is not saved on disk and is empty on a second run.
property Postcode_List : {}

set Sorted_Text to (path to desktop as Unicode text) & "Sorted_Text.txt"
set The_Text to read (choose file without invisibles)

if The_Text contains return & return then
	set delim to return & return
else
	set delim to (ASCII character 10) & (ASCII character 10)
end if
-- Search text for regular expression for UK postcodes
my Find_Postcodes(The_Text, "([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2})")
-- Break text into a list of blocks of info
set {ASTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, delim}
set Info_Blocks to text items of The_Text
set AppleScript's text item delimiters to ASTID

-- Sort the info based on postcodes
set twoSorted to sortlist {Postcode_List, Info_Blocks} with respect to 1

-- Put info back in 2 paragraph blocks
set {ASTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, delim}
set Two_Para_List to items of (item 2 of twoSorted) as text
set AppleScript's text item delimiters to ASTID

-- Write results to text file
my write_to_file(Two_Para_List, Sorted_Text, false)

set Postcode_List to {} (* so it will not be saved on disk *)

-- This sub requires Satimage OSXA
on Find_Postcodes(The_Text, Search_String)
	try
		set Postcode_Info to find text Search_String in The_Text regexpflag {"IGNORECASE"} with regexp and all occurrences
		repeat with i from 1 to (count of Postcode_Info)
			set end of Postcode_List to matchResult of item i of Postcode_Info
		end repeat
		--return Postcode_List (* useless as it's a property *)
	on error
		return false
	end try
end Find_Postcodes
--
(*
on Bubble_Sort(List_1, List_2)
	repeat with i from (count List_1) to 2 by -1
		set A to beginning of List_1
		set y to beginning of List_2
		repeat with j from 2 to i
			set B to item j of List_1
			set Z to item j of List_2
			if (A > B) then
				set item (j - 1) of List_1 to B
				set item (j - 1) of List_2 to Z
				set item j of List_1 to A
				set item j of List_2 to y
			else
				set A to B
				set y to Z
			end if
		end repeat
	end repeat
	return List_2
end Bubble_Sort
*)
--
on write_to_file(this_data, target_file, append_data)
	try
		set the target_file to the target_file as string
		set the open_target_file to open for access file target_file with write permission
		if append_data is false then set eof of the open_target_file to 0
		write this_data to the open_target_file starting at eof
		close access the open_target_file
		return true
	on error
		try
			close access file target_file
		end try
		return false
	end try
end write_to_file

Yvan KOENIG (from FRANCE vendredi 29 août 2008 17:48:38)

[/b]

Hello Yvan

Thank you for the script changes. I have a problem compiling the code above and get this error

??

I wrote:

  • as it uses a Satimage feature, it may also use the Satimage’s sortList

It seems that Satimage is not available on your system.

On my system (10.4.11) the installed version is 3.3.1.
I installed it as:

Macintosh HD:Library:ScriptingAdditions:Satimage.osax:

Yvan KOENIG (from FRANCE samedi 30 août 2008 12:37:07)

I was using Satimage.osax 3.06. I have now installed 3.3.1. I can now compile the script but when I run it I get the following error

the same line is highlighted relating to the error

I’m not sure if this problem can be Satimage as it is not to do with the Satimage sub routine?