Help locating one line of text within a larger text block

FlyingB · August 14, 2014, 2:23am

I need a script that will take a block of text (plain text from a webpage) and locate a certain line. It involves looking up information about books, and the text that I want the script to look at will always look like this:

I want the script to locate and return the title, and as you can see, there is no “Title:” at the beginning of the line that would make things easier. However, the title is always between the line “Find a book:” and the line that begins with “ISBN-13.” Complicating things, though, there are sometimes a different number of lines between “Find a book:” and the title; and moreover, the lines that appear blank actually can have a variable number of tabs in them. (In every instance I’ve seen there are two tabs before the title, and before the “ISBN-13…” so at least there’s something else consistent.)

I’ve tried and tried, but am not good enough with text delimiters, offsets, or anything else I’ve seen that might deal with the variability of lines and tabs. What I’d like to end up with is a script that I can use as a part of an Automator workflow, which is what I’m using to get the plain text for the script to look at. If anyone can help, I would really appreciate it.

Yvan_Koenig · August 14, 2014, 9:23am

You may try this code :

set p2d to path to desktop as text
set path2datas to p2d & "the datas.txt"
set theDatas to read file path2datas
set enListe1 to my decoupe(theDatas, "Find a book:")
if (count enListe1) = 1 then error "Keyword not found"
repeat with i from 2 to count enListe1
	set maybe to item i of enListe1 as text
	if maybe contains "ISBN-13" then
		set theResult to item 1 of my decoupe(maybe, "ISBN-13")
		log theResult
		# try to clean
		set theResult to my supprime(theResult, {linefeed, return, character id 8233, tab, "Ã„", "®", "¬", ",", " "})
		theResult
		exit repeat
	end if
end repeat
log theResult
#=====

on decoupe(t, d)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

#=====
(*
removes every occurences of d in text t
*)
on supprime(t, d)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
	set l to text items of t
	set AppleScript's text item delimiters to ""
	set t to l as text
	set AppleScript's text item delimiters to oTIDs
	return t
end supprime

#=====

I wrote an instruction trying to clean the result.
I tested using your own message as source.
I was a bit puzzled because what I got before cleaning was :
(*šÃ„®¬ šÃ„®šÃ„®¬ ¬ ¬ ¬ ¬ James Madison: A Life ReconsideredšÃ„®šÃ„®¬ ¬ ¬ ¬ ¬ *)
No return, no linefeed, no tab embedded.

Yvan KOENIG (VALLAURIS, France) jeudi 14 aoÃ»t 2014 11:22:56

FlyingB · August 14, 2014, 3:49pm

Hi Yvan,

I can’t explain how those other characters got in there, but what I do know is that your script works beautifully! I’m grateful for your time and expertise. This will save me a lot of time. Thank you very much for your help!

â’·

Yvan_Koenig · August 14, 2014, 4:55pm

Thanks for the feedback.
As You certainly saw, I assumed that there is only one useful entry in the source datas.
Small changes to the loop may allow it to extract several strings.
If the “odd” characters were introduced by the MacScripter messages parser, you may try to shorten the cleaning instruction as :

   set theResult to my supprime(theResult, {linefeed, return, character id 8233, tab})

Yvan KOENIG (VALLAURIS, France) jeudi 14 aoÃ»t 2014 18:53:57