Issue with extracting text between defined strings

AndyMan · May 21, 2012, 4:42am

Grabbed and slightly modified the code below from: http://macscripter.net/viewtopic.php?id=24725

Is there an Applescript way to fix this or do you have a Perl/Awk/Sed method that might be a better/more reliable?

The code does extract the entries that I want. Unfortunately, it also grabs the first “page” of data.

The data is a saved adobe pdf file to text form. I wanted only all the pages of QTD. Unfortunately it came back with that and the first page of the Daily report. The data does vary as to the number of pages but is consistent in format.

The format is:

Daily report (pg 1.)
Data
Created on

Daily report (pg 2. and possibly more)
Data
Created on

Weekly Report (pg 1, 2 and possibly more)
Data
Created on

QTD (multiple pages)
Data
Created on

-- Extract every instance of text between bounding delimiters (Yvan Koenig)

set t to (read (choose file with prompt "Choose File to Operate On"))

set extract to extractBetween(t, "QTD", "Created")

---- The handler ----
to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters -- save them for later.
	set AppleScript's text item delimiters to startText -- find the first one.
	set liste to text items of SearchText
	set AppleScript's text item delimiters to endText -- find the end one.
	set extracts to {}
	repeat with subText in liste
		if subText contains endText then
			copy text item 1 of subText to end of extracts
		end if
	end repeat
	set AppleScript's text item delimiters to tid -- back to original values.
	return extracts
end extractBetween

Nigel_Garvey · May 21, 2012, 9:45am

Hi.

Yvan’s handler doesn’t drop the stuff before the first instance of the startText. I don’t know offhand if this is an oversight or whether it was originally intended for use in another context. Anyway, for your purposes, change this line:

set liste to text items of SearchText

. to this:

set liste to rest of text items of SearchText

You could write ‘set liste to text items 2 thru -1 of SearchText’ instead, but this will error if the start delimiter doesn’t occur in the text. The handler doesn’t check to see if both delimiters do occur in the text, or if they occur in the right order, but I don’t know what you’d want to do if they didn’t.

Yvan_Koenig · May 21, 2012, 11:52am

Hello Nigel
The original script does the wanted job.

(1) it splits the source text with the first delimiter
(2) it scans the subitems to check if they contain the second delimiter.
If a subitem doesn’t contain this 2nd delimiter, it’s dropped.
If it contains the 2nd delimiter, we keep only the piece of text which is before this delim.
As far as I know, it’s doing exactly what it’s supposed to achieve.

If as you describe, you drop the first item which may contain the 2nd delim. On my side, I assumed that what is before it must be kept.

Yvan KOENIG (VALLAURIS, France) lundi 21 mai 2012 13:52:44

Nigel_Garvey · May 21, 2012, 2:19pm

Hi, Yvan.

Yes. Sorry. I overlooked the ‘if’ statement in your repeat for some reason. The first text item is automatically excluded if doesn’t contain the second delimiter. But if it does, as here, then the final result contains all the text from before the first instance of the second delimiter, as well as any text which comes between the two delimiters.

Yvan_Koenig · May 21, 2012, 3:09pm

Hello Nigel

Only the asker know exactly what to do with the beginning of the text.
In such case, I choose the “conservative” workflow.
I think that it’s better to keep too much datas than drop useful one.

But my scheme drops a piece of data which is enclosed between two consecutive occurences of the first delimiter.
Not sure that it’s a good behavior.

I thought that an alternate scheme would be to start splitting with the 2nd delimiter but in this case we would drop piece of datas enclosed between two consecutives occurences of this 2nd delimiter.

It’s always difficult to answer a question which is no really precise.

I just pay attention to a sentence of the original message :

The code does extract the entries that I want. Unfortunately, it also grabs the first “page” of data.

So, the OP isn’t wanting to keep the first piece of text and so, the availability of 2nd delimiter must be tested only upon “the rest of text items”.

Yvan KOENIG (VALLAURIS, France) lundi 21 mai 2012 17:09:18

AndyMan · May 21, 2012, 3:42pm

Quite true. Both versions of the code do exactly what they are supposed to do.

Nigel’s solution does what I need it to do.

Thank you both for your input!

Andy

AndyMan · May 24, 2012, 6:35am

So I added a couple of lines at the bottom of this to save the “Extracts” as a text file. It does that. Unfortunately it adds strings in between the extracts such as: listutxt5z utxtÃ± utxt5z

It looks as though they might be consistent. Though who knows if one gets another set of files.

Any ideas why this might be happening?


set t to (read (choose file with prompt "Choose File to Operate On"))
set extract to extractBetween(t, "start", "end")


to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters -- save them for later.
	set AppleScript's text item delimiters to startText -- find the first one.
	set liste to rest of text items of SearchText
	set AppleScript's text item delimiters to endText -- find the end one.
	set extracts to {}
	repeat with subText in liste
		if subText contains endText then
			copy text item 1 of subText to end of extracts
		end if
	end repeat
	set AppleScript's text item delimiters to tid -- back to original values.
	set theFilePath to (path to desktop as string) & "file.txt" as string
	set theFileReference to open for access theFilePath with write permission
	set eof of theFileReference to 0
	write extracts to theFileReference starting at eof
	close access theFileReference
end extractBetween

StefanK · May 24, 2012, 6:46am

you’re writing a list (extracts) to disk which causes the “listutxt” token,
coerce the list back to text by using delimiter “” for just concatenating the list items or return for getting paragraphs


.
end repeat
	set text item delimiters to whatEverYouWantAsDelimiter
	set extracts to extracts as text
	set text item delimiters to tid -- back to original values.
.