Extracting Variable length text strings

SynapTECH · November 17, 2005, 7:21pm

Hi, I have looked through the forum looking for some code snippets to perform this function but haven’t found the right bits to do the job.

I am attempting to extract data strings from voluminous text files that are bounded by two unique characters, let’s say # on the front end of the string and * on the back end. The string between is variable length and I am unable to use traditional find and replace functions in typical text editors to achieve my objective.

Could someone point me in the right direction? I am a code newbie, but am looking forward to tackling this project.

Thanks in advance!

hhas · November 17, 2005, 8:44pm

Regular expressions are ideal for this sort of task. Using TextCommands:

tell application "TextCommands"
	search txt for "#(.*?)\\*" with regex
end tell

HTH

Bruce_Phillips · November 17, 2005, 9:28pm

As already noted, the Code Exchange forum is for posting full snippets/solutions. In the future, please use the other forums for help requests. This post will be moved.

Bruce_Phillips · November 17, 2005, 9:51pm

Here’s something to get you started:

set ASTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {("*" & (ASCII character 10))}

choose file with prompt "Find text in this file:" without invisibles
get quoted form of POSIX path of result

do shell script "grep -o '#.\\+\\*' " & result & " | colrm 1 1" without altering line endings
set textItems to text items 1 thru -2 of result

set AppleScript's text item delimiters to ASTID

return textItems

Edit: Small change to the shell script. Given this sample data:

. The result is:

SynapTECH · November 18, 2005, 6:31pm

Bruce,

Many thanks for the gracious bump and the code. This is a great teaching example, and I worked with your first code example last night - but couldnt’ get it to work just right. I thought at first that I choose poor delimeters as they are common in HTML source files, but I will work with your second example and see if I can make a go of it!

Thanks again,

Regards,
Cliff

jonn8 · November 18, 2005, 9:39pm

If your text is in a variable already, you can use this code with variable delimiters (there are probably other characters that need to be escaped and this assumes your delimiters are single characters only):

--set the_text to (read (choose file))
set the_text to "1234567890asdfghjkl
.qwerty#
hELLO<_wORLD/
#FindMe!*<blah/>
aeiou#Find this too.*[-]
``~+-*/=fin"
set {start_delim, end_delim} to {"#", "*"}

set found_text to my find_delimited_text(the_text, start_delim, end_delim)
-->{"FindMe!", "Find this too."}

on find_delimited_text(the_text, start_delim, end_delim)
	set {escaped_start_delim, escaped_end_delim} to {my escaped_delim(start_delim), my escaped_delim(end_delim)}
	set ASCII_10 to (ASCII character 10)
	tell (a reference to my text item delimiters)
		set {old_tid, contents} to {contents, {ASCII_10}}
		set {the_text, contents} to {(the_text's paragraphs) as Unicode text, {end_delim & ASCII_10}}
	end tell
	set found_text to (do shell script "echo " & quoted form of the_text & " | grep -o '" & escaped_start_delim & ".\\+" & escaped_end_delim & "' | colrm 1 1" without altering line endings)'s text items 1 thru -2
	tell (a reference to my text item delimiters) to set contents to old_tid
	return found_text
end find_delimited_text

on escaped_delim(the_delim)
	if the_delim is in "*.?()[]^\\" then return "\\" & the_delim
	return the_delim
end escaped_delim

Jon

Bruce_Phillips · November 18, 2005, 11:15pm

Yeah. I tried saving the source of this page, and running my script on that file, and it didn’t return the results that I wanted to see.

SynapTECH · November 29, 2005, 8:23pm

Bruce, John8:

Thank you so much for your help. Between the examples you provided, I have been tearing through my data files extracting away - and I’ve discovered the power behind applescript. I really appreciate the help. This a fantastic forum and you are a great resource!

Cheers,

Cliff

Bruce_Phillips · November 29, 2005, 9:01pm

I’m glad you’ve figured it out.