In need of a methodology for parsing text files!

cyan · June 4, 2007, 12:51am

Hello,

I’m looking for advice on how to parse text. My apologies if this is too basic a question.

I know you can get every word of something, or every character… the problem I see is when I ask for every word, escaped characters like quotes and parentheses are not considered words, and are not captured.

for example:

set theTextToParse to “1114_0_1_0_master8" (8)
STOP”

Yields this:

{
“1114”,
“",
“0”,
"”,
“1”,
“",
“0”,
"”,
“master8”,
“8”,
“STOP”
}

If I get every character instead of every word, these characters (" and parens) are retained, but really, I just want to look for the word “STOP” and take some action… but I don’t have the WORD “STOP” anymore, only the characters “S”, “T”, “O”, “P” and I don’t know where to take it from here.

And, what if I wanted to look for some different word in EVERY paragraph? I’m not sure whether text item delimiters as the approach would help here either.

Thoughts?

kel · June 4, 2007, 1:05am

Hi cyan,

In this kind of post, it’s good to know exactly what you’re trying to do. Hypothetical text parsing questions always lead to more questions when they are vague.

Basically, AppleScript is pretty limited. You can use text item delimiters or offset. You can check for containment of one string in another with the ‘contained by’ operator and its other forms.


set theTextToParse to "1114_0_1_0_master8 (8) \"STOP\""
theTextToParse contains "\"STOP\""
--> true

gl,

kel · June 4, 2007, 4:49am

Hi cyan,

Let me explain better about being more specific. Because AppleScript’s text parsing abillities is limited, you might use text item delimiters in one case and offset in another case. In certain applications, you can use regular expressions. You can also use UNIX’ various text parsing commands and languages such as grep, sed, awk, tr, etc.

Using any of these has its pros and cons and depends on the task at hand. Opening a shell has a default start time for initialization. So, any of the UNIX tools you have at your disposal has a default time handicap when called with ‘do shel script’.

With regards to AppleScript, the ‘offset’ command is straightforward. It just gets the index of the of the first character of the string you’re serching for. AppleScript has a property, ‘text item delimiters’. You can find solutions to many text parsing puzzles through the use of text item delimiters, but they may be quite complicated in a script. The positives of using AppleScript’s built in text parsing tools is speed.

gl,

cyan · June 4, 2007, 11:11pm

Thanks Kel,

I know these are deep waters. Its just that I don’t have the experience yet to know when to take one approach over another.

As an example, the task I posted about seems simple enough; open a text file, loop through every paragraph, appending it to a new text file. as I go. Eventually, one paragraph contains the word STOP. When I encounter that word, I want to insert a return and then write some other stuff to the text file, then go right back to parsing and appending.

I can find the word “STOP” by getting every word of paragraph i. But since some of the text is not considered a “word”, I can’t easily coerce the list of words I get back to a paragraph and write it to the file. On the other hand, if I get every character, I no longer have the word “STOP” to replace - just a list of characters!

Then I think I have to say something like this in the loop: "if the first character of the list =“S” and the next character of the list = “T” and …
but what if that list contains other "S"s… seems like a lot of work to find the word “STOP”, so I know this can’t be the way to proceed.

Am I making sense?

Adam_Bell · June 5, 2007, 12:22am

A tutorial I wrote some time ago might help: "Tutorial for Using AppleScript’s Text Item Delimiters

cyan · June 5, 2007, 1:28am

Thanks Adam,

I’m reading it over now. Looks great!