Thursday, December 14, 2017

#1 2007-08-06 05:00:18 am

chrys
Member
From:: McKinney, TX, USA
Registered: 2007-06-26
Posts: 442

Doing Structured Text Generation in AppleScript

If you have ever found yourself in a situation where you needed to produce a piece of text (or series of texts) with a predefined format, you have found an application for structured text generation.  For some uses, the predefined format may be something simple like the text "Looping with num = " followed by a number. Such a format might be useful for a debug statement in a repeat loop. It might be used in code like this:

Applescript:


repeat with num from 1 to 5
   set msgText to "Looping with num = " & num
   set titleText to "Loop Message"
   display dialog msgText with title titleText
   -- do other stuff with num
end repeat

For a purpose like debug output, the result is only read by people so the format is mostly unconstrained. Other predefined formats are more strict in what they require. Something like CSV (Comma Separated Value: entry at Wikipedia, RFC) has very specific requirements for how the final text must be formatted. The goal of this article is to demonstrate an example-based method of developing AppleScript code that will generate structured texts.

AppleScript Concepts Used

Besides AppleScript strings, this article also uses lists, records and handlers. For an introduction to lists, see [http://bbs.macscripter.net/viewtopic.php?id=24730]The Power of Lists[/url] in the unScripted archives. The use of records in this article is fairly inconsequential, but if you would like an introduction to records, you can find one in the Records & Repeats article in unScripted. If you are not already familiar with AppleScript handlers, you might want to check out the Getting Started With Handlers series in unScripted. Also, do not forget about the AppleScript Language Guide, which is the official reference for core AppleScript language features.

The Method: Sample, Analyze, Decompose, Code, Repeat

The first step in this method for structured text generation is to find or create examples of the text you want to generate. It is important to include as many distinct variations as you can imagine. With a comprehensive set of examples (or failing that, a thorough description of the required format), you can be more confident that you will not have to go back into your code later to add a new feature to handle some new variation.

With a full set of examples at hand, you can start analyzing them for structure. Because this is an iterative process, you can simply start with the most coarse-grained structure first. The finer-grained structure will be dealt with in later iterations, so do not worry about the details at first. You can use these questions to guide your analysis:

What are the static parts of the examples?
What are the dynamic parts of the examples?
What is the purpose or content of each dynamic part of the example?


Static elements are those parts of the example texts that are identical across all the examples. The dynamic elements are those parts of the texts that are different in at least one of the examples. Thinking about the contents of the dynamic parts will give you ideas of what to name them in the following steps.

After the analysis is done, you take your editor to the examples themselves and extract out and capture their dynamic elements in a simple data structure. While doing this, save the static parts for use in the last step. At first, this will just amount to using your script editor to break up big AppleScript strings into smaller AppleScript strings. As things progress, though, you will end up using lists and possibly even fancier data structures to hold your decomposed examples.

Once the examples have been decomposed, is it time to write the code to stitch the dynamic elements back together with the static elements to effect regeneration of the original text examples. This is where AppleScript's concatenation operator gets a real workout. The key here is that the new code that is recombining the decomposed elements, not your brain driving your script editor. Your brain does the decomposition of the examples with your script editor, then you write AppleScript code to recompose the results.

A Practice Text: "Ten Green Bottles"

Ten green bottles hanging on the wall,
Ten green bottles hanging on the wall,
And if one green bottle should accidentally fall,
There'll be nine green bottles hanging on the wall.

Nine green bottles hanging on the wall,
Nine green bottles hanging on the wall,
And if one green bottle should accidentally fall,
There'll be eight green bottles hanging on the wall.

...

Two green bottles hanging on the wall,
Two green bottles hanging on the wall,
And if one green bottle should accidentally fall,
There'll be one green bottle hanging on the wall.

One green bottle hanging on the wall,
One green bottle hanging on the wall,
And if one green bottle should accidentally fall,
There'll be no green bottles hanging on the wall.


There are ten stanzas in this song, but I have omitted the middle six to save space here. The general pattern should be obvious, so let us get started with the text generation.

Iteration 1, Step 1 - Pick Examples

In this case, I only want to generate the one song about green bottles on a wall. If you also wanted to be able to produce songs about crystal chandeliers on the ceiling, you should whip up an example of that song to look at in comparison with the basic green bottle version, too.

Iteration 1, Step 2 - Analyze Structure

Since I only have one version of the song, the most coarse-grained structural element seems to be the stanzas. If you also had a song about chandeliers on the ceiling, your first structure might be the songs themselves: bottles versus chandeliers. For my single song, the static elements are the blank lines between stanzas. The dynamic elements are the stanzas themselves.

Iteration 1, Step 3 - Decompose Structure

Based on the previous step, I know I am working with blank lines and stanzas. How should I store them in my program? Since I have several of them, a list seems like a good idea. In the following code I have broken up the stanzas into multiple strings. The first eight stanzas are abbreviated to keep this article from being a mile long. You should just imagine that the strings in the program contain the appropriate stanzas in their entirety. There is no real code yet, I have only broken the original text up into stanzas and put them in a list.

Applescript:


set stanzaList to {"Ten...wall.",
   "Nine...wall.",
   "Eight...wall.",
   "Seven...wall.",
   "Six...wall.",
   "Five...wall.",
   "Four...wall.",
   "Three...wall.",
   "Two green bottles hanging on the wall,
Two green bottles hanging on the wall,
And if one green bottle should accidentally fall,
There'll be one green bottle hanging on the wall."
, "One green bottle hanging on the wall,
One green bottle hanging on the wall,
And if one green bottle should accidentally fall,
There'll be no green bottles hanging on the wall."
}

Iteration 1, Step 4 - Code Recomposition

OK, now it is time for some code. This first pass is pretty simple. It is only going to concern itself with assembling the stanzas into the final song text. After all, this first decomposition was pretty simple. It is a good thing to keep the steps simple. The complexity will arise from the repetition and refinement, not from vast stretches of coding.

Applescript:


-- "set stanzaList to" from iteration 1, step 3 not shown in this snippet

to getStanzaText(stanzaInfo)
   return stanzaInfo
end getStanzaText

set theLyricsText to ""
repeat with stanza in stanzaList
   if theLyricsText is not equal to "" then
       set theLyricsText to theLyricsText & return & return
   end if
   set theLyricsText to theLyricsText & getStanzaText(contents of stanza)
end repeat

If you have used the functionality of AppleScript's text item delimiters much you may be thinking "Why not just set the text item delimiters to {return & return} and concatenate the list that way?". Sure that would work, as the code stands right now. However, you should keep in mind that this is an iterative process. What was originally one big long string (the entire lyrics), is now a list of strings. In the next iteration, those strings that are in the list will probably be decomposed to something else, hopefully something that is much shorter. Yes, text item delimiters would work right now, but I am also writing the code with the next iteration held lightly in mind. By "held lightly in mind" I mean that I do not think too much about the details of what will happen in the next iteration. I do, however, remember that the structure of the examples as they stand in the current iteration will probably change drastically during the subsequent iterations. That is also what the getStanzaText() handler is about. It looks pretty silly right now since it immediately returns the value it was passed. But in the next iteration it will be rewritten to recombine the pieces of decomposed stanzas. When you are following this process on your own, you do not have to anticipate these abstractions, but you might come to appreciate them if you keep them in mind as your work progresses.

Iteration 2, Step 1 - Pick Examples

The first iteration started with the whole lyrics (the only example) in a single string and yielded a string for each stanza and some code to concatenate them together. The task in this iteration is to distill the stanzas down to their overall static and dynamic parts and then write code to reconstitute the stanzas from those distilled parts. Since we already have the stanzas from the previous iteration those will be the examples for this iteration.

Iteration 2, Step 2 - Analyze Structure

The examples for this iteration are stanzas. What is a bit more fine-grained than a stanza? Maybe the individual lines would make a good candidate for the analysis in this iteration. Here is one way to break down the lines into static and dynamic elements:<blockquote>Static: the line breaks between the lines and the 3rd line
Dynamic: the 1st line, the 2nd line (same as the first), and the 4th line</blockquote>
Iteration 2, Step 3 - Decompose Structure

Effectively there are only two dynamic items per stanza at this point: the first line and the fourth line. The second line changes between stanzas, but because it is a duplicate of the first, we do not need to represent it separately. So, with each stanza being represented by two items, how should we store them in the script? There are a couple of options here. We could use another list, or since there are always exactly two items, we could use a record. Either way, remember that this is just an interim solution until the next iteration (where it will change yet again). Here is one possible representation using a list for each stanza (all inside the original list for the whole song):

Applescript:


set stanzaList to {{"Ten...",
"...wall."},
{"Nine...",
"...wall."},
{"Eight...",
"...wall."},
{"Seven...",
"...wall."},
{"Six...",
"...wall."},
{"Five...",
"...wall."},
{"Four...",
"...wall."},
{"Three...",
"...wall."},
{"Two green bottles hanging on the wall,",
"There'll be one green bottle hanging on the wall."},
{"One green bottle hanging on the wall,",
"There'll be no green bottles hanging on the wall."}}

To do this, I edited the script source to break each stanza string up into a list of two strings. In terms of actual edits to the script code, I added an open-curly-brace and a close-curly-brace (respectively) before and after each stanza string. Then I replaced the second and third lines of each stanza (including line breaks) with a double-quote, a comma, a space, and another double-quote. The surrounding list structure was maintained, giving me a list of lists.

Iteration 2, Step 4 - Code Recomposition

Now it is time to bring together the static elements with the dynamic elements from the two previous steps. Since we already have a handler that produces the stanza text (remember the useless looking getStanzaText() from the previous iteration?), we can use it to hold the code to build the stanza from this new representation. One way to write it might look like this:

Applescript:


-- "set stanzaList to" from iteration 2, step 3 not shown in this snippet

-- main loop from iteration 1, step 4 not shown in this snippet

to getFirstLineText(stanzaInfo)
   item 1 of stanzaInfo
end getFirstLineText

to getFourthLineText(stanzaInfo)
   item 2 of stanzaInfo
end getFourthLineText

to getStanzaText(stanzaInfo)
   set firstLine to getFirstLineText(stanzaInfo)
   set secondLine to firstLine
   set thirdLine to "And if one green bottle should accidentally fall,"
   set fourthLine to getFourthLineText(stanzaInfo)
   return firstLine & return & secondLine & return & thirdLine & return & fourthLine
end getStanzaText

Iteration 3, Step 1 - Pick Examples

OK, so after the first iteration, this isn't really a step anymore. What this step would normally produce is always produced in the previous iteration at step 3. At this point, we are working with a list of two element lists that contain the text of the first line and the text of the fourth line of each stanza.

Iteration 3, Step 2 - Analyze Structure

If you look at just the first eight stanzas, you will find that the only change is in the number words (ten, nine, ...). But there is another part that does not start changing until the last two stanzas: the word "bottles" changes to "bottle" when there is only one left. If you miss something like this early on, it is not the end of the world, you can go back and add support for it later when you realize the omission. Here is one possible analysis of the dynamic parts of the text of the first and fourth lines (the dynamic parts are colored red):

First lines:
"Two green bottles hanging on the wall,"
One green bottle hanging on the wall,"
Fourth lines:
"There'll be one green bottle hanging on the wall."
"There'll be no green bottles hanging on the wall."


Technically, I could have left "bottle" in the static parts and only used "s" in the dynamic part. However, since the simple "add an s" pattern does not hold for all plurals, I decided to keep the whole word together as one dynamic element. There are no strict rules about where the boundary should be drawn. If you wanted to generate variations of the song about wall-mounted items other than green bottles, you could pull the adjective "green " into the dynamic element (giving "green bottles"/"green bottle"). This may even be mandatory in some languages (like Spanish) where the adjectives must also be modified when forming plural noun phrases. If you think you might want to vary the adjective and noun independently, then you would need to split them up into two separate dynamic elements. Making these kinds of judgments is greatly aided by having picked an exhaustive set of examples in step 1 of the first iteration. So, if you know that you want your song to be about something more than just green bottles, you should be sure to start by picking example texts that illustrate that requirement.

Iteration 3, Step 3 - Decompose Structure

There are two dynamic elements per line. Each line in the previous example data structure will be replaced by a new data structure that holds two items. Last time we used a plain list, so for variety this time we will use a record. Here is a decomposition of the lines into a record that holds a "number word" and a "bottles word".

Applescript:


set stanzaList to {{{numberWord:"Ten", bottleWord:"bottles"}, {numberWord:"nine", bottleWord:"bottles"}}, {{numberWord:"Nine", bottleWord:"bottles"}, {numberWord:"eight", bottleWord:"bottles"}}, {{numberWord:"Eight", bottleWord:"bottles"}, {numberWord:"seven", bottleWord:"bottles"}}, {{numberWord:"Seven", bottleWord:"bottles"}, {numberWord:"six", bottleWord:"bottles"}}, {{numberWord:"Six", bottleWord:"bottles"}, {numberWord:"five", bottleWord:"bottles"}}, {{numberWord:"Five", bottleWord:"bottles"}, {numberWord:"four", bottleWord:"bottles"}}, {{numberWord:"Four", bottleWord:"bottles"}, {numberWord:"three", bottleWord:"bottles"}}, {{numberWord:"Three", bottleWord:"bottles"}, {numberWord:"two", bottleWord:"bottles"}}, {{numberWord:"Two", bottleWord:"bottles"}, {numberWord:"one", bottleWord:"bottle"}}, {{numberWord:"One", bottleWord:"bottle"}, {numberWord:"no", bottleWord:"bottles"}}}

Wow, that looks a bit noisy. Oh well, I have a feeling that we will be able to clean it up a bit in the next (final) iteration. Again, the decomposition is done while preserving the surrounding data structure. The new result is a list containing lists of two records with two properties.

Iteration 3, Step 4 - Code Recomposition

For this, we will rewrite the getFirstLineText() and getFourthLineText() handlers to use the new data structures. They will combine the static elements of the first and fourth lines with their dynamic elements.

Applescript:


-- "set stanzaList to" from iteration 3, step 3 not shown in this snippet

-- getStanzaText() handler from iteration 2, step 4 not shown in this snippet

-- main loop from iteration 1, step 4 not shown in this snippet

to getNumberText(lineInfo)
   numberWord of lineInfo
end getNumberText

to getBottlesText(lineInfo)
   bottleWord of lineInfo
end getBottlesText

to getFirstLineText(stanzaInfo)
   set theNumberText to getNumberText(item 1 of stanzaInfo)
   set theBottleText to getBottlesText(item 1 of stanzaInfo)
   return theNumberText & " green " & theBottleText & " hanging on the wall,"
end getFirstLineText

to getFourthLineText(stanzaInfo)
   set theNumberText to getNumberText(item 2 of stanzaInfo)
   set theBottlesText to getBottlesText(item 2 of stanzaInfo)
   "There'll be " & theNumberText & " green " & theBottlesText & " hanging on the wall."
end getFourthLineText

Iteration 4, Step 1 - Pick Examples

There is nothing to do for this step. The requirements have already been met by the actions taken in iteration 3, step 3.

Iteration 4, Step 2 - Analyze Structure

At this point, we have reduced the song down to just down to just lowercase number words, first letter uppercase number words, plural, and singular versions of the word bottle. We can still distill the structure down a bit further though. The number word used in the fourth line is always one smaller than the number word used in the first line. So, both can be determined from a single value. Also, the singular version of bottle is only used where the associated number is one. Every other number uses the plural form. That means we can also determine which form of the word bottle to use based on a single value. If everything can be determined by a single number, why not use that number to describe each stanza instead of two pairs of strings?

Iteration 4, Step 3 - Decompose Structure

Using just a single number for each stanza sure cleans up the example list:

Applescript:


set stanzaList to {10, 9, 8, 7, 6, 5, 4, 3, 2, 1}

Iteration 4, Step 4 - Code Recomposition

Now we need to generate a first letter uppercase number word for the first line, and a lowercase number word for the fourth line. The bottle words for both lines are identical, since they correspond directly to the number associated with the line (though the number for the fourth line is always one below the first line's number). We need two different number words, but we only have one handler to get number words, so we will have to add another one and change one of the line generation handlers to use the new one. Both of the line generation handlers will have to be changed anyway since the data structure they pass around was changed drastically. Here is one way to go about the changes:

Applescript:


-- "set stanzaList to" from iteration 4, step 3 not shown in this snippet

-- getStanzaText() handler from iteration 2, step 4 not shown in this snippet

-- main loop from iteration 1, step 4 not shown in this snippet

to getFirstUpperCaseNumberText(lineInfo)
   item (lineInfo + 1) of {"No", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten"}
end getFirstUpperCaseNumberText

to getLowerCaseNumberText(lineInfo)
   item (lineInfo + 1) of {"no", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"}
end getLowerCaseNumberText

to getBottlesText(lineInfo)
   if lineInfo is equal to 1 then return "bottle"
   return "bottles"
end getBottlesText

to getFirstLineText(stanzaInfo)
   set theNumberText to getFirstUpperCaseNumberText(stanzaInfo)
   set theBottleText to getBottlesText(stanzaInfo)
   return theNumberText & " green " & theBottleText & " hanging on the wall,"
end getFirstLineText

to getFourthLineText(stanzaInfo)
   set theNumberText to getLowerCaseNumberText(stanzaInfo - 1)
   set theBottlesText to getBottlesText(stanzaInfo - 1)
   "There'll be " & theNumberText & " green " & theBottlesText & " hanging on the wall."
end getFourthLineText

"Ten Green Bottles" - Done!

When the last code snippet is combined with the pieces of code mentioned in its comments, it will produce the entire lyrics of the song. Really, the only bit of structure left is the duplication between the two lists of number words. That can be taken out by writing a handler to convert the first letter of a string to upper case (or using one from a library or an OSAX). Now that the stanzas are represented by a sequence of numbers, the stanza list can be subsumed into the repeat loop itself. When I took care of these last few things, I ended up with this complete program:

Applescript:


property lettersLowerUpper : "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"

-- This only works for ASCII, so avoid using it when the input might not be ASCII.
to upCaseFirst(aWord)
   local firstLetter, letterPosition, newFirstLetter, newWord
   if length of aWord is equal to 0 then return ""
   set firstLetter to first character of aWord
   set letterPosition to offset of firstLetter in lettersLowerUpper
   if letterPosition mod 2 is equal to 0 then return aWord
   set newFirstLetter to character (letterPosition + 1) of lettersLowerUpper
   if length of aWord is equal to 1 then return newFirstLetter
   set newWord to newFirstLetter & text 2 through -1 of aWord
end upCaseFirst

to getFirstUpperCaseNumberText(lineInfo)
   upCaseFirst(getLowerCaseNumberText(lineInfo))
end getFirstUpperCaseNumberText

to getLowerCaseNumberText(lineInfo)
   item (lineInfo + 1) of {"no", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"}
end getLowerCaseNumberText

to getBottlesText(lineInfo)
   if lineInfo is equal to 1 then return "bottle"
   return "bottles"
end getBottlesText

to getFirstLineText(stanzaInfo)
   set theNumberText to getFirstUpperCaseNumberText(stanzaInfo)
   set theBottleText to getBottlesText(stanzaInfo)
   return theNumberText & " green " & theBottleText & " hanging on the wall,"
end getFirstLineText

to getFourthLineText(stanzaInfo)
   set theNumberText to getLowerCaseNumberText(stanzaInfo - 1)
   set theBottlesText to getBottlesText(stanzaInfo - 1)
   "There'll be " & theNumberText & " green " & theBottlesText & " hanging on the wall."
end getFourthLineText

to getStanzaText(stanzaInfo)
   set firstLine to getFirstLineText(stanzaInfo)
   set secondLine to firstLine
   set thirdLine to "And if one green bottle should accidentally fall,"
   set fourthLine to getFourthLineText(stanzaInfo)
   return firstLine & return & secondLine & return & thirdLine & return & fourthLine
end getStanzaText

to getTenGreenBottlesLyrics()
   set theLyricsText to ""
   repeat with stanza from 10 to 1 by -1
       if theLyricsText is not equal to "" then
           set theLyricsText to theLyricsText & return & return
       end if
       set theLyricsText to theLyricsText & getStanzaText(contents of stanza)
   end repeat
   theLyricsText
end getTenGreenBottlesLyrics

set theLyricsText to getTenGreenBottlesLyrics()

What Was Accomplished

I started with an 1800 character song and changed it into a 2200 character AppleScript program. Well, since I did not manage to achieve any data compression, I hope I have at least given you something to think about. In practice, the process is not always this lengthy or tedious. As you gain experience doing the analysis and breaking down the examples, you will gradually start to employ larger scale transformations in each iteration and something like the song in this tutorial would only be one or two iterations instead of four or five.

What Use Is This, Anyway?

There are plenty of productive uses for text generation. A custom blog publishing program would use text generation to massage blog entries into full HTML pages and associated index pages. Do you need to get some of the metadata from your photo library to a buddy that does not have a Mac? Extract the data and write it out as CSV so your buddy can read it on his PC. For really advanced stuff, when combined with run script you can even do metaprogramming. Although it is all too easy to misapply metaprogramming, sometimes it turns out to be the only way to solve some programming problems in vanilla AppleScript.


--
Chris


Filed under: text, Structured, Crys

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)