Monday, January 22, 2018

#1 2018-01-13 11:08:28 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 231

Get each line ending in a multi-line text string

It is sometimes desirable to programmatically modify a multi-line text string in which line endings differ from line to line while preserving the existing line endings.  This problem might arise, for example, when one wishes to programmatically edit a multi-line script that contains both Applescript and shell script code.  Lines of Applescript code would typically end with ASCII 13 carriage return [CR] characters, whereas lines of shell script code would typically end with ASCII 10 linefeed [LF] characters.  The following is a contrived script for purposes of demonstration.  The [CR] and [LF] markers have been added simply to signify the invisible line endings, but they are not in the actual text:

Applescript:


-- The line endings may be Mac-style [CR], Unix-style [LF], or even Dos-style [CR-LF]; the "paragraphs" property will properly handle all line ending forms[CR]
set var1 to paragraphs of ("good line[CR]
bad line[CR]
good line[CR]
another bad line[CR]
good line[CR]
this line is bad[CR]
good line"
)[CR]
[CR]
-- The line endings MUST be Unix-style [LF] in order for the "sed" command to process the lines properly and convert "paragraph" to "line"[CR]
set var2 to paragraphs of (do shell script "echo 'good paragraph[LF]
bad paragraph[LF]
good paragraph[LF]
another bad paragraph[LF]
good paragraph[LF]
this paragraph is bad[LF]
good paragraph' | sed -E 's/paragraph/line/'"
)[CR]
[CR]
return {var1, var2}

Running this script returns two identical sublists:

Applescript:


{
   {
       "good line",
       "bad line",
       "good line",
       "another bad line",
       "good line",
       "this line is bad",
       "good line"
   },
   {
       "good line",
       "bad line",
       "good line",
       "another bad line",
       "good line",
       "this line is bad",
       "good line"
   }
}

Now let's say that the goal is to programmatically add a number sign "#" to the start of any line containing the substring "bad" while preserving the line endings.  The first step would be to assign a text representation of the script to a variable with a command such as the following:

Applescript:


-- If the script has been saved as an uncompiled .applescript text file:
set oldScript to read "/path/to/MyScript.applescript"

-- Or if the script has been saved as a compiled .scpt text file:
set oldScript to do shell script "osadecompile " & "/path/to/MyScript.scpt"'s quoted form

Simply extracting oldScript's lines of text into a list with the paragraphs property then modifying any lines with the substring "bad" wouldn't work because line ending information would be lost. Also, using the sed shell command to perform a search-and-replace action wouldn't work because all line endings would first have to be converted to ASCII 10 linefeed characters, and again line ending information would be lost.

One approach that works is to process the text string character by character. The downside is the substantial amount of pattern-recognition coding required and, especially for large files, the potentially slow execution speed. Another approach would be the creative use of text (paragraph [N]) thru (paragraph [N+1]) statements. While this can be done, it suffers the same problems of coding complexity and potentially inefficient execution speed as the previous approach.

An alternative solution, and the recommended one, is to use Applescript's remarkably efficient text item delimiters property to split lines while preserving line ending information:

Applescript:


on extractLineEndings(theText)
   -- Extracts all line endings and their ASCII ("character id") values
   set {lineEndings, lineEndingIds} to {{}, {}}
   set tid to AppleScript's text item delimiters
   try
       -- First, split the text at Dos-style (CR-LF) line endings
       set AppleScript's text item delimiters to return & linefeed
       set dosSections to (get theText's text items)
       repeat with iDos from 1 to dosSections's length
           -- Next, split each Dos section of text at Mac-style (CR) line endings
           set AppleScript's text item delimiters to return
           set macSections to (get dosSections's item iDos's text items)
           repeat with iMac from 1 to macSections's length
               -- Finally, split each Mac secion of Dos-split text at Unix-style (LF) line endings
               set AppleScript's text item delimiters to linefeed
               set unixSections to (get macSections's item iMac's text items)
               repeat ((unixSections's length) - 1) times
                   -- Save the current sub-subsection's Unix-style line endings
                   set {end of lineEndings, end of lineEndingIds} to {linefeed, 10}
               end repeat
               -- Save the current subsection's Mac-style line endings
               if iMac < macSections's length then set {end of lineEndings, end of lineEndingIds} to {return, 13}
           end repeat
           -- Save the current section's Dos-style line endings
           if iDos < dosSections's length then set {end of lineEndings, end of lineEndingIds} to {return & linefeed, {13, 10}}
       end repeat
   end try
   set AppleScript's text item delimiters to tid
   -- Return the line endings themselves and their ASCII values as properties of a record
   return {lineEndings:lineEndings, lineEndingIds:lineEndingIds}
end extractLineEndings

Running this handler on the sample script confirms that the line endings have been extracted properly:

Applescript:


extractLineEndings(oldScript)'s lineEndingIds -->
{
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   10,
   10,
   10,
   10,
   10,
   10,
   13,
   13
}

One can then easily perform the desired text modifications by extracting the lines of text and line endings separately, then reconstructing the script with the modified lines of text and old line endings:

Applescript:


-- Extract the lines of text and line endings separately
set {oldLines, lineEndings} to {oldScript's paragraphs, extractLineEndings(oldScript)'s lineEndings}
set newLinesWithLineEndings to {}
-- Process the text one line at a time
repeat with i from 1 to oldLines's length
   -- Get the current line of text and line ending
   -- Set the line ending of the last line of text to the empty string because there will always be one less line ending than the total number of lines
   set {currLine, currLineEnding} to {oldLines's item i, ""}
   if i < oldLines's length then set currLineEnding to lineEndings's item i
   -- Prefix the line with a number sign if it contains the substring "bad"
   tell currLine to if it contains "bad" then set currLine to "#" & it
   -- Restore the old line ending to the line
   set end of newLinesWithLineEndings to currLine & currLineEnding
end repeat
-- Reconstruct the script from the individual modified lines
set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
set newScript to newLinesWithLineEndings as text
set AppleScript's text item delimiters to tid

Running the modifed script once again returns two identical sublists, now with a number sign at the start of lines containing the substring "bad":

Applescript:


run script newScript -->
{
   {
       "good line",
       "#bad line",
       "good line",
       "#another bad line",
       "good line",
       "#this line is bad",
       "good line"
   },
   {
       "good line",
       "#bad line",
       "good line",
       "#another bad line",
       "good line",
       "#this line is bad",
       "good line"
   }
}

And running the handler on the modified script confirms that the line endings have been preserved:

Applescript:


extractLineEndings(newScript)'s lineEndingIds -->
{
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   13,
   10,
   10,
   10,
   10,
   10,
   10,
   13,
   13
}

A couple of final points.  As an "acid" test, I ran the extractLineEndings handler on a .applescript text file that is over 25,000 lines long with mostly carriage return line endings but also numerous linefeed and a few carriage return-linefeed line endings, and a multitude of single- and multi-byte non-ASCII characters.  [With a huge sigh of relief] I found that it extracted line endings perfectly.  Also, I compared execution speed with an AppleScriptObjC version of the handler of the same structure but with the input text string transformed into an NSString and splits accomplished with NSString's componentsSeparatedByString method rather than Applescript's text item delimiters property. For text strings up to about 2500 lines, the Applescript version was found to execute faster than the AppleScriptObjC version, significantly so for text strings of about 500 lines or fewer. This  difference likely reflects a combination of the effects of text item delimiters's efficiency and the overhead of the ASOC bridge that is finally overcome by ASOC only at the largest text string sizes:

# text lines -> Applescript/ASOC execution time
-----------------------------------------------------------
    10 -> 0.16
    20 -> 0.17
    40 -> 0.15
    80 -> 0.17
    160 -> 0.20
    320 -> 0.31
    640 -> 0.45
    1280 -> 0.71
    2560 -> 0.90
    5120 -> 1.18
    10240 -> 1.42

Last edited by bmose (2018-01-13 10:55:16 pm)

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)