Help with finding and moving text in a large document

I posted this earlier to the TextWrangler list, and thought I’d post here as well (I can use all the help I can get).

I’ve been using Applescript and TextWrangler to massage large volumes of XML files, and it’s working very well. I just hit a new problem, though, while trying to massage some large MIF text files and was wondering if I should be concentrating more on figuring out Grep more vigorously or counting more on Applescript to handle things. Knowing where to start will (hopefully) save me a huge amount of valuable time.

In a nutshell: I need to scour a large MIF containing marker tags of various types inside Paragraph and ParaLine tags and, depending on what type of markers they are, move them to the beginning or end of <Paraline tags, allowing in some cases for a <TextRectID ###> between the start of the <Paraline and the start of the <Marker tag.

The following example contains two Markers; the first, a ‘Cross-Ref’, is located correctly, while the second, a ‘Type 18’, should be moved to just before the closing of the ParaLine tag:

Before:
<Para
<Unique 1009287>
<PgfTag HeadP18'> <PgfNumString 5.1’>
<ParaLine
<TextRectID 581>
<Marker
<MType 9>
<MTypeName Cross-Ref'> <MText SPECIFICATIONS’>
<MCurrPage 1'> <Unique 1026941> > # end of Marker <String SP’>
<String E'> <Marker <MType 18> <MTypeName Type 18’>
<MText Specifications'> <MCurrPage 1’>
<Unique 1026948>
> # end of Marker
<String CIFICATIO'> <String N’>
<String `S’>

end of ParaLine

end of Para

After:
<Para
<Unique 1009287>
<PgfTag HeadP18'> <PgfNumString 5.1’>
<ParaLine
<TextRectID 581>
<Marker
<MType 9>
<MTypeName Cross-Ref'> <MText SPECIFICATIONS’>
<MCurrPage 1'> <Unique 1026941> > # end of Marker <String SP’>
<String E'> <String CIFICATIO’>
<String N'> <String S’>
<Marker
<MType 18>
<MTypeName Type 18'> <MText Specifications’>
<MCurrPage `1’>
<Unique 1026948>
> # end of Marker

end of ParaLine

end of Para

I’ve been having great luck with using AS with TextWrangler’s grep searches, but I haven’t done anything this complex (search,-test and-move- if-appropriate rather than search-and-replace). The files can run as high as 100,000 to 200,000 lines, though they’re short lines. I have a large batch of these files coming in around the end of the month, and it would be really great if I could do most of the cleaning using Applescript (or some other scripting language, if needed… and relatively easy to get up to speed on) instead of moving the markers and such manually.

As always, any help is greatly appreciated.

I don’t have a solution, but the problem seems to be perfectly suited to awk. Why don’t you post your question to the newsgroup comp.lang.awk. I am sure they can solve this for you.

Andy

Model: G5 Dual 2.7 GHz
AppleScript: 2.0 (v43.1)
Browser: Safari 312
Operating System: Mac OS X (10.4)

Thanks Andy, I’ll keep it in mind; I do like Applescript very much, and this problem is the last one I have to solve, the other 90% have been solved using Applescript and TextWrangler’s Search and Replace using Grep. If at all possible, I’d really like to try to solve it with Applescript before adding another language to the mix.

I realize that there’s a very real possibility that I won’t be able to handle it in AS, but I’d like to take one more serious try before bailing.

Thanks very much for your reply, it’s appreciated. I hadn’t looked at awk before, it’s nice to know about other options.

– Walt Sterdan

Thanks for your response. I will follow and see if an AS solution is available. If you have a handle on grep, you will find awk to be a natural extension. Good Luck!

Andy

Thanks, I’ll need it. I’m still far too new at grepping to say I have a handle on it.

I’ve taken a swing at the problem and broken it down to mainly some simple (?) string parsing. By setting my text delimiter to “<ParaLine” a simple search splits the document into multi-paragraph text items (each of which can be dealt with as a single string); the probem now is reading in each text item (string) line-by-line until a line that contains “> # end of ParaLine” is found, giving us a substring containing only the ParaLine item, then checking the ParaLine for Marker substring (eg. <Marker … > # end of Marker) and, if any are found, checking the type of Marker substring, moving the crossref type to the front of the string and any other Marker types to the end of the string.

Any ideas on the easiest way to do this using Applescript’s limited string parsing set? Any simple way to read in the string, break it into appropriate substrings and, on checking the substrings, re-ordering them?

As always, any help is appreciated.

– Walt Sterdan

Hi,

Can you search for:

<Marker
 <MType 18>
 <MTypeName `Type 18'>
 <MText `Specifications'>
 <MCurrPage `1'>
 <Unique 1026948>
> # end of Marker

cut it. Then search for:

end of ParaLine

and replace with:

<Marker
 <MType 18>
 <MTypeName `Type 18'>
 <MText `Specifications'>
 <MCurrPage `1'>
 <Unique 1026948>
> # end of Marker

end of ParaLine

I don’t have TextWrangler, so don’t know if it can do that. Otherwise, you can do this with AppleScript.

Edited: I copied and pasted the wrong text.

gl,

Hi Kel;

I’m hoping to do it all with Applescript.

You could search and replace as above, but I haven’t explained the task very well; basically, there’s a very long MIF file containing “<ParaLine … > # end of Para” segments that often (but not always) contain one or more “<Marker … > # end of Marker” segments while any Marker segments contain a line describing their type (eg. <MType 18>, <MType 9>, etc.).

I have to check the “<ParaLine”-delimited text item for markers and, if it has any, check what types they are, moving the crossRef types to the front of the text item while moving any other Marker types to the back of the string.

While I have done the rest of the file cleaning inside of TextWrangler using simple search and replace, occasionally using simple grep statements (they’re still very new to me), I feel that this final task won’t use TextWrangler, that Applescript should be able to carry the day (barring that, I’ll have to learn another language – ouch).

A big draw with Applescript is that it’s already installed on all of the work machines; using anything else would most likely mean installing software on any machines that might need to clean MIF files, something I’m hesitant to do.

Thanks for your reply, it’s appreciated.

– Walt Sterdan

I got a script working, then was informed that not only do the markers have to be moved to the front and rear of the tags, that glossary tags also had to be moved to the first tag nested within its tag (a paragraph can have many lines; glossary markers must be at the front of the first line in the paragraph).

A second program (based on the first, using "<Para " as the delimiter instead of “<ParaLine”) solved the problem; I’ve joined them both into a single, crude, brute force script, totally lacking in elegance. :wink:

I’m currently working on wrapping it in a droplet shell, as well as eliminating the dialog boxes used to get the file names. For anyone intersted (and I can’t see why anyone would be) here it is:

--    MIFF Cleaner v0.04 2006/Mar/27 15:00
--    This version selects the file, breaks it into sections with "<ParaLine" as a text item delimiter, re-assembles the parts and writes the new file to "Testxx"
--    Next step: analyzing for tags, then rearranging and writing

set vParaLine to "<ParaLine"
set CurrentDoc to (read (choose file))
moveMarkers of CurrentDoc for vParaLine


set vPara to "<Para "
set CurrentDoc to (read (choose file))
moveGlossaryMarkers of CurrentDoc for vPara

--- SUB-ROUTINES ---

to moveMarkers of currText for delimiterString
   with timeout of 10000 seconds (* time-consuming statements *)
       set tid to text item delimiters
       set text item delimiters to delimiterString
       --    set myFile to (open for access file "Test File" with write permission)
       set tempResult to ""
       considering case
           tell currText to if (count text items) is 1 then
               set searchResult to "ERROR: no match found for \"" & ({""} & "\".")
           else
               set TextCount to get count of text items in currText
               repeat with counter1 from 1 to TextCount
                   if counter1 = TextCount then
                       set searchResult to (text item counter1 as string)
                   else
                       set searchResult to (text item counter1 & "<ParaLine " as string)
                   end if
                   if searchResult contains "<Marker" and counter1 > 1 then
                       -- Break down the current text item line by line (eg. Paragraph by Paragraph), checking for
                       -- Location (eg. currentParagraph contains "Mtype") and type of Marker; once found, and the end of the ParaLine section has been reached (eg. currentParagraph contains "> # end ParaLine")
                       -- we re-order the Markers, moving crossref, Glossary and (?) Type 18 markers to the front of the string, all else to the back, Index marker last
                       (* Marker types: 9 - Cross-Ref 6 = Glossary 8 = Hypertext 12 = Type 12 14 = Type 14 18 = Type 18 2 = Index *)
                       
                       --    set CurrentParagraph to 1
                       set TotalParagraphs to (count of paragraphs in searchResult)
                       set ParaText to ""
                       set FrontMarkers to ""
                       set BackMarkers to ""
                       set tempString to ""
                       set tempMarker to ""
                       set MarkerFlag to "off"
                       set ParaLineFlag to "on"
                       
                       repeat with CurrentParagraph from 1 to TotalParagraphs -- while tempString does not contain "> # end of Para "
                           set tempString to get paragraph CurrentParagraph of searchResult
                           
                           -- If we're currently between <ParaLine and > # end of ParaLine...
                           if (ParaLineFlag = "on") then
                               if (tempString contains "<TextRectID") then
                                   set tempResult to (tempResult & return & tempString & return as string)
                                   
                               else if (tempString contains "<Marker") then
                                   
                                   -- Process the Marker
                                   set MarkerFlag to "on"
                                   set tempMarker to (tempMarker & tempString & return as string)
                               else if ((tempString contains "<MType") or (tempString contains "<MText") or (tempString contains "<MCurr")) then
                                   set tempMarker to (tempMarker & tempString & return as string)
                               else if ((MarkerFlag is "on") and (tempString contains "Unique")) then
                                   set tempMarker to (tempMarker & tempString & return as string)
                               else if (tempString contains "> # end of Marker") then
                                   set MarkerFlag to "off"
                                   set tempMarker to (tempMarker & tempString as string)
                                   
                                   
                                   -- Marker is complete; check type; move appropriately, clear tempMarker 
                                   if (tempMarker contains "<MType 9>") then
                                       set tempMarker to (tempMarker & return & FrontMarkers as string)
                                       set FrontMarkers to (tempMarker)
                                       set tempMarker to ""
                                   else if (tempMarker contains "<MType 2>") then
                                       set BackMarkers to (BackMarkers & tempMarker & return as string)
                                       set tempMarker to ""
                                   else
                                       set FrontMarkers to (FrontMarkers & tempMarker as string)
                                       set tempMarker to ""
                                   end if
                                   -- End of Marker Processing
                                   
                                   -- If we're at the end of the ParaLine segment...
                               else if (tempString contains "> # end of ParaLine") then
                                   if (FrontMarkers is not "") then
                                       set tempResult to (tempResult & return & FrontMarkers & ParaText & BackMarkers & tempString & return as string)
                                   else
                                       set tempResult to (tempResult & FrontMarkers & ParaText & BackMarkers & tempString & return as string)
                                   end if
                                   set tempString to ""
                                   set ParaText to ""
                                   set ParaLineFlag to "off"
                               else
                                   set ParaText to (ParaText & tempString & return as string)
                               end if
                           else
                               set tempResult to (tempResult & tempString & return as string)
                           end if
                       end repeat
                       -- set tempResult to (tempResult & return & FrontMarkers & ParaText & BackMarkers)
                       --set FrontMarkers to ""
                       --set BackMarkers to ""
                   else
                       set tempResult to (tempResult & searchResult)
                   end if
               end repeat
           end if
       end considering
       set text item delimiters to tid
       --    set newFile to (CurrentDoc & "Updated" as string)
       my write_to_file(tempResult, "MIF_Temp", true)
   end timeout
end moveMarkers

-- set Number_of_Paragraphs to (count of "<Marker" in searchResult)
-- set the_offset to (offset of vmnum in CurrentDoc)
-- Number_of_Paragraphs

to moveGlossaryMarkers of currText for delimiterString
   with timeout of 10000 seconds (* time-consuming statements *)
       set tid to text item delimiters
       set text item delimiters to delimiterString
       --    set myFile to (open for access file "Test File" with write permission)
       set FinalResult to ""
       considering case
           tell currText to if (count text items) is 1 then
               set searchResult to "ERROR: no match found for \"" & ({""} & "\".")
           else
               set TextCount to get count of text items in currText
               repeat with counter1 from 1 to TextCount
                   if counter1 = TextCount then
                       set searchResult to (text item counter1 as string)
                   else
                       set searchResult to (text item counter1 & "<Para " & return as string)
                   end if
                   if searchResult contains "<MType 6>" and counter1 > 1 then
                       
                       --    set CurrentParagraph to 1
                       set TotalParagraphs to (count of paragraphs in searchResult)
                       set ParaText to ""
                       set GlossaryMarkers to ""
                       set BackMarkers to ""
                       set tempString to ""
                       set tempMarker to ""
                       set MarkerFlag to "off"
                       set ParaLineFlag to "off"
                       
                       repeat with CurrentParagraph from 1 to TotalParagraphs
                           set tempString to get paragraph CurrentParagraph of searchResult
                           
                           -- If we're currently between <Para and > # end of Para ...
                           if (ParaLineFlag = "off") then
                               set FinalResult to (FinalResult & tempString & return as string)
                               if (tempString contains "<ParaLine") then
                                   set ParaLineFlag to "on"
                               end if
                           else if (tempString contains "<Marker") then
                               
                               -- Process the Marker
                               set MarkerFlag to "on"
                               set tempMarker to (tempMarker & tempString & return as string)
                           else if ((tempString contains "<MType") or (tempString contains "<MText") or (tempString contains "<MCurr")) then
                               set tempMarker to (tempMarker & tempString & return as string)
                           else if ((MarkerFlag is "on") and (tempString contains "Unique")) then
                               set tempMarker to (tempMarker & tempString & return as string)
                           else if (tempString contains "> # end of Marker") then
                               set MarkerFlag to "off"
                               set tempMarker to (tempMarker & tempString as string)
                               
                               -- Marker is complete; check type; move appropriately, clear tempMarker 
                               if (tempMarker contains "<MTypeName `Glossary'>") then
                                   set GlossaryMarkers to (GlossaryMarkers & return & tempMarker & return as string)
                                   set tempMarker to ""
                               else
                                   set ParaText to (ParaText & return & tempMarker as string)
                                   set tempMarker to ""
                               end if
                               -- End of Marker Processing
                               
                               -- If we're at the end of the Para segment...
                           else if ((tempString contains "> # end of Para") and (tempString does not contain "> # end of ParaLine")) then
                               set FinalResult to (FinalResult & GlossaryMarkers & ParaText & tempString & return & "<Para " & return as string)
                               -- set ParaLineFlag to "off"
                           else
                               set ParaText to (ParaText & tempString & return as string)
                               
                           end if
                           set tempString to ""
                           -- set ParaText to ""
                       end repeat
                       -- set FinalResult to (FinalResult & return & FrontMarkers & ParaText & BackMarkers)
                       --set FrontMarkers to ""
                       --set BackMarkers to ""
                   else
                       set FinalResult to (FinalResult & searchResult)
                   end if
               end repeat
           end if
       end considering
       set text item delimiters to tid
       --    set newFile to (CurrentDoc & "Updated" as string)
       my write_to_file(FinalResult, "Cleaned_MIF", true)
       tell application "Finder"
           delete ("MIF_Temp")
       end tell
   end timeout
end moveGlossaryMarkers

-- set Number_of_Paragraphs to (count of "<Marker" in searchResult)
-- set the_offset to (offset of vmnum in CurrentDoc)
-- Number_of_Paragraphs

-- write-to-file subroutine, from AppleScript Resources
on write_to_file(this_data, target_file, append_data)
   try
       set the target_file to the target_file as text
       set the open_target_file to ¬
           open for access file target_file with write permission
       if append_data is false then ¬
           set eof of the open_target_file to 0
       write this_data to the open_target_file starting at eof
       close access the open_target_file
       return true
   on error
       try
           close access file target_file
       end try
       return false
   end try
end write_to_file

I’ve also posted it in a new thread begging for help as my deadline approaches. :frowning:

Thanks for the help and feedback guys, it’s appreciated; looking forward to many more years of the same. :wink:

– Walt Sterdan