I’m trying to figure out how to break a string into multiple strings, ultimately determined by a maximum number of characters. I’m working on something that can’t have more than 180 characters.
Taking a string that has 400 characters in it:
I realize I could just tell it to get characters 1 through 180, and then 181 through 360, and then 361 through 400. However, I don’t want to break a string mid-word. I tried something along the lines of “the word which contains the 180th character”, but that didn’t work.
I’m guessing it has something to do with setting the offset, but I can’t quite get my head around it.
Any takers?
Try this. Basically it goes to character 181 of the string and starts moving backwards until it finds a space. Once a space is found then we know we are at the end of a word and the string will be less than 181 characters (e.g. it fits your requirement of 180 characters or less). Then we remove all of that text from the string and do the same thing again… until the string we are left with is 180 characters long or less… then we’re finished.
I made it more fancy than that explanation but that’s the basics, so try this…
set theString to "MARLEY was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail."
set charactersBetweenWords to {space, return, tab}
set maxStringLength to 180
set listOfStrings to {}
repeat
set stringCount to count of theString
if stringCount is less than or equal to maxStringLength then
if stringCount is not 0 then set end of listOfStrings to theString
exit repeat
else
repeat with i from (maxStringLength + 1) to 1 by -1
if character i of theString is in charactersBetweenWords then
set end of listOfStrings to text 1 thru (i - 1) of theString
try
set theString to text (i + 1) thru end of theString
on error -- we are at the end of the string because there is no (i+1)
set theString to ""
end try
exit repeat
else if i is 1 then -- no charactersBetweenWords were found
set end of listOfStrings to text 1 thru maxStringLength of theString
set theString to text (maxStringLength + 1) thru end of theString
exit repeat
end if
end repeat
end if
end repeat
return listOfStrings
Here’s a different version. It’s a bit faster than Hank’s, but, since it parses a list of character ids rather than text, it only works in Leopard or later:
on split(theString, maxStringLength)
script o
property ids : theString's id
end script
set stringLength to (count theString)
set listOfStrings to {}
set i to 1
repeat until (i > stringLength)
set j to i + maxStringLength
if (j > stringLength) then
set j to stringLength + 1
else
repeat with j from j to i by -1
if (item j of o's ids < 33) then exit repeat
end repeat
end if
if (j = i) then -- No between-words characters.
set j to i + maxStringLength - 1
if (j > stringLength) then set j to stringLength
set end of listOfStrings to string id (items i thru j of o's ids)
else -- Between-word character found.
set end of listOfStrings to string id (items i thru (j - 1) of o's ids)
end if
set i to j + 1
end repeat
return listOfStrings
end split
set theString to "MARLEY was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail."
split(theString, 180)
Here is another version, for variety.
I chose to use the ludicrous offset of false method, my new favorite. I’m unable to try out Nigel’s version, because my machine is too old, but I am curious to see how it might compare, benchmark-wise, as I avoided conditionals.
set theString to "MARLEY was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail."
set span to 180
set {Origin, Terminus} to {1, span}
set theList to {}
repeat while Terminus < (count theString)
repeat until (offset in ((theString's character Terminus) = space or return or tab) without of) ≠1
set Terminus to Terminus - 1
end repeat
set the end of theList to theString's strings Origin thru Terminus
set Origin to Terminus + 1
set Terminus to Terminus + span
end repeat
set the end of theList to theString's strings Origin thru (count theString)
theList
Hi, Marc.
I don’t know what system you’re on (or what medication! ;)), but the line quoted above errors in Snow Leopard:
- ‘(theString’s character Terminus) = space or return or tab’ isn’t a valid test.
- ‘offset of false in .’ may or may not work, but it’s a misuse of the ‘offset’ command, which is specifically for use with strings, not booleans. It may not happen to work in the future.
- The use of an OSAX command (‘offset’) to derive a boolean result from (presumably) three others is an unecessary overhead when you only need one boolean in the first place:
repeat until (theString's character Terminus) is in (space & return & tab) -- (space & return & tab) preferably stored in a variable before the repeat.
- Performing a negative test (‘until . ≠1’) takes very slightly longer than performing a positive one (‘until . = 0’ or ‘while . = 1’).
Deliberate obfuscation’s fun (here you’re using a synonym for another context of ‘text’), but it shouldn’t be used in published scripts unless firmly marked as a joke. Many people blindly copy what they find on the Internet and bad code can propagate quickly and permanently, dooming those who know to years of having to answer the same old “Why doesn’t this work?” queries. 
PS. Once debugged, your script preserves the white-space character at the break point, which is a valid alternative if that’s what the OP wants. (He doesn’t say.) This does of course mean that occasionally, words which would just have fitted into to the 180 characters of one substring are deferred to the next.
Hey, Nigel.
I’m using OS 10.4.11. The test you say isn’t valid (1) evaluates and runs properly there. I’ll note that its nonfunctional in Snow.
I hadn’t considered the OSAX or negative evaluation impact (3, 4) on speed”that’s actually helpful to know”but the offset of false’s functionality is proof of its validity; why else would it return a boolean in typical operation”0”if it wasn’t intended to work with them? The opposite statement might read oddly, but it’s logical. I also prefer using “strings” over “text,” as i find it more semantically descriptive for most uses.
I’ve just tried your script on my Leopard machine and it appears to work there too. However, the line I pointed out .
repeat until (offset in ((theString's character Terminus) = space or return or tab) without of) ≠1
. is still wrong. The ‘offset’ bit for the reason I’ve already discussed; the character test .
((theString's character Terminus) = space or return or tab)
. because the three possible “conditions” it specifies are:
- (theString’s character Terminus) = space
- return
- tab
Only the first of these is a test producing a boolean result. If it proves ‘false’, the other two “conditions”, which are just bits of text, are ignored for some reason ” probably a bug ” in Tiger and Leopard. Snow Leopard correctly complains that it can’t derive booleans from them. The conditions should be:
((theString's character Terminus) = space or (theString's character Terminus) = return or (theString's character Terminus) = tab)
SInce the Dickens exerpt contains lots of spaces, and your first condition tests for a space, it’s always met where required, giving the illusion that your script’s doing what’s intended. If you change all the spaces in the text to, say, tabs, or change your test order to ‘((theString’s character Terminus) = return or tab or space)’, your script will fail.
Ah, I see what you mean, and that makes obvious sense because, if you were to evaluate multiple conditions across a whose clause, they’d also have to be separate. I didn’t test the other conditions well enough to establish their (non)functionality after the lack of an error.
Thanks to all for the comments.
I’ve taken Nigel’s approach, however I’m running into an issue. I’m trying to use this to parse an XML file. When I put this into the repeat loop that goes through each line of the XML file, it exits the first time that it runs into a line with more than 180 characters. What I ultimately what to end up with is a warning dialogue that says something to the effect of:
Marker 5 has more than 180 characters. Please split it at “item 1” and “item 2”
Marker 9 has more than 180 characters. Please split it at “item 2” and “Item 4”
Here is my attempt at it. Like I said, it exits on the first occurence:
property FrameRate : 30
--select the Final Cut XML file
--set theXML to XMLOpen (choose file with prompt "Select the XML file which contains your caption markers" of type "dyn.agk8zuxnqea" default location (path to the desktop) without invisibles)
set theXML to XMLOpen (choose file with prompt "Select the XML file which contains your caption markers" of type "public.xml" default location (path to the desktop) without invisibles)
set the_root to XMLRoot theXML
set the_child to XMLChild the_root index 1
set theTitle to XMLChild the_child index 3
set youtubeTitle to (XMLGetText theTitle) & " - YouTube Captions"
set flashTitle to (XMLGetText theTitle) & " - Flash Captions"
set dvdTitle to (XMLGetText theTitle) & " - DVD Subtitles"
set mp4Title to (XMLGetText theTitle) & " - MP4 Subtitles"
set allMarkers to XMLXPath the_child with "marker/name"
set theMarkers to XMLGetText (allMarkers)
set startMarker to 11
set theIndex to startMarker
--character count
set listOfStrings to newCountCharacters(theXML, theMarkers, 180)
return listOfStrings
on newCountCharacters(theXML, theMarkers, maxStringLength)
set the_root to XMLRoot theXML
set the_child to XMLChild the_root index 1
set theTitle to XMLChild the_child index 3
set allMarkers to XMLXPath the_child with "marker/name"
set theMarkers to XMLGetText (allMarkers)
set startMarker to 11
set theIndex to startMarker
set listOfStrings to {}
set alertString to "Too many characters"
set messageString to "The following comments have too many characters. Please edit your original Final Cut markers so that you only have 180 characters per comment." & return & return
set warningString to ""
repeat with my_item in theMarkers
try
set the_child_2 to XMLChild the_child index theIndex
set theMarker to XMLGetText (XMLChild the_child_2 index 1)
set theComment to XMLGetText (XMLChild the_child_2 index 2)
set theCount to get the count of the characters of theComment
if theCount is greater than 180 then
script o
property ids : theComment's id
end script
set stringLength to (count theComment)
--set listOfStrings to {}
set i to 1
repeat until (i > stringLength)
set j to i + maxStringLength
if (j > stringLength) then
set j to stringLength + 1
else
repeat with j from j to i by -1
if (item j of o's ids < 33) then exit repeat
end repeat
end if
if (j = i) then -- No between-words characters.
set j to i + maxStringLength - 1
if (j > stringLength) then set j to stringLength
set end of listOfStrings to string id (items i thru j of o's ids)
else -- Between-word character found.
set end of listOfStrings to string id (items i thru (j - 1) of o's ids)
end if
set i to j + 1
end repeat
return listOfStrings
set warningString to warningSString & listOfStrings
--set messageString to messageString & "Comment " & theMarker & " has " & theCount & " characters." & return & return
end if
set theIndex to theIndex + 1
end try
end repeat
--if warningString is not "" then
-- set messageString to messageString & warningString
-- display alert alertString message messageString buttons {"OK"} default button "OK"
--else
-- return
--end if
--return theIndex
end newCountCharacters
Hi.
The ‘return listOfStrings’ line is returning listOfStrings instead of allowing the handler to continue.
Hmm. Ok. So I need to sus out how to get it to hand the result back out to the main script without jumping out of the handler. Or will it do this without needing the return statement? I’ll give it a go.
Thanks!