Script to split one large RTF into several renamed smaller files

Hi

I have some RTF files in the format:

1st January 2010

Para 1

Para 2

Para 3

Para 4

8th January 2010

Para 1

Para 2

Para 3

etc…

Essentially this is one long list of session notes.

I would like to split this large RTF file into smaller RTF files, one for each date, naming the file by the date at the top of each session (in the format 2010-01-01, 2010-01-08, etc) and removing the date from the content of the files, so that I just have the paragraphs.

Ideally the script would also be able to recognise date in different formats, e.g. 1st January 2010, 1st Jan 2010, etc.

Would it be possible to do this with applescript?

Thanks

Nick

Hi,

do you have MS Word? That’s the easist way to split RTF files
Try this, the script asks for the source file and saves the splitted files on desktop.

Date formats 1st January 2010 and 1st Jan 2010 are considered, but not etc.
For other date formats the checkDate() handler must be extended.


property dayOrdinalAbbreviationList : {"st", "nd", "rd", "th"}
property monthNameList : "JanFebMarAprMayJunJulAugSepOctNovDec"

set sourceFile to choose file

set newDocCount to 0
tell application "Microsoft Word"
	open sourceFile
	set aDoc to active document
	set numberOfParagraphs to count paragraphs of aDoc
	repeat with i from 1 to numberOfParagraphs
		set textValue to content of text object of paragraph i of aDoc
		set {dy, mn, yr} to my checkDate(textValue)
		if dy is not false then
			if newDocCount is not 0 then
				my makeNewDocument(aDoc, low, i - 1, fileName)
			end if
			set low to i + 1
			set newDocCount to newDocCount + 1
			set fileName to yr & "-" & mn & "-" & dy & ".rtf"
		else if dy is false and i = numberOfParagraphs then
			my makeNewDocument(aDoc, low, i, fileName)
		end if
	end repeat
end tell

on makeNewDocument(aDoc, fromParagraph, toParagraph, fName)
	tell application "Microsoft Word"
		set myRange to create range aDoc start (start of content of ¬
			text object of paragraph fromParagraph of aDoc) end (end of content ¬
			of text object of paragraph toParagraph of aDoc)
		select myRange
		copy object selection
		set newDoc to make new document
		paste object text object of newDoc
		save as newDoc file name ((path to desktop as Unicode text) & fName) file format format rtf
		close front document
	end tell
end makeNewDocument

on checkDate(theString)
	set {TID, text item delimiters} to {text item delimiters, space}
	try
		set {dy, mn, yr} to text items of theString
		if (count dy) < 3 or (count mn) < 3 or (count yr) < 4 then error
		if text -2 thru -1 of dy is in dayOrdinalAbbreviationList then
			set dy to text -2 thru -1 of ("0" & text 1 thru -3 of dy)
		else
			error
		end if
		set monthOffset to offset of (text 1 thru 3 of mn) in monthNameList
		if monthOffset = 0 then
			error
		else
			set mn to text -2 thru -1 of ("0" & (monthOffset div 3) + 1)
		end if
		if last character of yr is in {return, linefeed} then set yr to text 1 thru -2 of yr
		try
			yr as integer
		end try
		set text item delimiters to TID
		return {dy, mn, yr}
	on error
		set text item delimiters to TID
		return {false, false, false}
	end try
end checkDate

Thanks Stefan

Yes, I have MS Word. I tried to run your script on an RTF file and after a couple of minutes, got the error:

error “The variable low is not defined.” number -2753 from “low”

Nick

the script assumes that the first line is a date line and
the date segments are delimited by space characters in one of the mentioned date formats

This is the format the main RTF file is in:

1st January 2010

Paragraph 1

Paragraph 2

etc…

8th January 2010

Paragraph 1

Paragraph 2

etc…

The first line in the document is a date line.

Not sure if this meets the required criteria. What’s the number -2753 in the error message? Maybe there’s a section some way down in the document which isn’t in this format.

Nick

the variable low represents the index of the first paragraph after the current date line paragraph
and will be defined after parsing the first date line.
I tested the script successfully with the given text in post #1 and some added rich text attributes

Yes, it works for me to on the text in post #1, but when I tried it on the text below I got the same error:

error “The variable low is not defined.” number -2753 from “low”

1 September 2007

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

8 September 2007

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

the script expects the day portion as ordinal number (1st, 2nd, 3rd etc.)

Oops. Sorry, my mistake. All date formats are either 1 January 2010 or 1 Jan 2010. How would I amend the script to reflect this?

this is much easier.
I added a property isEmptyLineAfterDateLine. Set it to true if there is always an empty line after the date line


property monthNameList : "JanFebMarAprMayJunJulAugSepOctNovDec"
property isEmptyLineAfterDateLine : false

set sourceFile to choose file

set newDocCount to 0
tell application "Microsoft Word"
	open sourceFile
	set aDoc to active document
	set numberOfParagraphs to count paragraphs of aDoc
	repeat with i from 1 to numberOfParagraphs
		set textValue to content of text object of paragraph i of aDoc
		set {dy, mn, yr} to my checkDate(textValue)
		if dy is not false then
			if newDocCount is not 0 then
				my makeNewDocument(aDoc, low, i - 1, fileName)
			end if
			set low to i + 1 + (isEmptyLineAfterDateLine as integer)
			set newDocCount to newDocCount + 1
			set fileName to yr & "-" & mn & "-" & dy & ".rtf"
		else if dy is false and i = numberOfParagraphs then
			my makeNewDocument(aDoc, low, i, fileName)
		end if
	end repeat
end tell

on makeNewDocument(aDoc, fromParagraph, toParagraph, fName)
	tell application "Microsoft Word"
		set myRange to create range aDoc start (start of content of ¬
			text object of paragraph fromParagraph of aDoc) end (end of content ¬
			of text object of paragraph toParagraph of aDoc)
		select myRange
		copy object selection
		set newDoc to make new document
		paste object text object of newDoc
		save as newDoc file name ((path to desktop as Unicode text) & fName) file format format rtf
		close front document
	end tell
end makeNewDocument

on checkDate(theString)
	set {TID, text item delimiters} to {text item delimiters, space}
	try
		set {dy, mn, yr} to text items of theString
		if (count mn) < 3 or (count yr) < 4 then error
		try
			dy as integer
			set dy to text -2 thru -1 of ("0" & dy)
		on error
			error
		end try
		set monthOffset to offset of (text 1 thru 3 of mn) in monthNameList
		if monthOffset = 0 then
			error
		else
			set mn to text -2 thru -1 of ("0" & (monthOffset div 3) + 1)
		end if
		if (count yr) > 4 then set yr to text 1 thru 4 of yr
		try
			yr as integer
		end try
		set text item delimiters to TID
		return {dy, mn, yr}
	on error
		set text item delimiters to TID
		return {false, false, false}
	end try
end checkDate

Most excellent. Thanks Stefan. It works just fine. I wonder if it would be possible to make a couple of additions:

  1. Have the newly created RTFs put in a folder on the Desktop named after the original document.

  2. The newly created RTFs will be stored inside DEVONthink Pro. Within DEVONthink Pro there is a way of setting the right margin of the RTFs so that the content of the RTF always fits the available space of the view, i.e. it has a flexible width (you achieve this by dragging the right margin tab to the far right of the view). I am not sure whether this is a feature common to text files across other Applications, and whether there is a way of getting the script to format the RTFs in this way, but this is how I would like them to display in DEVONthink Pro and because there will be several hundred RTFs obviously I would like to find a way of automating this.

  1. is no problem

replace the beginning of the script with this code

property monthNameList : "JanFebMarAprMayJunJulAugSepOctNovDec"
property isEmptyLineAfterDateLine : false
property desktopFolderName : "myFolder"

set sourceFile to choose file

set newDocCount to 0
set destinationFolder to ((path to desktop as text) & desktopFolderName)
do shell script "/bin/mkdir -p " & quoted form of POSIX path of destinationFolder
tell application "Microsoft Word"
	open sourceFile
	set aDoc to active document
	set docName to name of aDoc
	if docName ends with ".rtf" then set docName to text 1 thru -5 of docName
	set numberOfParagraphs to count paragraphs of aDoc
.
-- continues with old code starting with the repeat with line

and change the save as line into


save as newDoc file name (destinationFolder & ":" & fName) file format format rtf

the folder will be created on desktop automatically if it doesn’t exist.

  1. In MS Word you can do almost everything with AppleScript, but I don’t know
    if changing the bounds will be displayed correctly in other applications

Regarding Point 1:

I have made changes and now am getting error message:

error “The variable destinationFolder is not defined.” number -2753 from “destinationFolder”

which is a little strange as it is clearly defined earlier in the script. also, I just wanted to make sure you’re aiming for the same outcome that I am aiming for, as from my reading of the script I am not sure of this.

If the original RTF is called “Brian.rtf” I would want all of the newly created RTFs to go into a folder called “Brian” on the desktop.

Regarding Point 2:

Would it be possible to try and change the bounds of the RTFs to “page width” or whatever the equivalent is in MS Word, so I can test if it looks right in DEVONthink?

Thanks

Ah, OK
Here’s the complete script


property monthNameList : "JanFebMarAprMayJunJulAugSepOctNovDec"
property isEmptyLineAfterDateLine : false

property destinationFolder : missing value

set sourceFile to choose file

set newDocCount to 0
tell application "Microsoft Word"
	open sourceFile
	set aDoc to active document
	set docName to name of aDoc
	set numberOfParagraphs to count paragraphs of aDoc
end tell
if docName ends with ".rtf" then set docName to text 1 thru -5 of docName
set destinationFolder to ((path to desktop as text) & docName)
do shell script "/bin/mkdir -p " & quoted form of POSIX path of destinationFolder
repeat with i from 1 to numberOfParagraphs
	tell application "Microsoft Word" to set textValue to content of text object of paragraph i of aDoc
	set {dy, mn, yr} to checkDate(textValue)
	if dy is not false then
		if newDocCount is not 0 then
			makeNewDocument(aDoc, low, i - 1, fileName)
		end if
		set low to i + 1 + (isEmptyLineAfterDateLine as integer)
		set newDocCount to newDocCount + 1
		set fileName to yr & "-" & mn & "-" & dy & ".rtf"
	else if dy is false and i = numberOfParagraphs then
		makeNewDocument(aDoc, low, i, fileName)
	end if
end repeat

on makeNewDocument(aDoc, fromParagraph, toParagraph, fName)
	tell application "Microsoft Word"
		set myRange to create range aDoc start (start of content of ¬
			text object of paragraph fromParagraph of aDoc) end (end of content ¬
			of text object of paragraph toParagraph of aDoc)
		select myRange
		copy object selection
		set newDoc to make new document
		paste object text object of newDoc
		save as newDoc file name (destinationFolder & ":" & fName) file format format rtf
		close front document
	end tell
end makeNewDocument

on checkDate(theString)
	set {TID, text item delimiters} to {text item delimiters, space}
	try
		set {dy, mn, yr} to text items of theString
		if (count mn) < 3 or (count yr) < 4 then error
		try
			dy as integer
			set dy to text -2 thru -1 of ("0" & dy)
		on error
			error
		end try
		set monthOffset to offset of (text 1 thru 3 of mn) in monthNameList
		if monthOffset = 0 then
			error
		else
			set mn to text -2 thru -1 of ("0" & (monthOffset div 3) + 1)
		end if
		if (count yr) > 4 then set yr to text 1 thru 4 of yr
		try
			yr as integer
		end try
		set text item delimiters to TID
		return {dy, mn, yr}
	on error
		set text item delimiters to TID
		return {false, false, false}
	end try
end checkDate


I’m not that familiar with Word scripting

That works perfectly now. I will look into the issue of margins/formatting some more.

If you have the time and inclination, there is another thread I started regarding another splitting task I am trying to automate, namely splitting an MS Word document according to the Levels in the outline:

http://macscripter.net/viewtopic.php?id=34582

But if not, no worries. You have been most helpful.

Enjoy the day!

Nick

I have made a bit of progress with the formatting issue. :stuck_out_tongue: To get the RTF files to show up in the format I would like in DEVONthink, they have to be in ‘Wrap to Window’ mode. The only way I have found to achieve this is to open the files in TextEdit, choose ‘Wrap to Window’ from the Format menu, then select all, then drag the right margin tab to the far right of the TextEdit Window, and then save.

There is a ‘Wrap to Window’ option in MS Word preferences, which does get text to fill the screen in Draft and Outline view, but then there is no way to drag the right margin tab to the right edge of the window, so the file isn’t actually changed. So I am not sure if this can be achieved in MS Word. If not, then perhaps I need to be using TextEdit to process all the RTF files after they have been created by the script as it stands. Even here though, the Wrap to Window, Select All and Save commands could be scripted, but I am not sure about the dragging the right margin tab all the way to the right.

isn’t this “wrap to window” just a particular view of the target text editor?

When just choosing wrap to window in TextEdit the file isn’t actually changed, but when the right margin marker is then dragged to the right of the window, the file is changed and the content then fills the available width in the window. So maybe what I am actually looking to script is not wrap to window, but whatever is actually happening to the file when the right margin tag is dragged to the far right of the window.

this sets the right margin of all paragraphs of the current word document to the rightmost position


tell application "Microsoft Word"
	set aDoc to active document
	set right indent of paragraphs of text object of aDoc to 0.0
end tell

thanks, but unfortunately that doesn’t seem to make any difference to the file created.