Dealing with Rich Text

Adam_Bell · September 21, 2007, 4:43pm

The general problem is to “chunk” a TextEdit document into new documents containing identified portions of the original. The original RTF doc has markers inserted in it by the App that exported it as one long doc that I can find with a “contains” filter and thus determine the range of paragraphs that constitute one “chunk” as those between markers. I’m not using text item delimiters because they don’t apply to rich text.

The problem is that if I have a formatted line of rich text with some words bold, italic or colored, with font changes, how do I create a new TE document that preserves that formatting for the chunk? This script will determine what’s there:

tell application "TextEdit"
	open (choose file of type "public.rtf" without invisibles) -- grab the doc.
	set AR to {attribute runs, properties of attribute runs} of text of document 1
end tell

It returns an array of two lists. The first element is the characters associated with the formatting captured in the second element of each item of the outer list.

Now how do I get that into a new doc?. If it was possible to select the paragraphs I wanted I could GUI script copying them, but it’s not.

EDIT: I see that hhas’ TextCommands can deal with formatting – I’ll pursue that unless someone knows a more straight-forward method that doesn’t require the special treatment. This isn’t for me.

Adam_Bell · September 21, 2007, 5:33pm

Addendum to post above. If you know this isn’t on or strongly suspect it’s not possible, please let me know that.

mark_hunte · September 21, 2007, 8:06pm

Is this what you mean

The only thing I am having a problem with is the colour.
I notice the color is always the same in the properties, reagardless of the actual words colour
** ok that was me, not spotting a typo. So coulor works as well.
So now need to work out placement i.e middle, left, right
**


tell application "TextEdit"
	set theFile to open (choose file of type "public.rtf" without invisibles) -- grab the doc.
	set dname to name of document 1
	set a to paragraphs of text of document dname
	repeat with i from 1 to number of items in a
		set this_itemP to item i of a
		if this_itemP contains "yourword" then
			set newDoc to (make new document at beginning of documents)
			set the text of the front document to this_itemP
			repeat with i from 1 to number of words in this_itemP
				set this_itemWord to word i of this_itemP
				set ARsize to {size of attribute runs} of word i of text of document dname
				set ARfont to {font of attribute runs} of word i of text of document dname
				set ARcolor to (color of (attribute runs) of word i of text of document dname)
				set ARcolor to item 1 of ARcolor
				set font of word i of newDoc to (ARfont as string)
				set size of word i of newDoc to (ARsize as string)
				set color of word i of newDoc to ARcolor
			end repeat
			
		end if
	end repeat
	
end tell

StefanK · September 21, 2007, 8:51pm

Everything is possible, Adam, but sometimes quite tricky

I would read the raw RTF text with AppleScript’s read command,
extract the header, which contains all the color and font definitions, make your selections and write the files
back to disk each with the header and its selected text.
Maybe you have Hanaan Rosenthal’s AppleScript Guide, in one chapter is a tutorial about the RTF syntax

mark_hunte · September 21, 2007, 8:54pm

Ah Thing I got it


tell application "TextEdit"
	set theFile to open (choose file of type "public.rtf" without invisibles) -- grab the doc.
	set dname to name of document 1
	set a to paragraphs of text of document dname
	repeat with i from 1 to number of items in a
		set this_itemP to item i of a
		if this_itemP contains "Textedit" then
			set this_start to i
		else if this_itemP contains "open" then
			set this_end to i
		end if
	end repeat
	set chunk to (paragraphs this_start thru this_end of text of document dname) as string
	set newDoc to (make new document at beginning of documents)
	set the text of the front document to chunk
	repeat with i from 1 to number of words in chunk
		set this_itemWord to word i of chunk
		set ARsize to {size of attribute runs} of word i of text of document dname
		set ARfont to {font of attribute runs} of word i of text of document dname
		set ARcolor to (color of (attribute runs) of word i of text of document dname)
		set ARcolor to item 1 of ARcolor
		set font of word i of newDoc to (ARfont as string)
		set size of word i of newDoc to (ARsize as string)
		set color of word i of newDoc to ARcolor
	end repeat
end tell

waltr · September 21, 2007, 8:56pm

Hi Adam,

I agree with StephanK. Here’s something similar I did a while back, with lots of good input from Kai and Kel:

http://bbs.applescript.net/viewtopic.php?id=17413

Cheers.

mark_hunte · September 21, 2007, 9:59pm

That was a very good idea,
Although my other script seems to work well, it could be very slow

This is very close…

It does all the fonts,colour, and sizes.
but does not seem to keep the paragraphs in order

set thetext to do shell script "cat /Users/username/Untitled.rtf" as string

	(*look for header *)
	set a to paragraphs of thetext
	set this_Header_start to 1
	repeat with i from 1 to number of items in a
		set this_itemP to item i of a
		if this_itemP contains "pardirnatural" then
			set this_Header_end to i
			exit repeat
		end if
	end repeat
	(*look for text from and to *)
	repeat with i from 1 to number of items in a
		set this_itemP to item i of a
		if this_itemP contains "tell" then
			set this_start to i
			exit repeat
			
		end if
	end repeat
	repeat with i from 1 to number of items in a
		set this_itemP to item i of a
		if this_itemP contains "the doc" then
			set this_end to i
			exit repeat
		end if
	end repeat
	(* put it together and writ it out *)
	set header to (paragraphs this_Header_start thru this_Header_end of thetext)
	set chunk to (paragraphs this_start thru this_end of thetext) --as string
	
	do shell script "echo " & "\"" & header & "\"" & return & " > text.rtf"
	do shell script "echo " & "\"" & chunk & "\"}" & " >> text.rtf"

**edit, change the line
if this_itemP contains “pardirnatural” then to if this_itemP contains “pard” then

Adam_Bell · September 21, 2007, 10:14pm

Thank you gentlemen; a lot to digest, but a good starting point for my chore.

Nigel_Garvey · September 21, 2007, 10:25pm

Hi, Adam.

Would it work for you to delete the text that’s not in a chunk of interest, save the document under a different name, and then reopen the original to get the next chunk? It’s possible on my Jaguar machine, but I suspect the scripting may be better in Tiger.

Adam_Bell · September 21, 2007, 11:22pm

I suspect that is possible; excellent thought. The doc is very long, however, so that may be a tad slow. I will nonetheless give it a go. Thanks.

mark_hunte · September 22, 2007, 1:16pm

You can Select the text and use the Textedit services :New Window Containing Selection ( mine has a hot key, but I can not remember if this is standard)
You can use this in most apps, but more importantly, you can use it in Textedit on the rtf doc.

Adam_Bell · September 22, 2007, 4:01pm

Thanks, Mark, but I want the script to select the paragraphs, not the user.

hhas · September 22, 2007, 4:14pm

In theory, you should be able to do:

tell application "TextEdit" to set text of document 1 to paragraphs i thru j of document 2

However, Cocoa Scripting’s standard Text Suite implementation blows chunks (as I’m sure you already know), so in practice all you get is an error.

As a workaround, in theory you could copy the text into a new document, then delete the portions you don’t want:

tell application "TextEdit"
    set text of document 1 to text of document 2
    delete paragraphs j thru -1 of document 2
    delete paragraphs 1 thru i of document 2
end tell

But, once again, Cocoa Scripting’s standard Text Suite implementation blows chunks (feel free to file bugs on that POS), and in practice appears to have O(n*n) efficiency when deleting text, so quickly grinds to a halt as document size increases.

Note that TextCommands doesn’t do RTF. Its ‘format’ command is for getting string representations of AppleScript values.

Note that while the first edition did, the second doesn’t.

If you want to edit the RTF data manually, you can find the RTF specs online easily enough. How practical this is will depend on the clarity of the data and the complexity of the changes.

Other options would be to use a non-Cocoa Scripting-based rich text editor, e.g. Word or Tex-Edit Pro may be suitable, or use another language that provides RTF libraries or RTF-aware rich text classes, e.g. you could knock together a simple command line tool using NSAttributedString with the AppKit additions.

Adam_Bell · September 22, 2007, 7:00pm

Thank you hhas; it’s good to know that my ineptness at transferring paragraphs of easily delineated paragraphs of an export from an unscriptable database app (that insists on dumping the whole thing as one document) to a set of new separate documents is not entirely my own. I’ve discovered that attribute runs and properties of attribute runs contain all the required data for reformatting a new document, but that one must then alter the format of the new text on a word by word basis because making a new attribute run doesn’t seem to be possible (at least I’ve never discovered the language for doing it).

Be nice when AppleScript supports RTF better than it does now as more and more apps seem to be using it as their native text display.

Nigel_Garvey · September 23, 2007, 10:23pm

A variation on the RTF-editing approach is to strip the visible text from the bits outside the current chunk, but to leave the RTF formatting tags in place. That way, if the chunk starts in the middle of an attribute run, the relevant tags will be in force at the point where the chunk starts in the edited document. If the RTF text is loaded into TextEdit and resaved from there, any superfluous tags will be removed automatically.

The script below assumes you have the original document open in TextEdit’s front window and already know the paragraph ranges of the three chunks. The chunk files are saved to the same folder as the original. With my 44KB test document, most of the running time is taken up with the opening and resaving of the three chunk files at the very end of the getChunks() handler. Tested in Jaguar but not (yet) in Tiger.

-- Supervise the extraction of three chunks from an RTF document into new TextEdit files and documents.
-- docPath is TextEdit's POSIX path to the file of its front document.
-- rangeLists is a list of three two-integer lists, the integers representing paragraph numbers.
on getChunks(docPath, rangeLists)
	set origPath to docPath as POSIX file as Unicode text
	set rtf to (read file origPath)
	set newFiles to {}
	
	set astid to AppleScript's text item delimiters
	set rtfLF to "\\" & (ASCII character 10) -- RTF line feed.
	if (rtf does not contain rtfLF) then set rtfLF to "\\" & return
	set AppleScript's text item delimiters to rtfLF
	set paragraphCount to (count rtf's text items)
	
	repeat with chunk from 1 to 3
		set {i, j} to item chunk of rangeLists
		
		set AppleScript's text item delimiters to rtfLF
		set parts to {text from text item i to text item j of rtf}
		if (i > 1) then set beginning of parts to stripTextFromRTF(text 1 thru text item (i - 1) of rtf)
		if (j < paragraphCount) then set end of parts to stripTextFromRTF(text from text item (j + 1) to -1 of rtf)
		set end of parts to "}"
		
		set AppleScript's text item delimiters to ""
		set newRTF to parts as string
		
		set newPath to origPath & " Chunk " & chunk & ".rtf"
		set fRef to (open for access file newPath with write permission)
		try
			set eof fRef to 0
			write newRTF to fRef
		end try
		close access fRef
		set end of newFiles to alias newPath
	end repeat
	set AppleScript's text item delimiters to astid
	
	tell application "TextEdit"
		activate
		open newFiles
		set modified of documents 1 thru 3 to true
		save (documents 1 thru 3)
	end tell
end getChunks

-- Strip the text from an RTF chunk, leaving the formatting in place.
-- TextEdit will remove any redundant formatting when it opens and resaves the document.
on stripTextFromRTF(rtf)
	set skippables to "'uU{}" & (ASCII character 10) & return
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "\\"
	script o
		property TIs : rtf's text items
	end script
	
	considering case
		-- Text item 1 is either "", text, or a brace for RTF code.
		if (beginning of o's TIs is not "{") then set item 1 of o's TIs to ""
		set zapNext to false
		repeat with i from 2 to (count o's TIs)
			set thisTI to item i of o's TIs
			if ((count thisTI) is 0) then -- Textual backslash ("\\"). The following text item will be text too.
				set item i of o's TIs to missing value
				set zapNext to true
			else if (zapNext) then -- Text following a textual backslash.
				set item i of o's TIs to missing value
				set zapNext to false
			else if (character 1 of thisTI is in skippables) then -- Exotic character, textual brace, or line end.
				set item i of o's TIs to missing value
			else if (thisTI contains " ") and (thisTI does not start with "fcharset") then -- Probably an attribute tag.
				set item i of o's TIs to word 1 of thisTI & " "
			end if
		end repeat
	end considering
	set rtf to o's TIs's strings as string
	set AppleScript's text item delimiters to astid
	
	return rtf
end stripTextFromRTF

-- Assuming you've already worked out that the three "chunks" are paragraphs 1 to 17, 18 to 48, and 49 to 111.
tell application "TextEdit" to set docPath to path of front document
getChunks(docPath, {{1, 17}, {18, 48}, {49, 111}})

Edit: Improved script.

Adam_Bell · September 23, 2007, 11:27pm

Fantastic, Mr. G.

I do know the paragraph ranges and can identify the name to be used for each chunk from within the chunk. Tomorrow, I’ll make a fresh start on that so I can try your method on my document.

Merci;

Adam

Adam_Bell · September 24, 2007, 3:26pm

Nigel;

I’ve discovered that the raw unicode of the document I’m working with which is an export from another program called BookEnds (which is not scriptable) starts off with every font on my machine, and doesn’t contain any newLine characters. The paragraphs of the text are delimited by \p symbols. With a few mods, however, I might get this running using your method.

Adam

Nigel_Garvey · September 24, 2007, 11:25pm

Does TextEdit display them as paragraph breaks? I can’t get them to work.

I’ve now cured the problems I noticed with my script and have edited it in my post above. I’ll leave it to you to deal with the vagaries of your file.

Adam_Bell · September 25, 2007, 12:47am

Thanks, Nigel. It turns out that most of my difficulties are caused by the “vagaries” of my file, which is an export from BookEnds. If I prepare my own files in TextEdit even the very crude approach of grabbing the file’s attributes one word at a time and then transferring them to another works perfectly (not fast, but accurate). In my real case, I’m actually creating the second document as an entry in Journler, but this approach works there too since it’s language for dealing with words is the same. I think that the BookEnds file will have to have all its paragraph symbols changed back to proper newlines. I’m now convinced that it has to be fixed first.

tell application "TextEdit"
	set F to {}
	set S to {}
	set C to {}
	tell text of document 1
		set tText to it
		repeat with k from 1 to count words of it
			set F's end to font of word k
			set S's end to size of word k
			set C's end to color of word k
		end repeat
	end tell
	
	make new document
	set text of document 1 to tText
	tell text of document 1
		repeat with k from 1 to count words
			set font of word k to item k of F
			set size of word k to item k of S
			set color of word k to item k of C
		end repeat
	end tell
end tell