Help needed for AS to split (Explode) textfile into separate files

Bussard · February 13, 2012, 4:49pm

All,

The script takes as input a plain text file. Each text begins with a citationKey for a book. I then write all the notes on the book in this one text document: every note is preceded by a page notation (e.g. @88) and (except the last note) finishes with a unique delimiter: $%$

The script reads the file, and splits it into chunks. For each chunk it adds the pagenumber to the citationKey, it combines text and citationKey, and writes these to separate files, which are named according to a reworked formatting of the citationKey. This to ensure that notes are always hard-linked to their source.

set theFile to (choose file with prompt "Select a file with notes to split into separate files:")
open for access theFile
set someText to (read theFile)
close access theFile

-- Example citationKey: {Wessel, 2009, #82957}
-- Example page number: @88
-- Delimiter used in original file: $%$

on explode(delimiter, someText)
	set prevTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to delimiter -- This is the delimiter string that will split the text into chunks
	set output to text items of someText -- After splitting, the various chunks of text have now become individual 'items'
	set AppleScript's text item delimiters to prevTIDs -- Reset the delimiter to default value
	set itemCount to count of items of output -- itemCount stores the number of items (chunks)
	set refNo to item 1 of output -- item 1 is always the citationKey copied from the Reference manager software 
	set refNo to characters 1 through ((length of refNo) - 1) of refNo as string -- Remove the closing curly brace from the citationKey
	set refItem to last word of refNo -- Captures the clean number from the citationKey - From {Wessel, 2009, #82957} to 82957
	set i to 1 -- i is set to control the repeat loop. We start at i = 1 instead of i = 0 to skip Item 1, which is the citationKey
	try
		repeat until i is itemCount
			set i to i + 1 -- increment by one, until i becomes equal to itemCount, which halts the process
			set targetFolder to "MiniHD:Users:ImNotApc:Documents:Bookshelf in Finder:"
			set mainBody to item i of output -- Contents of each item are stored in local variable
			set mainBodyLean to paragraphs 3 thru -1 of mainBody as string -- Due to the delimiter, some empty lines were added, these are here removed
			set fileNameQuote to (characters 1 through 20 of mainBodyLean) as string -- Extract first 20 characters of the text to use in naming the file
			
			set pageNo to second paragraph of mainBody -- The second paragraph contains the page number
			set fullCite to refNo & pageNo & "}" -- Complete the citationKey, to include the page number
			set fileNamer to "{#" & refItem & pageNo & "} " & fileNameQuote & "..." -- Complete the file name, to include full citationKey plus 20 characters plus ...
			
			set writeIt to mainBodyLean & " " & fullCite -- Compose the content of each separate file, and include the full citationKey *after* the text it refers to
			set fileSpec to (open for access file (targetFolder & fileNamer as text) with write permission) -- Routine to create a new file in the targetFolder in the Finder
			
			write writeIt to fileSpec as text -- Writing the contents to the newly created file
			close access fileSpec
		end repeat
		return "Succeeded in Writing to Separate Files in:" & targetFolder -- Confirmation that process completed in writing the files
	on error
		try
			close access fileSpec
		end try
		return "Failed Writing to File."
	end try
end explode

explode("$%$", someText)

Issues:

the first file the script writes triggers a strange utf-8 error message when I want to open that file in the Finder. TextEdit does not open it but tex-edit does and the correct chunk of text is all there. All consequtive files are correct plain text files. Does anyone know of a solution to this?
if a note (a chunk of text) consists of two or more paragraphs, these are compacted into 1. During the process, the paragraph-returns (or line endings?) are removed. Does that happen automatically because the System reads the file, rather than TextEdit? Is there a way to prevent this happening?

Many thanks!

Adam_Bell · February 13, 2012, 6:56pm

At one point you’ve collected paragraphs, i.e. a list of the text blocks that were separated by returns but the returns are not in the list. If you then coerce that to text with a blank TID, you’ve lost the returns. I don’t know what causes the error unless there’s a unicode character that can’t be represented as UTF-8. Finally, the line below is much shorter and faster than what you have:

set refNo to characters 1 through -2 of refNo as string

Bussard · February 13, 2012, 8:08pm

Thanks, Adam,

There was indeed, a UTF-8 conflct (umlaut) in one text segment. Saving the TextEdit document as UTF-8 before running this script solved the conflict.

Thanks for the shortened line of code.

I used ‘text items’ as that is how I thought I could address the individual pieces of text after the delimiter had separated them. What would you suggest to maintain the returns inside of those pieces?

Nigel_Garvey · February 13, 2012, 9:50pm

Hi.

If the text in your original file is UTF-8, the line you use to read it should be:

-- open for access theFile -- No need to open for access if you're only reading it once.
set someText to (read theFile as «class utf8»)
-- close access theFile

Similarly, the write line should be:

write «data rdatEFBBBF» to fileSpec -- UTF-8 BOM. (May not be necessary.)
write writeIt to fileSpec as «class utf8» -- Writing the contents to the newly created file

Replace the list-to-text coercions with ‘text’ range references like these:

-- set mainBodyLean to paragraphs 3 thru -1 of mainBody as string -- Due to the delimiter, some empty lines were added, these are here removed
set mainBodyLean to text from paragraph 3 to -1 of mainBody

-- set fileNameQuote to (characters 1 through 20 of mainBodyLean) as string -- Extract first 20 characters of the text to use in naming the file
set fileNameQuote to text 1 through 20 of mainBodyLean

Bussard · February 14, 2012, 9:14am

Many thanks, Nigel, the script flies through the files! Thanks to your corrections, it is now also possible to input Markdown formatted text and have the script produce from that a whole series of separate Markdown files.