Breaking long text through keywords

I have several long documents with different content and encoding. Each contains some keywords which can allow me to break each document in different parts.

While the keywords are the same within each Document they are different for every single document

Example for document_01 in Folder Named “Collected Docs”

keyword: Mountain
text text text
keyword: Mountain (same as previous)
text text text

Example for document_02 still in Folder Named “Collected Docs”

keyword: China
text text text
keyword: China (same as previous)
text text text

and so on

I hope to receive some kind help to get a script to

1st Assign the first keywords (example Mountain for document_01) to the script and flag document_01 as “done”

2nd break the document_01 in several others according to how many keywords the script finds

3nd move all the new files from document_01 into a folder which I can name at the end of the script

4th go back to the “Collected Docs” folder and repeat the script to document_02 assigning the second keyword (example China for document_02)until every document goes through this process

Thanks for any help

regards

I assume that this piece of code may be a starting point.


(*
Structure of the parameters file :

document01.txt<TAB>keyword1
document02.txt<TAB>keyword2
document03.txt<TAB>keyword3

Yvan KOENIG (VALLAURIS, France)
2010/08/13
*)

property nom_du_fichier_parametres : "parameters.txt"
property extension_txt : ".txt"

on run
	set le_dossier to "" & (choose folder)
	
	tell application "System Events"
		set fichiers_text to name of every file of folder (le_dossier) whose type identifier is "public.plain-text"
	end tell
	
	if fichiers_text does not contain nom_du_fichier_parametres then
		error "The file "" & nom_du_fichier_parametres & "" is unavailable !"
	end if
	set les_parametres to paragraphs of (read file (le_dossier & nom_du_fichier_parametres))
	
	repeat with ref_des_parametres in les_parametres
		set {nom_source, delimiteur} to my decoupe(ref_des_parametres, tab)
		if nom_source ends with extension_txt then
			set nom_du_dossier_cible to text 1 thru -(1 + (length of extension_txt)) of nom_source
			set chemin_cible to le_dossier & nom_du_dossier_cible
			tell application "System Events"
				if exists folder chemin_cible then set name of disk item chemin_cible to (nom_du_dossier_cible & (do shell script "date +_%Y%m%d-%H%M%S"))
				make new folder at end of folder le_dossier with properties {name:nom_du_dossier_cible}
			end tell -- System Events
			
			set les_blocs to my decoupe(read file (le_dossier & nom_source), delimiteur)
			repeat with i from 1 to count of les_blocs
				set nom_numero_i to nom_du_dossier_cible & "#" & text -3 thru -1 of ("000" & i) & extension_txt
				tell application "System Events" to make new file at end of folder chemin_cible with properties {name:nom_numero_i}
				write "" & item i of les_blocs to file (chemin_cible & ":" & nom_numero_i)
			end repeat
		end if -- nom_source ends.
	end repeat
end run

--=====

on decoupe(t, d)
	local oTIDs, l
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

--=====

Yvan KOENIG (VALLAURIS, France) vendredi 13 août 2010 18:11:46

Hello

I edited it a bit because I wasn’t fully satisfied by the date_time stamp applied to existing folders.
Now, I no longer use the current date_time but the modification_date_time of the existing folder.


(*
Structure of the parameters file :

document01.txt<TAB>keyword1
document02.txt<TAB>keyword2
document03.txt<TAB>keyword3

Yvan KOENIG (VALLAURIS, France)
2010/08/13
changed the date_time stamp of existing folders.
Now it's no longer built according to current date_time but to the folder's modification date.
*)

property nom_du_fichier_parametres : "parameters.txt"
property extension_txt : ".txt"

on run
	set le_dossier to "" & (choose folder)
	
	tell application "System Events"
		if not (exists file (le_dossier & nom_du_fichier_parametres)) then
			error "The file "" & nom_du_fichier_parametres & "" is unavailable !"
		end if
	end tell
	
	set les_parametres to paragraphs of (read file (le_dossier & nom_du_fichier_parametres))
	
	repeat with ref_des_parametres in les_parametres
		set {nom_source, delimiteur} to my decoupe(ref_des_parametres, tab)
		if nom_source ends with extension_txt then
			set nom_du_dossier_cible to text 1 thru -(1 + (length of extension_txt)) of nom_source
			set chemin_cible to my makeNewFolder(le_dossier, nom_du_dossier_cible)
			set les_blocs to my decoupe(read file (le_dossier & nom_source), delimiteur)
			repeat with i from 1 to count of les_blocs
				set nom_numero_i to nom_du_dossier_cible & "#" & text -3 thru -1 of ("000" & i) & extension_txt
				tell application "System Events" to make new file at end of folder chemin_cible with properties {name:nom_numero_i}
				write "" & item i of les_blocs to file (chemin_cible & ":" & nom_numero_i)
			end repeat
		end if -- nom_source ends.
	end repeat
end run

--=====

on makeNewFolder(dossier_hote, sous_dossier)
	tell application "System Events"
		set chemin_cible to dossier_hote & sous_dossier
		if exists folder chemin_cible then
			set date_de_modification to modification date of disk item chemin_cible
			set les_secondes to time of date_de_modification
			set name of disk item chemin_cible to (sous_dossier & "_" & year of date_de_modification & text -2 thru -1 of ("0" & (month of date_de_modification as integer)) & text -2 thru -1 of ("0" & day of date_de_modification) & "_" & text -2 thru -1 of ("0" & les_secondes div 3600) & text -2 thru -1 of ("0" & (les_secondes mod 3600) div 60) & text -2 thru -1 of ("0" & les_secondes mod 60))
		end if
		make new folder at end of folder dossier_hote with properties {name:sous_dossier}
	end tell -- System Events
	return chemin_cible
end makeNewFolder

--=====

on decoupe(t, d)
	local oTIDs, l
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

--=====

Yvan KOENIG (VALLAURIS, France) vendredi 13 août 2010 21:11:04

Thanks but unfortunately the script ends as soon as it runs through the first document,
The error happens before breaking the doc in several parts.

Can this be because they are general writings and include “:” or “;” or other punctuation signs?

I am only a novice with apple script and actually at my old age I find it very complex to go through a proper learning.

You give me this instruction:

Structure of the parameters file :

document01.txtkeyword1
document02.txtkeyword2
document03.txtkeyword3

Di I actually need to insert in

property nom_du_fichier_parametres : “parameters.txt”?

In any case the script stops unfortunately at the end of scanning the first document and before breaking it through the keyword.

The text in question are notes taken during trips and they all the keywords are indeed the ones I say: Unique to each document. I did this in Text Wrangler and I am totally certain the same keyword does not appear but in any separate doc.

I could change all “wrong chars” in Text Wrangler all at once if this is the case.

I tried to record a script there as it has powerful scripting capacities but the results are poor.

Thanks a lot anyway

Danwan

Give this a try. It works on only 1 file and the script will ask you for 1) the file to convert, 2) the keyword, and 3) the folder where you want the created text files saved.

set theFile to choose file of type {"txt"} with prompt "Choose the file to convert."
set theKeyword to text returned of (display dialog "What is the keyword for this file?" default answer "")
set outFolder to (choose folder with prompt "Where would you like the output files saved?") as text

-- read the file
try
	set theText to read theFile
on error
	display dialog "The file could not be read:" & return & (theFile as text) buttons {"OK"} default button 1 with icon stop
	return
end try

-- get the parts of the file separated by the keyword
set text item delimiters to theKeyword
set textList to text items of theText
set text item delimiters to ""

set {Nm, Ex} to getName_andExtension(theFile)
repeat with i from 1 to count of textList
	-- calculate the output file path
	set theNum to text -3 thru -1 of ("000" & (i as text))
	set newPath to outFolder & Nm & "_" & theNum & ".txt"
	
	-- write the text to newPath
	set success to writeTo(item i of textList & theKeyword, newPath, false, text)
	if success is false then
		display dialog "There was a problem writing to the file:" & return & newPath buttons {"OK"} default button 1 with icon stop
		exit repeat
	end if
end repeat



(*============== SUBROUTINES =============*)
on writeTo(fileData, targetFile, apendData, mode) -- apendData is true or false... mode is string, list, or record (no quotes around either)
	try
		set targetFile to targetFile as text
		if targetFile does not contain ":" then set targetFile to POSIX file targetFile as text
		set openFile to open for access file targetFile with write permission
		if apendData is false then set eof of openFile to 0
		write fileData to openFile starting at eof as mode
		close access openFile
		return true
	on error
		try
			close access file openFile
		end try
		return false
	end try
end writeTo

on getName_andExtension(F)
	set F to F as Unicode text
	set {name:Nm, name extension:Ex} to info for file F without size
	if Ex is missing value then set Ex to ""
	if Ex is not "" then
		set Nm to text 1 thru ((count Nm) - (count Ex) - 1) of Nm
	end if
	return {Nm, Ex}
end getName_andExtension

I used as a symbol for the tab character.

Here the script behave flawlessly.

Yvan KOENIG (VALLAURIS, France) samedi 14 août 2010 11:27:44