How to select and keep some lines in Pages file and delete the rest

Hello guys.

I have a huge txt file with thousands of paragraphs. Every paragraph starts with * (thats very helpful I guess) and it has a title (see the example).

*Here is the title (some data, 0.9 deg.) -

Text here… until a new one!

*Here is the title (some data, 0.0 deg.) -

Text here… until a new one!

At the end of the title, there are parenthesis and inside of them there are some numbers in degree format.

I want to keep only the 0.0 degree title and the text underneath and every other title and text to be delete. How can I do that? I will earn a valuable time…

You may try with :


# Of course you may define the document as you want.
set leChemin to (path to desktop folder as text) & "pour essais.pages"

tell application "Pages"
	open file leChemin
	tell body text of document 1
		set theParas to paragraphs whose first character is "*"
	end tell
end tell
set theNumbers to {}
repeat with unPara in theParas
	set aNumber to item -2 of my decoupe(unPara as text, {", ", " deg."})
	set end of theNumbers to aNumber
end repeat
theNumbers
{"0.9", "0.0"}

#===== handler

on decoupe(t, d)
	local oTids, l
	set {oTids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
	set l to text items of t
	set AppleScript's text item delimiters to oTids
	return l
end decoupe

#===== handler 

Yvan KOENIG (VALLAURIS, France) jeudi 12 mars 2015 18:15:47

Hello Yvan and thank you very much for your time. I appreciate.

The script opens the file but doesn’t delete the text areas with the degrees (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 & 1.0) except of the 0.0 degree. It doesn’t do anything in the text.

In the script editor I received in the results area… {“0.9”, “0.0”}

but I have no changes in the text!!! I am trying to figure out what’s wrong!

It seems that I read too fast.


# Of course you may define the document as you want.
set leChemin to (path to desktop folder as text) & "pour essais.pages"

tell application "Pages"
	open file leChemin
	tell body text of document 1
		repeat
			if (paragraph 1 starts with "*") and paragraph 1 contains "0.0 deg." then exit repeat
			delete paragraph 1
		end repeat
	end tell
end tell

Yvan KOENIG (VALLAURIS, France) jeudi 12 mars 2015 22:37:56

Doesn’t work either. :frowning:

Does not do any change to the text. Is there a chance that it can’t be done?

The format of the text is exactly like below…

*Here is the title (some data, 0.9 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.

*Here is the title (some data, 0.0 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.

*Here is the title (some data, 0.3 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.

etc.

Is there a chance the breaks (empty lines between) to create the problem?

When I run the script on your sample, the result is :

[i]*Here is the title (some data, 0.0 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.

*Here is the title (some data, 0.3 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.

etc.

Is there a chance the breaks (empty lines between) to create the problem?

Offline[/i]

Isn’t it what was asked ?

If it’s not it’s that there is something that I don’t understand in English (not a surprise as it’s not my main language).

Maybe you wanted that some other items are removed like what is done bu this script.


# Of course you may define the document as you want.
set leChemin to (path to desktop folder as text) & "pour essais.pages"

tell application "Pages"
	open file leChemin
	tell body text of document 1
		repeat
			if (paragraph 1 starts with "*") and paragraph 1 contains "0.0 deg." then exit repeat
			delete paragraph 1
		end repeat
		# Here we have removed what was above the title to keep
		repeat with i from 2 to 10000
			if (paragraph i starts with "*") then exit repeat
		end repeat
		# Here i is the index of the first title/paragraph to delete
		
		repeat
			if (count paragraphs) = i then exit repeat
			delete paragraph i
		end repeat
	end tell
end tell

Yvan KOENIG (VALLAURIS, France) vendredi 13 mars 2015 09:38:24

Yvan, thank you!

The last script was running and see the text while deleting. But it deletes everything and left only 2 titles (0.0 and 0.1) and 1 text. There must be at least 100 titles of 0.0 deg. Weird. And I just notice something very important. There are dates that I need to keep (sorry about that - I just think about it). Let me be more detailed.

Friday, 27 Mar 2015 - Bill

*Here is the title (some data, 0.9 deg.) -

Text here… until a new one!

*Here is the title (some data, 0.0 deg.) -

Text here… until a new one!

Saturday, 28 Mar 2015 - Bill

*Here is the title (some data, 0.2 deg.) -

Text here… until a new one!

*Here is the title (some data, 1.0 deg.) -

Text here… until a new one!

*Here is the title (some data, 0.0 deg.) -

Text here… until a new one!

and I would like to be like below

Friday, 27 Mar 2015 - Bill

*Here is the title (some data, 0.0 deg.) -
Text here… until a new one!

Saturday, 28 Mar 2015 - Bill

*Here is the title (some data, 0.0 deg.) -
Text here… until a new one!

Is it possible? Maybe it can not be done! What do you think?

I’m really puzzled.

Here the script left only :
[i]*Here is the title (some data, 0.0 deg.) -

Text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here, text here.[/i]

The used source was extracted from your message.

Now that, at last, you describe the real problem, I assume that it’s a question for Regular Expressions.
It’s an area where I am perfectly ignorant but some helpers are really fluent with them here.
With a bit of luck one of them will see this thread.

Yvan KOENIG (VALLAURIS, France) vendredi 13 mars 2015 22:08:28

Hi.

Assuming there are no more surprises to come, this does what’s described and it’s quite fast. BUT it’s only been tested with Pages 4.3. I’ve no idea if it works with Pages 5.whatever. :slight_smile:

Edit: Now tested with and adapted for Pages 5.2.2.


-- Tested with Pages 4.3 and Pages 5.2.2.
-- This is the version for 5.2.2.

tell application "Pages"
	tell body text of document 1
		-- View the top of the document so that Pages doesn't have to keep updating the display during the deletions.
		-- select insertion point before character 1 -- Commented out. Pages 5.2.2 doesn't understand 'select' or 'insertion point'.
		-- Get the paragraphs as a list of text to parse in AppleScript rather than in Pages.
		set paras to paragraphs
	end tell
end tell

-- Initialise paragraph indices for the key paragraph types.
set pDate to (count paras) + 1 -- Date paragraphs (ending with  "- Bill" & return).
set pToKeep to pDate -- Paragraphs beginning with "*" and containing "0.0 deg."
set pToCut to pDate -- Other paragraphs beginning with "*".
-- Preconcatenated text.
set BillReturn to "- Bill" -- & return -- Pages 5.2.2 doesn't include returns in 'paragraph' results.

-- Parse the list, deleting blocks of unwanted paragraphs from the document in reverse order for ease of indexing and to minimise the housekeeping for Pages.
repeat with p from pDate - 1 to 1 by -1
	set thisPara to item p of paras
	-- Look out for paragraphs beginning with "*" or ending with "- Bill" & return.
	if (thisPara begins with "*") then
		if (thisPara contains "0.0 deg.") then
			-- This is a "*" paragraph containing "0.0 deg."
			if (pDate comes after pToKeep) then
				-- If there's been no date since the last such paragraph, set the date pointer to that paragraph to prevent it being deleted and delete the blank paragraph after it.
				set pDate to pToKeep
				tell application "Pages" to delete paragraph (pToKeep + 1) of body text of front document
			end if
			-- Note the index of the current "0.0 deg." paragraph..
			set pToKeep to p
			-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next date paragraph (ditto).
			if (pToCut < pDate) then tell application "Pages" to delete paragraphs pToCut thru (pDate - 1) of body text of document 1
		else
			-- This is a "*" paragraph not containing "0.0 deg.". Note its index.
			set pToCut to p
		end if
	else if (thisPara ends with BillReturn) then
		-- This is a date paragraph.
		if (pToKeep comes after pDate) then
			-- If there's been no "0.0 deg." since the last date paragraph, set "0.0 deg." pointer to that date paragraph to prevent it being deleted.
			set pToKeep to pDate
		else
			-- Otherwise delete the empty paragraph after that "0.0 deg."
			tell application "Pages" to delete paragraph (pToKeep + 1) of body text of document 1
		end if
		-- Note the index of the current date paragraph.
		set pDate to p
		-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next "0.0 deg." paragraph (ditto).
		if (pToCut < pToKeep) then tell application "Pages" to delete paragraphs pToCut thru (pToKeep - 1) of body text of document 1
	end if
end repeat

Yvan thank you very much that you wast your time for me! I appreciate a lot.

Nigel, thank you too for your try! Unfortunately, I am working on version 5 and I received this message (I don’t know if it has to do with the version).

error “Pages got an error: insertion point before character 1 of body text of document 1 doesn’t understand the “select” message.” number -1708 from insertion point before character 1 of body text of document 1

That must be caused by the select insertion point before character 1 line near the top of the script. The line’s just a trick to speed things up if your document’s quite long. It’s not a vital part of the process and can be deleted or commented out without affecting the rest of the script. But I don’t know what other incompatibilities you’d find after that.

Nigel I removed the line about select insertion.

What I have noticed is that it has stopped on a title with 0.0. I press again the play and deleted next line and stopped. Play and delete next line. When it finds a title with 0.0 (after many press plays) it deleted it also. Another thing. It deletes also the dates (something which I need it).

It would be easier if I copy the text in a txt file instead of pages?

OK. I’ve unzipped a copy of Pages 5.2.2 to have a look for myself. It seems that, unlike 4.3, 5.2.2 returns ‘paragraphs’ without returns on their ends. (A much better idea!) So the value of my ‘BillReturn’ variable should just be “- Bill”, not “- Bill” & return. With this additional adjustment, the script works with Pages 5.2.2 and the text you’ve posted. I’ve edited the code in post #10 above.

Nigel thank you very much for your try. I appreciate that you update your page version for this script. I appreciated a lot.

My pages version is 5.5.2. The script respond in the same way as before. I don’t know why.

I could copy and paste the whole text in a text editor than pages, so to avoid the pages problems. :frowning:

Here is a version working with both versions of Pages.


-- Tested with Pages 4.3 and Pages 5.2.2.
-- This is the version for both

tell application "Pages"
	set isVersion5 to version ≥ "5"
	tell body text of document 1
		-- View the top of the document so that Pages doesn't have to keep updating the display during the deletions.
		if isVersion5 then
			tell application "System Events" to tell process "Pages"
				set frontmost to true
				key code 126 using {command down}
			end tell
		else
			select insertion point before character 1
		end if
		-- Get the paragraphs as a list of text to parse in AppleScript rather than in Pages.
		set paras to paragraphs
	end tell
end tell

-- Initialise paragraph indices for the key paragraph types.
set pDate to (count paras) + 1 -- Date paragraphs (ending with  "- Bill" & return).
set pToKeep to pDate -- Paragraphs beginning with "*" and containing "0.0 deg."
set pToCut to pDate -- Other paragraphs beginning with "*".
-- Preconcatenated text.
if isVersion5 then
	set BillReturn to "- Bill" -- & return -- Pages 5.2.2 doesn't include returns in 'paragraph' results.
else
	set BillReturn to "- Bill" & return
end if

-- Parse the list, deleting blocks of unwanted paragraphs from the document in reverse order for ease of indexing and to minimise the housekeeping for Pages.
repeat with p from pDate - 1 to 1 by -1
	set thisPara to item p of paras
	-- Look out for paragraphs beginning with "*" or ending with "- Bill" & return.
	if (thisPara begins with "*") then
		if (thisPara contains "0.0 deg.") then
			-- This is a "*" paragraph containing "0.0 deg."
			if (pDate comes after pToKeep) then
				-- If there's been no date since the last such paragraph, set the date pointer to that paragraph to prevent it being deleted and delete the blank paragraph after it.
				set pDate to pToKeep
				tell application "Pages" to delete paragraph (pToKeep + 1) of body text of front document
			end if
			-- Note the index of the current "0.0 deg." paragraph..
			set pToKeep to p
			-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next date paragraph (ditto).
			if (pToCut < pDate) then tell application "Pages" to delete paragraphs pToCut thru (pDate - 1) of body text of document 1
		else
			-- This is a "*" paragraph not containing "0.0 deg.". Note its index.
			set pToCut to p
		end if
	else if (thisPara ends with BillReturn) then
		-- This is a date paragraph.
		if (pToKeep comes after pDate) then
			-- If there's been no "0.0 deg." since the last date paragraph, set "0.0 deg." pointer to that date paragraph to prevent it being deleted.
			set pToKeep to pDate
		else
			-- Otherwise delete the empty paragraph after that "0.0 deg."
			tell application "Pages" to delete paragraph (pToKeep + 1) of body text of document 1
		end if
		-- Note the index of the current date paragraph.
		set pDate to p
		-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next "0.0 deg." paragraph (ditto).
		if (pToCut < pToKeep) then tell application "Pages" to delete paragraphs pToCut thru (pToKeep - 1) of body text of document 1
	end if
end repeat

In fact, I wrote it before reading last Nigel’s message so I didn’t edited le instruction defining BillReturn.

It worked and the result was matching what was posted by Viti : blank line between “filled” ones.

I don’t know which one is matching Viti’s needs.
In the version posted here it’s easy to enable or disable the instruction defining BillReturn.

As the code using GUI Scripting to move the cursor at top is accepted by both versions, assuming that the “large” line spacing is the wanted one the script may be reduced to


-- Tested with Pages 4.3 and Pages 5.2.2.
-- This is the version for both

tell application "Pages"
	set isVersion5 to version ≥ "5"
	tell body text of document 1
		-- View the top of the document so that Pages doesn't have to keep updating the display during the deletions.
		
		tell application "System Events" to tell process "Pages"
			set frontmost to true
			key code 126 using {command down}
		end tell
		
		-- Get the paragraphs as a list of text to parse in AppleScript rather than in Pages.
		set paras to paragraphs
	end tell
end tell

-- Initialise paragraph indices for the key paragraph types.
set pDate to (count paras) + 1 -- Date paragraphs (ending with  "- Bill" & return).
set pToKeep to pDate -- Paragraphs beginning with "*" and containing "0.0 deg."
set pToCut to pDate -- Other paragraphs beginning with "*".
-- Preconcatenated text.
set BillReturn to "- Bill" & return

-- Parse the list, deleting blocks of unwanted paragraphs from the document in reverse order for ease of indexing and to minimise the housekeeping for Pages.
repeat with p from pDate - 1 to 1 by -1
	set thisPara to item p of paras
	-- Look out for paragraphs beginning with "*" or ending with "- Bill" & return.
	if (thisPara begins with "*") then
		if (thisPara contains "0.0 deg.") then
			-- This is a "*" paragraph containing "0.0 deg."
			if (pDate comes after pToKeep) then
				-- If there's been no date since the last such paragraph, set the date pointer to that paragraph to prevent it being deleted and delete the blank paragraph after it.
				set pDate to pToKeep
				tell application "Pages" to delete paragraph (pToKeep + 1) of body text of front document
			end if
			-- Note the index of the current "0.0 deg." paragraph..
			set pToKeep to p
			-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next date paragraph (ditto).
			if (pToCut < pDate) then tell application "Pages" to delete paragraphs pToCut thru (pDate - 1) of body text of document 1
		else
			-- This is a "*" paragraph not containing "0.0 deg.". Note its index.
			set pToCut to p
		end if
	else if (thisPara ends with BillReturn) then
		-- This is a date paragraph.
		if (pToKeep comes after pDate) then
			-- If there's been no "0.0 deg." since the last date paragraph, set "0.0 deg." pointer to that date paragraph to prevent it being deleted.
			set pToKeep to pDate
		else
			-- Otherwise delete the empty paragraph after that "0.0 deg."
			tell application "Pages" to delete paragraph (pToKeep + 1) of body text of document 1
		end if
		-- Note the index of the current date paragraph.
		set pDate to p
		-- Delete from the following not-"0.0 deg." asterisk paragraph (previously indexed) to just before the next "0.0 deg." paragraph (ditto).
		if (pToCut < pToKeep) then tell application "Pages" to delete paragraphs pToCut thru (pToKeep - 1) of body text of document 1
	end if
end repeat

Yvan KOENIG (VALLAURIS, France) samedi 14 mars 2015 14:22:09

Sorry Yvan, they don’t work, both of them… It starts and then immediately stops.

The scripts work for both Yvan and myself. The only theories I have are:

  1. You may be looking at the top of the document and not waiting for the edits ” which start at the bottom ” to reach there.
  2. The text you want to edit may not be how you’ve described it.
  3. The text you want to edit may not be consistently how you’ve described it.
  4. There may be invisible characters in the text (such as spaces or tabs at the ends of lines) which don’t show up on this Web site.
  5. The security system on your computer may be disallowing the scripted keystroke ” although since my version of the script doesn’t have it, this is unlikely to be the main problem.

The script does currently have a weakness in that any text before the first “date” line isn’t treated.

Ok then, test it please.

Here is the pages file.

http://we.tl/sJZNfGrV5Y

Phew! I had to agree to the Web site’s conditions before it would let me have the file, download the file, and then download Pages 5.5.2 in order to be able to open it! :wink:

The problem turns out to be my theory number 2. The asterisk lines don’t begin with asterisks, but with spaces. So .

if (thisPara begins with "*") then

. should be .

if (thisPara begins with " *") then

The script then performs the desired edits. But the result’s rather untidy as there are page numbers which get preserved because they follow “0.0 deg.” lines and there’s some uneven paragraph spacing for reasons I haven’t yet worked out.