Find content change in a text list

I have hundreds of pages of text like this sample:

Page 797 090001331
Page 798 090001331
Page 799 090001501
Page 800 090001501
Page 801 090001501
Page 802 090001501
Page 803 090001501
Page 804 090001501
Page 805 090001856
Page 806 090001856
Page 807 090001856
Page 808 090001856
Page 809 090001856
Page 810 090001856
Page 811 100000099
Page 812 100000099
Page 813 100000099
Page 814 100000099

For Applescript, each line above is a paragraph. I have to separate the data in increments of 800 paragraphs each, but the separations MUST occur where the 9-digit number changes to a new 9-digit number (See lines Page 804 and 805).
If I hit the line “Page 800 090001501”, what formula would I use to determine the line where “090001501” becomes the next section, 090001856, for example? The breaks in the pages are at random intervals. I know I have to increment and compare till the 9-digit number changes at every 800 paragraph point, but could use some direction.

Thanks,
Bob R.

Hey Bob,

This is just a little quick experimentation:


set pageList to paragraphs of text 2 thru -2 of "
Page 797 *090001331*
Page 798 *090001331*
Page 799 *090001501*
Page 800 *090001501*
Page 801 *090001501*
Page 802 *090001501*
Page 803 *090001501*
Page 804 *090001501*
Page 805 *090001856*
Page 806 *090001856*
Page 807 *090001856*
Page 808 *090001856*
Page 809 *090001856*
Page 810 *090001856*
Page 811 *100000099*
Page 812 *100000099*
Page 813 *100000099*
Page 814 *100000099*
"
set AppleScript's text item delimiters to "*"
set _temp to text item -2 of (last item of pageList)
set AppleScript's text item delimiters to return
set pageList to pageList as text
set AppleScript's text item delimiters to _temp
set myPage to first text item of pageList & _temp & "*"
set myPage to paragraphs 1 thru -2 of myPage

The bigger AppleScript lists get the more risk you have of a stack overflow.

But let’s say you can easily convert between list and text.

I’d take an 800 line bite of the main text and use the basic process above (or something similar) to break out the page change point.

Rinse and repeat.

There are other ways of course, but you should tell us exactly how many hundreds of pages “ thousands of lines “ so we can properly test.

-Chris


MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11

Hi Chris,
The “real” document would have 20,000 lines of the above content, each line as a paragraph.
It starts as a 20,000 page supplied PDF. I converted the PDF to TEXT and GREPped the list of page numbers with their paired code. Each page has a barcode in the same place on the page, like chapters. The barcode changes randomly to a different code thru the big PDF. So the text file is just a table of contents telling me where the page ranges change. I will use this info to script the actual PDF file.

My goal is to break the 20,000 page PDF into separate PDFs of 800 pages each (more or less…the important thing is, the new PDFs have to have their code series start and stop in the document…not continuing across a pair of documents).
I need to break up the original PDF because 20,000 pages is too much for Applescript to process later on in a timely manner at one time. It goes on for days, I tried it.

To your reply, I would not know where the 800-increment selection arrives in the text list, so I do not think delimiting the section of different codes would work. All I know is Page 804 and Page 805 have to be in different PDF files split off from the big PDF 20K page file. Thanks!!

Hey Bob,

Personally I’d rather do something like this using the Satimage.osax.


-------------------------------------------------------------------------------------------
# Requires the Satimage.osax be installed { http://tinyurl.com/dc3soh }
-------------------------------------------------------------------------------------------
set _text to bbeditFrontWinText()
set my800Lines to fnd("([^\\r]+\\r){800}", _text, false, true) of me
set myNumber to fndUsing("(\\*\\d{9}\\*)\\r\\Z", "\\1", my800Lines, false, true) of me
set myNumber to cng("\\*", "\\\\*", myNumber) of me
set myNewLines to fndUsing("(?m)(\\A.+?)[^\\r]+" & myNumber, "\\1", my800Lines, false, true) of me

-------------------------------------------------------------------------------------------
--» HANDLERS
-------------------------------------------------------------------------------------------
on cng(_find, _replace, _data)
	change _find into _replace in _data with regexp without case sensitive
end cng
-------------------------------------------------------------------------------------------
on fnd(_find, _data, _all, strRslt)
	try
		find text _find in _data all occurrences _all string result strRslt with regexp without case sensitive
	on error
		return false
	end try
end fnd
-------------------------------------------------------------------------------------------
on fndUsing(_find, _capture, _data, _all, strRslt)
	try
		set findResult to find text _find in _data using _capture all occurrences _all ¬
			string result strRslt with regexp without case sensitive
	on error
		false
	end try
end fndUsing
-------------------------------------------------------------------------------------------
on bbeditFrontWinText()
	tell application "BBEdit"
		tell front document to its text
	end tell
end bbeditFrontWinText
-------------------------------------------------------------------------------------------

I’d probably run this on a file instead of a BBEdit document, but again this depends upon how large the overall text is.

The next step would be to strip off the found lines (one more line of code) and then to loop through again.

You should provide us with a little more detail of your overall process, so we don’t have to guess and can better test.

-Chris

Hi.

This vanilla approach seems quite fast:

-- This assumes that each paragraph in the text ends with a reference consisting of eleven characters (say nine digits and two asterisks) with no trailing spaces. No special arrangements have been made for line endings at the end of the text, but they're easily added.

on separateText(txt)
	script o
		property separationList : {}
	end script
	
	set paraCount to (count txt's paragraphs)
	set i to 1
	repeat with j from 800 to paraCount by 800
		set refNo to text -11 thru -1 of paragraph j of txt
		repeat while ((j < paraCount) and (paragraph (j + 1) of txt ends with refNo))
			set j to j + 1
		end repeat
		set end of o's separationList to text from paragraph i to paragraph j of txt
		set i to j + 1
	end repeat
	if (i ≤ paraCount) then set end of o's separationList to text from paragraph i to end of txt
	
	return (o's separationList)
end separateText

local txt
set txt to (read (choose file)) -- For test purposes

separateText(txt)

Hey Bob,

I believe that is what I’ve done in the script, although I went for an under-run and Nigel went for an overrun.

The script takes an 800 line bite and then finds the last code number. It then truncates that code number off at the split where it deviates from the previous one.

The idea was for you to take that and run with it.

Nigel has done me one better and provided a complete script, and it is fast indeed “ running on my system with a 20K line file in ~ 0.3 seconds.

I don’t understand what you mean by this. “All you know…”

If Nigel’s script doesn’t do what you want it to do then we plainly aren’t understanding what you really want.

-Chris

Thank you, Nigel, that seems the direction I need to go.

Chris, if Adobe Acrobat Pro had more scripting options, I would not even need to have this step. I could just open the 20,000 page PDF and use the Acrobat Split option at 800 page increments to make new 800 page documents from the original. The reason I cannot do that is, like I said, I need the page breaks near the 800 page mark to end with a series of the bar codes. In my list example, you see that tho the page numbers change, the barcodes stay the same for a certain amount of lines (this changes in the big doc also; some codes run for 6 lines, some for 12, some for 14, some for 8. The script does not know what it will find in this regard.).

Page 799 090001501
Page 800 090001501
Page 801 090001501
Page 802 090001501
Page 803 090001501
Page 804 090001501
Page 805 090001856

I could use Acrobat to split the 20000 page PDF doc into the first 800 page section. However, The bar code 090001501 at page 800 does not change until Page 805, where it becomes 090001856. So the 800 page pdf file is not what I need for the later step of breaking it out further into page levels. This is the need for my text file. My actual text file has 20000 lines. It is a key-pair list of the PDF page numbers and the respective 11-digit code that falls on that page. My script will break out the PDF into new PDFs depending on the page counts associated with the barcodes, but that is not my question here, that part is done.

I am trying to find an efficient way for my script to hit every 800 lines in the text list and at that point determine where the bar code series changes in the vicinity of lines 800, 1600, 2400, 3200, 4000, 4800, 5600, 6400, 7200, 8000, 8800, 9600, 10400, 11200, 12000, 12800, 13600, 14400, 15200, 16000, 16800, 17600, 18400, 19200, and 20000 in the text file. From there I will know exactly where the 11-digit codes change for each increment. Obviously, the bar code will not change exactly at line 14400. So I need the script to read forward from that point and find, for example, that the bar code with line 14400 will actually change at line 14408. Simple, right? Then the script goes to the next increment, 15200 and does the same thing. Maybe that 11-digit code change is at 15203. From that I get a list of the exact page breaks I want to split out the 20000 PDF into 800 page pdfs. I do not want to do all this manually; hence the need to script it.

This neatly demonstrates why the scripts above are quite fast. Although we’re talking large numbers of paragraphs, at the AppleScript level there are only 25 items to extract.

I didn’t mention the way the repeat variable j is used (or perhaps misused) in my script, which may cause confusion if you’re not familiar with AppleScript. At the top of the repeat which begins .

repeat with j from 800 to paraCount by 800

. j is always set to the next value in the series 800, 1600, 2400, ., regardless of any value to which it may be changed within the repeat. So if j is incremented to, say, 804 during the first iteration to index the relevant end paragraph, it’ll be 1600 at the top of the next iteration, not 1604.

Hey Bob,

Okay, we do understand what you’re trying to do.

My script does exactly that on a small scale, EXCEPT that I’m breaking at the bar-code shift before the last line of the bite of data. I’m also demonstrating on your 18 line sample instead of extrapolating to 800 lines.

I didn’t show you on a full scale or write the whole script for you as Nigel did.

Nigel also correctly inferred that you wanted to step OVER the 800th line to the next barcode break if warranted.

It appears to me that Nigel’s script does exactly what you want.

Yes/No?

-Chris

I appreciate how your method increases speed by avoiding internal work, but it could potentially allow duplicates by not implementing either a span or backtracking. I tested the sample list at a scale of 4, rather than 800. In this case, problems showed up at page 810.

{“Page 797 090001331
Page 798 090001331”, “Page 799 090001501
Page 800 090001501
Page 801 090001501
Page 802 090001501
Page 803 090001501
Page 804 090001501”, “Page 805 090001856
Page 806 090001856
Page 807 090001856
Page 808 090001856
Page 809 090001856
Page 810 090001856”, “Page 810 090001856
Page 811 100000099”, “Page 811 100000099
Page 812 100000099
Page 813 100000099
Page 814 100000099”}

Thanks for all your inputs. Here is the solution:

Sample of the list of codes:

Page 799 090001501
Page 800 090001501
Page 801 090001501
Page 802 090001501
Page 803 090001501
Page 804 090001501
Page 805 090001856
Page 806 090001856
Page 807 090001856
Page 808 090001856

Here is the relevant part of the script that tells me that the code changes after Page 804:

–for testing
set myline to 800 as integer

set linematch to true

repeat until linematch is false
if paragraph myline of mybarcode is equal to paragraph (myline + 1) of mybarcode then
display dialog “Same lines”
set myline to myline + 1 as integer
else
display dialog “Different” as text
display dialog myline as text
set linematch to false
end if
end repeat

This one requires the Satimage.osax, although I’ve taken a different approach than my first SIO-based script.

It runs in ~ 0.5 seconds on my system with a 20K line test file.

-Chris

-----------------------------------------------------------------------
# Requires the Satimage.osax be installed { http://tinyurl.com/dc3soh }
-----------------------------------------------------------------------

set sepList to {}
set {oldTIDS, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "*"}
set barcodeText to readtext "~/Downloads/TestBarCode.txt"

# Remove and leading or trailing "vertical" whitespace.
set barcodeText to cng("\\A\\s+|\\s+\\Z", "", barcodeText) of me

repeat
	try
		try
			set barCode to "\\*" & (text item -2 of paragraph 800 of barcodeText) & "\\*"
		on error
			set barCode to "\\*" & (text item -2 of paragraph -1 of barcodeText) & "\\*"
		end try
		
		set findReco to fnd("(?m)\\A.+" & barCode & "$", barcodeText, false, false) of me
		
		if findReco ≠ false then
			set end of sepList to matchResult of findReco
			set barcodeText to text ((matchLen of findReco) + 2) thru -1 of barcodeText
		end if
		
	on error
		exit repeat
	end try
end repeat

set AppleScript's text item delimiters to oldTIDS

sepList

-----------------------------------------------------------------------
--» HANDLERS
-----------------------------------------------------------------------
on cng(_find, _replace, _data)
	change _find into _replace in _data with regexp without case sensitive
end cng
-----------------------------------------------------------------------
on fnd(_find, _data, _all, strRslt)
	try
		find text _find in _data all occurrences _all string result strRslt with regexp without case sensitive
	on error
		return false
	end try
end fnd
-----------------------------------------------------------------------