HTML Find pattern, use part of match to add new DIV ID tag

I would like to do the following:

Select a local folder that contains dozens of subfolders, each of those folders has several .html files. In each .html file, I need to find every instance of . Each instance of page-num will have a unique number, for example This number (37.1 in this example) needs to then be inserted into the

tag directly below as an id.

So the existing section is:

It needs to read:

Each .html file may have from a few to several dozen instances of with corresponding

below it. Thankfully, the number (37.1) is unique in each page-num instance for that .html file. (But the number will repeat in the other .html files.)

I have been able to find the page-num pattern successfully using grep in BBEdit and can generate a file with the file location/name, line number and page-num instance but am at a loss for where to go from here. A line from the grep search results file looks like this:

/Volumes/Pookie/Issues/172/imageGallery.html:640:

I have looked at several examples of find and insert with AppleScript, but they all seem to focus on replacing the pattern found, and/or inserting at the pattern location. I have not been able to find how to take part of the pattern match (37.1) and then insert this info at a new location in the file. I’m not even sure how to phrase this for a google search.

Is this even possible in AppleScript? Should I be looking at another method for doing this? I did some searching and looking at sed and awk but those are much more complicated to me than AppleScript is (I am very much a pre-beginner in AppleScript). I would appreciate any help or ideas from the experts here about how to go about performing this search and insert.

Thank you for your time! :slight_smile:

This is one of those issues that could be handled in myriad ways, so I’m going to stick with bbedit but otherwise just focus on your specific example. Hopefully this will give you some ideas as to how to proceed.

First, bbedit is extra nice because unlike pretty much any other app, it is recordable. This means that you can open a document in bbedit, create a new script in script editor, click record, and then switch back to bbedit and most of what you do will generate code in the new script. This is especially useful when it comes to learning how to script searches.

Try this: Make a bbedit document that has your provided text as its entire content and save it as ‘37.html’, and then begin recording a script as outlined above. Then switch back to bbedit and select the text of the first row (but not the line ending) and then type command-e, which will make the selection into the Find string. Then type command-f to open the find dialog and you should see some variation on your selection in the find box. Check the ‘grep’ box and hit return (or click Next) — this will ‘find’ the selected text. Then switch back to the script and stop the recording. Your script should look something like this:

tell application "BBEdit"
	find "<page page-num=\"37\\.1\">" searching in text 1 of text document "37.html" options {search mode:grep, wrap around:true} with selecting match
end tell

The options can be looked up in the bbedit dictionary but generally they correspond to the options in the find dialogue. Make sure that your script line includes ‘with selecting match’. Then, change each of the three digits in the script to ‘\d’. The find string in the script should then look like ‘<page page-num="(\d\d\.\d)">’. NB in a grep search, at least in bbedit, ‘\d’ represents a single digit. Applescript requires an extra ''. So basically, it will search for your ‘XX.X’ pattern.

If you run the script at this point, you will find that the text of the first line will be selected. You then want to run another search but only within the selected text. This time, set the search string to match just the XX.X pattern. The script line should look like this:

find "\\d\\d.\\d" searching in selection options {search mode:grep} with selecting match

If you run your script now, you should end up with just the ‘37.1’ selected. So let’s assign this to a variable:

set ts to the selection

Now that we have the text to insert, we can set up the replacement, which might look something like this:

	set repStr to "<div class=\"imageGalleryPage\" "
	find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
	replace repStr searching in selection options {search mode:grep} using repStr & "id=\"" & ts & "\" "

Note that I’ve replaced the document name with its index. Before you run the script, in bbedit move the cursor to the very beginning of the document. What this does is look for the beginning of the div tag and find and select the first attribute. It then replaces all that with itself and the id attribute.

Now, there are various issues like, what happens to the second XX.X? If you have multiple text blocks like the one provided, and the cursor is set to the beginning, then each time you run the script, it should insert the id attribute for one instance. You’ll have to figure out the rest, or at least provide more information. Hope this at least gets you going.

Thank you VERY much! This has helped immensely. I was able to successfully run the script, each time running the script goes to the next instance of the appropriate

tag and inserts the correct id tag.

I did make a copy of the script and modified it to search for page-num without decimal points, and another for a single digit page-num. I did run into a strange issue. When running these versions, the find always works. But the insert will skip those

tags that only have the class attribute (no “style” attribute). This occurs for both the single and double digit versions.

For example:

will not have the appropriate id tag inserted, it is skipped. Instead, id=“5” gets inserted on the next

tag (page-num=“6”) which has a style attribute in the
tag.

Page-num 6

originally looks like this:

And after the wrong id tag is inserted:

If I keep running the script, it will run correctly until the next time a

tag with only a class attribute is encountered.

Here is the modified script I am running for the one digit find:

tell application "BBEdit"
	find "<page page-num=\"(\\d)\">" searching in text 1 of text document "1.ArchivalView.html" options {search mode:grep, wrap around:true} with selecting match
	find "\\d" searching in selection options {search mode:grep} with selecting match
	set ts to the selection
	set repStr to "<div class=\"imageGalleryPage\" "
	find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
	replace repStr searching in selection options {search mode:grep} using repStr & "id=\"" & ts & "\" "
end tell

And for the two digit find:

tell application "BBEdit"
	find "<page page-num=\"(\\d\\d)\">" searching in text 1 of text document "1.ArchivalView.html" options {search mode:grep, wrap around:true} with selecting match
	find "\\d\\d" searching in selection options {search mode:grep} with selecting match
	set ts to the selection
	set repStr to "<div class=\"imageGalleryPage\" "
	find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
	replace repStr searching in selection options {search mode:grep} using repStr & "id=\"" & ts & "\" "
end tell

I’m sure I’m missing something somewhere, but I have not been able to figure out the solution.

I also wanted to ask:

  1. Is there a way to have the script specify the open document name as the one to search through? (i.e., [searching in text 1 of ‘open document’] instead of [searching in text 1 of text document “37.html”] or will I need to manually update the file name each time it is run for a new document? This is not a deal breaker but I thought I’d ask about the options. I did a lot of searching on the site but things I found didn’t seem to fit the scenario or I couldn’t get them to function in the script.

  2. Is there a way to repeat the search/insert so that it finds all of the page-num instances in a document? I’d assume it is more complicated as the script would have to know if a tag has already been inserted and stop running itself. Again, it isn’t a big issue. I can always use command R to run the script through each document, as that is still much faster than having to insert all the id tags by hand.

If you need to see the original working script for some reason, it is here:

tell application "BBEdit"
	find "<page page-num=\"(\\d\\d\\.\\d)\">" searching in text 1 of text document "37.html" options {search mode:grep, wrap around:true} with selecting match
	find "\\d\\d.\\d" searching in selection options {search mode:grep} with selecting match
	set ts to the selection
	set repStr to "<div class=\"imageGalleryPage\" "
	find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
	replace repStr searching in selection options {search mode:grep} using repStr & "id=\"" & ts & "\" "
end tell

Thank you so much for your help and guidance. It will save me hours and hours of tedium!

I’m glad that you’re getting some progress. As to your questions… in general, there are multiple ways of doing things. This applies to identifying which document to process as well as what tags and attributes to search for. The issue that you need to always be aware of when using regex to parse html is the possibility of false positives. It is very easy to construct a search that finds more matches than you hoped it would.

Anyway, regarding the document reference — open documents are commonly referenced by name, index or ID. All open documents are open, which is why your syntax didn’t help. In bbedit, ‘text document 1’ refers to the frontmost document. You can always get the name of that document with this command:

[format]set fn to name of text document 1[/format]

You can then use something like:

	set fn to name of text document 1
	find "<page page-num=\"\\d" searching in text 1 of text document fn options {search mode:grep, starting at top:true}

You can use regex to work on variable length strings, meaning 1-digit or 2-digit, and with or without an extension. This means that you can have a single search for various different strings. For example:

[format]\d\d?.?\d? (or from within the script: \d\d?\.?\d?)[/format]

The ‘?’ specifies that there is 0 or 1 instance of the preceding character, so this would find any of the following examples: 37.1, 37., 37, 3. Note that if you had ‘37.12’, it would only find ‘37.1’.

As an aside, BBEdit’s Help > Grep Reference is a worthwhile read (and well-written). It can also be helpful to just play around with the find dialogue while grep is checked. It’s a well-designed dialogue.

As to not inserting the ID attribute… the issue is that your tags are inconsistent and as such, your search queries require tweaking. Note that in your original post, the

had multiple attributes and thus, the ‘class’ attribute is followed by a space — this trailing space is included in the search. However, in your new example, the
has only a single attribute —class— which is followed by a ‘>’; this causes the search to consider it non-matching. So try removing the trailing space at the end of ‘repStr’. Again, this is the risk of using regex on html — inconsistency equals complexity.

NB this is a great of example of the value of just playing around with the find dialogue. Construct a find string and keep tapping return and watching whether it selects all the correct matches. When it misses one, tweak the find string and try again.

Finally, as to repetition… yes, it is certainly an option. It would probably be wise to nail down the searching first, to ensure that it correctly matches every necessary string before expanding the scope. Once you have that settled, then you could set up a repeat loop and go through the document. Here is an example that might work:

tell application "BBEdit"
	set fn to name of text document 1
	set bt to text 1 of text document fn
	-- cause first find to begin at beginning of document
	select first insertion point of text 1 of text document fn
	
	-- get count of 'page' tags
	set ppn to "<page page-num=\""
	set AppleScript's text item delimiters to ppn
	set dl to (length of text items of bt) - 1
	
	repeat dl times
		find "<page page-num=\"\\d\\d?\\.?\\d?" searching in text 1 of text document fn options {search mode:grep, starting at top:false} with selecting match
	end repeat
	
end tell

Hopefully I didn’t miss anything.

Wow! Thank you again for all of your help. I have taken your advice and guidance and have come up with this:

tell application "BBEdit"
	set fn to name of text document 1
	set bt to text 1 of text document fn
	-- cause first find to begin at beginning of document
	select first insertion point of text 1 of text document fn
	
	-- get count of 'page' tags
	set ppn to "<page page-num=\""
	set AppleScript's text item delimiters to ppn
	set dl to (length of text items of bt) - 1
	
	repeat dl times
		find "<page page-num=\"(\\d\\d?\\d?\\.?\\d?)\">" searching in text 1 of text document fn options {search mode:grep, wrap around:true} with selecting match
		find "\\d\\d?\\d?\\.?\\d?" searching in selection options {search mode:grep} with selecting match
		set ts to the selection
		set repStr to "<div class=\"imageGalleryPage\""
		find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
		replace repStr searching in selection options {search mode:grep} using repStr & " id=\"" & ts & "\""
		
	end repeat
	
end tell

I made a few small changes (supports triple digits, instead of starting at top:false for the repeat I found wrap around:true seemed to give the best results). I made sure to test, test, test on different variations of the numerals contained in the page-num tag. It also now finds the

tags with single or multiple attributes. It is working wonderfully.

Thank you for the suggestion about BBEdit’s grep reference materials. It looks quite good and extensive. I will be delving into that more to get a better understanding of the available options.

Thank you again for the help, I sincerely appreciate it. :slight_smile:

It looks good. Glad I could help.

One last thing you can consider. It looks like it adds the ID attribute each time the script is run, regardless of whether one (or more) exists already. It is possible to check whether a tag has an ID attribute and only add the ID when it doesn’t already exist.

Try replacing the last two lines inside the loop with the following:

What this does is check to see whether the ID attribute exists immediately after repStr. If it does, then it continues on through the loop without making any changes. If not, then it inserts the ID attribute.

		set attrExists to find repStr & " id=\"" & ts & "\"" searching in text 1 of text document fn options {search mode:grep}
		
		if attrExists is {found:false} then
			find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
			replace repStr searching in selection options {search mode:grep} using repStr & " id=\"" & ts & "\""
		end if

Also, I just noticed that the ‘set repStr’ line is inside the repeat loop but there isn’t any need for it there as that value never changes. I would move it to some place above the repeat.

And as a final suggestion to consider… bbedit offers a ‘Process Lines Containing…’ command (under Text). You can run it on your file, using one of the search strings and have it copy every matching line to a new document. You can then quickly inspect that new document to confirm that every line looks the way it should. Alternatively, copy every line that has the ‘<page’ tag but doesn’t have the ID attribute — presumably there wouldn’t be any but if there were, then you’d know something was amiss.

Look into using NSRegularXpressions and using a capture group.
You can search the HTML and find the “pattern” your looking for.

Have a look at:
https://www.regular-expressions.info/examples.html

Also check out Shane’s RegEx library to help simplify things

https://forum.latenightsw.com/t/regexandstufflib-script-library/2018

Thanks for the feedback. I have tested the new code addition and moved the ‘set repStr’ line above the repeat. I have tested again and it will now only add the ID tag once and will not add it again, even if the script is run more than once.

tell application "BBEdit"
	set fn to name of text document 1
	set bt to text 1 of text document fn
	-- cause first find to begin at beginning of document
	select first insertion point of text 1 of text document fn
	
	-- get count of 'page' tags
	set ppn to "<page page-num=\""
	set AppleScript's text item delimiters to ppn
	set dl to (length of text items of bt) - 1
	set repStr to "<div class=\"imageGalleryPage\""
	
	repeat dl times
		find "<page page-num=\"(\\d\\d?\\d?\\.?\\d?)\">" searching in text 1 of text document fn options {search mode:grep, wrap around:true} with selecting match
		find "\\d\\d?\\d?\\.?\\d?" searching in selection options {search mode:grep} with selecting match
		set ts to the selection
		set attrExists to find repStr & " id=\"" & ts & "\"" searching in text 1 of text document fn options {search mode:grep}
		
		if attrExists is {found:false} then
			find repStr searching in text 1 of text document 1 options {search mode:grep} with selecting match
			replace repStr searching in selection options {search mode:grep} using repStr & " id=\"" & ts & "\""
		end if
	end repeat
	
end tell

That’s a good idea about the ‘Process Lines Containing…’ command to double check that things turn out the way I am aiming for. I appreciate your ideas and code about refining the script. I’m really jazzed that this turned out so well!

technomorph - thank you for the links, I will definitely check them out!

Unfortunately, I can’t replicate what you’re seeing. I just took the text that was in your original post, duplicated it, and changed the second ‘page-num’ to be “38.1”. When I run that latest version against that, it adds the ID attribute if it doesn’t exist and ignores it when it doesn’t. So regardless of how many times you run the script, the resulting text has one ID attribute for each page tag.

I should note however that when I ran the last script you posted yesterday, it added the attribute —no matter what— each time the script was run. But I have the impression that when you run it, it works fine. I’m not sure what would be causing a different result but it happens. What counts is which works for you.

So, if that is the case, I’d revert to that script (but still move the ‘set repStr’ line) and go with that. If you like, include some sample text where my last version fails and I’ll see if I can figure out why.

Hi Mockman,

It is adding the ID tag once per page-num instance, just as it should. I realize when I reread my last post that it probably came across as if the script was only adding one ID tag total. This is not the case! The script is working as it should, and as you described when you tested the newest version → (it adds the ID attribute if it doesn’t exist and ignores it when it doesn’t. So regardless of how many times you run the script, the resulting text has one ID attribute for each page tag.) Perfect :slight_smile:

I’m sorry for the confusion! Thank you for the amazing help.

That’s great to hear. Glad to be of help.