I’ve got a list of URL’s (900+). These are catalog pages of items. I’ve got Safari already set up download each of these to local html files for further processing.
Each page contains this HTML in the source:
Master Category > Sub Category
Which basically displays as
Master Category > Sub Category
Using GREP (I think) I’m trying to retrieve this text to
- Insert MasterCategory > SubCategory into the page, replacing what’s there
- Rename the file to “MasterCategory_SubCategory.html”
Grep is a new fangled thing to me. Any assistance is muy appreiciated.
Model: pb tibook 667
AppleScript: 1.9.3
Browser: Safari 312.3.1
Operating System: Mac OS X (10.3.9)
Update: I’ve found a pattern that will find my target string:
([^<]) > ([^<])
But I’m stumped on how to
Leave this line in place.
Strip out the html and generate a regular string
Put that string between the Title tags
This works on a single file:
choose file without invisibles
set theFile to quoted form of POSIX path of result
set ASTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {"<"}
try
do shell script "grep -o '<font color=\"#0000ff\">[^<]*</font></a></nobr> > <nobr><strong>[^<]*</strong></nobr>' " & theFile & " | colrm 1 22"
get every text item of result
get first item of result & " > " & (text 8 thru -1 of (sixth item of result))
do shell script "perl -p -i -e " & quoted form of ("s~<title>[^<]*</title>~<title>" & result & "</title>~") & " " & theFile
on error errorMsg number errorNum
display dialog "Error (" & errorNum & "):" & return & return & errorMsg buttons "Cancel" default button 1 with icon caution
end try
set AppleScript's text item delimiters to ASTID
This assumes that “Master Category” and “Sub Category” do not contain a “>” character.