Need help plucking URLs out of source code

I’m banging my head against the wall here, trying to do something really hard I guess. Mainly I’ve got a website and I want to find URLs inside the website, which follow certain patterns but which are somewhat randomized (since they’re part of Magento, which is designed to be as complex and hard to work with as possible). Basically, I need to find URLs that look like this

(the front half or so should always be the same, but the second half starts to randomize)

out of a text file which is the source code for a given website product page.

I know that Grep and BBedit could do this as I’ve used it a lot in the past, but I can’t figure out the best way to proceed. Optimally I’d like to find the code using grep in the terminal, for a “cleaner” script execution, though my attempts at making this work have been messy so far. (HTML code isn’t made up of clean “lines” as grep in Terminal seems to expect and I can’t figure how to get just the URLs that I need.)

Can anyone tell me how to proceed? Is there any better way of searching using grep, or wildcards, in Applescript itself? Any magic search plugin you can recommend?

You could try using Data Detectors, which is what the system uses to highlight links in emails and so on.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theURLs to my findURLsIn:theString

on findURLsIn:theString
	set theString to current application's NSString's stringWithString:theString
	set theDD to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theURLs to theDD's matchesInString:theString options:0 range:{0, theString's |length|()}
	set thePred to current application's NSPredicate's predicateWithFormat:"SELF BEGINSWITH 'https'"
	set newArray to (theURLs's valueForKeyPath:"URL.absoluteString")'s filteredArrayUsingPredicate:thePred
	return newArray as list
end findURLsIn:

You may try to use a handler borrowed from Shane STANLEY’s Everyday AppleScriptObjC 3ed.

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions
on findURLsIn:theString
	set theNSDataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theURLsNSArray to theNSDataDetector's matchesInString:theString options:0 range:{location:0, |length|:length of theString}
	return (theURLsNSArray's valueForKeyPath:"URL.absoluteString") as list
end findURLsIn:

set numOriginal to "HT200266"
quoted form of ("" & numOriginal & "?viewlocale=fr_FR")
set theContent to do shell script "curl -Ls -A 'Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1' " & result

its findURLsIn:theContent

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) dimanche 31 juillet 2016 14:53:39

Thanks for the replies, Yvan and Shane. You guys are great

So I’ve got a good working script that grabs all the URLs from the currently open tab I’m looking at, and that’s good. Now what I’m having trouble is, how can I (quickly) grab ones that fit a certain pattern? Mainly it’s product URLs that look like this:

They’re random enough that they’re hard to parse, but I think most should follow a few basic patterns. What I’d really like to do is, quickly and efficiently (i.e. without manually going through each returned URL pattern), find only those who contain “product/cache” or some other string that allows me to find the product URLs, not all the unrelated ones. I tried something like

set producturls to each item of theurls that contains “product/cache”

but that format is wrong and I can’t seem to find the best way of capturing only string items that contain “X.” Do you have any ideas?

You can just change the predicate:

   set thePred to current application's NSPredicate's predicateWithFormat:"(SELF BEGINSWITH 'https') AND (SELF CONTAINS 'product/cache/')"

Thank you very much, that works perfectly! And so speedy too!

That’s the easy way, of course. :wink:

One thing I noticed when working on something similar a few weeks ago is that URL data detectors return everything in the text that could possibly be a URL. In the HTML for this thread page alone, there are clickable http links, javascripts with URLs for the adverts, URLs in the HTML headers, mailto links, ppayne’s https URLs, and the abbreviated versions of them shown on screen. They’re all returned by the data detector in the above scripts and most of them are then filtered out again with the predicate.

It turns out that if you need to discover only a small percentage of the URLs in an HTML text, and you know enough about them to be able to find them using a regex, it can be far more efficient to do so ” although you’re not likely to notice much difference in practice. :wink:

use AppleScript version "2.4"
use scripting additions
use framework "Foundation"

on findURLs(theHTML, URLRegex)
	set theString to current application's class "NSString"'s stringWithString:theHTML
	set theRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:(URLRegex) options:(0) |error|:(missing value)
	set regexMatches to theRegex's matchesInString:(theString) options:(0) range:({location:0, |length|:theString's |length|()})
	set matchRanges to regexMatches's valueForKey:("range")
	set newArray to {}
	repeat with thisRange in matchRanges
		set end of newArray to (theString's substringWithRange:(thisRange)) as text
	end repeat
	return newArray
end findURLs

tell application "Safari" to set HTMLContents to source of front document
-- Regex for complete "https" URLs wrapped in quotes (not included) and containing the sequence "/product/cache/" beginning no later than the fourth folder name each URL.
set URLRegex to "(?<=\")https://([^/\"< ]++/){1,4}product/cache/[^\"< ]++(?=\")"
set theURLs to findURLs(HTMLContents, URLRegex)

Of course :slight_smile: The NSDataDetector class is just a specialised subclass of NSRegularExpression, after all. But any time I get my regular expression pattern written for me, I consider it a win :slight_smile:


By the way, I think Chris must be taking a break or he’d have suggested this:

tell application "Safari" to set HTMLContents to source of front document
set theURLs to (find text "(?<=\")https://([^/\"< ]++/){1,4}product/cache/[^\"< ]++(?=\")" in HTMLContents with regexp, all occurrences and string result) -- Requires Satimage OSAX.