Scrape the URLs out of a text string

This seems like it should have been done before, but I couldn’t find it anywhere so I just wrote a solution from scratch. Basically the subroutine accepts a big ugly block of text containing one or more plain text URLs, and returns a clean list of all the URLs it found. The following script gives an example of how the handler can be used to pop up a list of URLs. Double-clicking one will launch it in your default client for the web, mail, chat, ftp, etc.

Important Considerations:

  • The script as written assumes that the URLs you’re looking for begin with a common URI scheme, and end with a space. You can correct for this if necessary, by changing the values for “endDelim” and “URISchemes”
  • The script as written returns the URLs sorted first by their URI scheme, and then by the order that they appear in the text.
set myText to "http://zombo.com This is a bunch of ugly text with URLs littered in it
http://bbs.applescript.net/viewtopic.php?id=13333 yadda yadda
yadda http://www.macscripter.net/ yadda yadda yadda mailto:John@doe.com yadda yadda 
yadda yadda http://www.yahoo.com yadda yadda 
yadda yadda http://apple.com/applescript
ftp://ftp.gimp.org/
aim:JohnDoe"

set theURL to choose from list getURLs(myText)
open location theURL

on getURLs(sourceText)
	set endDelim to " " (* change this value if you're looking for
	 something other than a space at the end of your URLs *)
	set URLList to {}
	set oldDelims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "
" -- strip carriage returns
	set sourceText to text items of sourceText
	set AppleScript's text item delimiters to endDelim
	set sourceText to sourceText & endDelim as string
	set URISchemes to {"http:", "https:", "file:", "ftp:", "mailto:", "aim:", "telnet:", "news:", "rtsp:", "afp:", "eppc:", "rss:"}
	repeat with delim from 1 to length of URISchemes
		set delim to item delim of URISchemes as string
		set AppleScript's text item delimiters to delim
		set trimText to {}
		repeat with x from 2 to length of sourceText's text items
			set trimText to trimText & (text item x of sourceText) as list
		end repeat
		set AppleScript's text item delimiters to endDelim
		repeat with x from 1 to length of trimText
			set URLList to URLList & (delim & text item 1 of item x of trimText) as list
		end repeat
	end repeat
	set AppleScript's text item delimiters to oldDelims
	return URLList
end getURLs

Feel free to let us all know if you can break this subroutine, or make it more efficient…

A few comments: your script will break if the string is HTML code as opposed to just text with URLs in it. It will also break if you use the common practice of putting angle brackets around URLs (e.g., http://macscripter.net). Your script also doesn’t sort the URLs or remove duplicates.

When parsing HTML source, I find it easier to use a PERL GREP shell script and also sort and remove duplicates using sort and uniq respectively:

set myText to "<a href=\"http://zombo.com\">http://zombo.com</a> This is a bunch of ugly text with URLs littered in it
<a href=\"http://bbs.applescript.net/viewtopic.php?id=13333\">http://bbs.applescript.net/viewtopic.php?id=13333</a> yadda yadda
yadda <a href=\"http://www.macscripter.net/\">http://www.macscripter.net/</a> yadda yadda yadda <a href=\"mailto:John@doe.com\">mailto:John@doe.com</a> yadda yadda 
yadda yadda <a href=\"http://www.yahoo.com\">http://www.yahoo.com</a> yadda yadda 
yadda yadda <a href=\"http://apple.com/applescript\">http://apple.com/applescript</a>
<a href=\"ftp://ftp.gimp.org/\">ftp://ftp.gimp.org/</a>
<a href=\"aim:JohnDoe\">aim:JohnDoe</a>"

try
	open location ((choose from list extract_URLs(myText)) as string)
end try

on extract_URLs(the_string)
	try
		return (do shell script "echo " & quoted form of (my snr(the_string, ">", ">" & (ASCII character 10))) & " | grep 'href' | perl -ne 'm/.*href=\"?([^>\"]+)\"?.*/; print $1,\"\\n\";' | sort -f | uniq")'s paragraphs
	on error the_error number the_number
		log the_error & " (" & the_number & ")"
		return {}
	end try
end extract_URLs

on snr(the_string, search_string, replace_string)
	tell (a reference to my text item delimiters)
		set {old_tid, contents} to {contents, search_string}
		set {the_string, contents} to {the_string's text items, replace_string}
		set {the_string, contents} to {the_string as Unicode text, old_tid}
	end tell
	return the_string
end snr

Jon

Having just finished reading Jeffrey Friedl’s “Mastering Regular Expressions”, I think I can confidently say that I will probably never be able to create an instruction such as:

Having said that, I don’t understand why grep is used and piped to Perl which I thought was practically built around RegEx (unless, perhaps, the Mac OS X version of Perl is less than the full nine yards needed for a kilt).

Wow, that is a very cool approach to the problem, thank you for sharing it! :smiley: I’ll still have to use my code in my particular situation because I’m not processing html markup, and I’m not chewing through a metric ton of data that require industrial strength CLI tools. At any rate yours is definitely an excellent script to keep in the arsenal, now I’ve just to mosey down to bigbox bookstore and crack a RegEx book, a PERL book, and see if I can ever hope to make sense of that line! :wink:

I have this application that sets mp3 and m4a id tags. I added a web search function to hunt down art, genre, year (to be released as freeware here eventually…). So I went through this sort of thing for a site already. Mind you, what I did is tailored to get exactly what I want, you will have to ponder and alter.

I made a shell script. I made it accept a few variables, and then my applescript program calls it. For example, say I want a Discogs search for Some+artist. I curl that web page down to the /tmp folder and then:

do shell script (/path/to/websearch.sh -Discogs -search Releases -i /path/to/curled.out.html) – in pseudocode

which will give a something like:

/releases/23325 “Some Release”
/releases/4567567 “Another Release”

I could search for Year, Genre, Images, whatever I set my script up for. And it all gets funneled through the 1 shell script to give me the precise output I want. grep, sed and awk were made for these things. The “do shell script” penalty is minor compared to the ease of doing it on a single line.

my shell script looks like this:


[code]Discogs_search()
{
input_file=“$2”
first_char=echo $input_file | colrm 2 -1
if [ “$first_char” = “/” ] ; then
song_name=“”
else
song_name=“$2”
input_file=“$3”
fi

case “$1” in
Tracks )
tracklisting_line=grep -n "Tracklisting:" "$input_file" | awk -F ':' '{ print $1 }'
end_line=grep -n "Add Comment" "$input_file" | awk -F ':' '{ print $1 }'
diff_line=expr "$end_line" "-" "$tracklisting_line"
grep “Tracklisting:” -A $diff_line “$input_file” | grep -v “nowrap” | grep -v “nbsp” | grep -v “href” | sed -e ‘s/<[^>]>//g’ | sed 's/^[ \t]//’ | sed ‘/^$/d’ | sed ‘s/ ([0-9]:[0-9])//g’ | grep -v “Tracklisting:”
;;
Year )
grep “Released:” “$input_file” | sed -e ‘s/<[^>]>//g’ | awk -F “:” ‘{ print $2 }’
;;
Genre )
grep “Style:” -A 5 “$input_file” | grep -v Credits | grep -v Style | grep -v “Notes:” | grep -v “Rating:” | sed -e 's/<[^>]
>//g’ | sed 's/^[ \t]//’ | sed ‘/^$/d’ | sed ‘s/,//g’
;;
Images )
grep viewimages “$input_file” | awk -F "<
>" ‘{ print $3 }’ | cut -d ‘"’ -f2
;;
Releases )
start_line=grep "Releases:" -n "$input_file" | awk -F ':' '{ print $1 }'
## it may not necessary be Production for every artist ##
end_line=grep "Tracks Appear On:" -n "$input_file" | awk -F ':' '{ print $1 }'
diff_line=expr "$end_line" "-" "$start_line"
total_output=grep "Releases:" -A $diff_line "$input_file" | grep release | awk -F '"' '{ print $2 " "$3 }' | cut -d "<" -f1 | sort -r

if [ "$song_name" = "" ] ; then
  echo "$total_output"
else
  final_output=`echo "$total_output" | grep "$song_name"`
  if [ "$final_output" = "" ] ; then
    echo "$total_output"
  else
    echo "$final_output"
  fi
fi

;;

Artist )
grep artist “$input_file” | grep -v selected | grep -v nbsp | sed -e ‘s/<[^>]>//g’ | sed 's/^[ \t]//’
;;
esac

}
#======================= This is where it really starts ==============#
song=“”

args=getopt 'Discogs''Juno''cddb''song':'search':'i': $*
for i
do
case “$i” in
-song)
song=“$3”;
shift;;
-search)
search_for=“$2”;
shift;;
-i)
if [ -z “$song” ] ; then
input_webfile=“$3”;
else
input_webfile=“$4”;
fi
shift;;
-Discogs)
search_site=“Discogs_search”;
shift;;
–)
shift; break;;
esac
done

if [ -z “$song” ] ; then
$search_site $search_for “$input_webfile”
else
$search_site $search_for “$song” “$input_webfile”
fi[/code]
Granted, that’s just my way of doing it - it looks scary and gawd awful, but it can be broken down into smaller chunks and figured out. I know the data I’m looking for comes in a certain format, and a unified structure of getting it into my program makes going through a shell script the way to go for me.

When you want to experiment, copy a line from your source html page and in Terminal.app type:

echo " [paste your text now, no space after the quote, that was just for show] " | sed -e ‘s/<[^>]*>//g’

or whatever other command you want to try out. You’ll see how different options affect what you get out. Don’t forget that if you make a shell script, you have to chmod +x it or (I think) refer to it as /path/to/./shellscript.sh

There’s no question that the Shell is King when it comes to string editing. Just the same, the whole idea of AppleScript is that it’s supposed to be absurd (a natural english language, wtf?) which is why I love it. It lets people like me approach issues from a completely different direction than sane, rational people. That said I’m still quite interested in RegEx, I’d particularly like for some genius to wrap an elegant GUI around it. I’m thinking franken-Automator-action. Anybody up to that task? :slight_smile: