Can anyone help me out with searching a web page’s HTML for a certain string? Is it possible to do it without downloading the HTML? I tried with downloading:
set HTML to (do shell script “curl http://www.google.com”)
set displaystring to “Not Found”
set stringtofind to “google”
if stringtofind is in HTML then set displaystring to “Found”
display dialog displaystring
and it always comes up with not found.
I would love code which doesn’t download
I’m trying to convert from autoit for windows to applescript.
Thanks for the help!
Your code works just fine here on my Mac with an active internet connection.
Therefor you should log the HTML output, maybe in your case curl is directed to another site:
set HTML to (do shell script "curl http://www.google.com")
log HTML
set displaystring to "Not Found"
set stringtofind to "google"
if stringtofind is in HTML then set displaystring to "Found"
display dialog displaystring
Just activate the Event Log tab at the bottom of the Script Editor window and have a look what is downloaded.
If the string I’m searching for is to far down the page then my code fails. I seem to remember reading about some limitation in size for curl, what can I do to download the entire page?
You could also try to use URL Access Scripting instead of curl:
on run
-- getting an unused temp file path
set tmpfilepath to my gettmpfilepath()
-- downloading the HTML to the tempo file
tell application "URL Access Scripting"
set tmpfile to download "http://www.apple.com" to tmpfilepath
end tell
-- opening and reading the content of the temp file
try
set fileobj to open for access tmpfile
set filecont to read fileobj
close access fileobj
on error
try
close access fileobj
end try
end try
-- searching for the string
set searchstring to "iPod"
if searchstring is in filecont then
tell me
activate
display dialog "Search string found!"
end tell
end if
-- removing the temporary file
do shell script ("rm " & quoted form of (POSIX path of tmpfilepath))
end run
-- I am returning a file path to an unused temporary file
on gettmpfilepath()
set tmpfolderpath to (path to temporary items folder from user domain) as Unicode text
repeat
set randnum to random number from 1000 to 9999
set tmpfilepath to (tmpfolderpath & randnum & ".tmp")
try
set tmpfilealias to tmpfilepath as alias
on error
exit repeat
end try
end repeat
return tmpfilepath
end gettmpfilepath
Also Safari is scriptable and lets you access the source of a loaded website:
tell application "Safari"
set htmlsource to source of document 1
end tell
Moreover DEVONagent/DEVONthink have powerful AppleScript libraries to process web items:
tell application "DEVONagent"
set htmlsource to download markup from "http://www.apple.com"
end tell
Thank you so much for your detailed response. Do you have any thoughts on which would be the fastest?