Copy contents of specific web page

I originally asked a query here 2 years ago but the topic of query was only one facet of the overall objective. Not wishing to be going off the original topic, I am looking at expanding an answer given by Shane Stanley here https://macscripter.net/viewtopic.php?pid=199269#p199269 but not sure his approach would work with the specific web site desired if somehow adjusted.

Objective: (1) Copy contents of page url=http://wireshare.sourceforge.net/bootstrap/ (2) Replace the tabs/spaces with LF (linux/mac os) line breaks. (3) Output to UTF new text document.

I’d prefer to be doing this without using JavaScript. In the previous query 2 years ago, one of the websites has since been offline, another was incorrectly updated & unusable. The only comparable website that works with either Shane or KniazidisR’s script has outdated data (up to 9 months old.) Whereas the webpage listed above is constantly updated and reliable.

Note: I am on OS 10.11 with Script Editor 2.8.1, AS 2.5. As much backward compatibility is preferred. A (bundled) script will be offered free to public to assist with a p2p protocol.

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 78.0
Operating System: macOS 10.11

Using plain AppleScript:


-- open webpage
tell application "Safari" to open location "http://wireshare.sourceforge.net/bootstrap/"

-- wait until the webpage is loaded fully
tell application "System Events" to tell application process "Safari"
	set frontmost to true
	repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists)
		delay 0.1
	end repeat
end tell

-- get text of webpage
tell application "Safari" to set theText to text of document 1

-- replace tabs/spaces with linefeed
set text item delimiters of AppleScript to {space, tab}
set theText to text items of theText
set text item delimiters of AppleScript to linefeed
set theText to theText as text
set text item delimiters of AppleScript to ""

-- choose text file name
tell application "Safari" to set the_file to choose file name default location path to desktop folder

-- write to UTF-8 encoded text file
set file_ID to open for access the_file with write permission
set eof file_ID to 0
write theText to file_ID as «class utf8»
close access file_ID

The same using AsObjC:


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

--  get the text of webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/bootstrap/"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)

-- replace tabs and spaces with linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:space withString:linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:tab withString:linefeed

-- write to UTF8_encoded text file
set the_file to choose file name default location path to desktop folder
(WebPageText's writeToFile:(POSIX path of the_file) atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value))

Thanks KniazidisR for such a fast response. Whilst I’d have preferred not using a web browser, I’m not going to be fussy.

The first script does not work for me, possibly because I already have multiple tabs open. The new tab loads but the log shows repeatedly:
exists UI element “Reload this page” of group 2 of toolbar 1 of window 1 of application process “Safari”
exists UI element “Reload this page” of group 2 of toolbar 1 of window 1 of application process “Safari”
exists UI element “Reload this page” of group 2 of toolbar 1 of window 1 of application process “Safari”

and the script does not finish running.

The second script works fine. I simply adjusted the final line to output to a specific name. This is exactly what I wanted. Thanks again, you’ve put together a great script! :slight_smile:

Here is a different solution without the necessity of launching a web browser

property saveToFile : (path to desktop as text) & "webpage_text.txt"
property theURL : "http://wireshare.sourceforge.net/bootstrap/"

do shell script "curl " & quoted form of theURL & ¬
	" > " & quoted form of POSIX path of saveToFile

Thanks! Excellent second approach similar to Shane’s script in regards no browser required. The opening of a browser or tab might confuse some persons young or old as to what’s happening, so that’s sorted well here.

As far as I knew curl had major limitations but you’ve shown an ingenious method. :slight_smile:

When you provide people with a solution, it is desirable that it work at all, and then, for a long time, and not with one single webpage. The OP specifically requested to replace tabs and spaces to the linefeed, as well as to write the result as UTF-8 text file.

Your solution, wch1zpink and Fredrik71 in contrast to the solution in general, is private, not completed, since it uses the specifics of the content of a specific webpage. For her it works, but with many other pages it will simply fail.

Thanks Fredrik71. For some reason I was finding either an empty line on first line or if not, in both cases the first number of the first listing would be missing. It may well be due to each of my methods of writing to file.

I’m personally tending toward wch1zpink’s answer as it is simple yet effective. It also works at least far back as OS 10.8

True. But the above approaches work for this specific topic and target.
In the previous query a couple of years ago, I found the approaches difficult to adapt to a different variation of these types of website. Thanks KniazidisR for your support.

And thanks to everyone for your excellent efforts.

Second URL

I changed my mind as long as this is not stretching things too far, if possible a solution be found for the following website:
url=http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella
This is one of those I struggled with.

The site lists 100 lines, but I’d prefer only the first 50 (as the rest are a day or two old.) Similar to above only the address listing is desired stripping all other details in similar fashion to the query of 2 years ago.

Reason for second URL: 1. I wish to add this to the results of the other website already discussed. 2. Either one of the websites might be down such as for maintenance purposes (usually no more than a few hours once/twice a year.)

Objective: 1. Copy and process details of the webpage. 2. (if both online) Add results of each webpage together. 3. If both websites are online & been processed, remove duplicates. 4. Output to file.

The increase in output might only be slightly larger but it offers an insurance against one site being temporarily offline.

I totally understand if this second request for assistance within same thread is stretching things too far. Although the topic is basically the same. :slight_smile:

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 78.9
Operating System: macOS 10.11

The second page has XHTML. I will think about none browser solution. For now, full solution using Safari, is this:


-- open first webpage, wait for full loading
tell application "Safari" to open location "http://wireshare.sourceforge.net/bootstrap/"
my waitFullLoading()

tell application "Safari" to set goodLinks to text of document 1 -- get text of webpage

-- replace tabs/spaces with linefeed, get links list
set text item delimiters of AppleScript to {space, tab}
set goodLinks to text items of goodLinks
set text item delimiters of AppleScript to linefeed
set goodLinks to goodLinks as text
set text item delimiters of AppleScript to ""

-- open second webpage, wait for full loading
tell application "Safari" to open location "http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella"
my waitFullLoading()

-- execute JavaScipt in the XHTML of webpage, to retrieve the links
tell application "Safari" to set theLinks to do JavaScript (my jScript()) in document 1

-- filter useful staff
repeat with nextLink in theLinks
	if nextLink contains "gnutella:host:" then
		set dottedLink to text 15 thru -1 of nextLink
		if not (dottedLink is in goodLinks) then set goodLinks to goodLinks & dottedLink & linefeed
	end if
end repeat

my writeToUTF8TextFile(goodLinks) -- write to UTF8 ecncoded text file


--================================= HANDLERS ==========================================

on waitFullLoading() -- wait until the webpage is loaded fully
	tell application "System Events" to tell application process "Safari"
		set frontmost to true
		repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists) or (UI element "Reload this page" of group 3 of toolbar 1 of window 1 exists)
			delay 0.1
		end repeat
	end tell
end waitFullLoading

on jScript() -- this JavaScript will executed in the Safari to get links from XHTML
	"function documentLinks() {
	//
	var arr = [], links = document.links;
	for(var i = 0; i < links.length; i++) {
	   arr.push(links[i].href);
	}
	return arr
	}
	//
	documentLinks()
	"
end jScript

on writeToUTF8TextFile(goodLinks)
	-- choose text file name
	tell application "Safari" to set the_file to choose file name default location path to desktop folder
	-- write to UTF-8 encoded text file
	set file_ID to open for access the_file with write permission
	set eof file_ID to 0
	write (text 1 thru -2 of goodLinks) to file_ID as «class utf8»
	close access file_ID
end writeToUTF8TextFile

None browser solution:


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- get the text of first webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/bootstrap/"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)
-- replace tabs and spaces with linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:space withString:linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:tab withString:linefeed

set theText to WebPageText as text -- text generated from first webpage

-- get the text of second webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)

-- get custom paragraphs
set AppleScript's text item delimiters to {"href=", "\" class=\"table_address\">"}
set theParagraphs to text items of (WebPageText as text)
set AppleScript's text item delimiters to ""

-- add not duplicate dotted links to theText
repeat with anItem in theParagraphs
	if anItem contains "\"gnutella:host:" then
		set dottedLink to text 16 thru -1 of anItem
		if not (theText contains dottedLink) then set theText to theText & dottedLink & linefeed
	end if
end repeat

-- choose text file name
set the_file to choose file name default location path to desktop folder
-- write to UTF-8 encoded text file
set file_ID to open for access the_file with write permission
set eof file_ID to 0
write theText to file_ID as «class utf8»
close access file_ID