Delete line(s) of unsaved document within TextWrangler/BBEdit

My objective is to copy details from a webpage (GWebCache) and process the details until only the hosts addresses remain upon which they are inserted into a host file text document. My issue is that whilst Chrome works fine if delays are added and Firefox works very well even without the short delays. But Safari is not straight forward. My tests are showing Safari is copy-pasting each line broken up into 4 lines. Thus I’m trying to write an extra script for safari to delete all the unnecessary lines. The 4 lines are host address and other details, network name (gnutella), program name, and timestamp. Whilst removing repetitive lines partly solves the issue I still need to remove the unwanted lines (timestamp lines are not identical.)

It would be easier if I could instead instruct Safari to copy all lines in-place instead of this extra code. I have placed the copy-webpage code at bottom of post. Meanwhile if that cannot be done, I need to figure out how to delete the unwanted lines incorporating words such as gnutella, wireshare, 2019/ or 2020/ or lines that incorporate two colons.

The text formatting to remove details is handled by sub-routines not listed here, there’s more than one depending on webcache site accessed.

The keystroke deleting comes up with error number -1700

tell application "TextWrangler"
	activate
	
	---  open find window
	find "gnutella" searching in text 1 of text document "untitled text 41" options {wrap around:true} with selecting match --- you may need to change this to document 1 or front document.
	--- set linetobedeleted to selection
	tell application "TextWrangler"
		tell front window
			tell application "System Events"
				
				try
					keystroke (delete {shift down, control down})
					
				end try
			end tell
		end tell
	end tell
end tell

You can test out the first section of code using Safari, chrome or firefox to see the difference and the problem. (Different script for pre-checking if person posseses either bbedit or textwrangler, otherwise this extra processing is by-passed):


set browser_name to "Safari" --- I've removed the code check for default browser. And code to check if any browsers are already open so that they are only quit if they were not inititally open.

-- Write Document Text  
set DocText to ""

-- Look at whichever browser is default browser, open and do things  
tell application browser_name
	activate --- launch instead of activate as it will open in background. Remains in background if already open and hidden.
	
	open location "http://disobscure.velum-ultra.com/skulls.php?showhosts=1"
	--- "http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella"
	
end tell

to raiseWindow of browser_name for theName
	tell the application named browser_name
		activate
		set theWindow to the first item of ¬
			(get the windows whose name is theName)
		if index of theWindow is not 1 then
			set index to 1
			set visible to true
		end if
	end tell
end raiseWindow

delay 4
tell application "System Events"
	-- Open URL
	
	if frontmost of process browser_name then
		set visible of process browser_name to true
	else
		set frontmost of process browser_name to true
	end if
	
	tell application process browser_name
		set visible to true
		
		-- Press ⌘A
		set uiScript to "keystroke \"a\" using command down"
		try
			run script "tell application \"System Events\"
" & uiScript & "
end tell"
		end try
		delay 2
		set uiScript to "keystroke \"c\" using command down"
		
		try
			run script "tell application \"System Events\"
" & uiScript & "
end tell"
		end try
		delay 2
		-- Press ⌘W (close browser tab)
		set uiScript to "keystroke \"w\" using command down"
		try
			run script "tell application \"System Events\"
" & uiScript & "
end tell"
		end try
		delay 1
		keystroke return
	end tell
	
end tell
delay 1
set the clipboard to string of (the clipboard as record)
set webpage to (the clipboard as text)

tell application "TextWrangler"
	launch --- launch instead of activate as it will open in background. Remains in background if already open and hidden. --- realised the select, copy-paste probably does not work if in background.
	
	set thisDoc to make new document
	
	-- readying process to copy web-page contents
	set ContentRelitive to webpage & return
	
	set DocText to ContentRelitive & return
	tell thisDoc
		set its text to DocText
	end tell
	
	tell application "TextWrangler" to tell front text document to delete text of (lines 1 thru 808)
	tell application "TextWrangler" to tell front text document to delete text of (lines 78 thru 1610) --- I had to increase these numbers for Safari. For Chrome + Firefox see original line listing below. Also depends on which URL is accessed.)

	-- tell application "TextWrangler" --- I forgot, I added this for safari but does not do anything. Would not want this for the other browsers so remove this section if using chrome or firefox.
		-- repeat with i in (get every line of front text document)
			-- tell i's text
				-- if (contents of line contains "WireShare") then
					-- delete text of line
				-- end if
			-- end tell
		-- end repeat
	-- end tell

	-- tell application "TextWrangler" to tell front text document to delete text of (lines 1 thru 6)
	-- tell application "TextWrangler" to tell front text document to delete text of (lines 51 thru 70
--- these are actually for the sourceforge url
end tell

I’m sorry my English is imperfect, but maybe you want this:


set JS to "var URLs = [];
function collectIfNew(url) {
if( URLs.indexOf(url) == -1 ) {
URLs.push(url);
}
}
function processDoc(doc) {
var l = undefined;
try {
// If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
l = doc.links;
} catch(e) { console.warn(e) }
if( l !== undefined ) {
for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
}
}
function processFrameset(f) {
for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
if( o.frames !== undefined && o.frames.length != 0 ) {
// It is a frameset
processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
processFrameset(o.frames);
} else {
// It is a document
processDoc(o.document);
}
}
process(window);
URLs;"

tell application "Safari" to open location "http://disobscure.velum-ultra.com/skulls.php?showhosts=1"

delay 2 -- wait while webpage is loaded
-- OR, better, wait while webpage is loaded this way:
--tell application "System Events" to tell application process "Safari"
--repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists)
--delay 0.1
--end repeat
--end tell

tell application "Safari" to set linkURLs to do JavaScript JS in front document

set hostAddresses to {}
repeat with i from 1 to count linkURLs
	set anItem to item i of linkURLs
	if anItem contains ":host:" then set end of hostAddresses to anItem
end repeat

hello KniazidisR

Thank you for your fast reply. Your script looks very clever.

Something I forgot to mention is these scripts are to be a part of a public deploy where people will run the overall script on their computer. The javascript approach might be problematic for some persons. These days javascript appears to be disabled for browsers by default.

Safari got an error: You must enable the ‘Allow JavaScript from Apple Events’ option in Safari’s Develop menu to use ‘do JavaScript’

Funny thing is the script ran in Firefox instead. :smiley: No doubt because firefox is my default browser. However I solved it with:

tell application "Safari"
	activate
	open location "http://disobscure.velum-ultra.com/skulls.php?showhosts=1"
end tell

Yes my skills are very limited and generally search for scripts that do something similar or record actions.

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 71.0
Operating System: macOS 10.11

Sometimes I load other browsers to test the scripts. But then, I will uninstall them, since for me to have several browsers on a Mac is a big luxury, and Safari meets all my requirements. So you did well when you placed open location in the tell application “Safari” block. I updated my script above according to your fair remark.

Now about JavaScript: I think it’s less expensive to enable JavaScript on a browser than installing TextWrangler. To prohibit JavaScript on the browser is the same as not to go out - you can catch a cold…

Note: TextWrangler is a third-party application, and every Mac has powerful tools like text items delimiters of AppleScript and bash commands for text manipulations. In addition, they work more efficiently, as they work in the background. Therefore, TextWrangler never got a place on my computer.

BBEdit is a third-party application too, but I like it, because it has more nice interface, is quicker, and less expensive than xCode. To see the nested content it is the best solution.

Isn’t xCode free?

Do they pay you to take BBEdit?

I mean: less expensive in the consumption of computer resources. I have nothing to do with BBEdit other than it is one of the very few paid programs on my Mac. I’m just its user. But this is already taking the topic aside.

I am impressed and highly appreciative of your script and your efforts to solve the topic objective.

I can see the script limits a return of both the host and network arrays/columns which is very clever. From that point I would need to filter the list to those listed as gnutella and then remove the gnutella array. The host addresses would need to be on separate lines. Definitely an improvement from your original script which included the host slot available listings immediately after the port number.

In reference to my original script, it is a fair argument about limiting that (bonus hosts) section of the script to those possessing either BBEdit or TextWrangler on their system. If I could achieve the same thing using either TextEdit or simply via AS so it is universal for everyone to use then I would be extremely happy (TextEdit is not very scriptable; almost a dead horse from what I can see.) But I suspect it will cause more issues for people attempting to run your suggested script as it will probably fail for 99% of persons with the browser permissions issue re: javascript. A sad thought. To my understanding it is the JS within browsers that has all the security issues in regards to java. It is a difficult call to demand people to allow javascript be it they use my script at home, work or university.

But your approach of selecting the appropriate portions of the webpage are excellent.

My script first checks to see if the person possesses BBEdit/TextWr and if they do they are shown a different menu with an extra option for bonus hosts. I did not include that part of script in my original post. I was concerned including all the scripts and sub-routines of this project would have been a bit overwhelming to read and helpers might have simply stepped past it.

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 71.0
Operating System: macOS 10.11

I don’t know if the following is what you want:


set JS to "var URLs = [];
function collectIfNew(url) {
if( URLs.indexOf(url) == -1 ) {
URLs.push(url);
}
}
function processDoc(doc) {
var l = undefined;
try {
// If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
l = doc.links;
} catch(e) { console.warn(e) }
if( l !== undefined ) {
for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
}
}
function processFrameset(f) {
for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
if( o.frames !== undefined && o.frames.length != 0 ) {
// It is a frameset
processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
processFrameset(o.frames);
} else {
// It is a document
processDoc(o.document);
}
}
process(window);
URLs;"

tell application "Safari"
	activate
	open location "http://disobscure.velum-ultra.com/skulls.php?showhosts=1"
end tell

delay 2 -- wait while webpage is loaded
-- OR, better, wait while webpage is loaded this way:
--tell application "System Events" to tell application process "Safari"
--repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists)
--delay 0.1
--end repeat
--end tell

tell application "Safari" to set linkURLs to do JavaScript JS in front document

set gnutella_HostAddresses to {}
set gnutella2_HostAddresses to {}
repeat with i from 1 to count linkURLs
	set anItem to item i of linkURLs
	if anItem contains "gnutella:host:" then
		set {ATID, text item delimiters} to {text item delimiters, "gnutella:host:"}
		set anItem to text items of anItem
		set text item delimiters to ATID
		set end of gnutella_HostAddresses to item 2 of anItem
	end if
	if anItem contains "g2:host:" then
		set {ATID, text item delimiters} to {text item delimiters, "g2:host:"}
		set anItem to text items of anItem
		set text item delimiters to ATID
		set end of gnutella2_HostAddresses to item 2 of anItem
	end if
end repeat

set hostAddresses to gnutella_HostAddresses & gnutella2_HostAddresses

Why not skip the browser altogether:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theDesktop to POSIX path of (path to desktop)
-- load page
set pageURL to current application's |NSURL|'s URLWithString:"http://disobscure.velum-ultra.com/skulls.php?showhosts=1"
set {pageHTML, theError} to current application's NSData's dataWithContentsOfURL:pageURL options:0 |error|:(reference)
if pageHTML = missing value then error (theError's localizedDescription() as text)
-- make XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- get links with no title
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//a[@title = '']" |error|:(reference))
if theMatches = missing value then error (theError's localizedDescription() as text)
set theText to (theMatches's valueForKey:"stringValue")'s componentsJoinedByString:linefeed
theText's writeToFile:(theDesktop & "Output.txt") atomically:true

The similar script without using browser and JavaScript at all:


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- load page
set pageURL to current application's |NSURL|'s URLWithString:"http://disobscure.velum-ultra.com/skulls.php?showhosts=1"
set {pageHTML, theError} to current application's NSData's dataWithContentsOfURL:pageURL options:0 |error|:(reference)
if pageHTML = missing value then error (theError's localizedDescription() as text)
-- make XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for href attributes
set {linkAtts, theError} to (theXMLDoc's nodesForXPath:"//*[@href]/attribute::href" |error|:(reference))
if linkAtts = missing value then error (theError's localizedDescription() as text)
-- filter links to suit, for example
set thePred to current application's NSPredicate's predicateWithFormat:"stringValue BEGINSWITH[c] %@ OR stringValue BEGINSWITH[c] %@" argumentArray:{"gnutella:host", "g2:host"}
-- extract as list of strings
set linkURLs to ((linkAtts's filteredArrayUsingPredicate:thePred)'s valueForKey:"stringValue") as list

set gnutella_HostAddresses to {}
set gnutella2_HostAddresses to {}
repeat with i from 1 to count linkURLs
	set anItem to item i of linkURLs
	if anItem contains "gnutella:host:" then
		set {ATID, text item delimiters} to {text item delimiters, "gnutella:host:"}
		set anItem to text items of anItem
		set text item delimiters to ATID
		set end of gnutella_HostAddresses to item 2 of anItem
	end if
	if anItem contains "g2:host:" then
		set {ATID, text item delimiters} to {text item delimiters, "g2:host:"}
		set anItem to text items of anItem
		set text item delimiters to ATID
		set end of gnutella2_HostAddresses to item 2 of anItem
	end if
end repeat

set hostAddresses to gnutella_HostAddresses & gnutella2_HostAddresses

Although we end up with just the host address, ideally before or after the process the following array would be added to the end of the host address+port number:
“,en,0,PASSIVE,1,”
but without the inverted commas obviously.
example:
25.132.1.5:6346,en,0,PASSIVE,1,
Lines should have LF line endings (Shane’s approach does.)

A conundrum both scripts are very good!

hello Shane

This is excellent! Does so much with so little code. This is what I had hoped for. Only thing I would need to do is remove the first and last 200 host listings as they belong to different networks (200 per network ~ I’m after the middle network called gnutella.)

Any possibility of getting this to also work with a different url such as “http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella” perhaps run separately. This web page’s arrays are notably different.

Also any possibility have the list inserted into another document at a specific line number?
Line number 300 would be good. The file is read from bottom up when loaded by the program.

Question: Any idea how far back in AS and OSX versions this script might be compatible with?

Primarily preferred to work with OSX 10.5 or later. Better if be able to work with earlier OSX but not that important. Number of persons running earlier OSX and using these programs is inevitably extremely small.

This one is definitely a universal approach. Thank you.

I will need to then insert the lines into another document at specified point if possible (but without the array characters of course.) Line number 300 would be good. The file is read from bottom up when loaded by the program.

Any idea about the backward compatibility of OSX versions?

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 71.0
Operating System: macOS 10.11

To adapt it to other pages, you need to examine them and change the XPath query accordingly (and perhaps the predicate if you go down that route).

In your dreams. As the code says, 10.10 or later.

Thanks Shane, that’s fine and many thanks for your help. I know it is perhaps going off-topic but rather than create a new topic can I ask about a quick AS method to determine OSX version that would be backward compatible to OSX 10.3 or 10.4 or later?
Reason I’m asking is an approach I used some years ago stopped working at or after 10.10. I want to be sure the approach I use is definitely not redundant or restricted to much later os versions. ie: set up a process that checks whether the person’s system is running at least OS 10.10

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 71.0
Operating System: macOS 10.11

If you include the line:

 "use AppleScript version "2.4" -- Yosemite (10.10) or later

the script won’t load on earlier systems, but will be fine on later ones.

Thanks. Tested without the AS version to test on 10.8 Mountain Lion and doesn’t work.
[NSURL] doesn’t understand the URLWithString_ message.
So that verifies it doesn’t work on earlier OSX. :smiley:
Tested with AS version and doesn’t load.

If you want to use the canonical method, you can set the LSMinimumSystemVersion in an applet’s Info.plist file. This is simple to do in Script Debugger: go to the Resources tab, and enter a value for Minimum OS:.

Using either of the above approaches, would it be possible for an example of how the same approach could be used for a different website with different editing of the details required. Such as http://cache.jayl.de/g2/gwc.php?display=gnutella

If I can see how to vary the same script I might be able to make changes for other websites as needed.

Edit: I hope my request is not considered ‘off-topic’. But it is related to the answers.

  1. Simply change URL of site in the open location statement
  2. You can change “host” to some other text-key in the if anItem contains statement. Here I don’t change, because “host” key founds all gnutella hosts only on this certain site fine

set JS to "var URLs = [];
function collectIfNew(url) {
if( URLs.indexOf(url) == -1 ) {
URLs.push(url);
}
}
function processDoc(doc) {
var l = undefined;
try {
// If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
l = doc.links;
} catch(e) { console.warn(e) }
if( l !== undefined ) {
for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
}
}
function processFrameset(f) {
for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
if( o.frames !== undefined && o.frames.length != 0 ) {
// It is a frameset
processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
processFrameset(o.frames);
} else {
// It is a document
processDoc(o.document);
}
}
process(window);
URLs;"

tell application "Safari" to open location "http://cache.jayl.de/g2/gwc.php?display=gnutella"

delay 2 -- wait while webpage is loaded
-- OR, better, wait while webpage is loaded this way:
--tell application "System Events" to tell application process "Safari"
--repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists)
--delay 0.1
--end repeat
--end tell

tell application "Safari" to set linkURLs to do JavaScript JS in front document

set hostAddresses to {}
repeat with i from 1 to count linkURLs
	set anItem to item i of linkURLs
	if anItem contains ":host:" then set end of hostAddresses to anItem
end repeat

Hi KniazidisR, firstly a huge thanks. Would it be possible for you to provide the example instead via your previous answer using set pageURL and without using Safari?