Parsing text/attribute from <a href> links

hansolo625 · August 29, 2019, 11:11pm

Hello scripting experts!

I’ve hit a roadblock with my work project. I’m scraping a work webpage to aggregate users information efficiently without having to manually copy-pasting. Our webpage recently got updated and one of the user properties is now a hyperlink and when my scraper parses the information, I get a html link as a string.

Usually I would just modify my handler to extract between or eliminate the unneeded strings to parse the information I need. However, there’s something in the link that makes both handlers ineffective. The link looks like this:

As you can see, the URL contains a unique ID that’s randomly generated for each user (it’s a combination of letters and numbers). THEN there’s the User ID that’s publicly displayed and that’s the property I need to scrape (contains only numbers). As mentioned, the two handlers that I currently have a) extract between two strings b) eliminate unwanted strings are both ineffective as I cannot define the unique ID.

I know it’s possible to detect a link in other languages but a) my novice self don’t know if that’s possible with AppleScript b) My scraper uses a Do JavaScript handle to parse the link as a string so I’m not sure how that’ll be possible.

Any advice would be deeply appreciated!

TedW · August 30, 2019, 3:05am

If you can isolate it down to that html link, you can use this code to extract the user ID:

set htmlLink to "<a href=\"/userdata/(UniqueID)\">User_ID</a>"
-- above line for testing purposes

set tid to my text item delimiters
set my text item delimiters to {"<", ">"}
set textBits to text items of htmlLink
set my text item delimiters to tid
set userID to item 3 of textBits

By setting the text item delimiters to the opening and closing angle brackets, we cut up the string at the html tag boundaries; then it’s just a matter of extracting the user id from the resulting list.

If you have to pick the link out of the entire text, then we’ll need to switch to something more complicated. First let me know if this suits your needs.

KniazidisR · August 30, 2019, 3:48am

Hi.
I will try to help you if you provide a website address. I need to look into the HTML of your site.

Automated retrieving all links from the current web page:


set JS to "var URLs = [];
function collectIfNew(url) {
if( URLs.indexOf(url) == -1 ) {
URLs.push(url);
}
}
function processDoc(doc) {
var l = undefined;
try {
// If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
l = doc.links;
} catch(e) { console.warn(e) }
if( l !== undefined ) {
for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
}
}
function processFrameset(f) {
for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
if( o.frames !== undefined && o.frames.length != 0 ) {
// It is a frameset
processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
processFrameset(o.frames);
} else {
// It is a document
processDoc(o.document);
}
}
process(window);
URLs;"

set linkURLs to {}
tell application "Safari" to set linkURLs to do JavaScript JS in front document

bogdanw · August 30, 2019, 6:47am

Removing HTML Markup from Text https://developer.apple.com/library/archive/documentation/LanguagesUtilities/Conceptual/MacAutomationScriptingGuide/RemoveMarkupfromHTML.html

hansolo625 · September 9, 2019, 9:17pm

NO. WAY. WOW. That documentary has more than I expect… and my noob arse actually have read through that doc quite many times… Can’t believe they actually have exactly what I needed. Imma go play with that now. MUCH MUCH appreciate it!!! :lol:

hansolo625 · September 9, 2019, 9:19pm

This is wonderful!!! It appears to be the same concept as the method shown in the Apple Scripting Guide shared by another expert here! https://developer.apple.com/library/archive/documentation/LanguagesUtilities/Conceptual/MacAutomationScriptingGuide/RemoveMarkupfromHTML.html

I truly appreciate your help!

hansolo625 · September 9, 2019, 9:21pm

I appreciate your help KniazidisR! Unfortunately the link is for an internal work site I won’t be able to share it publicly. It appears that your solution is JavaScript? I’m sorry I’m definitely too novice to understand it lol However, I do plan on learning JavaScript in the future, it may still be of help in the future for me! Thank you!