XMLRPC, BeautifulSoup

Many times in this forum users have been asking for ways to extract information from a website. And there are many ways to do it. In this exercise we will use some modules in Python.

  • urllib
  • lxml
  • request
  • bs4 (BeautifulSoup)

The python code will show you how you could do for loop, if statement.
It will show you how to use a class ‘tag’ in bs. There you could extract
‘href’, ‘src’ or make AppleScript record of tags. It will also show you how
you could do a simple html_parser or xml_parser.

Many of this examples are from BeautifulSoup Documentation.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

tell methodCall
	-- its methodName:"html_parser" params:{"https://www.examples...."}
	-- its methodName:"xml_parser" params:{"https://www.examples.com"}
	its methodName:"tags" params:{"https://www.examples.com"}
	-- its methodName:"find_all_links_bs" params:{"https://www.examples.com"}
	-- its methodName:"find_all_links_lxml" params:{"https://www.examples.com"}
end tell

script methodCall
	on methodName:methodName params:params
		tell application "http://localhost:8090"
			call xmlrpc {method name:methodName, parameters:params}
		end tell
	end methodName:params:
end script

Copy the python code to a file ex. xmlrpc_bs.py And run it after you have installed the
modules you need for it. Next you could run the AppleScript with port 8090

from twisted.web import xmlrpc, server
from twisted.logger import Logger
log = Logger()

import urllib
import lxml.html

import requests
from bs4 import BeautifulSoup

class beautiful_soup(xmlrpc.XMLRPC):
    def xmlrpc_html_parser(self, *args):
        """
        """
        response = requests.get(args[0])
        soup = BeautifulSoup(response.content, 'html.parser')
        return str(soup)

    def xmlrpc_xml_parser(self, *args):
        """
        """
        response = requests.get(args[0])
        soup = BeautifulSoup(response.content, 'xml')
        return str(soup)

    def xmlrpc_tags(self, *args):
        """
        """
        response = requests.get(args[0])
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # The 'tag_a' object is <class 'bs4.element.Tag'>
        tag_a = soup.a
        tagClass = str(type(tag_a))
        
        # return 'a'
        tagName = tag_a.name
        
        # return https://www.examples.com
        link = tag_a['href']
        
        tag_a_dict = tag_a.attrs
        
        # return https://images.examples.com/images/logo.svg
        tag_img = soup.img
        src = tag_img['src']
        
        tag_img_dict = tag_img.attrs
        
        return [str(tag_a), tagClass, tagName, link, src, tag_a_dict,tag_img_dict]

    def xmlrpc_find_all_links_bs(self, *args):
        """
        """
        response = requests.get(args[0])
        soup = BeautifulSoup(response.content, 'html.parser')
        links = soup.find_all('a')
        
        r = []
        for link in links:
            # get all links start with 'http'
            if link.get('href').startswith('http'):
                r.append(link.get('href'))

        return r

    def xmlrpc_find_all_links_lxml(self, *args):
        """
        Example using urllib and lxml
        """
        url = urllib.request.urlopen(args[0])
        dom = lxml.html.fromstring(url.read())
        links = dom.xpath('//a/@href')
        
        r = []
        for link in links:
            # get all links start with 'http'
            if link.startswith('http'):
                r.append(str(link))

        return r

    def xmlrpc_fault(self):
        """
        Raise a Fault indicating that the procedure should not be used.
        """
        raise xmlrpc.Fault(123, "The fault procedure is faulty.")

    def handleData(data):
        log.debug("Got data: {data!r}.", data=data)


if __name__ == '__main__':
    from twisted.internet import reactor
    r = beautiful_soup()
    reactor.listenTCP(8090, server.Site(r))
    reactor.run()

What is the advantage of this approach over using JavaScript’s DOM methods directly in Safari or Chrome? That doesn’t require additional software, but has the added benefit that you can use far more powerful selectors than with the simple findAll from BS which works only for elements (irritatingly and wrongly called “tags” in the BS documentation) and attributes – imagine extracting every even row from a table.

Also, browsers are all at HTML5 now, whereas BS requires the user to choose an appropriate parser. So, why tackle the problem this way, which requires several python modules, instead of letting the browser do the job?

@chrillek
The reason I like it instead of using JavaScript in a tell block of Safari.

  • I do not need to mix 2 different language in the main code.
  • It also means I could build a library of methods in python and call the methods I need.
  • I could use a language that is easier for me to code and JavaScript.

Feel free to do what I did in your solution ??

If you are looking for a fast approach I would suggest AppleScriptObjC. But not all people know AppleScriptObjC and maybe it would be easier to learn Python. And I like to code in
python and that’s why I use it.

I do not have any comment if there is any advantage or not its up to the user to decide.

Well, you could write everything in JS. No need for another language. But I get the point of „I’m happy with Python“. Though it might not be the best tool for _this _ task (or maybe there are more capable modules than BS)

@chrillek
I think you should share your code and show your approach. If not then I guess your words or your thinking is pointless. As I said in the first post in this topic there are many ways to do it.

The approach is shown here, for a RL case

The same thread contains other examples.

Its not the first time you have question my motive why I share code of some context. But its clear to me if that is not JavaScript you are very fast to judge. I do not know if you have ever used BS or other method in Python to make a statement that running JavaScript in tell block of a scriptable application is better.

  • My code didn’t use any JavaScript.
  • My code didn’t embedded other scripting language.
  • My code was clean.

My code use 100% AppleScript and 100% Python. In Xcode a user could mix Objective-C with Objective-C++ but still there will be some guidelines or thinking if that is a good approach.

And why should someone chose JavaScript or JXA if the same approach could be done in AppleScriptObjC. I do understand that you like to do everything in JavaScript but that doesn’t mean I should do the same.

So please do not ask me to use JavaScript or try to convince me that is good, thanks.

I didn’t. I asked why in this particular situation your approach makes sense. You explained your motives, I recognized them – but I have another opinion. And there are other people here who might want to read about different opinions.

The simple fact is that the W3C specified a JavaScript-API for the DOM. Which makes this language the natural choice for HTML parsing. But you can use whatever language you want. As I may point out that it’s not the best choice in this case. This is a technical discussion, not an ad hominem attack

Why do Facebook, Instagram or Dropbox using Python if your arguments is always a fact in every cases.And why do users or developers use Python as back-end for a front-end web technology. Why do user or developers use Django or Flask.

So maybe your technical description what is best is also your weakness. I have never told anyone what is best to use. The topic in code exchange is about sharing code that solve some task or could be useful to someone. Its not about the method and why a user do it that way. If you have technology question how to use it you are welcome.

So its an attack against me and I’m tired to have this discussion with you. I try to make this forum to be alive to share my knowledge and help others.

Goodness. I was talking about HTML parsing. Only. I didn’t say that Python is a bad language nor that you shouldn’t use it. Python is fine, as is Ruby, Perl, C, Pascal, ALGOL even Fortran has its uses. But not every language is good at every task. Which nicht be sc reason why there’s not only Python out there. Or, god forbid, AppleScript.

Python simply is not the best tool to parse HTM in a scripting context on macOS. Nor did you make any argument that it is, except that is the language you prefer. On another platform, outside of the browser – different story.

Please don’t criticize me for things I haven’t said or done. If you have difficulties understanding, use DeepL.

When I asked you for your solution you send me a link that use a bridge to Objective-C. And that is not different from using a pure AppleScriptObjC method that is faster and give better error handling. Still you believe JavaScript is a great alternative to AppleScript.

I have never said Python is the best solution (if the definition of best is only performance). I would arguably say that Python is far more easier to learn and other language. Not only the syntax or all examples but also very good documentation.

And when people have difficult to solve error when nothing is returned. The only solution is to ask someone else for help. And if that has not been clear to them its a nightmare to maintain.

@Fredrik71, @chrillek

Hey you two.

By all means discuss and disagree. That’s partly what Code Exchange is for. But please keep querulousness, nastiness, and personal insults out of your public posts.

That’s the code contained in the post I linked to:

(() => {
  const app = Application('DEVONthink 3');
  const rec = app.getRecordWithUuid('2F437490-F4EC-4E38-8747-8F4FCC073F86');
  const thinkWindow = app.openWindowFor({record:rec});
  const result = app.doJavaScript(`var headings = document.querySelectorAll('h1,h2,h3');JSON.stringify([...headings].map(h => h.innerText));`, {in: thinkWindow});
  console.log(result);
})()

Pure JavaScript/JXA. With minor modifications, it can be used with Safari or Chrome instead of DEVONthink. The code returns all headings of level 1, 2, and 3 from the HTML document.

The ASObjC stuff was actually in another post in the same thread.

@chrillek
Thanks.

For any scriptable application if its not open need to load a website to be able to extract information. The time it take for this process is not fast on a slow machine. So it will be a delay in that approach. XMLRPC call to event-driven approach do not have any delay and it means it only need to read the url to memory and parse it as html or xml. The module lxml use binding to libxml2 that is coded in C. From different websites they claim its a fast library. My code is not slow on my machine and I use Apple Silicon. Yes its slower and AppleScriptObjC but the difference is not a issue. I do believe its easier for any person who knows less to adapt to AppleScript + Python solution and it would be to learn other solutions that has very poorly error handling. Most of the time I have find solution for things I do not know and that have serve me well.

I have shared already a way in AppleScriptObjC how to use xpath. And other have too. And that’s why I thought this topic was interesting to share how it could be done in Python. It doesn’t have to be more and that without describe in detail why the maintainer of the code use that approach.