Welcome to part 4 of the web scraping with Beautiful Soup 4 tutorial mini-series. Here, we're going to discuss how to parse dynamically updated data via javascript.
Many websites will supply data that is dynamically loaded via javascript. In Python, you can make use of jinja templating and do this without javascript, but many websites use javascript to populate data. To simulate this, I have added the following code to the parsememcparseface page:
<p>Javascript (dynamic data) test:</p> <p class='jstest' id='yesnojs'>y u bad tho?</p> <script> document.getElementById('yesnojs').innerHTML = 'Look at you shinin!'; </script>
The code basically takes regular paragraph tags, with the class of jstest
, and initially returns the text y u bad tho?
. After this, however, there is some javascript defined that will subsequently update that jstest
paragraph data to be Look at you shinin!
. Thus, if you are reading the javascript-updated information, you will see the shinin message. If you don't then you will be ridiculed.
If you open the page in your web browser, we'll see the shinin message, so we'll try in Beautiful Soup:
import bs4 as bs import urllib.request source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/') soup = bs.BeautifulSoup(source,'lxml') js_test = soup.find('p', class_='jstest') print(js_test.text)
y u bad tho?
What?! Beautiful Soup doesn't mimic a client. Javascript is code that runs on the client. With Python, we simply make a request to the server, and get the server's response, which is the starting text, along of course with the javascript, but it's the browser that reads and runs that javascript. Thus, we need to do that. There are many ways to do this. If you're on Mac or Linux, you can setup dryscrape... or we can just do basically what dryscrape does in PyQt4 and everyone can follow along. Thus, get PyQt4. If you need help getting PyQt4, check out the: PyQt4 tutorial.
import sys from PyQt4.QtGui import QApplication from PyQt4.QtCore import QUrl from PyQt4.QtWebKit import QWebPage import bs4 as bs import urllib.request class Client(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self.on_page_load) self.mainFrame().load(QUrl(url)) self.app.exec_() def on_page_load(self): self.app.quit() url = 'https://pythonprogramming.net/parsememcparseface/' client_response = Client(url) source = client_response.mainFrame().toHtml() soup = bs.BeautifulSoup(source, 'lxml') js_test = soup.find('p', class_='jstest') print(js_test.text)
The main take-away here is that, since Qt is asynchronous, we mostly need to have some sort of handling for when the page loading is complete. If we don't do that, we're not going to get the data we want, it'll just be an empty page. Otherwise, we're using the PyQt Webkit to mimic a browser. Upon having done that, we can see the javascript data!
Look at you shinin!
Just in case you wanted to make use of dryscrape:
import dryscrape sess = dryscrape.Session() sess.visit('https://pythonprogramming.net/parsememcparseface/') source = sess.body() soup = bs.BeautifulSoup(source,'lxml') js_test = soup.find('p', class_='jstest') print(js_test.text)
That's all for this series for now, for more tutorials: