Rendering JavaScript content using Python, Selenium, and a Headless Browser
Jan 21, 2019
Mike Page
8 minute read

This post will demonstrate how to scrape data from the web rendered in JavaScript (JS), using Python and several other tools. To achieve this, we will build a scraper that collects the current dates weather forecast for a given location (rendered in JS) from Wunderground.

First, the robots.txt file on the Wunderground site must be inspected, in addition to the Terms of Service of the site, to see if scraping information is permitted:

# Import modules
import requests
# Check Wunderground ToS - pass
# HTTP request Wunderground robots.txt file and check permissions
r = requests.get('https://www.wunderground.com/robots.txt')
print(r.text)

As the robots.txt file returns a large list, the results have not been printed in this post. Upon inspection it can be seen that there are no restrictions placed on scraping most weather stations. The Terms of Service of the site also reveal no restrictions on scraping data for personal use. Nonetheless, if you intend to scrape data from Wunderground, it is your responsibility to check the robots.txt and Terms of Service of the site independently, and take responsibility for any data you may decide to scrape.

Making HTTP Requests

Usually, to retrieve content from a webpage, a simple HTTP GET request can be made, more info can be found here. Python has many modules for making such requests, such as ‘requests’, as can been seen above when retrieving the robots.txt information. However, these libraries often only return the static Document Object Model (DOM), and are not capable of rendering dynamic JS content. This means that often not all content is returned when making the HTTP GET request.

Retrieving JS Rendered Content

In order to return the rendered JS content in Python, a web browser must be used. This is because it is the web browser itself that renders JS. Therefore, using a headless browser (headless meaning there is no GUI) in combination with Selenium, a tool that automates web browsing tasks, it is possible to return rendered JS content. First, a headless browser must be setup. To use a headless Firefox browser, the Firefox gecko driver must be installed from GitHub. Once installed, selenium must be pointed to the directory where the gecko driver resides (I recommended placing it in ‘usr/local/bin’ if using Linux/Ubuntu):

# Import modules
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait
# Requests library cannot return rendered javascript content, only the unmodified DOM (static web page). Use selenium and a web driver to automate a web browser and return rendered javascript content (dynamic web page). Initiate headless Firefox driver
options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path = '/usr/local/bin/geckodriver', options = options)

Now that the headless driver has been initiated, Selenium can be used to automate any web browsing tasks. For this example, we will just make a very simple GET request to load a single URL, and then quit the headless browser.

As only weather information for the current day is to be collected, the Wunderground site needs to be inspected to see which pages contains the current dates weather information for any given location. It appears that this information can be found on the ‘hourly’ page for any given weather station. Moreover, the information for any given date in any city appears to follow the convenient formula of: ‘https://www.wunderground.com/hourly/insert_country_abbreviation/insert_city/date/insert_date’. Therefore, to collect the current dates weather in Rotorua, NZ, for example, the formula: https://www.wunderground.com/hourly/nz/rotorua/date/insert_todays_date‘should be used. To automate the task of inserting the current date, the Python ’datetime’ module can be used to convert the current date into a string. This string can be appended into the URL of the headless browser GET request (using Selenium):

# Import modules
from datetime import datetime
# Set todays date as a string
current_date = datetime.today().date().isoformat()
# Method 2 (more verbose) current_date = datetime.today().date().strftime('%Y-%m-%d')
# Load relevant webpage containing info on current weather
driver.get('https://www.wunderground.com/hourly/nz/rotorua/date/' + current_date)

Parsing Data Using Beautiful Soup

NB: as this post is focused upon retrieving rendered JS content, the following section is light on technical details (in order to keep the post brief). Nonetheless, the inline code comments should provide ample information to understand and run the code.

Now that Selenium has been used to load the desired web page, the Beautiful Soup Python module (bs4) can be used to extract the required JS content into a parse tree of HTML tags:

# Import modules
from bs4 import BeautifulSoup
import lxml
# Retrieve page source
soup = BeautifulSoup(driver.page_source, 'lxml')
# Quit driver
driver.quit()

From here, it is a simple case of extracting the relevant nodes from the parse tree. To find the correct nodes in Firefox, load the desired page, hover over the desired web page element, and then right-click and click ‘Inspect Element (Q)’. Alternatively, you can press ‘Ctrl’ + ‘Shift’ + ‘c’ on Linux/Ubuntu. In the case of the daily Wunderground weather information, it can be seen that it resides in a JS rendered table called ‘hourly-forecast-table’, comprised of a header and body. First, extract all the column names from the table:

# Extract table header names
# Method 1 | List comprehension - prefered method Python 3
titles = [element.text for element in soup.find(id = 'hourly-forecast-table').thead.find_all(class_ = 'tablesaw-sortable-btn')]
# Method 2 | for loop on list
# for element in soup.find(id = 'hourly-forecast-table').thead.find_all(class_ = 'tablesaw-sortable-btn'):
#     print(element.text)
# Method 3 | for loop on generator
# for string in soup.find(id = 'hourly-forecast-table').thead.stripped_strings:
#     print(string)

Next, extract the values contained within each row of the table:

# Extract table values
# Note: Table rows are denoted by <tr> tags
# Extra table row values
# First find all rows with table values and append to list
tr_list = [element for element in soup.find(id = 'hourly-forecast-table').tbody.find_all('tr')]
# Then, for each item in list, find each data point which can also be indexed as list (creating a list of lists)
tr_row = [tr.find_all('td') for tr in tr_list]
# Note tr_list != tr_row, i.e., len(tr_list[0]) != len(tr_row[0]). This is because each element of the list in tr_row has now been split differently. Each item in tr_list is indexed by each tag, whereas each item in tr_row is indexed by each 'td' tag
# Extract text elements for each row of data and clean up
for i in range(len(tr_row)):
    tr_row[i] = [element.text for element in tr_row[i]]
    tr_row[i] = [value.replace('\n', '') for value in tr_row[i]]
    tr_row[i] = [value.strip() for value in tr_row[i]]

Finally, join the column names and table values in a Pandas DataFrame, and print the object for inspection:

# Import modules
import pandas as pd
# Create DataFrame
weather_table = pd.DataFrame(tr_row, columns = titles)
# Print DataFrame
with pd.options_context('display.max_rows', None, 'display.max_columns', None):
   print(weather_table)
##         Time           Conditions  Temp. Feels Like Precip Amount Cloud Cover  \
## 0   10:00 am  Mostly SunnyM Sunny  66 °F      66 °F     0%   0 in         27%   
## 1   11:00 am  Mostly SunnyM Sunny  69 °F      69 °F     0%   0 in         21%   
## 2   12:00 pm           SunnySunny  72 °F      72 °F     0%   0 in         18%   
## 3    1:00 pm           SunnySunny  74 °F      74 °F     0%   0 in         12%   
## 4    2:00 pm           SunnySunny  76 °F      76 °F     0%   0 in          8%   
## 5    3:00 pm           SunnySunny  77 °F      77 °F     0%   0 in          9%   
## 6    4:00 pm           SunnySunny  78 °F      78 °F     0%   0 in          8%   
## 7    5:00 pm           SunnySunny  77 °F      77 °F     0%   0 in          7%   
## 8    6:00 pm           SunnySunny  76 °F      76 °F     0%   0 in          4%   
## 9    7:00 pm           SunnySunny  73 °F      73 °F     0%   0 in          2%   
## 10   8:00 pm           SunnySunny  70 °F      70 °F     2%   0 in          2%   
## 11   9:00 pm           ClearClear  66 °F      66 °F     4%   0 in          6%   
## 12  10:00 pm           ClearClear  63 °F      63 °F     6%   0 in          5%   
## 13  11:00 pm           ClearClear  61 °F      60 °F     7%   0 in          3%   
## 
##    Dew Point Humidity        Wind  Pressure  
## 0      51 °F      59%    9 mph SW  29.92 in  
## 1      52 °F      54%    9 mph SW  29.92 in  
## 2      52 °F      49%    8 mph SW  29.92 in  
## 3      52 °F      46%    7 mph SW  29.91 in  
## 4      52 °F      44%    7 mph SW  29.91 in  
## 5      53 °F      43%   7 mph WSW  29.90 in  
## 6      54 °F      44%   8 mph WSW  29.89 in  
## 7      55 °F      46%   9 mph WSW  29.88 in  
## 8      56 °F      49%   9 mph WSW  29.88 in  
## 9      57 °F      55%  10 mph WSW  29.88 in  
## 10     57 °F      63%   9 mph WSW  29.89 in  
## 11     56 °F      71%   8 mph WSW  29.92 in  
## 12     56 °F      77%   8 mph WSW  29.94 in  
## 13     55 °F      82%   8 mph WSW  29.95 in

That is it! to parse JS content It is as simple as using Selenium in combination with a headless browser and a powerful web parser such as Beautiful Soup. In the next post, I will show you how to create a bash alias to run the script simply from the command line.