First, the robots.txt file on the Wunderground site must be inspected, in addition to the Terms of Service of the site, to see if scraping information is permitted:
# Import modules import requests # Check Wunderground ToS - pass # HTTP request Wunderground robots.txt file and check permissions r = requests.get('https://www.wunderground.com/robots.txt') print(r.text)
As the robots.txt file returns a large list, the results have not been printed in this post. Upon inspection it can be seen that there are no restrictions placed on scraping most weather stations. The Terms of Service of the site also reveal no restrictions on scraping data for personal use. Nonetheless, if you intend to scrape data from Wunderground, it is your responsibility to check the robots.txt and Terms of Service of the site independently, and take responsibility for any data you may decide to scrape.
Making HTTP Requests
Usually, to retrieve content from a webpage, a simple HTTP GET request can be made, more info can be found here. Python has many modules for making such requests, such as ‘requests’, as can been seen above when retrieving the robots.txt information. However, these libraries often only return the static Document Object Model (DOM), and are not capable of rendering dynamic JS content. This means that often not all content is returned when making the HTTP GET request.
Retrieving JS Rendered Content
In order to return the rendered JS content in Python, a web browser must be used. This is because it is the web browser itself that renders JS. Therefore, using a headless browser (headless meaning there is no GUI) in combination with Selenium, a tool that automates web browsing tasks, it is possible to return rendered JS content. First, a headless browser must be setup. To use a headless Firefox browser, the Firefox gecko driver must be installed from GitHub. Once installed, selenium must be pointed to the directory where the gecko driver resides (I recommended placing it in ‘usr/local/bin’ if using Linux/Ubuntu):
Now that the headless driver has been initiated, Selenium can be used to automate any web browsing tasks. For this example, we will just make a very simple GET request to load a single URL, and then quit the headless browser.
As only weather information for the current day is to be collected, the Wunderground site needs to be inspected to see which pages contains the current dates weather information for any given location. It appears that this information can be found on the ‘hourly’ page for any given weather station. Moreover, the information for any given date in any city appears to follow the convenient formula of: ‘https://www.wunderground.com/hourly/insert_country_abbreviation/insert_city/date/insert_date’. Therefore, to collect the current dates weather in Rotorua, NZ, for example, the formula: https://www.wunderground.com/hourly/nz/rotorua/date/insert_todays_date‘should be used. To automate the task of inserting the current date, the Python ’datetime’ module can be used to convert the current date into a string. This string can be appended into the URL of the headless browser GET request (using Selenium):
# Import modules from datetime import datetime # Set todays date as a string current_date = datetime.today().date().isoformat() # Method 2 (more verbose) current_date = datetime.today().date().strftime('%Y-%m-%d') # Load relevant webpage containing info on current weather driver.get('https://www.wunderground.com/hourly/nz/rotorua/date/' + current_date)
Parsing Data Using Beautiful Soup
NB: as this post is focused upon retrieving rendered JS content, the following section is light on technical details (in order to keep the post brief). Nonetheless, the inline code comments should provide ample information to understand and run the code.
Now that Selenium has been used to load the desired web page, the Beautiful Soup Python module (bs4) can be used to extract the required JS content into a parse tree of HTML tags:
# Import modules from bs4 import BeautifulSoup import lxml # Retrieve page source soup = BeautifulSoup(driver.page_source, 'lxml') # Quit driver driver.quit()
From here, it is a simple case of extracting the relevant nodes from the parse tree. To find the correct nodes in Firefox, load the desired page, hover over the desired web page element, and then right-click and click ‘Inspect Element (Q)’. Alternatively, you can press ‘Ctrl’ + ‘Shift’ + ‘c’ on Linux/Ubuntu. In the case of the daily Wunderground weather information, it can be seen that it resides in a JS rendered table called ‘hourly-forecast-table’, comprised of a header and body. First, extract all the column names from the table:
# Extract table header names # Method 1 | List comprehension - prefered method Python 3 titles = [element.text for element in soup.find(id = 'hourly-forecast-table').thead.find_all(class_ = 'tablesaw-sortable-btn')] # Method 2 | for loop on list # for element in soup.find(id = 'hourly-forecast-table').thead.find_all(class_ = 'tablesaw-sortable-btn'): # print(element.text) # Method 3 | for loop on generator # for string in soup.find(id = 'hourly-forecast-table').thead.stripped_strings: # print(string)
Next, extract the values contained within each row of the table:
# Extract table values # Note: Table rows are denoted by <tr> tags # Extra table row values # First find all rows with table values and append to list tr_list = [element for element in soup.find(id = 'hourly-forecast-table').tbody.find_all('tr')] # Then, for each item in list, find each data point which can also be indexed as list (creating a list of lists) tr_row = [tr.find_all('td') for tr in tr_list] # Note tr_list != tr_row, i.e., len(tr_list) != len(tr_row). This is because each element of the list in tr_row has now been split differently. Each item in tr_list is indexed by each tag, whereas each item in tr_row is indexed by each 'td' tag # Extract text elements for each row of data and clean up for i in range(len(tr_row)): tr_row[i] = [element.text for element in tr_row[i]] tr_row[i] = [value.replace('\n', '') for value in tr_row[i]] tr_row[i] = [value.strip() for value in tr_row[i]]
Finally, join the column names and table values in a Pandas DataFrame, and print the object for inspection:
# Import modules import pandas as pd # Create DataFrame weather_table = pd.DataFrame(tr_row, columns = titles) # Print DataFrame with pd.options_context('display.max_rows', None, 'display.max_columns', None): print(weather_table)
## Time Conditions Temp. Feels Like Precip Amount Cloud Cover \ ## 0 10:00 am Mostly SunnyM Sunny 66 °F 66 °F 0% 0 in 27% ## 1 11:00 am Mostly SunnyM Sunny 69 °F 69 °F 0% 0 in 21% ## 2 12:00 pm SunnySunny 72 °F 72 °F 0% 0 in 18% ## 3 1:00 pm SunnySunny 74 °F 74 °F 0% 0 in 12% ## 4 2:00 pm SunnySunny 76 °F 76 °F 0% 0 in 8% ## 5 3:00 pm SunnySunny 77 °F 77 °F 0% 0 in 9% ## 6 4:00 pm SunnySunny 78 °F 78 °F 0% 0 in 8% ## 7 5:00 pm SunnySunny 77 °F 77 °F 0% 0 in 7% ## 8 6:00 pm SunnySunny 76 °F 76 °F 0% 0 in 4% ## 9 7:00 pm SunnySunny 73 °F 73 °F 0% 0 in 2% ## 10 8:00 pm SunnySunny 70 °F 70 °F 2% 0 in 2% ## 11 9:00 pm ClearClear 66 °F 66 °F 4% 0 in 6% ## 12 10:00 pm ClearClear 63 °F 63 °F 6% 0 in 5% ## 13 11:00 pm ClearClear 61 °F 60 °F 7% 0 in 3% ## ## Dew Point Humidity Wind Pressure ## 0 51 °F 59% 9 mph SW 29.92 in ## 1 52 °F 54% 9 mph SW 29.92 in ## 2 52 °F 49% 8 mph SW 29.92 in ## 3 52 °F 46% 7 mph SW 29.91 in ## 4 52 °F 44% 7 mph SW 29.91 in ## 5 53 °F 43% 7 mph WSW 29.90 in ## 6 54 °F 44% 8 mph WSW 29.89 in ## 7 55 °F 46% 9 mph WSW 29.88 in ## 8 56 °F 49% 9 mph WSW 29.88 in ## 9 57 °F 55% 10 mph WSW 29.88 in ## 10 57 °F 63% 9 mph WSW 29.89 in ## 11 56 °F 71% 8 mph WSW 29.92 in ## 12 56 °F 77% 8 mph WSW 29.94 in ## 13 55 °F 82% 8 mph WSW 29.95 in
That is it! to parse JS content It is as simple as using Selenium in combination with a headless browser and a powerful web parser such as Beautiful Soup. In the next post, I will show you how to create a bash alias to run the script simply from the command line.