Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort. In this article, we will go through an easy example of how to automate downloading hundreds of files from the New York MTA. This is a great exercise for web scraping beginners who are looking to understand how to web scrape. Web scraping can be slightly intimidating, so this tutorial will break down the process of how to go about the process.
We will be downloading turnstile data from this site:
Turnstile data is compiled every week from May 2010 to present, so hundreds of .txt files exist on the site. It would be torturous to manually right click on each link and save to your desktop.
Important notes about web scraping:
Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes. Make sure you are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.
Inspecting the Website
The first thing that we need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags. Simply put, there is a lot of code on a website page and we want to find the relevant pieces of code that contains our data. If you are not familiar with HTML tags, refer to W3Schools Tutorials. It is important to understand the basics of HTML in order to successfully web scrape. On the website, right click and click on “Inspect”. This allows you to see the raw code behind the site. Once you’ve clicked on “Inspect”, you should see this console pop up.
If you click on this arrow and then click on an area of the site itself, the code for that particular item will be highlighted in the console. I’ve clicked on the very first data file, Saturday, September 22, 2018 and the console has highlighted in blue the link to that particular file.
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>
Notice that all the .txt files are inside the <a> tag following the line above. As you do more web scraping, you will find that the <a> is used for hyperlinks.
Now that we’ve identified the location of the links, let’s get started on coding!
We start by importing the following libraries.
from bs4 import BeautifulSoup
Next, we set the url to the website and access the site with our requests library.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
Next we parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure. If you are interested in learning more about this library, check out the BeatifulSoup documentation.
soup = BeautifulSoup(response.text, “html.parser”)
We use the method .findAll to locate all of our <a> tags.
This code gives us every line of code that has an <a> tag. The information that we are interested in starts on line 38 as seen below. That is, the very first text file is located in line 38, so we want to grab the rest of the text files located below.
one_a_tag = soup.findAll(‘a’)
link = one_a_tag[‘href’]
This code saves the first text file, ‘data/nyct/turnstile/turnstile_180922.txt’ to our variable link. The full url to download the data is actually ‘http://web.mta.info/developers/data/nyct/turnstile/turnstile_180922.txt’ which I discovered by clicking on the first data file on the website as a test. We can use our urllib.request library to download this file path to our computer. We provide request.urlretrieve with two parameters: file url and the filename. For my files, I named them “turnstile_180922.txt”, “turnstile_180901”, etc.
download_url = 'http://web.mta.info/developers/'+ link
Now that we understand how to download a file, let’s try downloading the entire set of data files with a for loop.
Thanks for reading and happy web scraping everyone!