Some time ago i wrote a post about web crawling using google´s api (See here). However, it lacks of HTML labels recognition support and it becomes tedious to find key components on web pages.
In this post, i will try to show you how to successfully recognize web page’s key HTML labels such as title, div, etc using a library named BeautifulSoup using the programming language Python. For this reason, we need to have basic HTML and python knowledge. For experiment purposes i will be using the native python installation on OSX 10.11.5 “El Capitan”.
Before continuing, i recommend reading the terms and condition the web page you are going to scrap, since some of them don’t allow it and our scripts won’t work.
Having the previous suggestion in mind, we need to install two different tools to work with in python:
- requests – Allows us to load a web page from python.
- beautifulsoup – Allows us to analyze the loaded web page’s HTML structure.
To install the, we can simply use the pip command:
$ pip install requests $ pip install beautifulsoup4
Once installed, the first we might do is visually inspect the HTML code of the web page we are going to scrap and correctly identify the data we are interested in. As a basic example, lets create a file named scraping.py and append the following code. On it we will extract and print a label every web page must have: the title and the inner urls (links to other web pages). See the following code and i leave a comment over each line to explain.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#!/usr/bin/python # -*- coding: utf-8 -*- # Import libraries from bs4 import BeautifulSoup import requests # Ask the url to user (note it doesn't need the "http://" part in the following line) url = raw_input("Type the URL: ") # Load the HTML code of the web page r = requests.get("http://" +url) # Parse the HTML code to plain text data = r.text print "" # Create a soup object and load our HTML parsed code we did before soup = BeautifulSoup(data, 'lxml') # Extract the page's title (found in the "title" label of the HTML code) and print it title= soup.title.text print "The page's title is: " + title print "" # Look for all links found in the main page (a) and then we extract the link urls through the "href" label and print each of them. for link in soup.find_all('a'): print(link.get('href'))
And there you are, it shows the title and links found in the web page. In the next post we will try to extract another kind of labels we are interested in, such as constantly refresh a weather report of a forecast web page.