Web Crawler Using Google’s API From Python.

Standard

If you want to implement your own google’s like web crawler it’s kind of easy by using their API. To begin we might download the api package here. Once downloaded we have to unpack it using:

#unpack 
tar -xzvf google-1.9.1.tar.gz 
#move to the unpacked dir
cd google-1.9.1/
#install googles apy
sudo python setup.py install

a1 a2

And we are ready to go with our python program. Let’s begin with the simplest possible program, retrieving the link’s list with all results of a search.

Google’s api has 3 main functions: search(), get_page() and filter_result() (see documentation). In order to implement our crawler we’ll only require the search() function, which may receive several parameters, search(query str,tld=’com’, lang=’en’, num=10, start=0, stop=None, pause=2.0)

Where:

  • query (str) – Query string. Must NOT be url-encoded.
  • tld (str) – Top level domain.
  • lang (str) – Languaje.
  • num (int) – Number of results per page.
  • start (int) – First result to retrieve.
  • stop (int) – Last result to retrieve. Use None to keep searching forever.
  • pause (float) – Lapse to wait between HTTP requests. A lapse too long will make the search slow, but a lapse too short may cause Google to block your IP. Your mileage may vary!

For example, create a file named myCrawler.py and add the code below:

from google import search
for url in search('ironman', lang='es', num=10, pause=2.0, stop=20):
print(url)

The code above makes a request to google’s server searching “ironman”, looking for results in spanish, getting 10 results per page, waiting 2 seconds between each request and stoping after receiving 20 results.

After that, we can run our python program and see the results:

python myCrawler.py
http://www.ironman.com/
http://www.ironman.com/triathlon/coverage/live.aspx
http://www.ironman.com/events/triathlon-races.aspx
http://www.ironman.com/triathlon/coverage/past.aspx
http://www.ironman.com/triathlon-news/ironman-life.aspx
http://www.pasmar.cl/detalle/actividad/ironman
http://shamserg.deviantart.com/art/Invincible-Iron-Man-576833336
http://ironman.wikia.com/wiki/Mark_43
http://cinemarvel.wikia.com/wiki/Archivo:IronMan_Vengadores.png
http://worldversus.com/IronMan-vs-Hulk
http://es.marvel.wikia.com/wiki/Iron_Man_Armor_MK_XLIV_(Tierra-199999)
https://es.wikipedia.org/wiki/Iron_Man
.
.
.

And that easy you have now your own web crawler. However, you may be cautious with the number of request you send to google’s server, or your ip may be blocked for several minutes or requiere a browser’s capcha resolve to continue. You can play with all parameters (specially pause and num) to evade such blocking constraints.

In the next post i will talk about how to get the pages content’s, including title, body and references in such pages using lynx (console web browser), see you then.

Leave a Reply

Your email address will not be published. Required fields are marked *