If you want to implement your own google’s like web crawler it’s kind of easy by using their API. To begin we might download the api package here. Once downloaded we have to unpack it using:
#unpack tar -xzvf google-1.9.1.tar.gz #move to the unpacked dir cd google-1.9.1/ #install googles apy sudo python setup.py install
And we are ready to go with our python program. Let’s begin with the simplest possible program, retrieving the link’s list with all results of a search.
Google’s api has 3 main functions: search(), get_page() and filter_result() (see documentation). In order to implement our crawler we’ll only require the search() function, which may receive several parameters, search(query str,tld=’com’, lang=’en’, num=10, start=0, stop=None, pause=2.0)
- query (str) – Query string. Must NOT be url-encoded.
- tld (str) – Top level domain.
- lang (str) – Languaje.
- num (int) – Number of results per page.
- start (int) – First result to retrieve.
- stop (int) – Last result to retrieve. Use None to keep searching forever.
- pause (float) – Lapse to wait between HTTP requests. A lapse too long will make the search slow, but a lapse too short may cause Google to block your IP. Your mileage may vary!
For example, create a file named myCrawler.py and add the code below:
from google import search for url in search('ironman', lang='es', num=10, pause=2.0, stop=20): print(url)
The code above makes a request to google’s server searching “ironman”, looking for results in spanish, getting 10 results per page, waiting 2 seconds between each request and stoping after receiving 20 results.
After that, we can run our python program and see the results:
python myCrawler.py http://www.ironman.com/ http://www.ironman.com/triathlon/coverage/live.aspx http://www.ironman.com/events/triathlon-races.aspx http://www.ironman.com/triathlon/coverage/past.aspx http://www.ironman.com/triathlon-news/ironman-life.aspx http://www.pasmar.cl/detalle/actividad/ironman http://shamserg.deviantart.com/art/Invincible-Iron-Man-576833336 http://ironman.wikia.com/wiki/Mark_43 http://cinemarvel.wikia.com/wiki/Archivo:IronMan_Vengadores.png http://worldversus.com/IronMan-vs-Hulk http://es.marvel.wikia.com/wiki/Iron_Man_Armor_MK_XLIV_(Tierra-199999) https://es.wikipedia.org/wiki/Iron_Man . . .
And that easy you have now your own web crawler. However, you may be cautious with the number of request you send to google’s server, or your ip may be blocked for several minutes or requiere a browser’s capcha resolve to continue. You can play with all parameters (specially pause and num) to evade such blocking constraints.
In the next post i will talk about how to get the pages content’s, including title, body and references in such pages using lynx (console web browser), see you then.