[PYTHON]A simple web crawler.

Η ιδέα μου ήρθε καθώς διάβαζα μία συνέντευξη του Dries Buytaert, founder του Drupal. Σε κάποιο σημείο είπε πως είχε φτιάξει ένα web crawler και μαζευε στατιστικά από διάφορες ιστοσελίδες.

So… why not ;)
Η κύρια λειτουργία του είναι να βρίσκει όλα τα links σε μία σελίδα, να τα αποθηκεύει και στην συνέχεια να τα ακολουθεί.

Αρχικά παίρνει τα “feed” urls από ένα αρχείο με όνομα urls (θα βρίσκεται στον ίδιο φάκελο). Σε κάθε νέο host, καλείται η urllib2.info() και παίρνουμε κάποιες πληροφορίες.

#!/usr/bin/env python

import urllib2, re, sys, urlparse

#******************************** Options ********************************#
def options():
   print "A simple web crawler by mpekatsoula."
   print "Options:"
   print "-h      : print help."
   print "-n i    : i is the number of \"crawls\" you want to do."
   print "          Enter -1 or leave blank for inf."
   print "-o name : the name of the file you want to store the results."
   print "          If blank the file will be named results."
   exit(0)

#************************************************************************#

# Standar values
crawls = -1
results_file = "results"

# Check user input
for arg in sys.argv[1:]:
   if arg.lower()=="-h":
      options()
   if arg.lower()=="-n":
      crawls = sys.argv[int(sys.argv[1:].index(arg))+2]
      crawls = int(crawls)
   if arg.lower()=="-o":
      results_file = sys.argv[int(sys.argv[1:].index(arg))+2]
      results_file = str(results_file)

# Open the file with the 'feed' urls
feed_urls = open('urls','r')

# Create the file to store the results
results = open(results_file,'a')

# Array that holds the urls to crawl/urls that are crawled/hosts that info has gathered
nexttocrawl = set([])
crawled_urls = set([])
gathered_info = set([])

# We need to have the expressions of a url.
# So we make an object that holds these expressions
# More info for regular expressions in python here: http://docs.python.org/dev/howto/regex.html
expressions = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

# Add the feed urls from the file to the array
for line in feed_urls:
   nexttocrawl.add(line)

# Simple counter
i=0

while i!=crawls:

   i=i+1
   try:
      # Get next url and print it. If the array is empty, exit.
      crawling_url = nexttocrawl.pop()
      print "[*] Crawling...: " + crawling_url
   except KeyError:
      exit(0)

   # "Break" the url to components
   parsed_url = urlparse.urlparse(crawling_url)
   # Open the url
   try:
      url  = urllib2.urlopen(crawling_url)
   except:
      continue

   # Read the url
   url_message = url.read()
   # Find the new urls
   gen_urls = expressions.findall(url_message)

   # Store the crawled urls
   crawled_urls.add(crawling_url)

   # Add the new urls to array
   for link in (gen_urls.pop() for _ in xrange(len(gen_urls))):
      if link.startswith('/'):
         link = 'http://' + parsed_url[1] + link
      elif link.startswith('#'):
         link = 'http://' + parsed_url[1] + parsed_url[2].rstrip("\n") + link
      elif not link.startswith('http'):
         link = 'http://' + parsed_url[1] + '/' + link
      if link not in crawled_urls:
         nexttocrawl.add(link)

   if parsed_url[1] not in gathered_info:
      gathered_info.add(parsed_url[1])
      # Collect the info
      collected_info = str(url.info())
      # Here we store the results ;)
      results.write("!!!!"+parsed_url[1]+"!!!!\n")
      results.write(collected_info)
      results.write("*********************************\n")

#close the files & exit
feed_urls.close()
results.close()
exit(0)

Με βάση αυτό μπορούμε να κάνουμε και άλλα πράγματα, όπως πχ να βρούμε τι CMS μπορεί να τρέχει ο host, κάποια παλιά έκδοση του apache που είναι vulnerable και ότι άλλο βάλει ο νους μας ;)

Running example:
urls:

 

https://stack0verflow.wordpress.com/

 

mpekatsoula@mpekatsospito:~/Desktop$ python crawl.py -o test.out -n 10
[*] Crawling...: https://stack0verflow.wordpress.com/

[*] Crawling...: http://avalonstar.com
[*] Crawling...: https://stack0verflow.wordpress.com/about/
[*] Crawling...: http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
[*] Crawling...: http://duartes.org/gustavo/blog/category/linux
[*] Crawling...: http://en.wikipedia.org/wiki/Marcelo_Tosatti
[*] Crawling...: http://lxr.linux.no/linux+v2.6.28.1/arch/x86/mm/fault.c#L692
[*] Crawling...: http://www.cloudknow.com/2009/01/daily-links-18/
[*] Crawling...: http://mirror.href.com/thestarman/asm/debug/Segments.html
[*] Crawling...: http://www.newegg.com/Product/Product.aspx?Item=N82E16817371005

test.out:

 

!!!!stack0verflow.wordpress.com!!!!
Server: nginx
Date: Sat, 06 Nov 2010 15:28:36 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback: https://stack0verflow.wordpress.com/xmlrpc.php
Link: ; rel=shortlink
Last-Modified: Sat, 06 Nov 2010 15:28:36 +0000
Cache-Control: max-age=300, must-revalidate
X-nananana: Batcache
*********************************
!!!!avalonstar.com!!!!
Date: Sat, 06 Nov 2010 15:28:37 GMT
Server: Apache/2.2.9 (Ubuntu) Phusion_Passenger/3.0.0
Vary: Host,Accept-Encoding
Last-Modified: Sat, 06 Nov 2010 15:27:45 GMT
ETag: "b43a2-60d-4946408601e40"
Accept-Ranges: bytes
Content-Length: 1549
Connection: close
Content-Type: text/html
*********************************
!!!!duartes.org!!!!
Date: Sat, 06 Nov 2010 15:28:21 GMT
Server: Apache
Last-Modified: Mon, 27 Sep 2010 14:15:24 GMT
ETag: "4a1247-1a8f7-4913e5bfab700;48ced8affcb80"
Accept-Ranges: bytes
Content-Length: 108791
Connection: close
Content-Type: text/html; charset=UTF-8
*********************************
!!!!lxr.linux.no!!!!
Date: Sat, 06 Nov 2010 15:28:42 GMT
Server: Apache/2.2.11 (Ubuntu) mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
*********************************
!!!!www.cloudknow.com!!!!
Date: Sat, 06 Nov 2010 15:28:42 GMT
Server: Apache
Last-Modified: Sat, 06 Nov 2010 14:39:02 GMT
ETag: "582c027-332f-494635a26ad80"
Accept-Ranges: bytes
Content-Length: 13103
Cache-Control: max-age=300, must-revalidate
Expires: Sat, 06 Nov 2010 15:33:42 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=UTF-8
*********************************
!!!!mirror.href.com!!!!
Content-Length: 28900
Content-Type: text/html
Last-Modified: Mon, 15 Oct 2007 08:29:14 GMT
Accept-Ranges: bytes
ETag: "0a9d8775fc81:8098"
Server: Microsoft-IIS/6.0
IISExport: This web site was exported using IIS Export v4.2
Date: Sat, 06 Nov 2010 15:28:40 GMT
Connection: close
*********************************
!!!!www.newegg.com!!!!
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 155243
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/6.0
x-server-id: 115
X-UA-Compatible: IE=7
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
GA: 0
NEG-Created: 11/6/2010 8:28:45 AM
Set-Cookie: NV%5FDVINFO=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22w19%22%3a%22Y%22%7d%2c%22Exp%22%3a%221289147325%22%7d%7d%7d; domain=.newegg.com; path=/
Set-Cookie: NV%5FPRDLIST=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22wf%22%3a%22N82E16817371005%22%7d%2c%22Exp%22%3a%221375457325%22%7d%7d%7d; domain=.newegg.com; expires=Fri, 02-Aug-2013 15:28:45 GMT; path=/
Set-Cookie: NV%5FCONFIGURATION=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22wd%22%3a%221%22%2c%22w39%22%3a%227657%22%7d%2c%22Exp%22%3a%221375457325%22%7d%7d%7d; domain=.newegg.com; expires=Fri, 02-Aug-2013 15:28:45 GMT; path=/
Date: Sat, 06 Nov 2010 15:28:45 GMT
Set-Cookie: NSC_xxx.ofxfhh.dpn-WJQ=ffffffffaf183f1e45525d5f4f58455e445a4a423660;expires=Sat, 06-Nov-2010 16:21:52 GMT;path=/
*********************************

Σημείωση: Αν δεν δώσουμε πόσα crawls θα κάνει, κατά 99% δεν πρόκειτε να σταματήσει ποτέ. Οπότε καλό θα ήταν να βάζουμε ένα πεπερασμένο πλήθος.

 

 

 

 

 

 

 

 

Advertisements

5 thoughts on “[PYTHON]A simple web crawler.

  1. c0demasters

    gratz, very nice my friend.. :D i didn’t know that you have blog, i will add you at my blogroll!

    r1nu-

    Reply
  2. aaaaaaaaaaaaaaaaaaaaaaa

    kalo fenetai genika, alla an 8es kati apodotiko kai pou kanei douleia sto real world tote prepei mallon na ftiakseis kapoia pragmatakia ;) auto de 8a doulepsei px sto parakatw gt meta to <a den uparxei mono whitespace, alla an baleis .*? pi8anotata na mh doulepsei opws 8es, 8a briskei to prwto <a kai to teleutaio href :P pera apo auto, kalo 8a htan na xrhsimopoihseis kapoiou eidous multiplexing/demultiplexing stis sundeseis, gt me mia mono sundesh argei opws kai na to kaneis. de s lew na pas gia dos, alla kati kalutero apo 1 sundesh ;) auta..

    Reply
  3. mpekatsoula Post author

    Για το regex δεν πολυκατάλαβα τι εννοείς (τα έβγαλε λίγο μπάχαλο μου φαίνεται..)
    Όσο για πολλαπλές συνδέσεις κλπ, ναι θα ήταν αρκετά καλύτερο έτσι, αλλά όταν το έφτιαξα δεν το είδα σαν σοβαρό tool να πω την αλήθεια. Κυρίως να πάρει μια ιδέα κάποιος ;)
    Thanks for the comments :P

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s