Category Archives: Scripts

[PYTHON]A simple web crawler.

Η ιδέα μου ήρθε καθώς διάβαζα μία συνέντευξη του Dries Buytaert, founder του Drupal. Σε κάποιο σημείο είπε πως είχε φτιάξει ένα web crawler και μαζευε στατιστικά από διάφορες ιστοσελίδες.

So… why not ;)
Η κύρια λειτουργία του είναι να βρίσκει όλα τα links σε μία σελίδα, να τα αποθηκεύει και στην συνέχεια να τα ακολουθεί.

Αρχικά παίρνει τα “feed” urls από ένα αρχείο με όνομα urls (θα βρίσκεται στον ίδιο φάκελο). Σε κάθε νέο host, καλείται η urllib2.info() και παίρνουμε κάποιες πληροφορίες.

#!/usr/bin/env python

import urllib2, re, sys, urlparse

#******************************** Options ********************************#
def options():
   print "A simple web crawler by mpekatsoula."
   print "Options:"
   print "-h      : print help."
   print "-n i    : i is the number of \"crawls\" you want to do."
   print "          Enter -1 or leave blank for inf."
   print "-o name : the name of the file you want to store the results."
   print "          If blank the file will be named results."
   exit(0)

#************************************************************************#

# Standar values
crawls = -1
results_file = "results"

# Check user input
for arg in sys.argv[1:]:
   if arg.lower()=="-h":
      options()
   if arg.lower()=="-n":
      crawls = sys.argv[int(sys.argv[1:].index(arg))+2]
      crawls = int(crawls)
   if arg.lower()=="-o":
      results_file = sys.argv[int(sys.argv[1:].index(arg))+2]
      results_file = str(results_file)

# Open the file with the 'feed' urls
feed_urls = open('urls','r')

# Create the file to store the results
results = open(results_file,'a')

# Array that holds the urls to crawl/urls that are crawled/hosts that info has gathered
nexttocrawl = set([])
crawled_urls = set([])
gathered_info = set([])

# We need to have the expressions of a url.
# So we make an object that holds these expressions
# More info for regular expressions in python here: http://docs.python.org/dev/howto/regex.html
expressions = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

# Add the feed urls from the file to the array
for line in feed_urls:
   nexttocrawl.add(line)

# Simple counter
i=0

while i!=crawls:

   i=i+1
   try:
      # Get next url and print it. If the array is empty, exit.
      crawling_url = nexttocrawl.pop()
      print "[*] Crawling...: " + crawling_url
   except KeyError:
      exit(0)

   # "Break" the url to components
   parsed_url = urlparse.urlparse(crawling_url)
   # Open the url
   try:
      url  = urllib2.urlopen(crawling_url)
   except:
      continue

   # Read the url
   url_message = url.read()
   # Find the new urls
   gen_urls = expressions.findall(url_message)

   # Store the crawled urls
   crawled_urls.add(crawling_url)

   # Add the new urls to array
   for link in (gen_urls.pop() for _ in xrange(len(gen_urls))):
      if link.startswith('/'):
         link = 'http://' + parsed_url[1] + link
      elif link.startswith('#'):
         link = 'http://' + parsed_url[1] + parsed_url[2].rstrip("\n") + link
      elif not link.startswith('http'):
         link = 'http://' + parsed_url[1] + '/' + link
      if link not in crawled_urls:
         nexttocrawl.add(link)

   if parsed_url[1] not in gathered_info:
      gathered_info.add(parsed_url[1])
      # Collect the info
      collected_info = str(url.info())
      # Here we store the results ;)
      results.write("!!!!"+parsed_url[1]+"!!!!\n")
      results.write(collected_info)
      results.write("*********************************\n")

#close the files & exit
feed_urls.close()
results.close()
exit(0)

Με βάση αυτό μπορούμε να κάνουμε και άλλα πράγματα, όπως πχ να βρούμε τι CMS μπορεί να τρέχει ο host, κάποια παλιά έκδοση του apache που είναι vulnerable και ότι άλλο βάλει ο νους μας ;)

Running example:
urls:

 

https://stack0verflow.wordpress.com/

 

mpekatsoula@mpekatsospito:~/Desktop$ python crawl.py -o test.out -n 10
[*] Crawling...: https://stack0verflow.wordpress.com/

[*] Crawling...: http://avalonstar.com
[*] Crawling...: https://stack0verflow.wordpress.com/about/
[*] Crawling...: http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
[*] Crawling...: http://duartes.org/gustavo/blog/category/linux
[*] Crawling...: http://en.wikipedia.org/wiki/Marcelo_Tosatti
[*] Crawling...: http://lxr.linux.no/linux+v2.6.28.1/arch/x86/mm/fault.c#L692
[*] Crawling...: http://www.cloudknow.com/2009/01/daily-links-18/
[*] Crawling...: http://mirror.href.com/thestarman/asm/debug/Segments.html
[*] Crawling...: http://www.newegg.com/Product/Product.aspx?Item=N82E16817371005

test.out:

 

!!!!stack0verflow.wordpress.com!!!!
Server: nginx
Date: Sat, 06 Nov 2010 15:28:36 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback: https://stack0verflow.wordpress.com/xmlrpc.php
Link: ; rel=shortlink
Last-Modified: Sat, 06 Nov 2010 15:28:36 +0000
Cache-Control: max-age=300, must-revalidate
X-nananana: Batcache
*********************************
!!!!avalonstar.com!!!!
Date: Sat, 06 Nov 2010 15:28:37 GMT
Server: Apache/2.2.9 (Ubuntu) Phusion_Passenger/3.0.0
Vary: Host,Accept-Encoding
Last-Modified: Sat, 06 Nov 2010 15:27:45 GMT
ETag: "b43a2-60d-4946408601e40"
Accept-Ranges: bytes
Content-Length: 1549
Connection: close
Content-Type: text/html
*********************************
!!!!duartes.org!!!!
Date: Sat, 06 Nov 2010 15:28:21 GMT
Server: Apache
Last-Modified: Mon, 27 Sep 2010 14:15:24 GMT
ETag: "4a1247-1a8f7-4913e5bfab700;48ced8affcb80"
Accept-Ranges: bytes
Content-Length: 108791
Connection: close
Content-Type: text/html; charset=UTF-8
*********************************
!!!!lxr.linux.no!!!!
Date: Sat, 06 Nov 2010 15:28:42 GMT
Server: Apache/2.2.11 (Ubuntu) mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
*********************************
!!!!www.cloudknow.com!!!!
Date: Sat, 06 Nov 2010 15:28:42 GMT
Server: Apache
Last-Modified: Sat, 06 Nov 2010 14:39:02 GMT
ETag: "582c027-332f-494635a26ad80"
Accept-Ranges: bytes
Content-Length: 13103
Cache-Control: max-age=300, must-revalidate
Expires: Sat, 06 Nov 2010 15:33:42 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=UTF-8
*********************************
!!!!mirror.href.com!!!!
Content-Length: 28900
Content-Type: text/html
Last-Modified: Mon, 15 Oct 2007 08:29:14 GMT
Accept-Ranges: bytes
ETag: "0a9d8775fc81:8098"
Server: Microsoft-IIS/6.0
IISExport: This web site was exported using IIS Export v4.2
Date: Sat, 06 Nov 2010 15:28:40 GMT
Connection: close
*********************************
!!!!www.newegg.com!!!!
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 155243
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/6.0
x-server-id: 115
X-UA-Compatible: IE=7
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
GA: 0
NEG-Created: 11/6/2010 8:28:45 AM
Set-Cookie: NV%5FDVINFO=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22w19%22%3a%22Y%22%7d%2c%22Exp%22%3a%221289147325%22%7d%7d%7d; domain=.newegg.com; path=/
Set-Cookie: NV%5FPRDLIST=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22wf%22%3a%22N82E16817371005%22%7d%2c%22Exp%22%3a%221375457325%22%7d%7d%7d; domain=.newegg.com; expires=Fri, 02-Aug-2013 15:28:45 GMT; path=/
Set-Cookie: NV%5FCONFIGURATION=#5%7b%22Sites%22%3a%7b%22USA%22%3a%7b%22Values%22%3a%7b%22wd%22%3a%221%22%2c%22w39%22%3a%227657%22%7d%2c%22Exp%22%3a%221375457325%22%7d%7d%7d; domain=.newegg.com; expires=Fri, 02-Aug-2013 15:28:45 GMT; path=/
Date: Sat, 06 Nov 2010 15:28:45 GMT
Set-Cookie: NSC_xxx.ofxfhh.dpn-WJQ=ffffffffaf183f1e45525d5f4f58455e445a4a423660;expires=Sat, 06-Nov-2010 16:21:52 GMT;path=/
*********************************

Σημείωση: Αν δεν δώσουμε πόσα crawls θα κάνει, κατά 99% δεν πρόκειτε να σταματήσει ποτέ. Οπότε καλό θα ήταν να βάζουμε ένα πεπερασμένο πλήθος.

 

 

 

 

 

 

 

 

Advertisements