WWW::Spyder, for simple, easy web crawling
Social links
View Ashley Pond V's profile on LinkedIn
Miscellaneous

Other pages

WWW::Spyder examples

WWW::Spyder is a generalized Perl module to create simple spiders or web robots. To view the POD in a nice format please see this page.

This is the first module I have released through the standard Perl distribution channel, CPAN, the Comprehensive Perl Archive Network. It’s still in development but is already quite useful.

Discussion

Spiders and robots are programs that browse the web automatically, usually for gathering and indexing links or other information.

XML and its grandparent SGML are attempts to instill meaningful order into information. With them, single documents become leaves of databases. A collection of pages can be displayed as HTML easily through conversion or used for indexed searching or even generating entirely new documents.

The Internet has always been full of data, just never with any real meta-organization. You can think of the Internet itself as the single most important database in existence, but without it all being in a formatted language like XML or some other rigid scheme, it’s not a valuable database. Information without order, indices and strong categorization, reduces quickly to noise.

The real value of the Internet is found in its surfeit of plain text, no offense to the porn industry. The one arena where no one debates the supremacy of Perl is text parsing and manipulating. Therefore, it’s no real stretch to set some Perl loose on the Internet, with the right instructions, and find the value in that great unkeyed DB.

So let’s do something really valuable with the WWW! Let’s find a celebrity’s birthday. We’ll pick Jimmy Page to dull the irony somewhat. We are using simple regexes to check for birthdays. Much better ones could be crafted for serious applications.

Code
#!/usr/bin/perl
use strict;
use warnings;
#---------------------------------------------------------------------
use WWW::Spyder;  # our crawler
use URI::Escape;  # to properly escape our query for the search engine
#---------------------------------------------------------------------
@ARGV == 2 or usage();
my $spyder = WWW::Spyder->new(sleep_base => 20,
                              exit_on    => { pages => 30,
                                              time  => '1min'});
my $name = join(' ',@ARGV);
$spyder->terms($name, qr/birthdays?/i);

$spyder->seed( 'http://www.google.com/search?q=' . 
               uri_escape(qq{"$name"}) );

my $bday;
while ( my $page = $spyder->crawl ) {

    print "Check-->> ", $page->url, "\n";

# try to extract the birthday here
    ( $bday ) = $page->text =~ 
        m,$name\s+was born on ([^.]+\d\d+),sio;
    last if $bday;
    ( $bday ) = $page->text =~
        m,$name\'s\s+birthday is ([^.]+\d\d+),sio;
    last if $bday;
}

if ( $bday ) {
    print "\n  ${name}'s birthday seems to be: $bday\n\n";
} else {
    print "\n   Sorry, couldn't find ${name}'s birthday quickly.\n\n"; 
}

exit 0;
#=====================================================================
sub usage {
    my ( $tool ) = $0 =~ m,([^\/]+)$,;
    die <<KettleChips;
----------------------------------------------------------------------
USAGE:
   $tool [Proper Name]

I will try to find the birthday of someone famous if you will please
give me his/her name. I can only do two word names right now.
----------------------------------------------------------------------
KettleChips
}
#=====================================================================

Usage
jinx[96]>spyder-birthday Jimmy Page
Output
Check-->> http://www.google.com/search?q=%22Jimmy%20Page%22
Check-->> http://www.led-zeppelin.com/
Check-->>
   http://directory.google.com/Top/Arts/Music/Bands_and...
Check-->> http://home.earthlink.net/~juliannwh/

Jimmy Page's birthday seems to be: January 9, 1944

Search these pages via Google
Text, original code, fonts, and graphics ©1990-2008 Ashley Pond V.