[plug] Perl programming using HTML::Parser
David Buddrige
buddrige at wasp.net.au
Mon Apr 28 15:20:13 WST 2003
Hi all,
I am attempting to learn how to use the HTML::Parser class provided by Perl.
To do this, I have written the following program:
#!/usr/bin/perl
# For info on HTML parsing, see
# http://www.perldoc.com/perl5.8.0/lib/HTML/Parser.html
{
package MyParser;
use base 'HTML::Parser';
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
print "tagname : $tagname\n";
if ( "$tagname" == "html" )
{
print "its a html tag!\n";
}
elsif ( "$tagname" == "span" )
{
print "its a span!\n";
}
elsif ( "$tagname" == "a" )
{
print "Its a A!\n";
}
elsif ( "$tagname" == "h1" )
{
print "Its a H1!\n";
}
else
{
print "Its some other sort of tag!\n";
}
}
sub end {
my($self, $tagname, $origtext) = @_;
#...
}
sub text {
my($self, $origtext, $is_cdata) = @_;
#...
}
}
my $p = MyParser->new;
#use HTML::Parser ();
# Parse document line by line.
while(<>)
{
$chunk = $_;
$p->parse( $chunk );
}
I have the following html file which I am trying to parse:
rabbits.html
<html><head></head><body><h1>All about Rabbits</h1>
<h3>main links</h3>
<a
href="http://www.google.com/search?num=100&hl=en&lr=lang_en&ie=I
SO-8859-1&q=wild+rabbit+habits">google search: Wild Rabbit
Habbits</a><br>
<a href="http://www.airguns.f9.co.uk/subs/quarry/rabbits-1.htm">Airguns in
action: Rabbit Habbits</a><br>
<a href="http://www.airguns.f9.co.uk/subs/quarry/rabbits-2.htm">Rabbit
Hunting tips</a><br>
<a
href="http://www.google.com/search?num=100&hl=en&lr=lang_en&ie=I
SO-8859-1&q=australia+rabbit+hunt">Google search: Australia Rabbit
Hunt</a><br>
<a href="http://www.wikipedia.org/wiki/Rabbit_%28Australia%29">Rabbits in
Australia (Wikipedia)</a><br>
<a
href="http://www.rabbithunting.com/">http://www.rabbithunting.com/</a><br>
<a href="http://www.lm.net.au/%7Erobert/robert1.htm">Tailem Bend Firearms
Club Inc</a><br>
<a href="http://lm.net.au/%7Erobert/rabbit.htm">Tailem Bend Firearms Club
Inc - Rabbit Hunt</a><br>
<a href="http://www3.gov.ab.ca/srd/fw/watch/rabb_style.html">Lifestyles and
Habbits of Rabbits</a><br>
<a
href="http://www.gunsreview.com/views/Big_bore_rabbit_hunting514301.html">Bi
g bore rabbit hunting In Oz</a><br>
<a
href="http://www.yptenc.org.uk/docs/factsheets/animal_facts/rabbit.html">Fac
tsheet: Rabbits</a><br>
<a href="http://www.chuckhawks.com/shotgun_410_australia.htm">The .410
Shotgun and A Young Hunter In Australia</a><br>
<a href="http://www.chuckhawks.com/index2c.shotguns.htm">GUNS AND SHOOTING
ONLINE : Shotgun Information Page (Part III of "The Definitive Firearms
Site" )</a><br>
<h1>Other articles</h1>
<a href="http://www.rabbit.org/behavior/">Pet Rabbit Behaviour</a><br>
</body></html>
When I run the program over the file rabbits.html, I get the following
output:
tagname : html
its a html tag!
tagname : head
its a html tag!
tagname : body
its a html tag!
tagname : h1
its a html tag!
tagname : h3
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : h1
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
Can anyone see why it keeps running the line
print "its a html tag!\n";
when the if() statement should cause it to run the code for the alternative
tags?
thanks heaps guys
David Buddrige.
More information about the plug
mailing list