[plug] Perl programming using HTML::Parser

David Buddrige buddrige at wasp.net.au
Mon Apr 28 15:20:13 WST 2003


Hi all, 

I am attempting to learn how to use the HTML::Parser class provided by Perl. 
To do this, I have written the following program: 

#!/usr/bin/perl 

# For info on HTML parsing, see
# http://www.perldoc.com/perl5.8.0/lib/HTML/Parser.html 

{
   package MyParser;
   use base 'HTML::Parser'; 

   sub start {
      my($self, $tagname, $attr, $attrseq, $origtext) = @_; 

      print "tagname : $tagname\n"; 


      if ( "$tagname" == "html" )
      {
	   print "its a html tag!\n";
      }
      elsif ( "$tagname" == "span" )
      {
	   print "its a span!\n";
      }
      elsif ( "$tagname" == "a" )
      {
	   print "Its a A!\n";
      }
      elsif ( "$tagname" == "h1" )
      {
	   print "Its a H1!\n";
      }
      else
      {
	 print "Its some other sort of tag!\n";
      }
   } 

   sub end {
	my($self, $tagname, $origtext) = @_;
	#...
   } 

   sub text {
	my($self, $origtext, $is_cdata) = @_;
	#...
   }
} 

my $p = MyParser->new; 


#use HTML::Parser (); 

# Parse document line by line.
while(<>)
{
   $chunk = $_;
   $p->parse( $chunk );
} 

 

I have the following html file which I am trying to parse: 

rabbits.html 

<html><head></head><body><h1>All about Rabbits</h1>
<h3>main links</h3>
<a 
href="http://www.google.com/search?num=100&hl=en&lr=lang_en&ie=I 
SO-8859-1&q=wild+rabbit+habits">google search: Wild Rabbit 
Habbits</a><br>
<a href="http://www.airguns.f9.co.uk/subs/quarry/rabbits-1.htm">Airguns in 
action: Rabbit Habbits</a><br>
<a href="http://www.airguns.f9.co.uk/subs/quarry/rabbits-2.htm">Rabbit 
Hunting tips</a><br>
<a 
href="http://www.google.com/search?num=100&hl=en&lr=lang_en&ie=I 
SO-8859-1&q=australia+rabbit+hunt">Google search: Australia Rabbit 
Hunt</a><br>
<a href="http://www.wikipedia.org/wiki/Rabbit_%28Australia%29">Rabbits in 
Australia (Wikipedia)</a><br>
<a 
href="http://www.rabbithunting.com/">http://www.rabbithunting.com/</a><br>
<a href="http://www.lm.net.au/%7Erobert/robert1.htm">Tailem Bend Firearms 
Club Inc</a><br>
<a href="http://lm.net.au/%7Erobert/rabbit.htm">Tailem Bend Firearms Club 
Inc - Rabbit Hunt</a><br>
<a href="http://www3.gov.ab.ca/srd/fw/watch/rabb_style.html">Lifestyles and 
Habbits of Rabbits</a><br> 

<a 
href="http://www.gunsreview.com/views/Big_bore_rabbit_hunting514301.html">Bi 
g bore rabbit hunting In Oz</a><br>
<a 
href="http://www.yptenc.org.uk/docs/factsheets/animal_facts/rabbit.html">Fac 
tsheet: Rabbits</a><br>
<a href="http://www.chuckhawks.com/shotgun_410_australia.htm">The .410 
Shotgun and A Young Hunter In Australia</a><br>
<a href="http://www.chuckhawks.com/index2c.shotguns.htm">GUNS AND SHOOTING 
ONLINE : Shotgun Information Page (Part III of "The Definitive Firearms 
Site" )</a><br>
<h1>Other articles</h1>
<a href="http://www.rabbit.org/behavior/">Pet Rabbit Behaviour</a><br> 

</body></html> 


When I run the program over the file rabbits.html, I get the following 
output: 

tagname : html
its a html tag!
tagname : head
its a html tag!
tagname : body
its a html tag!
tagname : h1
its a html tag!
tagname : h3
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag!
tagname : h1
its a html tag!
tagname : a
its a html tag!
tagname : br
its a html tag! 


Can anyone see why it keeps running the line 

	   print "its a html tag!\n"; 

when the if() statement should cause it to run the code for the alternative 
tags? 

thanks heaps guys 

David Buddrige. 



More information about the plug mailing list