[plug] Perl regex question

Mon Oct 27 12:34:46 WST 2008

"Arie Hol" <arie99 at ozemail.com.au> writes:

> I am trying to run a perl script which needs to apply a regex which
> will be effective over multiple lines.....
>
> I would like to remove multiline javascript statements from within a
> web page which I want to archive.
>
> Example :
>
> <script language="javascript">
> 	I want to remove 
> 	the javascript lines as well as 
> 	the "script" tags which run over several lines
> </script>
>
> I have tried many different permutations, including the following :
>
> $line =~ s/<script.*script>//smi ;

This like makes *absolutely* no sense: you told Perl; the flags
following the regular expression are 'smi', which means:

  s => treat this as a single line
  m => treat this as multiple lines
  i => ignore case

You might get more sensible behaviour if you specify *one* of those two
mutually incompatible options.  In your case, you want '/s', which is an
instruction to treat it as a single line.

In that case the '.' operator /will/ match (the platform-dependent value
of) newline, which means that it will remove what you want.

Also, as another poster just mentioned, a non-greedy regexp is probably
what you want, /if/ you demand regular expressions:

  $html =~ s/<script.*?script>/si;

Also, you *are* applying this to the entire document, not on a
line-by-line basis, right?  Right?  Because, if you split the above into
lines and apply that regexp to each line independently it will *NOT*
work.

Finally, what you *really* want to do here is use the HTML::Scrubber
library from CPAN, or (possibly) something like HTML::Parser to actually
strip out the script tags.

(Actually, in truth, you want to strip *everything* except for a little
 whitelist of permitted HTML content, rather than a blacklist of bad
 content.  Since you need to know. :)

Doing this yourself with regexp is an invitation to an HTML injection
attack from the folks submitting this content that you are trying to
clean up.

> I read the Perl documentation which tells me to use the /s and /m
> switches - but I can't get the desired results.

It works just fine for me with '/s' as indicated above...

> The above example does not seem to do anything at all - it doesn't
> throw an error or give me any feed back.

A regular expression substitution that doesn't match anything returns a
suitable error code, but does not generate any other diagnostics because
it is a routine operation.

You would, of course, know that because you /did/ read the perlop(1)
manual page, which documents this in detail, right?

Regards,
        Daniel