[plug] Perl regex question
Daniel Pittman
daniel at rimspace.net
Mon Oct 27 12:34:46 WST 2008
"Arie Hol" <arie99 at ozemail.com.au> writes:
> I am trying to run a perl script which needs to apply a regex which
> will be effective over multiple lines.....
>
> I would like to remove multiline javascript statements from within a
> web page which I want to archive.
>
> Example :
>
> <script language="javascript">
> I want to remove
> the javascript lines as well as
> the "script" tags which run over several lines
> </script>
>
> I have tried many different permutations, including the following :
>
> $line =~ s/<script.*script>//smi ;
This like makes *absolutely* no sense: you told Perl; the flags
following the regular expression are 'smi', which means:
s => treat this as a single line
m => treat this as multiple lines
i => ignore case
You might get more sensible behaviour if you specify *one* of those two
mutually incompatible options. In your case, you want '/s', which is an
instruction to treat it as a single line.
In that case the '.' operator /will/ match (the platform-dependent value
of) newline, which means that it will remove what you want.
Also, as another poster just mentioned, a non-greedy regexp is probably
what you want, /if/ you demand regular expressions:
$html =~ s/<script.*?script>/si;
Also, you *are* applying this to the entire document, not on a
line-by-line basis, right? Right? Because, if you split the above into
lines and apply that regexp to each line independently it will *NOT*
work.
Finally, what you *really* want to do here is use the HTML::Scrubber
library from CPAN, or (possibly) something like HTML::Parser to actually
strip out the script tags.
(Actually, in truth, you want to strip *everything* except for a little
whitelist of permitted HTML content, rather than a blacklist of bad
content. Since you need to know. :)
Doing this yourself with regexp is an invitation to an HTML injection
attack from the folks submitting this content that you are trying to
clean up.
> I read the Perl documentation which tells me to use the /s and /m
> switches - but I can't get the desired results.
It works just fine for me with '/s' as indicated above...
> The above example does not seem to do anything at all - it doesn't
> throw an error or give me any feed back.
A regular expression substitution that doesn't match anything returns a
suitable error code, but does not generate any other diagnostics because
it is a routine operation.
You would, of course, know that because you /did/ read the perlop(1)
manual page, which documents this in detail, right?
Regards,
Daniel
More information about the plug
mailing list