[plug] Apache transfer logging
James Bromberger
james at rcpt.to
Mon Aug 19 13:16:44 WST 2002
On Mon, Aug 19, 2002 at 10:55:51AM +0800, Anthony J. Breeds-Taurima wrote:
> On Mon, 19 Aug 2002, Adrian Woodley wrote:
> > G'Day all,
> > This is directed at all the Apache warriors out there :) How can I generate
> > a monthly report about the amount of transfers for each subdomain every month.
> > I thought about running webalizer for each domain, but that would quickly
> > become a pain in the proverbial.
> > I'm not after something flash, output doesn't really matter, provided it can
> > show me all the various subdomains in a table.
>
> *cough*
> perl
> *cough*
/me passes a Butter Menthol
No need to cough....
> Seriously, I don't know of any we traffic analysers as I've written all my
> own. It shouldn't be too hadr to write in perl.
It can get very complicated. For example, one web site I run generates
around 250 MB of access logs per (trading) day. We have 10 web servers.
Each server rotates its logs every HOUR. So by the time you get to the end
of the month, you have 10 * 24 * 30 log files (= 7200), and 5 - 7 GB of
raw data (its around 160,000 hits per hour). Dealing with large data in a
reliable manner in reasonable time is an issue.
My current method of dealing with this is to doa two stage process. Take
each hour's worth of logs, and load them into a hash in perl. My hash
has a structure of:
$var->{'year'}->{2002}->{'month'}->{08}->{'day'}->{19}->{'hour'}->{10}->{'hits'} = 123456
This means that by just reading the hash you can get the number of hits
for this hour, in place of parsing thousands of log files. Not, where I
have the key "hits", I also have another one called "bytes", another called
"useragents", etc. Thus you can get a generalised view of every hour, and not
need to refer to the original logs (which are themselves then compressed and
archived to backup medium.).
These hashes are "frozen" to disk using the FreezeThaw module.
At reporting time, load all the fozen hashes, and then iterate over them.
However, this starts to geta bit big after a year. I have just started
working on a new approch to try and tie the data repository into a database.
I am placing apache access logs into 3rd Normal Form in a mysql database.
3rd normal form is about as compact as you can get. As more logs are added,
th size required drops. I'll give a demo if people want to see it. There are
currently two scripts to this: parse, and replay. Prase takes a log file from
STDIN, and normalises and saves it. Replay either dumps the log out fast,
or in relative time to which the logged event occurred.
An advantage of being in a database is that you can use the normalisation,
and the indexes you create on your tables. This makes sorting and reporting
reasonably fast. The next step is to add 'aggregation tables' to the
reporting side of the 3rd NF so that you dont have to scan all logs to get
the data, similar to the aggregation I currently do in hashes.
I'll post my module later on the web when I am happy with it.
Now, just because you can feed your access logs into a database, you
SHOULDN'T. You should batch it. For the simlpe reason is that
databases/networks/etc sometimes dont work. Once in a while databases
get shutdown and restarted, and durign these times you cannot save logs.
Doing reliable batch jobs means you have some period for downtime, and
can recover from insert failures.
While I am here, I'll plug mod_gzip. I installed it on the linux.conf.au
server about three weeks ago, and in that time, roughly 10% of all hits have
been compressed, saving around 66% on my outbound traffic. Many people dont
pay for outbound, but there are bandwidth and download-time considerations.
That 10% resulted in only 2 MB being sent in place of 6MB. Not too shabby.
The next thing after that is to minmise some of the whitescpae in
your HTML/CSS/JS/etc...
--
James Bromberger <james_AT_rcpt.to> www.james.rcpt.to
Remainder moved to http://www.james.rcpt.to/james/sig.html
The Australian Linux Technical Conference 2003: http://www.linux.conf.au/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20020819/f8817d2f/attachment.pgp>
More information about the plug
mailing list