[plug] Apache transfer logging

James Bromberger james at rcpt.to
Mon Aug 19 13:16:44 WST 2002


On Mon, Aug 19, 2002 at 10:55:51AM +0800, Anthony J. Breeds-Taurima wrote:
> On Mon, 19 Aug 2002, Adrian Woodley wrote:
> > G'Day all,
> >     This is directed at all the Apache warriors out there :) How can I generate
> > a monthly report about the amount of transfers for each subdomain every month.
> >     I thought about running webalizer for each domain, but that would quickly
> > become a pain in the proverbial.
> >     I'm not after something flash, output doesn't really matter, provided it can
> > show me all the various subdomains in a table.
> 
> *cough*
> perl
> *cough*

/me passes a Butter Menthol

No need to cough....
 
> Seriously, I don't know of any we traffic analysers as I've written all my
> own.  It shouldn't be too hadr to write in perl.


It can get very complicated. For example, one web site I run generates 
around 250 MB of access logs per (trading) day. We have 10 web servers. 
Each server rotates its logs every HOUR. So by the time you get to the end 
of the month, you have 10 * 24 * 30 log files (= 7200), and 5 - 7 GB of 
raw data (its around 160,000 hits per hour). Dealing with large data in a 
reliable manner in reasonable time is an issue.

My current method of dealing with this is to doa two stage process. Take 
each hour's worth of logs, and load them into a hash in perl. My hash 
has a structure of:

	$var->{'year'}->{2002}->{'month'}->{08}->{'day'}->{19}->{'hour'}->{10}->{'hits'} = 123456

This means that by just reading the hash you can get the number of hits 
for this hour, in place of parsing thousands of log files. Not, where I 
have the key "hits", I also have another one called "bytes", another called 
"useragents", etc. Thus you can get a generalised view of every hour, and not 
need to refer to the original logs (which are themselves then compressed and 
archived to backup medium.).

These hashes are "frozen" to disk using the FreezeThaw module.

At reporting time, load all the fozen hashes, and then iterate over them.

However, this starts to geta bit big after a year. I have just started 
working on a new approch to try and tie the data repository into a database. 
I am placing apache access logs into 3rd Normal Form in a mysql database. 
3rd normal form is about as compact as you can get. As more logs are added, 
th size required drops. I'll give a demo if people want to see it. There are 
currently two scripts to this: parse, and replay. Prase takes a log file from 
STDIN, and normalises and saves it. Replay either dumps the log out fast, 
or in relative time to which the logged event occurred.

An advantage of being in a database is that you can use the normalisation, 
and the indexes you create on your tables. This makes sorting and reporting 
reasonably fast. The next step is to add 'aggregation tables' to the 
reporting side of the 3rd NF so that you dont have to scan all logs to get 
the data, similar to the aggregation I currently do in hashes. 

I'll post my module later on the web when I am happy with it.


Now, just because you can feed your access logs into a database, you 
SHOULDN'T. You should batch it. For the simlpe reason is that 
databases/networks/etc sometimes dont work. Once in a while databases 
get shutdown and restarted, and durign these times you cannot save logs. 
Doing reliable batch jobs means you have some period for downtime, and 
can recover from insert failures.


While I am here, I'll plug mod_gzip. I installed it on the linux.conf.au 
server about three weeks ago, and in that time, roughly 10% of all hits have 
been compressed, saving around 66% on my outbound traffic. Many people dont
pay for outbound, but there are bandwidth and download-time considerations. 
That 10% resulted in only 2 MB being sent in place of 6MB. Not too shabby. 
The next thing after that is to minmise some of the whitescpae in 
your HTML/CSS/JS/etc...

 

-- 
 James Bromberger <james_AT_rcpt.to> www.james.rcpt.to
 Remainder moved to http://www.james.rcpt.to/james/sig.html
 The Australian Linux Technical Conference 2003: http://www.linux.conf.au/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20020819/f8817d2f/attachment.pgp>


More information about the plug mailing list