2006-09-03 12:01

by Vasil Kolev

These are a few small tricks I use to gather stats and see how the data looks like…

(a note – this is mostly postgresql related, but is easily ported to another database)

Let’s say we have a table in the database, which looks like a log file, with two columns – timestamp and event_id, which is integer from 1 to 5. The tasks is to get in X minute intervals what type of event happened how much times and to draw them.

The first step is the following query:

SELECT  
	((EXTRACT(epoch FROM timestamp)::int/90)*90) AS period,
        SUM((event_id=1)::int) AS type_1,
	SUM((event_id=2)::int) AS type_2, 
	SUM((event_id=3 OR event_id=4 OR event_id=5)::int) AS type_rest,
	SUM(1) AS all
FROM
	log
WHERE 
	timestamp => '2006-08-28 00:00:00' AND timestamp < '2006-08-29 00:00:00'
GROUP BY
	period 
ORDER BY
	period ASC;

(thanks to Aqua for the testing ang the notes :) )

A short explanation - period is a timestamp rounded to 90 seconds (using integer division), which we use to do the grouping. For every such period we're counting how many of each event we have with the small trick with the typecast - true is 1, false is 0, which means that if the condition is satisfied, the sum will increase. This way we're counting what' in the table with one pass through it.

Some notes for the query:

1) date_trunc can be used instead of the EXTRACT trick, but it can't round to stuff different than a minute or hour.
2) In the older versions of postgresql (before 8) by default there's not typecast from boolean to integer, but in the documentation there's an example how to create one.
3) Instead of the trick with SUM() a grouping by event_id can be used, but to be able to combine a few event_ids again some boolean trick has to be used. Also getting the whole count and solving the problem for 0 count of some event_id for some period will be harder.
4) Instead of SUM(1) in the end COUNT(1) can be used, but I don't see a big difference.
5) BETWEEN can be used in the WHERE clause, but I like this more :)

To get the data in format, useful for drawing, we need the following line:

psql -A -q -t -F ' ' $DBNAME -c "$QUERY" > data

(what the options do, you can see in the man page :) DBNAME is the database, QUERY is the query from above)

We'll have one file with the following type of lines:

1156748310 23 12 6 41
1156748400 28 21 5 54
1156748490 32 15 2 49
....

Here gnuplot comes into the game, with the following commands to make a nice drawing:

set title "Events through the day, quantised at 90 sec. (fig. 1)"
set ylabel "number of events"
set xlabel "time through the day"
plot "data" using 5 with lines title  "all", "data" using 4 with lines title  "rest", \
	"data" using 3 with lines title  "type_1", "data" using 2 with lines title  "type_1"

And with putting these two lines in the beginning it will write the thing to file:

set terminal png font VeraSe 11 size 800,500
set output "test.png"

Of course, it would be great to write the hour underneath, but I still haven't thought up a way to do it :)

Leave a Reply