Thursday, December 12, 2013

Log-log distribution in gnuplot

The distribution of a variable (i.e. how frequent are its values) is a very useful tool to understand the behavior of a given dataset. In my case, I was exploring this reddit dataset and trying to see how the upvotes of different submissions are distributed.

Although gnuplot is a great tool to visualize data, plotting distribution of variables from external files is not very intuitive. To plot the distribution in linear axes we can use the smooth frequency option:

# Configure the output
set terminal png
set output "dist_linear_axes.png"

# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table

# Plot the result using linear scale.
set xlabel "Number of Upvotes"
set ylabel "Count"
set nokey
plot "temp.dat" using 1:2 smooth frequency with points

First I define a bin() function to group close x values. In this example I have used a bin width of 5. That means, for example, that values 2 and 4 will be grouped into the same bin. Intuitively, the next step would be to use the plot command as follows:

plot "temp.dat" using (bin($1):(1.0) smooth frequency with points

However this will not work as expected. The reason is that gnuplot will first group x values and only after that apply the bin() function. Instead, we want to apply the bin() function before smooth frequency groups the x values. To achieve this result I have applied the bin() function and saved the results to temporary file ("temp.dat") using the set table command. The final result will look like this:

Distribution in linear axes (it does not look very nice!)

As we can see, the result does not look very good. Most of the points got clustered in the bottom of the plot. The reason is that the range of x and y values is big and does not allows us to see the shape of the distribution. Often, when we have variables that span a big range of values, it is useful to use log scale axes.

Log scale axes

We can tell gnuplot to use log scales axes by using the set logscale command:

set logscale xy

However, in this case (plotting distributions) we have a little problem: gnuplot applies the log function to the x and y values before applying the using smooth option. Since in our case the y values are all 1 and log(1) = 0, the count for each bin will also be zero.

We can solve this problem by using a second temporary file. First we save the results from smooth frequency options and then we tell gnuplot to use the log scale:

# Configure the output
set terminal png
set output "dist_log_axes.png"

# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table

# Apply smooth frequency and save result to "temp2.dat"
set table "temp2.dat"
plot "temp.dat" using 1:2 smooth frequency with points
unset table

# Plot the result using log scale.
set xlabel  "Number of Upvotes"
set ylabel "Count"
set nokey
set logscale xy
plot "temp2.dat" using 1:2 with points

And here is the resulting plot:

Distribution using log axes.