Alceu Costa

Thursday, December 12, 2013

Log-log distribution in gnuplot

The distribution of a variable (i.e. how frequent are its values) is a very useful tool to understand the behavior of a given dataset. In my case, I was exploring this reddit dataset and trying to see how the upvotes of different submissions are distributed.

Although gnuplot is a great tool to visualize data, plotting distribution of variables from external files is not very intuitive. To plot the distribution in linear axes we can use the smooth frequency option:

# Configure the output
set terminal png
set output "dist_linear_axes.png"

# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table

# Plot the result using linear scale.
set xlabel "Number of Upvotes"
set ylabel "Count"
set nokey
plot "temp.dat" using 1:2 smooth frequency with points

First I define a bin() function to group close x values. In this example I have used a bin width of 5. That means, for example, that values 2 and 4 will be grouped into the same bin. Intuitively, the next step would be to use the plot command as follows:

plot "temp.dat" using (bin($1):(1.0) smooth frequency with points

However this will not work as expected. The reason is that gnuplot will first group x values and only after that apply the bin() function. Instead, we want to apply the bin() function before smooth frequency groups the x values. To achieve this result I have applied the bin() function and saved the results to temporary file ("temp.dat") using the set table command. The final result will look like this:

Distribution in linear axes (it does not look very nice!)

As we can see, the result does not look very good. Most of the points got clustered in the bottom of the plot. The reason is that the range of x and y values is big and does not allows us to see the shape of the distribution. Often, when we have variables that span a big range of values, it is useful to use log scale axes.

Log scale axes

We can tell gnuplot to use log scales axes by using the set logscale command:

set logscale xy

However, in this case (plotting distributions) we have a little problem: gnuplot applies the log function to the x and y values before applying the using smooth option. Since in our case the y values are all 1 and log(1) = 0, the count for each bin will also be zero.

We can solve this problem by using a second temporary file. First we save the results from smooth frequency options and then we tell gnuplot to use the log scale:

# Configure the output
set terminal png
set output "dist_log_axes.png"

# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table

# Apply smooth frequency and save result to "temp2.dat"
set table "temp2.dat"
plot "temp.dat" using 1:2 smooth frequency with points
unset table

# Plot the result using log scale.
set xlabel "Number of Upvotes"
set ylabel "Count"
set nokey
set logscale xy
plot "temp2.dat" using 1:2 with points

And here is the resulting plot:

Distribution using log axes.

Sunday, November 10, 2013

Fractal dimension from image

Suppose that we want to estimate the fractal dimension from an object represented by an image:

Image depicting the Sierpinski triangle.

The image depicts the Sierpinski triangle, a well known fractal object. Computing the fractal dimension of an object can be useful, for example, in image feature extraction.

In our case, the fractal object is represented by a binary image I, where non-zero pixels belong to the object and zero pixels constitute the background. We can estimate the Haussdorf fractal dimension of the object using the following algorithm:

Pad the image with background pixels so that its dimensions are a power of 2.
Set the box size 'e' to the size of the image.
Compute N(e), which corresponds to the number of boxes of size 'e' which contains at least one object pixel.
If e > 1 then e = e / 2 and repeat step 3.
Compute the points log(N(e)) x log(1/e) and use the least squares method to fit a line to the points.
The returned Haussdorf fractal dimension D is the slope of the line.

Matlab code

This link provides my implementation of the previous algorithm. The hausDim() function can be used to estimate the fractal dimension of the depicted fractal. Note that first it is necessary to convert the image to a binary matrix using the im2bw function:

% Read the image.
>> I = imread('https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQHyGYhHQzfTBDzSCm9fH5atXyrVLG5E4mAGc-Ikn2EJWovGkhi_PpGfhrVEax21b9c4Afzqlt2vUQJP_dtXZ52RK5HJZ08rEhkVkrOLIMK9FdBNZ0yWVaxt_O-S7QuAl7M9udzGjsjhyphenhyphen-/s1600/sierpinski.png');

% Convert to a binary image and negate pixel values.
>> I = ~im2bw(I);
>> hausDim(I)

ans = 1.5496

The estimated value is 1.5496. This is quite close to the exact value which is log(3) / log(2) = 1.58496...

Monday, September 2, 2013

Texture Classification

In many applications, texture has a close relationship with the semantics. For example, Fig. 1(a) shows a CT scan of a lung. In this case, there are differences between the textures of healthy and unhealthy tissues. In satellite images - Fig. 1(b) - texture can be employed to differentiate terrains such as forests, buildings and sea.

Figure 1 - Texture information can be used to (a) detect unhealthy lung tissue or (b) different types of terrains.

The goal of this post is to provide a brief overview of how texture information can be used to classify images using Matlab. Basically, what we want to do is to learn a texture classifier from a set of labeled images depicting textures. Then, we will use the learned classifier to provide a class label for an unlabeled image.

Texture classification can be divided into 3 phases which are discussed in the following sections:

Extracting texture features;
Training a classifier;
Classification of an unlabeled texture image;

Downloading the image set

In order to classify texture images, we will use as example the Textured Surfaces Dataset from Prof. Jean Ponce's research group. Just click on the link and search for "Texture Database". Then download the 5 provided zip files (~40MB each).

Unzip all the images into a single imgs directory. We we will also use a text file (image_class.txt) where each line contains the file name of each image and the corresponding class. Organize the images and image_class.txt under the following directory structure:

.
|
|--imgs
| |--img1.jpg
| |--img2.jpg
| ...
| |--img1000.jpg
|
|--image_class.txt

Texture feature extraction

There are many feature extraction algorithms available for texture analysis. In this post I will use the SFTA Texture Extractor, (Matlab code available here) that I have developed as part of my PhD research. Download the SFTA code and unzip into the working directory:

.
|
|--imgs
| |--img1.jpg
| |--img2.jpg
| ...
| |--img1000.jpg
|
|--image_class.txt
|--findBorders.m
|--hausDim.m
|--otsurec.m
|--sfta.m

To extract features we will use the sfta(I, nt) function, where I corresponds to the input image depicting a texture and nt is a parameter that defines the size of the feature vector.

Training the classifier

There are many good libraries and software that provide implementation of classification algorithms such as Weka for Java or LIBSVM. In this example, however, I will use the classifiers provided by Matlab. More specifically, I will use the Naive Bayes classifier.

In order to train the Naive Bayes we need to organize the images' feature vectors into a single NxM matrix, where N is the number of images and M is the size of the feature vector. Also, we will need a Nx1 cell array containing the images' labels. The following code reads the image_class.txt file and generates the F matrix with the feature vectors and the L cell array with the images' labels:

imgListFile = fopen('image_class.txt', 'r');
F = zeros(1000, 21); L = cell(1000, 1);
tline = fgetl(imgListFile); currentLine = 1;
while ischar(tline)
    disp(tline)
    splittedLine = regexp(tline, ',[ ]*', 'split');

    % Extract the SFTA feature vector from the image.
    imagePath = fullfile('imgs', splittedLine{1});
    I = imread(imagePath);
    f = sfta(I, 4);
    F(currentLine, :) = f;

    % Store the image label.
    L{currentLine} = splittedLine{2};

    tline = fgetl(imgListFile);
    currentLine = currentLine + 1;
end;
fclose(imgListFile);

Now that we have the F matrix and L cell array, we can train a Naive Bayes classifier nb using the following command:

>> nb = NaiveBayes.fit(F, L);

Classification

The nb classifier can be used to classify an image Itest using the following commands:

>> Itest = imread(testImagePath)
>> features = sfta(Itest, 4)
>> predict(nb, features)
ans = "knit"

Pages