Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on thedata representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and study the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process.

What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind.

“Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while mostdata lives in much higher dimensions. For instance image patches of size10 × 10live in a100pixel space. How are we going to visualize it? There are many answers to this problem, but most involveprojection: we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plotX_1_n_versus_X_2_n

CHAPTER2. DATAVISUALIZATION

etc. An example of such a_scatter plot_is given in Figure??.

Note that we have a total ofd(d−1)/_2possible two dimensional projections which amounts to 4950 projections for 100dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? perhaps random projections may work? Unfortunately that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure??). The deeper reason behind this phenomenon is the_central limit theorem_which states that the sum of a large number of independent random variablesis (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote withwa vector inR_d_and byxthe d-dimensional random variable, theny=w_Txis the value of the projection. This is clearly is a weighted sum of the random variablesxi, i= 1..d. If we assume thatxi_are approximately independent, then we can see that their sum will be governed by this central limit theorem. Analogously, a dataset{_Xin}can thus be visualized in one dimension by “histogramming”¹the values ofY=wT__X, see Figure??. In this figure we clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of projected and histogrammed data.

In one sense thecentral limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytic properties. Unfortunately, the Gaussian distribution is also the most_uninformative_distribution. This notion of “uninformative” can actually be made very precise using information theory and states:_Given a fixed mean and variance, the Gaussian density represents the least amount of informationamong all densities_with the same mean and variance. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. So in general we have to work a bit harder to see interesting structure.

A large number of algorithmshas been devised to search for informative projections. The simplest being “principal component analysis” or PCA for short??. Here, interesting means dimensions of high variance. However, it was recognized that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian. For instance, “independent components analysis” (ICA)??and “projection pursuit”??searches for dimen9

sions that have heavy tails relative to Gaussian distributions. Another criterion is to to find projections onto which the data has multiple modes. A more recent approach is to project the data onto a potentially curved manifold??.

Scatter plots are of course not the only way to visualize data. Its a creative exercise and anything that helps enhance your understanding of the data is allowed in this game. To illustrate I will give a few examples form a

10CHAPTER2. DATAVISUALIZATION

Chapter 3

¹. A histogram is a bar-plot where the height of the bar represents the number items that had a value located in the interval on the x-axis o which the bar stands (i.e. the basis of the bar). If many items have a value around zero, then the bar centered at zero will be very high. ↩

Data Visualization

Data Visualization

results matching ""

No results matching ""