1.2Preprocessing the Data

As mentioned in the previous section, algorithms are based on assumptions and can become more effective if we transform the data first. Consider the following example, depicted in figure??a. The algorithm we consists of estimating the area that the data occupy. It grows a circle starting at the origin and at the point it contains all the data we record the area of circle. In the figure why this will be a bad estimate: the data-cloud is not centered. If we would have first centered it we would have obtained reasonableestimate. Although this example is somewhat simple-minded, there are many, much more interesting algorithms that assume centered data. To center data we will introduce the_sample mean_of the data, given by,

_N_1

E[X]

i=N_X_X__in(1.1)

n=1

Hence, for every attribute_i_separately, we simple add all the attribute value across data-cases and divide by the total number of data-cases. To transform the data so that their sample mean is zero, we set,

X__in′=X__in−E[X]in(1.2)

It is now easy to check that the sample mean ofX′indeed vanishes. An illustration of the global shift is given in figure??b. We also see in this figure that the algorithm described above now works much better!

In a similar spirit as centering, we may also wish to scale the data along the coordinate axis in order make it more “spherical”. Consider figure??a,b. In this case the data was first centered, but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. The solution is to scale the axes so that the spread is the same in every dimension. To define this operation we first introduce the notion ofsample variance,V[X]i= 1_NXin_2(1.3)_N_X

n=1

where we have assumed that the data was first centered. Note that this is similar to the sample mean, but now we have used the square. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. By first taking the square, all data-cases first get mapped to positive half of the axes (for each dimension or

1.2. PREPROCESSINGTHEDATA5

attribute separately) and then added and divided byN. You have perhaps noticed that variance does not have the sameunits_as_X_itself. If_X_is measured in grams, then variance is measured in grams squared. So to scale the data to have the same scale in every dimension we divide by the square-root of the variance, which is usually called the_sample standard deviation.,

(1.4)

Note again that sphering requires centering implying that we always have to perform these operations in this order, first center, then sphere. Figure??a,b,c illustrate this process.

You may now be asking,“well what if the data where elongated in a diagonal direction?”. Indeed, we can also deal with such a case by first centering, then_rotating_such that the elongated direction points in the direction of one of the axes, and then scaling. This requires quite a bit more math, and will postpone this issue until chapter??on “principal components analysis”. However, the question is in fact a very deep one, because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identifies structure (and then removes it). By remembering the sequence of transformations you performed you have implicitly build a model. Reversely, many algorithm can be easily adapted to model the mean and scale of the data. Now, the preprocessing is no longer necessary and becomes integrated into themodel.

Just as preprocessing can be viewed as building a model, we can use a model to transform structured data into (more) unstructured data. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Compression algorithms are based on models for the redundancy in data (e.g. text, images). The compression consists in removing this redundancy and transforming the original data into a less structured or less redundant (and hence more succinct) code. Models and structure reducing data transformations are in sense each others reverse: we often associate with a model an understanding of how the data was generated, starting from random noise. Reversely, pre-processing starts with the data and understands how we can get back to the unstructured random state of the data [FIGURE].

Finally, I will mention one more popular data-transformation technique. Many algorithms are are based on the assumption that data is sort of symmetric aroundthe origin. If data happens to be just positive, it doesn’t fit this assumption very well. Taking the following logarithm can help in that case,

(1.5)

Chapter 2

results matching ""

    No results matching ""