1.1Data Representation

What does “data” look like? In other words, what do wedownload into our computer? Data comes in many shapes and forms, for instance it could be words from a document or pixels from an image. But it will be useful to convert data into a

1.1. DATAREPRESENTATION3

standard format so that the algorithms that we will discuss can be applied to it. Most datasets can be represented as a matrix,X= [X__in], with rows indexed by “attribute-index”i_and columns indexed by “data-index”_n. The valueX__in_for attribute_i_and data-case_n_can be binary, real, discrete etc., depending on what we measure. For instance, if we measure weight and color of100cars, the matrix_X_is2 × 100dimensional and_X_1,20= 20,684.57is the weight of car nr.20in some units (a real value) while_X_2,_20= 2is the color of car nr.20(say one of6predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a nr. and simply count how often a word was present. Say the word “book” is defined to have nr.10,_568in the vocabulary then_X_10568,_5076= 4would mean: the word book appeared 4 times in document5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about rats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to rescale the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?”, to indicate that that entry wasn’t observed.

Itis very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation is may become totally obscure. Itis still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “Hummers and Ferraries can be separated with by a line, see figure??. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to recode these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking aboutin which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter??will discuss some elementary visualization techniques.

results matching ""

    No results matching ""