11

12CHAPTER3. LEARNING

derstood that this was a lion. They understood that all lions have these particular characteristics in common, but may differ in some other ones (like the presence of a scar someplace).

Bob has another disease which is called over-generalization. Once he has seen an object he believes almost everything is some, perhaps twisted instanceof the same object class (In fact, I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). If ancestral Bob walks the savanna and he has just encountered an instance of a lion and fled into a tree with his buddies, the next time he sees a squirrel he believes it is a small instance of a dangerous lion and flees into the trees again. Over-generalization seems to be rather common among small children.

One of the main conclusions from this discussion is that we should neither over-generalize nor over-fit. We need to be on the edge of being just right. But just right about what? It doesn’t seem there is one correct God-given definition of the category chairs. We seem to all agree, butone can surely find examples that would be difficult to classify. When do we generalize exactly right? The magic word isPREDICTION. From an evolutionary standpoint, all we have to do is make correct predictions about aspects of life that help us survive.Nobody really cares about the definition of lion, but we do care about the our responses to the various animals (run away for lion, chase for deer). And there are a lot of things that can be predicted in the world. This food kills me but that food is goodfor me. Drumming my fists on my hairy chest in front of a female generates opportunities for sex, sticking my hand into that yellow-orange flickering“flame” hurts my hand and so on. The world is wonderfully predictable and we are very good at predicting it.

So why do we care about object categories in the first place? Well, apparently they help us organize the world and make accurate predictions. The category lions is an_abstraction_and abstractions help us to generalize. In a certain sense, learning is allabout finding useful abstractions or concepts that describe the world. Take the concept “fluid”, it describes all watery substances and summarizes some of their physical properties. Ot he concept of “weight”: an abstraction that describes a certain property of objects.

Here is one very important corollary for you:“machine learning is not in the business of remembering and regurgitating observed information, it is in the business of transferring (generalizing) properties from observed data onto new, yet un__observed data”. This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the final exam).

The information we receive from the world has two components to it: there

13

is the part of the information which does not carry over to the future, the unpredictable information. We call this “noise”. And then there is the information that_is_predictable, the learnable part of the information stream. The task of any learning algorithm is to separate thepredictable part from the unpredictable part.

Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent for every bit that he sends. If the image were completely white it would be really stupid of Bob to send the message:pixel 1: white__, pixel 2: white, pixel 3: white,..... He could just have send the messageall pixels are white!. The blank image is completely predictable but carries very little information. Now imagine a image that consist of white noise (your television screen if thecable is not connected). To send the exact image Bob will have to sendpixel 1: white, pixel 2: black, pixel 3: black,.... Bob can not do better because there is no predictable information in that image, i.e. there is nostructure_to be modeled. You can imagine playing a game and revealing one pixel at a time to someone and pay him 1$ for every next pixel he predicts correctly. For the white image you can do perfect, for the noisy picture you would be random guessing. Real pictures are in between: some pixels are very hard to predict, while others are easier. To compress the image, Bob can extract rules such as: always predict the same color as the majority of the pixels next to you, except when there is an edge. These rules constitute the model for the regularities of the image. Instead of sending the entire image pixel by pixel, Bob will now first send his rules and ask Alice to apply the rules. Every time the rule fails Bob also send a correction:_pixel 103: white, pixel 245: black. A few rules and two corrections is obviously cheaper than 256 pixel values and no rules.

There is one fundamental tradeoff hidden in this game. Since Bob is sending only a single image it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. If he would be sending 1 billion images it would pay off to first send the complicated model because he would be saving a fraction of all bits for every image. On the other hand, if Bob wants to send 2 pixels, therereally is no need in sending a model whatsoever. Therefore:the size of Bob’s model depends on the amount of data he wants to transmit. Ironically, the boundary between what is model and what is noise depends on how much data we are dealing with! If we usea model that is too complex we overfit to the data at hand, i.e. part of the model represents noise. On the other hand, if we use a too simple model we ”underfit” (over-generalize) and valuable structure remains unmodeled. Both lead to suboptimal compression of the image. But both also lead to suboptimal prediction on new images. The compression game can therefore be used to find the right size of model complexity for a given dataset. And so we have discovered a deep

14CHAPTER3. LEARNING

connection between learning and compression.

Now let’s think for a moment what we really mean with “a model”. A model represents our prior knowledge of the world. It imposes structure that is not necessarily present in the data. We call this the “inductive_bias”. Our inductive bias often comes in the form of a parametrized model. That is to say, we define a family of models but let the data determine which of these models is most appropriate. A strong inductive bias means that we don’t leave flexibility inthe model for the data to work on. We are so convinced of ourselves that we basically ignore the data. The downside is that if we are creating a “bad bias” towards to wrong model. On the other hand, if we are correct, we can learn the remaining degrees offreedom in our model from very few data-cases. Conversely, we may leave the door open for a huge family of possible models. If we now let the data zoom in on the model that best explains the training data it will overfit to the peculiarities of that data.Now imagine you sampled 10 datasets of the same size_N_and train these very flexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). Let’s say wewant to determine the value of some parameterθ. Because the models are so flexible, we can actually model the idiosyncrasies of each dataset. The result is that the value forθis likely to be very different for each dataset. But because we didn’t imposemuch inductive bias the average of many of such estimates will be about right. We say that the bias is small, but the variance is high. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Notethat not only is a large bias is bad (for obvious reasons), a large variance is bad as well: because we only have one dataset of size_N, our estimate could be very far off simply we were unlucky with the dataset we were given. What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. If we don’t trust our prior knowledge we should let the data speak. However, letting the data speak too much might lead to overfitting, so we need to find the boundary between too complex and too simple a model and get its complexity just right. Access to more data means that the data can speak more relative to prior knowledge. That, in a nutshell is what machinelearning is all about.

3.1. INANUTSHELL15

results matching ""

    No results matching ""