21
CHAPTER5. NEARESTNEIGHBORSCLASSIFICATION
figure out the label of a test-case we simply look around and see what labels our neighbors have. Asking your closest neighbor is like betting all your money on a single piece of advice and you might get really unlucky if your closest neighbor happens to be an odd-one-out. It’s typically better to ask several opinions before making your decision. However, if you ask too much around you will be forced to ask advice from data-cases that are no longer very similar to you. So there is some optimal number of neighbors to ask, which may be different for every problem. Determining this optimal number of neighbors is not easy, but we can again use cross validation (section??) to estimate it.
So what is good and bad about kNN? First, it’s simplicity makes it attractive. Very few assumptions about the data are used in the classification process. This property can also be a disadvantage: if you have prior knowledge about how the data was generated, its better to use it, because less information has to be extracted from the data. A second consideration is computation time and memory efficiency. Assume you have a very large dataset, but you need to make decisions very quickly. As an example, consider surfing the web-pages of Amazone.com. Whenever you search for a book, it likes to suggest 10 others. To do that it could classify books into categories and suggest the top ranked in that category. kNN requires Amazone to store all features of all books at a location that is accessible for fast computation. Moreover, to classify kNN has to do the neighborhood search every time again. Clearly, there are tricks that can be played with smart indexing, but wouldn’t it be much easier if we would have summarizedall books by a simple classification functionf__θ(X), that “spits out” a class for any combination of featuresX?
This distinction between algorithms/models that require memorizing every data-item data is often called “parametric” versus “non-parametric”.It’s important to realize that this is somewhat of a misnomer: non-parametric models can have parameters (such as the number of neighbors to consider). The key distinction is rather wether the data is summarized through a set of parameters which together comprise a classification functionf__θ(X), or whether we retain all the data to do the classification “on the fly”.
KNN is also known to suffer from the “curse of high dimensions”. If we use many features to describe our data, and in particular when most ofthese features turn out to be irrelevant and noisy for the classification, then kNN is quickly confused. Imagine that there are two features that contain all the information necessary for a perfect classification, but that we have added 98 noisy, uninformative features. The neighbors in the two dimensional space of the relevant features are unfortunately no longer likely to be the neighbors in the 100 dimensional space,
5.1. THEIDEAINANUTSHELL23
because 98 noisy dimensions have been added. This effect is detrimental to the kNN algorithm. Once again, it is very important to choose your initial representation with much care and preprocess the data before you apply the algorithm. In this case, preprocessing takes the form of “feature selection” on which a whole book in itself could be written.