17
CHAPTER4. TYPESOFMACHINELEARNING
factors that can explain the data. Knowing these factors is like denoising the data where we first peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors.
There are two dominant classes of unsupervised learning algorithms: clustering based algorithms assume that the data organizes into groups. Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. Another class of algorithms strives to project the data onto a lower dimensional space. This mapping can be nonlinear, but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embeddedin the input space. Unrolling that manifold is then the task of the learning algorithm. In this case the dimensions should be interpreted as semantic factors.
There are many variations on the above themes. For instance, one is often confronted with a situation where you have access to many more unlabeled data (onlyX__i) and many fewer labeled instances (both(Xi,Y__i). Take the task of classifying news articles by topic (weather, sports, national news, international etc.). Some people may have labeled some news-articles by hand but there won’t be all that many of those. However, we do have a very large digital library of scanned newspapers available. Shouldn’t it be possible to use those scanned newspapers somehow to to improve the classifier? Imagine that thedata naturally clusters into well separated groups (for instance because news articles reporting on different topics use very different words). This is depicted in Figure??). Note that there are only very few cases which have labels attached to them. Fromthis figure it becomes clear that the expected optimal decision boundary nicely separates these clusters. In other words, you do not expect that the decision boundary will cut through one of the clusters. Yet that is exactly what would happen if you wouldonly be using the labeled data. Hence, by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classifier. The subfield that studies how to improve classification algorithms using unlabeled data goesunder the name “semi-supervised learning”.
A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food. While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). It’s only at the end when he reaches the cheese that receives his positive feedback, and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. These problem 19
fall under the name ”reinforcement learning”. It is a very general setup in which almost all known cases of machine learning can be cast, but this generality also means that these type of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time toreach the cheese considerably, or whether he should simply stay with he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have figure out how the world works and survive in it?
Let’s go back to the news-articles. Assume we have control overwhat article we will label next. Which one would be pick. Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that decides to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate “active learning”.
One may also be faced with learning multiple tasks at the same time. These tasks are related butnot identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would really require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same “type” (for example a sf fan or a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to many movies that were rated by the customer,we need to “draw statistical strength” from customers who seem to be similar. From this example it has hopefully become clear that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning:“m__ulti-task learning.
CHAPTER4. TYPESOFMACHINELEARNING