6.1The Naive Bayes Model

NB is mostly used when dealing with discrete-valued attributes. We will explain the algorithm in this context but note that extensions to continuous-valued attributes are possible. We will restrict attention to classification problems between two classes and refer to section??for approaches to extend this two more than two classes.

In our usualnotation we considerD_discrete valued attributes_X__i∈ [0,..,V__i], i= 1..D. Note that each attribute can have a different number of valuesV__i. If the original data was supplied in a different format, e.g.X_1= [_Y es,No], then we simply reassign these values to fit the above format,Y es= 1,No= 0(or reversed). In addition we are also provided with a supervised signal, in this case the labels areY= 0andY= 1indicating that that data-item fell in class0or class1. Again, which class is assigned to0or1is arbitrary and has no impact on the performance of the algorithm.

Before we move on, let’s consider a real world example: spam-filtering. Every day your mailbox get’s bombarded with hundreds of spam emails. To give an

25

example of the traffic that it generates: the university of California Irvine receives on the order of 2 million spam emailsa day. Fortunately, the bulk of these emails (approximately97%) is filtered out or dumped into your spam-box and will reach your attention. How is this done? Well, it turns out to be a classic example of a classification problem: spam or ham, that’s the question. Let’s say that spam will receive a label1and ham a label0. Our task is thus to label each new email with either0or1. What are the attributes?Rephrasing this question, what would you measure in an email to see if it is spam? Certainly, if I would read “viagra” in the subject I would stop right there and dump it in the spam-box. What else? Here are a few: “enlargement, cheap, buy, pharmacy, money, loan, mortgage, credit” and so on. We can build a dictionary of words that we can detect in each email. This dictionary could also include word phrases such as “buy now”, “penis enlargement”, one can make phrases as sophisticated as necessary. One couldmeasure whether the words or phrases appear at least once or one could count the actual number of times they appear. Spammers know about the way these spam filters work and counteract by slight misspellings of certain key words. Hence we might also want todetect words like “via gra” and so on. In fact, a small arms race has ensued where spam filters and spam generators find new tricks to counteract the tricks of the “opponent”. Putting all these subtleties aside for a moment we’ll simply assume that we measure a number of these attributes for every email in a dataset. We’ll also assume that we have spam/ham labels for these emails, which were acquired by someone removing spam emails by hand from his/her inbox. Our task is then to train a predictor for spam/ham labels for future emails where we have access to attributes but not to labels.

The NB model is what we call a “generative” model. This means that we imagine how the data was generated in an abstract sense. For emails, this works as follows, an imaginary entity first decides how many spam and ham emails it will generate on a daily basis. Say, it decides to generate 40% spam and 60% ham. We will assume this doesn’t change with time (of course it does, but we will make this simplifying assumption for now).It will then decide what the chance is that a certain word appearsk_times in a spam email. For example, the word “viagra” has a chance of96%to not appear at all,1%to appear once,0.9%to appear twice etc. These probabilities are clearly different forspam and ham, “viagra” should have a much smaller probability to appear in a ham email (but it could of course; consider I send this text to my publisher by email). Given these probabilities, we can then go on and try to generate emails that actually looklike real emails, i.e. with proper sentences, but we won’t need that in the following. Instead we make the simplifying assumption that email consists of “_a bag of words”, in random

6.2. LEARNINGANAIVEBAYESCLASSIFIER

order.

results matching ""

    No results matching ""