6.4Regularization

The spam filter algorithm that we discussed in the previous sections does unfortunately not work very well if we wish to use many attributes (words, word-phrases). The reason is that for many attributes we may not encounter a single example in the dataset.Say for example that we defined the word “Nigeria” as an attribute, but that our dataset did not include one of those spam emails where you are promised mountains of gold if you invest your money in someone bank in Nigeria. Also assume there are indeed afew ham emails which talk about the nice people in Nigeria. Then any future email that mentions Nigeria is classified as ham with 100% certainty. More importantly, one cannot recover from this decision even if the email also mentions viagra, enlargement, mortgage and so on, all in a single email! This can be seen by the fact thatlogP_spam(_X“Nigeria”_>_0) = −∞while the final score is a sum of these individual word-scores.

To counteract this phenomenon, we give each word in the dictionary a small probabilityof being present in any email (spam or ham), before seeing the data. This process is called smoothing. The impact on the estimated probabilities are given below,

(6.12)

(6.13)

whereV__i_is the number of possible values of attribute_i. Thus,_α_can be interpreted as a small, possibly fractional number of “pseudo-observations” of the attribute in question. It’s like adding these observations to the actual dataset.

What value forα_do we use? Fitting its value on the dataset will not work, becausethe reason we added it was exactly because we assumed there was too little data in the first place (we hadn’t received one of those annoying “Nigeria” emails yet) and thus will relate to the phenomenon of overfitting. However, we can use the trick described in section??where we split the data two pieces. We learn a model on one chunk and adjustα_such that performance of the other chunk is optimal. We play this game this multiple times with different splits and average the results.

6.5. REMARKS

results matching ""

    No results matching ""