Support Vector Regression
In kernel ridge regression we have seen the final solution was not sparse in the variablesα. We will now formulate a regression method that is sparse, i.e. it has the concept of support vectors that determine the solution.
The thing to notice is that the sparseness arose from complementary slackness conditions which in turn came from the fact that we had inequality constraints. In the SVM the penalty that was paid for being on the wrong side of the support plane was given byCiξik_for positive integers_k, whereξ__i_is the orthogonal distance away from the support plane. Note that the termP ||w||2was there to penalize largewand hence to regularize the solution. Importantly, there was_no_penalty if a data-case was on the right side of the plane. Because all these datapoints do not have any effect on the final solution theαwas sparse. Here we do the same thing: we introduce a penalty for being to far away from predicted linewΦ_i+b, but once you are close enough, i.e. in some “epsilon-tube” around this line, there is no penalty. We thus expect that all the data-cases which lie inside the data-tube will have no impact on the final solution and hence have correspondingα__i= 0. Using the analogy of springs: in the case of ridge-regression the springs were attached between the data-cases and the decision surface, hence every item had an impact on the position of this boundary through the force it exerted (recall that the surface was from “rubber” and pulled back because it was parameterized using a finite number of degrees of freedom or because it was regularized). For SVR there are only springs attached between data-cases outside the tube and these attach to the tube, not the decision boundary. Hence, data-items inside the tubehave no impact on the final solution (or rather, changing their position slightly doesn’t perturb the solution).
We introduce different constraints for violating the tube constraint from above