47
CHAPTER9. SUPPORTVECTORREGRESSION
and from below,
minimize−w,ξ,ξˆ
subject to | wTΦi+b−y__i≤ε+ξ__i∀i | |
---|---|---|
The primal Lagrangian becomes, | y__i−wTΦi−b≤ε+ξˆi∀i | (9.1) |
22 XXX
12+C(ξ__i_2+ξˆ_i_2)+αi(w_TΦi+b−_yi−ε−ξi)+αˆ_i(_yi−w_TΦi−b−ε−ξˆi)
LP=||w||
_i_ii
(9.2)
Remark I: We could have added the constraints thatξ__i≥ 0andξˆi≥ 0. However, it is not hard to see that the final solution will have that requirement automatically and there is no sense in constraining the optimization to the optimal solution as well. To see this, imagine someξ__i_is negative, then, by settingξ_i= 0thecost is lower and non of the constraints is violated, so it is preferred. We also note due to the above reasoning we will always have at least one of theξ,ξˆzero, i.e. inside the tube both are zero, outside the tube one of them is zero. This means that at the solution we haveξξˆ= 0.
Remark II: Note that we don’t scaleε= 1like in the SVM case. The reason is that{y__i}now determines the scale of the problem, i.e. we have not over-parameterized the problem.
We now take the derivatives w.r.t.w,b,ξ_andξ_ˆto find the following KKT conditions (there are more of course),
w= X(αˆi−α__i)Φi(9.3)
ξˆi=αˆi__/C(9.4)
Plugging this back in and using that now we also haveαiαˆi= 0we find the dual problem, maximizeα,
ij
subject toX(αˆi−α__i) = 0
i
α__i≥ 0,αˆi≥ 0∀i(9.5)
49
From the complementary slackness conditions we can read the sparseness of the solution out:
α__i(wTΦi+b−y__i−ε−ξ__i) = 0 | (9.6) |
---|---|
αˆi(y__i−wTΦi−b−ε−ξˆi) = 0 | (9.7) |
ξiξˆi= 0,αi__αˆi= 0 | (9.8) |
where we added the last conditions by hand (they don’t seem to directly follow from the formulation). Now we clearly see that if a case is above the tubeξˆi_will take on its smallest possible value in order to make the constraints satisfiedξˆ_i=y__i−wTΦi−b−ε. This implies thatαˆi_will take on a positive value and the farther outside the tube the larger theαˆ_i(you can think of it as a compensating force). Note that in this caseα__i= 0. A similar story goes ifξi>_0andαi>0. If a data-case is inside the tube theαi,α_ˆ_i_are necessarily zero, and hence we obtain sparseness.
We now change variables to make this optimization problem look more similar to the SVM and ridge-regression case. Introduceβ__i=αˆi−α__i_and useαˆi_α__i= 0
to writeαˆi+α__i= | β__i | , | |
---|---|---|---|
maximizeβ | ![]() |
X
subject toβ__i= 0(9.9)
i
where the constraint comes from the fact that we included a bias term3b.
From the slackness conditions we can also find a value forb(similar to the SVM case). Also, as usual, the prediction of new data-case is given by,
(9.10)
It is an interesting exercise for the reader to work her way through the case
CHAPTER9. SUPPORTVECTORREGRESSION
where the penalty is linear instead of quadratic, i.e.
minimizew,ξ,ξˆ
subject to | wTΦi+b−y__i≤ε+ξ__i∀i | |
---|---|---|
y__i−wTΦi−b≤ε+ξˆi∀i | (9.11) | |
leading to the dual problem, | ξ__i≥ 0,ξˆi≥ 0 ∀i | (9.12) |
maximizeβ
subject to(9.13)
−C≤β__i≤ +C∀i(9.14)
where we note that the quadratic penalty on the size ofβis replaced by a box constraint, as is to be expected in switching from_L_2norm to_L_1norm.
Final remark: Let’s remind ourselves that the quadratic programs that we have derived are convex optimization problems which have a unique optimal solution which can be found efficiently using numerical methods. This is often claimed as great progress w.r.t. the old neural networks days which were plagued by many local optima.
Chapter 10
3. Note by the way that we could not use the trick we used in ridge-regression by defining a constant featureφ0=1andb=w0. The reason is that the objective does not depend onb. ↩