14.1Kernel CCA

As usual, the starting point to map the data-cases to feature vectorsΦ(xi)andΨ(yi).When the dimensionality of the space is larger than the number of datacases in the training-set, then the solution must lie in the span of data-cases, i.e.

ab(14.7)

Using this equation in the Lagrangian we get,

(14.8)

whereαis a vector in a differentN-dimensional space than e.g.awhich lives in aD-dimensional space, andK__x=iΦ(xi)TΦ(xi)and similarly forK__y. Taking derivatives w.r.t.αandβPwe find,

KxK__yβ = (14.9)
KyK__xα = _λK__y_2β (14.10)

Let’s try to solve these equations by assuming thatKx_is full rank (which is typically the case). We get,and hence,_Ky_2β=λ2_K__y_2βwhich always has a solution forλ_= 1. By recalling that,

(14.11)

we observe that this represents the solution with maximal correlation and hence the preferred one. This is a typical case of over-fitting emphasizes again the need to regularize in kernel methods. This can be done by adding a diagonal term to the constraints in the Lagrangian (or equivalently to the denominator of the original objective), leading to the Lagrangian,

T_1_T_2α+η||α||2−_N) − 1λTKy_2β+η||β||2−_N)

L =αKxK__yβ−λK__x

22

(14.12) One can see that this acts as a quadratic penalty on the norm ofαandβ. The resulting equations are,

KxK__yβ = λ(K__x_2+ηI_)α (14.13)
KyK__xα = λ(K__y_2+ηI_)β (14.14)

CHAPTER14. KERNELCANONICALCORRELATIONANALYSIS

Analogues to the primal problem, we will define big matrices,KD_which contains(_Kx_2+ηI)and(_K__y_2+ηI)as blocks on the diagonal and zeros at the blocks off the diagonal, and the matrix_KO_which has the matrices_KxKy_on the right-upper off diagonal block and_KyK__x_at the left-lower off-diagonal block. Also, we defineγ= [α,_β]. This leads to the equation,

1111

K__Oγ=λK__Dγ⇒K__D−1K__Oγ=λγ⇒KO_2_KD−1KO_2(_KO_2γ) =λ_(_K__O_2γ)

(14.15)

which is again a regular eigenvalue equation. Note that the regularization also moved the smallest eigenvalue away from zero, and hence made the inverse more numerically stable. The value for_η_needs to be chosen using cross-validation or some other measure. Solving the equations using this larger eigen-value problem is actually not quite necessary, and more efficient methods exist (see book).

The solutions are not expected to be sparse,because eigen-vectors are not expected to be sparse. One would have to replace_L_2norm penalties with_L_1norm penalties to obtain sparsity.

Appendix A

results matching ""

    No results matching ""