10.1Kernel Ridge Regression

We now replace all data-cases with their feature vector:xi→ Φi= Φ(xi). In this case the number of dimensions can be much higher, or even infinitely higher, than the number of data-cases. There is a neat trick that allows us to perform the inverse above in smallest space of the two possibilities, either the dimension of the feature space or the number ofdata-cases. The trick is given by the following identity,

(P−1+BTR−1B)−1BTR−1=PB__T(BPB__T+R)−1(10.4)

Now note that ifB_is not square, the inverse is performed in spaces of different dimensionality. To apply this to our case we defineΦ = Φ_ai_andy=_y__i. The solution is then given by,

w= (λId+ ΦΦT)−1Φy= Φ(ΦTΦ +λIn)−1y(10.5)

This equation can be rewritten as:w=iαiΦ(xi)withα= (ΦTΦ +λIn)−1y. This is an equation that will be a recurrent theme and it can bePinterpreted as: The solutionwmust lie in the span of the data-cases, even if the dimensionality of the feature space is much larger than the number of data-cases. This seems intuitively clear, since the algorithm is linear in feature space.

We finally need to show that we never actually need access to the feature vectors, which could be infinitely long (which would be rather impractical). What we need in practice is is the predicted value for anew test point,x. This is computed by projecting it onto thesolutionw,

y=wTΦ(x) =yTΦ +λIn)−1ΦTΦ(x) =y(K+λIn)−1κ(x)(10.6)

whereK(bxi,bx__j) = Φ(x__i)TΦ(x__j)andκ(x) =K(xi__,x). The important message here is of course that we only need access to the kernelK.

We can now add bias to the whole story by adding one more, constant feature toΦ:Φ0= 1. The value of_w_0then represents the bias since,

wTΦ = XwaΦai+w0(10.7)

a

Hence, the story goes through unchanged.

10.2. ANALTERNATIVEDERIVATION53

results matching ""

    No results matching ""