7.1The Perceptron Model

Our assumption is that a line can separate the two classes of interest. To make our life a little easier we will switch to theY= {+1,−1}representation. With this, we can express the condition mathematically expressed as²,

Y__n≈sign(XwkX__kn−α)(7.1)

where “sign” is the sign-function (+1for nonnegative reals and−1for negative reals). We have introducedK+1parameters{w_1,..,wK,α}which define the line for us. The vectorwrepresents the direction orthogonal to the decision boundary depicted in Figure??. For example, a line through the origin is represented byw_Tx= 0, i.e. all vectorsxwith a vanishing inner product withw. The scalar quantityα_represents the offset of the linew_Tx= 0from the origin, i.e. the shortest distance from the origin to the line. This can be seen by writing the points on the line asx=y+vwhereyis a fixed vector pointing to an arbitrary point on the line andvis the vector on the line starting aty(see Figure??). Hence,wT(y+v) −α= 0. Since by definitionwTv= 0, we findwTy=α_which means thatα_is the projection ofyontowwhich is the shortest distance from the origin to the line.

7.1. THEPERCEPTRONMODEL35

We like to estimate these parameters from the data (which we will do in a minute), but it is important to notice that the number of parameters is fixed in advance. In some sense, we believe so much in our assumption that the data is linearly separable thatwe stick to it irrespective of how many data-cases we will encounter. This fixed capacity of the model is typical for parametric methods, but perhaps a little unrealistic for real data. A more reasonable assumption is that the decision boundary may becomemore complex as we see more data. Too few data-cases simply do not provide the resolution (evidence) necessary to see more complex structure in the decision boundary. Recall that non-parametric methods, such as the “nearest-neighbors” classifiers actuallydo have this desirable feature. Nevertheless, the linear separability assumption comes with some computation advantages as well, such as very fast class prediction on new test data. I believe that this computational convenience may be at the root for itspopularity. By the way, when we take the limit of an infinite number of features, we will have happily returned the land of “non-parametrics” but we have exercise a little patience before we get there.

Now let’s write down a cost function that we wish to minimize in order for our linear decision boundary to become a good classifier. Clearly, we would like to control performance on future, yet unseen test data. However, this is a little hard (since we don’t have access to this data by definition). As a surrogate we will simply fit the line parameters on the training data. It can not be stressed enough that this is dangerous in principle due to the phenomenon of overfitting (see section??). If we have introduced very many features and no form of regularization then we have many parameters to fit. When this capacity is too large relative to the number of data cases at our disposal, we will be fitting the idiosyncrasies of this particular dataset and these will not carry over to the future test data. So, one should split of a subset of the training data and reserve it for monitoring performance (one should not use this set in the training procedure). Cycling though multiple splits and averaging the result was the cross-validation procedure discussed in section??. If we do not use too many features relative to the number of data-cases, the model class is very limited and overfitting is not an issue. (In fact, one may want to worry more about “underfitting” in this case.)

Ok, so now that we agree on writing down acost on the training data, we need to choose an explicit expression. Consider now the following choice:

(7.2)

where we have rewrittenwTXn=kwkXkn. If we minimize this cost thenwTXn−α_tends to be positive whenP_Y__n= +1and negative whenY__n= −1. This is what we want! Once optimized we can then easily use our optimal parameters to perform prediction on new test data_X_testas follows:

Y˜test=sign(Xw__k∗X__test−α∗)(7.3)

whereY˜is used to indicate thepredicted_value for_Y.

So far so good, but how do we obtain our values for{w∗,α∗}? The simplest approach is to compute the gradient and slowly descent on the cost function (see appendix??for background). In this case, the gradients are simple:

1_T_T_n_X

∇wC(w,α) = −(Y__n−wX__n+α)X__n= −X(Y−Xw+α(7.4))

i=1n

_n_X

∇α__C(w,α) =(Y__n−wTXn+α) = (Y−X__Tw+α)(7.5)

i=1

where in the latter matrix expression we have used the convention thatX_is the matrix with elements_X__kn. Our gradient descent is now simply given as,

wt+1=wt−η∇wC(wt,αt)(7.6)α__t+1=α__t−η∇α__C(wt,αt)(7.7)

Iterating these equations until convergence will minimize the cost function. One may criticize plain vanilla gradient descent for many reasons. For example you need to be carefully choose the stepsizeηor risk either excruciatingly slow convergence or exploding values of the iterateswt,α__t. Even if convergence is achieved asymptotically, it is typically slow. Using a Newton-Ralphson method will improve convergence properties considerably but is alsovery expensive. Many methods have been developed to improve the optimization of the cost function, but that is not the focus of this book.

However, I do want to mention a very popular approach to optimization on very large datasets known as “stochastic gradient descent”. The idea is to select a single data-item randomly and perform an update on the parameters based on that:

wt+1=wt+η(Y__n−wTXn+α)X__n	(7.8)
α__t+1=α__t=η(Y__n−wTXn+α)	(7.9)

7.2. ADIFFERENTCOSTFUNCTION:LOGISTICREGRESSION37

The fact that we are picking data-cases randomly injects noise in the updates, so even close to convergence we are “wiggling around” the solution. If we decrease the stepsize however, the wiggles get smaller. So it seems a sensible strategy would be to slowly decrease the stepsize and wiggle our way to the solution. This stochastic gradient descent is actually very efficient in practice if we can find a good annealing schedule for the stepsize. Why really? It seems that if we use more data-cases in a mini-batch to perform a parameter update we should be able to make larger steps in parameter space by using bigger stepsizes. While this reasoning holds close to the solution it does not far away from the solution. The intuitive reason is that far away from convergence every datapoint will tell you the same story: move in direction X to improve your model. You simply do not need to query datapoints in order to extract that information. So for a bad model there is a lot of redundancy in the information that data-cases can convey about improving the parameters and querying a few is sufficient. Closer to convergence you need to either use more data or decrease the stepsize to increase the resolution of your gradients.

This type of reasoning clearly makes an effort toinclude the computational budget part of the overall objective. This is what we have argued in chapter XX is the distinguishing feature of machine learning. If you are not convinced about how important this is in the face of modern day datasets imaginer the following. Company C organizes a contest where they provide a virtually infinite dataset for some prediction task. You can earn 1 million dollars if you make accurate predictions on some test set by Friday next week. You can choose between a single parameter update based on all the data or many updates on small subsets of the data, Who do you think will win the contest?

². Note that we can replaceXk→φk(X)but that for the sake of simplicity we will refrain from doing so at this point. ↩

7.1The Perceptron Model

7.1The Perceptron Model

results matching ""

No results matching ""