Fisher Linear Discriminant Analysis
The most famous example of dimensionality reduction is ”principal components analysis”. This technique searches for directions in the data that have largest variance and subsequently project the data onto it. In this way, we obtain a lower dimensional representation of the data, that removes some of the ”noisy” directions. There are many difficult issues with how many directions one needs to choose, but that is beyond the scope of this note.
PCA is an unsupervised technique and as such does not include label information of the data. For instance, if we imagine 2 cigar like clusters in 2 dimensions, one cigar hasy= 1and the othery= −1. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space).
So the question is, how do we utilize the label information in finding informative projections? To that purpose Fisher-LDA considers maximizing the following objective:
wTSBw
J(w) =wTSWw(13.1)
where_SB_is the “between classes scatter matrix” and_SW_is the “within classes scatter matrix”. Note that due to the factthat scatter matrices are proportional to the covariance matrices we could have defined_J_using covariance matrices – the proportionality constant would have no effect on the solution. The definitions of