Kernel Canonical Correlation Analysis

Imagine you are given 2 copies of a corpus of documents, one written in English, the other written in German. You may consider an arbitrary representation of the documents, but for definiteness we will use the “vectorspace” representation where there is an entry for every possible word in the vocabulary and a document is represented by count values for every word, i.e. if the word “the appeared 12 times and the first word in the vocabulary we haveX_1(_doc) = 12etc.

Let’s say we are interested in extracting low dimensional representations for each document. If we had only one language, we could consider running PCA to extract directions in word space that carry most of the variance. This has the ability to infer semanticrelations between the words such as synonymy, because if words tend to co-occur often in documents, i.e. they are highly correlated, they tend to be combined into a single dimension in the new space. These spaces can often be interpreted as topic spaces.

If we have two translations, we can try to find projections of each representation separately such that the projections are maximally correlated. Hopefully, this implies that they represent the same topic in two different languages. In this way we can extract language independent topics.

Letxbe a document in English andya document in German. Consider the projections:u=aTxandv=bTy. Also assume that the data have zero mean. We now consider the following objective,

E[uv]

ρ=(14.1)

results matching ""

    No results matching ""