Multilingual text classification with word embeddings
Our work “Multi-label, Multi-class Classification Using Polylingual Embeddings” was accepted at the European Conference on Information Retrieval. In case you missed the paper, here is a short summary of the work we presented. It was presented as a poster, you can have a look at it in slideshare.
The main question that motivates the paper, is whether parallel translations of a document can be used in order to create rich representations and, whether, given a task those new representations perform better than the monolingual. I named those representations polylingual and let me explain why. Simply, because these new representations, that have been generated by combining information from more than one languages. For convenience, assume that we operate in the word level. Departing from the space of each language (language-dependent space), e.g., English and French, we generate a new space. In the language-dependent spaces words were in some points, in the induced polylingual space, we map pairs of words (a word and its translation) in points. This is also what the main poster figure explains. The intuition behind it, is that by combining languages one can create richer semantic representations that achieve word disambiguation etc.
How to do that? Since at that time I was in to distributed representations and word2vec, I decided to follow this path:
- Generate word2vec vectors for each language
- Apply the average composition function, to generate document representations from word representations. This means that given the words of a document, average their representations to obtain the document representation.
- Having the document representations in let’s say English and French, obtain the polylingual document representation using a denoising autoencoder. The denoising autoencoder learns a compressed representation of its inputs. I have compressed the representations by a factor of 0.7, which was experimentally tuned.
- Compare the performance on document classification with SVMs using tf-idf representations, monolingual distributed representations and polylingual distributed representations.
In the experiments I found that given only a few labeled data, polylingual representations perform the best. As more data become available though, standard tf-idf representations become competitive and outperform the polylingual.
Discussion: I have assumed access to parallel translations of texts to be classified. This is not quite realistic and I have used google translate to generate those translations, which is considered a state-of-the-art system. Here, the effect of the translation quality has to be further investigated. Also, it is to be noted that a simple composition function (average) has been used to obtain document representations. I plan to try it using better composition functions that either rely on more operations that single averaging like min and max, or use neural networks. Among these, I have tried paragraph vectors (the results are included in the paper), but they were not as competitive as word2vec + composition functions. Finally, I have used English and French as the language pairs, which raised some questions on how pairs like English and Chinese would perform. This is to be investigated in the future.
Conclusions: as machine translation systems improve, this work provided evidence that one can improve the performance on a task by fusion mechanisms. This related to multi-view representation learning. Also, this paper builds on distributed representations, a concept that is quite exciting due to the observations that those representations can capture semantic and linguistic similarities. I strongly believe that working on the representation learning direction is a very promising direction.