A fundamental question in Natural Language Processing (NLP) is how to represent words. If we have a paragraph we want to translate, or a product review we want to determine whether is positive or negative, or a question we want to answer, ultimately the easiest building block to start from is the individual word. The main problem of this approach is that treating each word as just a symbol loses a lot of information. How can we tell from such a representation that the relationship between the symbol PAGE and the symbol PAPER is not the same as that between PAGE and MOON?
Some popular techniques exist that try to learn an abstract representation which identifies these relationships and preserves them. In essence, what these methods do is go over a huge body of text (a corpus), like the entire English Wikipedia, word by word, and come up with a representation assigned to each word. Using mathematical operations over these resulting representations, our model can then know that PAGE is very “similar” to PAPER, DOG is similar to CAT, to KENNEL and to the verb BARK, and so on. These methods are strong enough to also represent analogies: the “difference” between MAN and WOMAN is very similar to that between KING and QUEEN.
These representations go a long way in helping our end-goal, be it Machine Translation or Review Classification or anything else (we call these the downstream tasks). They can be used as input for a new system that is trained to perform them. But now a new problem arises, stemming from the fact that language is both (a) big, and (b) constantly growing. A paragraph in our downstream task is more likely than not to contain a word we never saw in the Wikipedia text, or was too rare to save a representation for. This Out-Of-Vocabulary, or OOV, could be a new word (BLOCKCHAIN), a typo, a highly-technical term, in some languages even regular inflections (in Bulgarian, for example, some verbs have more than 600 theoretical forms. In Hebrew and Arabic, prepositions like “in” and “for” are attached to the following word).
How do we deal with this unseen word problem? One common approach is to learn a single representation for all unknown words (this can be done while training the original representations: once in a while, we “tell” a Wikipedia word that now it’s an unknown, and train that representation instead of the actual word’s), but this is rather drastic. In a paper published last year at EMNLP , we developed a method that suggests a representation for an unseen word by generalizing only on the known representations (the output of the Wikipedia traversal, but before the downstream task). The only signal we have available at this point is the way words are spelled, so that is exactly what we use as our input. Basically we “pretend” that the representations were trained based on the spelling (rather than on the Wikipedia corpus) and train an appropriate model. When we see an unknown word in our downstream task, we know how it is spelled, and thus can apply the model to guess a representation for that as well. Ideally, we’ve captured enough information at the training phase to make a guess that’s more helpful than a single guess for all words.
In our paper we show this is indeed the case. In addition to some sensible nearest-neighbor results (see figure), we present results on the downstream task of Part-Of-Speech tagging, showing improvements on nearly all of 23 languages tested. The method proved particularly successful in languages that feature many inflections, reinforcing the conclusion that important characteristics are present in the way words are spelled in these languages.
[Written by Yuval Pinter]
 Yuval Pinter, Robert Guthrie, Jacob Eisenstein. Mimicking Word Embeddings using Subword RNNs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.