Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query
suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted
patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However,
lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted
patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and
for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically
of one sentence in length.
We investigate whether determining short text similarity is possible using only semantic
features---where by semantic we mean, pertaining to a representation of meaning---rather than relying on similarity in lexical
or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that
represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity.
to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge
with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated.
We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector
means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised
learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of
We show on a publicly available evaluation set commonly used for the task of semantic similarity that
our method outperforms baseline methods that work under the same conditions.