Understanding sentences via sentence embeddings

© Hollandse Hoogte / Westend61 GmbH

In this paper we propose a novel algorithm WordGraph2Vec (WG2Vec) to analyse text data. WG2Vec combines two aspects of natural language processing: language models and word embeddings.

As a first step, WG2Vec uses language models to dissect and understand text on a grammatical level. So-called Word graphs are extracted from sentences based on the grammatical properties of (related) words. Word graphs should represent the important phrases in the (larger) text.

As a next step the semantics for Word graphs are obtained using word embeddings models, such as Word2Vec. WG2Vec converts phrases into a vector of numbers, i.e., sentence embeddings. The phrases with vectors in close proximity to each other should hold the same semantic meaning.

In conclusion, WG2Vec analyses texts both on a grammatical and semantic level, whereas standard language models or word embedding models only do one or the other. In the paper we give preliminary evidence that WG2Vec can efficiently find semantically similar labour market skills in vacancy data.