Embedded Lexica: Extracting Topical Dictionaries from Unlabeled Corpora using Word Embeddings


The rise of the internet, social media, and the digitization of archives have led to an accumulation of untold quantities of unlabeled text data of relevance to the social sciences. Efficiently extracting information from those corpora frequently involves applying topical dictionaries to identify tweets, news articles, or other documents of interest to researchers. However, human-coded dictionaries are too costly to generate for them to be practical solutions for specific information extraction tasks. Also, existing algorithms for extracting dictionaries, such as supervised machine learning and the semi-automatic WordNet, require many user-provided seed words to generate useful results and do not incorporate contextual information of natural language. In this paper, I present a novel algorithm, conclust, that applies word embeddings towards extracting topically-related dictionaries from unlabeled text using a small number of user-provided seed words and a fitted word embeddings model. Compared to existing methods of lexicon extraction conclust requires few seed words, is computationally efficient, and takes word context into account. In this paper, I will describe this algorithm’s properties, and evaluate it according to its ability to replicate word topics from the WordNet Domains database, comparing its performance with existing methods of lexical dictionary extraction.

Patrick J. Chester
Patrick J. Chester
Postdoctoral researcher

Patrick Chester is a postdoctoral researcher at the China Data Lab at UC, San Diego who received his PhD from New York University’s Politics Department.