Embedded Lexica: Extracting Topical Dictionaries from Unlabeled Corpora using Word Embeddings

Abstract

Researchers frequently need to extract information, such as events or target topics, from large corpora. One common solution involves applying semantically-related keywords to identify tweets, news articles, or other documents of interest. However, it is rarely the case that dictionaries of relevance to the topic, event, or language both exist and are accessible. Existing algorithms for extracting dictionaries, require many user-provided seed words or hand-coded documents to generate useful results. Additionally, they do not incorporate contextual information from natural language. In this paper, I present a novel algorithm, conclust, that extracts keywords from unlabeled text using a small number of user-provided seed words and a fitted word embeddings model. Compared to existing methods of lexicon extraction, conclust requires few seed words, is computationally efficient, and takes word context into account. I describe this algorithm’s properties and benchmark its performance with existing methods of lexical dictionary extraction, comparing differences in user labor, conceptual clarity, and the ability to replicate existing keyword dictionaries.

Patrick J. Chester
Patrick J. Chester
Postdoctoral researcher

Patrick Chester is a postdoctoral researcher at the China Data Lab at UC, San Diego who received his PhD from New York University’s Politics Department.

Related