Embedded Lexica: Extracting Topical Dictionaries from Unlabeled Corpora using Word Embeddings

Patrick J. Chester

July 2024

Abstract

Researchers frequently need to extract information, such as events or target topics, from large corpora. One common solution involves applying semantically-related keywords to identify tweets, news articles, or other documents of interest. However, it is rarely the case that dictionaries of relevance to the topic, event, or language both exist and are accessible. Existing algorithms for extracting dictionaries, require many user-provided seed words or hand-coded documents to generate useful results. Additionally, they do not incorporate contextual information from natural language. In this paper, I present a novel algorithm, keyclust, that extracts keywords from unlabeled text using a small number of user-provided seed words and a fitted word embeddings model. Compared to existing methods of lexicon extraction, keyclust requires few seed words, is computationally efficient, and takes word context into account. I describe this algorithm’s properties and benchmark its performance with existing methods of lexical dictionary extraction, comparing differences in user labor, conceptual clarity, and the ability to replicate existing keyword dictionaries.

Type

Preprint

working paper dissertation

Embedded Lexica: Extracting Topical Dictionaries from Unlabeled Corpora using Word Embeddings

Abstract

Patrick J. Chester

Postdoctoral researcher

Related