Packaged Datasets

All datasets are available on our github page, the World Well-Being Project site and via the pip install.

Note: some lexica and datasets are distributed on more restrictive licenses than DLATK. Please review each before use.

Language Data

Blog Authorship Corpus

A subset of blog posts from this dataset collected by J. Schler, M. Koppel, S. Argamon and J. Pennebaker. This subset contains all posts from a random set of 1000 users. Shared with permission from Moshe Koppel.

  • [.zip]

  • MySQL: dla_tutorial.msgs, dla_tutorial.blog_outcomes

Lexica

Age and Gender Lexica

Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users.

PERMA Lexicon

Our lexicon to predict well-being as measured through PERMA scales.

Spanish PERMA Lexicon

Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA.

Other Lexica

Prospection Lexicon: Temporal Orientation:

Affect and Intensity Lexicon:

LDA Topics

2000 Facebook Topics