Packaged Datasets

All datasets are available on our github page, the World Well-Being Project site and via the pip install.

Note: some lexica and datasets are distributed on more restrictive licenses than DLATK. Please review each before use.

Language Data

Blog Authorship Corpus

A subset of blog posts from this dataset collected by J. Schler, M. Koppel, S. Argamon and J. Pennebaker. This subset contains all posts from a random set of 1000 users. Shared with permission from Moshe Koppel.

[.zip]
MySQL: dla_tutorial.msgs, dla_tutorial.blog_outcomes

Lexica

Age and Gender Lexica

Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users.

[.zip]
MySQL: permaLexicon.dd_emnlp14_ageGender
Link to publication

PERMA Lexicon

Our lexicon to predict well-being as measured through PERMA scales.

[.zip]
MySQL: permaLexicon.dd_permaV3
Link to publication
[Usage license]

Spanish PERMA Lexicon

Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA.

[.zip]
MySQL: permaLexicon.dd_sperma_v2
Link to publication

Other Lexica

Prospection Lexicon: Temporal Orientation:

[.csv]
MySQL: permaLexicon.dd_PaPreFut
Link to publication

Affect and Intensity Lexicon:

[.csv]
MySQL: permaLexicon.dd_intAff
Link to publication

LDA Topics

2000 Facebook Topics

Top 20 words per topic: [.csv] [Excel file]
MySQL: permaLexicon.met_a30_2000_cp and permaLexicon.met_a30_2000_freq_t50ll
All words: [.csv]
Conditional probabilities [.csv] (sparse matrix format)
Link to publication