Packaged Datasets¶
All datasets are available on our github page, the World Well-Being Project site and via the pip install.
Note: some lexica and datasets are distributed on more restrictive licenses than DLATK. Please review each before use.
Language Data¶
Blog Authorship Corpus¶
A subset of blog posts from this dataset collected by J. Schler, M. Koppel, S. Argamon and J. Pennebaker. This subset contains all posts from a random set of 1000 users. Shared with permission from Moshe Koppel.
- [.zip]
- MySQL: dla_tutorial.msgs, dla_tutorial.blog_outcomes
Lexica¶
Age and Gender Lexica¶
Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users.
- [.zip]
- MySQL: permaLexicon.dd_emnlp14_ageGender
- Link to publication
PERMA Lexicon¶
Our lexicon to predict well-being as measured through PERMA scales.
- [.zip]
- MySQL: permaLexicon.dd_permaV3
- Link to publication
- [Usage license]
Spanish PERMA Lexicon¶
Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA.
- [.zip]
- MySQL: permaLexicon.dd_sperma_v2
- Link to publication
Other Lexica¶
Prospection Lexicon: Temporal Orientation:
- [.csv]
- MySQL: permaLexicon.dd_PaPreFut
- Link to publication
Affect and Intensity Lexicon:
- [.csv]
- MySQL: permaLexicon.dd_intAff
- Link to publication
LDA Topics¶
2000 Facebook Topics¶
- Top 20 words per topic: [.csv] [Excel file]
- MySQL: permaLexicon.met_a30_2000_cp and permaLexicon.met_a30_2000_freq_t50ll
- All words: [.csv]
- Conditional probabilities [.csv] (sparse matrix format)
- Link to publication