DLATK Tutorial

In Differential Language Analysis (DLA) we correlate patterns in language with other characteristics such as gender, or voting results. We may look at text broken down by user, county or individual message among other things. You can see more about the conceptual aspect of DLA in this Youtube Video and in this journal paper, Toward Personality Insights from Language Exploration in Social Media.

In this tutorial we will walk you through the process of running DLA using the dlatkInterface.py interface tool. Before running DLA here are some questions to ask yourself:

  • What text am I using?
    • twitter, facebook, blogs
  • What relationships am I looking at?
    • LIWC category prevalence VS voting habits
    • Usage of the word “I” VS narcissism score on personality quiz
    • What words correlate with extraversion score on a personality quiz
  • How will I group my data?
    • by message, by user, by country, by hour

Answers for this tutorial:

  • We will be using text from blogs (a subset of data from this corpus).
  • We will look at age and gender and how they correlate with n-grams and Facebook topics.
  • We will group the data at the user level.

Difference between mysql and dlatk: You will be using our infrastructure code (which is primarily written in Python) and MySQL. If you are new to using a terminal this can be confusing. You will start every command with dlatkInterface.py. As the name suggests, this is an interface to a much larger set of code called dlatk. The .py tells you that this is a Python file. On the other hand, MySQL is a database management system. You access the MySQL command line interface by typing mysql in your terminal. MySQL commands only work in the MySQL command line interface, just as dlatkInterface only works in the terminal. Anytime you are asked to run a MySQL command you must first type mysql. To exit you simply type exit. You should also note that all MySQL commands end with a semicolon. When running this tutorial it might be helpful to have two terminal windows open, one for MySQL and another for dlatk.

STEP 0 - Prepare

The text we will use is the mysql database dla_tutorial and in the table msgs. We will copy it to a new table using the following mysql commands:

Replace xxx with your initials so that your message table name is unique and your results don’t get messed up with anyone else’s results!!

USE dla_tutorial;
CREATE TABLE msgs_xxx LIKE msgs;
INSERT INTO msgs_xxx SELECT * FROM msgs;

The relationships we will look at are ngram usage versus age and gender. We will also look at Facebook topics versus age and gender. This outcome data is stored in the table dla_tutorial.blog_outcomes in the columns age and gender.

We will group our data by user. You can see both the message table and the outcome table have a column called user_id.

STEP 1 - Feature Extraction

Generate 1-gram Features

This step generates a quantitative summary of a body of text. It basically does word/n-gram counts on a group by group basis. It also normalizes by group, so at the end of this you can answer questions like “what proportion of words used by USER 234459 were the word ‘the’”.

##EXPECTED OUTPUT TABLES
#feat$1gram$msgs_xxx$user_id$16to16
#feat$1gram$msgs_xxx$user_id$16to16$0_001
dlatkInterface.py -d dla_tutorial -t msgs_xxx -c user_id --add_ngrams -n 1 --feat_occ_filter --set_p_occ 0.001 --group_freq_thresh 500

Brief descriptions of the flags:

  • -d: the database we are using
  • -t: the table inside the database where our text lives
  • -c: the table column we will be grouping the text by
  • --add_ngrams: the flag which starts the ngram extraction process
  • ../fwinterface/fwflag_n: the value or values for n in ngrams
  • --feat_occ_filter: This tells us to ignore features which occur in a small percentage of groups
  • --set_p_occ: The percentage value for the feature occurrence filter
  • --group_freq_thresh: Ignore groups which do not contain a certain number of words
##OTHER COMMAND OPTIONS
dlatkInterface.py -d <database> -t <message_table> -c <group_data_column> --add_ngrams -n 1 2 3 --combine_feat_tables 1to3gram

##FOLLOWED BY
dlatkInterface.py -d <database> -t <message_table> -c <group_data_column> -f <feature table> --feat_occ_filter --set_p_occ <pocc> --group_freq_thresh <gft>

To view the columns in your feature table use the following mysql command:

describe feat$1gram$msgs_xxx$user_id$16to16;

This will give you the following output

+------------+---------------------+------+-----+---------+----------------+
| Field      | Type                | Null | Key | Default | Extra          |
+------------+---------------------+------+-----+---------+----------------+
| id         | bigint(16) unsigned | NO   | PRI | NULL    | auto_increment |
| group_id   | varchar(45)         | YES  | MUL | NULL    |                |
| feat       | varchar(28)         | YES  | MUL | NULL    |                |
| value      | int(11)             | YES  |     | NULL    |                |
| group_norm | double              | YES  |     | NULL    |                |
+------------+---------------------+------+-----+---------+----------------+

Summary of the columns:

  • id: numeric value of a sql table row
  • group_id: user ids from your message table
  • feat: the 1grams
  • value: the number of times the 1gram occurred within the group
  • group_norm: the value divided by the total number of features for this group

To view the features tables use the following command in mysql. This will show every column value in the first 10 rows.

mysql> select * from dla_tutorial.feat$1gram$msgs_xxx$user_id$16to16 limit 10;
+----+----------------------------------+-----------+-------+----------------------+
| id | group_id                         | feat      | value | group_norm           |
+----+----------------------------------+-----------+-------+----------------------+
|  1 | 003ae43fae340174a67ffbcf19da1549 | neighbors |     1 | 0.000260010400416017 |
|  2 | 003ae43fae340174a67ffbcf19da1549 | all       |    15 |  0.00390015600624025 |
|  3 | 003ae43fae340174a67ffbcf19da1549 | jason     |     1 | 0.000260010400416017 |
|  4 | 003ae43fae340174a67ffbcf19da1549 | <newline> |     5 |  0.00130005200208008 |
|  5 | 003ae43fae340174a67ffbcf19da1549 | caused    |     1 | 0.000260010400416017 |
|  6 | 003ae43fae340174a67ffbcf19da1549 | beware    |     1 | 0.000260010400416017 |
|  7 | 003ae43fae340174a67ffbcf19da1549 | bull      |     1 | 0.000260010400416017 |
|  8 | 003ae43fae340174a67ffbcf19da1549 | focus     |     1 | 0.000260010400416017 |
|  9 | 003ae43fae340174a67ffbcf19da1549 | yellow    |     1 | 0.000260010400416017 |
| 10 | 003ae43fae340174a67ffbcf19da1549 | four      |     3 |  0.00078003120124805 |
+----+----------------------------------+-----------+-------+----------------------+

You can also compare the sizes of the two tables to see the effect of --feat_occ_filter:

mysql> select count(distinct feat) from dla_tutorial.feat$1gram$msgs_xxx$user_id$16to16;
+----------------------+
| count(distinct feat) |
+----------------------+
|                65593 |
+----------------------+

mysql> select count(distinct feat) from dla_tutorial.feat$1gram$msgs_xxx$user_id$16to16$0_1;
+----------------------+
| count(distinct feat) |
+----------------------+
|                 1872 |
+----------------------+

What would you expect the count to be if you had used a set_p_occ value of 0.01?

Given the definition of group norm above, what would you expect to get if you summed all of the group norms for a single group? Verify your answer with the following mysql command:

select group_id, sum(group_norm) from dla_tutorial.feat$1gram$msgs_xxx$user_id$16to16 group by group_id limit 10;

Generate 1to3-gram Features

Now we will generate a 1-3 gram table which will contain all 1grams, 2grams 3grams for each user. This next command will create four tables, one table for each ngram and one combined table containing every table created during the process.

##EXPECTED OUTPUT TABLES
#feat$1gram$msgs_xxx$user_id$16to16
#feat$2gram$msgs_xxx$user_id$16to16
#feat$3gram$msgs_xxx$user_id$16to16
#feat$1to3gram$msgs_xxx$user_id$16to16
dlatkInterface.py -d dla_tutorial -t msgs_xxx -c user_id --add_ngrams -n 1 2 3 --combine_feat_tables 1to3gram

The argument to the --combine_feat_tables flag is used to name the combined table. We used 1to3gram which have us the feature table feat$1to3gram$msgs_xxx$user_id$16to16.

Generate Lexicon (topic) Features

This step uses the 1gram feature table that was used in step 1a in addition to some topic definitions. It calculates a value that characterizes how strongly each topic was present in the text of a given group. Sometimes this is as simple as aggregating counts. Sometimes there is a weighting factor involved. We will use a weighted, data driven lexicon like our 2000 Facebook topics (topics are distributed with this release). These topics were created from Facebook data using Latent Dirichlet allocation (LDA). Go here for more info on LDA. Also see our Mallet LDA Interface. The Facebook topic table in permaLexicon looks like

mysql> select * from permaLexicon.met_a30_2000_cp limit 10;
+----+---------+----------+--------------------+
| id | term    | category | weight             |
+----+---------+----------+--------------------+
|  1 | ce      | 344      |  0.000162284972412 |
|  2 | concept | 344      |  0.000556947925369 |
|  3 | cough   | 344      | 0.0000711541198235 |
|  4 | bring   | 344      |   0.00570741964554 |
|  5 | finest  | 344      |  0.000520020800832 |
|  6 | human   | 344      | 0.0000762679547477 |
|  7 | winds   | 344      |   0.00839234198794 |
|  8 | faster  | 344      |  0.000218674830527 |
|  9 | halfway | 344      |  0.000872790748418 |
| 10 | blow    | 344      |    0.0120238095238 |
+----+---------+----------+--------------------+

Every lex table will have the columns: id, term, category and weight. In an unweighted lexica (for example LIWC (Linguistic Inquiry and Word Count)) the weight column is set to 1.

Since this lexica was produced using a data driven approach we make no attempt to label the categories and give them numerical ids. For example, in LIWC we see the category 'personal pronouns'. The Facebook lexica contains weight in the form of conditional probabilities. We now apply this to our message set:

# EXPECTED OUTPUT TABLE
# feat$cat_met_a30_2000_cp$msgs_xxx$user_id$16to16
dlatkInterface.py -d dla_tutorial -t msgs_xxx -c user_id --add_lex_table -l met_a30_2000_cp --weighted_lexicon

Brief descriptions of the flags:

Note - dlatk pieces together the expected name of the 1gram table using the information you give it in the -d, -t, and -c options Note - in the table name met_a30_2000_cp, met stands for messages english tokenizen, a30 stands for alpha = 30 (a tuning parameter in the LDA process) and 2000 means there are 2000 topics.

In general use the following syntax (permaLexicon is a database where all of our lexica are stored):

## GENERAL SYNTAX FOR CREATING LEXICON FEATURE TABLES
dlatkInterface.py -d <db> -t <msg_tbl> -c <grp_col> --add_lex_table -l <topic_tbl_from_permalexicon> [--weighted_lexicon]

Again, you can view the tables with the following mysql commands:

select * from dla_tutorial.feat$cat_met_a30_2000_cp_w$msgs_xxx$user_id$16to16 limit 10;

What should the group norms sum to for a single group in the lexicon tables? Will this be the same as above? Why or why not?

select group_id, sum(group_norm) from dla_tutorial.feat$cat_met_a30_2000_cp_w$msgs_xxx$user_id$16to16 group by group_id limit 10;

STEP 2 - Insights (DLA): Correlate features with outcomes

This step takes the quantified/summarized text and examines/uses relationships with information about the group. One basic output is a correlation matrix in html format. You may need to download a program such as WinSCP to transfer the output files from our server to your computer in order to view the output.

dlatkInterface.py -d dla_tutorial -t msgs_xxx -c user_id \
-f 'feat$cat_LIWC2007$msgs_xxx$user_id$16to16' \
 --outcome_table blog_outcomes \
 --group_freq_thresh 500 \
 --outcomes age gender \
 --output_name xxx_output --rmatrix --sort --csv

Brief descriptions of the flags:

Output will be written to the file xxx_output.csv and xxx_output.html. The csv output should look like

feature,age,p,N,freq,gender,p,N,freq
ACHIEV,0.10453337969466858,1.2486251420175023,499,24061,-0.1327959917320303,0.18924871053777773,499,24061
ADVERBS,-0.097823107908957693,1.8490497097147072,499,77661,0.091427449910103736,2.6369379754861826,499,77661
AFFECT,-0.060118741047985133,11.519149773307243,499,133155,0.094864627490032188,2.1840596807077146,499,133155

The HTML file should look like this when opened in a browser: Attach:rmatrix_output.png

In this example, positive value for age correlates with older age, and negative correlates with younger. Similarly, a positive value for gender indicates correlation with female, and a negative value correlates with male. Or using the Facebook topics and creating topic tag clouds:

dlatkInterface.py -d dla_tutorial -t msgs_xxx -c user_id \
-f 'feat$cat_met_a30_2000_cp_w$msgs_xxx$user_id$16to16' \
 --outcome_table blog_outcomes  --group_freq_thresh 500 \
 --outcomes age gender --output_name xxx_output \
 --topic_tagcloud --make_topic_wordcloud --topic_lexicon met_a30_2000_freq_t50ll \
--tagcloud_colorscheme bluered

Brief descriptions of the flags:

The following line will be printed to the screen:

Yielding norms with zeros (500 groups * 2000 feats).

This tells us that we have 500 users (since our -c field is user_id) each with 2000 features. The 2000 features comes from the fact that we are working with 2000 Facebook topics. Looking in MySQL we see that we have 500 users total in our dataset:

mysql> select count(distinct user_id) from msgs_xxx;
+-------------------------+
| count(distinct user_id) |
+-------------------------+
|                     500 |
+-------------------------+

This means that every user in our dataset passes the group frequency threshold, i.e., each user has at least 500 words. If we were to set the group frequency threshold to 5000 we would see:

Yielding norms with zeros (125 groups * 2000 feats).

Output will be written to the file xxx_output_topic_tagcloud.txt. The topic tagcloud output will be in a directory called xxx_output_topic_tagcloud_wordclouds

Outcome Positive Correlation Negative Correlation
Gender gender_pos gender_neg
Age age_pos age_neg

Here is the general syntax for some other commands:

####MAKE WORDCLOUDS
dlatkInterface.py -d <db> -t <msg_tbl> -c <grp_col> -f <feat_tbl>  \
 --outcome_table <table_with_group_info>  \
 --outcomes <list of outcomes separated by spaces>  \
 --output_name <desired_output_name> --tagcloud --make_wordclouds
####MAKE TOPIC WORDCLOUDS
dlatkInterface.py -d <db> -t <msg_tbl> -c <grp_col> -f <feat_tbl>  \
 --outcome_table <table_with_group_info>  \
 --outcomes <list of outcomes separated by spaces>  \
 --output_name <desired_output_name> --topic_tagcloud --make_topic_wordcloud
 --topic_lexicon <lex_table>

Continuing on...

More information about dlatk's interface can be found in the following places: