DLATK LDA Interface

Note: These instructions introduce the new streamlined interface for LDA topic estimation. To use the old manual interface, see Mallet LDA Interface.

For a conceptual overview of LDA, see this intro

Step 0: Setup

Mallet:

This tutorial uses Mallet.
Install to your home directory using the following website: http://mallet.cs.umass.edu/download.php
NOTE - if you plan to be running on large datasets (~15M FB messages or similar) you may have to adjust parameters in your mallet script file. See more info in the "Run LDA with Mallet" step.

PyMallet:

Depending on your DLATK installation, you may also need to install pymallet with the following command: pip install dlatk-pymallet

Step 1: (If necessary) Create sample tweet table

If necessary, create a message table to run LDA on:

use dla_tutorial;
create table msgs_lda like msgs;
insert into msgs_lda select * from msgs where rand()<(2/6);

Step 2: Generate a feature table

This is a standard unigram feature table generation command.

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_ngrams -n 1

Step 3: Estimate LDA topics

A minimal command for estimating LDA topics is shown below:

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
    -f 'feat$1gram$msgs_lda$message_id$16to16' \
    --estimate_lda_topics \
    --lda_lexicon_name my_lda_lexicon

However, it is important to realize that the command above will estimate LDA topics using PyMallet, which is in general much slower than Mallet. To use Mallet for topic estimation, you can use the following command:

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
    -f 'feat$1gram$msgs_lda$message_id$16to16' \
    --estimate_lda_topics \
    --lda_lexicon_name my_lda_lexicon \
    --mallet_path /path/to/mallet/bin/mallet

Be sure to replace /path/to/mallet/bin/mallet with the correct path to which you installed Mallet in Step 0.

It is good practice to refrain from storing the topics as a lexicon until after you have reviewed them. While the interim LDA estimation files are typically stored in your /tmp directory, you can specify a different directory to allow you to more easily review the topics you have estimated. The following command will store these files in the lda_files directory and prevent creating a topic lexicon:

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
    -f 'feat$1gram$msgs_lda$message_id$16to16' \
    --estimate_lda_topics \
    --save_lda_files lda_files
    --no_lda_lexicon \
    --mallet_path /path/to/mallet/bin

You can now review the .keys file in the lda_files directory to view the estimated topics and decide whether you should change any parameters (e.g., --num_stopwords or --lda_alpha).

An important difference between this new interface and the old one is that stop words are no longer derived from a static Mallet stoplist. Instead, DLATK will determine the most common terms in your feature table and remove them (by default, it sets the top 50 most frequent terms as stop words, but this can be controlled with --num_stopwords). To disable stopping entirely, use --no_lda_stopping.

There are several options you may wish to use with --estimate_lda_topics:

Step 4: Extract features from lexicon

You’re now ready to start using the topic distribution lexicon

dlatkInterface.py -d DATABASE -t MESSAGE_TABLE -c GROUP_ID \
    --add_lex_table -l my_lda_lexicon_cp --weighted_lexicon

Always extract features using the _cp lexicon. The _freq_t50ll lexicon is only used when generating topic_tagclouds: --topic_tagcloud --topic_lexicon.