DLATK LDA Interface¶
Note: These instructions introduce the new streamlined interface for LDA topic estimation. To use the old manual interface, see Mallet LDA Interface.
For a conceptual overview of LDA, see this intro
Step 0: Setup¶
- This tutorial uses Mallet.
- Install to your home directory using the following website: http://mallet.cs.umass.edu/download.php
- NOTE - if you plan to be running on large datasets (~15M FB messages or similar) you may have to adjust parameters in your mallet script file. See more info in the "Run LDA with Mallet" step.
- Depending on your DLATK installation, you may also need to install pymallet with the following command:
pip install dlatk-pymallet
Step 1: (If necessary) Create sample tweet table¶
If necessary, create a message table to run LDA on:
use dla_tutorial; create table msgs_lda like msgs; insert into msgs_lda select * from msgs where rand()<(2/6);
Step 2: Generate a feature table¶
This is a standard unigram feature table generation command.
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_ngrams -n 1
Step 3: Estimate LDA topics¶
A minimal command for estimating LDA topics is shown below:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \ -f 'feat$1gram$msgs_lda$message_id$16to16' \ --estimate_lda_topics \ --lda_lexicon_name my_lda_lexicon
However, it is important to realize that the command above will estimate LDA topics using PyMallet, which is in general much slower than Mallet. To use Mallet for topic estimation, you can use the following command:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \ -f 'feat$1gram$msgs_lda$message_id$16to16' \ --estimate_lda_topics \ --lda_lexicon_name my_lda_lexicon \ --mallet_path /path/to/mallet/bin/mallet
Be sure to replace
/path/to/mallet/bin/mallet with the correct path to which you installed Mallet in Step 0.
It is good practice to refrain from storing the topics as a lexicon until after you have reviewed them. While the interim LDA estimation files are typically stored in your
/tmp directory, you can specify a different directory to allow you to more easily review the topics you have estimated. The following command will store these files in the
lda_files directory and prevent creating a topic lexicon:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \ -f 'feat$1gram$msgs_lda$message_id$16to16' \ --estimate_lda_topics \ --save_lda_files lda_files --no_lda_lexicon \ --mallet_path /path/to/mallet/bin
An important difference between this new interface and the old one is that stop words are no longer derived from a static Mallet stoplist. Instead, DLATK will determine the most common terms in your feature table and remove them (by default, it sets the top 50 most frequent terms as stop words, but this can be controlled with --num_stopwords). To disable stopping entirely, use --no_lda_stopping.
There are several options you may wish to use with --estimate_lda_topics:
Step 4: Extract features from lexicon¶
You’re now ready to start using the topic distribution lexicon
dlatkInterface.py -d DATABASE -t MESSAGE_TABLE -c GROUP_ID \ --add_lex_table -l my_lda_lexicon_cp --weighted_lexicon