Understanding Feature Table Names

This page will explain the "standard" way of naming feature tables in DLATK.

This is how DLATK expects them to be named.

Deviate at your own risk.

Example 1: unigram, bigram, etc features

These tables are generally created with the --add_ngrams flag of fwInterface.

feat$1to3gram$statuses_er1$user_id$16to16$0_01$pmi3_0
| f0 |field 1 |  field 2   |field3 |field4| f5 |field 6|

Field 0 Specifies this as a feature table

Field 1 Specifies kinds of features; these are 1-, 2-, and 3-grams, the result of running --combine_feat_tables after --add_ngrams

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-c) that features were grouped by

:Field 4:Specifies scaling on features:

Field 5 Shows feature occurrence filter (--feat_occ_filter) used on feature table (i.e., what %age of groups necessary to include feature in table)

Field 6 Gives the PMI threshold set by --feat_colloc_filter, and optionally, --set_pmi_threshold

Example 2: extracted lexicon/topic features

These tables are generally created with the --add_lex_table flag of fwInterface.

feat$cat_met_a30_2000_cp_w$messages_en$cty_id$16to16
| f0 |       field 1       |  field 2  |field3|field4|

Field 0 Specifies this as a feature table

Field 1 Specifies the source of features; these are extracted from the topic lexicon met_a30_2000, and the table was created via --add_lex_table. The trailing "_w" indicates a weighted lexicon. "_cp" stands for "conditional probability", one of the two types of topic lexica normally created (see [[Tutorials/LDA|LDA_Tutorial]]).

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-c) that features were grouped by