Understanding Feature Table Names

This page will explain the standard way of naming feature tables in DLATK.

This is how DLATK expects them to be named.

Deviate at your own risk.

Structure

Every feature table has the same structure: id, group_id, feat, value and group_norm. Here is an example of a message level (group_id = message_id) 1gram table:

mysql> describe feat$1gram$msgs$message_id$16to16;
+------------+---------------------+------+-----+---------+----------------+
| Field      | Type                | Null | Key | Default | Extra          |
+------------+---------------------+------+-----+---------+----------------+
| id         | bigint(16) unsigned | NO   | PRI | NULL    | auto_increment |
| group_id   | int(11)             | YES  | MUL | NULL    |                |
| feat       | varchar(36)         | YES  | MUL | NULL    |                |
| value      | int(11)             | YES  |     | NULL    |                |
| group_norm | double              | YES  |     | NULL    |                |
+------------+---------------------+------+-----+---------+----------------+

The column naming convention is identical across tables but the MySQL Type is not, though generally feat is a varchar, value is an int and group_norm is a double. The columns are defined as follows:

  • group_id: Identifier for each group as determined from the -c flag. This is typically a message id (e.g. Tweet id), user id (e.g. Twitter user id), community id (e.g. U.S. County FIPS code or state code), etc.
  • feat: feature name such as an ngram, LDA topic id, etc.
  • value: The number of times the feature was used by the group_id.
  • group_norm: The relative frequency of the feature use for the group_id. This is usually value divided by the sum of all value`s for the `group_id.

Things to keep in mind when creating your own feature tables:

  • The id column is technically not necessary but every other column is needed.
  • Tables are sparse encoded: group_id / feat pairs are assumed to be zero if missing from the table.
  • Nulls and 0's in the group_norm column will throw an error.
  • Do not use Decimal types in feature tables.
  • Keep the group_id and feat columns indexed.

Example 1: unigram, bigram, etc features

These tables are generally created with the --add_ngrams flag of fwInterface.

feat$1to3gram$statuses_er1$user_id$16to16$0_01$pmi3_0
| f0 |field 1 |  field 2   |field3 |field4| f5 |field 6|

Field 0 Specifies this as a feature table

Field 1 Specifies kinds of features; these are 1-, 2-, and 3-grams, the result of running --combine_feat_tables after --add_ngrams

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-c) that features were grouped by

Field 4 Specifies scaling on features:

Field 5 Shows feature occurrence filter (--feat_occ_filter) used on feature table (i.e., what %age of groups necessary to include feature in table)

Field 6 Gives the PMI threshold set by --feat_colloc_filter, and optionally, --set_pmi_threshold

Example 2: extracted lexicon/topic features

These tables are generally created with the --add_lex_table flag of dlatkInterface.

feat$cat_met_a30_2000_cp_w$messages_en$cty_id$1gra
| f0 |       field 1      |  field 2  |field3|field4|

Field 0 Specifies this as a feature table

Field 1 Specifies the source of features; these are extracted from the topic lexicon met_a30_2000, and the table was created via --add_lex_table. The trailing "_w" indicates a weighted lexicon. "_cp" stands for "conditional probability", one of the two types of topic lexica normally created (see Mallet LDA Interface).

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-c) that features were grouped by

Field 4 The first four characters from Field 1 of the word table (--word_table) used to derive the lexicon/topic features. By default this is the 1gram table. In previous version (less than 1.1.5) this field specified the scaling on features.