.. _tut_feat_tables:
=================================
Understanding Feature Table Names
=================================

This page will explain the **standard** way of naming feature tables in DLATK.

This is how DLATK expects them to be named.

Deviate at your own risk.

Structure
---------

Every feature table has the same structure: `id`, `group_id`, `feat`, `value` and `group_norm`. Here is an example of a message level (`group_id` = `message_id`) 1gram table:

.. code-block:: mysql 

   	mysql> describe feat$1gram$msgs$message_id;
	+------------+---------------------+------+-----+---------+----------------+
	| Field      | Type                | Null | Key | Default | Extra          |
	+------------+---------------------+------+-----+---------+----------------+
	| id         | bigint(16) unsigned | NO   | PRI | NULL    | auto_increment |
	| group_id   | int(11)             | YES  | MUL | NULL    |                |
	| feat       | varchar(36)         | YES  | MUL | NULL    |                |
	| value      | int(11)             | YES  |     | NULL    |                |
	| group_norm | double              | YES  |     | NULL    |                |
	+------------+---------------------+------+-----+---------+----------------+

The column naming convention is identical across tables but the MySQL Type is not, though generally `feat` is a `varchar`, `value` is an `int` and `group_norm` is a `double`. The columns are defined as follows:

* **group_id**: Identifier for each group as determined from the :doc:`../fwinterface/fwflag_c` flag. This is typically a message id (e.g. Tweet id), user id (e.g. Twitter user id), community id (e.g. U.S. County FIPS code or state code), etc.
* **feat**: feature name such as an ngram, LDA topic id, etc.
* **value**: The number of times the feature was used by the `group_id`.
* **group_norm**: The relative frequency of the feature use for the `group_id`. This is usually `value` divided by the sum of all `value`s for the `group_id`.

Things to keep in mind when creating your own feature tables:

* The `id` column is technically not necessary but every other column is needed. 
* Tables are sparse encoded: `group_id` / `feat` pairs are assumed to be zero if missing from the table.
* Nulls and 0's in the `group_norm` column will throw an error.
* Do not use `Decimal` types in feature tables.
* Keep the `group_id` and `feat` columns indexed.

Example 1: unigram, bigram, etc features
----------------------------------------
These tables are generally created with the :doc:`../fwinterface/fwflag_add_ngrams` flag of fwInterface.

.. code-block:: bash

	feat$1to3gram$statuses_er1$user_id$16to1$0_01$pmi3_0
	| f0 |field 1 |  field 2   |field3 |field4| f5 |field 6|

**Field 0** Specifies this as a feature table. All feature tables begin with the word "feat".

**Field 1** Specifies kinds of features; these are 1-, 2-, and 3-grams, the result of running :doc:`../fwinterface/fwflag_combine_feat_tables` after :doc:`../fwinterface/fwflag_add_ngrams`

**Field 2** Gives the message table (:doc:`../fwinterface/fwflag_t`) that the features were derived from

**Field 3** Gives the group ID (:doc:`../fwinterface/fwflag_c`) that features were grouped by

**Field 4** Specifies scaling on features. The default (or unscaled) feature tables do not include this field.

* *16to8*: :doc:`../fwinterface/fwflag_anscombe`
* *16to4*: :doc:`../fwinterface/fwflag_sqrt`
* *16to3*: :doc:`../fwinterface/fwflag_log`
* *16to1*: :doc:`../fwinterface/fwflag_boolean`

**Field 5** Shows feature occurrence filter (:doc:`../fwinterface/fwflag_feat_occ_filter`) used on feature table (i.e., what %age of groups necessary to include feature in table)

**Field 6** Gives the PMI threshold set by :doc:`../fwinterface/fwflag_feat_colloc_filter`, and optionally, :doc:`../fwinterface/fwflag_set_pmi_threshold`

Example 2: extracted lexicon/topic features
-------------------------------------------
These tables are generally created with the :doc:`../fwinterface/fwflag_add_lex_table` flag of dlatkInterface.

.. code-block:: bash

	feat$cat_met_a30_2000_cp_w$messages_en$cty_id$1gra
	| f0 |       field 1      |  field 2  |field3|field4|

**Field 0** Specifies this as a feature table

**Field 1** Specifies the source of features; these are extracted from the topic lexicon *met_a30_2000*, and the table was created via :doc:`../fwinterface/fwflag_add_lex_table`. The trailing "*_w*" indicates a weighted lexicon. "*_cp*" stands for "conditional probability", one of the two types of topic lexica normally created (see :doc:`../tutorials/tut_lda`).

**Field 2** Gives the message table (:doc:`../fwinterface/fwflag_t`) that the features were derived from

**Field 3** Gives the group ID (:doc:`../fwinterface/fwflag_c`) that features were grouped by

**Field 4** The first four characters from Field 1 of the word table (:doc:`../fwinterface/fwflag_word_table`)  used to derive the lexicon/topic features. By default this is the 1gram table. In previous version (less than 1.1.5) this field specified the scaling on features.