--aggregate_feats_by_new_group

Switch

--aggregate_feats_by_new_group

Description

Aggregate feature table by group field (i.e. message_id features by user_ids).

Argument and Default Value

Required Switches:

Details

Other Switches

Required Switches:

Optional Switches:

  • None

Example Commands

Example use case: feature tables broken down by month being recombined. Feature tables in question:

| feat$1to3gram$msgsPA_2012$cntyYM$16to16$0_01                   |
| feat$1to3gram$msgsPA_2012_01$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_02$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_03$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_04$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_05$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_06$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_07$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_08$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_09$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_10$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_11$cnty$16to16$0_01                  |
| feat$1to3gram$msgsPA_2012_12$cnty$16to16$0_01                  |

Note that feat$1to3gram$msgsPA_2012$cntyYM$16to16$0_01 has been created to hold the intermediate step.

Original tables like feat$1to3gram$msgsPA_2012_12$cnty$16to16$0_01:

+----------+--------------+-------+--------------------------+
| group_id | feat         | value | group_norm               |
+----------+--------------+-------+--------------------------+
|    42101 | soon as      |    73 |  0.000037435667062561665 |
|    42017 | are now      |     1 |  0.000019275622120703946 |
|    42077 | i tell u     |     1 |  0.000036302911493501776 |
|    42003 | fuck out .   |     3 |  0.000005261560964829973 |
|    42025 | the last day |     1 |   0.00040225261464199515 |
|    42101 | it ! we      |     2 | 0.0000011117948649530572 |
|    42017 | is willing   |     1 |  0.000019275622120703946 |
|    42049 | kno wat      |     1 |   0.00003332777870354941 |
|    42003 | just barely  |     1 | 0.0000016047294586605644 |
|    42101 | 2006         |    48 |   0.00002283123484160593 |
+----------+--------------+-------+--------------------------+

Initial collapse mysql command, e.g. (note different group field, cntyYM instead of cnty):

First collapsed table (feat$1to3gram$msgsPA_2012$cntyYM$16to16$0_01):

+---------------+------------------+-------+--------------------------+
| group_id      | feat             | value | group_norm               |
+---------------+------------------+-------+--------------------------+
| 42091_2012_11 | #theatreproblems |     1 |   0.00001484692817056151 |
| 42101_2012_02 | inappropriate    |     7 | 0.0000026320123388738445 |
| 42125_2012_03 | for everything   |     1 |  0.000040645449741901396 |
| 42101_2012_05 | fun and          |    18 |  0.000013396794444795903 |
| 42071_2012_02 | at my desk       |     1 |  0.000019890601690701143 |
| 42071_2012_11 | album will       |     1 |  0.000019987607683236393 |
| 42003_2012_06 | waste it         |     1 | 0.0000012686121256484195 |
| 42003_2012_08 | tweet goes out   |     1 |  0.000006096408605690388 |
| 42101_2012_04 | day :            |    29 |    0.0000182426537148019 |
| 42101_2012_07 | girl you better  |     1 | 0.0000007502882982786135 |
+---------------+------------------+-------+--------------------------+

Create a new dummy "message table" containing mappings from new group_id to original group_id with mysql commands like: insert into cntyYM_to_cnty values('42007_2012_01','42007');

Resulting table (cntyYM_to_cnty):

+---------------+-------+
| cntyYM        | cnty  |
+---------------+-------+
| 42001_2012_01 | 42001 |
| 42001_2012_02 | 42001 |
| 42001_2012_03 | 42001 |
| 42001_2012_04 | 42001 |
| 42001_2012_05 | 42001 |
+---------------+-------+

Aggregation fwInterface command:

dlatkInterface.py -d paHealth -t cntyYM_to_cnty -c cnty \
  -f 'feat$1to3gram$msgsPA_2012$cntyYM$16to16$0_01' \
  --aggregate_feats_by_new_group

At its heart, this function runs two SQL commands:

INSERT INTO feat$agg_1to3gram$msgsPA_2012$cnty
  SELECT m.cnty, f.feat, sum(f.value), 0 FROM feat$1to3gram$msgsPA_2012$cntyYM$16to16$0_01 AS f,
  cntyYM_to_cnty AS m where m.cntyYM = f.group_id GROUP BY m.cnty, f.feat
UPDATE feat$agg_1to3gram$msgsPA_2012$cnty a INNER JOIN
  (SELECT group_id,sum(value) sum FROM feat$agg_1to3gram$msgsPA_2012$cnty
  GROUP BY group_id) b ON a.group_id=b.group_id SET a.group_norm=a.value/b.sum

The output table name could probably be improved with better logic. After the fact, I changed it from feat$agg_1to3gram$msgsPA_2012$cnty to feat$1to3gram$msgsPA_2012$cnty$10to16$0_01.