.. _tut_dla: ================== DLA Rules of Thumb ================== p = |feats| First rule of thumb: there really are no hard and fast rules that always apply but these are best places to start. DLA --- * Feat_occ_filter (:doc:`../fwinterface/fwflag_feat_occ_filter`): * Rule of thumb: N/2 < p < N * Depends on: * Expected effect sizes * Sparsity of your observations * How many words do you have per person? (measure how well we are estimating word use rates) * "True" strength of relationship Colloc_filter ------------- * See :doc:`../fwinterface/fwflag_feat_colloc_filter` * When to apply? * good for DLA * usually not good for prediction (less accurate models) * generally a pmi threshold of 3 works for anything from 2grams to 4grams. Prediction ---------- * Feat_occ_filter * See :doc:`../fwinterface/fwflag_feat_occ_filter` * Rule of thumb: N < p < 2*N * (when you’re doing "magic sauce"”" feature selection or LASSO (L1) penalization) * Colloc filter: doesn’t usually help (sometimes hurts) * What usually works best * Regression (listed in order to what usually works best) * Feat_occ_filter => Univariate selection => PCA => L2 (ridge) regression (:doc:`../fwinterface/fwflag_feature_selection` magic sauce -:doc:`../fwinterface/fwflag_model` ridgecv) * LASSO L1 regression (with no separate feature selection) * ElasticNet L1L2 regression (presumably worse because there is one more hyper-parameter to set) * Choosing dimensions for PCA * If a lot of observations (>>10k): 10% of p * If few observations (< 10k) 50% of N * (see more complicated funtions in regressionPredictor.py featureSelectionString) * Classification * L1 linear-svm (:doc:`../fwinterface/fwflag_model` linear-svc) * L1 logistic regression (:doc:`../fwinterface/fwflag_model` lr) * Extremely randomized trees (:doc:`../fwinterface/fwflag_model` etc) Levels of analysis and group frequency threshold ------------------------------------------------ * See :doc:`../fwinterface/fwflag_group_freq_thresh` * County level: * Ok to push the boundaries for p (use lots of features compared to observations), because you have well-estimated features * GFT: 20 to 50k range; (if really good data, use 50k) * User-level: * Above rules for p directly apply. * GFT: 1k (500 if N < 5k; 2k if N > 100k usually has absolutely no benefit) * Message-level: * Rules above apply except, normally best to use binary encoding of 1to3grams * GFT: 1 (but depends on task)