



Filter features based on how commonly they are used

Argument and Default Value

There is no default value.


Group ratio is set by --set_p_occ RATIO_OF_GROUPS. Then feat_occ_filter filters features so as to keep only those features which are used by RATIO_OF_GROUPS number of groups or more. The missing features are aggregated into a feature called <OOV> which contains the value and group norm data for all the missing features.

Other Switches

Required Switches:

Example Commands

# Extract ngrams and filter in one command
dlatkInterface.py -d fb22 -t msgsEn_r5k -c user_id --add_ngrams -n 1 2 3 --combine_feat_tables 1to3gram --feat_occ_filter --set_p_occ 0.05

# Add a filter to a table that was generated without using collocs
dlatkInterface.py -d fb22 -t msgsEn_r5k -c user_id -f 'feat$1to3gram$msgsEn_r5k$user_id$16to16' --feat_occ_filter --set_p_occ 0.05

# Add a filter to a table that was generated using collocs, override the default group_freq_thresh value
dlatkInterface.py -d fb22 -t msgsEn_r5k -c user_id -f 'feat$colloc$msgsEn_r5k$user_id$16to16' --word_table 'feat$colloc$msgsEn_r5k$user_id$16to16 --feat_occ_filter --set_p_occ 0.05 --group_freq_thresh 50