--train_classifiers

Switch

--train_classifiers

Description

Trains a classification model using the features given.

Argument and Default Value

None

Details

This switch will cause the infrastructure to train a machine learning model to predict the outcome(s) (--outcomes) from the features in the feature tables -f (Note that you can put multiple feature tables in there). Features are loaded into memory, and are filtered/clustered using the feature selection (see below) and then standardized over the groups (unless --no_standardize is used), then fed into the classification model. It is usually useful to use this switch with --save_model, but put the order of the features into the name cause those aren't yet stored in the model.

Feature Selection In order to avoid overfitting, we have a couple of feature selection steps that one can do. Most of our feature selection is done using the Scikit:doc:fwflag_Learn package. To use it, we have a couple of pre:doc:fwflag_made feature selections, so just (un)comment the lines below this line:

# feature selection: featureSelectionString = None

Every feature selector string will create an object if evaluated, and said object needs to have the following two functions: fit(X, y) transform(X) If putting a lot of features into the model, it's good to use the pipeline feature selection:

featureSelectionString = 'Pipeline([("1_mean_value_filter", OccurrenceThreshold(threshold=(X.shape[0]/100.0))),

("2_univariate_select", SelectFwe(f_regression, alpha=70.0)), ("3_rpca", RandomizedPCA(n_components=.4/len(self.featureGetters), random_state=42,

whiten=False, iterated_power=3, max_components=X.shape[0]/max(1.5, len(self.featureGetters))))])'

If there aren't many features, you can choose not to use any feature selection. Talk to a CS PostDoc about this :)

Model selection See below for choosing the model. Once the model is chosen, you should tweak the parameters by commenting in/out the appropriate line in classifyPredictor.py below

# Model Parameters cvParams = {...

You can choose your model using --model, and choose one of the following: svc (Support Vector Classification) linear:doc:fwflag_svc (Support Vector Classification with Linear Kernel) lr (Logistic Regression) etc (ExtraTrees Classification) rfc (RandomForrest Classification) pac (Passive Agressive Classification) lda (Linear Discriminant Analysis)

Other Switches

Required Switches: -d, -g, -t, -f, --outcome_table, --outcomes Optional Switches: --group_freq_thresh --model --save_model --picklefile --no_standardize --sparse --classification_to_lexicon etc.

Example Commands