dlatk package

Submodules

dlatk.DDLA module

class dlatk.DDLA.DDLA(file1, file2, outputFile=None)[source]

Bases: object

add2Output(csvOutput, outcome_data, outcome_name, featsInOrder)[source]
compare_correl(ra, na, rb, nb)[source]
data = None
differential()[source]
file1 = None
file2 = None
get_next(some_iterable, window=1)[source]
header = None
ignoreColumns = {'feature', 'freq', 'N', 'p'}
load_data()[source]
outputData = None
outputForTagclouds(sizeField=1, colorField='dr')[source]
print_sorted(dataDict, features)[source]
signed_r_log(r)[source]
signed_r_square(r)[source]
write2CSV(dataDict, features)[source]

dlatk.classifyPredictor module

Classify Predictor

Interfaces with DLATK and scikit-learn to perform classification of binary outcomes for lanaguage features.

class dlatk.classifyPredictor.ClassifyPredictor(og, fgs, modelName='svc')[source]

Bases: object

Interfaces with scikit-learn to perform prediction of outcomes for lanaguage features.

cvParams

dict

modelToClassName

dict

modelToCoeffsName

dict

cvJobs

int

cvFolds

int

chunkPredictions

boolean

maxPredictAtTime

int

backOffPerc

float

backOffModel

str

featureSelectionString

str or None

featureSelectMin

int

featureSelectPerc

float

testPerc

float

randomState

int

trainingSize

int

Parameters:
  • outcomeGetter (OutcomeGetter object) --
  • featureGetters (list of FeatureGetter objects) --
  • modelName (str, optional) --
Returns:

Return type:

ClassifyPredictor object

backOffModel = 'linear-svc'
backOffPerc = 0.0
chunkPredictions = False
classificationModels = None

dict -- Docstring after attribute, with type specified.

cvFolds = 3
cvJobs = 8
cvParams = {'rfc': [{'n_estimators': [1000], 'n_jobs': [10]}], 'gnb': [{}], 'linear-svc': [{'C': [0.01], 'penalty': ['l1'], 'class_weight': ['balanced'], 'dual': [False]}], 'gbc': [{'subsample': [0.4], 'random_state': [42], 'n_estimators': [500], 'max_depth': [5]}], 'etc': [{'max_features': ['sqrt'], 'min_samples_split': [2], 'criterion': ['gini'], 'n_estimators': [200], 'n_jobs': [12]}], 'lr': [{'C': [0.01], 'penalty': ['l2'], 'dual': [False]}], 'pac': [{'C': [1, 0.1, 10], 'n_jobs': [10]}], 'svc': [{'C': [1, 10, 100, 1000, 10000, 0.1], 'kernel': ['rbf'], 'gamma': [0.001, 0.01, 0.0001, 0.1, 1e-05, 1]}], 'bnb': [{'alpha': [1.0], 'fit_prior': [False], 'binarize': [True]}], 'mnb': [{'alpha': [1.0], 'fit_prior': [False], 'class_prior': [None]}]}
fSelectors = None

dict -- Docstring after attribute, with type specified.

featureNames = None

list -- Holds the order the features are expected in.

featureSelectMin = 50
featureSelectPerc = 1.0
featureSelectionString = None
getWeightsForFeaturesAsADict()[source]
load(filename, pickle2_7=True)[source]
maxPredictAtTime = 30000
modelName = None

str -- Docstring after attribute, with type specified.

modelToClassName = {'rfc': 'RandomForestClassifier', 'gnb': 'GaussianNB', 'linear-svc': 'LinearSVC', 'gbc': 'GradientBoostingClassifier', 'etc': 'ExtraTreesClassifier', 'lr': 'LogisticRegression', 'pac': 'PassiveAggressiveClassifier', 'svc': 'SVC', 'bnb': 'BernoulliNB', 'mnb': 'MultinomialNB'}
modelToCoeffsName = {'lr': 'coef_', 'svc': 'coef_', 'bnb': 'feature_log_prob_', 'linear-svc': 'coef_', 'mnb': 'coef_'}
multiFSelectors = None

str -- Docstring after attribute, with type specified.

multiScalers = None

str -- Docstring after attribute, with type specified.

multiXOn = None

boolean -- whether multiX was used for training.

old_predict(standardize=True, sparse=False, restrictToGroups=None)[source]
old_train(standardize=True, sparse=False, restrictToGroups=None)[source]

Trains classification models

predict(standardize=True, sparse=False, restrictToGroups=None, groupsWhere='')[source]
predictAllToFeatureTable(standardize=True, sparse=False, fe=None, name=None, groupsWhere='')[source]
predictNoOutcomeGetter(groups, standardize=True, sparse=False, restrictToGroups=None)[source]
predictToFeatureTable(standardize=True, sparse=False, fe=None, name=None, groupsWhere='')[source]
predictToOutcomeTable(standardize=True, sparse=False, fe=None, name=None, restrictToGroups=None, groupsWhere='')[source]
static printComboControlPredictionProbsToCSV(scores, outputstream, paramString=None, delimiter='|')[source]

prints predictions with all combinations of controls to csv)

static printComboControlPredictionsToCSV(scores, outputstream, paramString=None, delimiter='|')[source]

prints predictions with all combinations of controls to csv)

static printComboControlScoresToCSV(scores, outputstream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, paramString=None, delimiter='|')[source]

prints scores with all combinations of controls to csv)

randomState = 42
roc(standardize=True, sparse=False, restrictToGroups=None, output_name=None, groupsWhere='')[source]

Tests classifier, by pulling out random testPerc percentage as a test set

roc_curves(y_test, y_score, output_name=None)[source]
save(filename)[source]
scalers = None

dict -- Docstring after attribute, with type specified.

test(standardize=True, sparse=False, saveModels=False, blacklist=None, groupsWhere='')[source]

Tests classifier, by pulling out random testPerc percentage as a test set

testControlCombos(standardize=True, sparse=False, saveModels=False, blacklist=None, noLang=False, allControlsOnly=False, comboSizes=None, nFolds=2, savePredictions=False, weightedEvalOutcome=None, stratifyFolds=True, adaptTables=None, adaptColumns=None, groupsWhere='')[source]

Tests classifier, by pulling out random testPerc percentage as a test set

testPerc = 0.2
train(standardize=True, sparse=False, restrictToGroups=None, groupsWhere='')[source]

Tests classifier, by pulling out random testPerc percentage as a test set

trainingSize = 1000000
class dlatk.classifyPredictor.VERPCA(n_components=None, copy=True, iterated_power=3, whiten=False, random_state=None, max_components_ratio=0.25)[source]

Bases: sklearn.decomposition.pca.RandomizedPCA

Randomized PCA that sets number of components by variance explained

Parameters:
  • n_components (int) -- Maximum number of components to keep: default is 50.
  • copy (bool) -- If False, data passed to fit are overwritten
  • iterated_power (int, optional) -- Number of iteration for the power method. 3 by default.
  • whiten (bool, optional) --

    When True (False by default) the components_ vectors are divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • random_state (int or RandomState instance or None (default)) -- Pseudo Random Number generator seed control. If None, use the numpy.random singleton.
  • max_components_ratio (float) -- Maximum number of components in terms of their ratio to the number of features. Default is 0.25 (1/4).
fit(X, y=None)[source]

Fit the model to the data X.

Parameters:X (array-like or scipy.sparse matrix, shape (n_samples, n_features)) -- Training vector, where n_samples in the number of samples and n_features is the number of features.
Returns:self -- Returns the instance itself.
Return type:object
dlatk.classifyPredictor.alignDictsAsXy(X, y, sparse=False, returnKeyList=False, keys=None)[source]

turns a list of dicts for x and a dict for y into a matrix X and vector y

dlatk.classifyPredictor.alignDictsAsy(y, *yhats, **kwargs)[source]
dlatk.classifyPredictor.chunks(X, y, size)[source]

Yield successive n-sized chunks from X and Y.

dlatk.classifyPredictor.foldN(l, folds)[source]
dlatk.classifyPredictor.getGroupsFromGroupNormValues(gnvs)[source]
dlatk.classifyPredictor.grouper(3, 'abcdefg', 'x') --> ('a', 'b', 'c'), ('d', 'e', 'f'), ('g', 'x', 'x')[source]
dlatk.classifyPredictor.hasMultValuesPerItem(listOfD)[source]

returns true if the dictionary has a list with more than one element

dlatk.classifyPredictor.matrixAppendHoriz(A, B)[source]
dlatk.classifyPredictor.pos_neg_auc(y1, y2)[source]
dlatk.classifyPredictor.r2simple(ytrue, ypred)[source]
dlatk.classifyPredictor.stratifyGroups(groups, outcomes, folds, randomState=42)[source]

breaks groups up into folds such that each fold has at most 1 more of a class than other folds

dlatk.classifyPredictor.xFoldN(l, folds)[source]

Yield successive n-sized chunks from l.

dlatk.clustering module

dlatk.featureExtractor module

class dlatk.featureExtractor.FeatureExtractor(corpdb, corptable, correl_field, mysql_host, message_field, messageid_field, encoding, use_unicode, lexicondb='permaLexicon', date_field='updated_time', wordTable=None)[source]

Bases: dlatk.dlaWorker.DLAWorker

Deals with extracting features from text and writing tables of features

Returns:
Return type:FeatureExtractor object

Examples

Extract 1, 2 and 3 grams

>>> for n in xrange(1, 4):
...     fe.addNGramTable(n=n)
addCharNGramTable(n, lowercase_only=True, min_freq=1, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, metaFeatures=True)[source]

Extract character ngrams from a message table

Parameters:
  • n (int) --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • min_freq (int, optional) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • metaFeatures (boolean, optional) --
Returns:

featureTableName -- Name of n-gram table: feat%nCgram%corptable%correl_field%transform

Return type:

str

addCollocFeatTable(collocList, lowercase_only=True, min_freq=1, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, includeSubCollocs=False, featureTypeName=None)[source]

???

Parameters:
  • collocList (list) --
  • min_freq (int, optional) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • includeSubCollocs (boolean, optional) --
  • featureTypeName (?????, optional) --
Returns:

featureTableName -- Name of n-gram table: ?????

Return type:

str

addCorpLexTable(lexiconTableName, lowercase_only=True, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, isWeighted=False, featValueFunc=<function FeatureExtractor.<lambda>>)[source]
Parameters:
  • lexiconTableName (str) --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • isWeighted (boolean, optional) -- Is the lexcion weighted?
  • featValueFunc (lambda, optional) --
Returns:

tableName -- Name of created feature table: lex%cat_lexTable%corptable$correl_field

Return type:

str

addFleschKincaidTable(tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, removeXML=True, removeURL=True)[source]

Creates feature tuples (correl_field, feature, values) table where features are flesch-kincaid scores.

Parameters:
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • removeXML (boolean, optional) --
  • removeURL (boolean, optional) --
Returns:

featureTableName -- Name of Flesch Kincaid table: feat$flkin$corptable$correl_field%transform

Return type:

str

addLDAFeatTable(ldaMessageTable, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>)[source]

??? This assumes each row is a unique message, originally meant for twitter

Parameters:
  • ldaMessageTable --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
Returns:

featureTableName -- Name of n-gram table: ?????

Return type:

str

addLexiconFeat(lexiconTableName, lowercase_only=True, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, isWeighted=False, featValueFunc=<function FeatureExtractor.<lambda>>)[source]

Creates a feature table given a 1gram feature table name, a lexicon table / database name

Parameters:
  • lexiconTableName (str) --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • isWeighted (boolean, optional) -- Is the lexcion weighted?
  • featValueFunc (lambda, optional) --
Returns:

tableName -- Name of created feature table: feat%cat_lexTable%corptable$correl_field

Return type:

str

addNGramTable(n, lowercase_only=True, min_freq=1, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, metaFeatures=True)[source]

Creates feature tuples (correl_field, feature, values) table where features are ngrams

Parameters:
  • n (int) --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • min_freq (int, optional) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • metaFeatures (boolean, optional) --
Returns:

featureTableName -- Name of n-gram table: feat%ngram%corptable%correl_field%transform

Return type:

str

addNGramTableFromTok(n, lowercase_only=True, min_freq=1, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, metaFeatures=True)[source]

???

Parameters:
  • n (int) --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • min_freq (int, optional) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • metaFeatures (boolean, optional) --
Returns:

featureTableName -- Name of n-gram table: feat%nCgram%corptable%correl_field%transform

Return type:

str

addNGramTableGzipCsv(n, gzCsv, idxMsgField, idxIdField, idxCorrelField, lowercase_only=True, min_freq=1, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>)[source]

??? This assumes each row is a unique message, originally meant for twitter

Parameters:
  • n (int) --
  • gzCsv --
  • idxMsgField --
  • idxIdField --
  • idxCorrelField --
  • lowercase_only (boolean) -- use only lowercase charngrams if True
  • min_freq (int, optional) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
Returns:

featureTableName -- Name of n-gram table: ?????

Return type:

str

addOutcomeFeatTable(outcomeGetter, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>)[source]

Creates feature table of outcomes

Parameters:
  • outcomeGetter (OutcomeGetter object) --
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
Returns:

outcomeFeatTableName -- Name of created feature table: feat%out_outcomes%corptable$correl_field

Return type:

str

addPNamesTable(nameLex, languageLex, tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, lastNameCat='LAST', firstNameCats=['FIRST-FEMALE', 'FIRST-MALE'])[source]

Creates feature tuples (correl_field, feature, values) table where features are People's Names

Parameters:
  • nameLex (str) --
  • languageLex --
  • tableName (?????, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • lastNameCat (str, optional) --
  • firstNameCats (list, optional) --
Returns:

(strings, tagged) -- ?????

Return type:

(list, list)

addPOSAndTimexDiffFeatTable(dateField='updated_time', tableName=None, serverPort=20202, valueFunc=<function FeatureExtractor.<lambda>>)[source]

Creates a feature table of difference between sent-time and time of time expressions, mean, std and a POS table version of the table

Parameters:
  • dateField (OutcomeGetter object) --
  • tableName (str, optional) --
  • serverPort (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
Returns:

featureTableName -- Name of created feature table: feat%timex%corptable$correl_field

Return type:

str

addPhraseTable(tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, maxTaggedPhraseChars=255)[source]

Creates feature tuples (correl_field, feature, values) table where features are parsed phrases

Parameters:
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • maxTaggedPhraseChars (int, optional) --
Returns:

phraseTableName -- Name of phrase table table: ?????

Return type:

str

addPosTable(tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, keep_words=False)[source]

Creates feature tuples (correl_field, feature, values) table where features are parts of speech

Parameters:
  • tableName (str, optional) --
  • pos_table (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • featValueFunc (lambda, optional) --
Returns:

posFeatTableName -- Name of created feature table: feat%pos%corptable$correl_field or feat%1gram_pos%corptable$correl_field

Return type:

str

addTimexDiffFeatTable(dateField='updated_time', tableName=None, serverPort=20202)[source]

Creates a feature table of difference between sent-time and time of time expressions, mean, std

Parameters:
  • dateField (OutcomeGetter object) --
  • tableName (str, optional) --
  • serverPort (str, optional) --
Returns:

featureTableName -- Name of created feature table: feat%timex%corptable$correl_field

Return type:

str

addTopicLexFromTopicFile(topicfile, newtablename, topiclexmethod, threshold)[source]

Creates a lexicon from a topic file

Parameters:
  • topicfile (str) -- Name of topic file to use to build the topic lexicon.
  • newtablename (str) -- New (topic) lexicon name.
  • topiclexmethod (str) -- must be one of: "csv_lik", "standard".
  • threshold (float) -- Default = float('-inf').
Returns:

newtablename -- New (topic) lexicon name.

Return type:

str

addWNNoPosFeat(tableName=None, valueFunc=<function FeatureExtractor.<lambda>>, featValueFunc=<function FeatureExtractor.<lambda>>)[source]

Creates a wordnet concept feature table (based on words without pos tags) given a 1gram feature table name

Parameters:
  • tableName (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • featValueFunc (lambda, optional) --
Returns:

tableName -- Name of created feature table: feat%wn_nopos%corptable$correl_field

Return type:

str

addWNPosFeat(tableName=None, pos_table=None, valueFunc=<function FeatureExtractor.<lambda>>, featValueFunc=<function FeatureExtractor.<lambda>>)[source]

Creates a wordnet concept feature table (based on words with pos tags) given a POS feature table name

Parameters:
  • tableName (str, optional) --
  • pos_table (str, optional) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • featValueFunc (lambda, optional) --
Returns:

tableName -- Name of created feature table: feat%wn_pos%corptable$correl_field

Return type:

str

constParseMatchRe = re.compile('^\\s*\\([A-Z]')
createFeatureTable(featureName, featureType='VARCHAR(64)', valueType='INTEGER', tableName=None, valueFunc=None, correlField=None, extension=None)[source]

Creates a feature table based on self data and feature name

Parameters:
  • featureName (str) -- Type of feature table (ex: 1gram, 1to3gram, cat_LIWC).
  • featureType (str, optional) -- MySQL type of feature.
  • valueType (str, optional) -- MySQL type of value.
  • tableName (str, optional) -- Name of table to be created. If not supplied the name will be automatically generated.
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • correlField (str, optional) -- Correlation Field (AKA Group Field): The field which features are aggregated over
  • extension (str, optional) --
createLexFeatTable(lexiconTableName, lexKeys, isWeighted=False, tableName=None, valueFunc=None, correlField=None, extension=None)[source]

Creates a feature table of the form lex$featureType$messageTable$groupID$valueFunc$ext. This table is used when printing topic tagclouds and looks at the corpus the lexicon is applied to rather than relying on the posteriors from the model to dictate which words to display for a topic.

Parameters:
  • lexiconTableName (str) --
  • lexKeys (list) --
  • isWeighted (boolean) --
  • tableName (str) --
  • valueFunc (lambda, optional) -- Scales the features by the function given
  • correlField (str, optional) -- Correlation Field (AKA Group Field): The field which features are aggregated over
  • extension (str, optional) --
Returns:

tableName -- Name of created feature table: lex%cat_lexTable%corptable$correl_field

Return type:

str

findPhrasesInConstParse(parse)[source]

Traverses a constituent parse tree to pull out all phrases

Parameters:parse (?????, optional) --
Returns:(strings, tagged) -- ?????
Return type:(list, list)
getCorrelFieldType(correlField)[source]

Returns the type of correlField

Parameters:correlField (str) -- Correlation Field (AKA Group Field): The field which features are aggregated over
static getTimexDiff(timexXML, messageDT)[source]
Parameters:
  • timexXML --
  • messageDT --
Returns:

tid, -- ?????

Return type:

list, float or None

static noneToNull(data)[source]

chanes None values to string 'Null'

static parseCoreNLPForPOSTags(parseInfo)[source]

returns a dictionary of pos tags and frequencies

Parameters:parseInfo (dict) --
Returns:posTags, numWords -- ?????
Return type:dict, int
static parseCoreNLPForTimexDiffs(parseInfo, messageDT)[source]

returns a list differences between datetime and timexes, and normaled ne tags for any timex

Parameters:
  • parseInfo (dict) --
  • messageDT (dict) --
Returns:

timexes, netags, numWords -- ?????

Return type:

list, set, int

static removeURL(text)[source]

Removes URLs from text

static removeXML(text)[source]

Removed XML from text

static shortenDots(text)[source]

Changes None values to string 'Null'

static timexAltValueParser(altValueStr, messageDT)[source]

resolve valueStr to a datetime object

Parameters:
  • altValueStr --
  • messageDT --
Returns:

workingDT -- ?????

Return type:

static timexValueParser(valueStr, messageDT)[source]

resolve valueStr to a datetime object

Parameters:
  • valueStr --
  • messageDT --
Returns:

valueStr -- ?????

Return type:

str

dlatk.featureGetter module

class dlatk.featureGetter.FeatureGetter(corpdb='dla_tutorial', corptable='msgs', correl_field='user_id', mysql_host='127.0.0.1', message_field='message', messageid_field='message_id', encoding='utf8mb4', use_unicode=True, lexicondb='permaLexicon', featureTable='feat$1gram$messages_en$user_id$16to16$0_01', featNames=['honor'], wordTable=None)[source]

Bases: dlatk.dlaWorker.DLAWorker

General class for reading from feature tables.

Parameters:
  • featureTable (str) -- Table containing feature information to work with
  • featNames (str) -- Limit outputs to the given set of features
Returns:

Return type:

FeatureGetter object

Examples

Initialize a FeatureGetter

>>> fg = FeatureGetter.fromFile('~/myInit.ini')

Get group norms as pandas dataframe

>>> fg_gns = fg.getGroupNormsAsDF()
countGroups(groupThresh=0, where='')[source]

returns the number of distinct groups (note that this runs on the corptable to be accurate)

disableFeatTableKeys()[source]

Disable keys: good before doing a lot of inserts

enableFeatTableKeys()[source]

Enables the keys, for use after inserting (and with keys disabled)

classmethod fromFile(initFile)[source]

Loads specified features from file

Parameters:initFile (str) -- Path to file
getContingencyArrayFeatNorm(where='')[source]

returns a list of lists: each row is a group_id and each col is a feature

getDistinctFeatures(where='')[source]

returns a distinct list of (feature) tuples given the name of the feature value field (either value, group_norm, or feat_norm)

getDistinctGroups(where='')[source]

returns the distinct distinct groups (note that this runs on the corptable to be accurate)

getDistinctGroupsFromFeatTable(where='')[source]

Returns the distinct group ids that are in the feature table

getFeatAll(where='')[source]

returns a list of (group_id, feature, value, group_norm) tuples

getFeatAllSS(where='', featNorm=True)[source]

returns a list of (group_id, feature, value, group_norm) tuples

getFeatMeanData(where='')[source]

returns a dict of (feature => (mean, std, zero_feat_norm))

getFeatNorms(where='')[source]

returns a list of (group_id, feature, feat_norm) triples

getFeatNormsSS(where='')[source]

returns a server-side cursor pointing to (group_id, feature, feat_norm) triples

getFeatNormsWithZeros(groups=[], where='')[source]

returns a dict of (group_id => feature => feat_norm)

getFeatureCounts(groupFreqThresh=0, where='', SS=False, groups=set())[source]

Gets feature occurence by group

Parameters:
  • groupFreqThresh (int) -- Minimum number of words a group must contain to be considered valid
  • where (str) -- Filter groups with sql-style call.
  • SS (boolean) -- Indicates the use of SSCursor (true use SSCursor to access MySQL)
  • groups (set) -- Set of group ID's
Returns:

  • returns a list of (feature, count) tuples,
  • where count is the feature occurence in each group

getFeatureCountsSS(groupFreqThresh=0, where='')[source]

Gets feature occurence by group.

Parameters:
  • groupFreqThresh (int) -- Minimum number of words a group must contain to be considered valid
  • where (str) -- Filter groups with sql-style call.
Returns:

  • returns a list of (feature, count) tuples,
  • where count is the feature occurence in each group

getFeatureValueSums(where='')[source]

returns a list of (feature, count) tuples, where count is the number of groups with the feature

getFeatureZeros(where='')[source]

returns a distinct list of (feature) tuples given the name of the feature value field (either value, group_norm, or feat_norm)

getGroupAndFeatureValues(featName=None, where='')[source]

returns a list of (group_id, feature_value) tuples

getGroupNorms(where='')[source]

returns a list of (group_id, feature, group_norm) triples

getGroupNormsAsDF(where='')[source]

returns a dataframe of (group_id, feature, group_norm)

getGroupNormsForFeat(feat, where='', warnMsg=False)[source]

returns a list of (group_id, feature, group_norm) triples

getGroupNormsForFeats(feats, where='', warnMsg=False)[source]

returns a list of (group_id, feature, group_norm) triples

getGroupNormsSparseFeatsFirst(groups=[], where='')[source]

returns a dict of (feature => group_id => group_norm)

getGroupNormsWithZeros(groups=[], where='')[source]

returns a dict of (group_id => feature => group_norm)

getGroupNormsWithZerosAsDF(groups=[], where='', pivot=False, sparse=False)[source]

returns a dict of (group_id => feature => group_norm)

getGroupNormsWithZerosFeatsFirst(groups=[], where='', blacklist=None)[source]

returns a dict of (feature => group_id => group_norm)

getGroupsAndFeats(where='')[source]
getSumValue(where='')[source]

returns the sum of all values

getSumValuesByFeat(where='')[source]
getSumValuesByGroup(where='')[source]
getTopMessages(lex_tbl, outputfile, lim_num, whitelist)[source]
getValues(where='')[source]

returns a list of (group_id, feature, value) triples

getValuesAndGroupNorms(where='')[source]

returns a list of (group_id, feature, value, group_norm) triples

getValuesAndGroupNormsAsDF(where='')[source]

returns a dataframe of (group_id, feature, value, group_norm)

getValuesAndGroupNormsForFeat(feat, where='', warnMsg=False)[source]

returns a list of (group_id, feature, group_norm) triples

getValuesAndGroupNormsForFeats(feats, where='', warnMsg=False)[source]

returns a list of (group_id, feature, group_norm) triples

getValuesAsDF(where='')[source]

returns a dataframe of (group_id, feature, value)

optimizeFeatTable()[source]

Optimizes the table -- good after a lot of deletes

static pairedTTest(y1, y2)[source]
printJoinedFeatureLines(filename, delimeter=' ')[source]

prints feature table like a message table in format mallet can use

ttestWithOtherFG(other, maskTable=None, groupFreqThresh=0)[source]

Performs PAIRED ttest on differences between group norms for 2 tables, within features

yieldGroupNormsWithZerosByFeat(groups=[], where='', values=False, feats=[])[source]

yields (feat, groupnorms, number of features

yieldGroupNormsWithZerosByGroup(groups=[], where='', allFeats=None)[source]

returns a dict of (group_id, feature_values)

yieldValuesSparseByGroup(groups=[], where='', allFeats=None)[source]

returns a dict of (group_id, feature_values)

yieldValuesWithZerosByGroup(groups=[], where='', allFeats=None)[source]

returns a dict of (group_id, feature_values)

dlatk.featureRefiner module

class dlatk.featureRefiner.FeatureRefiner(corpdb='dla_tutorial', corptable='msgs', correl_field='user_id', mysql_host='127.0.0.1', message_field='message', messageid_field='message_id', encoding='utf8mb4', use_unicode=True, lexicondb='permaLexicon', featureTable='feat$1gram$messages_en$user_id$16to16$0_01', featNames=['honor'], wordTable=None)[source]

Bases: dlatk.featureGetter.FeatureGetter

Deals with the refinement of feature information already in a table (outputs to new table)

Returns:
Return type:FeatureRefiner object
addFeatNorms(ReCompute=False)[source]

Adds the mean normalization by feature (z-score) for each feature

addFeatTableMeans(field='group_norm', groupNorms=None)[source]

Add to the feature mean table: mean, standard deviation, and zero_mean for the current feature table

createAggregateFeatTableByGroup(valueFunc=<function FeatureRefiner.<lambda>>)[source]

combines feature tables, and groups by the given group field

createCollocRefinedFeatTable(threshold=3.0, featNormTable=False)[source]
createCombinedFeatureTable(featureName=None, featureTables=[], tableName=None)[source]

Create a new feature table by combining others

createCorrelRefinedFeatTable(correls, pValue=0.05, featNormTable=True)[source]
createFeatTableByDistinctOutcomes(outcomeGetter, controlValuesToAvg=[], outcomeRestriction=None, nameSuffix=None)[source]

Creates a new feature table, by combining values based on an outcome, then applies an averaging based on controls

createFeatureTable(featureName, featureType='VARCHAR(64)', valueType='INTEGER', tableName=None, valueFunc=None, correlField=None, extension=None)[source]

Creates a feature table based on self data and feature name

createNewTableWithGivenFeats(toKeep, label, featNorm=False)[source]

Creates a new table only containing the given features

createTableWithBinnedFeats(num_bins, group_id_range, valueFunc=<function FeatureRefiner.<lambda>>, gender=None, genderattack=False, reporting_percent=0.04, outcomeTable='masterstats_r500', skip_binning=False)[source]
createTableWithRemovedFeats(p, minimumFeatSum=0, groupFreqThresh=0, setGFTWarning=False)[source]

creates a new table with features that appear in more than p*|correl_field| rows, only considering groups above groupfreqthresh

createTfIdfTable(ngram_table)[source]

Creates new feature table where group_norm = tf-idf (term frequency-inverse document frequency) :param ngram_table: table containing words/ngrams, collocs, etc...

Written by Phil

findMeans(field='group_norm', addZeros=True, groupNorms=None)[source]

Finds feature means from group norms

getCollocsWithPMI()[source]
Inputs:self.featureTable

calculates PMI for each ngram that is >1 :returns: a dict of colloc => [pmi, num_tokens, pmi_threshold_val]

**pmi_threshold_val is pmi/(num_tokens-1), thats what --feat_colloc_filter is based on
getCorrelFieldType(correlField)[source]
makeTopicLabelMap(topiclexicon, numtopicwords=5, is_weighted_lexicon=False)[source]
static pmi(jointFreq, indFreqs, allFreq, words=None)[source]
static salience(jointFreq, indFreqs, allFreq)[source]

dlatk.featureStar module

class dlatk.featureStar.FeatureStar(corpdb='dla_tutorial', corptable='msgs', correl_field='user_id', mysql_host='127.0.0.1', message_field='message', messageid_field='message_id', encoding='utf8mb4', use_unicode=True, lexicondb='permaLexicon', featureTable='feat$1gram$messages_en$user_id$16to16$0_01', featNames=['honor'], date_field='updated_time', outcome_table='masterstats_r500', outcome_value_fields=['demog_age'], outcome_controls=[], outcome_interaction=[], group_freq_thresh=None, featureMappingTable='', featureMappingLex='', output_name='', wordTable=None, model='ridgecv', feature_selection='', feature_selection_string='', init=None)[source]

Bases: object

Generic class for importing an instance of each class in DLATK

Parameters:fe (FeatureExtractor object) --

fg : FeatureGetter object

fr : FeatureRefiner object

og : OutcomeGetter object

oa : OutcomeAnalyzer object

cp : ClassifyPredictor object

rp : RegressionPredictor object

allFW
: dict
Dictionary containing all of the above attributes keyed on object name

Examples

Initialize a FeatureStar

>>> fs = FeatureStar.fromFile('~/myInit.ini')

Create a pandas dataframe with both feature and outcome information

>>> df = fs.combineDFs()
combineDFs(fg=None, og=None, fillNA=True)[source]

Method for combining a feature table with an outcome table in a single dataframe

Parameters:
  • fg (FeatureGetter object) --
  • og (OutcomeGetter object) --
  • fillNA (boolean)) -- option to fill missing or NA values in dataframe, fill value = 0
Returns:

Dataframe indexed on group_id (correl_field)

Return type:

pandas dataframe

classmethod fromFile(initFile, initList=None)[source]

Loads specified features from file

Parameters:
  • initFile (str) -- Path to file
  • initList (list) -- List of classes to load

dlatk.featureWorker module

dlatk.fwConstants module

dlatk.mediation module

Mediation Analysis

Interfaces with DLATK and Statsmodels

class dlatk.mediation.MediationAnalysis(fg, og, path_starts, mediators, outcomes, controls, method='parametric', boot_number=1000, sig_level=0.05, style='baron')[source]

Bases: object

Interface between Mediation class in Statsmodels and DLATK with the addition of standard Baron and Kenny approach.

Parameters:
  • outcomeGetter (OutcomeGetter object) --
  • featureGetter (FeatureGetter object) --
  • pathStartNames (list) --
  • mediatorNames (list) --
  • outcomeNames (list) --
  • controlNames (list) --
  • mediation_method (str) -- "parametric" or "bootstrap"
  • boot_number (int) -- number of bootstrap iterations
  • sig_level (float) -- significane level for reporting results in summary
output

dict

output_sobel

dict

output_p

dict

baron_and_kenny

boolean -- if True runs Baron and Kenny method

imai_and_keele

boolean -- if True runs Imai, Keele, and Tingley method

get_data(switch, outcome_field, location, features)[source]

Get data from outcomeGetter / featureGetter

mediate(switch='default', p_correction_method='BH', zscoreRegression=True, logisticReg=False)[source]

Runs the medition.

Parameters:
  • switch (str) -- controls source (FeatureGetter or OutcomeGetter) of variables (path_starts, mediators, outcomes, controls)
  • p_correction_method (str) -- Name of p correction method
  • zscoreRegression (boolean) -- True if data is z-scored
  • logisticReg (boolean) -- True if running logistic regression

Notes

Data sources according to 'switch':
"default":
FeatureGetter: mediators OutcomeGetter: path_starts and outcomes
"feat_as_path_start":
FeatureGetter: path_starts OutcomeGetter: mediators, outcomes and controls
"feat_as_outcome":
FeatureGetter: outcomes OutcomeGetter: path_starts, mediators and controls
"feat_as_control":
FeatureGetter: controls OutcomeGetter: path_starts, mediators and outcomes
"no_features":
OutcomeGetter: path_starts, mediators, outcomes and controls
prep_data(path_start, mediator, outcome, controlDict=None, controlNames=None, zscoreRegression=None)[source]

Take dictionary data and return a Pandas DataFrame indexed by group_id. Column names are 'path_start', 'mediator' and 'outcome'

print_csv(output_name='')[source]
print_summary(output_name='')[source]

dlatk.occurrenceSelection module

class dlatk.occurrenceSelection.OccurrenceThreshold(threshold=1.0)[source]

Bases: sklearn.base.BaseEstimator, sklearn.feature_selection.base.SelectorMixin

Feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:threshold (float, optional) -- Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.
`counts_`

array, shape (n_features,)

Number of non-zero observations of individual features.

Examples

The following dataset has integer features, two of which are the same in every sample. These are removed with the default setting for threshold:

>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()
>>> selector.fit_transform(X)
array([[2, 0],
       [1, 4],
       [1, 1]])
fit(X, y=None)[source]

Learn the occurrences. good for frequency / count data :param X: Sample vectors from which to compute variances. :type X: {array-like, sparse matrix}, shape (n_samples, n_features) :param y: Ignored. This parameter exists only for compatibility with

sklearn.pipeline.Pipeline.
Returns:
Return type:self

dlatk.outcomeAnalyzer module

class dlatk.outcomeAnalyzer.OutcomeAnalyzer(corpdb='dla_tutorial', corptable='msgs', correl_field='user_id', mysql_host='127.0.0.1', message_field='message', messageid_field='message_id', encoding='utf8mb4', use_unicode=True, lexicondb='permaLexicon', outcome_table='masterstats_r500', outcome_value_fields=['demog_age'], outcome_controls=[], outcome_interaction=[], group_freq_thresh=None, featureMappingTable='', featureMappingLex='', output_name='', wordTable=None)[source]

Bases: dlatk.outcomeGetter.OutcomeGetter

Deals with Outcome Tables and provides various functionalities for how outcomes are processed/analyzed.

output_name

str

Returns:
Return type:OutcomeAnalyzer object.
IDP_correlate(featGetter, outcomeWithOutcome=False, includeFreqs=False, useValuesInsteadOfGroupNorm=False, blacklist=None, whitelist=None)[source]

Informative Dirichlet prior, based on http://pan.oxfordjournals.org/content/16/4/372.full Finds the correlations between features and outcomes

Parameters:
  • featGetter (featureGetter object) --
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • useValuesInsteadOfGroupNorm (boolean, optional) -- use value field instead of group_norm
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
Returns:

out

Return type:

dict

IDPcomparison(featGetter, sample1, sample2, blacklist=None, whitelist=None)[source]

Finds the correlations between features and outcomes

Parameters:
  • featGetter (featureGetter object) --
  • sample1 --
  • sample2 --
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
Returns:

out

Return type:

dict

aucWithFeatures(featGetter, p_correction_method='BH', interaction=None, bootstrapP=None, blacklist=None, whitelist=None, includeFreqs=False, outcomeWithOutcome=False, zscoreRegression=True, outputInteraction=False, groupsWhere='')[source]

Finds the auc between features and dichotamous outcomes

Parameters:
  • featGetter (featureGetter object) --
  • p_correction_method (str, optional) -- Specified method for p-value correction
  • iteraction (list, optional) -- list of iteraction terms
  • bootstrapP (list, optional) --
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True

:param outcomeWithOutcome :boolean, optional: Adds the outcomes themselves to the list of variables to correlate with the outcomes if True :param zscoreRegression: standard both variables if True :type zscoreRegression: boolean, optional :param outputInteraction: True - append output interactions to results :type outputInteraction: boolean, optional :param groupsWhere: string specified with the --where flag containing a sql statement for filtering :type groupsWhere: str, optional

Returns:aucs -- dict of outcome=>feature=>(auc, p, numGroups, ci, featFreqs) here ci = nan
Return type:dict
barPlot(correls, outputFile=None, featSet=set(), featsPerOutcome=5)[source]
static buildBatchPlotFile(corpdb, featTable, topicList='')[source]

Builds a file to be used for batch plotting

Parameters:
  • corpdb (str) -- database name
  • featTable (str) -- feature table name
  • topicList (list, optional) -- list of strings containing topics
Returns:

outputfile -- output file name

Return type:

str

buildTopicLabelDict(topic_lex, num_words=3)[source]

Build a topic label dictionary

Parameters:
  • topic_lex (str) -- name lex table
  • num_words (int, optional) -- number of words
Returns:

topicLabels -- list of labels

Return type:

list

ci_idx = 3
correlMatrix(correlMatrix, outputFile=None, outputFormat='html', sort=False, pValue=True, nValue=True, cInt=True, freq=False, paramString=None)[source]
correlateControlCombosWithFeatures(featGetter, spearman=False, p_correction_method='BH', blacklist=None, whitelist=None, includeFreqs=False, outcomeWithOutcome=False, zscoreRegression=True)[source]

Finds the correlations between features and all combinations of outcomes

Parameters:
  • featGetter (featureGetter object) --
  • spearman (boolean, optional) --
  • p_correction_method (str, optional) -- Specified method for p-value correction
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • zscoreRegression (boolean, optional) -- standard both variables if True
Returns:

comboCorrels -- dict of outcome=>feature=>(R, p)

Return type:

dict

correlateWithFeatures(featGetter, spearman=False, p_correction_method='BH', interaction=None, blacklist=None, whitelist=None, includeFreqs=False, outcomeWithOutcome=False, outcomeWithOutcomeOnly=False, zscoreRegression=True, logisticReg=False, outputInteraction=False, groupsWhere='')[source]

Finds the correlations between features and outcomes

Parameters:
  • featGetter (featureGetter object) --
  • spearman (boolean, optional) --
  • p_correction_method (str, optional) -- Specified method for p-value correction
  • iteraction (list, optional) -- list of iteraction terms
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • outcomeWithOutcomeOnly (boolean, optional) -- True - only correlate outcomes with outcomes False - correlate features with outcomes
  • zscoreRegression (boolean, optional) -- standard both variables if True
  • logisticReg (boolean, optional) -- True - use logistic regression, False - use default linear
  • outputInteraction (boolean, optional) -- True - append output interactions to results
  • groupsWhere (str, optional) -- string specified with the --where flag containing a sql statement for filtering
Returns:

correls -- dict of outcome=>feature=>(R, p, numGroups, CI, featFreqs)

Return type:

dict

correls_length = 5
static duplicateFilter(rList, wordFreqs, maxToCheck=100)[source]

Filters out duplicate words

Parameters:
  • rList (list) -- list of (word, correl) tuples
  • wordFreqs (dict) -- word - word frequency pairs
  • maxToCheck (int, optional) -- will stop checking after this many in order to speed up operation
Returns:

newList -- filtered version of rList

Return type:

list

static freqToColor(freq, maxFreq=1000, resolution=64, colorScheme='multi')[source]

Alter color scheme of plot based on the the word frequencies

Parameters:
  • freq (int) -- word frequency
  • maxFreq (int, optional) -- maximum frequency threshold
  • resolution (int, optional) -- pixels of resolution
  • colorScheme (str, optional) -- specifies color scheme of plot
Returns:

htmlcode -- string of html code specifying color scheme

Return type:

str

freq_idx = 4
classmethod fromFile(initFile)[source]

Load specified features from INI file

Parameters:initFile (str) -- path to file

Example

creates outcomeAnalyzer Object with features specified in the initFile

>>> outcomeAnalyzer.fromFile('~/myInit.ini')
generateTagCloudImage(correls, maxP=0.05, paramString=None, colorScheme='multi', cleanCloud=False)[source]

Generates a tag cloud image from correls

Parameters:
  • correls (dict) -- outcome: feature dict
  • maxP (float, optional) -- p-value max
  • paramString (str, optional) -- string to be printed to screen
  • colorScheme (str, optional) -- argument for color scheme
static generateTagCloudImageFromTuples(rList, maxWords)[source]

Generates a tag cloud from a list of tuples

Parameters:
  • rList (list) --
  • maxWords (int) --
getGGplotCommands(outcome, file_in, file_out, featLabels=None, research=False)[source]
Parameters:
  • outcome (str) -- name of the outcome to plot
  • file_in (str) -- input file name
  • file_out (str) -- output file name (plot is stored here)
  • featLabels (list, optional) -- list of feature labels
  • research (boolean, optional) -- False - Detailed color version of the plot True - standard plot
Returns:

commands -- Command string used to generate a plot

Return type:

str

getLabelmapFromLabelmapTable(labelmap_table='', lda_id=None)[source]

Parses a labelmap table and returns a python dictionary: {feat:feat_label}

Parameters:
  • labelmap_table (str, optional) -- name of lable map table
  • lda_id (str, optional) -- lda model id
Returns:

feat_to_label -- {feat:feat_label}

Return type:

dict

getLabelmapFromLexicon(lexicon_table)[source]

Returns a label map based on a lexicon. labelmap is {feat:concatenated_categories}

Parameters:lexicon_table (str) -- Lexicon table name
Returns:feat_to_label -- a label map based on a lexicon. labelmap is {feat:concatenated_categories}
Return type:dict
getTopicFeatLabel(topicLexicon, feat, numTopicTerms=8)[source]

Get the terms with the highest weight for a given topicLexicon

Parameters:
  • topicLexicon (str) -- table name containing the topic lexicon
  • feat (str) -- the topic label
  • numTopicTerms (int, optional) -- number of topic terms
Returns:

label -- list of terms with highest weight from the topic lexicon for the given topic

Return type:

list

static getTopicKeepList(rs, topicWords, filterThresh=0.25)[source]

Gets a list of topics to keep during duplication filtering

Parameters:
  • rs (list) --
  • topicWords (list) -- list of words
  • filterThresh (float, optional) -- Filter threshold
Returns:

keptTopics -- list of topics to keep

Return type:

list

getTopicWords(topicLex, maxWords=15)[source]

Gets the most prevalent words in a topic

Parameters:
  • topicLex (str) -- lexicon table
  • maxWords (int, optional) -- max number of words to return
Returns:

topicWords -- {term : weight}

Return type:

dict

loessPlotFeaturesByOutcome(featGetter, spearman=False, p_correction_method='BH', blacklist=None, whitelist=None, includeFreqs=False, zscoreRegression=True, outputdir='/data/ml/fb20', outputname='loess.jpg', topicLexicon=None, numTopicTerms=8, outputOrder=[])[source]

Finds the correlations between features and outcomes

Parameters:
  • featGetter (featureGetter object) --
  • spearman (boolean, optional) --
  • p_correction_method (str, optional) -- Specified method for p-value correction
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • zscoreRegression (boolean, optional) -- standard both variables if True
  • outputdir (str, optional) -- directory for results to be written
  • outputname (str, optional) -- name of output file
  • topicLexicon (list, optional) --
  • numTopicTerms (int, optional) --
  • outputOrder (list, optional) --
Returns:

Return type:

no return value, creates loess plot and saves it to a file

mapFeatureName(feat, mapping)[source]
Parameters:
  • feat (int) -- feature id
  • mapping (dict) -- mapping dict for feature labels
Returns:

newFeat -- new feature name

Return type:

str

multRegressionWithFeatures(featGetter, spearman=False, p_correction_method='BH', blacklist=None, whitelist=None, includeFreqs=False, outcomeWithOutcome=False, zscoreRegression=True, interactions=False)[source]

Finds the multiple regression coefficient between outcomes and features

Parameters:
  • featGetter (featureGetter object) --
  • spearman (boolean, optional) --
  • " str (p_correction_method) -- Specified method for p-value correction
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • zscoreRegression (boolean, optional) -- standard both variables if True
  • iteractions (boolean, optional) --
Returns:

coeffs -- dict of feature=>outcome=>(R, p)

Return type:

dict

n_idx = 2
static outputComboCorrelMatrixCSV(comboCorrelMatrix, outputstream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, paramString=None)[source]

prints correl matrices for all combinations of features, always prints p-values, n, and freq)

static outputCorrelMatrixCSV(correlMatrix, pValue=True, nValue=True, cInt=True, freq=True, outputFilePtr=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
static outputCorrelMatrixHTML(correlMatrix, pValue=True, nValue=True, cInt=True, freq=False, outputFilePtr=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
static outputSortedCorrelCSV(correlMatrix, pValue=True, nValue=True, cInt=True, freq=False, outputFilePtr=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, topN=50)[source]

Ouputs a sorted correlation matrix (note correlmatrix is reversed from non-sorted)

static outputSortedCorrelHTML(correlMatrix, pValue=True, nValue=True, cInt=True, freq=False, outputFilePtr=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
p_idx = 1
static plotFlexibinnedTable(corpdb, flexiTable, featureFile, feat_to_label=None, preserveBinTable=False)[source]

Plots a flexi binned table

Parameters:
  • corpdb (str) -- database name
  • flexiTable (str) -- table name
  • featureFile (str) -- feature file name
  • feat_to_label (list, optional) -- feature-label tuples
  • preserveBinTable (boolean, optional) -- True- preserve bin table, else drop table
static plotWordcloudFromTuples(rList, maxWords, outputFile, wordcloud)[source]
Parameters:
  • rList (list) -- list of (word, correl) tuples
  • maxWords (int) -- max number of words
  • outputFile (str) -- file name
  • wordcloud --
printBinnedGroupsAndOutcomesToCSV(featGetter, outputfile, where='', freqs=False)[source]

Print csv file with binned groups and output

Parameters:
  • featGetter (featureGetter object) --
  • outputfile (str) -- File name for the outcome to be printed to
  • where (str, optional) -- Additional options based on a where statement, similar to mysql select * from table where ....
  • freqs (boolean, optional) -- if true feature values are returned, if false group norms
Returns:

Return type:

No return value - Raises an Error

printGroupsAndOutcomesToCSV(featGetter, outputfile, where='', freqs=False)[source]

Prints sas-style csv file output

Parameters:
  • featGetter (featureGetter object) --
  • outputfile (str) -- File name for the outcome to be printed to
  • where (str, optional) -- Additional options based on a where statement, similar to mysql select * from table where ....
  • freqs (boolean, optional) -- if true feature values are returned, if false group norms
Returns:

Return type:

No return value - Writes to a CSV file containing groups and outcomes

Example

Print groups and outcomes to csv where fg is a feature getter object and args.printcsv is a user defined filename supplied to fwInterface.

>>> oa.printGroupsAndOutcomesToCSV(fg, args.printcsv)
printSignificantCoeffs(coeffs, outputFile=None, outputFormat='tsv', sort=False, pValue=True, nValue=False, maxP=0.05, paramString=None)[source]
printTagCloudData(correls, maxP=0.05, outputFile='', paramString=None, maxWords=100, duplicateFilter=False, colorScheme='multi', cleanCloud=False)[source]

Prints data that can be inputted into tag cloud software

Parameters:
  • correls (dict) -- outcome: feature dict
  • maxP (float, optional) -- p-value max
  • outputFile (str, optional) -- output file name
  • paramString (str, optional) -- string to be printed to screen
  • maxwords (int, optional) -- max number of words
  • duplicateFilter (boolean, optional) --
  • colorScheme (str, optional) -- argument for color scheme
static printTagCloudFromTuples(rList, maxWords, rankOrderFreq=True, rankOrderR=False, colorScheme='multi', use_unicode=True, cleanCloud=False, censor_dict={'bitches': 'b**ches', 'hoe': 'h**', 'hoes': 'h**s', 'pussy': 'p**sy', 'niggas': 'n**gas', 'bullshit': 'bulls**t', 'dick': 'd**k', 'fuckn': 'f**kn', 'fuck': 'f**k', 'niggaz': 'n**gaz', 'shit': 's**t', 'dickhead': 'd**khead', 'fucking': 'f**king', 'fucked': 'f**ked', "nigga's": "n**ga's", 'nigga': 'n**ga', 'fuckin': 'f**kin', 'bitch': 'b**ch', 'cock': 'c**k', 'whore': 'w**re', 'motherfucker': 'motherf**ker'})[source]

Prints a tag cloud from a set of tuples

Parameters:
  • rlist (list) -- list of (word, correl) tuples
  • maxWords (int) -- maximum number of words for the cloud
  • rankOrderFreq (boolean, optional) --
  • rankOrderR (boolean, optional) --
  • colorScheme (str, optional) -- color scheme of plot
  • use_unicode (boolean, optional) -- When true include unicode in clouds
static printTopicListTagCloudFromTuples(rs, topicWords, maxWords=25, maxTopics=40, duplicateFilter=False, wordFreqs=None, filterThresh=0.25, colorScheme='multi', use_unicode=True, cleanCloud=False)[source]

print topic tag cloud

Parameters:
  • rs (list) --
  • topicWords (list) -- list of words
  • maxwords (int, optional) -- max number of words
  • maxTopics (int, optional) -- max number of topics
  • duplicateFilter (boolean, optional) -- use duplicate filter if True
  • wordFreqs (list, optional) -- list of word frequencies
  • filterThresh (float, optional) -- Filter threshold
  • colorScheme (str, optional) -- color scheme option for cloud
  • use_unicode (boolean, optional) -- keep unicode characters is True
Returns:

topicWords -- list of words

Return type:

list

printTopicTagCloudData(correls, topicLex, maxP=0.05, paramString=None, maxWords=15, maxTopics=100, duplicateFilter=False, colorScheme='multi', outputFile='', useFeatTableFeats=False, cleanCloud=False)[source]

Prints Topic Tag Cloud data to text file

Parameters:
  • correls (dict) -- outcome:feature
  • topicLex (str) -- table name
  • maxP (float, optional) -- p-value max
  • paramString (str, optional) -- string to be printed to screen
  • maxWords (int, optional) -- max number of words
  • maxTopics (int, optional) -- max number of topics
  • duplicateFilter (boolean, optional) -- use duplicate filter if True
  • colorScheme (str, optional) -- color scheme option for cloud
  • outputFile (str, optional) -- output file name
  • useFeatTableFeats (boolean, optional) --
  • cleanCloud (boolean, optional) -- if true replace explatives with *** in center of word, ex: f**k
r_idx = 0
tableToDenseCsv(row_column, col_column, value_column, output_csv_filename=None, compress_csv=True)[source]

Take a mysql table to convert a long table (e.g. feature table) to a dense 2x2 contingency matrix (size N by M where N is the number of distinct rows and M is the number of distinct columns). Efficient (uses lookups instead of a single iteration through all entries of the contingency matrix -- could be more slightly more efficient if it used the dbCursor pointer).

Parameters:
  • row_column (str) -- table column that will populate the rows of the contingency csv
  • col_column (str) -- table column that will populate the columns of the contingency csv
  • value_column (str) -- table column that will populate the values at the intersection of the rows and columns of the contingency csv
  • output_csv_filename (str) -- the name of the output file -- if empty is created based on the values provided
  • compress_csv (boolean) -- whether to gzip the csv
topicDupeFilterCorrels(correls, topicLex, maxWords=15, filterThresh=0.25)[source]

Filters out topics that have many similar words to those with a stronger correlation

Parameters:
  • correls (dict) -- outcome:feature
  • topicLex (str) -- table name
  • maxWords (int, optional) -- max number of words
  • filterThresh (float, optional) -- filter threshold
Returns:

newCorrels -- filtered dict (same structure as correls)

Return type:

dict

wildcardMatch(string, list1)[source]
Parameters:
  • string (strstring) --
  • list1 (list) -- list of strings
Returns:

True - if string is in list1 False - no match

Return type:

boolean

writeSignificantCoeffs4dVis(coeffs, outputFile, outputFormat='tsv', sort=False, pValue=True, nValue=False, maxP=0.05, paramString=None, interactions=False)[source]
yieldDataForOneFeatAtATime(featGetter, blacklist=None, whitelist=None, outcomeWithOutcome=False, includeFreqs=False, groupsWhere='', outcomeWithOutcomeOnly=False)[source]

Finds the correlations between features and outcomes

Parameters:
  • featGetter (featureGetter object) --
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • groupsWhere (str, optional) -- string specified with the --where flag containing
  • outcomeWithOutcomeOnly (boolean, optional) -- True - only correlate outcomes with outcomes False - correlate features with outcomes
Yields:
  • groups -- contains all groups looked at -> ie all users
  • allOutcomes -- contains a dictionary of the outcomes and their values for each group in groups
  • dataDict -- contains the group_norms (i.e what we're z-scoring) for every feature
  • controls -- dict of controls and counts
  • numOutcomes
  • featFreqs
zScoreGroup(featGetter, outcomeWithOutcome=False, includeFreqs=False, blacklist=None, whitelist=None)[source]

Calculates group zScore

Parameters:
  • featGetter (featureGetter object) --
  • outcomeWithOutcome (boolean, optional) -- Adds the outcomes themselves to the list of variables to correlate with the outcomes if True
  • includeFreqs (boolean, optional) -- Include the frequency of each feature if True
  • blacklist (list, optional) -- list of feature table fields (str) to ignore
  • whitelist (list, optional) -- list of feature table fields (str) to include
Returns:

correls -- dict of outcome=>feature=>(R, p, numGroups, CI, featFreqs)

Return type:

dict

dlatk.outcomeGetter module

class dlatk.outcomeGetter.OutcomeGetter(corpdb='dla_tutorial', corptable='msgs', correl_field='user_id', mysql_host='127.0.0.1', message_field='message', messageid_field='message_id', encoding='utf8mb4', use_unicode=True, lexicondb='permaLexicon', outcome_table='masterstats_r500', outcome_value_fields=['demog_age'], outcome_controls=[], outcome_interaction=[], group_freq_thresh=None, featureMappingTable='', featureMappingLex='', wordTable=None, fold_column=None)[source]

Bases: dlatk.dlaWorker.DLAWorker

Deals with outcome tables

Parameters:
  • outcome_table (str) --
  • outcome_value_fields (str) --
  • outcome_controls (str) --
  • outcome_interaction (str) --
  • group_freq_thresh (str) --
  • featureMapping (str) --
  • oneGroupSetForAllOutcomes (str) --
  • fold_column (str) --
Returns:

Return type:

OutcomeGetter object

Examples

Initialize a OutcomeGetter

>>> og = OutcomeGetter.fromFile('~/myInit.ini')

Get outcome table as pandas dataframe

>>> outAndCont = og.getGroupsAndOutcomesAsDF()
copy()[source]
createOutcomeTable(tablename, dataframe, ifExists='fail')[source]
classmethod fromFile(initFile)[source]

Loads specified features from file

Parameters:initFile (str) -- Path to file
getDistinctOutcomeAndControlValueCounts(outcome=None, control=None, includeNull=True, where='')[source]

returns a dict of (outcome_value, count)

getDistinctOutcomeValueCounts(outcome=None, requireControls=False, includeNull=True, where='')[source]

returns a dict of (outcome_value, count)

getDistinctOutcomeValues(outcome=None, includeNull=True, where='')[source]

returns a list of outcome values

getFeatureMapping(featureMappingTable, featureMappingLex, bracketlabels)[source]
getGroupAndOutcomeValues(outcomeField=None, where='')[source]

returns a list of (group_id, outcome_value) tuples

getGroupAndOutcomeValuesAsDF(outcomeField=None, where='')[source]

returns a dataframe of (group_id, outcome_value)

getGroupsAndOutcomes(lexicon_count_table=None, groupsWhere='', includeFoldLabels=False)[source]
getGroupsAndOutcomesAsDF(lexicon_count_table=None, groupsWhere='', sparse=False)[source]
hasOutcomes()[source]
makeBinnedOutcomeTable(buckets, mid_aom_list)[source]

buckets is a list of tuples

makeContingencyTable(featureGetter, featureValueField, outcome_filter_where='', feature_value_group_sum_min=0)[source]

makes a contingency table from this outcome value, a featureGetter, and the desired column of the featureGetter, assumes both correl_field's are the same

numGroupsPerOutcome(featGetter, outputfile, where='')[source]

prints sas-style csv file output

dlatk.pca_mod module

Principal Component Analysis

class dlatk.pca_mod.PCA(n_components=None, copy=True, whiten=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space.

This implementation uses the scipy.linalg implementation of the singular value decomposition. It only works for dense arrays and is not scalable to large dimensional data.

The time complexity of this implementation is O(n ** 3) assuming n ~ n_samples ~ n_features.

Parameters:
  • n_components (int, None or string) --

    Number of components to keep. if n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    if n_components == 'mle', Minka's MLE is used to guess the dimension if 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

  • copy (bool) -- If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
  • whiten (bool, optional) --

    When True (False by default) the components_ vectors are divided by n_samples times singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making there data respect some hard-wired assumptions.

`components_`

array, [n_components, n_features] -- Components with maximum variance.

`explained_variance_ratio_`

array, [n_components] -- Percentage of variance explained by each of the selected components. k is not set then all components are stored and the sum of explained variances is equal to 1.0

`mean_`

array, [n_features] -- Per-feature empirical mean, estimated from the training set.

`n_components_`

int -- The estimated number of components. Relevant when n_components is set to 'mle' or a number between 0 and 1 to select using explained variance.

`noise_variance_`

float -- The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Notes

For n_components='mle', this class uses the method of Thomas P. Minka: Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604

Implements the probabilistic PCA model from: M. Tipping and C. Bishop, Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society, Series B, 61, Part 3, pp. 611-622 via the score and score_samples methods. See http://www.miketipping.com/papers/met-mppca.pdf

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_) 
[ 0.99244...  0.00755...]

See also

ProbabilisticPCA, RandomizedPCA, KernelPCA, SparsePCA, TruncatedSVD

fit(X, y=None)[source]

Fit the model with X.

Parameters:X (array-like, shape (n_samples, n_features)) -- Training data, where n_samples in the number of samples and n_features is the number of features.
Returns:self -- Returns the instance itself.
Return type:object
fit_transform(X, y=None)[source]

Fit the model with X and apply the dimensionality reduction on X.

Parameters:X (array-like, shape (n_samples, n_features)) -- Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:X_new
Return type:array-like, shape (n_samples, n_components)
get_covariance()[source]

Compute data covariance with the generative model.

cov = components_.T * S**2 * components_ + sigma2 * eye(n_features) where S**2 contains the explained variances.

Returns:cov -- Estimated covariance of data.
Return type:array, shape=(n_features, n_features)
get_precision()[source]

Compute data precision matrix with the generative model.

Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency.

Returns:precision -- Estimated precision of data.
Return type:array, shape=(n_features, n_features)
inverse_transform(X)[source]

Transform data back to its original space, i.e., return an input X_original whose transform would be X

Parameters:X (array-like, shape (n_samples, n_components)) -- New data, where n_samples is the number of samples and n_components is the number of components.
Returns:
Return type:X_original array-like, shape (n_samples, n_features)

Notes

If whitening is enabled, inverse_transform does not compute the exact inverse operation as transform.

score(X, y=None)[source]

Return the average log-likelihood of all samples

See. "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:X (array, shape(n_samples, n_features)) -- The data.
Returns:ll -- Average log-likelihood of the samples under the current model
Return type:float
score_samples(X)[source]

Return the log-likelihood of each sample

See. "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:X (array, shape(n_samples, n_features)) -- The data.
Returns:ll -- Log-likelihood of each sample under the current model
Return type:array, shape (n_samples,)
transform(X)[source]

Apply the dimensionality reduction on X.

X is projected on the first principal components previous extracted from a training set.

Parameters:X (array-like, shape (n_samples, n_features)) -- New data, where n_samples is the number of samples and n_features is the number of features.
Returns:X_new
Return type:array-like, shape (n_samples, n_components)
class dlatk.pca_mod.ProbabilisticPCA(*args, **kwargs)[source]

Bases: dlatk.pca_mod.PCA

Additional layer on top of PCA that adds a probabilistic evaluationPrincipal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space.

This implementation uses the scipy.linalg implementation of the singular value decomposition. It only works for dense arrays and is not scalable to large dimensional data.

The time complexity of this implementation is O(n ** 3) assuming n ~ n_samples ~ n_features.

Parameters:
  • n_components (int, None or string) --

    Number of components to keep. if n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    if n_components == 'mle', Minka's MLE is used to guess the dimension if 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

  • copy (bool) -- If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
  • whiten (bool, optional) --

    When True (False by default) the components_ vectors are divided by n_samples times singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making there data respect some hard-wired assumptions.

`components_`

array, [n_components, n_features] -- Components with maximum variance.

`explained_variance_ratio_`

array, [n_components] -- Percentage of variance explained by each of the selected components. k is not set then all components are stored and the sum of explained variances is equal to 1.0

`mean_`

array, [n_features] -- Per-feature empirical mean, estimated from the training set.

`n_components_`

int -- The estimated number of components. Relevant when n_components is set to 'mle' or a number between 0 and 1 to select using explained variance.

`noise_variance_`

float -- The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Notes

For n_components='mle', this class uses the method of Thomas P. Minka: Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604

Implements the probabilistic PCA model from: M. Tipping and C. Bishop, Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society, Series B, 61, Part 3, pp. 611-622 via the score and score_samples methods. See http://www.miketipping.com/papers/met-mppca.pdf

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_) 
[ 0.99244...  0.00755...]

See also

ProbabilisticPCA, RandomizedPCA, KernelPCA, SparsePCA, TruncatedSVD

fit(X, y=None, homoscedastic=True)[source]

Additionally to PCA.fit, learns a covariance model

Parameters:
  • X (array of shape(n_samples, n_features)) -- The data to fit
  • homoscedastic (bool, optional,) -- If True, average variance across remaining dimensions
score(X, y=None)[source]

Return a score associated to new data

Parameters:X (array of shape(n_samples, n_features)) -- The data to test
Returns:ll -- log-likelihood of each row of X under the current model
Return type:array of shape (n_samples),
class dlatk.pca_mod.RandomizedPCA(n_components=None, copy=True, iterated_power=3, whiten=False, random_state=None, max_components=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Principal component analysis (PCA) using randomized SVD

Linear dimensionality reduction using approximated Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space.

Parameters:
  • n_components (int, optional) -- Maximum number of components to keep. When not given or None, this is set to n_features (the second dimension of the training data).
  • copy (bool) -- If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
  • iterated_power (int, optional) -- Number of iterations for the power method. 3 by default.
  • whiten (bool, optional) --

    When True (False by default) the components_ vectors are divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • random_state (int or RandomState instance or None (default)) -- Pseudo Random Number generator seed control. If None, use the numpy.random singleton.
`components_`

array, [n_components, n_features] -- Components with maximum variance.

`explained_variance_ratio_`

array, [n_components] -- Percentage of variance explained by each of the selected components. k is not set then all components are stored and the sum of explained variances is equal to 1.0

`mean_`

array, [n_features] -- Per-feature empirical mean, estimated from the training set.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import RandomizedPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = RandomizedPCA(n_components=2)
>>> pca.fit(X)                 
RandomizedPCA(copy=True, iterated_power=3, n_components=2,
       random_state=None, whiten=False)
>>> print(pca.explained_variance_ratio_) 
[ 0.99244...  0.00755...]

See also

PCA, ProbabilisticPCA, TruncatedSVD

References

[Halko2009]Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 (arXiv:909)
[MRT]A randomized algorithm for the decomposition of matrices Per-Gunnar Martinsson, Vladimir Rokhlin and Mark Tygert

Notes

This class supports sparse matrix input for backward compatibility, but actually computes a truncated SVD instead of a PCA in that case (i.e. no centering is performed). This support is deprecated; use the class TruncatedSVD for sparse matrix support.

fit(X, y=None)[source]

Fit the model with X by extracting the first principal components.

Parameters:X (array-like, shape (n_samples, n_features)) -- Training data, where n_samples in the number of samples and n_features is the number of features.
Returns:self -- Returns the instance itself.
Return type:object
fit_transform(X, y=None)[source]

Fit the model with X and apply the dimensionality reduction on X.

Parameters:X (array-like, shape (n_samples, n_features)) -- New data, where n_samples in the number of samples and n_features is the number of features.
Returns:X_new
Return type:array-like, shape (n_samples, n_components)
inverse_transform(X, y=None)[source]

Transform data back to its original space.

Returns an array X_original whose transform would be X.

Parameters:X (array-like, shape (n_samples, n_components)) -- New data, where n_samples in the number of samples and n_components is the number of components.
Returns:
Return type:X_original array-like, shape (n_samples, n_features)

Notes

If whitening is enabled, inverse_transform does not compute the exact inverse operation of transform.

transform(X, y=None)[source]

Apply dimensionality reduction on X.

X is projected on the first principal components previous extracted from a training set.

Parameters:X (array-like, shape (n_samples, n_features)) -- New data, where n_samples in the number of samples and n_features is the number of features.
Returns:X_new
Return type:array-like, shape (n_samples, n_components)

dlatk.regressionPredictor module

Regression Predictor

Interfaces with DLATK and scikit-learn to perform prediction of outcomes for lanaguage features.

class dlatk.regressionPredictor.ClassifyToRegressionPredictor(og, fg, modelC='linear-svc', modelR='ridgecv')[source]

Bases: object

classOutcomeLabel = 'bin_'

Performs classificaiton for 0/non-zero then regression on non-zeros

predict(standardize=True, sparse=False, restrictToGroups=None)[source]

Predicts with the classifier and regressor. zero must be the false prediction from classifier

randomState = 42
test(standardize=True, sparse=False, saveModels=False, groupsWhere='')[source]
testPerc = 0.2
train(standardize=True, sparse=False, restrictToGroups=None, nFolds=4, trainRegOnAll=True, classifierAsFeat=True, groupsWhere='')[source]
class dlatk.regressionPredictor.CombinedRegressionPredictor(og, fgs, modelNames=['ridge'], combinedModelName='ridgecv')[source]

Bases: dlatk.regressionPredictor.RegressionPredictor

A class to handle a combination of regression predictors, implemented as a linear model

combinedTrainPerc = 0.15
cvFolds = 3
cvJobs = 6
cvParams = {'ridgecv': [{'alphas': array([ 1.00000000e+01, 2.00000000e+00, 2.00000000e+01, 1.00000000e+00, 1.00000000e+02, 2.00000000e+02, 1.00000000e+03, 1.00000000e-01, 2.00000000e-01, 1.00000000e-02, 1.00000000e-03, 1.00000000e-04, 1.00000000e-05, 1.00000000e-06])}], 'ridge': [{'alpha': [10], 'fit_intercept': [True]}]}
load(filename)[source]
maxPredictAtTime = 10000000
modelToClassName = {'ridgecv': 'RidgeCV', 'ridge': 'Ridge'}
predict(standardize=True, testPerc=0.25, sparse=False)[source]
predictToFeatureTable(standardize=True, testPerc=0.25, sparse=False, fe=None, name=None)[source]
randomState = 42
save(filename)[source]
test(standardize=True, sparse=False, saveModels=False, groupsWhere='')[source]

Tests combined regression

testPerc = 0.2
train(standardize=True, sparse=False)[source]

Trains regression models

class dlatk.regressionPredictor.RPCRidgeCV(component_percs=[0.01, 0.0316, 0.1, 0.316, 1], alphas=array([ 0.1, 1. , 10. ]), fit_intercept=True, normalize=False, cv=None, gcv_mode=None)[source]

Bases: sklearn.linear_model.base.LinearModel, sklearn.base.RegressorMixin

Randomized PCA Ridge Regression with built-in cross-validation

To set the RPCA number of components, it uses a train(80%) and dev-set(20%).

For ridge, it performs Generalized Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.

Parameters:
  • component_percs (list of percentages) -- number of components to try as a percentage of observations default is [0.01, 0.0333, 0.1, 0.333, 1]
  • alphas (numpy array of shape [n_alphas]) -- Array of alpha values to try. Small positive values of alpha improve the conditioning of the problem and reduce the variance of the estimates. Alpha corresponds to (2*C)^-1 in other linear models such as LogisticRegression or LinearSVC.
  • fit_intercept (boolean) -- Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize (boolean, optional) -- If True, the regressors X are normalized
  • cv (cross-validation generator, optional) -- If None, Generalized Cross-Validation (efficient Leave-One-Out) will be used.
  • gcv_mode ({None, 'auto', 'svd', eigen'}, optional) --

    Flag indicating which strategy to use when performing Generalized Cross-Validation. Options are:

    'auto' : use svd if n_samples > n_features, otherwise use eigen
    'svd' : force computation via singular value decomposition of X
    'eigen' : force computation via eigendecomposition of X^T X
    

    The 'auto' mode is the default and is intended to pick the cheaper option of the two depending upon the shape of the training data.

  • store_cv_values (boolean, default=False) -- Flag indicating if the cross-validation values corresponding to each alpha should be stored in the cv_values_ attribute (see below). This flag is only compatible with cv=None (i.e. using Generalized Cross-Validation).
`cv_values_`

array, shape = [n_samples, n_alphas] or shape = [n_samples, n_targets, n_alphas], optional -- Cross-validation values for each alpha (if store_cv_values=True and cv=None). After fit() has been called, this attribute will contain the mean squared errors (by default) or the values of the {loss,score}_func function (if provided in the constructor).

`coef_`

array, shape = [n_features] or [n_targets, n_features] -- Weight vector(s).

`alpha_`

float -- Estimated regularization parameter.

See also

RidgeCV
Ridge Regression with built-in cross-val to set alpha
Ridge
Ridge regression
fit(X, y, sample_weight=1.0)[source]

fits the randomized pca and ridge through cross-validation

predict(X)[source]
reducerString = 'RandomizedPCA(n_components=n_comps, random_state=42, whiten=False, iterated_power=3)'
transform(X)[source]
class dlatk.regressionPredictor.RegressionPredictor(og, fgs, modelName='ridge')[source]

Bases: object

Handles prediction of continuous outcomes

cvParams

dict

modelToClassName

dict

modelToCoeffsName

dict

cvJobs

int

cvFolds

int

chunkPredictions

boolean -- whether or not to predict in chunks (good for keeping track when there are a lot of predictions to do)

maxPredictAtTime

int

backOffPerc

float -- when the num_featrue / training_insts is less than this backoff to backoffmodel

backOffModel

str

featureSelectionString

str or None

featureSelectMin

int -- must have at least this many features to perform feature selection

featureSelectPerc

float -- only perform feature selection on a sample of training (set to 1 to perform on all)

testPerc

float -- percentage of sample to use as test set (the rest is training)

randomState

int -- percentage of sample to use as test set (the rest is training)

trainingSize

int

Parameters:
  • outcomeGetter (OutcomeGetter object) --
  • featureGetters (list) -- list of FeatureGetter objects
  • modelName (str, optional) --
Returns:

Return type:

RegressionPredictor object

accuracyStats(ytrue, ypred)[source]
adjustOutcomesFromControls(standardize=True, sparse=False, saveModels=False, allControlsOnly=False, comboSizes=None, nFolds=2, savePredictions=True, groupsWhere='')[source]

Produces adjusted outcomes given the controls

backOffModel = 'linear'
backOffPerc = 0.05
chunkPredictions = False
cvFolds = 3
cvJobs = 8
cvParams = {'ridgecv': [{'alphas': array([ 1.00000000e+00, 1.00000000e-02, 1.00000000e-04, 1.00000000e+02, 1.00000000e+04, 1.00000000e+06])}], 'lasso': [{'max_iter': [1500], 'alpha': [0.001]}], 'ridge100': [{'alpha': [100]}], 'lassolars': [{'alpha': [1, 0.1, 0.01, 0.001, 0.0001]}], 'linear': [{'fit_intercept': [True]}], 'par': [{'C': [0.01, 0.1, 0.001], 'n_iter': [10], 'shuffle': [False], 'random_state': [42], 'verbose': [1], 'epsilon': [0.01, 0.1, 1]}], 'lassocv': [{'n_alphas': [12], 'max_iter': [2200]}], 'extratrees': [{'random_state': [42], 'n_estimators': [1000], 'n_jobs': [12]}], 'ridge100000': [{'alpha': [100000]}], 'rpcridgecv': [{'component_percs': array([ 0.01 , 0.02154, 0.0464 , 0.1 , 0.2154 , 0.464 , 1. ]), 'alphas': array([ 1.00000000e-05])}], 'lassolarscv': [{'max_iter': [1000], 'max_n_alphas': [60]}], 'ridge': [{'alpha': [100]}], 'ridgefirstpasscv': [{'alphas': array([ 1.00000000e+00, 1.00000000e-02, 1.00000000e-04, 1.00000000e+02, 1.00000000e+04, 1.00000000e+06])}], 'ridge250': [{'alpha': [250]}], 'lars': [{}], 'ridgehighcv': [{'alphas': array([ 10, 100, 1, 1000, 10000, 100000, 1000000])}], 'sgdregressor': [{'n_iter': [50], 'alpha': array([ 1.00000000e+04, 1.00000000e+03, 1.00000000e+02, 1.00000000e+01, 1.00000000e+00, 1.00000000e-01, 1.00000000e-02, 1.00000000e-03, 1.00000000e-04]), 'penalty': ['l1'], 'fit_intercept': [True], 'verbose': [0]}], 'elasticnetcv': [{'n_alphas': [100], 'max_iter': [5000], 'cv': [10], 'verbose': [1], 'l1_ratio': array([ 1. , 0.99 , 0.975, 0.95 , 0.9 , 0.75 , 0.5 , 0.1 , 0.05 , 0.025, 0.01 , 0. ]), 'n_jobs': [10]}], 'elasticnet': [{'max_iter': [1500], 'alpha': [0.001], 'l1_ratio': [0.8]}], 'ridge1000': [{'alpha': [1000]}], 'svr': [{'C': [0.01, 0.001, 0.0001, 1e-05, 1e-06, 1e-07], 'epsilon': [0.25], 'kernel': ['linear']}], 'ridgelowcv': [{'alphas': array([ 1.00000000e-02, 1.00000000e-01, 1.00000000e-03, 1.00000000e+00, 1.00000000e-04, 1.00000000e-05])}], 'ridge10000': [{'alpha': [10000]}]}
fSelectors = None

dict -- Docstring after attribute, with type specified.

featureNames = None

list -- Holds the order the features are expected in.

featureSelectMin = 30
featureSelectPerc = 1.0
featureSelectionString = None
getWeightsForFeaturesAsADict()[source]
load(filename, pickle2_7=True)[source]
maxPredictAtTime = 60000
modelName = None

str -- Docstring after attribute, with type specified.

modelToClassName = {'ridgecv': 'RidgeCV', 'lasso': 'Lasso', 'ridge100': 'Ridge', 'lassolars': 'LassoLars', 'linear': 'LinearRegression', 'ridge10000': 'Ridge', 'par': 'PassiveAggressiveRegressor', 'lassocv': 'LassoCV', 'extratrees': 'ExtraTreesRegressor', 'ridge100000': 'Ridge', 'rpcridgecv': 'RPCRidgeCV', 'lassolarscv': 'LassoLarsCV', 'ridge': 'Ridge', 'ridgefirstpasscv': 'RidgeCV', 'ridge250': 'Ridge', 'lars': 'Lars', 'sgdregressor': 'SGDRegressor', 'elasticnetcv': 'ElasticNetCV', 'elasticnet': 'ElasticNet', 'ridge1000': 'Ridge', 'svr': 'SVR', 'ridgelowcv': 'RidgeCV', 'ridgehighcv': 'RidgeCV'}
modelToCoeffsName = {'randomizedlasso': 'scores_', 'ridgecv': 'coef_', 'lasso': 'coef_', 'ridge100': 'coef_', 'lassolars': 'coef_', 'linear': 'coef_', 'ridge10000': 'coef_', 'par': 'coef_', 'lassocv': 'coef_', 'extratrees': 'feature_importances_', 'ridge100000': 'coef_', 'rpcridgecv': 'coef_', 'lassolarscv': 'coef_', 'ridge': 'coef_', 'ridgefirstpasscv': 'coef_', 'ridge250': 'coef_', 'lars': 'coef_', 'sgdregressor': 'coef_', 'elasticnetcv': 'coef_', 'elasticnet': 'coef_', 'ridge1000': 'coef_', 'svr': 'coef_', 'ridgelowcv': 'coef_', 'ridgehighcv': 'coef_'}
multiFSelectors = None

str -- Docstring after attribute, with type specified.

multiScalers = None

str -- Docstring after attribute, with type specified.

multiXOn = None

boolean -- whether multiX was used for training.

old_predict(standardize=True, sparse=False, restrictToGroups=None)[source]
old_train(standardize=True, sparse=False, restrictToGroups=None)[source]

Trains regression models

predict(standardize=True, sparse=False, restrictToGroups=None, groupsWhere='')[source]
predictAllToFeatureTable(standardize=True, sparse=False, fe=None, name=None, nFolds=10, groupsWhere='')[source]
predictNoOutcomeGetter(groups, standardize=True, sparse=False, restrictToGroups=None)[source]
predictToFeatureTable(standardize=True, sparse=False, fe=None, name=None, groupsWhere='')[source]
predictToOutcomeTable(standardize=True, sparse=False, name=None, nFolds=10, groupsWhere='')[source]
static printComboControlPredictionsToCSV(scores, outputstream, paramString=None, delimiter='|')[source]

prints predictions with all combinations of controls to csv)

static printComboControlScoresToCSV(scores, outputstream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, paramString=None, delimiter='|')[source]

prints scores with all combinations of controls to csv)

randomState = 42
regressionModels = None

dict -- Docstring after attribute, with type specified.

save(filename)[source]
scalers = None

dict -- Docstring after attribute, with type specified.

test(standardize=True, sparse=False, saveModels=False, blacklist=None, groupsWhere='')[source]

Tests classifier, by pulling out random testPerc percentage as a test set

testControlCombos(standardize=True, sparse=False, saveModels=False, blacklist=None, noLang=False, allControlsOnly=False, comboSizes=None, nFolds=2, savePredictions=False, weightedEvalOutcome=None, residualizedControls=False, groupsWhere='')[source]

Tests regressors, by cross-validating over folds with different combinations of controls

testPerc = 0.2
train(standardize=True, sparse=False, restrictToGroups=None, groupsWhere='')[source]

Train Regressors

trainingSize = 1000000
class dlatk.regressionPredictor.SpamsGroupLasso(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, precompute='auto', max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False, rho=None)[source]

Bases: sklearn.linear_model.base.LinearModel, sklearn.base.RegressorMixin

interfaces with the spams implementation of group lasso

spamsParams = {'verbose': True, 'numThreads': -1}
class dlatk.regressionPredictor.VERPCA(n_components=None, copy=True, iterated_power=3, whiten=False, random_state=None, max_components_ratio=0.25)[source]

Bases: dlatk.pca_mod.RandomizedPCA

Randomized PCA that sets number of components by variance explained

Parameters:
  • n_components (int) -- Maximum number of components to keep: default is 50.
  • copy (bool) -- If False, data passed to fit are overwritten
  • iterated_power (int, optional) -- Number of iteration for the power method. 3 by default.
  • whiten (bool, optional) --

    When True (False by default) the components_ vectors are divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • random_state (int or RandomState instance or None (default)) -- Pseudo Random Number generator seed control. If None, use the numpy.random singleton.
  • max_components_ratio (float) -- Maximum number of components in terms of their ratio to the number of features. Default is 0.25 (1/4).
fit(X, y=None)[source]

Fit the model to the data X.

Parameters:X (array-like or scipy.sparse matrix, shape (n_samples, n_features)) -- Training vector, where n_samples in the number of samples and n_features is the number of features.
Returns:self -- Returns the instance itself.
Return type:object
dlatk.regressionPredictor.alignDictsAsXy(X, y, sparse=False, returnKeyList=False, keys=None)[source]

turns a list of dicts for x and a dict for y into a matrix X and vector y

dlatk.regressionPredictor.alignDictsAsy(y, *yhats, **kwargs)[source]
dlatk.regressionPredictor.chunks(X, y, size)[source]

Yield successive n-sized chunks from X and Y.

dlatk.regressionPredictor.foldN(l, folds)[source]

Yield successive n-sized chunks from l.

dlatk.regressionPredictor.getGroupsFromGroupNormValues(gnvs)[source]
dlatk.regressionPredictor.grouper(3, 'abcdefg', 'x') --> ('a', 'b', 'c'), ('d', 'e', 'f'), ('g', 'x', 'x')[source]
dlatk.regressionPredictor.hasMultValuesPerItem(listOfD)[source]

returns true if the dictionary has a list with more than one element

dlatk.regressionPredictor.r2simple(ytrue, ypred)[source]

dlatk.semanticsExtractor module

Semantic Extractor

Bridges DLATK's Feature Extractor with reading xml semantic annotations assumes that corptable is a directory with files rather than a database table.

class dlatk.semanticsExtractor.SemanticsExtractor(corpdb, corptable, correl_field, mysql_host, message_field, messageid_field, corpdir)[source]

Bases: dlatk.featureExtractor.FeatureExtractor

addNERTable(tableName=None, min_freq=1, valueFunc=<function SemanticsExtractor.<lambda>>, normalizeByCollocsInGroup=False)[source]

extracts named entiry features from semantic xml files

Module contents