dlatk.lib package

Submodules

dlatk.lib.StanfordParser module

class dlatk.lib.StanfordParser.StanfordParser(**kwargs)[source]

Bases: object

depParseRe = re.compile('^[a-z0-9\\_]+\\([a-z0-9]', re.IGNORECASE)
getParseDicts(output)[source]
parse(sents)[source]

returns a list of dicts with parse information: const, dep, and pos

dlatk.lib.StanfordParser.shortenToNWords(sent, n)[source]

dlatk.lib.TweetNLP module

class dlatk.lib.TweetNLP.TweetNLP(**kwargs)[source]

Bases: object

getTaggerProcess()[source]
getTokenizerProcess()[source]
tag(sents)[source]

returns a list of lists of tuples (word, tag)

tokenize(sents)[source]

returns a list of lists of tuples (word, tag)

dlatk.lib.descStats module

dlatk.lib.happierfuntokenizing module

This code implements a basic, Twitter-aware tokenizer.

A tokenizer is a function that splits a string of text into words. In Python terms, we map string and unicode objects into lists of unicode objects.

There is not a single right way to do tokenizing. The best method depends on the application. This tokenizer is designed to be flexible and this easy to adapt to new domains and tasks. The basic logic is this:

  1. The tuple regex_strings defines a list of regular expression strings.
  2. The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
  3. The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
  4. When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.

The __main__ method illustrates by tokenizing a few examples.

I've also included a Tokenizer method tokenize_random_tweet(). If the twitter library is installed (http://code.google.com/p/python-twitter/) and Twitter is cooperating, then it should tokenize a random English-language tweet.

class dlatk.lib.happierfuntokenizing.Tokenizer(preserve_case=False, use_unicode=True)[source]

Bases: object

tokenize(s)[source]

Argument: s -- any string or unicode object Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False

tokenize_random_tweet()[source]

If the twitter library is installed and a twitter connection can be established, then tokenize a random tweet.

dlatk.lib.notify module

dlatk.lib.wordcloud module

dlatk.lib.wordcloud.coerceToValidFileName(filename)[source]
dlatk.lib.wordcloud.createZColumn(schema, ngramTable)[source]
dlatk.lib.wordcloud.duplicateFilterLineIntoInformativeString(line)[source]
dlatk.lib.wordcloud.explode(string)[source]
dlatk.lib.wordcloud.extract(text, sub1, sub2)[source]
dlatk.lib.wordcloud.findall(string, sep=':')[source]
dlatk.lib.wordcloud.freqToColor(freq, maxFreq=1000, resolution=64, colorScheme='multi')[source]
dlatk.lib.wordcloud.getColorList(word_list, freq_list=[], randomize=False, colorScheme='multi', scale='linear')[source]
dlatk.lib.wordcloud.getFeatValueAndZ(user, schema, ngramTable, min_value=5, ordered=True, z_threshold=0)[source]

returns list of (feat, value, z) for a given user

dlatk.lib.wordcloud.getFeatWithLimit(schema, table, group='', amount=50, orderBy='group_norm', desc=True)[source]

get the first n amount of words, using the orderBy (asc or desc) column to sort. if group is specified, get from that specific group returns list of (feat, group_norm)

dlatk.lib.wordcloud.getMeanAndStd(word, ngramTable, schema, num_groups=-1, distTable='', distTableSource=None)[source]

get mean and std for a word using the ngramTable

dlatk.lib.wordcloud.getNgrams(ngramTable, schema)[source]
dlatk.lib.wordcloud.getOneGram(schema, ngramTable)[source]
dlatk.lib.wordcloud.getRankedFreqList(word_list, max_size=75, min_size=30, scale='linear')[source]

returns freq_list i.e. list of sizes from word_list freq_list goes from biggest to smallest make sure the word_list is sorted accordingly

dlatk.lib.wordcloud.getUniqueNgrams(schema, ngramTable, user='', max=-1)[source]

get n ngrams from ngramTable where z-score = 0, sorted by group_norm if user is specified, only grab unique ngrams from that user

dlatk.lib.wordcloud.getUsers(schema, ngramTable)[source]
dlatk.lib.wordcloud.getZscore(word, user, ngramTable, schema, distTable='')[source]
dlatk.lib.wordcloud.makeLexiconTopicWordclouds(lexdb, lextable, output, color, max_words=15)[source]
dlatk.lib.wordcloud.normalizeFreqList(old_freq_list, word_count=15)[source]

Given a sorted freq_list and a word count, return a normalized freq_list, based on the old sizing algorithm from oa.printTagCloudFromTuples :param old_freq_list: list of sorted, descending integers :param word_count: an integer that shows how big the new_freq_list should be

dlatk.lib.wordcloud.processTopicLine(line)[source]
dlatk.lib.wordcloud.random() → x in the interval [0, 1).
dlatk.lib.wordcloud.rgb(hex_str)[source]

converts a hex string to an rgb tuple

dlatk.lib.wordcloud.tagcloudToWordcloud(filename='', directory='', withTitle=False, fontFamily='Helvetica-Narrow', fontStyle=None, toFolders=False, useTopicWords=True)[source]
dlatk.lib.wordcloud.updateZscore(schema, ngramTable, user='', use_feat_table=False, distTable='')[source]
dlatk.lib.wordcloud.wordcloud(word_list, freq_list, output_prefix='test', color_list=None, random_colors=True, random_order=False, width=None, height=None, rgb=False, title=None, fontFamily='Helvetica-Narrow', keepPdfs=False, fontStyle=None, font_path='', min_font_size=40, max_font_size=250, max_words=500, big_mask=False, background_color='#FFFFFF', wordcloud_algorithm='ibm')[source]

given a list of words and a list of their frequencies, builds a wordle

dlatk.lib.wordcloud.wordcloudByFile(input_filename, output_filename='wordle_test')[source]

Reads a file in format (per line) word:number and builds a wordle

Module contents