--deduplicate

Switch

--deduplicate

Description

Removes duplicate tweets within -t grouping, writes to new table corptable_dedup. Not to be run at the message level.

Argument and Default Value

None

Details

Takes a mysql message table and removes all duplicate messages within a given user. Duplicate tweets = any tweets with the same first 6 tokens (no usernames, no url, no hashtags, no smileys, no punctuation, etc.). Writes to new message table with _dedup appended to end of name. For example, the following two tweets would be considered duplicates despite not being identical:

Written by Daniel Preotiuc, original code found here.

The --clean_messages flag will remove urls (and replace with <URL>) and @ mentions (and replace with <USER>).

Other Switches

Required Switches:

Optional Switches:

Example Commands

Remove duplicate tweets while cleaning URLs and @mentions:

# creates the table msgs_dedup
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --deduplicate --clean_messages