Data Cleaning

This is an overview of the different methods for cleaning text data. We start with a message table called "msgs" (available in the packaged data):

mysql> describe msgs;
+--------------+------------------+------+-----+---------+----------------+
| Field        | Type             | Null | Key | Default | Extra          |
+--------------+------------------+------+-----+---------+----------------+
| message_id   | int(11)          | NO   | PRI | NULL    | auto_increment |
| user_id      | int(10) unsigned | YES  | MUL | NULL    |                |
| date         | varchar(64)      | YES  |     | NULL    |                |
| created_time | datetime         | YES  | MUL | NULL    |                |
| message      | text             | YES  |     | NULL    |                |
+--------------+------------------+------+-----+---------+----------------+

Language Filtering

Uses the langid python package. For each message it will assign a confidence and keep the message if the confidence is over 0.80. By default this will lowercase your messages before running through langid.

This will create a new table whose name is taken from the -t flag and appends "_en".

  • --language_filter
# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en

Adding the --clean_messages flag will remove URLs and @-mentions before applying the language filter, which improves the language classification

# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en --clean_messages

Deduplicating

Removes tweets which contains the same first 6 tokens as another message within the same group (-c). This will create a new table whose name is taken from the -t flag and appends "_dedup".

  • --deduplicate
# creates the table msgs_dedup
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --deduplicate

To also replace URLs (with <URL>) and @-mentions (with <USER>) add the --clean_messages flag (this helps the language classification task):

./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --deduplicate --clean_messages

Removing URLs and @-mentions

When using --clean_messages alone it will create a new table whose name is taken from the -t flag and appends "_an".

  • --clean_messages
# creates the table msgs_an
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --clean_messages

Spam Filtering

If any message contains one of the spam words it is marked as spam (is_spam = 1, otherwise is_spam = 0). If number of spam messages / total message > THRESHOLD then user is removed from new message table.

Spam words = 'share', 'win', 'check', 'enter', 'products', 'awesome', 'prize', 'sweeps', 'bonus', 'gift'

This will create a new table whose name is taken from the -t flag and appends "_nospam".

  • --spam_filter
# creates the table msgs_nospam
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --spam_filter 0.1

Real World Example

Using both the language filter and the deduplication filters typically yeild nice results (both with the --clean_messages flag):

# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en --clean_messages

# creates the table msgs_en_dedup
./dlatkInterface.py -d dla_tutorial -t msgs_en -c user_id --deduplicate --clean_messages