Tokenization, Part of Speech Tagging and Segmentation

This is an overview of the different methods for transforming your text. Each method needs the following flags:

-d: the database we are using
-t: the table inside the database where our text lives (aka the message table)
-g: the table column we will be grouping the text by (aka group)

We start with a message table called "msgs" (available in the packaged data):

mysql> describe msgs;
+--------------+------------------+------+-----+---------+----------------+
| Field        | Type             | Null | Key | Default | Extra          |
+--------------+------------------+------+-----+---------+----------------+
| message_id   | int(11)          | NO   | PRI | NULL    | auto_increment |
| user_id      | int(10) unsigned | YES  | MUL | NULL    |                |
| date         | varchar(64)      | YES  |     | NULL    |                |
| created_time | datetime         | YES  | MUL | NULL    |                |
| message      | text             | YES  |     | NULL    |                |
+--------------+------------------+------+-----+---------+----------------+

In every command below we create a new version of the message table which includes all columns from the original table. The only difference is the new table name and the message column. You should consider the following:

You could be duplicating a lot of unnecessary data
All of this data could take a long time to write back to MySQL and cause your connection to drop

You should also consider the grouping level (-g). Changing this flag in the following commands only changes the chunk size of the reading / writing, which also causes MySQL connections to drop depending on the data size.

Tokenization

Happier Fun Tokenizer

Use DLATK's built-in tokenizer Happier Fun Tokenizer, which is an extension of Happy Fun Tokenizer.

--add_tokenized

# creates the table msgs_tok
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_tokenized

mysql> select message from msgs_tok limit 1;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message                                                                                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["can", "you", "believe", "it", "?", "?", "my", "mom", "wouln't", "let", "me", "go", "out", "on", "my", "b'day", "...", "i", "was", "really", "really", "mad", "at", "her", ".", "still", "am", ".", "but", "i", "got", "more", "presents", "from", "my", "friends", "this", "year", ".", "so", "thats", "great", "."] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

TweetNLP Tokenizer

Use Carnegie Mellon University's TweetNLP tokenizer.

--add_tweettok

# creates the table msgs_tweettok
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_tweettok

mysql> select message from msgs_tweettok limit 1;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message                                                                                                                                                                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["can", "you", "believe", "it", "??", "my", "mom", "wouln't", "let", "me", "go", "out", "on", "my", "b'day", "...", "i", "was", "really", "really", "mad", "at", "her", ".", "still", "am", ".", "but", "i", "got", "more", "presents", "from", "my", "friends", "this", "year", ".", "so", "thats", "great", "."] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Sentence Tokenization

--add_sent_tokenized

# creates the table msgs_stokes
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_sent_tokenized

mysql> select message_id, message from msgs_stoks limit 1;
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message_id | message                                                                                                                                                                                      |
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|          1 | ["can you believe it??", "my mom wouln't let me go out on my b'day...i was really really mad at her.", "still am.", "but i got more presents from my friends this year.", "so thats great."] |
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Or you can save each sentence as it's own row in MySQL:

--add_sent_per_row

# creates the table msgs_sent
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_sent_per_row

mysql> select message_id, message from msgs_sent limit 5;
+------------+----------------------------------------------------------------------------+
| message_id | message                                                                    |
+------------+----------------------------------------------------------------------------+
| 1_01       | can you believe it??                                                       |
| 1_02       | my mom wouln't let me go out on my b'day...i was really really mad at her. |
| 1_03       | still am.                                                                  |
| 1_04       | but i got more presents from my friends this year.                         |
| 1_05       | so thats great.                                                            |
+------------+----------------------------------------------------------------------------+

Part of Speech Tagging

Stanford Parser

Use the Stanford Parser to create three tables:

msgs_const - a tree structure corresponding to the grammatical structure of the message
msgs_pos - a part of speech tagged version of the original message
msgs_dep - a list of dependencies which provide a representation of grammatical relations between words in a sentence.

Use the flag:

--add_parses

# creates the table msgs_const, msgs_pos, msgs_dep
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_parses

mysql> select message from msgs_const limit 1;
+-------------------------------------------------------------------------------------------------+
| message                                                                                         |
+-------------------------------------------------------------------------------------------------+
| (ROOT (S (VP (VB urlLink) (NP (DT The) (NNP Obligatory) (NNP Field) (NNP Shot) (NN urlLink))))) |
+-------------------------------------------------------------------------------------------------+

mysql> select message from msgs_pos limit 1;
+----------------------------------------------------------------+
| message                                                        |
+----------------------------------------------------------------+
| urlLink/VB The/DT Obligatory/NNP Field/NNP Shot/NNP urlLink/NN |
+----------------------------------------------------------------+

mysql> select message from msgs_dep limit 1;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ['root(ROOT-0, urlLink-1)', 'det(urlLink-6, The-2)', 'nn(urlLink-6, Obligatory-3)', 'nn(urlLink-6, Field-4)', 'nn(urlLink-6, Shot-5)', 'dobj(urlLink-1, urlLink-6)'] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note that msgs_pos is tagged according to the Penn Treebank Project tags.

TweetNLP Part of Speech Tags

Use Carnegie Mellon University's TweetNLP part of speech tagger.

--add_tweetpos

# creates the table msgs_tweetpos
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_tweetpos

mysql> select message from msgs_tweetpos limit 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"probs": ["0.9990", "0.9993", "0.9999", "0.9853", "0.9934", "0.9958", "0.9813", "0.9890", "0.9999", "0.9994", "0.9973", "0.7924", "0.9962", "0.9963", "0.9934", "0.9776", "0.9931", "0.9997", "0.9997", "0.9997", "0.9505", "0.9997", "0.8819", "0.9984", "0.9925", "0.9268", "0.9984", "0.9964", "0.9957", "0.9996", "0.6084", "0.5645", "0.9990", "0.9986", "0.9735", "0.9791", "0.9904", "0.9991", "0.5527", "0.9695", "0.9981", "0.9985"], "tags": ["V", "O", "V", "O", ",", "D", "N", "V", "V", "O", "V", "T", "P", "D", "N", ",", "O", "V", "R", "R", "A", "P", "O", ",", "R", "V", ",", "&", "O", "V", "A", "V", "P", "D", "N", "D", "N", ",", "P", "L", "A", ","], "tokens": ["can", "you", "believe", "it", "??", "my", "mom", "wouln't", "let", "me", "go", "out", "on", "my", "b'day", "...", "i", "was", "really", "really", "mad", "at", "her", ".", "still", "am", ".", "but", "i", "got", "more", "presents", "from", "my", "friends", "this", "year", ".", "so", "thats", "great", "."], "original": "can you believe it?? my mom wouln't let me go out on my b'day...i was really really mad at her. still am. but i got more presents from my friends this year. so thats great."} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Segmentation with the Stanford Segmenter

For Chinese text you can use the Stanford segmenter. Use the --segmentation_model flag to change models: ctb (default, Penn Chinese Treebank) or pku (Beijing University).

# creates the table msgs_seg via the Penn Chinese Treebank
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_segmented

mysql> select message from msgs_ch limit 1;
+------------------------------------------------------------------------------+
| message                                                                      |
+------------------------------------------------------------------------------+
| [神马]欧洲站夏季女装雪纺短袖长裤女士运动时尚休闲套装女夏装2014新款  http://t.cn/RvCypCj |
+------------------------------------------------------------------------------+

mysql> select message from msgs_ch_seg limit 1;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message                                                                                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["[\u795e\u9a6c]", "\u6b27\u6d32", "\u7ad9", "\u590f\u5b63", "\u5973\u88c5", "\u96ea\u7eba", "\u77ed\u8896", "\u957f\u88e4", "\u5973\u58eb", "\u8fd0\u52a8", "\u65f6\u5c1a", "\u4f11\u95f2", "\u5957\u88c5", "\u5973", "\u590f\u88c5", "2014", "\u65b0\u6b3e", "http://t.cn/RvCypCj"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+