.. _fwflag_add_segmented: =============== --add_segmented =============== Switch ====== --add_segmented Description =========== Creates a word-segmented version of the message table (for Chinese only!). Argument and Default Value ========================== None Details ======= This will create a table called TABLE_seg (where TABLE is specified by :doc:`fwflag_t`) in the database specified by :doc:`fwflag_d`. The message column in this new table is a list of segmented words. Note that word segmentation only means something for Chinese messages. Choose the segmentation model by using :doc:`fwflag_segmentation_model`. After having done this, use :doc:`fwflag_add_ngrams_from_tokenized` to extract ngrams. How it works: The infrastructure writes the (message_id, message) pairs to a tempfile, runs the segmentor using the "command line" (os.system) and prints the segmented messages to a different temp file. The segmentor adds weird things (splits up long numbers; URLS incorrectly), so the python code fixes that. Weibo by default turns 'emoji' into '[emoji_label_word]' which get's split up by the segmentor, so the python code joins them together again. Example on one message: Original message: .. code-block:: bash [神马]欧洲站夏季女装雪纺短袖长裤女士运动时尚休闲套装女夏装2014新款 http://t.cn/RvCypCj Will turn into: .. code-block:: bash ["[\u795e\u9a6c]", "\u6b27\u6d32", "\u7ad9", "\u590f\u5b63", "\u5973\u88c5", "\u96ea\u7eba", "\u77ed\u8896", "\u957f\u88e4", "\u5973\u58eb", "\u8fd0\u52a8", "\u65f6\u5c1a", "\u4f11\u95f2", "\u5957\u88c5", "\u5973", "\u590f\u88c5", "2014", "\u65b0\u6b3e", "http://t.cn/RvCypCj"] Other Switches ============== Required Switches: * :doc:`fwflag_d`, :doc:`fwflag_c`, :doc:`fwflag_t` Optional Switches: * :doc:`fwflag_segmentation_model` Example Commands ================ .. code-block:: bash # creates the table msgs_seg via the Penn Chinese Treebank ./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_segmented