.. _tut_clustering: =========================== Clustering and Super Topics =========================== Step 1 - Clustering =================== In this step we will #. Cluster features based on their distribution over some corpus #. Output reduced feature space to a lexicon Note: This step refers to a general clustering method. While the following commands cluster a topic based feature table one can also cluster other features as well. Setup: * :doc:`../fwinterface/fwflag_d`: the database we are using * :doc:`../fwinterface/fwflag_t`: the table inside the database where our text lives * :doc:`../fwinterface/fwflag_c`: the table column we will be grouping the text by * :doc:`../fwinterface/fwflag_group_freq_thresh`: Ignore groups which do not contain a certain number of words Clustering flags: * :doc:`../fwinterface/fwflag_fit_reducer`: flag to initalize the clustering * :doc:`../fwinterface/fwflag_reducer_to_lexicon`: writes the clustered features to a lexicon format * :doc:`../fwinterface/fwflag_n_components`: specify the number of clusters * :doc:`../fwinterface/fwflag_model`: specify the clustering algorithm Step 1a: Fit Reducer -------------------- This flag initializes the clustering process. Using :doc:`../fwinterface/fwflag_model` one can specify the following clustering algorithms: * `nmf `_ - Non-Negative matrix factorization by Projected Gradient (NMF) * `pca `_ - (Principal component analysis) Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space. * `sparsepca `_ - (Sparse Principal Components Analysis) Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is controllable by the coefficient of the L1 penalty. * `lda `_ - (Linear Discriminant Analysis) A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. * `kmeans `_ - K-Means clustering * `dbscan `_ - (Density-Based Spatial Clustering of Applications with Noise) Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density. * `spectral `_ - Apply clustering to a projection to the normalized laplacian. In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plan. * `gmm `_ - (Gaussian Mixture Model) We can also specify the number of components in the clustering with :doc:`../fwinterface/fwflag_n_components`. **Sample Command** .. code-block:: bash dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' --fit_reducer --model nmf --group_freq_thresh 500 SQL QUERY: select distinct feat from feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16 Adding zeros to group norms (978 groups * 48 feats). [Applying StandardScaler to X: StandardScaler(copy=True, with_mean=True, with_std=True)] (N, features): (978, 48) [Doing clustering using : nmf] model: NMF(alpha=10, beta=1, eta=0.1, init='nndsvd', l1_ratio=0.95, max_iter=200, n_components=30, nls_max_iter=2000, random_state=42, shuffle=False, solver='cd', sparseness=None, tol=0.0001, verbose=0) -- Interface Runtime: 1.06 seconds DLATK exits with success! A good day indeed ¯\_(ツ)_/¯. This command will cluster the topics, using only users with 500 or more words, but note that while we ran clustering but haven't told DLATK what to do with those clusters. Step 1b: Reducer to lexicon --------------------------- This flag takes the output from :doc:`../fwinterface/fwflag_fit_reducer` and saves the weights in a "lexicon". **Sample Command** .. code-block:: bash dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' --fit_reducer --model nmf --reducer_to_lexicon msgs_reduced10_nmf --n_components 10 This command will produce table *msgs_reduced10_nmf* in the database *dlatk_lexica*. Here the term column contains topic ids and the category column contains the reduced component number. Weights are entries in the *m x n* facorization matrix produced in the clustering method where *m* is the number of features in your feature table and *n* is the number of components specified by :doc:`../fwinterface/fwflag_n_components`. .. code-block:: mysql mysql> select * from dlatk_lexica.msgs_reduced10_nmf limit 5; +----+------+----------+----------------+ | id | term | category | weight | +----+------+----------+----------------+ | 1 | 272 | RFEAT7 | 8.73150685503 | | 2 | 101 | RFEAT7 | 0.141548621844 | | 3 | 278 | RFEAT1 | 3.56542413757 | | 4 | 346 | RFEAT1 | 0.28171112431 | | 5 | 290 | RFEAT1 | 0.731260898487 | | 6 | 349 | RFEAT1 | 7.81170807542 | | 7 | 276 | RFEAT1 | 6.14425018597 | | 8 | 1781 | RFEAT1 | 1.26721023671 | | 9 | 107 | RFEAT1 | 1.29672280401 | | 10 | 344 | RFEAT8 | 0.159080830815 | +----+------+----------+----------------+ Step 2: Super Topics ==================== This allows one to unroll the reduced lexicon table at the word level to create a new topic set. We refer to the reduced lexicon table produced in Step 1b using the :doc:`../fwinterface/fwflag_reduced_lexicon` flag. We will use two additional flags: * :doc:`../fwinterface/fwflag_reduced_lexicon` * :doc:`../fwinterface/fwflag_super_topics` Note: For this step you need to first cluster a topic based feature table. Super topics do not make sense for other features such as 1to3grams. *Sample Command* .. code-block:: bash dlatkInterface.py -d dla_tutorial -t msgs -c user_id --reduced_lexicon msgs_reduced10_nmf --super_topics msgs_10nmf_fbcp -l met_a30_2000_cp This produces the table *emp_50nmf_fbcp* in *dlatk_lexica*: .. code-block:: mysql mysql> select * from dlatk_lexica.msgs_10nmf_fbcp limit 10; +----+--------------+----------+------------------------+ | id | term | category | weight | +----+--------------+----------+------------------------+ | 1 | 8) | RFEAT7 | 0.00006541063856000905 | | 2 | 8d | RFEAT7 | 0.8575587089762855 | | 3 | :d | RFEAT7 | 1.7992666876157764 | | 4 | :p | RFEAT7 | 0.020210781774380928 | | 5 | ;d | RFEAT7 | 1.5833280925503392 | | 6 | <3 | RFEAT7 | 0.010472837603757787 | | 7 | >: | RFEAT7 | 0.0003415147203434118 | | 8 | >:d | RFEAT7 | 1.8820983398248488 | | 9 | accident | RFEAT7 | 0.0026496137794309827 | | 10 | accidentally | RFEAT7 | 0.009535178394562414 | +----+--------------+----------+------------------------+ Under the hood -------------- Super topic creation runs the following MySQL command: .. code-block:: mysql CREATE TABLE AS SELECT super.category, orig.term, SUM(super.weight * orig.weight) AS weight FROM super, orig WHERE super.term=orig.category GROUP BY super.category, orig.term; Everything in a single command: .. code-block:: bash dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' \ --fit_reducer --model nmf --reducer_to_lexicon msgs_reduced10_nmf --n_components 10 \ --super_topics msgs_10nmf_fbcp -l met_a30_2000_cp Step 3: Using your super topics =============================== Now that we have the super topic table we can extract features over our corpus: .. code-block:: bash dlatkInterface.py -d dla_tutorial -t msgs -c user_id --add_lex_table -l msgs_10nmf_fbcp --weighted_lexicon This command produces the following feature table: .. code-block:: mysql mysql> select * from feat$cat_msgs_10nmf_fbcp_w$msgs$user_id$16to16 limit 10; +----+----------+--------+-------+-------------------------+ | id | group_id | feat | value | group_norm | +----+----------+--------+-------+-------------------------+ | 1 | 28451 | RFEAT9 | 5 | 0.000000921854579604825 | | 2 | 28451 | RFEAT1 | 36 | 0.00467374305816601 | | 3 | 28451 | RFEAT8 | 12 | 0.0000368626647544864 | | 4 | 28451 | RFEAT4 | 18 | 0.00112652429225695 | | 5 | 28451 | RFEAT0 | 203 | 0.0366415948096726 | | 6 | 28451 | RFEAT2 | 17 | 0.000283381640611496 | | 7 | 28451 | RFEAT7 | 8 | 0.00000232305243540397 | | 8 | 28451 | RFEAT6 | 7 | 0.000518033494791095 | | 9 | 28451 | RFEAT3 | 64 | 0.00532112905379904 | | 10 | 28451 | RFEAT5 | 15 | 0.000161019216334877 | +----+----------+--------+-------+-------------------------+ We can also create a set of super topics whose weights are based off of the *freq_t50ll* table and use them for printing wordclouds: .. code-block:: bash # create log likihood version of super topics # creates msgs_10nmf_fbll in dlatk_lexica dlatkInterface.py -d dla_tutorial -t msgs -c user_id --reduced_lexicon msgs_reduced10_nmf --super_topics msgs_10nmf_fbll -l met_a30_2000_freq_t50ll # print wordclouds using all of the above dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_msgs_10nmf_fbcp_w$msgs$user_id$16to16' --outcome_table blog_outcomes --group_freq_thresh 500 --outcomes age gender --output_name supertopic_output \ --topic_tagcloud --make_topic_wordcloud --topic_lexicon msgs_10nmf_fbll \ --tagcloud_colorscheme bluered