21.5.14

WEKA Text Mining Trick: Copying Options from the Explorer to the Command Line

From previous posts (specially from Command Line Functions for Text Mining in WEKA), you may know that writing command-line calls to WEKA can be far from trivial, mostly because you may need to nest FilteredClassifier , MultiFilter , StringToWordVector , AttributeSelection and a classifier into a single command with plenty of options -- and nested strings with escaped characters.

For instance, consider the following need: I want to test the classifier J48 over the smsspam.small.arff file, which contains couples of {class,text} lines. However, I want to:

  • Apply StringToWordVector with specific options: lowercased tokens, specific string delimiters, etc.
  • Get only those words with Information Gain over zero, which implies using the filter AttributeSelection with InfoGainAttributeEval and Ranker with threhold 0.0.
  • Make use of 10-fold cross validation, which implies using FilteredClassifier; and as long as I have two filters (StringToWordVector and AttributeSelection), I need to make use of MultiFilter as well.

With some experience, this is not too hard to be done by hand. However, it is much easier to configure your test at the WEKA Explorer, make a quick test with a very small subset of your dataset, then copy the configuration to a text file and editi it to fully fit your needs. For this specific example, I start with loading the dataset at the Preprocess tab, and then I configure the classifier by:

  1. Choosing FilteredClassifier, and J48 as the classifier.
  2. Choosing MultiFilter as the filter, then deleting the default AllFilter and adding StringToWordVector and AttributeSelection filters to it.
  3. Editing the StringToWordVector filter to specify lowercased tokens, do not operate per class, and my list of delimietrs.
  4. Editing the AttributeSelection filter to choose InfoGainAttributeEval as the evaluator, and Ranker with threshold 0.0 as the search method.

I show a picture in the middle of the process, just when editing the StringToWordVector filter:

Then you can specify spamclass as the class and run it to get something like:

=== Run information ===
Scheme: weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Relation: sms_test
Instances: 200
Attributes: 2 spamclass text
Test mode: 10-fold cross-validation
(../..)
=== Confusion Matrix ===
a b <-- classified as
16 17 | a = spam
6 161 | b = ham

As you can see, the Scheme line gives us the exact command options we need to get that result! You can just copy and edit it (after saving the result buffer) to get what you want. Alternatively, you can right click on the command at the Explorer, like in the following picture:

In any case, you get the following messy thing:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Then you can strip the options you do not need. For instance, some default options in StringToWordVector are -R first-last, prune-rate -1.0, -N 0, the stemmer, etc. You can guess the default options by issuing the help command:

$>java weka.filters.unsupervised.attribute.StringToWordVector -h
Help requested.

Filter options:
-C
Output word counts rather than boolean word presence.
-R <index1,index2-index4,...>
Specify list of string attributes to convert to words (as weka Range).
(default: select all string attributes)
...

So after cleaning the default options (in all filters and the classifier), adding the dataset file and the class index (-t spamsms.small.arff -c 1), and with some pretty printing for clarification, you can easily build the following command:

java weka.classifiers.meta.FilteredClassifier
-c 1
-t smsspam.small.arff
-F "weka.filters.MultiFilter
-F \"weka.filters.unsupervised.attribute.StringToWordVector
-W 100000
-L
-O
-tokenizer \\\"weka.core.tokenizers.WordTokenizer
-delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\"
-F \"weka.filters.supervised.attribute.AttributeSelection
-E \\\"weka.attributeSelection.InfoGainAttributeEval \\\"
-S \\\"weka.attributeSelection.Ranker -T 0.0 \\\"\""
-W weka.classifiers.trees.J48

So now you can change other parameters if you want, in order to test other text representations, classifiers, etc., without dealing with escaping the options, delimiters, etc.