A Simple Text Classifier in Java with WEKA

In previous posts [1, 2, 3], I have shown how to make use of the WEKA classes FilteredClassifier and MultiFilter in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the Explorer GUI provided by WEKA, and its command-line interface.

In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.

Following Salton, and Belkin and Croft, the process of text classification involves two main steps:

  • Representing your text database in order to enable learning, and to train a classifier on it.
  • Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.

In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model to classify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

  1. Loading the text dataset.
  2. Evaluating the classifier.
  3. Training the classifier.
  4. Storing the classifier.

The most interesting parts of the process are:

  • We read the dataset by simply using the method getData() of an ArffReader object that wraps a BufferedReader.
  • We programmatically create the classifier by combining a StringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifier class discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

filter = new StringToWordVector();
classifier = new FilteredClassifier();
classifier.setClassifier(new NaiveBayes());

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a new NaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).

You can execute this class with the following commands to get the following output:

$>javac MyFilteredLearner.java
$>java MyFilteredLearner smsspam.small.arff myClassifier.dat
===== Loaded dataset: smsspam.small.arff =====

Correctly Classified Instances 187 93.5 %
Incorrectly Classified Instances 13 6.5 %
Kappa statistic 0.7277
Mean absolute error 0.0721
Root mean squared error 0.2568
Relative absolute error 25.8792 %
Root relative squared error 69.1763 %
Coverage of cases (0.95 level) 94 %
Mean rel. region size (0.95 level) 51.75 %
Total Number of Instances 200

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam
0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham
Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965
===== Evaluating on filtered (training) dataset done =====
===== Training on filtered (training) dataset done =====
===== Saved model: myClassifier.dat =====

The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.

Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

  1. Reading the text to be classified from a file.
  2. Reading the model or classifier from a file.
  3. Creating the instance.
  4. Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

// Create the attributes, class and text
FastVector fvNominalVal = new FastVector(2);
Attribute attribute1 = new Attribute("class", fvNominalVal);
Attribute attribute2 = new Attribute("text",(FastVector) null);
// Create list of instances with one element
FastVector fvWekaAttributes = new FastVector(2);
instances = new Instances("Test relation", fvWekaAttributes, 1);
// Set class index
// Create and add the instance
DenseInstance instance = new DenseInstance(2);
instance.setValue(attribute2, text);
// instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text);

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).

So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to be null, but casted to FastVector. Then we create an Instances object by using a FastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.

The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.

The rest of the methods are (more or less) straightforward if you follow the documentation (weka - Programmatic Use, and weka - Use WEKA in your Java code). You get the class prediction on your text with the following lines:

double pred = classifier.classifyInstance(instances.instance(0));
System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java (that is stored in myClassifier.dat), then you get the following result:

$>javac MyFilteredClassifier.java
$>java MyFilteredClassifier smstest.txt myClassifier.dat
===== Loaded text data: smstest.txt =====
this is spam or not, who knows?
===== Loaded model: myClassifier.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'

@attribute class {spam,ham}
@attribute text string

?,' this is spam or not, who knows?'
===== Classified instance =====
Class predicted: ham

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.

For those interested on using the classifiers discussed in my previous posts (I mean including AttributeSelection, and using PART and SMO as classifiers), the only part of this code that you have to change is the learn() and evaluate() methods in MyFilteredLearner.java. Just play with it, and have fun.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for futher articles on this topic!

UPDATE (June 26th, 2013): Since I wrote this post, I have moved my code examples and other stuff to a GitHub repository. I have just updated the links.

58 comentarios:

Ajith Nair dijo...

Gr8 post..thanks..!!!

Sujit Pal dijo...

Thanks for the post Jose. However, the links to the java codes are throwing 404 not founds. Can you please update the links?

Jose Maria Gomez Hidalgo dijo...

Thanks for noting it, Sujil. I have just updated it.

Sujit Pal dijo...


Just FYI (and you can probably just delete this comment, no need to put it up if you don't want to) this one is still throwing 404s:
but I found it based on the other URL, it should be:

Jose Maria Gomez Hidalgo dijo...

Just corrected, sorry for the inconvenience and thank you again.

PostDoc dijo...
Este comentario ha sido eliminado por el autor.
Jose Maria Gomez Hidalgo dijo...

Dear PostDoc

It depends on the WEKA version you are using. DenseInstance is avalaible in the development version, but not in the book/stable version. I recommend to use the development version.

Regards, Jose Maria

PostDoc dijo...
Este comentario ha sido eliminado por el autor.
Justin Edse dijo...

Thank you very much for this tutorial. It helped me out tremendously. Personally I've always found the documentation on Weka to be quite poor and you made everything make much more sense to me.

Your tutorials are wonderful!

Jose Maria Gomez Hidalgo dijo...

Thanks a lot for your encouraging feedback, Justin.

Please feel free to suggest any improvement or a topic for another post.

jb ignacio dijo...

Hi, You've got an great post but I got an error in loading the model file (.model file). Im using naive bayes multinomial with string to word vector filter. I used weka explorer to save the model file.

Jose Maria Gomez Hidalgo dijo...


Please ensure that you are storing a FilteredClassifier class. Can you provide more details? For instance, the reported error.

Thanks for your comment.

jb ignacio dijo...

I added e.getMessage() in the catch part so I can get the real error.

This is the error: weka.classifiers.bayes.NaiveBayesMultinomial cannot be cast to weka.classifiers.meta.FilteredClassifier

Do I need to use a stringToWordVector filter in my text?

Again, thank you in advance.

Jose Maria Gomez Hidalgo dijo...

Yes. This code assumes that you have the raw text (e.g. ["this is my text",label] instances), so it is required to use a FilteredClassifier that first applies the StringToWordVector to the text (to get word-weights vector representation), then applies the classifier to the word-based representation. The Filtered classifier does it in a smooth fashion.

In my previous post: http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html I show how to use a FilteredClassifier at the WEKA Explorer. Once you have it, you can save it as a FilteredClassifier.

jb ignacio dijo...

Sorry, I am bit confused.

I need to use the filtered classifier for my training set to produce my trained model, right?

I have now the the trained model (e.g. multinomial.model)

the next step is to run the java code, loading the text file and the (multinomial.model) model file.

is this correct?

Thanks, and sorry because i am new with weka.

Jose Maria Gomez Hidalgo dijo...


In case you have any problem, try to modify FilteredLearner.java (my file) to fit you classes (I assume they will not be {spam,ham}), and use it instead of the Explorer.

No need to apologize!

BTW, if you need it, you can send me an excerpt from your ARFF file (the header plus 2 or 3 instances/rows) and I can test it on my side.

jb ignacio dijo...

Thanks! Can I have your email address?

Or how do I email you or send my arff file?

Jose Maria Gomez Hidalgo dijo...

You can get it at my home page: http://www.esp.uem.es/jmgomez/.

jb ignacio dijo...

Thanks! I sent you an email.

Jose Maria Gomez Hidalgo dijo...

Solved. I have just sent you back the files. Regards

Bino dijo...

The above mentioned informations were really helpful.I just need a clarification about arff file.How to create the arff file with certain attributes for large data.

Jose Maria Gomez Hidalgo dijo...

Bino, I am afraid your question is very generic.

You can create your ARFF files with scripts, as the output of other programs, etc. There are many ways, it depends on the source of your data. If thedata is going to be very very large, you may consider using a database and the appropriate connectors in WEKA.

Bino dijo...

Thank you for your response sir,Actually i am student doing my final year project which is used to identify the disease-treatment relation in short text.In the as a initial task i have to annotate the sentences as informative and non informative.Before that i have do the tagging part.Now my question is either should i give the tagged base words as my input for creating arff file or normal sentences is enough.. which one will provide the improved result.Thanks in advance.

Jose Maria Gomez Hidalgo dijo...

Hi, Bino

My experience is that if you have the sentences tagged, applying the StringToWordVector filter and then AttributeSelection with Ranker and Information Gain will give you which words are most valuable to predict if a sentence is informative or not.

For instance, if you have an ARFF file like:

"word1sentence1 word2sentence1",informative
"word1sentence2 word2sentence2",non-informative

Then the StringToWordFilter will give you the words, and after that the AttributeSelection filter will rank those words according to being good predictors. Beware, it could be the case that a word is not very "informative" (that is, a good predictor of your positive class) but very "non-informative" (that is, agood predictor of your negative class).

To get the ARFF file, you can have two folders, one called "informative" with a sentence per file, and another one called "non-informative" with a sentence per file as well.

Hope this helps you

Jose Maria

Sushil Kumar dijo...

I'm having a small problem when I'm using this code. No matter whats the test sentence it always predict the same.

Jose Maria Gomez Hidalgo dijo...

Hi, Sushil

This code completely depends on your training set. If you are using mine (smsspam.small.arff), it should be that way, while it is more likely to get the class ham, as it is the majority class. You can do the test by submitting a sentence that is already spam from the dataset.


Ivan dijo...

Hi Jose Maria,

This post was really useful to me. I made a study for a Data Mining subject and tested different machine learning algorithms over your SMS Spam Collection Data Set.

I developed an application to test some algorithms. Here is the app and the code: SMS Spam Filtering.

Here you can find my results.


Jose Maria Gomez Hidalgo dijo...

Thanks a lot for your support, Ivan. It is nice to read that my posts help the people. And it happens that I work a lot with that collection (in fact I contributed to build that dataset).

Good luck with your next experiments!

chi kuan dijo...

Dear Sir,

I have count the number of spam in smsspam.small.arff. I have found out that there are only 33 spams line in the smsspam.small.arff. but after using the java code, the output only shows 13 incorrectly classified instances. is this that nothing wrong with it?

Jose Maria Gomez Hidalgo dijo...

Hi, Chi Kuan

No, there is nothing wrong. There are 33 spam instances in the dataset, and 167 ham instances as well. The error by evaluating on the 200 training instances is 6.5% (13 instances). That is, you train on the dataset with 200 examples, then you run on the same dataset and get 187 correctly classifieed instances, and 13 mistakes -- some of them will be of the class spam and some of them will belong to ham. That's all.

Obviously it is more like for the test to fail on a spam message because there are few spams, so the classifier tends to classify in the majority class (ham).

Hope I made it clear.


Anónimo dijo...

First I would like to say that your posts here are amazing, keep up the good work! I am using WEKA too in my project now (i am still a beginner) and I wish to use a topic model such as Latent Dirichelet Allocation. I have looked into the documentation but there is no implementation of LDA. There are some API's such as LingPipe and Mallet that allow LDA transformation. However I do not know how I can get this representation into weka so i can classify them. Do you have any experience with doing this? Help is really appreciated!

Jose Maria Gomez Hidalgo dijo...

Unfortunately, LDA is not implemented in WEKA. You can ask for it in the WEKA list at: http://list.waikato.ac.nz/mailman/listinfo/wekalist.

In a search, I have found this quote by Mark Hall:

Q: I was looking for a LDA in Weka, but I didn't found it. Is there a LDA in Weka or something similar?
A: Weka doesn't have an implementation of LDA, but it does have a number of other methods that are arguably as good or better: multi response linear regression, logistic regression, PCA, partial least squares regression and linear support vector machines.

Found in: http://list.waikato.ac.nz/pipermail/wekalist/2011-September/053397.html

Anónimo dijo...

I get a "java.lang.ArrayIndexOutOfBoundsException: 2" error when running the .classify() command.

I think there is a problem in the makeinstance function, something to do with the format of my arff file or model.

Any ideas?

Anónimo dijo...

I can send you my arff file so you can have a look if possible please

Jose Maria Gomez Hidalgo dijo...

If you are working with your own file, it is very likely that the error is caused by having a different classification problem (class type, for instance). A quick search will give you may email, please send the file to me (or a subset of it) if you want me to check it, as it works perfectly with my sample files.

chi kuan dijo...


Is there any ways to show how the calculation on the MyFilteredClassifier, i mean how they doing to calculate using the probability of instances

Jose Maria Gomez Hidalgo dijo...

Dear Chi Kuan

It is possible to get the probability for each of the class values or labels in the case of a classification problem (nominal class) using the distributionForInstance() method available in every classifier (see http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html#distributionForInstance(weka.core.Instance) ). Instead of calling classifyInstance() in line #116, you can call the previous method to get an array with the probabilities of each class value. Beware, not all classifiers produce robust class membership probabilities, so this depends on the base classifier that you are using inside the FilteredClassifier.

However, if you want to get information about the internal probability calculations done during training, the only way I see to do this is using a base classifier that makes use of probabilities (e.g. NaiveBayes family) and output the classifier as an String somewhere after training, then post-processing that output.


Anónimo dijo...

i used my files and all functions are works but i'm having a problem with the last one classify() it shows for me this "Problem found when classifying the text" can you please tell me what's the problem ?

Anónimo dijo...

what number of WEKA.jar did you used ?

Jose Maria Gomez Hidalgo dijo...

First, I am using the version 3.7.9 (development version) in those tests.

Second, regarding the exception. You get the message because I catch the exception (lines 120-122 at MyFilteredClassifier.java). Just substitute the line #121 by e.printStackTrace(); to get a more informative error message and post it here if you are not able to solve it.

Most likely, the error is produced because either the model has not been previously learnt, or the training and test datasets are not compatible.

Anónimo dijo...

thank you for your reply . how can i know if they are not compatible ? i build them using WEKA tool not your MyFilteredLearner.java , dose this cause the problem ?

Also, i have replaced the line #121 and i got this error

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
at weka.core.Instances.attribute(Instances.java:341)
at weka.core.AttributeLocator.locate(AttributeLocator.java:153)
at weka.core.AttributeLocator.initialize(AttributeLocator.java:119)
at weka.core.AttributeLocator.(AttributeLocator.java:102)
at weka.core.StringLocator.(StringLocator.java:69)
at weka.filters.Filter.flushInput(Filter.java:431)
at weka.filters.unsupervised.attribute.StringToWordVector.batchFinished(StringToWordVector.java:768)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:474)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)
at myfilteredclassifier.MyFilteredClassifier.classify(MyFilteredClassifier.java:117)
at myfilteredclassifier.MyFilteredClassifier.main(MyFilteredClassifier.java:197)

Anónimo dijo...

can you please check my error :)?

Jose Maria Gomez Hidalgo dijo...

I am afraid that the output is not very informative, so I cannot help you with this unless I have more information. In particular, a short sample of the training and testing files may be enough - however it is required that you describe the process for generating the model with more detail: you just used the Explorer? Which version? Which model (classifier)? Etc.

HNJM dijo...

Hey Jose, thanks for this example.
I tried it but i have a problem. You suggested to switch the methods learn() and evaluate(). I did this and the training and evaluation works. But when I want to classify my own text after that I get the following error:

java.lang.NullPointerException: No output instance format defined

I didn't see in your code that you set the output format. Do you know wha I have to do?


Jose Maria Gomez Hidalgo dijo...


Can you post in which line you get the error? I guess you get it when running MyFilteredClassifier.java, but it works for me with the sample data and WEKA 3.7.9...


Raúl Zavala dijo...

Hola Jose, Al ver este articulo me preguntaba...

Si existe una api o método ya en este sector de la computación, que te permita agarrar un texto ya sea de una articulo o libro. A fin de clasificar su contenido, en párrafo, titulo, subtitulo.. Básicamente como descomponerlo reconociendo el sentido lógico del mismo texto.

De ser así me podrías mencionar alguno o bien recomendar por donde buscar..

Te lo pregunto pues por hay estoy investigando algo de esto en mi universidad y me gustaría conocer tu opinión en esta situación.


Jose Maria Gomez Hidalgo dijo...

Hola, Raúl

La verdad es que no es un tema en el que yo sea experto, ya sabes que el Procesamiento del Lenguaje Natural es un campo muy amplio...

Mi consejo es que por un lado busques APIs usando la keyword "textmining" en Twitter, donde hay varias, a ver si alguna resuelve tu problema.

Por otro lado, deberías buscar "text segmentation" en Google; en una primera búsqueda he obtenido ya algún resultado que habría que investigar más.

¡Mucha suerte!

Anónimo dijo...

hola Señor Raul, yo tengo esto,
doble pred = classifier.classifyInstance (instances.instance (0));
System.out.println ("Clase predijo:". + instances.classAttribute () valor ((int) pred));

como puedo obtener el porcentaje de error de esta clase que me predice.

en la aplicación de weka lo hace, pero como lo en java, ya he intentado con todos los métodos pero ninguno me funcionar, por favor ayuda ... gracias

Anónimo dijo...

hola he intentaddo correr el codigo pero me marca un error en esta linea:
DenseInstance instance = new DenseInstance(2);
No se a que se debe el error

Jose Maria Gomez Hidalgo dijo...


Sin conocer más detalles de tu instalación, no puedo estar 100% seguro, pero lo más probable es que se trate de que tenemos versiones distintas de WEKA. En este post he usado la versión de desarrollo, que a la fecha de cuando fue escrito, es la 3.7.1 si no recuerdo mal.

Un saludo

Anónimo dijo...

Hi, This looks like an excellent demonstration of how to use Weka with java. But I have unfortunately experienced an issue right at the end:

I have copy and pasted your classes and used the example file formats for the training instances and the new instance and I am using the Weka developer version. The classifier is built, learned and evaluated correctly. But when I run the MyFilteredClassifier methods to load instance, load model, make instance and classify it fails to classify the instance? I get the following error: No output instance format defined

This is the single line of my instance file:
this is spam or not, who knows?

This is the start of my train ARFF file:
@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String


Could you please let me know why this is happening, because I am using the exact code and file formats you have supplied. Thanks in advance.

Tharaka Mayadunne dijo...

Hi..im new to Weka and im implementing a movie classifier system based on genres for my project.I have a small question regarding your code. When you uploade the model it seems that you have uploaded "somthing.dat" file. But im uploading "something.model" file previously created and saved using weka explorer.So can you tell me is this the reason why im continuously getting errors in "classify" function?Thank you in advance.

Jose Maria Gomez Hidalgo dijo...

Hi, Tharaka

It is strange, in principle you should be able to use a model file you have previously saved using the Explorer, with my code, if the Classifier is compatible (same kind of FilteredClassifier with same filters, classifier and so). The name of the file does not matter...

I am afraid I cannot provide better guidance if I have not more details...



Jose Maria Gomez Hidalgo dijo...

Hi, Anonymous

Well, if you are following exactly the instructions and using the file format and right WEKA version, I cannot guess what is wrong, as it works for me.

My suggestion: pack everything and send it to me by email of put it in dropbox. I will examine it.



Adina Lazar dijo...

Hello. I am new to weka. I read and understood about classification but i don't understand one thing about testing:
I have 4 news categories and i made a arff file, transform with stringtowordvector and classified it.
Now i want to test one new text(one news)
How am i gonna transform this basic text to a test set?

Kikazz dijo...

Hello Jose,

This was a really great way for me to understand how to get started with Weka. More than with any other tutorial I have come across. A million thanks for this!
One question - Your MyFilteredLearner class has an evaluate and a learn method, both of which perform mostly the same steps of initialising/setting options for many of the same variables. Can't this be handled in the main function itself? Or by declaring the classifiers globally and avoiding having to repeat the code in the learn() method?

Jose Maria Gomez Hidalgo dijo...

@Adina - This post explains exactly that. You can apply the same configuration of thje StringToWordVector filter properly to the test set by using a Filtered Classifier.

@Kikazz You are right, that code can be factorized into the main function or another "initialization" one. My purpose was to allow you easily delete the function you don't need without loosing the one you need, and at the same time, having all the code for evaluation or training together. But it is better the way you propose.

Unknown dijo...

hola mi nombre es Abiud leal
me interesa mucho este post

quiero hacer algo similar, en mi caso quiero entrenar el modelo con un corpus de comentarios, que contiene las siguientes clases: queja, sugerencia, felicitacion.

cuando el modelo este entrenado, el comentario que inserte, me tiene que dar el tipo de comentario que escribi.

ejemplo: felicidades al cocinero todo estuvo rico. felicitación

ya intente ejecutar su programa de pero me marca varios errores.

que version de weka utiliza?? me la podria proporcionar porfavor??

espero su ayuda

muchas gracias