Text written by users in Social Networks is noisy: emoticons, chat codes, typos, grammar mistakes, and moreover, explicit noise created by users as a style, trend or fashion. Consider the next utterance, taken from a post in the social network Tuenti:
"felicidadees!! k t lo pases muy bien!! =)Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :DFeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) (heart)"
This is a real text. Its approximate translation to English would be something like:
"happybirthdaay!! njy it lot!! =)Haaapyyybirthdaaayyy !! (:Happybirthdayyy!!have a great timeee :DHappyyBiirtHdayY :D Enjy! ;) (heart)"
The latest word between parenthesis is a Tuenti code that is shown as a heart.
If you want to find more text like this out there, just point your browser to Fotolog.
As you can imagine, just tokenizing this kind of text for further analysis is quite a headache. During our experiments for the project WENDY (link in Spanish), we have designed a relatively simple tokenization algorithm in order to deal with this kind of text for age prediction. Although the method is designed for the Spanish language, it is quite language-independent and it may well be applied to other languages - not yet tested. The algorithm is the following one:
- Separate the initial string into candidate tokens using white spaces.
- A candidate token can be:
- A proper sequence of alphabetic characters (a potential word), or proper sequence of punctuation symbols (a potential emoticon). In this case, the candidate token is considered already a token.
- A mixed sequences of alphabetic characters and punctuation symbols. In this case, the character sequence is divided into sequences of alphabetic characters and sequences of punctuation symbols. For instance, "Hola:-)ketal" is further divided into "Hola", ":-)", and "ketal".
For instance, consider the next (real) text utterance:
"Felicidades LauraHey, felicidades! ^^felicidiadeees;DFelicidades!Un beso! FELIZIDADESS LAURIIIIIIIIIIIIII (LL)felicidadeeeeeees! :D jajaja mira mi tablonme meo jajajajajjajate quiero(:,"
The output of our algorithm is the list of tokens in the next table:
We have evaluate this algorithm directly and indirectly. Direct evaluation consists of comparing how many hits we get with an space-only tokenizer and with out tokenizer, in a Spanish and in a SMS-language dictionary. The more hits you get, the best recognized are words. We find about 9.5 more words in average in the Spanish dictionary with our tokenizer, and an average of 1.13 words more in the SMS-language dictionary, per text utterance (comment).
The indirect evaluation is performed by pipelining the algorithm in the full process of the WENDY age recognition system. The new tokenizer increases the accuracy of the age recognition system from 0.768 to 0.770, which may seem marginal except for the fact that it accounts for 206 new hits in our text collection of Tuenti comments. The new tokenizer provides relatively important increments in recall and precision for the most under-represented but most critical class, that is that of under 14 users.
This is the reference of the paper which details the tokenizer, the experiments, and the context of the WENDY project, in Spanish:
José María Gómez Hidalgo, Andrés Alfonso Caurcel Díaz, Yovan Iñiguez del Rio. Un método de análisis de lenguaje tipo SMS para el castellano. Linguamatica, Vol. 5, No. 1, pp. 31-39, July 2013.
If you are interested in the first steps of text analysis (tokenization, text normalization, POS Tagging), then these two recent news may be useful for you:
- The results of the Tweet Normalization Workshop/Task at SEPLN 2013 have been just published, interesting data & dataset.
- Leon Derczynski et al. have released a GATE-based POS-Tagger for Twitter with very good levels of accuracy.
And you may want to take a look at my previous post on text normalization.