Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods

Alexandre Ribeiro Afonso, Cláudio Gottschalg Duque

Resumo


This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

Palavras-chave


Text Mining; Text Clustering; Natural Language Processing; Brazilian Portuguese; Effectiveness.

Texto completo:

PDF (English)

Referências


Aires, R. V. X. (2000). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. [Implementation, Adaptation, Combination and Evaluation of Brasilian Portuguese Taggers.] Unpublished MsC Thesis. Universidade de São Paulo, São Paulo, Brazil.

Aluísio S. M., & Almeida G. M. B. (2006). O que é e como se constrói um corpus? Lições aprendidas na compilação de vários corpora para pesquisa lingüística. [What is a corpus and how to build a corpus? Lessons learned during the compilation of various corpora for linguistic research.] Calidoscópio, 4(3), 155-177.

Basic, B. D., Berecek B., & Cvitas A. (2005). Mining textual data in Croatian, in Proceedings of the 28th International Conference, MIPRO 2005, Business Intelligence Systems. (pp. 61–66).

Bezerra, G. B., Barra, T. V., Ferreira, H. M., & Von Zuben, F. J. (2006). A hierarchical immune-inspired approach for text clustering. Selected papers based on the presentations at the 2006 conference on Information Processing and Management of Uncertainty, IMPU, Paris, France. (pp. 131-142).

Biderman, M. T. C. (2001). O Português Brasileiro e o Português Europeu : Identidade e contrastes. [The Brazilian Portuguese and European Portuguese: Identity and contrasts.] Revue belge de philologie et d'histoire, 79 (3), 963-975.

Caldas Junior, J., Imamura, C.Y.M., & Rezende, S.O. (2001). Avaliação de um Algoritmo de Stemming para a Língua Portuguesa. [Evaluation of a Stemming Algorithm for Portuguese Language.] The Proceedings of the 2nd Congress of Logic Applied to Technology, São Paulo, Brazil. (pp. 267-274). São Paulo, Brazil: SENAC/Plêiade.

Camargo, Y. B. L. (2007). Abordagem lingüística na classificação de textos em português. [A linguistic approach to the classification of texts in Portuguese.] Unpublished MsC Thesis. COPPE - Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil.

DaSilva, C. F., Vieira, R., Osório, F. S., & Quaresma, P. (2004). Mining Linguistically Interpreted Texts. In Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora.

Fellbaum, C. (1998). WordNet. An electronic lexical database. Cambridge, MA: MIT Press.

Furlanetto, M. M. (2008). Neological Formations in Brazilian Portuguese: a Discursive View. Fórum Lingüístico, 5 (2), 1-22, Florianópolis, Brazil.

Goldberg D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.

Hall M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Hamilton, New Zealand. (University of Waikato Ph.D. dissertation).

IBGE (2000). Brasil: 500 anos de povoamento [Brazil: 500 years of settlement]. Rio de janeiro: IBGE.

Jones, G., Robertson, A. M., Santimetvirul, C., & Willett, P. (1995). Non-hierarchic document clustering using a genetic algorithm. Information Research, 1(1), Retrieved 15 April, 2012 from http://InformationR.net/ir/1-1/paper1.html.

Hammouda, K. M., & Kamel. M. S. (2004). Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10), 1279–1296.

Kohonen, T. (1997). Self-Organizing Maps. 2nd ed., Berlin: Springer-Verlag.

Kohonen, T. (1998). Self-Organization of Very Large Document Collections: State of the Art. Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, Skövde, Sweden, 2-4 September, 1998, 8(1), 65-74: Springer.

Maia, L. C., & Souza, R. R. (2010). Uso de sintagmas nominais na classificação automática de documentos eletrônicos. [The use of noun phrases in automatic classification of electronic documents.] Perspect. Ciênc. Inf., 15(1), 154-172.

Manning, C. D, Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi: Cambridge University Press.

Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: The MIT Press.

Markov, Z., & Larose D. T. (2007). Data Mining the Web: Uncovering Patterns in Web Content, Structure and Usage. Hoboken, New Jersey: John Wiley ; Sons, Inc.

Palmeira, E., & Freitas, F. (2007). Ontologias detalhadas e classificação de texto: uma união promissora. [Detailed ontologies and text classification: a promising union.] ENIA 2007: VI Encontro Nacional de Inteligência Artificial. Rio de Janeiro, Brazil, July, 03-06, 2007. Rio de Janeiro: Instituto Militar de Engenharia.

Ranganathan, S. R. (1967). Prolegomena to Library Classification. London: Asia Publishing House.

Ratnaparkhi, A. (1996). A Maximum Entropy Model for Part-of-Speech Tagging. Proceedings of the First Empirical Methods in NLP Conference. University of Pennsylvania, May 17-18, 1996. (pp. 133-142).

Reis, João José (2000). Presença Negra: conflitos e encontros. In Brasil: 500 anos de povoamento [Black Presence: conflicts and encounters]. Rio de Janeiro: IBGE, 2000. pp: 91.

Rossel, M., & Velupillai, S. (2005). The Impact of Phrases in Document Clustering for Swedish. Proceedings of the 15th NODALIDA conference, NoDaLiDa 2005, Joensuu, Finland. (pp.173-179).

Seno, E. R. M., & Nunes, M. D. V. (2008). Some Experiments on Clustering Similar Sentences of Texts in Portuguese. In Teixeira, A., StrubeDeLima, V. L., CaldasDeOliveira, L., Quaresma, P. (Eds.), Lecture Notes in Artificial Intelligence, Vol. 5190. 8th International Conference on Computational Processing of the Portuguese Language, PROPOR 2008, Aveiro, Portugal, September 08-10, 2008. (pp. 133-142). Berlin, Germany: Springer-Verlag.

Silva. A. S. (2006). Sociolinguística cognitiva e o estudo da convergência/divergência entre o Português Europeu e o Português Brasileiro. [Cognitive Sociolinguistics and the study of convergence / divergence between European Portuguese and Brazilian Portuguese.] Veredas :Revista de Estudos Lingüísticos, 10 (2006): Universidade Federal de Juiz de Fora.

Slonin, N., Friedman N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 11-15, 2002. (pp. 129-136). New York: ACM Press.

Song W., & Park S. C. (2006). Genetic Algorithm-based Text Clustering Technique. In Licheng Jiao, Lipo Wang, Xinbo Gao, Jing Liu, Feng Wu (Eds.), Lecture Notes in Computer Science, Vol. 4221. Advances in Natural Computation, Second International Conference, ICNC 2006, Xi'an, China, September 24-28, 2006. (pp. 779-782). Berlin: Springer-Verlag.

Stefanowski, J., & Weiss, D. (2003). Web search results clustering in Polish: experimental evaluation of Carrot. Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM´03 Conference, Zakopane, Poland, vol. 579 (14). (pp. 209-22).

Viera, A.F.G., & Virgil, J. (2007). Uma revisão dos algoritmos de radicalização em língua portuguesa. [A review of stemming algorithms for Portuguese Language.] Information Research, 12(3), paper 315. Retrieved 15 April, 2012 from http://InformationR.net/ir/12-3/paper315.html.

Witten I. H., & Frank E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, 2º Ed. Amsterdam, Boston, Heidelberg, London, New York, Oxford, Paris, San Diego, San Francisco, Singapure, Sydney, Tokyo: Elsevier.