Smote_easy: um algoritmo para tratar o problema de classificação em bases de dados reais

Hugo Leonardo Pereira Rufino, Antônio Cláudio Paschoarelli Veiga, Paula Teixeira Nakamoto

Abstract


A maioria das ferramentas de classificação assume que a distribuição dos dados seja balanceada ou com custos iguais, quando classificados incorretamente. Mas, na prática, é muito comum a ocorrência de bases de dados onde existam classes desbalanceadas, como no diagnóstico de doenças, no qual os casos confirmados são geralmente raros quando comparados com a população sadia. Outros exemplos são detecção de chamadas fraudulentas, detecção de intrusos em redes. Nestes casos, a classificação incorreta de uma classe minoritária (ex. diagnosticar uma pessoa portadora de câncer como sadia) pode resultar em consequências mais graves que classificar de forma incorreta uma classe majoritária. Por isso, é importante o tratamento de bases de dados em que ocorram classes desbalanceadas. Este artigo apresenta o algoritmo SMOTE_Easy, que é capaz de efetuar a classificação de dados, mesmo com uma alta taxa de desbalanceamento entre as diferentes classes. Para provar sua eficácia, foi feita uma comparação com os principais algoritmos para tratar problemas de classificação onde existam dados desbalanceados. Obteve-se êxito em praticamente todas as bases de dados testadas.

Keywords


Aprendizado de Máquina; Classificação de Dados; Máquinas de Vetores de Suporte; Comitê de Máquinas; Classes Desbalanceadas.

References


Akbani, R., Kwek, S., e Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Machine learning: ECML 2004, 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004, Proceedings (pp. 39–50). Springer.

Bennett, K. P., e Campbell, C. (2000). Support vector machines: Hype or hallelujah? SIGKDD Explorations.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Boser, B. E., Guyon, I. M., e Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann.

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.

Breiman, L. (1996, Ago.). Bagging predictors. Machine Learning, 24, 123–140.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. In U. Fayyad (Ed.), Data mining and knowledge discovery (pp. 121–167). Kluwer Academic.

Castro, C. L., Carvalho, M. A., e Braga, A. P. (2009). An improved algorithm for svms classification of imbalanced data sets. In D. Palmer-Brown, C. Draganova, E. Pimenidis, e H. Mouratidis (Eds.), Engineering applications of neural networks (Vol. 43, p. 108-118). Springer Berlin Heidelberg.

Chan, P., Fan, W., Prodromidis, A., e Stolfo, S. (1999). Distributed data mining in credit card fraud detection. IEEE Intelligent Systems, 14, 67–74.

Chang, C.-C., e Lin, C.-J. (2001). LIBSVM: a library for support vector machines [Computer software manual]. (Recuperado em 15 de agosto, 2010 de http://www.csie.ntu.edu.tw/~cjlin/libsvm)

Chawla, N. V., Bowyer, K. W., Hall, L. O., e Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR), 16, 321–357.

Chawla, N. V., Japkowicz, N., e Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.

Chawla, N. V., Lazarevic, A., Hall, L. O., e Bowyer, K. W. (2003). Smoteboost: Improving prediction of the minority class in boosting. In Principles and practice of knowledge discovery in databases (pp. 107–119). Springer.

Chekassky, V., e Mulier, F. (2007). Learning from data - concepts, theory, and methods (2nd ed.). Wiley.

Cortes, C., e Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

Cristianini, N., e Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge, U.K.: Cambridge University Press.

Fan, X., Zhang, G., e Xia, X. (2008). Performance evaluation of SVM in image segmentation. In IWSCA ’08: Proceedings of the 2008 IEEE International Workshop on Semantic Computing and Applications.

Fawcett, T. (2004). Roc graphs: Notes and practical considerations for researchers.

Frank, A., e Asuncion, A. (2010). UCI machine learning repository. Recuperado em 26 de setembro, 2010, de http://archive.ics.uci.edu/ml.

Freund, Y., e Schapire, R. E. (1996). Experiments with a new boosting algorithm. In International Conference on Machine Learning (pp. 148–156).

Freund, Y., e Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

Gunn, S. R. (1998). Support vector machines for classification and regression (Tech. Rep.). University of Southampton.

Hansen, L., e Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.

Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Prentice Hall.

He, H., e Garcia, E. (2009, Set.). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

Hearst, M. A. (1998). Trends & controversies: Support vector machines. IEEE Intelligent Systems, 13(4), 18–28.

Hulley, G., e Marwala, T. (2007). Evolving classifiers: Methods for incremental learning. Computing Research Repository, abs/0709.3965.

Japkowicz, N. (2000). Learning from imbalanced data sets: A comparison of various strategies. In Proceedings of Learning from Imbalanced Data Sets, Papers from the AAAI workshop, Technical Report ws-00-05 (pp. 10–15). AAAI Press.

Kubat, M., Holte, R. C., e Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30, 195–215.

Kubat, M., e Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proc. 14th International Conference on Machine Learning (pp. 179–186).

Kuncheva, L. I. (2004). Combining pattern classifiers - methods and algorithms. John Wiley & Sons.

Lewis, D. D., e Gale, W. A. (1994). A sequential algorithm for training text classifiers. In SIGIR ’94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12). Springer-Verlag New York, Inc.

Li, S., Fu, X., e Yang, B. (2008). Nonsubsampled contourlet transform for texture classifications using support vector machines. In ICNSC ’08: IEEE International Conference on Networking, Sensing and Control.

Liu, X.-Y.,Wu, J., e Zhou, Z.-H. (2009, Abr.). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550.

Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill.

Moraes Lima, C. A. de. (2004). Comitê de máquinas: Uma abordagem unificada empregando máquinas de vetores-suporte. Tese de doutorado, Universidade Federal de Campinas.

Osuna, E., Freund, R., e Girosi, F. (1997). Training support vector machines: An application to face detection. In IEEE Conference on Computer Vision and Pattern Recognition.

Provost, F. (2000). Machine learning from imbalanced data sets 101. (Invited paper for the AAAI’2000 Workshop on Imbalanced Data Sets).

Rao, R. B., Krishnan, S., e Niculescu, R. S. (2006). Data mining for improved cardiac care. SIGKDD Explorations, 8(1), 3–10.

Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.

Schölkopf, B. (1997). Support vector learning. Unpublished doctoral dissertation, Technische Universität Berlin.

Smola, A. J., Bartlett, P. L., Schölkopf, B., e Schuurmans, D. (2000). Introduction to large margin classifiers. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 1–29). The MIT Press.

Sun, Y. M., Kamel, M. S., Wong, A. K. C., e Wang, Y. (2007, Dez). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358–3378.

Tao, D., Tang, X., e Li, X. (2008). Which components are important for interactive image searching? IEEE Trans. Circuits Syst. Video Techn, 18(1).

Vapnik, V. N. (1998). Statistical learning theory. John Wiley & Sons, Inc.

Veropoulos, K., Campbell, C., e Cristianini, N. (1999). Controlling the Sensitivity of Support Vector Machines. In Proceedings of the International Joint Conference on AI (pp. 55–60).

Wang, G. (2008, Set.). A survey on training algorithms for support vector machine classifiers. In J. Kim, D. Delen, J. Park, F. Ko, e Y. J. Na (Eds.), International conference on networked computing and advanced information management (Vol. 1, pp. 123–128). IEEE Computer Society.

Wolpert, D., e Macready, W. (1997, Abr.). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.

Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., e Muller, K. R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. BIOINF: Bioinformatics, 16.




DOI: http://dx.doi.org/10.4301/S1807-17752016000100004

Copyright (c) 2016

Licensed under