Classificação de câncer de pâncreas utilizando Técnicas de imputação de dados faltantes e undersampling baseado em clusterização: uma análise comparativa com diferentes algoritmos de Machine Learning

Sena, Wanessa Layssa Batista de

dc.creator	Sena, Wanessa Layssa Batista de
dc.date.accessioned	2023-11-01T21:19:55Z
dc.date.available	2023-11-01T21:19:55Z
dc.date.issued	2023-07-13
dc.identifier.citation	SENA, Wanessa Layssa Batista de. Classificação de câncer de pâncreas utilizando técnicas de imputação de dados faltantes e undersampling baseado em clusterização: uma análise comparativa com diferentes algoritmos de Machine Learning.2023. 15f. Trabalho de Conclusão de Curso, (Curso de Tecnologia Análise de Desenvolvimento de Sistemas)- Instituto Federal de Ciência e Tecnologia de Pernambuco, Recife.	pt_BR
dc.identifier.uri	https://repositorio.ifpe.edu.br/xmlui/handle/123456789/1068
dc.description.abstract	Missing values and class imbalance are issues frequently found in databases from real-world scenarios, including cancer classification. Impacts on the performance of Machine Learning (ML) models can be observed if these issues are not properly addressed prior to the analysis. In this paper, a combined solution with missing data imputation using kNN (k-nearest neighbors) and cluster-based undersampling using k-means is proposed, focusing on pancreatic cancer classification. Different data subsets were generated by combining different preprocessing methods and the performance was analyzed using a ML analysis pipeline from a previous study. This pipeline implements ten ML classifiers, including Random Forest, Support Vector Machine and Artificial Neural Network. All data subsets presented a significant improvement (p<0.05 with Student’s T-Test) in the performance of most ML algorithms when compared with the results obtained when the pipeline was first evaluated. Results suggest that kNN and k-means can be used in the data preprocessing phase to overcome missing values and class imbalance issues and improve the classification accuracy.	pt_BR
dc.format.extent	15f.	pt_BR
dc.language	pt_BR	pt_BR
dc.relation	AHMED, M.; SERAJ, R.; ISLAM, S. M. S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics, v. 9, n. 8, ago. 2020. BAI, L.; LIANG, J.; GUO, Y. An Ensemble Clusterer of Multiple Fuzzy k-Means Clus terings to Recognize Arbitrarily Shaped Clusters. IEEE Transactions on Fuzzy Sys tems, v. 26, n. 6, p. 3524-3533, dez. 2018. CHARBUTY, B.; ABDULAZEEZ, A. Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends, v. 2, n. 01, p. 20-28, mar. 2021. CHEN, PH. C.; LIU, Y.; PENG, L. How to develop machine learning models for healthcare. Nature Materials, v. 18, p. 410–414, abr. 2019. CHEN, T.; GUESTRIN, C. XGBoost: A Scalable Tree Boosting System. Proceed ings of the 22nd ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, Association for Computing Machinery, p. 785–794, ago. 2016. CHICCO, D.; JURMAN, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Ge nomics, v. 21, n. 6, jan. 2020. DHEEBA, J.; ALBERT SINGH, N.; TAMIL SELVI, S. Computer-aided detection of breast cancer on mammograms: A swarm intelligence optimized wavelet neural net work approach. Journal of Biomedical Informatics, v. 49, p. 45–52, jun. 2014. ESTEVA, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature, v. 542, p. 115-118, jan. 2017. GHORI, K. M. et al. Performance Analysis of Different Types of Machine Learning Classifiers for Non-Technical Loss Detection. IEEE Access, v. 8, p. 16033-16048, jan. 2020. GUZMÁN-PONCE, A. et al. A New Under-Sampling Method to Face Class Overlap and Imbalance. Applied Sciences, v. 10, n. 15, p. 5164, jul. 2020. HASANIN, T. et al. Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data. 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), p. 346-356, abr. 2019. HASSAN, N. S. et al. Medical Images Breast Cancer Segmentation Based on K Means Clustering Algorithm: A Review. Asian Journal of Research in Computer Science, v. 9, n. 1, p. 23–38, mai. 2021. JIANG, F. et al. Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, v. 2, n. 4, p. 230–243, jun. 2017. JUNG, Y. Multiple predicting K-fold cross-validation for model selection. Journal of Nonparametric Statistics, v. 30, p. 197-215, nov. 2017. KAISER, J. Dealing with Missing Values in Data. Journal of Systems Integration, v. 5, p. 42-51, nov. 2014. KARRAR, A. E. The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values. Indonesian Journal of Electrical Engineering and Informatics, v. 10, n. 2, jun. 2022. KSIAZEK, W. et al. Development of novel ensemble model using stacking learning and evolutionary computation techniques for automated hepatocellular carcinoma detection. Biocybernetics and Biomedical Engineering, v. 40, n. 4, p. 1512-1524, out. 2020. KURANI, A. et al. A Comprehensive Comparative Study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on Stock Forecasting. Annals of Data Science, v. 10, p. 183–208, fev. 2023. PANDEY, A.; JAIN, A. Comparative Analysis of KNN Algorithm using Various Nor malization Techniques. International Journal of Computer Network and Infor mation Security, v. 9, p. 36-42, nov. 2017. PROROK, P. C. et al. Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin. Trials, v. 21, p. 273S-309S, dez. 2000. QIN, J. et al. Distributed k-Means Algorithm and Fuzzy c-Means Algorithm for Sensor Networks Based on Multiagent Consensus Theory. IEEE Transactions on Cyber netics, v. 47, n. 3, p. 772-783, mar. 2017. SCHONLAU, M.; ZOU, R. Y. The random forest algorithm for statistical learning. The Stata Journal, v. 20, n. 1, p. 3-29, mar. 2020. SUBASI, A. Machine learning techniques. In: ____. Practical Machine Learning for Data Analysis Using Python, Academic Press, 2020. cap. 3. SUNG, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin., v. 71, n. 3, p. 209-249, fev. 2021. TAN, J.; MOORE, J.; URBANOWICZ, R. Rapid Rule Compaction Strategies for Global Knowledge Discovery in a Supervised Learning Classifier System. ECAL 2013: The Twelfth European Conference on Artificial Life, p. 110-117, set. 2013. URBANOWICZ, R. et al. A rigorous machine learning analysis pipeline for biomedical binary classification: application in pancreatic cancer nested case-control studies with implications for bias assessments. ArXiv, v. abs/2008.12829v2, set. 2020. URBANOWICZ, R. ExSTraCS ML Pipeline Binary Notebook, set. 2020. Disponível em: <https://github.com/UrbsLab/ExSTraCS_ML_Pipeline_Binary_Notebook>. Aces so em: 1 out. 2022. URBANOWICZ, R. J.; MOORE, J. H. ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System. Evolutionary Intelligence, v. 8, n. 2, p. 89- 116, set. 2015. VUTTIPITTAYAMONGKOL, P.; ELYAN, E.; PETROVSKI, A. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems, v. 212, jan. 2021. WEI, L. et al. Gene Expression Value Prediction Based on XGBoost Algorithm. Fron tiers in Genetics, v. 10, nov. 2019. WEI, X. A Method of Enterprise Financial Risk Analysis and Early Warning Based on Decision Tree Model. Security and Communication Networks, v. 2021, set. 2021. WEI-CHAO, L. et al. Clustering-based undersampling in class-imbalanced data. In formation Sciences, v. 409-410, p. 17-26, out. 2017. WICKRAMASINGHE, I.; KALUTARAGE, H. Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing, v. 25, p. 2277–2293, set. 2020. WINTER, K. et al. Diagnostic and therapeutic recommendations in pancreatic ductal adenocarcinoma. Recommendations of the Working Group of the Polish Pancreatic Club. Przeglad gastroenterologiczny, v. 14, n. 1, p. 1-18, mar. 2019. WU, X. et al. Imputation techniques on missing values in breast cancer treatment and fertility data. Health Information Science and Systems, v. 7, p. 19, out. 2019. YANG, D. X. et al. Prevalence of Missing Data in the National Cancer Database and Association with Overall Survival. JAMA Network Open, v. 4, n. 3, p. e211793, mar. 2021. ZHANG, J.; CHEN, L.; ABID, F. Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. Journal of Healthcare Engineering, v. 2019, out. 2019.	pt_BR
dc.rights	Acesso Aberto	pt_BR
dc.rights	An error occurred on the license name.	*
dc.rights	An error occurred on the license name.	*
dc.rights.uri	An error occurred getting the license - uri.	*
dc.rights.uri	An error occurred getting the license - uri.	*
dc.subject	Ciência da Computação	pt_BR
dc.subject	Machine Learning	pt_BR
dc.subject	Clusterização k-means	pt_BR
dc.subject	kNN	pt_BR
dc.subject	Undersampling	pt_BR
dc.subject	Imputação de dados faltantes	pt_BR
dc.subject	Classificação	pt_BR
dc.title	Classificação de câncer de pâncreas utilizando Técnicas de imputação de dados faltantes e undersampling baseado em clusterização: uma análise comparativa com diferentes algoritmos de Machine Learning	pt_BR
dc.type	Article	pt_BR
dc.creator.Lattes	http://lattes.cnpq.br/3122764958081123	pt_BR
dc.contributor.advisor1	Neves, Renata Freire de Paiva
dc.contributor.advisor1Lattes	http://lattes.cnpq.br/9029559122700209	pt_BR
dc.contributor.referee1	Neves, Renata Freire de Paiva
dc.contributor.referee2	Ferreira, Aida Araújo
dc.contributor.referee3	Macedo, Samuel Victor Medeiros de
dc.contributor.referee1Lattes	http://lattes.cnpq.br/9029559122700209	pt_BR
dc.contributor.referee2Lattes	http://lattes.cnpq.br/8515798754882166	pt_BR
dc.contributor.referee3Lattes	http://lattes.cnpq.br/0753964115099661	pt_BR
dc.publisher.department	Recife	pt_BR
dc.publisher.country	Brasil	pt_BR
dc.subject.cnpq	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO	pt_BR
dc.description.resumo	Dados faltantes e desbalanceamento de classes são problemas frequentemente observados em bases de dados associadas a cenários reais, o que inclui a classificação de câncer. Caso estes problemas não sejam endereçados de forma adequada antes da análise, impactos no desempenho de modelos de Machine Learning (ML) podem ser observados. Neste artigo, é proposta uma solução combinada a partir da inserção de dados faltantes utilizando a técnica de kNN (k vizinhos mais próximos) e undersampling baseado em clusterização utilizando k means, com foco na classificação do câncer de pâncreas. Diferentes subconjuntos de dados foram gerados a partir da combinação de diferentes métodos de pré processamento e o desempenho analisado utilizando um pipeline de análise de ML de um estudo prévio. Este pipeline executa dez algoritmos de ML, incluindo Random Forest, Máquina de Vetores de Suporte e Redes Neurais Artificiais. Todos os subconjuntos de dados gerados apresentaram um aumento significativo (p<0,05 com teste-t de Student) no desempenho para a maioria dos algoritmos de ML quando comparados aos resultados obtidos anteriormente quando o pipeline foi avaliado pela primeira vez. Os resultados sugerem que kNN e k-means são métodos que podem ser utilizados na fase de pré-processamento dos dados para solucionar problemas de dados faltantes e desbalanceamento de classes e melhorar a acurácia da classificação.	pt_BR

Arquivos deste item

Nome:: Classificação de câncer de ...
Tamanho:: 560.6Kb
Formato:: PDF
Descrição:: Trabalho de Conclusão de Curso

Visualizar/Abrir

Nome:: license_rdf
Tamanho:: 0bytes
Formato:: application/rdf+xml

Visualizar/Abrir

Este item aparece na(s) seguinte(s) coleção(s)

Tecnólogo em Análise e Desenvolvimento de Sistemas

Mostrar registro simples