Mostrar registro simples

dc.creatorSena, Wanessa Layssa Batista de
dc.date.accessioned2023-11-01T21:19:55Z
dc.date.available2023-11-01T21:19:55Z
dc.date.issued2023-07-13
dc.identifier.citationSENA, Wanessa Layssa Batista de. Classificação de câncer de pâncreas utilizando técnicas de imputação de dados faltantes e undersampling baseado em clusterização: uma análise comparativa com diferentes algoritmos de Machine Learning.2023. 15f. Trabalho de Conclusão de Curso, (Curso de Tecnologia Análise de Desenvolvimento de Sistemas)- Instituto Federal de Ciência e Tecnologia de Pernambuco, Recife.pt_BR
dc.identifier.urihttps://repositorio.ifpe.edu.br/xmlui/handle/123456789/1068
dc.description.abstractMissing values and class imbalance are issues frequently found in databases from real-world scenarios, including cancer classification. Impacts on the performance of Machine Learning (ML) models can be observed if these issues are not properly addressed prior to the analysis. In this paper, a combined solution with missing data imputation using kNN (k-nearest neighbors) and cluster-based undersampling using k-means is proposed, focusing on pancreatic cancer classification. Different data subsets were generated by combining different preprocessing methods and the performance was analyzed using a ML analysis pipeline from a previous study. This pipeline implements ten ML classifiers, including Random Forest, Support Vector Machine and Artificial Neural Network. All data subsets presented a significant improvement (p<0.05 with Student’s T-Test) in the performance of most ML algorithms when compared with the results obtained when the pipeline was first evaluated. Results suggest that kNN and k-means can be used in the data preprocessing phase to overcome missing values and class imbalance issues and improve the classification accuracy.pt_BR
dc.format.extent15f.pt_BR
dc.languagept_BRpt_BR
dc.relationAHMED, M.; SERAJ, R.; ISLAM, S. M. S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics, v. 9, n. 8, ago. 2020. BAI, L.; LIANG, J.; GUO, Y. An Ensemble Clusterer of Multiple Fuzzy k-Means Clus terings to Recognize Arbitrarily Shaped Clusters. IEEE Transactions on Fuzzy Sys tems, v. 26, n. 6, p. 3524-3533, dez. 2018. CHARBUTY, B.; ABDULAZEEZ, A. Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends, v. 2, n. 01, p. 20-28, mar. 2021. CHEN, PH. C.; LIU, Y.; PENG, L. How to develop machine learning models for healthcare. Nature Materials, v. 18, p. 410–414, abr. 2019. CHEN, T.; GUESTRIN, C. XGBoost: A Scalable Tree Boosting System. Proceed ings of the 22nd ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, Association for Computing Machinery, p. 785–794, ago. 2016. CHICCO, D.; JURMAN, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Ge nomics, v. 21, n. 6, jan. 2020. DHEEBA, J.; ALBERT SINGH, N.; TAMIL SELVI, S. Computer-aided detection of breast cancer on mammograms: A swarm intelligence optimized wavelet neural net work approach. Journal of Biomedical Informatics, v. 49, p. 45–52, jun. 2014. ESTEVA, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature, v. 542, p. 115-118, jan. 2017. GHORI, K. M. et al. Performance Analysis of Different Types of Machine Learning Classifiers for Non-Technical Loss Detection. IEEE Access, v. 8, p. 16033-16048, jan. 2020. GUZMÁN-PONCE, A. et al. A New Under-Sampling Method to Face Class Overlap and Imbalance. Applied Sciences, v. 10, n. 15, p. 5164, jul. 2020. HASANIN, T. et al. Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data. 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), p. 346-356, abr. 2019. HASSAN, N. S. et al. Medical Images Breast Cancer Segmentation Based on K Means Clustering Algorithm: A Review. Asian Journal of Research in Computer Science, v. 9, n. 1, p. 23–38, mai. 2021. JIANG, F. et al. Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, v. 2, n. 4, p. 230–243, jun. 2017. JUNG, Y. Multiple predicting K-fold cross-validation for model selection. Journal of Nonparametric Statistics, v. 30, p. 197-215, nov. 2017. KAISER, J. Dealing with Missing Values in Data. Journal of Systems Integration, v. 5, p. 42-51, nov. 2014. KARRAR, A. E. The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values. Indonesian Journal of Electrical Engineering and Informatics, v. 10, n. 2, jun. 2022. KSIAZEK, W. et al. Development of novel ensemble model using stacking learning and evolutionary computation techniques for automated hepatocellular carcinoma detection. Biocybernetics and Biomedical Engineering, v. 40, n. 4, p. 1512-1524, out. 2020. KURANI, A. et al. A Comprehensive Comparative Study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on Stock Forecasting. Annals of Data Science, v. 10, p. 183–208, fev. 2023. PANDEY, A.; JAIN, A. Comparative Analysis of KNN Algorithm using Various Nor malization Techniques. International Journal of Computer Network and Infor mation Security, v. 9, p. 36-42, nov. 2017. PROROK, P. C. et al. Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin. Trials, v. 21, p. 273S-309S, dez. 2000. QIN, J. et al. Distributed k-Means Algorithm and Fuzzy c-Means Algorithm for Sensor Networks Based on Multiagent Consensus Theory. IEEE Transactions on Cyber netics, v. 47, n. 3, p. 772-783, mar. 2017. SCHONLAU, M.; ZOU, R. Y. The random forest algorithm for statistical learning. The Stata Journal, v. 20, n. 1, p. 3-29, mar. 2020. SUBASI, A. Machine learning techniques. In: ____. Practical Machine Learning for Data Analysis Using Python, Academic Press, 2020. cap. 3. SUNG, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin., v. 71, n. 3, p. 209-249, fev. 2021. TAN, J.; MOORE, J.; URBANOWICZ, R. Rapid Rule Compaction Strategies for Global Knowledge Discovery in a Supervised Learning Classifier System. ECAL 2013: The Twelfth European Conference on Artificial Life, p. 110-117, set. 2013. URBANOWICZ, R. et al. A rigorous machine learning analysis pipeline for biomedical binary classification: application in pancreatic cancer nested case-control studies with implications for bias assessments. ArXiv, v. abs/2008.12829v2, set. 2020. URBANOWICZ, R. ExSTraCS ML Pipeline Binary Notebook, set. 2020. Disponível em: <https://github.com/UrbsLab/ExSTraCS_ML_Pipeline_Binary_Notebook>. Aces so em: 1 out. 2022. URBANOWICZ, R. J.; MOORE, J. H. ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System. Evolutionary Intelligence, v. 8, n. 2, p. 89- 116, set. 2015. VUTTIPITTAYAMONGKOL, P.; ELYAN, E.; PETROVSKI, A. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems, v. 212, jan. 2021. WEI, L. et al. Gene Expression Value Prediction Based on XGBoost Algorithm. Fron tiers in Genetics, v. 10, nov. 2019. WEI, X. A Method of Enterprise Financial Risk Analysis and Early Warning Based on Decision Tree Model. Security and Communication Networks, v. 2021, set. 2021. WEI-CHAO, L. et al. Clustering-based undersampling in class-imbalanced data. In formation Sciences, v. 409-410, p. 17-26, out. 2017. WICKRAMASINGHE, I.; KALUTARAGE, H. Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing, v. 25, p. 2277–2293, set. 2020. WINTER, K. et al. Diagnostic and therapeutic recommendations in pancreatic ductal adenocarcinoma. Recommendations of the Working Group of the Polish Pancreatic Club. Przeglad gastroenterologiczny, v. 14, n. 1, p. 1-18, mar. 2019. WU, X. et al. Imputation techniques on missing values in breast cancer treatment and fertility data. Health Information Science and Systems, v. 7, p. 19, out. 2019. YANG, D. X. et al. Prevalence of Missing Data in the National Cancer Database and Association with Overall Survival. JAMA Network Open, v. 4, n. 3, p. e211793, mar. 2021. ZHANG, J.; CHEN, L.; ABID, F. Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. Journal of Healthcare Engineering, v. 2019, out. 2019.pt_BR
dc.rightsAcesso Abertopt_BR
dc.rightsAn error occurred on the license name.*
dc.rightsAn error occurred on the license name.*
dc.rights.uriAn error occurred getting the license - uri.*
dc.rights.uriAn error occurred getting the license - uri.*
dc.subjectCiência da Computaçãopt_BR
dc.subjectMachine Learningpt_BR
dc.subjectClusterização k-meanspt_BR
dc.subjectkNNpt_BR
dc.subjectUndersamplingpt_BR
dc.subjectImputação de dados faltantespt_BR
dc.subjectClassificaçãopt_BR
dc.titleClassificação de câncer de pâncreas utilizando Técnicas de imputação de dados faltantes e undersampling baseado em clusterização: uma análise comparativa com diferentes algoritmos de Machine Learningpt_BR
dc.typeArticlept_BR
dc.creator.Latteshttp://lattes.cnpq.br/3122764958081123pt_BR
dc.contributor.advisor1Neves, Renata Freire de Paiva
dc.contributor.advisor1Latteshttp://lattes.cnpq.br/9029559122700209pt_BR
dc.contributor.referee1Neves, Renata Freire de Paiva
dc.contributor.referee2Ferreira, Aida Araújo
dc.contributor.referee3Macedo, Samuel Victor Medeiros de
dc.contributor.referee1Latteshttp://lattes.cnpq.br/9029559122700209pt_BR
dc.contributor.referee2Latteshttp://lattes.cnpq.br/8515798754882166pt_BR
dc.contributor.referee3Latteshttp://lattes.cnpq.br/0753964115099661pt_BR
dc.publisher.departmentRecifept_BR
dc.publisher.countryBrasilpt_BR
dc.subject.cnpqCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAOpt_BR
dc.description.resumoDados faltantes e desbalanceamento de classes são problemas frequentemente observados em bases de dados associadas a cenários reais, o que inclui a classificação de câncer. Caso estes problemas não sejam endereçados de forma adequada antes da análise, impactos no desempenho de modelos de Machine Learning (ML) podem ser observados. Neste artigo, é proposta uma solução combinada a partir da inserção de dados faltantes utilizando a técnica de kNN (k vizinhos mais próximos) e undersampling baseado em clusterização utilizando k means, com foco na classificação do câncer de pâncreas. Diferentes subconjuntos de dados foram gerados a partir da combinação de diferentes métodos de pré processamento e o desempenho analisado utilizando um pipeline de análise de ML de um estudo prévio. Este pipeline executa dez algoritmos de ML, incluindo Random Forest, Máquina de Vetores de Suporte e Redes Neurais Artificiais. Todos os subconjuntos de dados gerados apresentaram um aumento significativo (p<0,05 com teste-t de Student) no desempenho para a maioria dos algoritmos de ML quando comparados aos resultados obtidos anteriormente quando o pipeline foi avaliado pela primeira vez. Os resultados sugerem que kNN e k-means são métodos que podem ser utilizados na fase de pré-processamento dos dados para solucionar problemas de dados faltantes e desbalanceamento de classes e melhorar a acurácia da classificação.pt_BR


Arquivos deste item

Thumbnail
Thumbnail

Este item aparece na(s) seguinte(s) coleção(s)

Mostrar registro simples