dc.description.abstract | In recent years, Public Administration has made significant progress in modernizing government systems in alignment with the Digital Government policy. This movement encourages the adoption of disruptive technologies, such as automation and machine learning, which can enhance administrative processes' efficiency and improve public service quality. Data processing has become a central issue in this context, requiring compliance with the Access to Information Law (LAI) and the General Data Protection Law (LGPD). Furthermore, strict adherence to information security protocols is essential to safeguard sensitive data. This article aims to evaluate a machine learning model for classifying confidential information based on historical data from the Federal Government. To this end, the study is divided into three sections: the first presents the justification and motivation for the work, along with the theoretical framework, considering the advancements in the use of ICTs in public services; the second section addresses the experimental structure, detailing the methodology used for data collection and processing, as well as the experiments conducted; finally, the results, challenges faced, and future work are discussed. The Random Forest model demonstrated good performance in classifying confidential information, achieving an accuracy of 94.39%. Other metrics, such as the Precision-Recall Curve (PRC), were also analyzed. However, challenges related to class imbalance and the differentiation between confidentiality categories were identified, highlighting the need for future refinements in modeling, the collection of a more representative volume of samples, and feature selection. | pt_BR |
dc.relation | ALIFERIS, Constanti; SIMON, Gyorgy. Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI. Computers in health care (New York), p. 477–524, jan. 2024.
BERRAR, Daniel et al. Cross-validation. 2019. Disponível em: https://dberrar.github.io/papers/Berrar_EBCB_2nd_edition_Cross-validation_preprint.pdf.Acesso em 12 de fevereiro de 2025.
BORRA, S.; CIACCIO, A. Measuring the Prediction Error. A Comparison of Cross-Validation, Bootstrap and Covariance Penalty Methods. Computational Statistics & Data Analysis, v. 54, p. 2976-2989. 2010.
BERGSTRA, J.; BENGIO, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, v. 13, n. Feb, p. 281–305, 2012.
BRASIL. Estratégia de Governo Digital 2023-2026. [S.l.: s.n.], 2024.Disponível em: https://www.gov.br/governodigital/pt-br/estrategias-e-governanca-digital/estrategianacional. Acesso em 22 de outubro de 2024.
CASTRO, C. L. DE .; BRAGA, A. P.. Aprendizado supervisionado com conjuntos de dados desbalanceados. Sba: Controle & Automação Sociedade Brasileira de Automatica, v. 22, n. 5, p. 441–466, set. 2011.
CHAWLA, N. V.; BOWYER, K. W.; KEGELMEYER, P. W.. Smote: Synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research, v. 16, p. 321-357,
jun. 2002.
CRISTÓVAM, José Sérgio da Silva; HAHN, Tatiana Meinhart. Administração Pública
Orientada Por Dados: Governo Aberto E Infraestrutura Nacional De Dados Abertos. Revista
de Direito Administrativo e Gestão Pública, v. 6, n. 1, p. 1–24, ago 2020. Disponível em:
https://www.indexlaw.org/index.php/rdagp/article/view/6388. Acesso em 23 de outubro de
2024.
CRISTÓVAM, José Sérgio da Silva; SAIKALI, Lucas Bossoni; SOUSA, Thanderson Pereira
de. Governo Digital na Implementação de Serviços Públicos para a Concretização de
Direitos Sociais no Brasil. Sequência (Florianópolis). Programa de Pós-Graduação em
Direito da Universidade Federal de Santa Catarina, n. 84, p. 209–242, jan. 2020. Disponível
em: https://doi.org/10.5007/2177-7055.2020v43n89p209. Acesso em 21 de outubro de 2024.
DOSHI-VELEZ, Finale; KIM, Been. Towards A Rigorous Science of Interpretable
Machine Learning. [S.l.: s.n.], 2017. Disponível em: https://arxiv.org/abs/1702.08608.
Acesso em 12 de novembro de 2024.
ERICKSON, Bradley J.; KITAMURA, Felipe. Magician’s Corner: 9. Performance Metrics for
Machine Learning Models. Radiology: Artificial Intelligence, v. 3, n. 3, 2021. Disponível
em: https://doi.org/10.1148/ryai.2021200126. Acesso em 11 de novembro de 2024.
FARQUAD, M.A.H.; BOSE, Indranil. Preprocessing unbalanced data using support vector
machine. Decision Support Systems, v. 53, n. 1, p. 226–233, 2012. Disponível em:
https://www.sciencedirect.com/science/article/pii/S0167923612000425. Acesso em 11 de
novembro de 2024.
GARCÍA-PABLOS, Aitor; PÉREZ, Naiara; CUADROS, Montse. Sensitive Data Detection
and Classification in Spanish Clinical Text: Experiments with BERT. 2020. Disponível
em: https://api.semanticscholar.org/CorpusID:212628622. Acesso em 9 de janeiro de 2025.
GOUVEIA, Luís Borges. Local e-Government-A governação digital na autarquia.
SPI/Principia, 2004. Disponível em: https://bdigital.ufp.pt/handle/10284/263. Acesso em 11
de novembro de 2024.
HAYKIN, S. Redes neurais: princípios e prática. [S.l.]: Bookman Editora, 2007.
HE, H; GARCIA, E. A. Learning from imbalanced data. IEEE Transactions on Knowledge
and Data Engineering, v. 21, n. 9, p. 1263–1284. 2009.
HULJANAH, Mia et al. Feature Selection using Random Forest Classifier for Predicting
Prostate Cancer. IOP Publishing, v. 546, n. 5, p. 052031, 2019. Disponível em:
https://dx.doi.org/10.1088/1757-899X/546/5/052031. Acesso em 10 de novembro de 2024.
KOHAVI, Ron. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection. Mar. 1995. Disponível em:
https://www.researchgate.net/publication/2352264_A_Study_of_Cross-Validation_and_Boots trap_for_Accuracy_Estimation_and_Model_Selection. Acesso em 19 fev. de 2025.
KIM, Ji-Hyun. Estimating classification error rate: Repeated cross-validation, repeated
hold-out and bootstrap. Computational Statistics & Data Analysis, v. 53, n. 11, p.
3735-3745. 2009.
LIMA, Tiago et al. Previsão de óbito e importância de características clínicas em idosos com
COVID19 utilizando o Algoritmo Random Forest. Revista Brasileira de Saúde Materno
Infantil. v. 21, p 445-451, mar. 2021. Disponível em: http://higia.imip.org.br/handle/123456789/878. Acesso em 15 nov. 2024.
LUDERMIR, Teresa Bernarda. Inteligência Artificial e Aprendizado de Máquina: estado atual
e tendências. Estudos Avançados, Instituto de Estudos Avançados da Universidade de São
Paulo, v. 35, n. 101, p. 85–94, jan. 2021. Disponível em: https://doi.org/10.1590/s0103-4014.2021.35101.007. Acesso em 15 nov. 2024.
MIAO, Jiaju; ZHU, Wei. Precision–recall curve (PRC) classification trees. Evolutionary
Intelligence, v. 15, p. 1545–1569, 2022. Disponível em: https://arxiv.org/abs/2011.07640. Acesso em 13 de fevereiro de 2025. Acesso em 15 nov. 2024. | pt_BR |