Comparative Study of Chronic Kidney Disease Predictor Performance Given Insufficient Training Dataset

Oluwadamilare Alabi


This study compares the performance of Logistic Regression and Classification and Regression Tree model implementations in predicting chronic kidney disease outcomes from predictor variables, given insufficient training data. Imputation of missing data was performed using a technique based on k-nearest neighbours. The dataset was arbitrarily split into 10 % training set and 90 % test set to simulate a dearth of training data. Accuracy was mainly considered for the quantitative performance assessment together with ROC curves, area under the ROC curve values and confusion matrix pairs. Validation of the results was done using a shuffled 5-fold cross-validation procedure. Logistic regression produced an average accuracy of about 99 % compared to about 97 % the decision tree produced.


Binary classification; decision tree; logistic regression; machine learning

Full Text:



B. Bikbov et al., “Global, regional, and national burden of chronic kidney disease, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017,” The Lancet, vol. 395, no. 10225, pp. 709–733, Feb. 2020.

K. J. Jager and S. D. Fraser, “The ascending rank of chronic kidney disease in the global Bburden of disease study,” Nephrology Dialysis Transplantation, vol. 32, no. 2, pp. 121–128, Apr. 2017.

S. R. Mitchell, S. A. Robert, and S. A. Lloyd, “An investigation into the use of machine learning for determining oestrus in cows,” Computers and Electronics in Agriculture, vol. 15, no. 3, pp. 195–213, Aug. 1996.

C. E. Brodley and P. E. Utgoff, “Multivariate decision trees,” Machine Learning, vol. 19, no. 1, pp. 45–77, Apr. 1995.

V. V. Raghavan, V. N. Gudivada, V. Govindaraju, and C. Rao, Eds., Cognitive Computing: Theory and Applications, Volume 35, 1st ed., 2016, p. 183.

G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proceedings of the IEEE, vol. 100, no. 9, pp. 2584–2603, Apr. 2012.

A. Géron, Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed., CA: Orielly media, 2019, pp. 182–184.

J. Bodine and D. S. Hochbaum, “The max-cut decision tree: Improving on the accuracy and running time of decision trees,” arXiv preprint arXiv:2006.14118, p. 1, 2020.

W. J. Long, J. L. Griffith, H. P. Selker, and R. B. D'Agostino, “A comparison of logistic regression to decision-tree induction in a medical domain,” Computers and Biomedical Research, vol. 26, no. 1, pp. 74–97, Feb. 1993.

W. Gunarathne, K. Perera, and K. Kahandawaarachchi, “Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (CKD),” in 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), Washington, DC, USA, Oct. 2017, pp. 291–296.

J. Xiao, R. Ding, X. Xu, H. Guan, X. Feng, T. Sun, S. Zhu, and Z. Ye, “Comparison and development of machine learning tools in the prediction of chronic kidney disease progression,” Journal of Translational Medicine, vol. 17, Art no. 119, Apr. 2019.

D. Dua and C. Graff, UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, 2017.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 1st ed., Morgan Kaufmann, 1999, p. 58.

H. Kang, “The prevention and handling of the missing data,” Korean Journal of Anesthesiology, vol. 64, no. 5, pp. 402–406, May 2013.

O. F. Ayilara, L. Zhang, T. T. Sajobi, R. Sawatzky, E. Bohm, and L. M. Lix, “Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry.,” Health and Quality of Life Outcomes, vol. 17, Art no. 106, Jun. 2019.

J. Poulos and R. Valle, “Missing data imputation for supervised learning,” Applied Artificial Intelligence, vol. 32, no. 2, pp. 186–196, Mar. 2018.

O. G. Troyanskaya, M. N. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman, “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001.

W. McKinney, “Data structures for statistical computing in Python,” in Proceedings of the 9th Python in Science Conference, Austin, TX, USA, Jul. 2010.

H.-F. Yu, F.-L. Huang and C.-J. Lin, “Dual coordinate descent methods for logistic regression and maximum entropy models,” Machine Learning, vol. 85, no. 1, pp. 41–75, Oct. 2011.

L. Breiman, Classification and Regression Trees, Chapman and Hall/CRC, 1983.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, no. 9, pp. 1871–1874, 2008.

N. J. Perkins and E. F. Schisterman, “The inconsistency of “optimal” cutpoints obtained using two criteria based on the receiver operating characteristic curve,” American Journal of Epidemiology, vol. 163, no. 7, pp. 670–675, Apr. 2006.

R. M. Simon, J. Subramanian, M.-C. Li, and S. Menezes, “Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data,” Briefings in Bioinformatics, vol. 12, no. 3, pp. 203–214, May 2011.

C. Alippi and M. Roveri, “Virtual k-fold cross validation: An effective method for accuracy assessment,” in The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, Jul. 2010.

Q. Noirhomme, D. Lesenfants, F. G. Gómez, A. Soddu, J. Schrouff, G. Garraux, A. Luxen, C. Phillips, and S. Laureys, “Biased binomial assessment of cross-validated estimation of classification accuracies illustrated in diagnosis predictions.,” NeuroImage: Clinical, vol. 4, pp. 687–694, 2014.

S. Lemm, B. Blankertz, T. Dickhaus, and K. R. Müller, “Introduction to machine learning for brain imaging.,” NeuroImage, vol. 56, no. 2, pp. 387–399, 2011.

S. Adelabu, O. Mutanga, and E. Adam, “Testing the reliability and stability of the internal accuracy assessment of random forest for classifying tree defoliation levels using different validation methods,” Geocarto International, vol. 30, no. 7, pp. 810–821, Feb. 2015.

I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, no. 7–8, pp. 1157–1182, Mar. 2003.

A. Majeed, “Improving time complexity and accuracy of the machine learning algorithms through selection of highly weighted top k features from complex datasets,” Annals of Data Science, vol. 6, no. 4, pp. 599–621, May 2019.

H. M. Sani, C. Lei, and D. Neagu, “Computational complexity analysis of decision tree algorithms,” in International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, Nov. 2018.

P. Kumar, “Computational complexity of ML models,” Analytics Vidhya, 14 December 2019. [Online]. Available:

DOI: 10.7250/itms-2022-0001


  • There are currently no refbacks.

Copyright (c) 2022 Oluwadamilare Alabi

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.