Variability of Classification Results in Data with High Dimensionality and Small Sample Size

Jana Busa, Inese Polaka


The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To evaluate the accuracy of the model, measures of the area under the ROC curve and the overall accuracy of the classifier were used. In the experiments, the authors examined how classification results were affected by feature selection and increased size of the data set.


Classification algorithms; feature selection; high dimensionality; machine learning

Full Text:



D. M. Camacho, K. M. Collins, R. K. Powers, J. C. Costello, and J. J. Collins, “Next-generation machine learning for biological networks,” Cell, vol. 173, no. 7, pp. 1581–1592, June 2018.

X.-B. Qian et al., “A guide to human microbiome research: study design, sample collection, and bioinformatics analysis,” Chinese Medical Journal, vol. 133, no. 15, pp. 1844–1855, June 2020.

M. Oh and L. Zhang, “DeepMicro: deep representation learning for disease prediction based on microbiome data,” Sci. Rep., vol. 10, no. 1, p. 6026, Apr. 2020.

H. Li and H. Li, “Introduction to special issue on statistics in microbiome and metagenomics,” Statistics in Biosciences, vol. 13, no. 2, pp. 197–199, Mar. 2021.

C. F. A. Ribeiro, G. Silveira, E. S. Candido, M. H. Cardoso, C. M. Espinola Carvalho, and O. L. Franco, “Effects of antibiotic treatment on gut microbiota and how to overcome its negative impacts on human health,” ACS Infect. Dis., vol. 6, no. 10, pp. 2544–2559, Jul. 2020.

A. Golugula, G. Lee, and A. Madabhushi, “Evaluating feature selection strategies for high dimensional, small sample size datasets,” in 2011 Annu. Int. Conf. of the IEEE Eng. in Med. and Biol. Soc., Aug. 2011, pp. 949–952.

S. Bang, D. Yoo, S.-J. Kim, S. Jhang, S. Cho, and H. Kim, “Establishment and evaluation of prediction model for multiple disease classification based on gut microbial data,” Scientific Reports, vol. 9, no. 1, Jul. 2019, Art. no. 10189.

B. D. Topcuoglu, N. A. Lesniak, M. Ruffin, J. Wiens, and P. D. Schloss, “A framework for effective application of machine learning to microbiome-based classification problems,” mBio, vol. 11, no. 3, Jun. 2020.

L. J. Marcos-Zambrano et al., “Applications of machine learning in human microbiome studies: A review on feature selection, biomarker identification, disease prediction and treatment,” Frontiers in Microbiology, Review vol. 12, no. 313, Feb. 2021.

M. Ziemski, T. Wisanwanichthan, N. A. Bokulich, and B. D. Kaehler, “Beating naive Bayes at taxonomic classification of 16S rRNA gene sequences,” Front. Microbiol., vol. 12, p. 644487, Jun. 2021.

A. Vabalas, E. Gowen, E. Poliakoff, and A. J. Casson, “Machine learning algorithm validation with a limited sample size,” PLoS One, vol. 14, no. 11, p. e0224365, Nov. 2019.

D. Brain and G. Webb, “On the effect of data set size on bias and variance in classification learning,” Proceedings of the Fourth Australian Knowledge Acquisition Workshop, Jun. 2000, pp. 117–128.

A. V. Joshi, Machine Learning and Artificial Intelligence. Switzerland: Springer, Cham, 2020.

C. Sammut and I. G. Webb, Encyclopedia of Machine Learning and Data Mining. New York: Springer Nature, 2017.

H. Zhou, “Decision trees,” in Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods. Berkeley, CA: Apress, 2020, pp. 125–148.

L. Igual and S. Seguí, Introduction to Data Science. A Python Approach to Concepts, Techniques and Applications (Undergraduate Topics in Computer Science). Switzerland: Springer, 2017.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning (Springer Texts in Statistic). New York: Springer-Verlag, 2013.

H. Rajaguru and S. K. Prabhakar, “kNN Classifier,” in KNN Classifier and K-Means Clustering for Robust Classification of Epilepsy From EEG Signals. A Detailed Analysis. Hamburg: Anchor Academic Publishing, 2017, ch. 3, pp. 31–38.

K. Ashley, “Neural networks,” in Applied Machine Learning for Health and Fitness: A Practical Guide to Machine Learning with Deep Vision, Sensors and IoT. Berkeley, CA: Apress, 2020, pp. 73–91.

A. Meyer-Baese and V. Schmid, “Foundations of neural networks,” in Pattern Recognition and Signal Analysis in Medical Imaging, A. Meyer-Baese and V. Schmid, Eds. Oxford: Academic Press, 2014, pp. 197–243.

V. Bolón-Canedo and A. Alonso-Betanzos, “Feature selection,” in Recent Advances in Ensembles for Feature Selection, vol. 147. Cham: Springer International Publishing, 2018, pp. 13–37.

J. Demšar et al., “Orange: data mining toolbox in Python,” Journal of Machine Learning Research, vol. 14, pp. 2349–2353, 2013. [Online]. Available:

I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc., 2016.

DOI: 10.7250/itms-2021-0007


  • There are currently no refbacks.

Copyright (c) 2021 Jana Busa, Inese Polaka

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.