The Impact of Feature Selection on the Information Held in Bioinformatics Data

Madara Gasparoviсa-Asite; Inese Polaka; Ludmila Aleksejeva

The Impact of Feature Selection on the Information Held in Bioinformatics Data

Madara Gasparoviсa-Asite, Inese Polaka, Ludmila Aleksejeva

Abstract

The present research examines a wide range of attribute selection methods – 86 methods that include both ranking and subset evaluation approaches. The efficacy evaluation of these methods is carried out using bioinformatics data sets provided by the Latvian Biomedical Research and Study Centre. The data sets are intended for diagnostic task purposes and incorporate values of more than 1000 proteomics features as well as diagnosis (specific cancer or healthy) determined by a golden standard method (biopsy and histological analysis). The diagnostic task is solved using classification algorithms FURIA, RIPPER, C4.5, CART, KNN, SVM, FB+ and GARF in the initial and various sets with reduced dimensionality. The research paper finalises with conclusions about the most effective methods of attribute subset selection for classification task in diagnostic proteomics data.

Keywords:

Bioinformatics; classification; data mining; diagnostics; feature selection

Full Text:

PDF

References

H. Liu and R. Setiono, “Chi2: Feature selection and discretization of numeric attributes,” in Proc. IEEE 7th Int. Conf. on Tools with Artificial Intelligence, pp. 338–391, 1995.

J.R. Quinlan, C4.5: Programs for Machine Learning. – San Mateo, CA: Morgan Kaufmann Publishers, 1993, p. 302.

R.C. Holte. “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, pp. 63–91, 1993. http://dx.doi.org/10.1023/A:1022631118932

I. Kononenko, “Estimating Attributes: Analysis and Extensions of RELIEF,” in European Conf. on Machine Learning, pp. 171–182, 1994. http://dx.doi.org/10.1007/3-540-57868-4_57

M. A. Hall, “Correlation-based Feature Subset Selection for Machine Learning,” Dissertation at University of Waikato (Hamilton, New Zealand), 1998. 198 p.

C.P. Tan, K.S. Lim, W.K. Lai, “Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsupervised Expectation Maximization Classifier for Imaging Surveillance Application,” Int. J. of Image Processing, 2–1, pp. 18–26, 2008.

H. Liu, R. Setiono, “A probabilistic approach to feature selection – a filter solution,” in Proc. of the 13th Int. Conf. on Machine Learning (ICML'96), Bari, Italy, July 3–6, 1996. San Mateo: Morgan Kaufmann Pub., 1996, pp. 319–327.

L. Yu, H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” in Proc. of the Twentieth Int. Conf. on Machine Learning, pp. 856–863, 2003.

P. Langley, “Selection of relevant features in machine learning,” in Proc. of the AAAI Fall Symposium on Relevance. New Orleans, Louisiana, USA, Nov. 4–6, 1994. New Orleans: AAAI Press, pp. 140– 144, 1994.

W. W. Cohen, “Fast Effective Rule Induction,” in Machine Learning: Proc. of the 12th Int. Conf. (ML’95), Morgan Kaufmann, 1995, pp. 115– 123. http://dx.doi.org/10.1016/b978-1-55860-377-6.50023-2

J. F ̈urnkranz and G. Widmer,“Incremental reduced error pruning,” in W.W. Cohen and H. Hirsh, editors, Proc. of the 11th Int. Conference on Machine Learning, pp. 70–77. Morgan Kaufmann, 1994.

R. Quinlan R. “Learning logical definitions from relations,” Machine Learning, vol. 5, no. 3, 1990.

R. Quinlan R. “Simplifying decision trees,” International Journal of Man-Machine Studies, vol. 27, pp. 221–234, 1987. http://dx.doi.org/10.1016/S0020-7373(87)80053-6

“The WEKA Data Mining Software: An Update”, M. Hall, E. Frank, G. Holmes, et al., ACM SIGKDD explorations newsletter, 2009, vol. 11, issue 1, pp. 10–18.

J. Hühn, E. Hüllermeier E. “FURIA: An Algorithm for Unordered Fuzzy Rule Induction”, Data Mining and Knowledge Discovery, 2009, vol. 19, no. 3, pp. 293–319. http://dx.doi.org/10.1007/s10618-009-0131-8

“Classification and Regression Trees”, L. Breiman, J.H. Fridman, L.A. Olshen et al. –Washington, DC: Chapman & Hall / CRC, 1984, 358 p. (Series: Wadsworth Statistics/Probability).

Data Mining and Knowledge Discovery Handbook / Ed. O. Maimon, L. Rokach. Berlin Heidelberg: Springer, 2010, 1285 p.

D. W. Aha, D. Kibler, and M.K. Albert, “Instance-Based Learning Algorithms”, Mach. Learn. vol. 6, issue 1, Jan. 1991, pp. 37–66. http://dx.doi.org/10.1023/A:1022689900470

D. Meyer, “Support Vector Machines. The Interface to libsvm in package” e1071. Online-Documentation of the package e1071 for R. – Wien: Technische Universität Wien, 2001. pp. 1–8. Available from: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BCDB6D0846 9CF19CF416EAADC044C6B3?doi=10.1.1.151.5271&rep=rep1&type= pdf.

V. Vapnik, and A. Lerner, “Pattern recognition using generalized portrait method,” Automation and Remote Control. 24, pp. 774–780, 1963.

C. Cortes, V. Vapnik, “Support-Vector Network,” Machine Learning, 1995, vol. 20, pp. 273–297. http://dx.doi.org/10.1007/BF00994018

H. Theron, I. Cloete, “BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions,” Machine Learning. 1996. Vol. 24, Issue 1. pp. 5–40. http://dx.doi.org/10.1007/BF00117830

J. van Zyl, “Fuzzy Set Covering as a New Paradigm for the Induction of Fuzzy Classification Rules,” PhD thesis. – Mannheim: University of Mannheim, 2007. 263 p.

M. Gasparovica-Asite, “Fuzzy classification methodology for processing and analyzing bioinformatics data,” PhD thesis. Riga: Riga Technical University, 2015. 160 p., in press.

I. Poļaka, A. Borisovs, “Genethic Algorithm and Tree Based Classification in Bioinformatics,” in European Conference on Data Analysis 2013: Book of Abstracts, Luxembourg, Luxembourg, July 10– 12, 2013. Luxembourg: 2013, pp. 107–107. ISBN 9782879711058.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Username
Password
Remember me

Information Technology and Management Science