The Effect of Feature Selection Methods to Classification Performance in Health Datasets

The Effect of Feature Selection Methods to Classification Performance in Health Datasets

Authors

DOI:

https://doi.org/10.52309/jai.2021.2

Abstract

Nowadays, since data sets become very high-dimensional and specific with the data collected from different devices, attribute selection has an important pre-processing task in reducing data size in data mining. This study aims to improve classification performance by reducing the calculation time and cost by using attribute selection methods. Attribute selection methods are examined under three main headings; filter method, wrapper method, and embedded method. In the study, support vector machine, Naïve Bayes and decision trees methods (J48) among the machine learning classification algorithms were used. Data sets were obtained from UCI and Kaggle databases. Accuracy values were calculated to compare the classification performances of the algorithms. WEKA version 3.8.3, R3.3.0  and Tableu programs were performed in all analyzes. After unnecessary features were extracted by using appropriate methods in the analysis; classification performances and run times of algorithms were calculated. Accuracy values increased to 87% for Colorectal Histology MNIST, 85% for Parkinson's disease, 97% for SCADI, 100% for HCC, and 78% for breast cancer after attribute selection. The algorithm with the highest performance was found as a wrapper method with decision trees (J48). While the fastest algorithm was filter method, the longest-running algorithm was the wrapper method. According to results, the performance improvement was higher in feature sets with a large number of attributes after selecting feature. As a result, low dimensional data sets may provide higher classification accuracy with lower calculation costs.

Published

2021-04-15