Comparative Analysis of Filter-based Feature Selection Methods for High-Dimensional Data in Classification Tasks

Shengjie Min; Chuanli Wei

doi:10.69987/JACS.2023.30803

Authors

Shengjie Min Statistics, The University of Georgia, GA, USA Author
Chuanli Wei Computer Science, University of Southern California, CA, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30803

Keywords:

Feature selection, high-dimensional data classification, feature filtering methods, dimensionality reduction

Abstract

High-dimensional data classification encounters substantial computational barriers when feature spaces exceed sample sizes by orders of magnitude. Filter-based feature selection addresses this dimensionality curse through statistical independence between feature evaluation and classifier training stages. This study examines six prevalent feature filtering methods across datasets ranging from 10³ to 10⁵ dimensions, measuring their impact on classification accuracy, computational overhead, and feature subset stability. Experimental results demonstrate that correlation-based approaches achieve 8.7% higher accuracy than variance thresholding on bioinformatics datasets while maintaining O (n log n) time complexity. Chi-square statistics test and mutual information methods exhibit comparable performance on categorical data with divergent behavior on continuous features. The analysis reveals trade-offs between its statistical power and computational tractability, with F-score emerging as optimal for balanced datasets and ReliefF excelling under class imbalance conditions. Performance degradation appears beyond 10⁴ features for correlation methods due to spurious associations, suggesting hybrid architectures for ultra-high-dimensional data processing problems.