Evaluating Machine Learning Approaches for Sensitive Data Identification: A Comparative Study of NLP and Rule-Based Methods

Jin Zhang

doi:10.69987/JACS.2024.40703

Authors

Jin Zhang Computer Science, Illinois Institute of Technology, IL, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40703

Keywords:

Data Leakage Prevention, Personally Identifiable Information Detection, Natural Language Processing, Machine Learning Security

Abstract

We present an empirical evaluation comparing machine learning approaches for detecting personally identifiable information in digital systems. Through systematic experimentation on 5.15 million database records and 855,000 documents containing 23.05 million PII entities, we assess natural language processing techniques against traditional rule-based methods. Our experiments measure detection accuracy, computational efficiency, and deployment complexity across thirteen entity categories. Results show transformer-based NLP methods reaching macro-averaged F1-scores of 0.917, exceeding rule-based baselines (0.860) by 5.7 percentage points (6.6% relative improvement). Hybrid architectures combining both approaches achieve 0.935 F1-score with 1.56× better throughput than pure NLP implementations. We quantify performance trade-offs between accuracy and computational overhead across five database management systems, providing practitioners with empirical guidance for implementing data leakage prevention systems.