A Comparative Study of Multi-source Data Fusion Approaches for Credit Default Early Warning
DOI:
https://doi.org/10.69987/AIMLR.2024.50109Keywords:
Credit Default Prediction, Multi-source Data Fusion, Ensemble Learning, Feature EngineeringAbstract
This study presents a comparative analysis of multi-source data fusion approaches for early warning of credit defaults in financial institutions. The research integrates heterogeneous data sources, including credit bureau records, transaction behavior patterns, textual financial reports, and macroeconomic indicators. Three fusion strategies—early fusion, late fusion, and hybrid fusion—are systematically evaluated using ensemble machine learning algorithms, including XGBoost, LightGBM, and Random Forest. Experimental results on a real-world dataset comprising 125,847 credit records demonstrate that the hybrid fusion approach achieves the highest predictive performance with an AUC-ROC of 0.8934, outperforming the best single-source credit-bureau model (AUC-ROC 0.8234) by 7.0 percentage points (8.5% relative improvement). Feature importance analysis using SHAP values indicates that transaction behavior features account for 34.2% of the prediction, whereas NLP-extracted sentiment scores from financial texts account for 18.6%. Statistical tests (e.g., DeLong's test and bootstrap confidence intervals) indicate that the hybrid fusion configuration significantly outperforms the early-fusion baseline (p < 0.001 for AUC).

