A Model-Risk-Friendly Probability of Default Workflow: Calibration, Distribution-Free Uncertainty Quantification, and SHAP Explanations on the UCI Credit Card Default Dataset

Jiaying Jin; Tina Huang; Sam Lu

doi:10.69987/JACS.2024.40606

Authors

Jiaying Jin Columbia University, NY, USA Author
Tina Huang Computer Engineering, Columbia University, NY, USA Author
Sam Lu Computer Science, Columbia University, NY, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40606

Keywords:

Probability of default, credit risk, model risk management, calibration, conformal prediction, bootstrap ensembles, SHAP

Abstract

Probability of default (PD) models are central to retail credit risk, but governance frameworks require more than a single score. This paper develops an auditable PD workflow that combines predictive performance, probability calibration, uncertainty quantification, and explanation. Using the UCI Default of Credit Card Clients dataset (30,000 observations, 23 explanatory variables), we evaluate logistic regression, random forest, XGBoost, and LightGBM under a strict 60/10/10/20 train/probability-calibration/conformal-calibration/test split. Ranking performance is assessed with AUC, PR-AUC, and KS; probability quality with Brier score, log loss, and expected calibration error (ECE). Raw XGBoost gives the strongest overall balance of discrimination and proper-score probability quality on the held-out test set (AUC=0.7796, PR-AUC=0.5526, Brier=0.1351, log loss=0.4301), while sigmoid-calibrated logistic regression achieves the lowest ECE (0.0068). Split conformal prediction for the final model achieves empirical coverage 0.9017 and 0.9462 at 90% and 95% targets, with average set sizes 1.222 and 1.469. Bootstrap sensitivity bands have average widths 0.0852 (90%) and 0.0934 (95%), and the decile-level observed default rate falls within the mean interval bounds in every decile. SHAP analysis identifies recent repayment status and credit limit as the dominant global drivers. The analysis adopts a strict validation logic in which proper scoring rules are primary for probability selection, while ECE and reliability plots are diagnostic complements. Overall, the results show that a PD workflow can be made calibration-aware, uncertainty-aware, and governance-ready without departing from standard supervised-learning tools.