Credit Card Default Risk Tiering with Probability Calibration and Uncertainty-Driven Rejection: A Reproducible Study on the UCI Credit Card Clients Dataset

Yuanzheng Chen; Yitian Zhang; David Chau; Matt Sherman

doi:10.69987/JACS.2023.30403

Authors

Yuanzheng Chen Accounting, UIUC, IL, USA Author
Yitian Zhang Accounting, The University of Wisconsin-Madison (UW-Madison), WI, USA Author
David Chau Computer Engineering, Dartmouth College, NH, USA Author
Matt Sherman Computer Engineering, Dartmouth College, NH, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30403

Keywords:

credit scoring, default prediction, probability calibration, expected calibration error, Brier score, selective classification

Abstract

Accurate probability of default (PD) estimates are central to credit risk management, yet modern tabular classifiers can be miscalibrated and overly confident, complicating downstream decisions such as pricing, limit management, and manual review. This paper presents a fully reproducible empirical study of calibration and uncertainty-aware decision policies for credit card default prediction using the UCI Default of Credit Card Clients dataset (30,000 clients; 23 features; Taiwan; 2005) introduced by Yeh and Lien and distributed via the UCI repository. We compare logistic regression (LR), gradient-boosted decision trees (XGBoost), and a lightweight TabTransformer neural architecture on a fixed train/validation/test split (18k/6k/6k) and evaluate both discrimination (ROC-AUC, PR-AUC) and calibration (Brier score, expected calibration error (ECE)). On the held-out test set, XGBoost achieves the best ranking performance (ROC-AUC=0.778, PR-AUC=0.554), followed by TabTransformer (ROC-AUC=0.767, PR-AUC=0.540) and LR (ROC-AUC=0.759, PR-AUC=0.526). We then apply post-hoc calibration (Platt scaling, isotonic regression, and temperature scaling) and quantify calibration changes via Brier and ECE. Finally, we operationalize uncertainty via predictive entropy and study a reject option: abstaining on the most uncertain cases yields coverage–risk trade-offs consistent with selective classification theory. For the temperature-scaled XGBoost model, selective risk (0–1 error among accepted predictions) drops from 0.182 at full coverage to 0.095 at 50% coverage, with 95% bootstrap confidence intervals reported. We also propose an uncertainty-driven risk tiering policy combining PD quantiles with a high-uncertainty “Review” bucket, producing sharply separated observed default rates on test (XGBoost Tier 1: 5.9%; Tier 4: 41.7%; Review: 49.7%). Overall, the results show that calibration and uncertainty-aware policies materially improve decision reliability beyond headline AUC, and they provide a practical template for risk-tier design on tabular credit datasets.