Human-Uncertainty Distillation for Calibrated Vision Models on CIFAR-10H

Ziliang Samuel Zhong; Ruiyan Ma; Hailey Zhao

doi:10.69987/JACS.2023.30206

Authors

Ziliang Samuel Zhong New York University, NY, USA Author
Ruiyan Ma Software Engineering, UC Irvine, CA, USA Author
Hailey Zhao Business Analytics, Columbia University, NY, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30206

Keywords:

uncertainty calibration, CIFAR-10H, human soft labels, knowledge distillation, label distributions, selective prediction, robustness

Abstract

Human uncertainty is informative when a visual example is genuinely ambiguous, because a full label distribution captures plausible class confusions that a hard one-hot label suppresses. This paper evaluates human-uncertainty distillation (HUD) directly on CIFAR-10H, which provides human label distributions for the 10,000-image CIFAR-10 test set. The study uses a stratified 60/20/20 split of CIFAR-10H, yielding 6000 training images, 2000 validation images, and 2000 test images. A compact vision classifier is trained from standardized HOG and color descriptors so that the effect of the supervision signal can be isolated from a larger backbone. HUD combines label-smoothed hard-label supervision with human soft-label distillation whose weight increases on high-entropy human targets, together with a small entropy-alignment penalty. On the held-out test split, HUD reached 58.43% top-1 accuracy, 1.1872 negative log-likelihood, 0.0284 expected calibration error, 0.5449 Brier score, and 0.2188 area under the risk-coverage curve. Relative to standard cross-entropy training, HUD improved accuracy by 1.50 percentage points, reduced negative log-likelihood by 3.6%, reduced expected calibration error by 50.9%, and reduced Brier score by 4.0%. Label smoothing remained a strong baseline, but HUD produced the best student negative log-likelihood, Brier score, human-label cross-entropy, and selective-prediction AURC. Under five corruption families at three severities, HUD improved mean corrupted accuracy from 0.4145 to 0.4201 and reduced mean corrupted ECE from 0.1574 to 0.1233. The results show that real human soft labels can improve likelihood, calibration, selective prediction, and robustness even when top-1 gains are modest.