Abstract:

Deep ensembles are a simple and effective method for improving both predictive performance and epistemic uncertainty estimation in deep learning. However, their high computational cost, especially at inference time, limits their practicality in real-world deployments. Ensemble distillation offers a promising solution by training a single student model to match the ensemble’s predictive distribution. Yet existing approaches typically assume full access to all M teacher predictionsduring training, which is often challenging due to compute constraints, memory limitations, or asynchronous model evaluations. Here we introduce STEDD (Stochastic Teacher-sampling for Ensemble Distribution Distillation), a framework for distilling both the mean and variance of an ensemble using only a small number of random teacher queries per input, even as few as one. STEDD includes three estimators tailored to different access regimes and provides theoretical guarantees for convergence and calibration. Experiments on genomics and vision benchmarks demonstrate that STEDD preserves ensemble-level performance and uncertainty calibration while significantly reducing training-time cost.

Paper In Preparation

Updated: