Calibration in Neural Nets

5 min readMay 30, 2021

I remember an interview I gave few years back, where I was explaining a neural net classifier to the interviewer. He strongly believed only in Logistic Regression and hence it was tough for me. One of his question was ‘how well calibrated these models are?’, for which I didn’t have an answer then. Accuracy is not the only factor but calibration is equally important in giving confidence of the system. Calibration becomes critical in models used in medical image analysis and autonomous driving.

Calibration is a procedure in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes (wiki) OR simply put ‘probability of each class occurrence’.

Batch Normalization, weight decay and increase in depth/breath of the deep neural nets are supposedly the main reasons for miscalibration.

Logistic regression is one of the well calibrated model to my knowledge. Even though Naïve Bayes, SVM’s and Random forest outputs a probability value, their interpretation differ widely from that of Logistic Regression. Read more about it here. https://scikit-learn.org/stable/modules/calibration.html

In the quest for calibration on neural nets I came across few state of the art calibration techniques such as Temperature Scaling, Dirichlet Calibration and the latest Calibration using Splines.

Measures of Calibration

There are 2 approaches of calibration in Deep Nets, Probabilistic and Measure based. Probabilistic approach includes Bayesian formalism, which requires an correctly estimated Prior distribution. This is complex and computationally expensive, hence measure based approach is more practical.

In measure-based approach, the main idea is to decrease the miscalibration of the network by minimizing a loss which is a calibration measure. The common calibration measures are: Negative Log Likelihood (NLL), Expected Calibration Error (ECE) and Brier score.

Negative Log Likelihood (NLL)

NLL measures the similarity of the probability distribution from the softmax output layer and the true conditional probability distribution collected from samples. Similarity is measured using Gibb’s inequality

where Q(y/X) is the true conditional probability and P(y/X) is the arbitrary distribution. E is the expected value function.

NLL = -E [log(P(y/X))], which in DNN is the softmax output probabilities. Hence this equation can be re-phrased as

where Sy is the softmax output (also called confidence) given by

and (x,y) belongs to a validation set V, k is the number of classes and hi is the logit layer.

Expected Calibration Error (ECE)

Miscalibration can be interpreted as the difference between confidence and probability of correctly classifying a sample. ECE is proposed as empirical expectation error between the accuracy and confidence.

It is calculated by partitioning the range of confidence between [0 , 1] into L equally-spaced confidence bins and then assign the samples to each bin Bl where l = {1, . . . , L} by their confidence range. Later it calculates the weighted absolute difference between the accuracy and confidence for each subset Bl

Temperature Scaling (TS) and Attended Temperature Scaling

TS is a post-processing approach which rescales the logit layer of a deep model by parameter T that is called temperature. TS is used to soften the output of the softmax layer and makes it more calibrated. The best value of T will be obtained by minimizing NLL loss function respecting to T conditioned by T > 0 on validation set V as defined in

TS previously is applied for calibration, distilling the knowledge and enhancing the output of DNNs for better discrimination between the in and out distribution samples. However, TS cannot find the optimal T value when the number of samples in validation set is not enough. TS is also sensitive to the noise of the labels as the optimal T value is dependent strongly on the true label of the samples in NLL loss function, Hence the introduction to Attended TS (ATS).

The idea of ATS is increasing the number of samples in the validation set with low computational cost. Previously, TS minimizes NLL to decrease the dissimilarity between the Sy(x, T) and Q(y|x). Instead, ATS attends to the conditional distribution of each class and decreases the dissimilarity between Sy=k(x, T) and Q(y = k|x) for each class k = 1, . . . , K. This setting brings a chance to increase the number of samples and robustness to the noise

For samples, Mk given as

where θ is a hyperparameter that will be fine-tuned on the validation set, a new loss function LATS and thereby T (new temperature variable) is proposed as

Calibration using splines

This is a binning-free calibration measure calibration method based on spline-fitting. This is inspired by the classical Kolmogorov-Smirnov (KS) statistical test. KS is a non-parametric test used to compare the respective class-wise cumulative (empirical) distributions and does not make any assumptions about the underlying distribution. For a perfect model, the two distributions are completely segregated, resulting in a KS statistic of 1. please read from the paper (CALIBRATION OF NEURAL NETWORKS USING SPLINES).

Calibration in Neural Nets

Measures of Calibration

Temperature Scaling (TS) and Attended Temperature Scaling

Calibration using splines

Written by Bijula Ratheesh