Default for binary and multi-class classification problems.

Negative Logarithmic Loss - also known as Log Loss, Cross-Entropy Loss, or NLL - is typically the default loss function used for classification tasks. However, research has shown others to be potentially more effective despite the popularity of log loss.

Formula

For Binary Classification on a single datapoint:

$$ NLL_{BinClass}= -(y^{(i)}log(\hat y^{(i)})+ (1-y^{(i)})log(1- \hat y^{(i)})) $$

For Multi-Class Classification on a single datapoint it becomes the following, where C represents the number of classes:

$$ NLL_{MultiClass}= \sum_{i=1}^C -(y^{(i)}log(\hat y^{(i)})) $$

What you are left with is that the only component of $y^{(i)} \neq 0$ is the correct answer ($y^{(i)}=1$). So this function can be boiled down to $-log(\hat y^{(i)})$ where $i$ is the correct answer.

Both of these are easily extended to batch examples - you would simple just take the sum of all the datapoints in the batch.

Derivatives

For Binary Classification on a single datapoint:

$$ \frac{\partial L_{BC}}{\partial \hat{y}}= -\frac{{y}}{{\hat y}} + \frac{{(1-y)}}{{(1-\hat y)}} $$

For Multi-Class Classification on a single datapoint:

$$ \frac{\partial L_{MC}}{\partial \hat{y}}= \sum_{i=1}^C -\frac{{y^{(i)}}}{{\hat y^{(i)}}} $$

Theory Behind Loss Function

The basic idea is that we want to think about loss probabistically. Let’s say we’re doing a binary classification problem and our output is 0.9 when through a sigmoid function.

Probabilistic View: Defining the below to mean the likelihood of observing the desired label given our inputs on a dataset of size n - this is something we want to maximize:

$$ \prod_{i=1}^n \{ \begin{array}{ll} g^{(i)}&\text{if }(y^{(i)} = 1) \quad \\ 1-g^{(i)}&\text{otherwise} \quad \end{array} $$

This would result in the following for our situation - depending on if we’re correct or incorrect:

$$ Correct: 0.9*0.9 = 0.81 $$