Default for binary and multi-class classification problems.
Negative Logarithmic Loss - also known as Log Loss, Cross-Entropy Loss, or NLL - is typically the default loss function used for classification tasks. However, research has shown others to be potentially more effective despite the popularity of log loss.
For Binary Classification on a single datapoint:
$$ NLL_{BinClass}= -(y^{(i)}log(\hat y^{(i)})+ (1-y^{(i)})log(1- \hat y^{(i)})) $$
For Multi-Class Classification on a single datapoint it becomes the following, where C represents the number of classes:
$$ NLL_{MultiClass}= \sum_{i=1}^C -(y^{(i)}log(\hat y^{(i)})) $$
What you are left with is that the only component of $y^{(i)} \neq 0$ is the correct answer ($y^{(i)}=1$). So this function can be boiled down to $-log(\hat y^{(i)})$ where $i$ is the correct answer.
Both of these are easily extended to batch examples - you would simple just take the sum of all the datapoints in the batch.
For Binary Classification on a single datapoint:
$$ \frac{\partial L_{BC}}{\partial \hat{y}}= -\frac{{y}}{{\hat y}} + \frac{{(1-y)}}{{(1-\hat y)}} $$
For Multi-Class Classification on a single datapoint:
$$ \frac{\partial L_{MC}}{\partial \hat{y}}= \sum_{i=1}^C -\frac{{y^{(i)}}}{{\hat y^{(i)}}} $$
The basic idea is that we want to think about loss probabistically. Let’s say we’re doing a binary classification problem and our output is 0.9 when through a sigmoid function.
Probabilistic View: Defining the below to mean the likelihood of observing the desired label given our inputs on a dataset of size n - this is something we want to maximize:
$$ \prod_{i=1}^n \{ \begin{array}{ll} g^{(i)}&\text{if }(y^{(i)} = 1) \quad \\ 1-g^{(i)}&\text{otherwise} \quad \end{array} $$
This would result in the following for our situation - depending on if we’re correct or incorrect:
$$ Correct: 0.9*0.9 = 0.81 $$