Common last-layer activation function for multi-class classification problems.

Overview

The Softmax function allows for an entire vector of logits to be passed as input and returns a vector (of the same size) of the probabilities associated with each logit.

The formula for it is fairly straightforward:

$$ \text{Softmax}(x)i = \frac{e^{x_i}}{\sum{j=1}^{n} e^{x_j}} $$

Derivative

In backpropagation, computing the loss with respect to Softmax isn’t entirely straightforward but the math can work out nicely.

You actually want to compute it with respect to the logits since Softmax computes outputs with respect to the other logits (unique among activation functions). If you use cross-entropy as your loss function, you end with the following friendly form:

$$ \frac {{\partial L}}{{\partial z}} = \hat y - y $$

Which can be easily computed as a vector for all logits. The more generalized form is the following for Softmax:

$$ \frac {{\partial \hat y}}{{\partial z_i}} = \hat y_i (1 - \hat y_i) + \sum_{i \ne j} \hat y_i \hat y_j $$

Working through the math to get to this result:

Step 1: Define the Two Cases

Step 2: Solve for Case 1

Step 3: Solve for Case 2

Step 4: Combine Cases 1 and 2

[Optional] Step 5: Combining with Cross-Entropy Loss