Most commonly used as an output function in Logistic Regression.
Sigmoid previously was leveraged more as a hidden-layer activation function, however, its drawbacks have led to new functions to become more popular (notable ReLU and its variants).
The output of the sigmoid function is (0,1) - (bounds non-inclusive). The function is as follows:
$$ \sigma(z) = \frac{{1}}{{1+e^{-z}}} $$
Let’s run through the quick pros and cons:
Pros | Cons |
---|---|
Output in interpretable format (0,1) → particularly useful in binary classifiers (i.e., logistic regression) | Vanishing gradient problem at ends (as the gradient gets very small as you approach the extremes) |
Smooth gradient and easily differentiable | Can be computationally inefficient compared to other functions |
Provides non-linearity | Output isn’t centered on 0 → you may prefer outputs centered on 0 |
In backpropagation for a sigmoid function, you need to store the forward-pass value of the sigmoid output and then plug that value (we’ll denote it as $o$) into the following to determine the gradient with respect to the pre-activation inputs ($z^{(i)}$).
$$ \frac {{\partial \sigma(z)}}{{\partial z}} = o*(1-o)=\sigma(z)*(1-\sigma(z)) $$