Most commonly used as an output function in Logistic Regression.

Overview

Sigmoid previously was leveraged more as a hidden-layer activation function, however, its drawbacks have led to new functions to become more popular (notable ReLU and its variants).

The output of the sigmoid function is (0,1) - (bounds non-inclusive). The function is as follows:

$$ \sigma(z) = \frac{{1}}{{1+e^{-z}}} $$

Let’s run through the quick pros and cons:

Pros Cons
Output in interpretable format (0,1) → particularly useful in binary classifiers (i.e., logistic regression) Vanishing gradient problem at ends (as the gradient gets very small as you approach the extremes)
Smooth gradient and easily differentiable Can be computationally inefficient compared to other functions
Provides non-linearity Output isn’t centered on 0 → you may prefer outputs centered on 0

Differentiation

In backpropagation for a sigmoid function, you need to store the forward-pass value of the sigmoid output and then plug that value (we’ll denote it as $o$) into the following to determine the gradient with respect to the pre-activation inputs ($z^{(i)}$).

$$ \frac {{\partial \sigma(z)}}{{\partial z}} = o*(1-o)=\sigma(z)*(1-\sigma(z)) $$