Support Vector Machine (SVM)

Won’t implement one in raw code but it is very similar to a Logistic Regressor. However, the key difference is that an SVM tries to maximize the margin between the hyperplane (linear separator) and the datapoints that are nearest the hyperplane in each category.

SVMs as Binary Classifiers

The most common implementation of an SVM.

A way to train these is to use the Pegasos Algorithm (Primal Estimated Sub-Gradient Solver for SVM) - this was introduced in 2007 and uses an iterative stochastic sub-gradient descent approach to updating.

Essentially you start with zero-weights (similar to a perceptron) and randomly sample datapoints. At each datapoint:

If incorrect:

$$ \theta_{t+1} = (1-\eta_t \lambda)\theta_t + \eta_ty_tx_t $$

If correct:

$$ \theta_{t+1} = (1-\eta_t \lambda)\theta_t $$

This utilizes a “scaling down” factor $(1-\eta_t \lambda)$ that is applied to the weights at each step regardless of if the data is correct or not.

We typically use hinge loss as the loss function in SVMs.

SVMs as Multi-Class Classifiers

SVMs can be leveraged for multi-class classification problems although they are inherently designed for binary classification. The method of extending these typically involves one of the following approaches:

One-vs-One (OvO)
- You train SVMs for each unique possible pair of labels - resulting in $\frac{{K(K-1)}}{{2}}$ different classifiers. Then at inference, you run each SVM and the most commonly picked classifier is the classifier you go with.
- This is computationally expensive as it involves a ton of models that scales exponentially with the number of classes you are classifying against.
One-vs-All (OvA)
- You train $K$ SVMs - one for each class distinguishing it against all other classes. Then at inference, you run the data through each SVM and the SVM that most confidently predicts the outcome is chosen as the classifier.
- This is more computationally efficient than the OvO approach.