Maximizing the margin

Often in machine learning, we introducce a parametrized distribution and we search the parameters that maximize the likelihood over some data. This can be equivalent to minimize an error function defined on the data (mean square error, mean cross entropy).

In this notebook, we describe the Support Vector Machines (SVM) Problem. Previously we have seen that many loss functions in machine learning are related to Maximum Likelihood. See :

  1. Minimizing the cross entropy : a nice trip from Maximum likelihood to Kullback–Leibler divergence (http://romain.raveaux.free.fr/document/CrossEntropy.html)
  2. Overfitting phenomenon of the maximum likelihood (http://romain.raveaux.free.fr/document/Overfittingbiaisedandunbiaisedvariance.html) 3 Gaussian Mixture and Expectation Maximization algorithm (http://romain.raveaux.free.fr/document/GaussianMixtureandExpectationMaximization.html)
  3. Relation between the least squares error the Maximum Likelihood (http://romain.raveaux.free.fr/document/LeastSquaresError.html)

The maximum likelihood can suffer from a severe over fitting phenomenon, for instance:

  1. A data set is not well fitted by Gaussian models (under estimation of the variance)
  2. In Gaussian mixture models, a Gaussian distribution can be fitted to a single data point

We have showed that the maximum likelihood can be equivalent to minimize an average error on a given data set (mean square error, mean cross entropy).

But instead of minimizing a mean error, would it be more interesting to minimize the largest error? Would it offer a better generalization property....

This is where the idea of maximizing the margin comes from ....

Author : Romain Raveaux

Distance to the decision line

Let us define a linear function $\hat{t}=f(x,w,b)=w^Tx+b \in \mathbb{R}$ (see Figure). The distance between a point $x$ and the line represented by the function $f()$ is defined by: $$d^{\perp}(x,f(.))=\frac{f(x,w,b)}{||w||_2} \quad \text{ (Orthogonal projection)}$$ Let us define the true target $t \in \{-1,1\}$ to distinguish between two classes. A sample $x$ is correctly classified if $f(x,w,b)=t$ or $t.f(x,w,b)>0$. The distance of correctly classified sample to the line represented by the function $f()$ is defined by: $$d(x,f(),t)=\frac{t.f(x,w,b)}{||w||_2} $$ Orthogonal projection of the point A on the line.

Finding the margin

Among a data set $\mathcal{D}=\{ (x_1,t_1), \cdots, (x_M,t_M) \}$, the margin is the smallest distance between the decision line $(f(.))$ and the samples. Finding the closest sample to the function $f$ is then defined by: $$d_{min}= \min_{i\in [1,\cdots,M]} \frac{t.f(x_i,w,b)}{||w||_2}$$

Maximizing the margin

A key idea of SVM is to find the margin which is the largest. As the data cannot be modified, parameters $w$ and $b$ must be tuned to maximize $d_{min}$: \begin{eqnarray} arg \, \max_{w,b}& d_{min}(w,b)\\ arg \, \max_{w,b}& \left[ \frac{1}{||w||_2} ( \underset{i \in [1,\cdots,M]}{\min} w^T x_i +b ) \right] \end{eqnarray} The former equation is the heart of the SVM problem finding the parameters that maximize the margin. However, this formulation is not easy to optimize.

Rescaling the parameters

To make the problem easier to solve, a rescaling of the data must be performed. $w=cst.w$ and $b=cst.b$ where $cst$ is a constant then the rescaling does not change the distance $d(x,f(),t)$. If the closest sample $x_{min}= arg\, \min_{i\in [1,\cdots,M]} \frac{t.f(x_i,w,b)}{||w||_2}$, we can use this freedom to set $ t_{min}(w^Tx_{min}+b)=1 $. And all the samples satisfy: $$t_i(w^Tx_i+b)=1 \geq 1, \forall i\in [1,\cdots,M] $$

Canonical formulation

Maximizing $\frac{1}{||w||_2}$ is equivalent to minimize $\frac{1}{2} ||w||_2^2$ \begin{eqnarray} arg \, \min_{w,b}&\frac{1}{2} ||w||_2^2 &\\ Subject\; to \text{ }& t_i(w^Tx_i+b)\geq 1& \forall i\in [1,\cdots,M] \end{eqnarray} The determination of the model parameters corresponds to a convex optimization problem. This is an example of a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints.

Lagrangian formulation

In order to solve this constrained optimization problem, we introduce Lagrange multipliers an $\alpha_i \geq 0$, with one multiplier $a_i$ for each of the constraints ($a=(a_1, \cdots, a_M)$). $$L(a,w,b)= \frac{1}{2} ||w||_2^2 - \sum_{i=1}^M a_i \left[ t_i(w^Tx_i+b)-1 \right] $$

Note the minus sign in front of the Lagrange multiplier term, because we are minimizing with respect to $w$ and $b$, and maximizing with respect to $a$.

Dual formulation and kernel formulation

From the langrangian function L(.), we want to find its minimum according to parameters $w$ and $b$ so we need to find where its derivative equal 0. $$\frac{\partial L}{\partial w} = 0= w-\sum_{i=1}^M a_i t_i x_i \implies w=\sum_{i=1}^M \alpha_i t_i x_i$$ $$\frac{\partial L}{\partial b} = 0= \sum_{i=1}^M a_i t_i $$ By substituting $w=\sum_{i=1}^M a_i t_i x_i$ in the function $L(.)$, we obtain the dual formulation: $$\tilde{L}(a)= \sum_{i=1}^M a_i -\frac{1}{2} \sum_{i=1}^M \sum_{j=1}^M a_i a_j t_i t_j x_i x_j $$ By replacing the dot product $x_i x_j $ by the kernel $k=(x_i,x_j)$, we obtain the SVM kernel machine: \begin{eqnarray} arg \, \max_{a} \tilde{L}(a)=& \sum_{i=1}^M a_i -\frac{1}{2} \sum_{i=1}^M \sum_{j=1}^M a_i a_j t_i t_j k(x_i,x_j )\\ Subject\; to&\\ & a_i\geq 0, \forall i\in [1,\cdots,M] \\ & \sum_{i=1}^M a_i t_i= 0, \forall i\in [1,\cdots,M] \end{eqnarray}

Solving SVM

Solutions for this optimization problem either: (i) reduce it to an equivalent polynomial-size reformulation (for certain decomposable loss functions), and use methods like SMO (sequential minimal optimization) \citep{smo} or general-purpose solvers; or (ii) work with the original problem by considering a subset of constraints, and employing cutting plane \citep{Tsochantaridis:2005:LMM:1046920.1088722} or stochastic subgradient methods. The solution to a quadratic programming problem in $d$ variables in general has computational complexity that is ${O}(d^3)$. In going to the dual formulation, the original optimization problem, which involved minimizing over $d$ variables, into the dual problem, which has $M$ variables. For a fixed set of basis functions whose number $d$ is smaller than the number $M$ of data points, the move to the dual problem appears disadvantageous. However, it allows the model to be reformulated using kernels, and so the maximum margin classifier can be applied efficiently to feature spaces whose dimensionality exceeds the number of data points.

Classification

In order to classify new data points using the trained model, the sign of $f(x,w,b)$ is evaluated, defined by $\hat{t}=w^Tx+b$ (if the canonical formulation is solved). Another possibility can be expressed in terms of the parameters $\{ a_n \}$ and the kernel function if dual formulation was solved : $$\hat{t}=\sum_{i=1}^M a_i t_i k(x,x_i) +b $$ Any data point for which $ a=0 $ will not appear in the sum in and hence plays no role in making predictions for new data points. The remaining data points are called support vectors. This property is central to the practical applicability of support vector machines. Once the model is trained, a significant proportion of the data points can be discarded and only the support vectors retained. $$\hat{t}=\sum_{i\in S} a_i t_i k(x,x_i) +b $$ Where $S$ denotes the set of indices of the support vectors.