## Introduction to Support Vector Machines

Starting from slides drawn by Ming-Hsuan Yang and Antoine Cornu´ejols

## SVM Bibliography

B. Boser, I. Guyon, V. Vapnik, “A training algorithm for optimal margin classifier”, 1992

C. Cortes, V. Vapnik, “Support vector networks”. Journal of Machine Learning, 20, 1995.

V. Vapnik. “The nature of statistical learning theory”.

Springer Verlag, 1995.

C. Burges, “A tutorial on support vector machines for pat- tern recognition”. Data Mining and Knowledge Descovery, 2(2):955-974, 1998.

N. Cristianini, J. Shawe-Taylor, “Support Vector Machines and other kernel-based learning methods”. Cambridge University Press, 2000. Andrew Ng, “Support Vector Ma- chines”, Stanford University, CS229 Lecture Notes, Part V.

2.

## SVM — The Main Idea

Given a set of data points which belong to either of two classes, find an optimal separating hyperplane

- maximizing the distance (from closest points) of either class to the separating hyperplane, and

- minimizing the risk of misclassifying the training samples and the unseen test samples.

Approach: Formulate a constraint-based optimisation prob- lem, then solve it using quadratic programming (QP).

## Optimal Separation Hyperplane

**optimal separating****hyperplane**

**maximal****margin**

**valid separating****hyperplane**

## Plan

1. Linear SVMs

The primal form and the dual form of linear SVMs Linear SVMs with soft margin

2. Non-Linear SVMs

Kernel functions for SVMs

An example of non-linear SVM

## 1. Linear SVMs: Formalisation

Let S be a set of points x_{i} ∈ R^{d} with i = 1, . . . , m. Each point
x_{i} belongs to either of two classes, with label y_{i} ∈ {−1, +1}.
The set S is linear separable if there are w ∈ R^{d} and w_{0} ∈ R

such that

y_{i}(w · x_{i} + w_{0}) ≥ 1 i = 1, . . . , m

The pair (w, w_{0}) defines the hyperplane of equation w·x+w_{0} = 0,
named the separating hyperplane.

The signed distance d_{i} of a point x_{i} to the separating hyper-
plane (w, w_{0}) is given by d_{i} = ^{w}^{·}^{x}_{||}^{i}_{w}^{+w}_{||} ^{0}.

It follows that y_{i}d_{i} ≥ _{||}_{w}^{1}_{||}, therefore _{||}_{w}^{1}_{||} is the lower bound on
the distance between points x_{i} and the separating hyper-
plane (w, w_{0}).

## Optimal Separating Hyperplane

Given a linearly separable set S, the optimal separat-
ing hyperplane is the separating hyperplane for which
the distance to the closest (either positive or negative)
points in S is maximum, therefore it maximizes _{||}_{w}^{1}_{||}.

**x**_{i}

**maximal****margin**

**optimal separating****hyperplane**

**1**
**II w II**

**x**_{i}

**II w II**

**D( )**

**geometric****margin**

**vectors****support**

**D(x) < −1****D(x) = 0**

**D(x) = −1**

**D(x) = 1****D(x) > 1**

**w**

D(x) = w · x + w_{0}

## Linear SVMs: The Primal Form

minimize ^{1}_{2}||w||^{2}

subject to y_{i}(w · x_{i} + w_{0}) ≥ 1 for i = 1, . . . , m

This is a constrained quadratic problem (QP) with d + 1 pa-
rameters (w ∈ R^{d} and w_{0} ∈ R). It can be solved by quadratic
optimisation methods if d is not very big (10^{3}).

For large values of d (10^{5}): due to the Kuhn-Tucker theorem,
since the above objective function and the associated con-
straints are convex, we can use the method of Lagrange
multipliers (α_{i} ≥ 0, i = 1, . . . , m) to put the above problem
under an equivalent “dual” form.

Note: In the dual form, the variables (α_{i}) will be subject to much simpler
constraints than the variables (w, w_{0}) in the primal form.

## Linear SVMs: Getting the Dual Form

The Lagrangean function associated to the primal form of the given QP is

L_{P}(w, w_{0}, α) = 1

2||w^{2}|| −

m

X

i=1

α_{i}(y_{i}(w · x_{i} + w_{0}) − 1)

with α_{i} ≥ 0, i = 1, . . . , m. Finding the minimum of L_{P} implies

∂L_{P}

∂w_{0} = −

m

X

i=1

y_{i}α_{i} = 0

∂L_{P}

∂w = w −

m

X

i=1

y_{i}α_{i}x_{i} = 0 ⇒ w =

m

X

i=1

y_{i}α_{i}x_{i}

where ∂L_{P}

∂w = (∂L_{P}

∂w_{1} , . . . , ∂L_{P}

∂w_{d} )

By substituting these constraints into L_{P} we get its dual form

L_{D}(α) =

m

X

i=1

α_{i} − 1
2

m

X

i=1 m

X

j=1

α_{i}α_{j}y_{i}y_{j} x_{i} · x_{j}

## Linear SVMs: The Dual Form

maximize ^{P}^{m}_{i=1} α_{i} − ^{1}_{2} ^{P}^{m}i=1

Pm

j=1 α_{i}α_{j}y_{i}y_{j} x_{i} · x_{j}
subject to ^{P}^{m}_{i=1} y_{i}α_{i} = 0

α_{i} ≥ 0, i = 1, . . . , m

The link between the primal and the dual form:

The optimal solution (w, w_{0}) of the primal QP problem
is given by

w =

m X i=1

α_{i}y_{i}x_{i}

α_{i}(y_{i}(w · x_{i} + w_{0}) − 1) = 0 for any i = 1, . . . , m

where α_{i} are the optimal solutions of the above (dual
form) optimisation problem.

## Support Vectors

The only α_{i} (solutions of the dual form of our QP problem)
that can be nonzero are those for which the constraints
y_{i}(w · x_{i} + w_{0}) ≥ 1 for i = 1, . . . , m in the primal form of the
QP are satisfied with the equality sign.

Because most α_{i} are null, the vector w is a linear combination
of a relative small percentage of the points x_{i}.

These points are called support vectors because they are the closest points to the optimal separating hyperplane (OSH) and the only points of S needed to determine the OSH.

The problem of classifying a new data point x is now simply
solved by looking at sign(w · x + w_{0}).

## Linear SVMs with Soft Margin

If the set S is not linearly separable — or one simply ignores
whether or not S is linearly separable —, the previous
analysis can be generalised by introducing m non-negative
(“slack”) variables ξ_{i}, for i = 1, . . . , m such that

y_{i}(w · x_{i} + w_{0}) ≥ 1 − ξ_{i}, for i = 1, . . . , m

Purpose: to allow for a small number of missclassified points, for better generalisation or computational efficiency.

## Generalised OSH

The generalised OSH is then viewed as the solution to the problem:

minimize ^{1}_{2}||w||^{2} + C ^{P}^{m}_{i=1} ξ_{i}

subject to y_{i}(w · x_{i} + w_{0}) ≥ 1 − ξ_{i} for i = 1, . . . , m
ξ_{i} ≥ 0 for i = 1, . . . , m

The associated dual form:

maximize ^{P}^{m}_{i=1} α_{i} − ^{1}_{2} ^{P}^{m}i=1

P_{m}

j=1 α_{i}α_{j}y_{i}y_{j} x_{i} · x_{j}
subject to ^{P}^{m}_{i=1} y_{i}α_{i} = 0

0 ≤ α_{i} ≤ C, i = 1, . . . , m
As before:

w = ^{P}^{m}_{i=1} α_{i}y_{i}x_{i}

α_{i}(y_{i}(w · x_{i} + w_{0}) − 1 + ξ_{i}) = 0
(C − α_{i}) ξ_{i} = 0

The role of C:

it acts as a regularizing parameter:

• large C ⇒ minimize the number of misclassified points

• small C ⇒ maximize the minimum distance _{||}_{w}^{1}_{||}

## 2. Nonlinear Support Vector Machines

• Note that the only way the data points appear in (the dual form of)
the training problem is in the form of dot products x_{i} · x_{j}.

• In a higher dimensional space, it is very likely that a linear separator can be constructed.

• We map the data points from the input space R^{d} into some space of
higher dimension R^{n} (n > d) using a function Φ : R^{d} → R^{n}

• Then the training algorithm would depend only on dot products of
the form Φ(x_{i}) · Φ(x_{j}).

• Constructing (via Φ) a separating hyperplane with maximum margin in the higher-dimensional space yields a nonlinear decision boundary in the input space.

## General Schema for Nonlinear SVMs

**Input**

**space** **Output**

**space**

**Internal**
**redescription**

**space**

**x** Φ **h** **y**

**x**

**h**

**y**

## Introducing Kernel Functions

• But the dot product is computationally expensive...

• If there were a “kernel function” K such that K(x_{i}, x_{j}) =
Φ(x_{i})·Φ(x_{j}), we would only use K in the training algorithm.

• All the previous derivations in the model of linear SVM hold (substituting the dot product with the kernel func- tion), since we are still doing a linear separation, but in a different space.

• Important remark: By the use of the kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the higher space.

## Some Classes of Kernel Functions for SVMs

• Polynomial: K(x, x^{′}) = (x · x^{′} + c)^{q}

• RBF (radial basis function): K(x, x^{′}) = e^{−}^{||x−x′||}

2 2σ2

• Sigmoide: K(x, x^{′}) = tanh(αx · x^{′} − b)

An Illustration

**(a)****(b)**

Decision surface

(a) by a polynomial classifier, and (b) by a RBF.

Support vectors are indicated in dark fill.

## Important Remark

The kernel functions require calculations in x(∈ R^{d}), therefore
they are not difficult to compute.

It remains to determine which kernel function K can be as- sociated with a given (redescription space) function Φ.

In practice, one proceeds vice versa:

we test kernel functions about which we know that they correspond to the dot product in a certain space (which will work as redescription space, never made explicit).

Therefore, the user operates by “trial and error”...

Advantage: the only parameters when training an SVM are the kernel function K, and the “tradeoff” parameter C.

## Mercer’s Theorem (1909):

## A Characterisation of Kernel Functions for SVMs

Theorem: Let K : R^{d} × R^{d} → R be a symmetrical function.

K represents a dot product,

i.e. there is a function Φ : R^{d} → R^{n} such that
K(x, x^{′}) = Φ(x) · Φ(x^{′})

if and only if

Z

K(x, x^{′})f(x)f(x^{′})dx dx^{′} ≥ 0

for any function f such that ^{R} f^{2}(x)dx is finite.

Remark: The theorem doesn’t say how to construct Φ.

## Some simple rules for building (Mercer) kernels

If K_{1} and K_{2} are kernels over X × X, with X ⊆ R^{n},
then

• K(x, y) = K_{1}(x, y) + K_{2}(x, y)

• K(x, y) = aK_{1}(x, y), with a ∈ R^{+}

• K(x, y) = K_{1}(x, y)K_{2}(x, y)
are also kernels.

## Illustrating the General Architecture of SVMs

for the problem of hand-written character recognition

**K**

**K**

**K** **K**

**K**

**K**

### α

_{2}### α

_{3}### α

_{4}### α

_{1}**K**

**K**

### Σ

^{Output:}

^{sign(}

^{P}

^{i}

^{α}

^{i}

^{y}

^{i}

^{K}

^{(x}

^{i}

^{, x) +}

^{w}0)

Comparison: K(x_{i}, x)

Support vectors: x_{1}, x_{2}, x_{3}, . . .

Input: x

## An Exercise: xor

**x**_{1}

**x****2**

**1**

**−1**

**−1**

**1**

Note: use K(x, x^{′}) = (x · x^{′} + 1)^{2}.

It can be easily shown that Φ(x) = (x^{2}_{1}, x^{2}_{2}, √

2x_{1}x_{2}, √

2x_{1}, √

2x_{2}, 1) ∈ R^{6} for
x = (x_{1}, x_{2}) ∈ R^{2}.

i x_{i} y_{i} Φ(i)
1 (1,1) −1 (1,1,√

2,√ 2,√

2,1) 2 (1,−1) 1 (1,1,−√

2,√

2,−√ 2,1) 3 (−1,1) 1 (1,1,−√

2,−√ 2,√

2,1) 4 (−1,−1) −1 (1,1,√

2,−√

2,−√ 2,1)

L_{D}(α) = ^{P}^{4}_{i=1} α_{i} − ^{1}_{2} ^{P}^{4}i=1

P_{4}

j=1 α_{i}α_{j}y_{i}y_{j} Φ(x_{i}) · Φ(x_{j})

= α_{1} + α_{2} + α_{3} + α_{4}−

−^{1}_{2}( 9α_{1}^{2} − 2α_{1}α_{2} − 2α_{1}α_{3} + 2α_{1}α_{4}+
9α_{2}^{2} + 2α_{2}α_{3} − 2α_{2}α_{4}+

9α_{3}^{2} − 2α_{3}α_{4}+
9α_{4}^{2})

subject to:

−α_{1} + α_{2} + α_{3} − α_{4} = 0

∂L_{D}(α)

∂α^{1} = 0 ⇔ 9α_{1} − α_{2} − α_{3} + α_{4} = 1

∂L_{D}(α)

∂α2 = 0 ⇔ α_{1} − 9α_{2} − α_{3} + α_{4} = −1

∂L_{D}(α)

∂α3 = 0 ⇔ α_{1} − α_{2} − 9α_{3} + α_{4} = −1

∂L_{D}(α)

∂α^{4} = 0 ⇔ α_{1} − α_{2} − α_{3} + 9α_{4} = 1

¯

α_{1} = ¯α_{2} = ¯α_{3} = ¯α_{4} = ^{1}_{8}

¯

w = ^{1}_{8}(−Φ(x_{1}) + Φ(x_{2}) + Φ(x_{3}) − Φ(x_{4})) = ^{1}_{8}(0, 0,−4√

2, 0, 0, 0)

¯

w · Φ(x_{i}) + ¯w_{0} = y_{i} ⇒ w¯_{0} = 0

The optimal separation hyperplane: w¯ · Φ(x) + ¯w_{0} = 0 ⇔ −x_{1}x_{2} = 0
Test: sign (−x_{1}x_{2})

## The xor Exercise: Result

**2 x****1**

**margin: ****2**

**maximum**

**1****2**

**D(x , x ) = −1**

**1****2**

**D(x , x ) = 0**

**x**_{2}

**x**_{1}

**1****2**

**D(x , x ) = 0**

**2****1****D(x , x ) = 0**

**1****2**

**D(x , x ) = +1**

**1****2**

**D(x , x ) = 0**

**1****2**

**D(x , x ) = +1**

**1****2**

**D(x , x ) = −1**

**1****2**

**D(x , x ) = −1**

**1****2**

**D(x , x ) = +1**

**2 x x**_{1 2}

**D(x , x ) = −x x**

**1****2**

**1****2****D(x , x ) =− 2 x x**_{1}**2****1 2**

**0** **1** **2**

**−1**

**−2**

**−2**

**−1**
**0**
**1**
**2**

**0** **1** **2**

**−1**

**−2**

**−2**

**−1**
**0**
**1**
**2**

**Feature space**
**Input space**

## Concluding Remarks: SVM — Pros and Cons

Pros:

• Find the optimal separation hyperplane.

• Can deal with very high dimentional data.

• Some kernels have infinite Vapnik-Chervonenkis dimension (see Computational learning theory, ch. 7 in Tom Mitchell’s book), which means that they can learn very elaborate concepts.

• Usually work very well.

Cons:

• Require both positive and negative examples.

• Need to select a good kernel function.

• Require lots of memory and CPU time.

• There are some numerical stability problems in solving the con- strained QP.

## Multi-class Classification with SVM

SVMs can only do binary classification.

For M classes, one can use the one-against-the-rest approach:

construct a hyperplane between class k and the M −1 other classes. ⇒ M SVMs.

To predict the output of a new instance, just predict with each of these M SVMs, and then find out which one puts the prediction furthest into the positive region of the instance space.

## SVM Implementations

• SVMlight

• LIBSVM

• mySVM

• Matlab

• Huller

• ...

The SMO (Sequential Minimal Optimization) algorithm John Pratt, 1998

Optimization problem:

maxα W(α) =

m

X

i=1

α_{i} − 1
2

m

X

i=1 m

X

j=1

y_{i}y_{j}α_{i}α_{j}x_{i} ·x_{j}
s. t. 0 ≤ α_{i} ≤ C, i = 1, . . . , m

m

X

i=1

α_{i}y_{i} = 0.

Andrew Ng, Stanford, 2012 fall, ML course, Lecture notes 3.

Algorithm:

Repeat till convergence {

1. Select some pair α_{i} and α_{j} to update next (using a heuristic that tries to
pick the two that will allow us to make the biggest progress towards the
global maximum).

2. Reoptimize W(α) with respect to α_{i} and α_{j}, while holding all the other α_{k}’s
(k 6= i, j) fixed.

}

Update equations:

•

αnew, unclipped

j = α_{j} − ^{y}^{j}^{(E}^{i}_{η}^{−E}^{j}^{)}
αnew, clipped

j =

H if αnew, unclipped

j > H

αnew, unclipped

j if L ≤ αnew, unclipped

j ≤ H

L if αnew, unclipped

j < L

◦ where

E_{k} = w · x_{k} +w_{0} −y_{k}
w = Pm

i=1y_{i}α_{i}x_{i},
η = − k x_{i} −x_{j} k^{2}

L = max(0, α_{j} −α_{i}) ¸si H = min(C, C +α_{j} −α_{j}) if y_{i} 6= y_{j}
L = max(0, αj +α_{i} −C) ¸si H = min(C, αj +α_{j}) if y_{i} = y_{j}

•

αnew

1 = αold

1 +y_{1}y_{2}(αold

2 −αnew

2 )

Assume that α_{i}, α_{2} are the free dual variables, and let the i and j indices be
used to index other variables.

Credit: John Pratt, Fast training of SVMs using Sequential Minimal Optimization, 2000 The two Lagrange multipliers must fulfill all the constraints of the full prob- lem.The inequality constraints cause the Lagrange multipliers to lie in the box. The liniar equality constraint causes them to lie on a diagonal line.

Thefore, one step of SMO must find an optimum of the objective function on a diagonal line segment. In this figure, γ = αold

1 + sαold

2 , is a constant that
depends on the previous values of α_{1} and α_{2}, and s = y_{1}y_{2}.

Proof

[following N. Cristianini and J. Shawe-Taylor, An introduction to SVM, 2000, pag. 138-140]

The objective function:

W(α_{1}, α_{2}, . . . , α_{m}) = Pm

i=1α_{i} − ^{1}_{2} Pm
i=1

Pm

j=1y_{i}y_{j}α_{i}α_{j}x_{i} ·x_{j}
v_{i} not.

=

Pm

j=3 y_{j}α_{j}x_{j}

·x_{i} = f(x_{i})−P2

j=1 y_{j}α_{j}x_{j} ·x_{i} for i = 1,2

⇒ W(α_{1}, α_{2}) = α_{1} +α_{2} − ^{1}_{2}α^{2}_{1}x^{2}_{1} − ^{1}_{2}α^{2}_{2}x^{2}_{2} −y_{1}y_{2}α_{1}α_{2}x_{1} ·x_{2} −y_{1}α_{1}v_{1} −y_{1}α_{2}v_{2} +const
From

Pm

i=1y_{i}α_{i} = 0 ⇒ αold

1 +αold

2 = αnew

1 +αnew

2 = γ (another constant)

⇒ αnew

1 = αold

1 +y_{1}y_{2}(αold

2 −αnew

2 )

s not.

= y_{1}y_{2}

⇒ W(α_{2}) = γ −sα_{2} +α_{2} − ^{1}_{2}(γ −sα_{2})^{2}x^{2}_{1} − ^{1}_{2}α^{2}_{2}x^{2}_{2}

−y_{1}y_{2}x_{1} ·x_{2}(γ −sα_{2})α_{2} −y_{1}v_{1}(γ −sα_{2}) −y_{2}v_{2}α_{2} +const

∂W(α2)

∂α_{2} = −s+ 1 + ^{1}_{2} x^{2}_{1}2s(γ −sα_{2}) − ^{1}_{2} x^{2}_{2}2α_{2}

−y_{1}y_{2}(x_{1} ·x_{2})γ + 2y_{1}y_{2}(x_{1} ·x_{2})sα_{2} +y_{1}v_{1}s−y_{2}v_{2}

= −s+ 1 +x^{2}_{1}(sγ −α_{2}) −x^{2}_{2}α_{2} −sγx^{2}_{1}−sγ (x_{1} ·x_{2}) + y_{1}v_{1}s −y_{2}v_{2}

Finding the stationary point:

∂W(α_{2})

∂α_{2} = 0 ⇒ αnew, unclipped

2 (x^{2}_{1} +x^{2}_{2} −2x_{1} ·x_{2}) = 1−s +γsx^{2}_{1} −γs(x_{1}· x_{2}) +y_{2}v_{1} −y_{2}v_{2}

1−s +γsx^{2}_{1} −γs(x1 ·x_{2}) +y_{2}v_{1} −y_{2}v_{2} = y_{2}(y2− y_{1}+γy_{1}(x^{2}_{1} −x_{1} ·x_{2}) +v_{1} −v_{2})
v_{1} −v_{2} = f(x1) −y_{1}αold

1 x^{2}_{1} −y_{2}αold

2 x_{1} ·x_{2}

−f(x2) +y_{1}αold

1 x_{1} ·x_{2} +y_{2}αold

2 x^{2}_{2}

= f(x1) −y_{1}(γ −sαold

2 )x^{2}_{1} −y_{2}αold

2 x_{1} ·x_{2}

−f(x_{2}) +y_{1}(γ −sαold

2 )x_{1} ·x_{2} +y_{2}αold

2 x^{2}_{2}
1−s +γsx^{2}_{1} −γs(x_{1} ·x_{2}) +y_{2}v_{1} −y_{2}v_{2} = y_{2}(y_{2}− y_{1}+f(x_{1}) + y_{1}sαold

2 x^{2}_{1} −y_{2}αold

2 x_{1} ·x_{2}

−f(x_{2})−y_{1}sαold

2 x_{1} ·x_{2} +y_{2}αold

2 x^{2}_{2})

= y_{2}(f(x_{1})−y_{1} −f(x_{2}) + y_{2})
+αold

2 (x^{2}_{1} +x^{2}_{2} −2x_{1} ·x_{2})

= y_{2}(E_{1} −E_{2}) +αold

2 (x^{2}_{1} +x^{2}_{2} −2x_{1} ·x_{2})

⇒ αnew, unclipped

2 (x^{2}_{1} +x^{2}_{2} −2x1 ·x_{2}) = y_{2}(E1 −E_{2}) +αold

2 (x^{2}_{1} +x^{2}_{2} −2x1 ·x_{2})

⇒ αnew, unclipped

2 = αold

2 − ^{y}^{2}^{(E}^{1}_{η}^{−E}^{2}^{)}