Artificial Neural Networks
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 4
Acknowledgement:
The present slides are an adaptation of slides drawn by T. Mitchell
PLAN
1. Introduction
Connectionist models
2 NN examples: ALVINN driving system, face recognition 2. The perceptron; the linear unit; the sigmoid unit
The gradient descent learning rule 3. Multilayer networks of sigmoid units
The Backpropagation algorithm Hidden layer representations Overfitting in NNs
4. Advanced topics
Alternative error functions
Predicting a probability function [Recurrent networks]
[Dynamically modifying the network structure]
[Alternative error minimization procedures]
5. Expressive capabilities of NNs
Connectionist Models
Consider humans:
• Neuron switching time: .001 sec.
• Number of neurons: 1010
• Connections per neuron: 104−5
• Scene recognition time: 0.1 sec.
• 100 inference steps doesn’t seem like enough
→ much parallel computation
Properties of
artificial neural nets
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed pro- cess
• Emphasis on tuning weights au- tomatically
A First NN Example:
ALVINN drives at 70 mph on highways
[Pomerleau, 1993]
Sharp Left
Sharp Right
4 Hidden Units
30 Output Units
30x32 Sensor Input Retina Straight
Ahead
A Second NN Example
Neural Nets for Face Recognition
... ...
left strt rght up
30x32 inputs
Typical input images:
http://www.cs.cmu.edu/∼tom/faces.html
Results:
90% accurate learning head pose, and recognizing 1-of-20 faces
Learned Weights
... ...
left strt rght up
30x32 inputs
after 1 epoch:
after 100 epochs:
Design Issues for these two NN Examples
See Tom Mitchell’s “Machine Learning” book, pag. 82-83, and 114 for ALVINN, and
pag. 112-177 for face recognition:
• input encoding
• output encoding
• network graph structure
• learning parameters:
η (learning rate), α (momentum), number of itera-
tions
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
2. The Perceptron
[Rosenblat, 1962]
w1 w2
wn
w0 x1
x2
xn
x0=1
. .
.
Σ
ΣAAA AAA AAA
wi xi
n
i=0 1 if > 0
-1 otherwise
{
o = Σn wi xi
i=0
o(x
1, . . . , x
n) =
1 if w
0+ w
1x
1+ · · · + w
nx
n≥ 0
−1 otherwise.
Sometimes we’ll use simpler vector notation:
o(~x) =
1 if w ~ · ~x ≥ 0
−1 otherwise.
Decision Surface of a Perceptron
x1 x2
+ +
- - +
-
x1 x2
(a) (b)
-
+ -
+
Represents some useful functions
• What weights represent g(x1, x2) = AND(x1, x2)?
But certain examples may not be linearly separable
• Therefore, we’ll want networks of these...
The Perceptron Training Rule
w
i← w
i+ ∆w
iwith ∆w
i= η(t − o)x
ior, in vectorial notation:
~
w ← w ~ + ∆ w ~ with ∆ w ~ = η(t − o)~x where:
• t = c(~x) is target value
• o is perceptron output
• η is small positive constant (e.g., .1) called learning rate It will converge (proven by [Minsky & Papert, 1969])
• if the training data is linearly separable
• and η is sufficiently small.
2 . The Linear Unit
To understand the perceptron’s training rule, consider a (simpler) linear unit, where
o = w0 + w1x1 + · · · + wnxn
Let’s learn wi’s that minimize the squared error
E[w]~ ≡ 1 2
X
d∈D
(td − od)2
where D is set of training examples.
x
1x
nw
2x
2w
0x =1
0w
1w
n.. . Σ
o =P
ni=0wixi
The linear unit uses the descent gradient training rule, presented on the next slides.
Remark:
Ch. 6 (Bayesian Learning) shows that the hypothesis h = (w0, w1, . . . , wn) that minimises E is the most probable hypothesis given the training data.
The Gradient Descent Rule
-1 0
1 2
-2 -1
0 1
2 3
0 5 10 15 20 25
w0 w1
E[w]
Gradient:
∇E[w]~ ≡
∂E
∂w0, ∂E
∂w1,· · · ∂E
∂wn
Training rule:
~
w = w~ + ∆w,~
with ∆w~ = −η∇E[w].~ Therefore,
wi = wi + ∆wi, with ∆wi = −η ∂E
∂wi
.
The Gradient Descent Rule for the Linear Unit Computation
∂E
∂wi
= ∂
∂wi
1 2
X
d
(td − od)2 = 1 2
X
d
∂
∂wi
(td − od)2
= 1 2
X
d
2(td − od) ∂
∂wi
(td − od) = X
d
(td − od) ∂
∂wi
(td − w~ · x~d)
= X
d
(td − od)(−xi,d)
Therefore
∆wi = η X
d
(td − od)xi,d
The Gradient Descent Algorithm for the Linear Unit
Gradient-Descent(training examples, η)
Each training example is a pair of the form h~x, ti, where
~x is the vector of input values t is the target output value.
η is the learning rate (e.g., .05).
• Initialize each wi to some small random value
• Until the termination condition is met – Initialize each ∆wi to zero.
– For each h~x, ti in training examples
∗ Input the instance ~x to the unit and compute the output o
∗ For each linear unit weight wi
∆wi ← ∆wi + η(t − o)xi
– For each linear unit weight wi
wi ← wi + ∆wi
Convergence
[Hertz et al., 1991]
The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error
• given a sufficiently small learning rate η
• even when the training data contains noise
• even when the training data is not separable by H
Note:
If η is too large, the gradient descent search runs the risk of over- stepping the minimum in the error surface rather than settling into it.For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows.
Remark
Gradient descent (and similary, gradient ascent: w ~ ← w ~ + η∇E ) is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever
• the hypothesis space contains continuously parametrized hypotheses
• the error can be differentiated w.r.t. these hypothesis parameters.
Practical difficulties in applying the gradient method:
• if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum.
• converging to a local optimum can sometimes be quite slow.
To alleviate these difficulties, a variation called incremental
(or: stochastic) gradient method was proposed.
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satisfied
1. Compute the gradient ∇ED[w]~ 2. w~ ← w~ − η∇ED[w]~
Incremental mode Gradient Descent:
Do until satisfied
• For each training example d in D 1. Compute the gradient ∇Ed[w]~ 2. w~ ← w~ − η∇Ed[w]~
ED[w]~ ≡ 1 2
X
d∈D
(td − od)2 Ed[w]~ ≡ 1
2(td − od)2
Covergence:
The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough.
2
′′. The Sigmoid Unit
w1 w2
wn
w0 x1
x2
xn
x0 = 1
AA AA AA AA
. .
. Σ
net =
Σ
wi xii=0 n
1 1 + e-net o = σ(net) =
σ(x) is the sigmoid function
1+e1−xNice property: dσ(x)dx = σ(x)(1 − σ(x))
We can derive gradient descent rules to train
• One sigmoid unit
• Multilayer networks of sigmoid units → Backpropagation
Error Gradient for the Sigmoid Unit
∂E
∂wi
= ∂
∂wi
1 2
X
d∈D
(td − od)2
= 1 2
X
d
∂
∂wi
(td − od)2
= 1 2
X
d
2(td − od) ∂
∂wi
(td − od)
= X
d
(td − od)
−∂od
∂wi
= −X
d
(td − od) ∂od
∂netd
∂netd
∂wi
where netd = Pn
i=0 wixi,d
But
∂od
∂netd
= ∂σ(netd)
∂netd
= od(1 − od)
∂netd
∂wi
= ∂(w~ · ~xd)
∂wi
= xi,d
So:
∂E
∂wi
= −X
d∈D
od(1 − od)(td − od)xi,d
Remark
Instead of gradient descent, one could use linear pro- gramming to find hypothesis consistent with separable data.
[Duda & Hart, 1973] have shown that linear program- ming can be extended to the non-linear separable case.
However, linear programming does not scale to multi-
layer networks, as gradient descent does (see next sec-
tion).
3. Multilayer Networks of Sigmoid Units
An example
This network was trained to recognize 1 of 10 vowel sounds occurring in the context “h d” (e.g. “head”, “hid”).
The inputs have been obtained from a spectral analysis of sound.
The 10 network outputs correspond to the 10 possible vowel sounds. The net- work prediction is the output whose
value is the highest. F1 F2
head hid who’d hood
... ...
This plot illustrates the highly non-linear decision surface represented by the learned network.
Points shown on the plot are test examples distinct from the examples used to train the network.
from [Haug & Lippmann, 1988]
3.1 The Backpropagation Algorithm (Rumelhart et al., 1986)
Formulation for a feed-forward 2-layer network of sigmoid units, the stochastic version
Idea: Gradient descent over the entire vector of network weights.
Initialize all weights to small random numbers.
Until satisfied, // stopping criterion to be (later) defined for each training example,
1. input the training example to the network, and compute the network outputs
2. for each output unit k:
δk ← ok(1 − ok)(tk − ok) 3. for each hidden unit h:
δh ← oh(1 − oh)P
k∈outputs wkhδk
4. update each network weight wji: wji ← wji + ∆wji where ∆wji = ηδjxji, and xji is the ith input to unit j.
Derivation of the Backpropagation rule,
(following [Tom Mitchell, 1997], pag. 101–103)
Notations:
xji: the ith input to unit j;
(j could be either hidden or output unit)
wji: the weight associated with the ith input to unit j netj = P
iwjixji
σ: the sigmoid function
oj: the output computed by unit j; (oj = σ(netj))
outputs: the set of units in the final layer of the network Downstream(j): the set of units whose immediate in-
puts include the output of unit j
Ed: the training error on the example d (summing over all of the network output units)
...
...
...
...
. . . . . .
...
...
...
... . ...
. . .. .
...
...
wji j
i
Legend: in magenta color, units be- longing to Downstream(j)
Preliminaries
Ed(w)~ = 1 2
X
k∈outputs
(tk −ok)2 = 1 2
X
k∈outputs
(tk −σ(netk))2 net j
..
..
..
..
wji
xji
oj
Σ
σCommon staff for both hidden and output units:
netj = X
i
wjixji
⇒ ∂Ed
∂wji = ∂Ed
∂netj
∂netj
∂wji = ∂Ed
∂netj
xji
⇒ ∆wji
def= −η ∂Ed
∂wji = −η ∂Ed
∂netj
xji
netj
..
..
.. σ
..
..
netk ok
..
..
oj
....
wkj
..
x ji
wji Σ
Σ Σ
Σ
Note: In the sequel we will use the notation: δj = − ∂Ed
∂netj
⇒ ∆wji = ηδjxji
Stage/Case 1: Computing the increments (∆) for output unit weights
∂Ed
∂netj = ∂Ed
∂oj
∂oj
∂netj
∂oj
∂netj = ∂σ(netj)
∂netj = oj(1−oj)
∂Ed
∂oj
= ∂
∂oj
1 2
X
k∈outputs
(tk −ok)2
= ∂
∂oj 1
2(tj −oj)2 = 1
22(tj −oj)∂(tj −oj)
∂oj
= −(tj −oj)
net j
..
..
..
..
Σ
σ o j⇒ ∂Ed
∂netj = −(tj −oj)oj(1 −oj) = −oj(1−oj)(tj −oj)
⇒ δj not.
= − ∂Ed
∂netj = oj(1−oj)(tj −oj)
⇒ ∆wji = ηδjxji = ηoj(1 −oj)(tj −oj)xji
Stage/Case 2: Computing the increments (∆) for hidden unit weights
∂Ed
∂netj = X
k∈Downstream(j)
∂Ed
∂netk
∂netk
∂netj
= X
k∈Downstream(j)
−δk
∂netk
∂netj
= X
k∈Downstream(j)
−δk∂netk
∂oj
∂oj
∂netj
netj
..
..
.. σ
..
..
netk
ok
..
..
oj
....
wkj
..
Σ
Σ Σ
Σ
∂Ed
∂netj = X
k∈Downstream(j)
−δkwkj ∂oj
∂netj = X
k∈Downstream(j)
−δkwkjoj(1 −oj)
Therefore:
δj
not= − ∂Ed
∂netj = oj(1 −oj) X
k∈Downstream(j)
δkwkj
∆wji def= −η ∂Ed
∂wji
= −η ∂Ed
∂netj
∂netj
∂wji
= −η ∂Ed
∂netjxji = ηδjxji = η [ oj(1 −oj) X
k∈Downstream(j)
δkwkj ] xji
Convergence of Backpropagation
for NNs of Sigmoid units
Nature of convergence
• The weights are initialized near zero;
therefore, initial decision surfaces are near-linear.
Explanation: oj is of the form σ(w~ ·~x), therefore wji ≈ 0 for all i, j;
note that the graph of σ is approximately liniar in the vecinity of 0.
• Increasingly non-linear functions are possible as training progresses
• Will find a local, not necessarily global error minimum.
In practice, often works well (can run multiple times).
More on Backpropagation
• Easily generalized to arbitrary directed graphs
• Training can take thousands of iterations → slow!
• Often include weight momentum α
∆w
i,j(n) = ηδ
jx
ij+ α∆w
ij(n − 1) Effect:
– speed up convergence (increase the step size in regions where the gradient is unchanging);
– “keep the ball rolling” through local minima (or along flat regions) in the error surface
• Using network after training is very fast
• Minimizes error over training examples;
Will it generalize well to subsequent examples?
3.2 Stopping Criteria when Training ANNs and Overfitting
(see Tom Mitchell’s “Machine Learning” book, pag. 108-112)
0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
0 5000 10000 15000 20000
Error
Number of weight updates Error versus weight updates (example 1)
Training set error Validation set error
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
0 1000 2000 3000 4000 5000 6000
Error
Number of weight updates Error versus weight updates (example 2)
Training set error Validation set error
Plots of the error E, as a function of the number of weights updates, for two different robot perception tasks.
3.3 Learning Hidden Layer Representations An Example: Learning the identity function f (~x) = ~x
Inputs Outputs
Input Output
10000000 → 10000000
01000000 → 01000000
00100000 → 00100000
00010000 → 00010000
00001000 → 00001000
00000100 → 00000100
00000010 → 00000010
00000001 → 00000001
Learned hidden layer representation:
Input Hidden Output
Values
10000000 → .89 .04 .08 → 10000000 01000000 → .15 .99 .99 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .01 .11 .88 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001
After 8000 training epochs, the 3 hidden unit values encode the 8 distinct inputs. Note that if the encoded values are rounded to 0 or 1, the result is the standard binary encoding for 8 distinct values (however not the usual one, i.e. 1 → 001, 2 → 010, etc).
Training (I)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 500 1000 1500 2000 2500
Sum of squared errors for each output unit
Training (II)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 500 1000 1500 2000 2500
Hidden unit encoding for input 01000000
Training (III)
-5 -4 -3 -2 -1 0 1 2 3 4
0 500 1000 1500 2000 2500
Weights from inputs to one hidden unit
4. Advanced Topics
4.1 Alternative Error Functions
(see ML book, pag. 117-118):• Penalize large weights;
E(w)~ ≡ 1 2
X
d∈D
X
k∈outputs
(tkd − okd)2 + γ X
i,j
w2ji
• Train on target slopes as well as values
E(w)~ ≡ 1 2
X
d∈D
X
k∈outputs
(tkd − okd)2 + µ X
j∈inputs
∂tkd
∂xjd − ∂okd
∂xjd
!2
• Tie together weights: e.g., in phoneme recognition network;
• Minimizing the cross entropy (see next 3 slides):
−X
d∈D
td logod + (1 − td) log(1 − od)
where od, the output of the network, represents the estimated probability that the training instance xd is associated the label (target value) 1.
4.2 Predicting a probability function:
Learning the ML hypothesis using a NN
(see Tom Mitchell’s “Machine Learning” book, pag. 118, 167-171)
Let us consider a non-deterministic function (LC: one-to-many relation) f : X → {0,1}.
Given a set of independently drawn examples
D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0,1},
we would like to learn the probability function g(x) def.= P(f(x) = 1).
The ML hypotesis hM L = argmaxh∈H P(D | h) in such a setting is hM L = argmaxh∈H G(h, D)
where G(h, D) = Pm
i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))]
We will use a NN for this task.
For simplicity, a single layer with sigmoidal units is considered.
The training will be done by gradient ascent:
~
w ← w~ + η ∇G(D, h)
The partial derivative of G(D, h) with respect to wjk, which is the weight for the kth input to unit j, is:
∂G(D, h)
∂wjk
=
m
X
i=1
∂G(D, h)
∂h(xi) · ∂h(xi)
∂wjk
=
m
X
i=1
∂(di lnh(xi) + (1 − di)ln(1 − h(xi)))
∂h(xi) · ∂h(xi)
∂wjk
= . . .
=
m
X
i=1
di − h(xi)
h(xi)(1 − h(xi)) · ∂h(xi)
∂wjk
and because
∂h(xi)
∂wjk
= σ′(xi)xi, jk = h(xi)(1 − h(xi))xi, jk it follows that
∂G(D, h)
∂wjk
=
m
X
i=1
(di − h(xi))xi, jk
Note: Here above we denoted xi, jk the kth input to unit j for the ith training example, and σ′ the derivative of the sigmoid function.
Therefore
w
jk← w
jk+ ∆w
jkwith
∆w
jk= η ∂G(D, h)
∂w
jk= η
m
X
i=1
(d
i− h(x
i))x
i,jk4.3 Recurrent Networks
• applied to time series data
• can be trained using a version of Backpropagation algorithm
• see [Mozer, 1995]
An example:
4.4 Dynamically Modifying the Network Structure
Two ideas:
• Begin with a network with no hidden unit, then grow the network until the training error is reduced to some accept- able level.
Example: Cascade-Correlation algorithm, [Fahlman &
Lebiere, 1990]
• Begin with a complex network and prune it as you find that certain connections w are inessential.
E.g. see whether w ≃ 0, or analyze
∂E∂w, i.e. the effect that a small variation in w has on the error E .
Example: [LeCun et al. 1990]
4.5 Alternative Optimisation Methods for Training ANNs
See Tom Mitchell’s “Machine Learning” book, pag. 119
• linear search
• conjugate gradient
4.6 Other Advanced Issues
• Ch. 6:
A Bayesian justification for choosing to minimize the sum of square errors
• Ch. 7:
The estimation of the number of needed training examples to reliably learn boolean functions;
The Vapnik-Chervonenkis dimension of certain types of ANNs
• Ch. 12:
How to use prior knowledge to improve the generalization
acuracy of ANNs
5. Expressive Capabilities of [Feed-forward] ANNs
Boolean functions:
• Every boolean function can be represented by a network with a single hidden layer,
but it might require exponential (in the number of inputs) hidden units.
Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989].
• Any function can be approximated to arbitrary accuracy
by a network with two hidden layers [Cybenko 1988].
Summary / What you should know
◦ The gradient descent optimisation method
• The thresholded perceptron;
the training rule, the test rule;
convergence result
The linear unit and the sigmoid unit;
the gradient descent rule (including the proofs);
convergence result
• Multilayer networks of sigmoid units;
the Backpropagation algorithm
(including the proof for the stochastic version);
convergence result
◦ Batch/online vs stochastic/incremental gradient descent for artifical neurons and neural networks;
convergence result
◦ Overfitting in neural networks; solutions