• Nu S-Au Găsit Rezultate

Artificial Neural Networks

N/A
N/A
Protected

Academic year: 2022

Share "Artificial Neural Networks"

Copied!
46
0
0
Arată mai multe ( pagini)

Text complet

(1)

Artificial Neural Networks

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 4

Acknowledgement:

The present slides are an adaptation of slides drawn by T. Mitchell

(2)

PLAN

1. Introduction

Connectionist models

2 NN examples: ALVINN driving system, face recognition 2. The perceptron; the linear unit; the sigmoid unit

The gradient descent learning rule 3. Multilayer networks of sigmoid units

The Backpropagation algorithm Hidden layer representations Overfitting in NNs

4. Advanced topics

Alternative error functions

Predicting a probability function [Recurrent networks]

[Dynamically modifying the network structure]

[Alternative error minimization procedures]

5. Expressive capabilities of NNs

(3)

Connectionist Models

Consider humans:

• Neuron switching time: .001 sec.

• Number of neurons: 1010

• Connections per neuron: 1045

• Scene recognition time: 0.1 sec.

• 100 inference steps doesn’t seem like enough

→ much parallel computation

Properties of

artificial neural nets

• Many neuron-like threshold switching units

• Many weighted interconnections among units

• Highly parallel, distributed pro- cess

• Emphasis on tuning weights au- tomatically

(4)

A First NN Example:

ALVINN drives at 70 mph on highways

[Pomerleau, 1993]

Sharp Left

Sharp Right

4 Hidden Units

30 Output Units

30x32 Sensor Input Retina Straight

Ahead

(5)

A Second NN Example

Neural Nets for Face Recognition

... ...

left strt rght up

30x32 inputs

Typical input images:

http://www.cs.cmu.edu/∼tom/faces.html

Results:

90% accurate learning head pose, and recognizing 1-of-20 faces

(6)

Learned Weights

... ...

left strt rght up

30x32 inputs

after 1 epoch:

after 100 epochs:

(7)

Design Issues for these two NN Examples

See Tom Mitchell’s “Machine Learning” book, pag. 82-83, and 114 for ALVINN, and

pag. 112-177 for face recognition:

• input encoding

• output encoding

• network graph structure

• learning parameters:

η (learning rate), α (momentum), number of itera-

tions

(8)

When to Consider Neural Networks

• Input is high-dimensional discrete or real-valued (e.g. raw sensor input)

• Output is discrete or real valued

• Output is a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human readability of result is unimportant

(9)

2. The Perceptron

[Rosenblat, 1962]

w1 w2

wn

w0 x1

x2

xn

x0=1

. .

.

Σ

Σ

AAA AAA AAA

wi xi

n

i=0 1 if > 0

-1 otherwise

{

o = Σn wi xi

i=0

o(x

1

, . . . , x

n

) =

1 if w

0

+ w

1

x

1

+ · · · + w

n

x

n

≥ 0

−1 otherwise.

Sometimes we’ll use simpler vector notation:

o(~x) =

1 if w ~ · ~x ≥ 0

−1 otherwise.

(10)

Decision Surface of a Perceptron

x1 x2

+ +

- - +

-

x1 x2

(a) (b)

-

+ -

+

Represents some useful functions

• What weights represent g(x1, x2) = AND(x1, x2)?

But certain examples may not be linearly separable

• Therefore, we’ll want networks of these...

(11)

The Perceptron Training Rule

w

i

← w

i

+ ∆w

i

with ∆w

i

= η(t − o)x

i

or, in vectorial notation:

~

w ← w ~ + ∆ w ~ with ∆ w ~ = η(t − o)~x where:

• t = c(~x) is target value

• o is perceptron output

• η is small positive constant (e.g., .1) called learning rate It will converge (proven by [Minsky & Papert, 1969])

• if the training data is linearly separable

• and η is sufficiently small.

(12)

2 . The Linear Unit

To understand the perceptron’s training rule, consider a (simpler) linear unit, where

o = w0 + w1x1 + · · · + wnxn

Let’s learn wi’s that minimize the squared error

E[w]~ ≡ 1 2

X

dD

(td − od)2

where D is set of training examples.

x

1

x

n

w

2

x

2

w

0

x =1

0

w

1

w

n

.. . Σ

o =

P

n

i=0wixi

The linear unit uses the descent gradient training rule, presented on the next slides.

Remark:

Ch. 6 (Bayesian Learning) shows that the hypothesis h = (w0, w1, . . . , wn) that minimises E is the most probable hypothesis given the training data.

(13)

The Gradient Descent Rule

-1 0

1 2

-2 -1

0 1

2 3

0 5 10 15 20 25

w0 w1

E[w]

Gradient:

∇E[w]~ ≡

∂E

∂w0, ∂E

∂w1,· · · ∂E

∂wn

Training rule:

~

w = w~ + ∆w,~

with ∆w~ = −η∇E[w].~ Therefore,

wi = wi + ∆wi, with ∆wi = −η ∂E

∂wi

.

(14)

The Gradient Descent Rule for the Linear Unit Computation

∂E

∂wi

= ∂

∂wi

1 2

X

d

(td − od)2 = 1 2

X

d

∂wi

(td − od)2

= 1 2

X

d

2(td − od) ∂

∂wi

(td − od) = X

d

(td − od) ∂

∂wi

(td − w~ · x~d)

= X

d

(td − od)(−xi,d)

Therefore

∆wi = η X

d

(td − od)xi,d

(15)

The Gradient Descent Algorithm for the Linear Unit

Gradient-Descent(training examples, η)

Each training example is a pair of the form h~x, ti, where

~x is the vector of input values t is the target output value.

η is the learning rate (e.g., .05).

• Initialize each wi to some small random value

• Until the termination condition is met – Initialize each ∆wi to zero.

– For each h~x, ti in training examples

∗ Input the instance ~x to the unit and compute the output o

∗ For each linear unit weight wi

∆wi ← ∆wi + η(t − o)xi

– For each linear unit weight wi

wi ← wi + ∆wi

(16)

Convergence

[Hertz et al., 1991]

The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error

• given a sufficiently small learning rate η

• even when the training data contains noise

• even when the training data is not separable by H

Note:

If η is too large, the gradient descent search runs the risk of over- stepping the minimum in the error surface rather than settling into it.

For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows.

(17)

Remark

Gradient descent (and similary, gradient ascent: w ~ ← w ~ + η∇E ) is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever

• the hypothesis space contains continuously parametrized hypotheses

• the error can be differentiated w.r.t. these hypothesis parameters.

Practical difficulties in applying the gradient method:

• if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum.

• converging to a local optimum can sometimes be quite slow.

To alleviate these difficulties, a variation called incremental

(or: stochastic) gradient method was proposed.

(18)

Incremental (Stochastic) Gradient Descent

Batch mode Gradient Descent:

Do until satisfied

1. Compute the gradient ∇ED[w]~ 2. w~ ← w~ − η∇ED[w]~

Incremental mode Gradient Descent:

Do until satisfied

• For each training example d in D 1. Compute the gradient ∇Ed[w]~ 2. w~ ← w~ − η∇Ed[w]~

ED[w]~ ≡ 1 2

X

dD

(td − od)2 Ed[w]~ ≡ 1

2(td − od)2

Covergence:

The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough.

(19)

2

′′

. The Sigmoid Unit

w1 w2

wn

w0 x1

x2

xn

x0 = 1

AA AA AA AA

. .

. Σ

net =

Σ

wi xi

i=0 n

1 1 + e-net o = σ(net) =

σ(x) is the sigmoid function

1+e1−x

Nice property: dσ(x)dx = σ(x)(1 − σ(x))

We can derive gradient descent rules to train

• One sigmoid unit

• Multilayer networks of sigmoid units → Backpropagation

(20)

Error Gradient for the Sigmoid Unit

∂E

∂wi

= ∂

∂wi

1 2

X

dD

(td − od)2

= 1 2

X

d

∂wi

(td − od)2

= 1 2

X

d

2(td − od) ∂

∂wi

(td − od)

= X

d

(td − od)

−∂od

∂wi

= −X

d

(td − od) ∂od

∂netd

∂netd

∂wi

where netd = Pn

i=0 wixi,d

But

∂od

∂netd

= ∂σ(netd)

∂netd

= od(1 − od)

∂netd

∂wi

= ∂(w~ · ~xd)

∂wi

= xi,d

So:

∂E

∂wi

= −X

dD

od(1 − od)(td − od)xi,d

(21)

Remark

Instead of gradient descent, one could use linear pro- gramming to find hypothesis consistent with separable data.

[Duda & Hart, 1973] have shown that linear program- ming can be extended to the non-linear separable case.

However, linear programming does not scale to multi-

layer networks, as gradient descent does (see next sec-

tion).

(22)

3. Multilayer Networks of Sigmoid Units

An example

This network was trained to recognize 1 of 10 vowel sounds occurring in the context “h d” (e.g. “head”, “hid”).

The inputs have been obtained from a spectral analysis of sound.

The 10 network outputs correspond to the 10 possible vowel sounds. The net- work prediction is the output whose

value is the highest. F1 F2

head hid who’d hood

... ...

(23)

This plot illustrates the highly non-linear decision surface represented by the learned network.

Points shown on the plot are test examples distinct from the examples used to train the network.

from [Haug & Lippmann, 1988]

(24)

3.1 The Backpropagation Algorithm (Rumelhart et al., 1986)

Formulation for a feed-forward 2-layer network of sigmoid units, the stochastic version

Idea: Gradient descent over the entire vector of network weights.

Initialize all weights to small random numbers.

Until satisfied, // stopping criterion to be (later) defined for each training example,

1. input the training example to the network, and compute the network outputs

2. for each output unit k:

δk ← ok(1 − ok)(tk − ok) 3. for each hidden unit h:

δh ← oh(1 − oh)P

koutputs wkhδk

4. update each network weight wji: wji ← wji + ∆wji where ∆wji = ηδjxji, and xji is the ith input to unit j.

(25)

Derivation of the Backpropagation rule,

(following [Tom Mitchell, 1997], pag. 101–103)

Notations:

xji: the ith input to unit j;

(j could be either hidden or output unit)

wji: the weight associated with the ith input to unit j netj = P

iwjixji

σ: the sigmoid function

oj: the output computed by unit j; (oj = σ(netj))

outputs: the set of units in the final layer of the network Downstream(j): the set of units whose immediate in-

puts include the output of unit j

Ed: the training error on the example d (summing over all of the network output units)

...

...

...

...

. . . . . .

...

...

...

... . ...

. . .. .

...

...

wji j

i

Legend: in magenta color, units be- longing to Downstream(j)

(26)

Preliminaries

Ed(w)~ = 1 2

X

koutputs

(tk ok)2 = 1 2

X

koutputs

(tk σ(netk))2 net j

..

..

..

..

wji

xji

oj

Σ

σ

Common staff for both hidden and output units:

netj = X

i

wjixji

∂Ed

∂wji = ∂Ed

netj

netj

∂wji = ∂Ed

netj

xji

∆wji

def= −η ∂Ed

∂wji = −η ∂Ed

∂netj

xji

netj

..

..

.. σ

..

..

netk ok

..

..

oj

....

wkj

..

x ji

wji Σ

Σ Σ

Σ

Note: In the sequel we will use the notation: δj = − ∂Ed

∂netj

⇒ ∆wji = ηδjxji

(27)

Stage/Case 1: Computing the increments (∆) for output unit weights

∂Ed

netj = ∂Ed

∂oj

∂oj

netj

∂oj

netj = ∂σ(netj)

netj = oj(1oj)

∂Ed

∂oj

=

∂oj

1 2

X

k∈outputs

(tk ok)2

=

∂oj 1

2(tj oj)2 = 1

22(tj oj)(tj oj)

∂oj

= −(tj oj)

net j

..

..

..

..

Σ

σ o j

∂Ed

netj = −(tj oj)oj(1 oj) = −oj(1oj)(tj oj)

δj not.

= ∂Ed

netj = oj(1oj)(tj oj)

∆wji = ηδjxji = ηoj(1 oj)(tj oj)xji

(28)

Stage/Case 2: Computing the increments (∆) for hidden unit weights

∂Ed

netj = X

kDownstream(j)

∂Ed

netk

netk

netj

= X

k∈Downstream(j)

−δk

netk

netj

= X

k∈Downstream(j)

δknetk

∂oj

∂oj

netj

netj

..

..

.. σ

..

..

netk

ok

..

..

oj

....

wkj

..

Σ

Σ Σ

Σ

∂Ed

netj = X

k∈Downstream(j)

δkwkj ∂oj

netj = X

k∈Downstream(j)

δkwkjoj(1 oj)

Therefore:

δj

not= ∂Ed

netj = oj(1 oj) X

k∈Downstream(j)

δkwkj

wji def= η ∂Ed

∂wji

= η ∂Ed

netj

netj

∂wji

= η ∂Ed

netjxji = ηδjxji = η [ oj(1 oj) X

k∈Downstream(j)

δkwkj ] xji

(29)

Convergence of Backpropagation

for NNs of Sigmoid units

Nature of convergence

• The weights are initialized near zero;

therefore, initial decision surfaces are near-linear.

Explanation: oj is of the form σ(w~ ·~x), therefore wji 0 for all i, j;

note that the graph of σ is approximately liniar in the vecinity of 0.

• Increasingly non-linear functions are possible as training progresses

• Will find a local, not necessarily global error minimum.

In practice, often works well (can run multiple times).

(30)

More on Backpropagation

• Easily generalized to arbitrary directed graphs

• Training can take thousands of iterations → slow!

• Often include weight momentum α

∆w

i,j

(n) = ηδ

j

x

ij

+ α∆w

ij

(n − 1) Effect:

– speed up convergence (increase the step size in regions where the gradient is unchanging);

– “keep the ball rolling” through local minima (or along flat regions) in the error surface

• Using network after training is very fast

• Minimizes error over training examples;

Will it generalize well to subsequent examples?

(31)

3.2 Stopping Criteria when Training ANNs and Overfitting

(see Tom Mitchell’s “Machine Learning” book, pag. 108-112)

0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0 5000 10000 15000 20000

Error

Number of weight updates Error versus weight updates (example 1)

Training set error Validation set error

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

0 1000 2000 3000 4000 5000 6000

Error

Number of weight updates Error versus weight updates (example 2)

Training set error Validation set error

Plots of the error E, as a function of the number of weights updates, for two different robot perception tasks.

(32)

3.3 Learning Hidden Layer Representations An Example: Learning the identity function f (~x) = ~x

Inputs Outputs

Input Output

10000000 → 10000000

01000000 → 01000000

00100000 → 00100000

00010000 → 00010000

00001000 → 00001000

00000100 → 00000100

00000010 → 00000010

00000001 → 00000001

(33)

Learned hidden layer representation:

Input Hidden Output

Values

10000000 → .89 .04 .08 → 10000000 01000000 → .15 .99 .99 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .01 .11 .88 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001

After 8000 training epochs, the 3 hidden unit values encode the 8 distinct inputs. Note that if the encoded values are rounded to 0 or 1, the result is the standard binary encoding for 8 distinct values (however not the usual one, i.e. 1 001, 2 010, etc).

(34)

Training (I)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 500 1000 1500 2000 2500

Sum of squared errors for each output unit

(35)

Training (II)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 500 1000 1500 2000 2500

Hidden unit encoding for input 01000000

(36)

Training (III)

-5 -4 -3 -2 -1 0 1 2 3 4

0 500 1000 1500 2000 2500

Weights from inputs to one hidden unit

(37)

4. Advanced Topics

4.1 Alternative Error Functions

(see ML book, pag. 117-118):

• Penalize large weights;

E(w)~ ≡ 1 2

X

dD

X

koutputs

(tkd − okd)2 + γ X

i,j

w2ji

• Train on target slopes as well as values

E(w)~ ≡ 1 2

X

dD

X

koutputs

(tkd − okd)2 + µ X

jinputs

∂tkd

∂xjd − ∂okd

∂xjd

!2

• Tie together weights: e.g., in phoneme recognition network;

• Minimizing the cross entropy (see next 3 slides):

−X

dD

td logod + (1 − td) log(1 − od)

where od, the output of the network, represents the estimated probability that the training instance xd is associated the label (target value) 1.

(38)

4.2 Predicting a probability function:

Learning the ML hypothesis using a NN

(see Tom Mitchell’s “Machine Learning” book, pag. 118, 167-171)

Let us consider a non-deterministic function (LC: one-to-many relation) f : X → {0,1}.

Given a set of independently drawn examples

D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0,1},

we would like to learn the probability function g(x) def.= P(f(x) = 1).

The ML hypotesis hM L = argmaxhH P(D | h) in such a setting is hM L = argmaxhH G(h, D)

where G(h, D) = Pm

i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))]

We will use a NN for this task.

For simplicity, a single layer with sigmoidal units is considered.

The training will be done by gradient ascent:

~

w ← w~ + η ∇G(D, h)

(39)

The partial derivative of G(D, h) with respect to wjk, which is the weight for the kth input to unit j, is:

∂G(D, h)

∂wjk

=

m

X

i=1

∂G(D, h)

∂h(xi) · ∂h(xi)

∂wjk

=

m

X

i=1

∂(di lnh(xi) + (1 − di)ln(1 − h(xi)))

∂h(xi) · ∂h(xi)

∂wjk

= . . .

=

m

X

i=1

di − h(xi)

h(xi)(1 − h(xi)) · ∂h(xi)

∂wjk

and because

∂h(xi)

∂wjk

= σ(xi)xi, jk = h(xi)(1 − h(xi))xi, jk it follows that

∂G(D, h)

∂wjk

=

m

X

i=1

(di − h(xi))xi, jk

Note: Here above we denoted xi, jk the kth input to unit j for the ith training example, and σ the derivative of the sigmoid function.

(40)

Therefore

w

jk

← w

jk

+ ∆w

jk

with

∆w

jk

= η ∂G(D, h)

∂w

jk

= η

m

X

i=1

(d

i

− h(x

i

))x

i,jk

(41)

4.3 Recurrent Networks

• applied to time series data

• can be trained using a version of Backpropagation algorithm

• see [Mozer, 1995]

An example:

(42)

4.4 Dynamically Modifying the Network Structure

Two ideas:

• Begin with a network with no hidden unit, then grow the network until the training error is reduced to some accept- able level.

Example: Cascade-Correlation algorithm, [Fahlman &

Lebiere, 1990]

• Begin with a complex network and prune it as you find that certain connections w are inessential.

E.g. see whether w ≃ 0, or analyze

∂E∂w

, i.e. the effect that a small variation in w has on the error E .

Example: [LeCun et al. 1990]

(43)

4.5 Alternative Optimisation Methods for Training ANNs

See Tom Mitchell’s “Machine Learning” book, pag. 119

• linear search

• conjugate gradient

(44)

4.6 Other Advanced Issues

• Ch. 6:

A Bayesian justification for choosing to minimize the sum of square errors

• Ch. 7:

The estimation of the number of needed training examples to reliably learn boolean functions;

The Vapnik-Chervonenkis dimension of certain types of ANNs

• Ch. 12:

How to use prior knowledge to improve the generalization

acuracy of ANNs

(45)

5. Expressive Capabilities of [Feed-forward] ANNs

Boolean functions:

• Every boolean function can be represented by a network with a single hidden layer,

but it might require exponential (in the number of inputs) hidden units.

Continuous functions:

• Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989].

• Any function can be approximated to arbitrary accuracy

by a network with two hidden layers [Cybenko 1988].

(46)

Summary / What you should know

◦ The gradient descent optimisation method

• The thresholded perceptron;

the training rule, the test rule;

convergence result

The linear unit and the sigmoid unit;

the gradient descent rule (including the proofs);

convergence result

• Multilayer networks of sigmoid units;

the Backpropagation algorithm

(including the proof for the stochastic version);

convergence result

◦ Batch/online vs stochastic/incremental gradient descent for artifical neurons and neural networks;

convergence result

◦ Overfitting in neural networks; solutions

Referințe

DOCUMENTE SIMILARE

The residues on a protein surface play a key role in interaction with other molecules, determine many physical properties, and constrain the structure of the folded protein.

The training is given with pre-trained model VGG-16 is used to extract the accurate features of human activity recognition.. VGG-16 is the deep learning model that gives the

Vectorization, Logistic Regression Cost Function, Gradient Descent, Image processing, Neural Networks, Forward Propagation,Backward

Neural Networks (NN) , Random Forest and SVM algorithms.The output of rule-based techniques and machine learning algorithms is evaluated using regular datasets such

In weight updating process the learning rate is used to diminish the function of time, it is relevant to the error rate in each epoch, and also determines how much

It also discusses how Artificial Neural Networks (ANN) provide a platform to perform sentiment analysis in a much easier and therefore less time consuming

learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units:. − We can learn

3.3 Implementation 29 Keep in mind that the purpose of this experiment was to test if Collaborative Artificial Neural Network, Simple Collaboration Networks to be exact, can be used

For the calibration of the connection between the integral intensity of the band “amide I” and the acetylation degree, five mixes were prepared with different proportions of

Illustration of the neuromuscular anatomy in the inferior axilla (a); sonoanatomy of the muscles at the level of the distal axillary fold (b) and at the lesser tubercle of the

The general problem of association rule mining is: Given a data set of transac- tions where each transaction is a set of items, a minimum support threshold and a minimum

Results obtained reveal the reliability and good predicatively of neural network model for the prediction of the size of Ag-NPs in green method.. (Received May 22, 2013;

1 Department of Physical Medicine and Rehabilitation, National Taiwan University Hospital, Bei-Hu Branch and National Taiwan University College of Medicine, Taipei, Taiwan, 2

Locations of the tibial nerve, popliteal artery, vein (b), and medial sural cutaneous nerve (c), and safe angles for nee- dle insertion (d).. n: tibial nerve, a: popliteal artery,

1. Enlarged spinoglenoid notch veins causing suprascapular nerve compression. Dynamic ultrasonogra- phy of the shoulder. Lafosse L, Tomasi A, Corbett S, Baier G, Willems K,

ductal orifice, and presence of a sphincter-like mecha- nism in the distal 3 cm of the duct [2].In the last four dec- ades, 22 cases with foreign bodies in the submandibular

1 Department of Physical Medicine and Rehabilitation, National Taiwan University Hospital, Bei Hu Branch and National Taiwan University College of Medicine, Taipei, Taiwan,

Transverse (a) and longitudinal (b) transvaginal ultrasound exhibit an isoechoic solid mass measuring 4 cm in size, with mul- tiple intralesional echogenic foci (arrows) and

The diagnostic accuracy of US could be improved in combination with CEUS (65.3% vs 83.7%). The diagnostic accuracy of the GB wall thickening type was higher than the mass forming

The number of vacancies for the doctoral field of Medicine, Dental Medicine and Pharmacy for the academic year 2022/2023, financed from the state budget, are distributed to

- aligners for minor dental changes: small dental corrections after an orthodontic treatment with a fixed appliance/ patients with a stable occlusion but a slightly

The basic concept we used is of transfer learning in neural networks and the final output is to detect the presence or absence of a face mask in a video that is streaming or in

The authors have used a transfer learning on a pre-trained convolutional neural network (CNN) as their first hypothesis, the second hypothesis is based on a conventional