• Nu S-Au Găsit Rezultate

Two Roles for the Bayesian Methods in Learning

N/A
N/A
Protected

Academic year: 2022

Share "Two Roles for the Bayesian Methods in Learning"

Copied!
60
0
0

Text complet

(1)

Bayesian Learning

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6

Acknowledgement:

The present slides are an adaptation of slides drawn by T. Mitchell

(2)

Two Roles for the Bayesian Methods in Learning

1.

Provides practical learning algorithms

by combining prior knowledge/probabilities with observed data:

• Naive Bayes learning algorithm

• Expectation Maximization (EM) learning algorithm (scheme):

learning in the presence of unobserved variables

• Bayesian Belief Network learning

2.

Provides a useful conceptual framework

• Serves for evaluating other learning algorithms, e.g.

concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination),

neural networks, liniar regression

• Provides additional insight into Occam’s razor

(3)

1. Basic Notions Bayes’ Theorem

Defining classes of hypotheses:

Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses

2. Learning MAP hypotheses

2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier;

2.3 The Gibbs classifier;

2.4 The Naive Bayes and the Joint Bayes classifiers.

Example: Learning over text data using Naive Bayes 2.5 The Minimum Description Length (MDL) Principle;

MDL hypotheses

3. Learning ML hypotheses

3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm 4. Bayesian Belief Networks

(4)

1 Basic Notions

• Product Rule:

probability of a conjunction of two events A and B:

P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)

• Bayes’ Theorem:

P(A|B) = P(B|A)P(A) P(B)

• Theorem of total probability:

if events A1, . . . , An are mutually exclusive, with Pn

i=1 P(Ai) = 1, then

P(B) =

n

X

i=1

P(B|Ai)P(Ai)

in particular

P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

(5)

Using Bayes’ Theorem for Hypothesis Learning

P (h | D) = P (D | h)P (h) P (D)

• P (D) = the (prior) probability of training data D

• P (h) = the (prior) probability of the hypothesis h

• P (D | h) = the (a posteriori) probability of D given h

• P (h | D) = the (a posteriori) probability of h given D

(6)

Classes of Hypotheses

Maximum Likelihood (ML) hypothesis:

the hypothesis that best explains the training data h

M L

= argmax

hiH

P (D | h

i

)

Maximum A posteriori Probability (MAP) hypothesis:

the most probable hypothesis given the training data

hM AP = argmax

hH

P(h|D) = argmax

hH

P(D|h)P(h)

P(D) = argmax

hH

P(D|h)P(h)

Note: If P (h

i

) = P (h

j

), ∀ i, j , then h

M AP

= h

M L

(7)

Exemplifying MAP Hypotheses

Suppose the following data characterize the lab result for cancer-suspect people.

P(cancer) = 0.008 P(¬cancer) = 0.992 h1 = cancer, h2 = ¬cancer

P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+,−}, P(D | h1), P(D | h2) P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97

Question: Should we diagnoze a patient x whose lab result is positive as having cancer?

Answer: No.

Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}):

P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298

⇒ hM AP = ¬cancer (We can infer

P (cancer | +) =

0.0078+0.02980.0078

= 21%

)

(8)

2 Learning MAP Hypothesis

2.1 The Brute Force MAP Hypothesis Learning Algorithm

Training:

Choose the hypothesis with the highest posterior proba- bility

h

M AP

= argmax

hH

P (h | D) = argmax

hH

P (D | h)P (h) Testing:

Given x, compute h

M AP

(x) Drawback:

Requires to compute all probabilities P (D | h) and P (h).

(9)

The Most Probable Classification of New Instances

So far we’ve sought h

M AP

, the most probable hypothesis given the data D.

Question: Given new instance x — the classification of which can take any value v

j

in some set V —, what is its most probable classification?

Answer: P (v

j

| D) = P

hiH

P (v

j

| h

i

)P (h

i

| D)

Therefore, the Bayes optimal classification of x is:

argmax

vjV

X

hiH

P (v

j

| h

i

)P (h

i

| D)

Remark: h

M AP

(x) is not the most probable classification of x!

(See the next example.)

(10)

The Bayes Optimal Classifier: An Example

Let us consider three possible hypotheses:

P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hM AP = h1.

Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = −

Question: What is the most probable classification of x?

Answer:

P(−|h1) = 0, P(+|h1) = 1 P(−|h2) = 1, P(+|h2) = 0 P(−|h3) = 1, P(+|h3) = 0 X

hiH

P(+|hi)P(hi|D) = 0.4 and X

hiH

P(−|hi)P(hi|D) = 0.6 therefore

argmax

vjV

X

hiH

P(vj|hi)P(hi|D) = −

(11)

2.3 The Gibbs Classifier

[Opper and Haussler, 1991]

Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses.

Gibbs algorithm:

1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance

Surprising fact [Haussler et al. 1994]:

If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier!

E[errorGibbs] ≤ 2E[errorBayesOptimal]

(12)

2.4 The Naive Bayes Classifier

When to use it:

The target function f takes value from a finite set V = {v1, . . . , vk}

Moderate or large training data set is available

The attributes < a1, . . . , an > that describe instances are conditionally independent w.r.t. to the given classification:

P(a1, a2 . . . an|vj) = Y

i

P(ai|vj)

The most probable value of f(x) is:

vM AP = argmax

vjV

P(vj|a1, a2 . . . an) = argmax

vjV

P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an)

= argmax

vjV

P(a1, a2 . . . an|vj)P(vj) = argmax

vjV

Y

i

P(ai|vj)P(vj) not.= vN B

This is the so-called decision rule of the Naive Bayes classifier.

(13)

The Joint Bayes Classifier

vM AP = argmax

vjV

P(vj|a1, a2 . . . an) = . . .

= argmax

vjV

P(a1, a2 . . . an|vj)P(vj) = argmax

vjV

P(a1, a2 . . . an, vj) not.= vJ B

(14)

The Naive Bayes Classifier: Remarks

1. Along with decision trees, neural networks, k-nearest neigh- bours, the Naive Bayes Classifier is one of the most prac- tical learning methods.

2. Compared to the previously presented learning algorithms, the Naive Bayes Classifier does no search through the hy- pothesis space;

the output hypothesis is simply formed by estimating the

parameters P (v

j

), P (a

i

| v

j

).

(15)

The Naive Bayes Classification Algorithm

Naive Bayes Learn (examples)

for each value v

j

of the output attribute P ˆ (v

j

) ← estimate P (v

j

)

for each value a

i

of each input attribute a P ˆ (a

i

| v

j

) ← estimate P (a

i

| v

j

)

Classify New Instance (x) v

N B

= argmax

vjV

P ˆ (v

j

) Q

aix

P ˆ (a

i

| v

j

)

(16)

The Naive Bayes: An Example

Consider again the PlayTennis example, and new instance

hOutlook = sun, T emp = cool, Humidity = high, W ind = strongi We compute:

vN B = argmaxvjV P(vj)Q

i P(ai|vj) P(yes) = 149 = 0.64 P(no) = 145 = 0.36 . . .

P(strong|yes) = 39 = 0.33 P(strong|no) = 35 = 0.60

P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes) = 0.0053 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no) = 0.0206

→ vN B = no

(17)

A Note on The Conditional Independence Assumption of Attributes

P (a

1

, a

2

. . . a

n

| v

j

) = Y

i

P (a

i

| v

j

)

It is often violated in practice ...but it works surprisingly well anyway.

Note that we don’t need estimated posteriors P ˆ (v

j

| x) to be correct; we only need that

argmax

vjV

P ˆ (v

j

) Y

i

P ˆ (a

i

| v

j

) = argmax

vjV

P (v

j

)P (a

1

. . . , a

n

| v

j

)

[Domingos & Pazzani, 1996] analyses this phenomenon.

(18)

Naive Bayes Classification:

The problem of unseen data

What if none of the training instances with target value vj have the at- tribute value ai?

It follows that Pˆ(ai|vj) = 0, and Pˆ(vj)Q

i Pˆ(ai|vj) = 0

The typical solution is to (re)define P(ai|vj), for each value vj of ai:

P ˆ (a

i

| v

j

) ←

nn+mc+mp, where

• n is number of training examples for which v = vj,

• nc number of examples for which v = vj and a = ai

• p is a prior estimate for Pˆ(ai|vj)

(for instance, if the attribute a has k values, then p = k1)

• m is a weight given to that prior estimate (i.e. number of “virtual” examples)

(19)

Using the Naive Bayes Learner:

Learning to Classify Text

• Learn which news articles are of interest

Target concept Interesting? : Document → { +, −}

• Learn to classify web pages by topic

Target concept Category : Document → { c

1

, . . . , c

n

}

Naive Bayes is among most effective algorithms

(20)

Learning to Classify Text: Main Design Issues

1. Represent each document by a vector of words

• one attribute per word position in document 2. Learning:

• use training examples to estimate P(+), P(−), P(doc|+), P(doc|−)

• Naive Bayes conditional independence assumption:

P(doc|vj) =

length(doc)

Y

i=1

P(ai = wk|vj)

where P(ai = wk|vj) is probability that word in position i is wk, given vj

• Make one more assumption:

∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj)

i.e. attributes are (not only indep. but) also identically distributed

(21)

Learn naive Bayes text ( Examples, Vocabulary )

1. Collect all words and other tokens that occur in Examples V ocabulary ← all distinct words and other tokens in Examples 2. Calculate the required P(vj) and P(wk|vj) probability terms

For each target value vj in V

docsj ← the subset of Examples for which the target value is vj

P (v

j

) ←

|Examples|docsj| |

T extj ← a single doc. created by concat. all members of docsj

n ← the total number of words in T extj

For each word wk in V ocabulary

nk ← the number of times word wk occurs in T extj

P (w

k

| v

j

) ←

n+|V ocabularynk+1 | (here we use the m-estimate)

(22)

Classify naive Bayes text (Doc)

positions ← all word positions in Doc that contain tokens from V ocabulary Return

v

N B

= argmax

vjV

P (v

j

) Q

ipositions

P (a

i

= w

k

| v

j

)

(23)

Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from

comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey

alt.atheism sci.space soc.religion.christian sci.crypt

talk.religion.misc sci.electronics talk.politics.mideast sci.med

talk.politics.misc talk.politics.guns

Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)

(24)

Learning Curve for 20 Newsgroups

0 10 20 30 40 50 60 70 80 90 100

100 1000 10000

20News

Bayes TFIDF PRTFIDF

Accuracy vs. Training set size

(25)

2.5 The Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hM AP

hM AP = argmax

hH

P(D|h)P(h) = argmax

hH

(log2 P(D|h) + log2 P(h))

= argmin

hH

(−log2 P(D|h) − log2 P(h)) Interesting fact from the Information Theory:

The optimal (shortest expected coding length) code for an event with probability p is the one using −log2 p bits.

So we can interpret:

−log2 P(h): the length of h under the optimal code

−log2 P(D|h): the length of D given h under the optimal code Therefore we prefer the hypothesis h that minimizes...

(26)

Bayes Analysis and the MDL Principle

We saw that a MAP learner prefers the hypothesis h that minimizes LC1(h) + LC2(D|h), where LC(x) is the description length of x under encoding C

hM DL = argmin

hH

(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels

• LC1(h) is the number of bits to describe tree h

• LC2(D|h) is the number of bits to describe D given h

In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.

(27)

For instance:

LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h.

Therefore, in this situation we need only to describe exceptions. So:

hM DL = argmin

hH

(length(h) + length(misclassif ications))

In general, MDL trades off hypothesis size for training errors:

it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data!

Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.

(28)

The MDL Principle: Back to Occam’s Rasor

MDL hypotheses are not necessarily also the best/MAP ones.

(For that, we should know all the probabilities P (D | h)

and P (h).)

(29)

3 Learning Maximum Likelihood (ML) Hypothesis

3.1 Learning Real Valued Functions:

ML Hypotheses as Least Suquered Error Hypotheses

hML f

e y

x

Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples hxi, dii, i = 1, . . . , m with

xi, assumed fixed (to simplify)

di noisy training value di = f(xi) + ei

ei is random variable (noise) drawn inde- pendently for each xi, according to some Gaussian distribution with mean=0.

(30)

Proposition

Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h,

the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:

hM L

def.= argmax

hH

P(D|h) = arg min

hH m

X

i=1

(di − h(xi))2

(31)

Note: We will use the probability density function:

p(x0) def.= limǫ0

1

ǫP(x0 ≤ x < x0 + ǫ) hM L = argmax

hH

P(D|h) = argmax

hH

m

Y

i=1

p(di|h) µi=f=(xi) argmax

hH

m

Y

i=1

p(ei|h)

= argmax

hH

m

Y

i=1

p(di − f(xi)|h) h(xi)=f= (xi) argmax

hH

m

Y

i=1

p(di − h(xi)|h)

= argmax

hH

m

Y

i=1

√ 1

2πσ2e12(dih(σ xi))2 = argmax

hH

(

m

X

i=1

ln 1

√2πσ2 − 1 2

di − h(xi) σ

2

)

= argmax

hH

m

X

i=1

−1 2

di − h(xi) σ

2

= argmax

hH

m

X

i=1

−(di − h(xi))2

= argmin

hH

m

X

i=1

(di − h(xi))2

(32)

Generalisations...

1. Similar derivations can be performed starting with other assumed noise distributions (than Gaussians), producing different results.

2. It was assumed that

a. the noise affects only f(xi), and

b. no noise was recorded in the attribute values for the given ex- amples xi.

Otherwise, the analysis becomes significantly more complex.

(33)

3.2 ML hypotheses for Learning Probability Functions

Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0,1}.

Given a set of independently drawn examples

D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0,1},

we would like to learn a ML hypothesis for the probability function g(x) def.= P(f(x) = 1).

For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92.

Proposition: In this setting, hM L = argmaxhH P(D | h) maximizes the sum Pm

i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))].

Proof:

P(D | h) = Πmi=1P(xi, di | h) = Πmi=1P(di | xi, h) · P(xi | h) It can be assumed that xi is independent of h, therefore:

P(D | h) = Πmi=1P(di | xi, h) · P(xi)

(34)

Proof (continued):

What we wanted to compute is h(xi) = P(di = 1 | xi, h).

In a more general form:

P(di | xi, h) =

h(xi) if di = 1 1 − h(xi) if di = 0

In a more convenient mathematical form: P(di | xi, h) = h(xi)di(1−h(xi))1di.

⇒ hM L = argmaxhH Πmi=1[h(xi)di(1 − h(xi))1diP(xi)]

= argmaxhH Πmi=1h(xi)di(1 − h(xi))1di · Πmi=1P(xi)

= argmaxhH Πmi=1h(xi)di(1 − h(xi))1di

= argmaxhH

m

X

i=1

[di lnh(xi) + (1 − di)ln(1 − h(xi))]

Note: The quantity −Pm

i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))] is called cross- entropy; the above hM L minimizes this quantity.

(35)

3.3 The Expectation Maximization (EM) Algorithm

[Dempster et al, 1977]

Find (local) Maximum Likelihood hypotheses when data is only partially observable:

• Unsupervised learning (i.e., clustering):

the target value is unobservable

• Supervised learning:

some instance attributes are unobservable

Some applications:

• Non-hierarchical clustering:

Estimate the means of k Gausseans

• Learn Hidden Markov Models

• Learn Probabilistic Context Free Grammars

• Train Radial Basis Function Networks

• Train Bayesian Belief Networks

(36)

The General EM Problem

Given

• observed data X = { x

1

, . . . , x

m

}

independently generated using the parameterized distributions/hypotheses h

1

, . . . , h

m

• unobserved data Z = { z

1

, . . . , z

m

} determine

ˆ h that (locally) maximizes P (X | h).

(37)

The Essence of the EM Approach

Start with h

(0)

, an arbitrarily/conveniently chosen value of h.

Repeatedly

1. Use the observed data X and the current hypothesis h

(t)

to estimate [the probabilities associated to the values of ] the unobserved variables Z , and further on compute their expectations, E [Z ].

2. The expected values of the unobserved variables Z are used to calculate an improved hypothesis h

(t+1)

, based on maximizing the mean of a log-likelihood function: E [ln P (Y | h) | X, h

(t)

], where Y = { y

1

, . . . , y

m

} is the complete (observed and unobserved) data, i.e.

y

i

= (x

i

, z

i

), for i = 1, . . . , m.

(38)

The General EM Algorithm

Repeat the following two steps until convergence is reached:

Estimation (E) step:

Calculate the log likelihood function

Q(h | h

(t)

)

not.

= E [ln P (Y | h) | X, h

(t)

] where Y = X ∪ Z .

Maximization (M) step:

Replace hypothesis h

(t)

by the hypothesis h

(t+1)

that maxi- mizes this Q function.

h

(t+1)

← argmax

h

Q(h | h

(t)

)

(39)

The EM algorithmic Schema

idea: replace missing values by estimated values

initialize parameters with arbitrary values

estimate missing val- ues based on current parameter values

re-estimate param- eters using the complete data

repeat the previous two steps until conver- gence

X, h(t) E[Z | ] h(t)

h

= argmax h(t+1)

P(Z|X; )h(t)

E [ln P(X,Z|h)]

++t

ln P(X|h)

t=0

(40)

Methodology: How to apply an instance of the EM schema

initialize parametersarbitrarily

h

(t)

++t t=0

h

= argmax h

(t+1)

P(Z|X; )h(t)

E [ln P(X,Z|h)]

X, h

(t)

E[Z | ]

M Step:

apply updating rules (2)

apply

E Step:

updating rules (1)

(41)

Methodology: How to derive an instance of the EM schema

h(t)

"observable" data log−likelihood of the

h(t) Q(h| )

function

"auxiliary"

variables

"unobservable" / latent

"observable"

variables / data

parameters hypothesis /

(4) solve this optimization problem

and sometimes Lagrange’s method) (usually using partial derivatives,

of "complete" data

(2) apply expectation linearity (1) compute the log−likelihood

with the total probability formula

(3) apply Bayes formula

(and sometimes the exponentiation trick)

++t

ln P(X|h)

t=0

h

= argmax h(t+1)

P(Z|X; )h(t)

E [ln P(X,Z|h)]

X, h(t) E[Z | ]

M Step

initialized parameters consider arbitrarily

E Step

(42)

Baum-Welch Theorem

When Q is continuous, it can be shown that EM con-

verges to a stationary point (local maximum) of the

likelihood function P (X | h).

(43)

Two mixtures of Bernoulli distributions

1− θ 2 θ

θ

2

1−

0 1

Z~Bernoulli

X~Bernoulli

1

X~Bernoulli

0

π 1− π

0 1

θ

X πBernoulli(θ) + (1π)Bernoulli(2θ)

0 1

X~Bernoulli

0

1

0 p 1−p q

1−q 1

X~Bernoulli

Z~Bernoulli

1− π π

X πBernoulli(p) + (1π)Bernoulli(q)

(44)

EM algorithm for solving

A mixture of vectors of independent and identically distributed (i.i.d.) Bernoulli variables

Xi

X1 X10

x1 xi x10 x

~Bernoulli (θ )A ~Bernoulli (θ )B π

Z~Bernoulli

1−π

B A

X1 Xi X10

X = (x1, . . . , x10) π·Bernoulli(x1;θA)·. . .·Bernoulli(x10;θA) + (1π)·Bernoulli(x1;θB)·. . .·Bernoulli(x10;θB).

Cf. “What is the expectation maximization algorithm?”, Chuong B. Do, Serafim Batzoglou, Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899

(45)

Init¸ializare:

atribuie valori arbitrare (θA(0), θB(0)ˆın intervalul (0,1)) pentru parametrii θA ¸si respectiv θB;

Corpul iterativ:

pentru t = 0, . . . , T 1 (cu T fixat ˆın avans)

(sau: pˆan˘a cˆand log-verosimilitatea datelor observabile nu mai cre¸ste semnificativ), (sau: pˆan˘a cˆand |θA(t) θA(t+1)| < ε, |θB(t) θB(t+1)| < ε),

cu ε fixat ˆın avans) execut˘a

Pasul E:

pentru i = 1, . . . ,5 calculeaz˘a

p(t+1)i,A = (θ(t)A )xi(1θ(t)A )10xi

(θ(t)A )xi(1θ(t)A )10xi + (θB(t))xi(1θB(t))10xi

p(t+1)i,B = (θ(t)B )xi(1θ(t)B )10−xi

(θ(t)A )xi(1θ(t)A )10xi + (θB(t))xi(1θB(t))10xi ; Pasul M:

θA(t+1) = P5

i=1

xi

10 p(t+1)i,A P5

i=1p(t+1)i,A ¸si θB(t+1) = P5

i=1

xi

10 p(t+1)i,B P5

i=1p(t+1)i,B ; Returneaz˘a θA(t+1), θB(t+1);

(46)

A mixture of vectors of independent Bernoulli distributions, applied to clustering of hand-written digits (MNIST)

π

2

π

1

π

K

(p )1,i

~B

X1,1 X1,i X1,D

(p )2,i

~B

X2,1 X2,i X2,D

(p )K,i

~B

XK,1 X K,i XK,D

x1 xi xD 1

Z~Categorical

2 K

x CMU, 2015f, A. Smola, B. Poczos,

HW2, pr. 1

X = (x1, . . . , xD) π1·Bernoulli(x1;p1,1)·. . .·Bernoulli(xD;p1,D) +πK·Bernoulli(x1;pK,1)·. . .·Bernoulli(xD;pK,D).

(47)

4 Bayesian Belief Networks

(also called Bayes Nets) Interesting because:

• The Naive Bayes assumption of conditional independence of attributes is too restrictive.

(But it’s intractable without some such assumptions...)

• Bayesian Belief networks describe conditional indepen- dence among subsets of variables.

• It allows the combination of prior knowledge about

(in)dependencies among variables with observed training

data.

(48)

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z :

( ∀ x

i

, y

j

, z

k

) P (X = x

i

| Y = y

j

, Z = z

k

) = P (X = x

i

| Z = z

k

) More compactly, we write P (X | Y, Z ) = P (X | Z )

Note: Naive Bayes uses conditional independence to justify P (A

1

, A

2

| V ) = P (A

1

| A

2

, V )P (A

2

| V ) = P (A

1

| V )P (A

2

| V )

Generalizing the above definition:

P (X

1

. . . X

l

| Y

1

. . . Y

m

, Z

1

. . . Z

n

) = P (X

1

. . . X

l

| Z

1

. . . Z

n

)

(49)

A Bayes Net

Storm

Campfire Lightning

Thunder ForestFire

Campfire C

¬C

¬S,B ¬S,¬B 0.4

0.6 0.1 0.9

0.8 0.2

0.2 0.8 S,¬B

BusTourGroup

S,B

The network is defined by

• A directed acyclic graph, represening a set of conditional independence assertions:

Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.

Example: P(T hunder|F orestF ire, Lightning) = P(T hunder|Lightning)

• A table of local conditional probabilities for each node/variable.

(50)

A Bayes Net (Cont’d)

represents the joint probability distribution over all variables Y

1

, Y

2

, . . . , Y

n

:

This joint distribution is fully defined by the graph, plus the conditional probabilities:

P (y

1

, . . . , y

n

) = P (Y

1

= y

1

, . . . , Y

n

= y

n

) =

n

Y

i=1

P (y

i

| P arents(Y

i

)) where P arents(Y

i

) denotes immediate predecessors of Y

i

in the graph.

In our example: P (Storm, BusT ourGroup, . . . , F orestF ire)

(51)

Inference in Bayesian Nets

Question: Given a Bayes net, can one infer the probabilities of values of one or more network variables, given the observed values of (some) others?

Example:

Given the Bayes net compute:

(a) P (S ) (b) P (A, S )

(b) P (A)

P(A|S)=0.7

P(A|~S)=0.3

P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8

P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6

L F

S

A G

P(L)=0.4 P(F)=0.6

(52)

Inference in Bayesian Nets (Cont’d)

Answer(s):

• If only one variable is of unknown (probability) value, then it is easy to infer it

• In the general case, we can compute the probability dis- tribution for any subset of network variables, given the distribution for any subset of the remaining variables.

But...

• The exact inference of probabilities for an arbitrary

Bayes net is an NP-hard problem!!

(53)

Inference in Bayesian Nets (Cont’d)

In practice, we can succeed in many cases:

• Exact inference methods work well for some net structures.

• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions [Pradham & Dagum, 1996].

(In theory even approximate inference of probabilities in

Bayes Nets can be NP-hard!! [ Dagum & Luby, 1993])

(54)

Learning Bayes Nets (I)

There are several variants of this learning task

• The network structure might be either known or unknown (i.e., it has to be inferred from the training data).

• The training examples might provide values of all network variables, or just for some of them.

The simplest case:

If the structure is known and we can observe the values of all variables,

then it is easy to estimate the conditional probability table

entries (analogous to training a Naive Bayes classifier).

(55)

Learning Bayes Nets (II)

When

• the structure of the Bayes Net is known, and

• the variables are only partially observable in the training data

learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units:

− We can learn the net’s conditional probability tables using the gradient ascent!

− Converge to the network h that (locally) maximizes P (D | h).

(56)

Gradient Ascent for Bayes Nets

Let w

ijk

denote one entry in the conditional probability table for the variable Y

i

in the network

w

ijk

= P (Y

i

= y

ij

| P arents(Y

i

) = the list u

ik

of values) It can be shown (see the next two slides) that

∂ ln P

h

(D)

∂w

ijk

= X

dD

P

h

(y

ij

, u

ik

| d) w

ijk

therefore perform gradient ascent by repeatedly 1. update all w

ijk

using the

training data D w

ijk

← w

ijk

+η X

dD

P

h

(y

ij

, u

ik

| d) w

ijk

2. renormalize the w

ijk

to as- sure

X

j

w

ijk

= 1 and 0 ≤ w

ijk

≤ 1

56.

(57)

Gradient Ascent for Bayes Nets: Calculus

∂ ln P

h

(D)

∂w

ijk

= ∂

∂w

ijk

ln Y

dD

P

h

(d) = X

dD

∂ ln P

h

(d)

∂w

ijk

= X

dD

1 P

h

(d)

∂P

h

(d)

∂w

ijk

Summing over all values y

ij

of Y

i

, and u

ik

of U

i

= P arents(Y

i

):

∂ ln P

h

(D)

∂w

ijk

= X

dD

1 P

h

(d)

∂w

ijk

X

jk

P

h

(d | y

ij

, u

ik

)P

h

(y

ij

, u

ik

)

= X

dD

1 P

h

(d)

∂w

ijk

X

jk

P

h

(d | y

ij

, u

ik

)P

h

(y

ij

| u

ik

)P

h

(u

ik

)

Note that w

ijk

≡ P

h

(y

ij

| u

ik

), therefore...

(58)

Gradient Ascent for Bayes Nets: Calculus (Cont’d)

∂ ln P

h

(D)

∂w

ijk

= X

dD

1 P

h

(d)

∂w

ijk

P

h

(d | y

ij

, u

ik

)w

ijk

P

h

(u

ik

)

= X

dD

1

P

h

(d) P

h

(d | y

ij

, u

ik

)P

h

(u

ik

) (applying Bayes th.)

= X

dD

1 P

h

(d)

P

h

(y

ij

, u

ik

| d)P

h

(d)P

h

(u

ik

) P

h

(y

ij

, u

ik

)

= X

dD

P

h

(y

ij

, u

ik

| d)P

h

(u

ik

)

P

h

(y

ij

, u

ik

) = X

dD

P

h

(y

ij

, u

ik

| d) P

h

(y

ij

| u

ik

)

= X

dD

P

h

(y

ij

, u

ik

| d)

w

ijk

(59)

Learning Bayes Nets (II, Cont’d)

The EM algorithm can also be used.

Repeatedly:

1. Calculate/estimate from data the probabilities of unob- served variables w

ijk

,

assuming that the hypothesis h holds

2. Calculate a new h (i.e. new values of w

ijk

) so to maximize E [ln P (D | h)],

where D now includes both the observed and the unob-

served variables.

(60)

Learning Bayes Nets (III)

When the structure is unknown, algorithms usually use greedy search to trade off network complexity (add/substract edges/nodes) against degree of fit to the data.

Example: [Cooper & Herscovitz, 1992] the K 2 algorithm:

When data is fully observable, use a score metric to choose among alternative networks.

They report an experiment on (re-learning) a network with 37

nodes and 46 arcs describing anesthesia problems in a hospital

operating room. Using 3000 examples, the program succeeds

almost perfectly: it misses one arc and adds an arc which is

not in the original net.

Referințe

DOCUMENTE SIMILARE

The methodology of managerial training programs’ development at university level, and, particularly, at post-university level, goes on, with few exceptions, to be

The first theme, Efficient and interoperable eGovernment services, aims at improving the efficiency and effectiveness of public administrations and facilitating their interactions

Member States have committed themselves to inclusive eGovernment objectives to ensure that by 2010 all citizens, including socially disadvantaged groups, become major beneficiaries

The most impor- tant point is that, with complete data, the maximum-likelihood parameter learning problem for a Bayesian network decomposes into separate learning problems, one for

De¸si ˆın ambele cazuri de mai sus (S ¸si S ′ ) algoritmul Perceptron g˘ ase¸ste un separator liniar pentru datele de intrare, acest fapt nu este garantat ˆın gazul general,

This classification allows stating an important correlation between the denotation of the adjective and its syntax (for a more detailed analysis, see Cornilescu

In the same line, Nistor and Roman (1968b) offer an outstanding survey of the major problems in computational linguistics in the sixties: information retrieval, Chomsky’s

Thus, if Don Quixote is the idealist, Casanova the adventurous seducer, Werther the suicidal hero, Wilhelm Meister the apprentice, Jesus Christ will be, in the audacious and