Two Roles for the Bayesian Methods in Learning

(1)

Bayesian Learning

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6

Acknowledgement:

The present slides are an adaptation of slides drawn by T. Mitchell

(2)

Two Roles for the Bayesian Methods in Learning

1.

Provides practical learning algorithms

by combining prior knowledge/probabilities with observed data:

• Naive Bayes learning algorithm

• Expectation Maximization (EM) learning algorithm (scheme):

learning in the presence of unobserved variables

• Bayesian Belief Network learning

2.

Provides a useful conceptual framework

• Serves for evaluating other learning algorithms, e.g.

concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination),

neural networks, liniar regression

• Provides additional insight into Occam’s razor

(3)

1. Basic Notions Bayes’ Theorem

Defining classes of hypotheses:

Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses

2. Learning MAP hypotheses

2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier;

2.3 The Gibbs classifier;

2.4 The Naive Bayes and the Joint Bayes classifiers.

Example: Learning over text data using Naive Bayes 2.5 The Minimum Description Length (MDL) Principle;

MDL hypotheses

3. Learning ML hypotheses

3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm 4. Bayesian Belief Networks

(4)

1 Basic Notions

• Product Rule:

probability of a conjunction of two events A and B:

P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)

• Bayes’ Theorem:

P(A|B) = P(B|A)P(A) P(B)

• Theorem of total probability:

if events A1, . . . , An are mutually exclusive, with Pn

i=1 P(A_i) = 1, then

P(B) =

n

X

i=1

P(B|Ai)P(Ai)

in particular

P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

(5)

Using Bayes’ Theorem for Hypothesis Learning

P (h | D) = P (D | h)P (h) P (D)

• P (D) = the (prior) probability of training data D

• P (h) = the (prior) probability of the hypothesis h

• P (D | h) = the (a posteriori) probability of D given h

• P (h | D) = the (a posteriori) probability of h given D

(6)

Classes of Hypotheses

Maximum Likelihood (ML) hypothesis:

the hypothesis that best explains the training data h

M L

= argmax

h_i∈H

P (D | h

i

)

Maximum A posteriori Probability (MAP) hypothesis:

the most probable hypothesis given the training data

hM AP = argmax

h∈H

P(h|D) = argmax

h∈H

P(D|h)P(h)

P(D) = argmax

h∈H

P(D|h)P(h)

Note: If P (h

i

) = P (h

j

), ∀ i, j , then h

M AP

= h

M L

(7)

Exemplifying MAP Hypotheses

Suppose the following data characterize the lab result for cancer-suspect people.

P(cancer) = 0.008 P(¬cancer) = 0.992 h₁ = cancer, h₂ = ¬cancer

P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+,−}, P(D | h₁), P(D | h₂) P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97

Question: Should we diagnoze a patient x whose lab result is positive as having cancer?

Answer: No.

Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}):

P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298

⇒ hM AP = ¬cancer (We can infer

P (cancer | +) =

0.0078+0.0298⁰^.⁰⁰⁷⁸

= 21%

⁾

(8)

2 Learning MAP Hypothesis

2.1 The Brute Force MAP Hypothesis Learning Algorithm

Training:

Choose the hypothesis with the highest posterior proba- bility

h

M AP

= argmax

h∈H

P (h | D) = argmax

h∈H

P (D | h)P (h) Testing:

Given x, compute h

M AP

(x) Drawback:

Requires to compute all probabilities P (D | h) and P (h).

(9)

The Most Probable Classification of New Instances

So far we’ve sought h

M AP

, the most probable hypothesis given the data D.

Question: Given new instance x — the classification of which can take any value v

j

in some set V —, what is its most probable classification?

Answer: P (v

j

| D) = P

h_i∈H

P (v

j

| h

i

)P (h

i

| D)

Therefore, the Bayes optimal classification of x is:

argmax

v_j∈V

X

h_i∈H

P (v

j

| h

i

)P (h

i

| D)

Remark: h

M AP

(x) is not the most probable classification of x!

(See the next example.)

(10)

The Bayes Optimal Classifier: An Example

Let us consider three possible hypotheses:

P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hM AP = h1.

Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = −

Question: What is the most probable classification of x?

Answer:

P(−|h₁) = 0, P(+|h₁) = 1 P(−|h₂) = 1, P(+|h₂) = 0 P(−|h₃) = 1, P(+|h₃) = 0 X

hi∈H

P(+|hi)P(hi|D) = 0.4 and X

hi∈H

P(−|hi)P(hi|D) = 0.6 therefore

argmax

vj∈V

X

h_i∈H

P(v_j|h_i)P(h_i|D) = −

(11)

2.3 The Gibbs Classifier

[Opper and Haussler, 1991]

Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses.

Gibbs algorithm:

1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance

Surprising fact [Haussler et al. 1994]:

If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier!

E[errorGibbs] ≤ 2E[errorBayesOptimal]

(12)

2.4 The Naive Bayes Classifier

When to use it:

• The target function f takes value from a finite set V = {v₁, . . . , vk}

• Moderate or large training data set is available

• The attributes < a₁, . . . , a_n > that describe instances are conditionally independent w.r.t. to the given classification:

P(a1, a2 . . . an|vj) = Y

i

P(ai|vj)

The most probable value of f(x) is:

vM AP = argmax

vj∈V

P(vj|a₁, a₂ . . . an) = argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an)

= argmax

v_j∈V

P(a1, a2 . . . an|vj)P(vj) = argmax

v_j∈V

Y

i

P(ai|vj)P(vj) ^not.= vN B

This is the so-called decision rule of the Naive Bayes classifier.

(13)

The Joint Bayes Classifier

vM AP = argmax

vj∈V

P(vj|a1, a2 . . . an) = . . .

= argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) = argmax

vj∈V

P(a1, a2 . . . an, vj) ^not.= vJ B

(14)

The Naive Bayes Classifier: Remarks

1. Along with decision trees, neural networks, k-nearest neigh- bours, the Naive Bayes Classifier is one of the most prac- tical learning methods.

2. Compared to the previously presented learning algorithms, the Naive Bayes Classifier does no search through the hy- pothesis space;

the output hypothesis is simply formed by estimating the

parameters P (v

j

), P (a

i

| v

j

).

(15)

The Naive Bayes Classification Algorithm

Naive Bayes Learn (examples)

for each value v

j

of the output attribute P ˆ (v

j

) ← estimate P (v

j

)

for each value a

i

of each input attribute a P ˆ (a

i

| v

j

) ← estimate P (a

i

| v

j

)

Classify New Instance (x) v

N B

= argmax

_v_j_∈_V

P ˆ (v

j

) Q

ai∈x

P ˆ (a

i

| v

j

)

(16)

The Naive Bayes: An Example

Consider again the PlayTennis example, and new instance

hOutlook = sun, T emp = cool, Humidity = high, W ind = strongi We compute:

vN B = argmax_v_j∈V P(vj)Q

i P(ai|vj) P(yes) = ₁₄⁹ = 0.64 P(no) = ₁₄⁵ = 0.36 . . .

P(strong|yes) = ³₉ = 0.33 P(strong|no) = ³₅ = 0.60

→ v_{N B} = no

(17)

A Note on The Conditional Independence Assumption of Attributes

P (a

₁

, a

₂

. . . a

n

| v

j

) = Y

i

P (a

i

| v

j

)

It is often violated in practice ...but it works surprisingly well anyway.

Note that we don’t need estimated posteriors P ˆ (v

j

| x) to be correct; we only need that

argmax

v_j∈V

P ˆ (v

j

) Y

i

P ˆ (a

i

| v

j

) = argmax

v_j∈V

P (v

j

)P (a

₁

. . . , a

n

| v

j

)

[Domingos & Pazzani, 1996] analyses this phenomenon.

(18)

Naive Bayes Classification:

The problem of unseen data

What if none of the training instances with target value v_j have the attribute value ai?

It follows that Pˆ(ai|vj) = 0, and Pˆ(vj)Q

i Pˆ(ai|vj) = 0

The typical solution is to (re)define P(ai|vj), for each value vj of ai:

P ˆ (a

_i

| v

_j

) ←

ⁿ_n+m^c^+mp^{, where}

• n is number of training examples for which v = vj,

• nc number of examples for which v = vj and a = ai

• p is a prior estimate for Pˆ(ai|vj)

(for instance, if the attribute a has k values, then p = _k¹)

• m is a weight given to that prior estimate (i.e. number of “virtual” examples)

(19)

Using the Naive Bayes Learner:

Learning to Classify Text

• Learn which news articles are of interest

Target concept Interesting? : Document → { +, −}

• Learn to classify web pages by topic

Target concept Category : Document → { c

₁

, . . . , c

n

}

Naive Bayes is among most effective algorithms

(20)

Learning to Classify Text: Main Design Issues

1. Represent each document by a vector of words

• one attribute per word position in document 2. Learning:

• use training examples to estimate P(+), P(−), P(doc|+), P(doc|−)

• Naive Bayes conditional independence assumption:

P(doc|v_j) =

length(doc)

Y

i=1

P(a_i = w_k|v_j)

where P(ai = wk|vj) is probability that word in position i is wk, given vj

• Make one more assumption:

∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj)

i.e. attributes are (not only indep. but) also identically distributed

(21)

Learn naive Bayes text ( Examples, Vocabulary )

1. Collect all words and other tokens that occur in Examples V ocabulary ← all distinct words and other tokens in Examples 2. Calculate the required P(v_j) and P(w_k|v_j) probability terms

For each target value vj in V

docsj ← the subset of Examples for which the target value is vj

P (v

j

) ←

_|_Examples^|^docs^j^| _|

T extj ← a single doc. created by concat. all members of docsj

n ← the total number of words in T extj

For each word w_k in V ocabulary

n_k ← the number of times word w_k occurs in T ext_j

P (w

k

| v

j

) ←

_n+_|V ocabularyⁿ^k⁺¹ | (here we use the m-estimate)

(22)

Classify naive Bayes text (Doc)

positions ← all word positions in Doc that contain tokens from V ocabulary Return

v

N B

= argmax

_v_j_∈_V

P (v

j

) Q

i∈positions

P (a

i

= w

k

| v

j

)

(23)

Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from

comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey

alt.atheism sci.space soc.religion.christian sci.crypt

talk.religion.misc sci.electronics talk.politics.mideast sci.med

talk.politics.misc talk.politics.guns

Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)

(24)

Learning Curve for 20 Newsgroups

0 10 20 30 40 50 60 70 80 90 100

100 1000 10000

20News

Bayes TFIDF PRTFIDF

Accuracy vs. Training set size

(25)

2.5 The Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hM AP

hM AP = argmax

h∈H

P(D|h)P(h) = argmax

h∈H

(log₂ P(D|h) + log₂ P(h))

= argmin

h∈H

(−log₂ P(D|h) − log₂ P(h)) Interesting fact from the Information Theory:

The optimal (shortest expected coding length) code for an event with probability p is the one using −log₂ p bits.

So we can interpret:

−log₂ P(h): the length of h under the optimal code

−log₂ P(D|h): the length of D given h under the optimal code Therefore we prefer the hypothesis h that minimizes...

(26)

Bayes Analysis and the MDL Principle

We saw that a MAP learner prefers the hypothesis h that minimizes L_C₁(h) + L_C₂(D|h), where L_C(x) is the description length of x under encoding C

hM DL = argmin

h∈H

(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels

• L_C₁(h) is the number of bits to describe tree h

• L_C₂(D|h) is the number of bits to describe D given h

In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.

(27)

For instance:

LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h.

Therefore, in this situation we need only to describe exceptions. So:

hM DL = argmin

h∈H

(length(h) + length(misclassif ications))

In general, MDL trades off hypothesis size for training errors:

it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data!

Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.

(28)

The MDL Principle: Back to Occam’s Rasor

MDL hypotheses are not necessarily also the best/MAP ones.

(For that, we should know all the probabilities P (D | h)

and P (h).)

(29)

3 Learning Maximum Likelihood (ML) Hypothesis

3.1 Learning Real Valued Functions:

ML Hypotheses as Least Suquered Error Hypotheses

h_ML f

e y

x

Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples hxi, dii, i = 1, . . . , m with

xi, assumed fixed (to simplify)

d_i noisy training value d_i = f(x_i) + e_i

ei is random variable (noise) drawn independently for each xi, according to some Gaussian distribution with mean=0.

(30)

Proposition

Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h,

the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:

hM L

def.= argmax

h∈H

P(D|h) = arg min

h∈H m

X

i=1

(di − h(xi))²

(31)

Note: We will use the probability density function:

p(x0) ^def.= limǫ→0

1

ǫP(x0 ≤ x < x0 + ǫ) hM L = argmax

h∈H

P(D|h) = argmax

h∈H

m

Y

i=1

p(di|h) ^µⁱ^=f=^(xⁱ⁾ argmax

h∈H

m

Y

i=1

p(ei|h)

= argmax

h∈H

m

Y

i=1

p(di − f(xi)|h) ^h(xⁱ^)=f= ^(xⁱ⁾ argmax

h∈H

m

Y

i=1

p(di − h(xi)|h)

= argmax

h∈H

m

Y

i=1

√ 1

2πσ²e⁻¹²⁽^di⁻^h(^σ ^xi⁾⁾² = argmax

h∈H

(

m

X

i=1

ln 1

√2πσ² − 1 2

di − h(xi) σ

2

)

= argmax

h∈H

m

X

i=1

−1 2

di − h(xi) σ

2

= argmax

h∈H

m

X

i=1

−(di − h(xi))²

= argmin

h∈H

m

X

i=1

(di − h(xi))²

(32)

Generalisations...

1. Similar derivations can be performed starting with other assumed noise distributions (than Gaussians), producing different results.

2. It was assumed that

a. the noise affects only f(xi), and

b. no noise was recorded in the attribute values for the given examples x_i.

Otherwise, the analysis becomes significantly more complex.

(33)

3.2 ML hypotheses for Learning Probability Functions

Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0,1}.

Given a set of independently drawn examples

D = {< x₁, d₁ >, . . . , < x_m, d_m >} where d_i = f(x_i) ∈ {0,1},

we would like to learn a ML hypothesis for the probability function g(x) ^def.= P(f(x) = 1).

For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92.

Proposition: In this setting, hM L = argmax_h_∈_H P(D | h) maximizes the sum Pm

i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))].

Proof:

P(D | h) = Π^m_i=1P(xi, di | h) = Π^m_i=1P(di | xi, h) · P(xi | h) It can be assumed that xi is independent of h, therefore:

P(D | h) = Π^m_i=1P(d_i | x_i, h) · P(x_i)

(34)

Proof (continued):

What we wanted to compute is h(xi) = P(di = 1 | xi, h).

In a more general form:

P(di | xi, h) =

h(xi) if di = 1 1 − h(xi) if di = 0

In a more convenient mathematical form: P(di | xi, h) = h(xi)^dⁱ(1−h(xi))¹⁻^dⁱ.

⇒ h_{M L} = argmax_h_∈_H Π^m_i=1[h(x_i)^dⁱ(1 − h(x_i))¹⁻^dⁱP(x_i)]

= argmax_h_∈_H Π^m_i=1h(x_i)^dⁱ(1 − h(x_i))¹⁻^dⁱ · Π^m_i=1P(x_i)

= argmax_h_∈_H Π^m_i=1h(xi)^dⁱ(1 − h(xi))¹⁻^dⁱ

= argmax_h_∈_H

m

X

i=1

[di lnh(xi) + (1 − di)ln(1 − h(xi))]

Note: The quantity −Pm

i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))] is called cross- entropy; the above hM L minimizes this quantity.

(35)

3.3 The Expectation Maximization (EM) Algorithm

[Dempster et al, 1977]

Find (local) Maximum Likelihood hypotheses when data is only partially observable:

• Unsupervised learning (i.e., clustering):

the target value is unobservable

• Supervised learning:

some instance attributes are unobservable

Some applications:

• Non-hierarchical clustering:

Estimate the means of k Gausseans

• Learn Hidden Markov Models

• Learn Probabilistic Context Free Grammars

• Train Radial Basis Function Networks

• Train Bayesian Belief Networks

(36)

The General EM Problem

Given

• observed data X = { x

₁

, . . . , x

m

}

independently generated using the parameterized distributions/hypotheses h

₁

, . . . , h

m

• unobserved data Z = { z

₁

, . . . , z

m

} determine

ˆ h that (locally) maximizes P (X | h).

(37)

The Essence of the EM Approach

Start with h

⁽⁰⁾

, an arbitrarily/conveniently chosen value of h.

Repeatedly

1. Use the observed data X and the current hypothesis h

^(t)

to estimate [the probabilities associated to the values of ] the unobserved variables Z , and further on compute their expectations, E [Z ].

2. The expected values of the unobserved variables Z are used to calculate an improved hypothesis h

^(t+1)

, based on maximizing the mean of a log-likelihood function: E [ln P (Y | h) | X, h

^(t)

], where Y = { y

₁

, . . . , y

m

} is the complete (observed and unobserved) data, i.e.

y

i

= (x

i

, z

i

), for i = 1, . . . , m.

(38)

The General EM Algorithm

Repeat the following two steps until convergence is reached:

Estimation (E) step:

Calculate the log likelihood function

Q(h | h

^(t)

)

^not.

= E [ln P (Y | h) | X, h

^(t)

] where Y = X ∪ Z .

Maximization (M) step:

Replace hypothesis h

^(t)

by the hypothesis h

^(t+1)

that maxi- mizes this Q function.

h

^(t+1)

← argmax

h

Q(h | h

^(t)

)

(39)

The EM algorithmic Schema

◦ idea: replace missing values by estimated values

• initialize parameters with arbitrary values

• estimate missing val- ues based on current parameter values

• re-estimate param- eters using the complete data

• repeat the previous two steps until conver- gence

X, h^(t) E[Z | ] h^(t)

h

= argmax h^(t+1)

P(Z|X; )h^(t)

E [ln P(X,Z|h)]

++t

ln P(X|h)

t=0

(40)

Methodology: How to apply an instance of the EM schema

initialize parametersarbitrarily

h

^(t)

++t t=0

h

= argmax h

^(t+1)

P(Z|X; )h^(t)

E [ln P(X,Z|h)]

X, h

^(t)

E[Z | ]

M Step:

apply updating rules (2)

apply

E Step:

updating rules (1)

(41)

Methodology: How to derive an instance of the EM schema

h^(t)

"observable" data log−likelihood of the

h^(t) Q(h| )

function

"auxiliary"

variables

"unobservable" / latent

"observable"

variables / data

parameters hypothesis /

(4) solve this optimization problem

and sometimes Lagrange’s method) (usually using partial derivatives,

of "complete" data

(2) apply expectation linearity (1) compute the log−likelihood

with the total probability formula

(3) apply Bayes formula

(and sometimes the exponentiation trick)

++t

ln P(X|h)

t=0

h

= argmax h^(t+1)

P(Z|X; )h^(t)

E [ln P(X,Z|h)]

X, h^(t) E[Z | ]

M Step

initialized parameters consider arbitrarily

E Step

(42)

Baum-Welch Theorem

When Q is continuous, it can be shown that EM con-

verges to a stationary point (local maximum) of the

likelihood function P (X | h).

(43)

Two mixtures of Bernoulli distributions

1− θ 2 θ

θ

2

1−

0 1

Z~Bernoulli

X~Bernoulli

₁

X~Bernoulli

₀

π ¹⁻ π

0 1

θ

X ∼ πBernoulli(θ) + (1−π)Bernoulli(2θ)

0 1

X~Bernoulli

₀

1

0 p 1−p q

1−q 1

X~Bernoulli

Z~Bernoulli

1− π π

X ∼ πBernoulli(p) + (1−π)Bernoulli(q)

(44)

EM algorithm for solving

A mixture of vectors of independent and identically distributed (i.i.d.) Bernoulli variables

X_i

X₁ X₁₀

x₁ x_i x₁₀ x

~Bernoulli (θ )_A ^~Bernoulli (θ )_B π

Z~Bernoulli

1−π

B A

X₁ X_i X₁₀

X = (x¹, . . . , x10) ∼π·Bernoulli(x¹;θ_A)·. . .·Bernoulli(x¹⁰;θ_A) + (1−π)·Bernoulli(x¹;θ_B)·. . .·Bernoulli(x¹⁰;θ_B).

Cf. “What is the expectation maximization algorithm?”, Chuong B. Do, Serafim Batzoglou, Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899

(45)

Init¸ializare:

atribuie valori arbitrare (θ_A⁽⁰⁾, θ_B⁽⁰⁾ˆın intervalul (0,1)) pentru parametrii θA ¸si respectiv θB;

Corpul iterativ:

pentru t = 0, . . . , T −1 (cu T fixat ˆın avans)

(sau: pân˘a când log-verosimilitatea datelor observabile nu mai cre¸ste semnificativ), (sau: pân˘a când |θ_A^(t) −θ_A^(t+1)| < ε, |θ_B^(t) −θ_B^(t+1)| < ε),

cu ε fixat ˆın avans) execut˘a

Pasul E:

pentru i = 1, . . . ,5 calculeaz˘a

p^(t+1)_i,A = (θ^(t)_A )^xⁱ(1−θ^(t)_A )¹⁰⁻^xⁱ

(θ^(t)_A )^xⁱ(1−θ^(t)_A )¹⁰⁻^xⁱ + (θ_B^(t))^xⁱ(1−θ_B^(t))¹⁰⁻^xⁱ

p^(t+1)_i,B = (θ^(t)_B )^xⁱ(1−θ^(t)_B )¹⁰⁻^xⁱ

(θ^(t)_A )^xⁱ(1−θ^(t)_A )¹⁰⁻^xⁱ + (θ_B^(t))^xⁱ(1−θ_B^(t))¹⁰⁻^xⁱ ; Pasul M:

θ_A^(t+1) = P5

i=1

xi

10 p^(t+1)_i,A P5

i=1p^(t+1)_i,A ¸si θ_B^(t+1) = P5

i=1

xi

10 p^(t+1)_i,B P5

i=1p^(t+1)_i,B ; Returneaz˘a θ_A^(t+1), θ_B^(t+1);

(46)

A mixture of vectors of independent Bernoulli distributions, applied to clustering of hand-written digits (MNIST)

π

₂

π

₁

π

_K

(p )1,i

~B

X_1,1 X_1,i X_1,D

(p )2,i

~B

X_2,1 X_2,i X_2,D

(p )K,i

~B

XK,1 X _K,i X_K,D

x₁ x_i x_D 1

Z~Categorical

2 K

x CMU, 2015f, A. Smola, B. Poczos,

HW2, pr. 1

X = (x¹, . . . , xD) ∼ π1·Bernoulli(x¹;p1,1)·. . .·Bernoulli(xD;p1,D) +πK·Bernoulli(x¹;pK,1)·. . .·Bernoulli(xD;pK,D).

(47)

4 Bayesian Belief Networks

(also called Bayes Nets) Interesting because:

• The Naive Bayes assumption of conditional independence of attributes is too restrictive.

(But it’s intractable without some such assumptions...)

• Bayesian Belief networks describe conditional indepen- dence among subsets of variables.

• It allows the combination of prior knowledge about

(in)dependencies among variables with observed training

data.

(48)

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z :

( ∀ x

i

, y

j

, z

k

) P (X = x

i

| Y = y

j

, Z = z

k

) = P (X = x

i

| Z = z

k

) More compactly, we write P (X | Y, Z ) = P (X | Z )

Note: Naive Bayes uses conditional independence to justify P (A

₁

, A

₂

| V ) = P (A

₁

| A

₂

, V )P (A

₂

| V ) = P (A

₁

| V )P (A

₂

| V )

Generalizing the above definition:

P (X

₁

. . . X

l

| Y

₁

. . . Y

m

, Z

₁

. . . Z

n

) = P (X

₁

. . . X

l

| Z

₁

. . . Z

n

)

(49)

A Bayes Net

Storm

Campfire Lightning

Thunder ForestFire

Campfire C

¬C

¬S,B ¬S,¬B 0.4

0.6 0.1 0.9

0.8 0.2

0.2 0.8 S,¬B

BusTourGroup

S,B

The network is defined by

• A directed acyclic graph, represening a set of conditional independence assertions:

Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.

Example: P(T hunder|F orestF ire, Lightning) = P(T hunder|Lightning)

• A table of local conditional probabilities for each node/variable.

(50)

A Bayes Net (Cont’d)

represents the joint probability distribution over all variables Y

₁

, Y

₂

, . . . , Y

n

:

This joint distribution is fully defined by the graph, plus the conditional probabilities:

P (y

₁

, . . . , y

n

) = P (Y

₁

= y

₁

, . . . , Y

n

= y

n

) =

n

Y

i=1

P (y

i

| P arents(Y

i

)) where P arents(Y

i

) denotes immediate predecessors of Y

i

in the graph.

In our example: P (Storm, BusT ourGroup, . . . , F orestF ire)

(51)

Inference in Bayesian Nets

Question: Given a Bayes net, can one infer the probabilities of values of one or more network variables, given the observed values of (some) others?

Example:

Given the Bayes net compute:

(a) P (S ) (b) P (A, S )

(b) P (A)

_P(A|S)=0.7

P(A|~S)=0.3

P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8

P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6

L F

S

A G

P(L)=0.4 P(F)=0.6

(52)

Inference in Bayesian Nets (Cont’d)

Answer(s):

• If only one variable is of unknown (probability) value, then it is easy to infer it

• In the general case, we can compute the probability dis- tribution for any subset of network variables, given the distribution for any subset of the remaining variables.

But...

• The exact inference of probabilities for an arbitrary

Bayes net is an NP-hard problem!!

(53)

Inference in Bayesian Nets (Cont’d)

In practice, we can succeed in many cases:

• Exact inference methods work well for some net structures.

• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions [Pradham & Dagum, 1996].

(In theory even approximate inference of probabilities in

Bayes Nets can be NP-hard!! [ Dagum & Luby, 1993])

(54)

Learning Bayes Nets (I)

There are several variants of this learning task

• The network structure might be either known or unknown (i.e., it has to be inferred from the training data).

• The training examples might provide values of all network variables, or just for some of them.

The simplest case:

If the structure is known and we can observe the values of all variables,

then it is easy to estimate the conditional probability table

entries (analogous to training a Naive Bayes classifier).

(55)

Learning Bayes Nets (II)

When

• the structure of the Bayes Net is known, and

• the variables are only partially observable in the training data

learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units:

− We can learn the net’s conditional probability tables using the gradient ascent!

− Converge to the network h that (locally) maximizes P (D | h).

(56)

Gradient Ascent for Bayes Nets

Let w

ijk

denote one entry in the conditional probability table for the variable Y

i

in the network

w

ijk

= P (Y

i

= y

ij

| P arents(Y

i

) = the list u

ik

of values) It can be shown (see the next two slides) that

∂ ln P

h

(D)

∂w

ijk

= X

d∈D

P

h

(y

ij

, u

ik

| d) w

ijk

therefore perform gradient ascent by repeatedly 1. update all w

ijk

using the

training data D w

ijk

← w

ijk

+η X

d∈D

P

h

(y

ij

, u

ik

| d) w

ijk

2. renormalize the w

ijk

to as- sure

X

j

w

ijk

= 1 and 0 ≤ w

ijk

≤ 1

56.

(57)

Gradient Ascent for Bayes Nets: Calculus

∂ ln P

h

(D)

∂w

ijk

= ∂

∂w

ijk

ln Y

d∈D

P

h

(d) = X

d∈D

∂ ln P

h

(d)

∂w

ijk

= X

d∈D

1 P

h

(d)

∂P

h

(d)

∂w

ijk

Summing over all values y

ij^′

of Y

i

, and u

ik^′

of U

i

= P arents(Y

i

):

∂ ln P

h

(D)

∂w

ijk

= X

d∈D

1 P

h

(d)

∂

∂w

ijk

X

j^′k^′

P

h

(d | y

ij^′

, u

ik^′

)P

h

(y

ij^′

, u

ik^′

)

= X

d∈D

1 P

h

(d)

∂

∂w

ijk

X

j^′k^′

P

h

(d | y

ij^′

, u

ik^′

)P

h

(y

ij^′

| u

ik^′

)P

h

(u

ik^′

)

Note that w

ijk

≡ P

h

(y

ij

| u

ik

), therefore...

(58)

Gradient Ascent for Bayes Nets: Calculus (Cont’d)

∂ ln P

h

(D)

∂w

ijk

= X

d∈D

1 P

h

(d)

∂

∂w

ijk

P

h

(d | y

ij

, u

ik

)w

ijk

P

h

(u

ik

)

= X

d∈D

1 P

h

(d) P

h

(d | y

ij

, u

ik

)P

h

(u

ik

) (applying Bayes th.)

= X

d∈D

1 P

h

(d)

P

h

(y

ij

, u

ik

| d)P

h

(d)P

h

(u

ik

) P

h

(y

ij

, u

ik

)

= X

d∈D

P

h

(y

ij

, u

ik

| d)P

h

(u

ik

)

P

h

(y

ij

, u

ik

) = X

d∈D

P

h

(y

ij

, u

ik

| d) P

h

(y

ij

| u

ik

)

= X

d∈D

P

h

(y

ij

, u

ik

| d)

w

ijk

(59)

Learning Bayes Nets (II, Cont’d)

The EM algorithm can also be used.

Repeatedly:

1. Calculate/estimate from data the probabilities of unob- served variables w

ijk

,

assuming that the hypothesis h holds

2. Calculate a new h (i.e. new values of w

ijk

) so to maximize E [ln P (D | h)],

where D now includes both the observed and the unob-

served variables.

(60)

Learning Bayes Nets (III)

When the structure is unknown, algorithms usually use greedy search to trade off network complexity (add/substract edges/nodes) against degree of fit to the data.