## Bayesian Learning

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6

Acknowledgement:

The present slides are an adaptation of slides drawn by T. Mitchell

## Two Roles for the Bayesian Methods in Learning

1.

### Provides practical learning algorithms

by combining prior knowledge/probabilities with observed data:

• Naive Bayes learning algorithm

• Expectation Maximization (EM) learning algorithm (scheme):

learning in the presence of unobserved variables

• Bayesian Belief Network learning

2.

### Provides a useful conceptual framework

• Serves for evaluating other learning algorithms, e.g.

concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination),

neural networks, liniar regression

• Provides additional insight into Occam’s razor

1. Basic Notions Bayes’ Theorem

Defining classes of hypotheses:

Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses

2. Learning MAP hypotheses

2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier;

2.3 The Gibbs classifier;

2.4 The Naive Bayes and the Joint Bayes classifiers.

Example: Learning over text data using Naive Bayes 2.5 The Minimum Description Length (MDL) Principle;

MDL hypotheses

3. Learning ML hypotheses

3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm 4. Bayesian Belief Networks

## 1 Basic Notions

• Product Rule:

probability of a conjunction of two events A and B:

P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)

• Bayes’ Theorem:

P(A|B) = P(B|A)P(A) P(B)

• Theorem of total probability:

if events A1, . . . , An are mutually exclusive, with Pn

i=1 P(A_{i}) = 1, then

P(B) =

n

X

i=1

P(B|Ai)P(Ai)

in particular

P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

## Using Bayes’ Theorem for Hypothesis Learning

### P (h | D) = P (D | h)P (h) P (D)

### • P (D) = the (prior) probability of training data D

### • P (h) = the (prior) probability of the hypothesis h

### • P (D | h) = the (a posteriori) probability of D given h

### • P (h | D) = the (a posteriori) probability of h given D

## Classes of Hypotheses

### Maximum Likelihood (ML) hypothesis:

### the hypothesis that best explains the training data h

M L### = argmax

h_{i}∈H

### P (D | h

i### )

### Maximum A posteriori Probability (MAP) hypothesis:

### the most probable hypothesis given the training data

hM AP = argmax

h∈H

P(h|D) = argmax

h∈H

P(D|h)P(h)

P(D) = argmax

h∈H

P(D|h)P(h)

### Note: If P (h

i### ) = P (h

j### ), ∀ i, j , then h

M AP### = h

M L## Exemplifying MAP Hypotheses

Suppose the following data characterize the lab result for cancer-suspect people.

P(cancer) = 0.008 P(¬cancer) = 0.992 h_{1} = cancer, h_{2} = ¬cancer

P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+,−}, P(D | h_{1}), P(D | h_{2})
P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97

Question: Should we diagnoze a patient x whose lab result is positive as having cancer?

Answer: No.

Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}):

P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298

⇒ hM AP = ¬cancer (We can infer

### P (cancer | +) =

0.0078+0.0298^{0}

^{.}

^{0078}

### = 21%

^{)}

## 2 Learning MAP Hypothesis

### 2.1 The Brute Force MAP Hypothesis Learning Algorithm

### Training:

### Choose the hypothesis with the highest posterior proba- bility

### h

M AP### = argmax

h∈H

### P (h | D) = argmax

h∈H

### P (D | h)P (h) Testing:

### Given x, compute h

M AP### (x) Drawback:

### Requires to compute all probabilities P (D | h) and P (h).

### The Most Probable Classification of New Instances

### So far we’ve sought h

M AP### , the *most probable hypothesis* given the data D.

### Question: Given new instance x — the classification of which can take any value v

j### in some set V —, what is its *most* *probable classification?*

### Answer: P (v

j### | D) = P

h_{i}∈H

### P (v

j### | h

i### )P (h

i### | D)

### Therefore, the Bayes optimal classification of x is:

### argmax

v_{j}∈V

### X

h_{i}∈H

### P (v

j### | h

i### )P (h

i### | D)

### Remark: h

M AP### (x) is not the most probable classification of x!

### (See the next example.)

### The Bayes Optimal Classifier: An Example

Let us consider three possible hypotheses:

P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hM AP = h1.

Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = −

Question: What is the most probable classification of x?

Answer:

P(−|h_{1}) = 0, P(+|h_{1}) = 1
P(−|h_{2}) = 1, P(+|h_{2}) = 0
P(−|h_{3}) = 1, P(+|h_{3}) = 0
X

hi∈H

P(+|hi)P(hi|D) = 0.4 and X

hi∈H

P(−|hi)P(hi|D) = 0.6 therefore

argmax

vj∈V

X

h_{i}∈H

P(v_{j}|h_{i})P(h_{i}|D) = −

## 2.3 The Gibbs Classifier

### [Opper and Haussler, 1991]

Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses.

Gibbs algorithm:

1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance

Surprising fact [Haussler et al. 1994]:

If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier!

E[errorGibbs] ≤ 2E[errorBayesOptimal]

## 2.4 The Naive Bayes Classifier

When to use it:

• The target function f takes value from a finite set V = {v_{1}, . . . , vk}

• Moderate or large training data set is available

• The attributes < a_{1}, . . . , a_{n} > that describe instances are conditionally independent
w.r.t. to the given classification:

P(a1, a2 . . . an|vj) = Y

i

P(ai|vj)

The most probable value of f(x) is:

vM AP = argmax

vj∈V

P(vj|a_{1}, a_{2} . . . an) = argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an)

= argmax

v_{j}∈V

P(a1, a2 . . . an|vj)P(vj) = argmax

v_{j}∈V

Y

i

P(ai|vj)P(vj) ^{not.}= vN B

This is the so-called *decision rule* of the Naive Bayes classifier.

## The Joint Bayes Classifier

vM AP = argmax

vj∈V

P(vj|a1, a2 . . . an) = . . .

= argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) = argmax

vj∈V

P(a1, a2 . . . an, vj) ^{not.}= vJ B

## The Naive Bayes Classifier: Remarks

### 1. Along with decision trees, neural networks, k-nearest neigh- bours, the Naive Bayes Classifier is one of the most prac- tical learning methods.

### 2. Compared to the previously presented learning algorithms, the Naive Bayes Classifier does no search through the hy- pothesis space;

### the output hypothesis is simply formed by estimating the

### parameters P (v

j### ), P (a

i### | v

j### ).

## The Naive Bayes Classification Algorithm

### Naive Bayes Learn (examples)

### for each value v

j### of the output attribute P ˆ (v

j### ) ← estimate P (v

j### )

### for each value a

i### of each input attribute a P ˆ (a

i### | v

j### ) ← estimate P (a

i### | v

j### )

### Classify New Instance (x) v

N B### = argmax

_{v}

_{j}

_{∈}

_{V}

### P ˆ (v

j### ) Q

ai∈x

### P ˆ (a

i### | v

j### )

## The Naive Bayes: An Example

Consider again the *PlayTennis* example, and new instance

hOutlook = sun, T emp = cool, Humidity = high, W ind = strongi We compute:

vN B = argmax_{v}_{j}∈V P(vj)Q

i P(ai|vj)
P(yes) = _{14}^{9} = 0.64 P(no) = _{14}^{5} = 0.36
. . .

P(strong|yes) = ^{3}_{9} = 0.33 P(strong|no) = ^{3}_{5} = 0.60

P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes) = 0.0053 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no) = 0.0206

→ v_{N B} = no

## A Note on The Conditional Independence Assumption of Attributes

### P (a

_{1}

### , a

_{2}

### . . . a

n### | v

j### ) = Y

i

### P (a

i### | v

j### )

### It is often violated in practice ...but it works surprisingly well anyway.

### Note that we don’t need estimated posteriors P ˆ (v

j### | x) to be correct; we only need that

### argmax

v_{j}∈V

### P ˆ (v

j### ) Y

i

### P ˆ (a

i### | v

j### ) = argmax

v_{j}∈V

### P (v

j### )P (a

_{1}

### . . . , a

n### | v

j### )

### [Domingos & Pazzani, 1996] analyses this phenomenon.

## Naive Bayes Classification:

## The problem of unseen data

What if none of the training instances with target value v_{j} have the at-
tribute value ai?

It follows that Pˆ(ai|vj) = 0, and Pˆ(vj)Q

i Pˆ(ai|vj) = 0

The typical solution is to (re)define P(ai|vj), for each value vj of ai:

## P ˆ (a

_{i}

## | v

_{j}

## ) ←

^{n}

_{n+m}

^{c}

^{+mp}

^{, where}

• n is number of training examples for which v = vj,

• nc number of examples for which v = vj and a = ai

• p is a prior estimate for Pˆ(ai|vj)

(for instance, if the attribute a has k values, then p = _{k}^{1})

• m is a weight given to that prior estimate (i.e. number of “virtual” examples)

## Using the Naive Bayes Learner:

## Learning to Classify Text

### • Learn which news articles are of interest

### Target concept Interesting? : Document → { +, −}

### • Learn to classify web pages by topic

### Target concept Category : Document → { c

_{1}

### , . . . , c

n### }

### Naive Bayes is among most effective algorithms

## Learning to Classify Text: Main Design Issues

1. Represent each document by a vector of words

• one attribute per word position in document 2. Learning:

• use training examples to estimate P(+), P(−), P(doc|+), P(doc|−)

• Naive Bayes conditional independence assumption:

P(doc|v_{j}) =

length(doc)

Y

i=1

P(a_{i} = w_{k}|v_{j})

where P(ai = wk|vj) is probability that word in position i is wk, given vj

• Make one more assumption:

∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj)

i.e. attributes are (not only indep. but) also identically distributed

### Learn naive Bayes text ( *Examples, Vocabulary* )

1. Collect all words and other tokens that occur in Examples
V ocabulary ← all distinct words and other tokens in Examples
2. Calculate the required P(v_{j}) and P(w_{k}|v_{j}) probability terms

For each target value vj in V

docsj ← the subset of Examples for which the target value is vj

### P (v

j### ) ←

_{|}

_{Examples}

^{|}

^{docs}

^{j}

^{|}

_{|}

T extj ← a single doc. created by concat. all members of docsj

n ← the total number of words in T extj

For each word w_{k} in V ocabulary

n_{k} ← the number of times word w_{k} occurs in T ext_{j}

### P (w

k### | v

j### ) ←

_{n+}

_{|}V ocabulary

^{n}

^{k}

^{+1}| (here we use the m-estimate)

### Classify naive Bayes text (Doc)

positions ← all word positions in Doc that contain tokens from V ocabulary Return

### v

N B### = argmax

_{v}

_{j}

_{∈}

_{V}

### P (v

j### ) Q

i∈positions

### P (a

i### = w

k### | v

j### )

Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from

comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey

alt.atheism sci.space soc.religion.christian sci.crypt

talk.religion.misc sci.electronics talk.politics.mideast sci.med

talk.politics.misc talk.politics.guns

Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)

### Learning Curve for 20 Newsgroups

0 10 20 30 40 50 60 70 80 90 100

100 1000 10000

20News

Bayes TFIDF PRTFIDF

Accuracy vs. Training set size

## 2.5 The Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hM AP

hM AP = argmax

h∈H

P(D|h)P(h) = argmax

h∈H

(log_{2} P(D|h) + log_{2} P(h))

= argmin

h∈H

(−log_{2} P(D|h) − log_{2} P(h))
Interesting fact from the Information Theory:

The optimal (shortest expected coding length) code for an event
with probability p is the one using −log_{2} p bits.

So we can interpret:

−log_{2} P(h): the length of h under the optimal code

−log_{2} P(D|h): the length of D given h under the optimal code
Therefore we prefer the hypothesis h that minimizes...

### Bayes Analysis and the MDL Principle

We saw that a MAP learner prefers the hypothesis h that minimizes
L_{C}_{1}(h) + L_{C}_{2}(D|h), where L_{C}(x) is the description length of x under
encoding C

hM DL = argmin

h∈H

(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels

• L_{C}_{1}(h) is the number of bits to describe tree h

• L_{C}_{2}(D|h) is the number of bits to describe D given h

In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.

### For instance:

LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h.

Therefore, in this situation we need only to describe exceptions. So:

hM DL = argmin

h∈H

(length(h) + length(misclassif ications))

In general, MDL trades off hypothesis size for training errors:

it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data!

Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.

### The MDL Principle: Back to Occam’s Rasor

### MDL hypotheses are not necessarily also the best/MAP ones.

### (For that, we should know all the probabilities P (D | h)

### and P (h).)

## 3 Learning Maximum Likelihood (ML) Hypothesis

### 3.1 Learning Real Valued Functions:

### ML Hypotheses as Least Suquered Error Hypotheses

*h*_{ML}*f*

*e*
*y*

*x*

Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples hxi, dii, i = 1, . . . , m with

xi, assumed fixed (to simplify)

d_{i} noisy training value d_{i} = f(x_{i}) + e_{i}

ei is random variable (noise) drawn inde- pendently for each xi, according to some Gaussian distribution with mean=0.

### Proposition

Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h,

the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:

hM L

def.= argmax

h∈H

P(D|h) = arg min

h∈H m

X

i=1

(di − h(xi))^{2}

Note: We will use the probability density function:

p(x0) ^{def.}= limǫ→0

1

ǫP(x0 ≤ x < x0 + ǫ) hM L = argmax

h∈H

P(D|h) = argmax

h∈H

m

Y

i=1

p(di|h) ^{µ}^{i}^{=f}=^{(x}^{i}^{)} argmax

h∈H

m

Y

i=1

p(ei|h)

= argmax

h∈H

m

Y

i=1

p(di − f(xi)|h) ^{h(x}^{i}^{)=f}= ^{(x}^{i}^{)} argmax

h∈H

m

Y

i=1

p(di − h(xi)|h)

= argmax

h∈H

m

Y

i=1

√ 1

2πσ^{2}e^{−}^{1}^{2}^{(}^{di}^{−}^{h(}^{σ} ^{xi}^{)}^{)}^{2} = argmax

h∈H

(

m

X

i=1

ln 1

√2πσ^{2} − 1
2

di − h(xi) σ

2

)

= argmax

h∈H

m

X

i=1

−1 2

di − h(xi) σ

2

= argmax

h∈H

m

X

i=1

−(di − h(xi))^{2}

= argmin

h∈H

m

X

i=1

(di − h(xi))^{2}

## Generalisations...

1. Similar derivations can be performed starting with other assumed noise distributions (than Gaussians), producing different results.

2. It was assumed that

a. the noise affects only f(xi), and

b. no noise was recorded in the attribute values for the given ex-
amples x_{i}.

Otherwise, the analysis becomes significantly more complex.

### 3.2 ML hypotheses for Learning Probability Functions

Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0,1}.

Given a set of independently drawn examples

D = {< x_{1}, d_{1} >, . . . , < x_{m}, d_{m} >} where d_{i} = f(x_{i}) ∈ {0,1},

we would like to learn a ML hypothesis for the probability function g(x) ^{def.}=
P(f(x) = 1).

For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92.

Proposition: In this setting, hM L = *argmax*_{h}_{∈}_{H} P(D | h) maximizes the sum
Pm

i=1[di *ln*h(xi) + (1 − di)*ln*(1 − h(xi))].

Proof:

P(D | h) = Π^{m}_{i=1}P(xi, di | h) = Π^{m}_{i=1}P(di | xi, h) · P(xi | h)
It can be assumed that xi is independent of h, therefore:

P(D | h) = Π^{m}_{i=1}P(d_{i} | x_{i}, h) · P(x_{i})

Proof (continued):

What we wanted to compute is h(xi) = P(di = 1 | xi, h).

In a more general form:

P(di | xi, h) =

h(xi) *if* di = 1
1 − h(xi) *if* di = 0

In a more convenient mathematical form: P(di | xi, h) = h(xi)^{d}^{i}(1−h(xi))^{1}^{−}^{d}^{i}.

⇒ h_{M L} = *argmax*_{h}_{∈}_{H} Π^{m}_{i=1}[h(x_{i})^{d}^{i}(1 − h(x_{i}))^{1}^{−}^{d}^{i}P(x_{i})]

= *argmax*_{h}_{∈}_{H} Π^{m}_{i=1}h(x_{i})^{d}^{i}(1 − h(x_{i}))^{1}^{−}^{d}^{i} · Π^{m}_{i=1}P(x_{i})

= *argmax*_{h}_{∈}_{H} Π^{m}_{i=1}h(xi)^{d}^{i}(1 − h(xi))^{1}^{−}^{d}^{i}

= *argmax*_{h}_{∈}_{H}

m

X

i=1

[di *ln*h(xi) + (1 − di)*ln*(1 − h(xi))]

Note: The quantity −Pm

i=1[di *ln*h(xi) + (1 − di)*ln*(1 − h(xi))] is called cross-
entropy; the above hM L minimizes this quantity.

### 3.3 The Expectation Maximization (EM) Algorithm

[Dempster et al, 1977]

### Find (local) Maximum Likelihood hypotheses when data is only partially observable:

• Unsupervised learning (i.e., clustering):

the target value is unobservable

• Supervised learning:

some instance attributes are unobservable

### Some applications:

• Non-hierarchical clustering:

Estimate the means of k Gausseans

• Learn Hidden Markov Models

• Learn Probabilistic Context Free Grammars

• Train Radial Basis Function Networks

• Train Bayesian Belief Networks

## The General EM Problem

### Given

### • observed data X = { x

_{1}

### , . . . , x

m### }

### independently generated using the parameterized distributions/hypotheses h

_{1}

### , . . . , h

m### • unobserved data Z = { z

_{1}

### , . . . , z

m### } determine

### ˆ h that (locally) maximizes P (X | h).

## The Essence of the EM Approach

### Start with h

^{(0)}

### , an arbitrarily/conveniently chosen value of h.

### Repeatedly

### 1. Use the observed data X and the current hypothesis h

^{(t)}

### to estimate [the probabilities associated to the values of ] the unobserved variables Z , and further on compute their expectations, E [Z ].

### 2. The expected values of the unobserved variables Z are used to calculate an improved hypothesis h

^{(t+1)}

### , based on maximizing the mean of a log-likelihood function: E [ln P (Y | h) | X, h

^{(t)}

### ], where Y = { y

_{1}

### , . . . , y

m### } is the complete (observed and unobserved) data, i.e.

### y

i### = (x

i### , z

i### ), for i = 1, . . . , m.

## The General EM Algorithm

### Repeat the following two steps until convergence is reached:

### Estimation (E) step:

### Calculate the log likelihood function

### Q(h | h

^{(t)}

### )

^{not.}### = E [ln P (Y | h) | X, h

^{(t)}

### ] where Y = X ∪ Z .

### Maximization (M) step:

### Replace hypothesis h

^{(t)}

### by the hypothesis h

^{(t+1)}

### that maxi- mizes this Q function.

### h

^{(t+1)}

### ← argmax

h

### Q(h | h

^{(t)}

### )

## The EM algorithmic Schema

◦ idea: replace missing values by estimated values

• initialize parameters with arbitrary values

• estimate *missing val-*
*ues* based on *current*
*parameter values*

• re-estimate *param-*
*eters* using the
*complete* data

• repeat the previous
two steps until *conver-*
*gence*

**X, h**^{(t)}**E[Z | ]**
**h**^{(t)}

**h**

** = argmax****h**^{(t+1)}

**P(Z|X; )****h**^{(t)}

**E [ln P(X,Z|h)]**

**++t**

**ln P(X|h)**

**t=0**

### Methodology: How to apply an instance of the EM schema

**initialize parameters****arbitrarily**

**h**

^{(t)}**++t**
**t=0**

**h**

** = argmax** **h**

**= argmax**

^{(t+1)}**P(Z|X; )****h**^{(t)}

**E [ln P(X,Z|h)]**

**X, h**

^{(t)}**E[Z | ]**

**M Step:**

**M Step:**

**apply updating rules (2)****apply**

**E Step:**

**E Step:**

**updating rules (1)**

### Methodology: How to derive an instance of the EM schema

**h**^{(t)}

**"observable" data****log−likelihood of the**

**h**^{(t)}**Q(h| )**

**function**

**"auxiliary"**

**variables**

**"unobservable" / latent**

**"observable"**

**variables / data**

**parameters****hypothesis /**

**(4) solve this optimization problem**

**and sometimes Lagrange’s method)****(usually using partial derivatives,**

**of "complete" data**

**(2) apply expectation linearity****(1) compute the log−likelihood**

**with the total probability formula**

**(3) apply Bayes formula**

**(and sometimes the exponentiation trick)**

**++t**

**ln P(X|h)**

**t=0**

**h**

** = argmax****h**^{(t+1)}

**P(Z|X; )****h**^{(t)}

**E [ln P(X,Z|h)]**

**X, h**^{(t)}**E[Z | ]**

**M Step**

**initialized parameters****consider arbitrarily**

**E Step**

## Baum-Welch Theorem

### When Q is continuous, it can be shown that EM con-

### verges to a stationary point (local maximum) of the

### likelihood function P (X | h).

### Two mixtures of Bernoulli distributions

**1−** θ **2** θ

### θ

**2**

**1−**

**0** **1**

**Z~Bernoulli**

**Z~Bernoulli**

**X~Bernoulli**

**X~Bernoulli**

_{1}**X~Bernoulli**

**X~Bernoulli**

_{0}## π ^{1−} π

^{1−}

**0** **1**

### θ

X ∼ πBernoulli(θ) + (1−π)Bernoulli(2θ)

**0** **1**

**X~Bernoulli**

**X~Bernoulli**

_{0}**1**

**0** **p** **1−p** **q**

**p**

**1−p**

**q**

**1−q** **1**

**1−q**

**X~Bernoulli**

**X~Bernoulli**

**Z~Bernoulli**

**Z~Bernoulli**

**1−** π π

X ∼ πBernoulli(p) + (1−π)Bernoulli(q)

EM algorithm for solving

A mixture of *vectors of independent and identically distributed* (i.i.d.)
Bernoulli variables

**X**_{i}

**X**_{1}**X**_{10}

**x**_{1}**x**_{i}**x**_{10}**x**

* ~Bernoulli* (θ )

_{A}*(θ )*

^{~Bernoulli}**π**

_{B}**Z~Bernoulli**

1−π

**B**
**A**

**X**_{1}**X**_{i}**X**_{10}

X = (x^{1}, . . . , x10) ∼π·Bernoulli(x^{1};θ_{A})·. . .·Bernoulli(x^{10};θ_{A}) + (1−π)·Bernoulli(x^{1};θ_{B})·. . .·Bernoulli(x^{10};θ_{B}).

Cf. *“What is the expectation maximization algorithm?”, Chuong B. Do, Serafim Batzoglou,*
Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899

Init¸ializare:

atribuie valori arbitrare (θ_{A}^{(0)}, θ_{B}^{(0)}ˆın intervalul (0,1))
pentru parametrii θA ¸si respectiv θB;

Corpul iterativ:

pentru t = 0, . . . , T −1 (cu T fixat ˆın avans)

(sau: pˆan˘a cˆand log-verosimilitatea datelor observabile nu mai cre¸ste semnificativ),
(sau: pˆan˘a cˆand |θ_{A}^{(t)} −θ_{A}^{(t+1)}| < ε, |θ_{B}^{(t)} −θ_{B}^{(t+1)}| < ε),

cu ε fixat ˆın avans) execut˘a

Pasul E:

pentru i = 1, . . . ,5 calculeaz˘a

p^{(t+1)}_{i,A} = (θ^{(t)}_{A} )^{x}^{i}(1−θ^{(t)}_{A} )^{10}^{−}^{x}^{i}

(θ^{(t)}_{A} )^{x}^{i}(1−θ^{(t)}_{A} )^{10}^{−}^{x}^{i} + (θ_{B}^{(t)})^{x}^{i}(1−θ_{B}^{(t)})^{10}^{−}^{x}^{i}

p^{(t+1)}_{i,B} = (θ^{(t)}_{B} )^{x}^{i}(1−θ^{(t)}_{B} )^{10−}^{x}^{i}

(θ^{(t)}_{A} )^{x}^{i}(1−θ^{(t)}_{A} )^{10}^{−}^{x}^{i} + (θ_{B}^{(t)})^{x}^{i}(1−θ_{B}^{(t)})^{10}^{−}^{x}^{i} ;
Pasul M:

θ_{A}^{(t+1)} =
P5

i=1

xi

10 p^{(t+1)}_{i,A}
P5

i=1p^{(t+1)}_{i,A} ¸si θ_{B}^{(t+1)} =
P5

i=1

xi

10 p^{(t+1)}_{i,B}
P5

i=1p^{(t+1)}_{i,B} ;
Returneaz˘a θ_{A}^{(t+1)}, θ_{B}^{(t+1)};

A mixture of *vectors of independent* Bernoulli distributions,
applied to clustering of hand-written digits (MNIST)

### π

_{2}### π

_{1}### π

_{K}**(p )****1,i**

**~B**

**X**_{1,1}**X**_{1,i}**X**_{1,D}

**(p )****2,i**

**~B**

**X**_{2,1}**X**_{2,i}**X**_{2,D}

**(p )****K,i**

**~B**

**X****K,1** **X**_{K,i}**X**_{K,D}

**x**_{1}**x**_{i}**x**_{D}**1**

**Z~Categorical**

**2** **K**

* x* CMU, 2015f, A. Smola, B. Poczos,

HW2, pr. 1

X = (x^{1}, . . . , xD) ∼ π1·Bernoulli(x^{1};p1,1)·. . .·Bernoulli(xD;p1,D) +πK·Bernoulli(x^{1};pK,1)·. . .·Bernoulli(xD;pK,D).

## 4 Bayesian Belief Networks

### (also called Bayes Nets) Interesting because:

### • The Naive Bayes assumption of conditional independence of attributes is too restrictive.

### (But it’s intractable without some such assumptions...)

### • Bayesian Belief networks describe conditional indepen- dence among *subsets* of variables.

### • It allows the combination of prior knowledge about

### (in)dependencies among variables with observed training

### data.

## Conditional Independence

### Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z :

### ( ∀ x

i### , y

j### , z

k### ) P (X = x

i### | Y = y

j### , Z = z

k### ) = P (X = x

i### | Z = z

k### ) More compactly, we write P (X | Y, Z ) = P (X | Z )

*Note:* Naive Bayes uses conditional independence to justify P (A

_{1}

### , A

_{2}

### | V ) = P (A

_{1}

### | A

_{2}

### , V )P (A

_{2}

### | V ) = P (A

_{1}

### | V )P (A

_{2}

### | V )

### Generalizing the above definition:

### P (X

_{1}

### . . . X

l### | Y

_{1}

### . . . Y

m### , Z

_{1}

### . . . Z

n### ) = P (X

_{1}

### . . . X

l### | Z

_{1}

### . . . Z

n### )

## A Bayes Net

*Storm*

*Campfire*
*Lightning*

*Thunder* *ForestFire*

*Campfire*
*C*

*¬C*

*¬S,B ¬S,¬B*
0.4

0.6 0.1 0.9

0.8 0.2

0.2
0.8
*S,¬B*

*BusTourGroup*

*S,B*

The network is defined by

• A directed acyclic graph, represening a set of conditional independence assertions:

Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.

Example: P(T hunder|F orestF ire, Lightning) = P(T hunder|Lightning)

• A table of local conditional probabilities for each node/variable.

## A Bayes Net (Cont’d)

### represents the joint probability distribution over all variables Y

_{1}

### , Y

_{2}

### , . . . , Y

n### :

### This joint distribution is fully defined by the graph, plus the conditional probabilities:

### P (y

_{1}

### , . . . , y

n### ) = P (Y

_{1}

### = y

_{1}

### , . . . , Y

n### = y

n### ) =

n

### Y

i=1

### P (y

i### | P arents(Y

i### )) where P arents(Y

i### ) denotes immediate predecessors of Y

i### in the graph.

### In our example: P (Storm, BusT ourGroup, . . . , F orestF ire)

## Inference in Bayesian Nets

### Question: Given a Bayes net, can one infer the probabilities of values of one or more network variables, given the observed values of (some) others?

### Example:

### Given the Bayes net compute:

### (a) P (S ) (b) P (A, S )

### (b) P (A)

_{P(A|S)=0.7}

P(A|~S)=0.3

P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8

P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6

L F

S

A G

P(L)=0.4 P(F)=0.6

## Inference in Bayesian Nets (Cont’d)

### Answer(s):

### • If only one variable is of unknown (probability) value, then it is easy to infer it

### • In the general case, we can compute the probability dis- tribution for any subset of network variables, given the distribution for any subset of the remaining variables.

### But...

### • The exact inference of probabilities for an arbitrary

### Bayes net is an NP-hard problem!!

## Inference in Bayesian Nets (Cont’d)

### In practice, we can succeed in many cases:

### • Exact inference methods work well for some net structures.

### • Monte Carlo methods “simulate” the network randomly to calculate approximate solutions [Pradham & Dagum, 1996].

### (In theory even approximate inference of probabilities in

### Bayes Nets can be NP-hard!! [ Dagum & Luby, 1993])

## Learning Bayes Nets (I)

### There are several variants of this learning task

### • The network structure might be either *known* or *unknown* (i.e., it has to be inferred from the training data).

### • The training examples might provide values of *all* network variables, or just for *some* of them.

### The simplest case:

### If the structure is known and we can observe the values of all variables,

### then it is easy to estimate the conditional probability table

### entries (analogous to training a Naive Bayes classifier).

## Learning Bayes Nets (II)

### When

### • the structure of the Bayes Net is known, and

### • the variables are only partially observable in the training data

### learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units:

### − We can learn the net’s conditional probability tables using the gradient ascent!

### − Converge to the network h that (locally) maximizes P (D | h).

## Gradient Ascent for Bayes Nets

### Let w

ijk### denote one entry in the conditional probability table for the variable Y

i### in the network

### w

ijk### = P (Y

i### = y

ij### | P arents(Y

i### ) = the list u

ik### of values) It can be shown (see the next two slides) that

### ∂ ln P

h### (D)

### ∂w

ijk### = X

d∈D

### P

h### (y

ij### , u

ik### | d) w

ijk### therefore perform gradient ascent by repeatedly 1. update all w

ijk### using the

### training data D w

ijk### ← w

ijk### +η X

d∈D

### P

h### (y

ij### , u

ik### | d) w

ijk### 2. renormalize the w

ijk### to as- sure

### X

j

### w

ijk### = 1 and 0 ≤ w

ijk### ≤ 1

56.

## Gradient Ascent for Bayes Nets: Calculus

### ∂ ln P

h### (D)

### ∂w

ijk### = ∂

### ∂w

ijk### ln Y

d∈D

### P

h### (d) = X

d∈D

### ∂ ln P

h### (d)

### ∂w

ijk### = X

d∈D

### 1 P

h### (d)

### ∂P

h### (d)

### ∂w

ijk### Summing over all values y

ij^{′}

### of Y

i### , and u

ik^{′}

### of U

i### = P arents(Y

i### ):

### ∂ ln P

h### (D)

### ∂w

ijk### = X

d∈D

### 1 P

h### (d)

### ∂

### ∂w

ijk### X

j^{′}k^{′}

### P

h### (d | y

ij^{′}

### , u

ik^{′}

### )P

h### (y

ij^{′}

### , u

ik^{′}

### )

### = X

d∈D

### 1 P

h### (d)

### ∂

### ∂w

ijk### X

j^{′}k^{′}

### P

h### (d | y

ij^{′}

### , u

ik^{′}

### )P

h### (y

ij^{′}

### | u

ik^{′}

### )P

h### (u

ik^{′}

### )

### Note that w

ijk### ≡ P

h### (y

ij### | u

ik### ), therefore...

## Gradient Ascent for Bayes Nets: Calculus (Cont’d)

### ∂ ln P

h### (D)

### ∂w

ijk### = X

d∈D

### 1 P

h### (d)

### ∂

### ∂w

ijk### P

h### (d | y

ij### , u

ik### )w

ijk### P

h### (u

ik### )

### = X

d∈D

### 1

### P

h### (d) P

h### (d | y

ij### , u

ik### )P

h### (u

ik### ) (applying Bayes th.)

### = X

d∈D

### 1 P

h### (d)

### P

h### (y

ij### , u

ik### | d)P

h### (d)P

h### (u

ik### ) P

h### (y

ij### , u

ik### )

### = X

d∈D

### P

h### (y

ij### , u

ik### | d)P

h### (u

ik### )

### P

h### (y

ij### , u

ik### ) = X

d∈D

### P

h### (y

ij### , u

ik### | d) P

h### (y

ij### | u

ik### )

### = X

d∈D