Bayesian Learning
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6
Acknowledgement:
The present slides are an adaptation of slides drawn by T. Mitchell
Two Roles for the Bayesian Methods in Learning
1.
Provides practical learning algorithms
by combining prior knowledge/probabilities with observed data:
• Naive Bayes learning algorithm
• Expectation Maximization (EM) learning algorithm (scheme):
learning in the presence of unobserved variables
• Bayesian Belief Network learning
2.
Provides a useful conceptual framework
• Serves for evaluating other learning algorithms, e.g.
concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination),
neural networks, liniar regression
• Provides additional insight into Occam’s razor
1. Basic Notions Bayes’ Theorem
Defining classes of hypotheses:
Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses
2. Learning MAP hypotheses
2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier;
2.3 The Gibbs classifier;
2.4 The Naive Bayes and the Joint Bayes classifiers.
Example: Learning over text data using Naive Bayes 2.5 The Minimum Description Length (MDL) Principle;
MDL hypotheses
3. Learning ML hypotheses
3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm 4. Bayesian Belief Networks
1 Basic Notions
• Product Rule:
probability of a conjunction of two events A and B:
P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
• Bayes’ Theorem:
P(A|B) = P(B|A)P(A) P(B)
• Theorem of total probability:
if events A1, . . . , An are mutually exclusive, with Pn
i=1 P(Ai) = 1, then
P(B) =
n
X
i=1
P(B|Ai)P(Ai)
in particular
P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)
Using Bayes’ Theorem for Hypothesis Learning
P (h | D) = P (D | h)P (h) P (D)
• P (D) = the (prior) probability of training data D
• P (h) = the (prior) probability of the hypothesis h
• P (D | h) = the (a posteriori) probability of D given h
• P (h | D) = the (a posteriori) probability of h given D
Classes of Hypotheses
Maximum Likelihood (ML) hypothesis:
the hypothesis that best explains the training data h
M L= argmax
hi∈H
P (D | h
i)
Maximum A posteriori Probability (MAP) hypothesis:
the most probable hypothesis given the training data
hM AP = argmax
h∈H
P(h|D) = argmax
h∈H
P(D|h)P(h)
P(D) = argmax
h∈H
P(D|h)P(h)
Note: If P (h
i) = P (h
j), ∀ i, j , then h
M AP= h
M LExemplifying MAP Hypotheses
Suppose the following data characterize the lab result for cancer-suspect people.
P(cancer) = 0.008 P(¬cancer) = 0.992 h1 = cancer, h2 = ¬cancer
P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+,−}, P(D | h1), P(D | h2) P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97
Question: Should we diagnoze a patient x whose lab result is positive as having cancer?
Answer: No.
Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}):
P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298
⇒ hM AP = ¬cancer (We can infer
P (cancer | +) =
0.0078+0.02980.0078= 21%
)2 Learning MAP Hypothesis
2.1 The Brute Force MAP Hypothesis Learning Algorithm
Training:
Choose the hypothesis with the highest posterior proba- bility
h
M AP= argmax
h∈H
P (h | D) = argmax
h∈H
P (D | h)P (h) Testing:
Given x, compute h
M AP(x) Drawback:
Requires to compute all probabilities P (D | h) and P (h).
The Most Probable Classification of New Instances
So far we’ve sought h
M AP, the most probable hypothesis given the data D.
Question: Given new instance x — the classification of which can take any value v
jin some set V —, what is its most probable classification?
Answer: P (v
j| D) = P
hi∈H
P (v
j| h
i)P (h
i| D)
Therefore, the Bayes optimal classification of x is:
argmax
vj∈V
X
hi∈H
P (v
j| h
i)P (h
i| D)
Remark: h
M AP(x) is not the most probable classification of x!
(See the next example.)
The Bayes Optimal Classifier: An Example
Let us consider three possible hypotheses:
P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hM AP = h1.
Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = −
Question: What is the most probable classification of x?
Answer:
P(−|h1) = 0, P(+|h1) = 1 P(−|h2) = 1, P(+|h2) = 0 P(−|h3) = 1, P(+|h3) = 0 X
hi∈H
P(+|hi)P(hi|D) = 0.4 and X
hi∈H
P(−|hi)P(hi|D) = 0.6 therefore
argmax
vj∈V
X
hi∈H
P(vj|hi)P(hi|D) = −
2.3 The Gibbs Classifier
[Opper and Haussler, 1991]
Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance
Surprising fact [Haussler et al. 1994]:
If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier!
E[errorGibbs] ≤ 2E[errorBayesOptimal]
2.4 The Naive Bayes Classifier
When to use it:
• The target function f takes value from a finite set V = {v1, . . . , vk}
• Moderate or large training data set is available
• The attributes < a1, . . . , an > that describe instances are conditionally independent w.r.t. to the given classification:
P(a1, a2 . . . an|vj) = Y
i
P(ai|vj)
The most probable value of f(x) is:
vM AP = argmax
vj∈V
P(vj|a1, a2 . . . an) = argmax
vj∈V
P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an)
= argmax
vj∈V
P(a1, a2 . . . an|vj)P(vj) = argmax
vj∈V
Y
i
P(ai|vj)P(vj) not.= vN B
This is the so-called decision rule of the Naive Bayes classifier.
The Joint Bayes Classifier
vM AP = argmax
vj∈V
P(vj|a1, a2 . . . an) = . . .
= argmax
vj∈V
P(a1, a2 . . . an|vj)P(vj) = argmax
vj∈V
P(a1, a2 . . . an, vj) not.= vJ B
The Naive Bayes Classifier: Remarks
1. Along with decision trees, neural networks, k-nearest neigh- bours, the Naive Bayes Classifier is one of the most prac- tical learning methods.
2. Compared to the previously presented learning algorithms, the Naive Bayes Classifier does no search through the hy- pothesis space;
the output hypothesis is simply formed by estimating the
parameters P (v
j), P (a
i| v
j).
The Naive Bayes Classification Algorithm
Naive Bayes Learn (examples)
for each value v
jof the output attribute P ˆ (v
j) ← estimate P (v
j)
for each value a
iof each input attribute a P ˆ (a
i| v
j) ← estimate P (a
i| v
j)
Classify New Instance (x) v
N B= argmax
vj∈VP ˆ (v
j) Q
ai∈x
P ˆ (a
i| v
j)
The Naive Bayes: An Example
Consider again the PlayTennis example, and new instance
hOutlook = sun, T emp = cool, Humidity = high, W ind = strongi We compute:
vN B = argmaxvj∈V P(vj)Q
i P(ai|vj) P(yes) = 149 = 0.64 P(no) = 145 = 0.36 . . .
P(strong|yes) = 39 = 0.33 P(strong|no) = 35 = 0.60
P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes) = 0.0053 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no) = 0.0206
→ vN B = no
A Note on The Conditional Independence Assumption of Attributes
P (a
1, a
2. . . a
n| v
j) = Y
i
P (a
i| v
j)
It is often violated in practice ...but it works surprisingly well anyway.
Note that we don’t need estimated posteriors P ˆ (v
j| x) to be correct; we only need that
argmax
vj∈V
P ˆ (v
j) Y
i
P ˆ (a
i| v
j) = argmax
vj∈V
P (v
j)P (a
1. . . , a
n| v
j)
[Domingos & Pazzani, 1996] analyses this phenomenon.
Naive Bayes Classification:
The problem of unseen data
What if none of the training instances with target value vj have the at- tribute value ai?
It follows that Pˆ(ai|vj) = 0, and Pˆ(vj)Q
i Pˆ(ai|vj) = 0
The typical solution is to (re)define P(ai|vj), for each value vj of ai:
P ˆ (a
i| v
j) ←
nn+mc+mp, where• n is number of training examples for which v = vj,
• nc number of examples for which v = vj and a = ai
• p is a prior estimate for Pˆ(ai|vj)
(for instance, if the attribute a has k values, then p = k1)
• m is a weight given to that prior estimate (i.e. number of “virtual” examples)
Using the Naive Bayes Learner:
Learning to Classify Text
• Learn which news articles are of interest
Target concept Interesting? : Document → { +, −}
• Learn to classify web pages by topic
Target concept Category : Document → { c
1, . . . , c
n}
Naive Bayes is among most effective algorithms
Learning to Classify Text: Main Design Issues
1. Represent each document by a vector of words
• one attribute per word position in document 2. Learning:
• use training examples to estimate P(+), P(−), P(doc|+), P(doc|−)
• Naive Bayes conditional independence assumption:
P(doc|vj) =
length(doc)
Y
i=1
P(ai = wk|vj)
where P(ai = wk|vj) is probability that word in position i is wk, given vj
• Make one more assumption:
∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj)
i.e. attributes are (not only indep. but) also identically distributed
Learn naive Bayes text ( Examples, Vocabulary )
1. Collect all words and other tokens that occur in Examples V ocabulary ← all distinct words and other tokens in Examples 2. Calculate the required P(vj) and P(wk|vj) probability terms
For each target value vj in V
docsj ← the subset of Examples for which the target value is vj
P (v
j) ←
|Examples|docsj| |T extj ← a single doc. created by concat. all members of docsj
n ← the total number of words in T extj
For each word wk in V ocabulary
nk ← the number of times word wk occurs in T extj
P (w
k| v
j) ←
n+|V ocabularynk+1 | (here we use the m-estimate)Classify naive Bayes text (Doc)
positions ← all word positions in Doc that contain tokens from V ocabulary Return
v
N B= argmax
vj∈VP (v
j) Q
i∈positions
P (a
i= w
k| v
j)
Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from
comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey
alt.atheism sci.space soc.religion.christian sci.crypt
talk.religion.misc sci.electronics talk.politics.mideast sci.med
talk.politics.misc talk.politics.guns
Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)
Learning Curve for 20 Newsgroups
0 10 20 30 40 50 60 70 80 90 100
100 1000 10000
20News
Bayes TFIDF PRTFIDF
Accuracy vs. Training set size
2.5 The Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hM AP
hM AP = argmax
h∈H
P(D|h)P(h) = argmax
h∈H
(log2 P(D|h) + log2 P(h))
= argmin
h∈H
(−log2 P(D|h) − log2 P(h)) Interesting fact from the Information Theory:
The optimal (shortest expected coding length) code for an event with probability p is the one using −log2 p bits.
So we can interpret:
−log2 P(h): the length of h under the optimal code
−log2 P(D|h): the length of D given h under the optimal code Therefore we prefer the hypothesis h that minimizes...
Bayes Analysis and the MDL Principle
We saw that a MAP learner prefers the hypothesis h that minimizes LC1(h) + LC2(D|h), where LC(x) is the description length of x under encoding C
hM DL = argmin
h∈H
(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels
• LC1(h) is the number of bits to describe tree h
• LC2(D|h) is the number of bits to describe D given h
In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.
For instance:
LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h.
Therefore, in this situation we need only to describe exceptions. So:
hM DL = argmin
h∈H
(length(h) + length(misclassif ications))
In general, MDL trades off hypothesis size for training errors:
it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data!
Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.
The MDL Principle: Back to Occam’s Rasor
MDL hypotheses are not necessarily also the best/MAP ones.
(For that, we should know all the probabilities P (D | h)
and P (h).)
3 Learning Maximum Likelihood (ML) Hypothesis
3.1 Learning Real Valued Functions:
ML Hypotheses as Least Suquered Error Hypotheses
hML f
e y
x
Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples hxi, dii, i = 1, . . . , m with
xi, assumed fixed (to simplify)
di noisy training value di = f(xi) + ei
ei is random variable (noise) drawn inde- pendently for each xi, according to some Gaussian distribution with mean=0.
Proposition
Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h,
the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:
hM L
def.= argmax
h∈H
P(D|h) = arg min
h∈H m
X
i=1
(di − h(xi))2
Note: We will use the probability density function:
p(x0) def.= limǫ→0
1
ǫP(x0 ≤ x < x0 + ǫ) hM L = argmax
h∈H
P(D|h) = argmax
h∈H
m
Y
i=1
p(di|h) µi=f=(xi) argmax
h∈H
m
Y
i=1
p(ei|h)
= argmax
h∈H
m
Y
i=1
p(di − f(xi)|h) h(xi)=f= (xi) argmax
h∈H
m
Y
i=1
p(di − h(xi)|h)
= argmax
h∈H
m
Y
i=1
√ 1
2πσ2e−12(di−h(σ xi))2 = argmax
h∈H
(
m
X
i=1
ln 1
√2πσ2 − 1 2
di − h(xi) σ
2
)
= argmax
h∈H
m
X
i=1
−1 2
di − h(xi) σ
2
= argmax
h∈H
m
X
i=1
−(di − h(xi))2
= argmin
h∈H
m
X
i=1
(di − h(xi))2
Generalisations...
1. Similar derivations can be performed starting with other assumed noise distributions (than Gaussians), producing different results.
2. It was assumed that
a. the noise affects only f(xi), and
b. no noise was recorded in the attribute values for the given ex- amples xi.
Otherwise, the analysis becomes significantly more complex.
3.2 ML hypotheses for Learning Probability Functions
Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0,1}.
Given a set of independently drawn examples
D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0,1},
we would like to learn a ML hypothesis for the probability function g(x) def.= P(f(x) = 1).
For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92.
Proposition: In this setting, hM L = argmaxh∈H P(D | h) maximizes the sum Pm
i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))].
Proof:
P(D | h) = Πmi=1P(xi, di | h) = Πmi=1P(di | xi, h) · P(xi | h) It can be assumed that xi is independent of h, therefore:
P(D | h) = Πmi=1P(di | xi, h) · P(xi)
Proof (continued):
What we wanted to compute is h(xi) = P(di = 1 | xi, h).
In a more general form:
P(di | xi, h) =
h(xi) if di = 1 1 − h(xi) if di = 0
In a more convenient mathematical form: P(di | xi, h) = h(xi)di(1−h(xi))1−di.
⇒ hM L = argmaxh∈H Πmi=1[h(xi)di(1 − h(xi))1−diP(xi)]
= argmaxh∈H Πmi=1h(xi)di(1 − h(xi))1−di · Πmi=1P(xi)
= argmaxh∈H Πmi=1h(xi)di(1 − h(xi))1−di
= argmaxh∈H
m
X
i=1
[di lnh(xi) + (1 − di)ln(1 − h(xi))]
Note: The quantity −Pm
i=1[di lnh(xi) + (1 − di)ln(1 − h(xi))] is called cross- entropy; the above hM L minimizes this quantity.
3.3 The Expectation Maximization (EM) Algorithm
[Dempster et al, 1977]
Find (local) Maximum Likelihood hypotheses when data is only partially observable:
• Unsupervised learning (i.e., clustering):
the target value is unobservable
• Supervised learning:
some instance attributes are unobservable
Some applications:
• Non-hierarchical clustering:
Estimate the means of k Gausseans
• Learn Hidden Markov Models
• Learn Probabilistic Context Free Grammars
• Train Radial Basis Function Networks
• Train Bayesian Belief Networks
The General EM Problem
Given
• observed data X = { x
1, . . . , x
m}
independently generated using the parameterized distributions/hypotheses h
1, . . . , h
m• unobserved data Z = { z
1, . . . , z
m} determine
ˆ h that (locally) maximizes P (X | h).
The Essence of the EM Approach
Start with h
(0), an arbitrarily/conveniently chosen value of h.
Repeatedly
1. Use the observed data X and the current hypothesis h
(t)to estimate [the probabilities associated to the values of ] the unobserved variables Z , and further on compute their expectations, E [Z ].
2. The expected values of the unobserved variables Z are used to calculate an improved hypothesis h
(t+1), based on maximizing the mean of a log-likelihood function: E [ln P (Y | h) | X, h
(t)], where Y = { y
1, . . . , y
m} is the complete (observed and unobserved) data, i.e.
y
i= (x
i, z
i), for i = 1, . . . , m.
The General EM Algorithm
Repeat the following two steps until convergence is reached:
Estimation (E) step:
Calculate the log likelihood function
Q(h | h
(t))
not.= E [ln P (Y | h) | X, h
(t)] where Y = X ∪ Z .
Maximization (M) step:
Replace hypothesis h
(t)by the hypothesis h
(t+1)that maxi- mizes this Q function.
h
(t+1)← argmax
h
Q(h | h
(t))
The EM algorithmic Schema
◦ idea: replace missing values by estimated values
• initialize parameters with arbitrary values
• estimate missing val- ues based on current parameter values
• re-estimate param- eters using the complete data
• repeat the previous two steps until conver- gence
X, h(t) E[Z | ] h(t)
h
= argmax h(t+1)
P(Z|X; )h(t)
E [ln P(X,Z|h)]
++t
ln P(X|h)
t=0
Methodology: How to apply an instance of the EM schema
initialize parametersarbitrarily
h
(t)++t t=0
h
= argmax h
(t+1)P(Z|X; )h(t)
E [ln P(X,Z|h)]
X, h
(t)E[Z | ]
M Step:
apply updating rules (2)apply
E Step:
updating rules (1)
Methodology: How to derive an instance of the EM schema
h(t)
"observable" data log−likelihood of the
h(t) Q(h| )
function
"auxiliary"
variables
"unobservable" / latent
"observable"
variables / data
parameters hypothesis /
(4) solve this optimization problem
and sometimes Lagrange’s method) (usually using partial derivatives,
of "complete" data
(2) apply expectation linearity (1) compute the log−likelihood
with the total probability formula
(3) apply Bayes formula
(and sometimes the exponentiation trick)
++t
ln P(X|h)
t=0
h
= argmax h(t+1)
P(Z|X; )h(t)
E [ln P(X,Z|h)]
X, h(t) E[Z | ]
M Step
initialized parameters consider arbitrarily
E Step
Baum-Welch Theorem
When Q is continuous, it can be shown that EM con-
verges to a stationary point (local maximum) of the
likelihood function P (X | h).
Two mixtures of Bernoulli distributions
1− θ 2 θ
θ
2
1−
0 1
Z~Bernoulli
X~Bernoulli
1X~Bernoulli
0π 1− π
0 1
θ
X ∼ πBernoulli(θ) + (1−π)Bernoulli(2θ)
0 1
X~Bernoulli
01
0 p 1−p q
1−q 1
X~Bernoulli
Z~Bernoulli
1− π π
X ∼ πBernoulli(p) + (1−π)Bernoulli(q)
EM algorithm for solving
A mixture of vectors of independent and identically distributed (i.i.d.) Bernoulli variables
Xi
X1 X10
x1 xi x10 x
~Bernoulli (θ )A ~Bernoulli (θ )B π
Z~Bernoulli
1−π
B A
X1 Xi X10
X = (x1, . . . , x10) ∼π·Bernoulli(x1;θA)·. . .·Bernoulli(x10;θA) + (1−π)·Bernoulli(x1;θB)·. . .·Bernoulli(x10;θB).
Cf. “What is the expectation maximization algorithm?”, Chuong B. Do, Serafim Batzoglou, Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899
Init¸ializare:
atribuie valori arbitrare (θA(0), θB(0)ˆın intervalul (0,1)) pentru parametrii θA ¸si respectiv θB;
Corpul iterativ:
pentru t = 0, . . . , T −1 (cu T fixat ˆın avans)
(sau: pˆan˘a cˆand log-verosimilitatea datelor observabile nu mai cre¸ste semnificativ), (sau: pˆan˘a cˆand |θA(t) −θA(t+1)| < ε, |θB(t) −θB(t+1)| < ε),
cu ε fixat ˆın avans) execut˘a
Pasul E:
pentru i = 1, . . . ,5 calculeaz˘a
p(t+1)i,A = (θ(t)A )xi(1−θ(t)A )10−xi
(θ(t)A )xi(1−θ(t)A )10−xi + (θB(t))xi(1−θB(t))10−xi
p(t+1)i,B = (θ(t)B )xi(1−θ(t)B )10−xi
(θ(t)A )xi(1−θ(t)A )10−xi + (θB(t))xi(1−θB(t))10−xi ; Pasul M:
θA(t+1) = P5
i=1
xi
10 p(t+1)i,A P5
i=1p(t+1)i,A ¸si θB(t+1) = P5
i=1
xi
10 p(t+1)i,B P5
i=1p(t+1)i,B ; Returneaz˘a θA(t+1), θB(t+1);
A mixture of vectors of independent Bernoulli distributions, applied to clustering of hand-written digits (MNIST)
π
2π
1π
K(p )1,i
~B
X1,1 X1,i X1,D
(p )2,i
~B
X2,1 X2,i X2,D
(p )K,i
~B
XK,1 X K,i XK,D
x1 xi xD 1
Z~Categorical
2 K
x CMU, 2015f, A. Smola, B. Poczos,
HW2, pr. 1
X = (x1, . . . , xD) ∼ π1·Bernoulli(x1;p1,1)·. . .·Bernoulli(xD;p1,D) +πK·Bernoulli(x1;pK,1)·. . .·Bernoulli(xD;pK,D).
4 Bayesian Belief Networks
(also called Bayes Nets) Interesting because:
• The Naive Bayes assumption of conditional independence of attributes is too restrictive.
(But it’s intractable without some such assumptions...)
• Bayesian Belief networks describe conditional indepen- dence among subsets of variables.
• It allows the combination of prior knowledge about
(in)dependencies among variables with observed training
data.
Conditional Independence
Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z :
( ∀ x
i, y
j, z
k) P (X = x
i| Y = y
j, Z = z
k) = P (X = x
i| Z = z
k) More compactly, we write P (X | Y, Z ) = P (X | Z )
Note: Naive Bayes uses conditional independence to justify P (A
1, A
2| V ) = P (A
1| A
2, V )P (A
2| V ) = P (A
1| V )P (A
2| V )
Generalizing the above definition:
P (X
1. . . X
l| Y
1. . . Y
m, Z
1. . . Z
n) = P (X
1. . . X
l| Z
1. . . Z
n)
A Bayes Net
Storm
Campfire Lightning
Thunder ForestFire
Campfire C
¬C
¬S,B ¬S,¬B 0.4
0.6 0.1 0.9
0.8 0.2
0.2 0.8 S,¬B
BusTourGroup
S,B
The network is defined by
• A directed acyclic graph, represening a set of conditional independence assertions:
Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
Example: P(T hunder|F orestF ire, Lightning) = P(T hunder|Lightning)
• A table of local conditional probabilities for each node/variable.
A Bayes Net (Cont’d)
represents the joint probability distribution over all variables Y
1, Y
2, . . . , Y
n:
This joint distribution is fully defined by the graph, plus the conditional probabilities:
P (y
1, . . . , y
n) = P (Y
1= y
1, . . . , Y
n= y
n) =
n
Y
i=1
P (y
i| P arents(Y
i)) where P arents(Y
i) denotes immediate predecessors of Y
iin the graph.
In our example: P (Storm, BusT ourGroup, . . . , F orestF ire)
Inference in Bayesian Nets
Question: Given a Bayes net, can one infer the probabilities of values of one or more network variables, given the observed values of (some) others?
Example:
Given the Bayes net compute:
(a) P (S ) (b) P (A, S )
(b) P (A)
P(A|S)=0.7P(A|~S)=0.3
P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8
P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6
L F
S
A G
P(L)=0.4 P(F)=0.6
Inference in Bayesian Nets (Cont’d)
Answer(s):
• If only one variable is of unknown (probability) value, then it is easy to infer it
• In the general case, we can compute the probability dis- tribution for any subset of network variables, given the distribution for any subset of the remaining variables.
But...
• The exact inference of probabilities for an arbitrary
Bayes net is an NP-hard problem!!
Inference in Bayesian Nets (Cont’d)
In practice, we can succeed in many cases:
• Exact inference methods work well for some net structures.
• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions [Pradham & Dagum, 1996].
(In theory even approximate inference of probabilities in
Bayes Nets can be NP-hard!! [ Dagum & Luby, 1993])
Learning Bayes Nets (I)
There are several variants of this learning task
• The network structure might be either known or unknown (i.e., it has to be inferred from the training data).
• The training examples might provide values of all network variables, or just for some of them.
The simplest case:
If the structure is known and we can observe the values of all variables,
then it is easy to estimate the conditional probability table
entries (analogous to training a Naive Bayes classifier).
Learning Bayes Nets (II)
When
• the structure of the Bayes Net is known, and
• the variables are only partially observable in the training data
learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units:
− We can learn the net’s conditional probability tables using the gradient ascent!
− Converge to the network h that (locally) maximizes P (D | h).
Gradient Ascent for Bayes Nets
Let w
ijkdenote one entry in the conditional probability table for the variable Y
iin the network
w
ijk= P (Y
i= y
ij| P arents(Y
i) = the list u
ikof values) It can be shown (see the next two slides) that
∂ ln P
h(D)
∂w
ijk= X
d∈D
P
h(y
ij, u
ik| d) w
ijktherefore perform gradient ascent by repeatedly 1. update all w
ijkusing the
training data D w
ijk← w
ijk+η X
d∈D
P
h(y
ij, u
ik| d) w
ijk2. renormalize the w
ijkto as- sure
X
j
w
ijk= 1 and 0 ≤ w
ijk≤ 1
56.
Gradient Ascent for Bayes Nets: Calculus
∂ ln P
h(D)
∂w
ijk= ∂
∂w
ijkln Y
d∈D
P
h(d) = X
d∈D
∂ ln P
h(d)
∂w
ijk= X
d∈D
1 P
h(d)
∂P
h(d)
∂w
ijkSumming over all values y
ij′of Y
i, and u
ik′of U
i= P arents(Y
i):
∂ ln P
h(D)
∂w
ijk= X
d∈D
1 P
h(d)
∂
∂w
ijkX
j′k′
P
h(d | y
ij′, u
ik′)P
h(y
ij′, u
ik′)
= X
d∈D
1 P
h(d)
∂
∂w
ijkX
j′k′
P
h(d | y
ij′, u
ik′)P
h(y
ij′| u
ik′)P
h(u
ik′)
Note that w
ijk≡ P
h(y
ij| u
ik), therefore...
Gradient Ascent for Bayes Nets: Calculus (Cont’d)
∂ ln P
h(D)
∂w
ijk= X
d∈D
1 P
h(d)
∂
∂w
ijkP
h(d | y
ij, u
ik)w
ijkP
h(u
ik)
= X
d∈D
1
P
h(d) P
h(d | y
ij, u
ik)P
h(u
ik) (applying Bayes th.)
= X
d∈D
1 P
h(d)
P
h(y
ij, u
ik| d)P
h(d)P
h(u
ik) P
h(y
ij, u
ik)
= X
d∈D
P
h(y
ij, u
ik| d)P
h(u
ik)
P
h(y
ij, u
ik) = X
d∈D
P
h(y
ij, u
ik| d) P
h(y
ij| u
ik)
= X
d∈D