Probability of a Markov Chain

(1)

Hidden Markov Models

Based on

• “Foundations of Statistical NLP” by C. Manning & H.

Sch¨utze, ch. 9, MIT Press, 2002

• “Biological Sequence Analysis”, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998

(2)

PLAN

1 Markov Models

Markov assumptions 2 Hidden Markov Models

3 Fundamental questions for HMMs

3.1 Probability of an observation sequence:

the Forward algorithm, the Backward algorithm

3.2 Finding the “best” sequence: the Viterbi algorithm 3.3 HMM parameter estimation:

the Forward-Backward (EM) algorithm 4 HMM extensions

5 Applications

(3)

1 Markov Models (generally)

Markov Models are used to model a sequence of random variables in which each element depends on pre- vious elements.

X = hX₁ . . . X_Ti X_t ∈ S = {s₁, . . . , s_N}

X is also called a Markov Process or Markov Chain.

S = set of states

Π = initial state probabilities π_i = P(X₁ = s_i); PN

i=1 π_i = 1 A = transition probabilities:

a_ij = P(X_t+1 = s_j|X_t = s_i); PN

j=1 a_ij = 1 ∀i

(4)

Markov assumptions

• Limited Horizon:

P(X_t+1 = s_i|X₁ . . . X_t) = P(X_t+1 = s_i|X_t) (first-order Markov model)

• Time Invariance: P(X_t+1 = s_j|X_t = s_i) = p_ij ∀t

Probability of a Markov Chain

P(X₁ . . . X_T) = P(X₁)P(X₂|X₁)P(X₃|X₁X₂) . . . P(X_T|X₁X₂ . . . X_T₋₁)

= P(X₁)P(X₂|X₁)P(X₃|X₂) . . . P(X_T|X_T₋₁)

= π_X₁Π^T_t=1⁻¹a_X_t_X_t₊₁

(5)

A 1st Markov chain example: DNA

(from [Durbin et al., 1998])

A T

C G

Note:

Here we leave transition

probabilities unspecified.

(6)

A 2nd Markov chain example:

CpG islands in DNA sequences

Maximum Likelihood estimation of parameters using real data (+ and -)

a⁺_st = c⁺_st P

t^′ c⁺_st′

a⁻_st = c⁻_st P

t^′ c⁻_st′

+ A C G T

A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182

− A C G T

A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292

(7)

Using log likelihoood (log-odds) ratios for discrimination

S(x) = log₂P(x | model +) P(x | model −) =

L

X

i=1

log₂a⁺_x_i−₁_x_i a⁻_x

i−1x_i

=

L

X

i=1

β_x_i₋₁_x_i

β A C G T

A −0.740 0.419 0.580 −0.803 C −0.913 0.302 1.812 −0.685 G −0.624 0.461 0.331 −0.730 T −1.169 0.573 0.393 −0.679

(8)

2 Hidden Markov Models

K = output alphabet = {k₁, . . . , k_M} B = output emission probabilities:

b_ijk = P(O_t = k|X_t = s_i, X_t+1 = s_j)

Notice that b_ijk does not depend on t.

In HMMs we only observe a probabilistic function of the state sequence: hO₁ . . . O_Ti

When the state sequence hX₁ . . . X_Ti is also observable:

Visible Markov Model (VMM)

Remark:

In all our subsequent examples b_ijk is independent of j.

(9)

A program for a HMM

t = 1;

start in state s_i with probability π_i (i.e., X₁ = i);

forever do

move from state s_i to state s_j with prob. a_ij (i.e., X_t+1 = j);

emit observation symbol O_t = k with probability b_ijk; t = t + 1;

(10)

A 1st HMM example: CpG islands

A₊

A−

T₊ G+

C₊

T−

C− G

−

Notes:

1. In addition to the transitions shown, there is also a complete set of transitions within each set (+ respec- trively -).

2. Transition probabilities in this model are set so that within each group they are close to the transition probabilities of the original model, but there is also a small chance of switching into the other component. Over- all, there is more chance of switching from ’+’ to ’-’ than viceversa.

(11)

A 2nd HMM example: The occasionally dishonest casino

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 2: 1/10

F L

0.99 0.01 0.95

0.05

0.9

0.1

(12)

A 2rd HMM example: The crazy soft drink machine

(from [Manning & Sch¨utze, 2000])

Preference Ice tea Coke

Preference

π_CP=1

P(Coke) = 0.6 Ice tea = 0.1 Lemon = 0.3

Ice tea = 0.7 Lemon = 0.2 P(Coke) = 0.1

0.3

0.5

0.5 0.7

(13)

(from [Eddy, 2004])

(14)

3 Three fundamental questions for HMMs

1. Probability of an Observation Sequence:

Given a model µ = (A, B,Π) over S, K, how do we (effi- ciently) compute the likelihood of a particular sequence, P(O|µ)?

2. Finding the “Best” State Sequence:

Given an observation sequence and a model, how do we choose a state sequence (X₁, . . . , X_T₊₁) to best explain the observation sequence?

3. HMM Parameter Estimation:

Given an observation sequence (or corpus thereof ), how do we acquire a model µ = (A, B,Π) that best explains the data?

(15)

3.1 Probability of an observation sequence

P(O|X, µ) = Π^T_t=1P(O_t|X_t, X_t+1, µ) = b_X₁_X₂_O₁b_X₂_X₃_O₂ . . . b_X_T_X_T₊₁_O_T P(O, µ) = X

X

P(O|X, µ)P(X, µ) = X

X1...X_T+1

π_X₁Π^T_t=1a_X_t_X_t₊₁b_X_t_X_t₊₁_O_t Complexity : (2T + 1)N^T⁺¹, too inefficient

better : use dynamic prog. to store partial results α_i(t) = P(O₁O₂ . . . O_t−1, X_t = s_i|µ).

(16)

3.1.1 Probability of an observation sequence:

The Forward algorithm

1. Initialization: α_i(1) = π_i, for 1 ≤ i ≤ N 2. Induction: α_j(t + 1) = PN

i=1 α_i(t)aijb_ijO_t, 1 ≤ t ≤ T, 1 ≤ j ≤ N 3. Total: P(O|µ) = PN

i=1 α_i(T + 1). Complexity: 2N²T

(17)

Proof of induction step:

α_j(t + 1) = P(O1O2 . . . O_t−1O_t, X_t+1 = j|µ)

=

N

X

i=1

P(O1O2 . . . O_t−1O_t, X_t = i, X_t+1 = j|µ)

=

N

X

i=1

P(Ot, Xt+1 = j|O1O2 . . . Ot−1, Xt = i, µ)P(O1O2 . . . Ot−1, Xt = i|µ)

=

N

X

i=1

P(O1O2 . . . O_t−1, X_t = i|µ)P(O_t, X_t+1 = j|O1O2. . . O_t−1, X_t = i, µ)

=

N

X

i=1

αi(t)P(Ot, Xt+1 = j|Xt = i, µ)

=

N

X

i=1

α_i(t)P(Ot|Xt = i, X_t+1 = j, µ)P(Xt+1 = j|Xt = i, µ) =

N

X

i=1

α_i(t)bijO^ta_ij

(18)

Closeup of the Forward update step

a 1j b

1jOt

a _2j b _2jO

t

b _NjO

t

a _Nj

j µ

t+1 t

P(O ... O , X = s | )1

t−1 µ

1 t i

N

α_N(t) s s₂ α₂(t) α₁(t) s₁

t t+1

P(O ... O , X = s | )

s_j α_j (t+1)

(19)

Trellis

Each node (si, t) stores information about paths through s_i at time t.

s₁

sN

s2

s₃

1 2 Time t T+1

State

(20)

3.1.2 Probability of an observation sequence:

The Backward algorithm

β_i(t) = P(O_t . . . O_T|X_t = i, µ) 1. Initialization: β_i(T + 1) = 1, for 1 ≤ i ≤ N 2. Induction: β_i(t) = PN

j=1 a_ijb_ijO_tβ_j(t + 1), 1 ≤ t ≤ T, 1 ≤ i ≤ N 3. Total: P(O|µ) = PN

i=1 π_iβ_i(1) Complexity: 2N²T

(21)

Induction:

β_i(t) = P(OtO_t+1 . . . O_T|Xt = i, µ)

=

N

X

j=1

P(OtOt+1 . . . OT, Xt+1 = j|Xt = i, µ)

=

N

X

j=1

P(OtOt+1 . . . OT|Xt = i, Xt+1 = j, µ)P(Xt+1 = j|Xt = i, µ)

=

N

X

j=1

P(O_t+1 . . . O_T|O_t, X_t = i, X_t+1 = j, µ)P(O_t|X_t = i, X_t+1 = j, µ)a_ij

=

N

X

j=1

P(Ot+1 . . . OT|Xt+1 = j, µ)bijO^taij =

N

X

j=1

βj(t + 1)bijO^taij

Total:P(O|µ) =

N

X

i=1

P(O1O2 . . . O_T|X1 = i, µ)P(X1 = i|µ) =

N

X

i=1

β_i(1)πi

(22)

Combining Forward and Backward probabilities

P(O, Xt = i|µ) = αi(t)βi(t) P(O|µ) =

N

X

i=1

α_i(t)βi(t) for 1 ≤ t ≤ T + 1

Proofs:

P(O, Xt = i|µ) = P(O1. . . OT, Xt = i|µ)

= P(O1. . . Ot−1, Xt = i, Ot. . . OT|µ)

= P(O1. . . Ot−1, Xt = i|µ)P(Ot. . . OT|O1. . . Ot−1, Xt = i, µ)

= α_i(t)P(Ot. . . O_T|Xt = i, µ)

= αi(t)βi(t)

P(O|µ) =

N

X

i=1

P(O, Xt = i|µ) =

N

X

i=1

αi(t)βi(t)

Note: The “total” forward and backward formulae are special cases of the above one (for t = T + 1 and respectively t = 1).

(23)

3.2.1 Posterior decoding

One way to find the most likely state sequence underlying the observation sequence: choose the states individually

γ_i(t) = P(Xt = i|O, µ) Xˆ_t = argmax

1≤i≤N

γ_i(t) for 1 ≤ t ≤ T + 1 Computing γ_i(t):

γ_i(t) = P(X_t = i|O, µ) = P(X_t = i, O|µ)

P(O|µ) = α_i(t)β_i(t) PN

j=1 α_j(t)β_j(t)

Remark:

Xˆ maximizes the expected number of states that will be guessed cor- rectly. However, it may yield a quite unlikely/unnatural state sequence.

(24)

Note

Sometimes not the state itself is of interest, but some other property derived from it.

For instance, in the CpG islands example, let g be a function defined on the set of states: g takes the value 1 for A₊, C₊, G₊, T₊ and 0 for A₋, C₋, G₋, T₋.

Then

X

j

P(π_t = s_j | O)g(s_j)

designates the posterior probability that the symbol O_t come from a state in the + set.

Thus it is possible to find the most probable label of the state at each position in the output sequence O.

(25)

3.2.2 Finding the “best” state sequence The Viterbi algorithm

Compute the probability of the most likely path argmax

X

P(X|O, µ) = argmax

X

P(X, O|µ) through a node in the trellis

δi(t) = max

X1...X^t₋1

P(X1. . . Xt−1, O1 . . . Ot−1, Xt = si|µ) 1. Initialization: δj(1) = πj, for 1 ≤ j ≤ N

2. Induction: (see the similarity with the Forward algorithm) δj(t + 1) = max1≤i≤N δi(t)aijbijOt, 1 ≤ t ≤ T, 1 ≤ j ≤ N

ψj(t + 1) = argmax_1≤i≤N δi(t)aijbijOt, 1 ≤ t ≤ T, 1 ≤ j ≤ N 3. Termination and readout of best path:

P(X, O|µ) = maxˆˆ 1≤i≤N δi(T + 1)

XˆˆT+1 = argmax_1≤i≤N δi(T + 1), Xˆˆt = ψXˆˆt+1(t + 1)

(26)

Example:

Variable calculations for the crazy soft drink machine HMM

Output lemon ice tea coke

t 1 2 3 4

αCP(t) 1.0 0.21 0.0462 0.021294 αIP(t) 0.0 0.09 0.0378 0.010206 P(o1 . . . o_t−1) 1.0 0.3 0.084 0.0315

βCP(t) 0.0315 0.045 0.6 1.0 βCP(t) 0.029 0.245 0.1 1.0 P(o1 . . . oT) 0.0315

γCP(t) 1.0 0.3 0.88 0.676 γ_IP(t) 0.0 0.7 0.12 0.324

Xˆt CP IP CP CP

δ_CP(t) 1.0 0.21 0.0315 0.01323 δ_IP(t) 0.0 0.09 0.0315 0.00567

ψ_CP(t) CP IP CP

ψ_IP(t) CP IP CP

Xˆˆt CP IP CP CP

P(Xˆˆ) 0.019404

(27)

3.3 HMM parameter estimation

Given a single observation sequence for training, we want to find the model (parameters) µ = (A, B, π) that best explains the observed data.

Under Maximum Likelihood Estimation, this means:

argmax

µ

P(Otraining|µ)

There is no known analytic method for doing this.

However we can choose µ so as to locally maximize P(Otraining|µ) by an iterative hill-climbing algorithm:

Forward-Backward (or: Baum-Welch), which is a special case of the EM algorithm.

(28)

3.3.1 The Forward-Backward algorithm The idea

• Assume some (perhaps randomly chosen) model parameters. Calculate the probability of the observed data.

• Using the above calculation, we can see which transitions and signal emissions were probably used the most; by in- creasing the probabily of these, we will get a higher probability of the observed sequence.

• Iterate, hopefully arriving at an optimal parameter setting.

(29)

The Forward-Backward algorithm: Expectations

Define the probability of traversing a certain arc at time t, given the observation sequence O

pt(i, j) = P(Xt = i, Xt+1 = j|O, µ)

p_t(i, j) = P(Xt = i, X_t+1 = j, O|µ)

P(O|µ) = α_i(t)aijb_ijOtβ_j(t + 1) PN

m=1α_m(t)βm(t)

= αi(t)aijbijO^tβj(t + 1) PN

m=1

PN

n=1 αm(t)amnbmnO^tβn(t + 1) Summing over t:

PT

t=1 pt(i, j) = expected number of transitions from si to sj in O PN

j=1

PT

t=1 pt(i, j) = expected number of transitions from si in O

(30)

. . .

s

_i

s

j

a

_ij

b

_ijO

t

β

_j

(t+1) t+1 t

α

_i

(t)

t−1 t

(31)

The Forward-Backward algorithm: Re-estimation

From µ = (A, B,Π), derive µˆ = ( ˆA, B,ˆ Π):ˆ

ˆ

π_i =

PN

j=1 p₁(i, j) PN

l=1

PN

j=1 p₁(l, j) =

N

X

j=1

p₁(i, j) = γ_i(1)

ˆ

a_ij =

PT

t=1 p_t(i, j) PN

l=1

PT

t=1 p_t(i, l) ˆb_ijk =

P

t:O_t=k, 1≤t≤T p_t(i, j) PT

t=1 p_t(i, j)

(32)

The Forward-Backward algorithm: Justification

Theorem (Baum-Welch): P(O|µ)ˆ ≥ P(O|µ)

Note 1: However, it does not necessarily converge to a global optimum.

Note 2: There is a straightforward extension of the algorithm that deals with multiple observation sequences (i.e., a corpus).

(33)

Example: Re-estimation of HMM parameters

The crazy soft drink machine, after one EM iteration on the sequence O = (Lemon, Ice-tea, Coke)

Preference Ice tea Coke

Preference

π_CP=1

0.4514

0.1951

0.5486 0.8049

P(Coke) = 0.4037 Ice tea = 0.1376 Lemon = 0.4587

Lemon = 0 P(Coke) = 0.1463 Ice tea = 0.8537

On this HMM, we obtained P(O) = 0.1324, a significant improvement on the initial P(O) = 0.0315.

(34)

3.3.2 HMM parameter estimation: Viterbi version

Objective: maximize P(O | Π^⋆(O), µ), where

Π^⋆(O) is the Viterbi path for the sequence O Idea:

Instead of estimating the parameters a_ij, b_ijk using the expected values of hidden variables (p_t(i, j)),

estimate them (as Maximum Likelihood), based on the computed Viterbi path.

Note:

In practice, this method performs poorer than the Forward-Backward (Baum-Welch) main version. However it is widely used, especially when the HMM used is pri- marily intended to produce Viterbi paths.

(35)

3.3.3 Proof of the Baum-Welch theorem...

3.3.3.1 ...In the general EM setup (not only that of HMM)

Assume

some statistical model determined by parameters θ the observed quantities x,

and some missing data y that determines/influences the probability of x.

The aim is to find the model (in fact, the value of the parameter θ) that maximises the log likelihood

log P(x | θ) = logX

y

P(x, y | θ)

Given a valid model θ^t, we want to estimate a new and better model θ^t+1, i.e. one for which

logP(x | θ^t+1) > logP(x | θ^t)

(36)

it follows (since P

y P(y | x, θ^t) = 1):

log P(x | θ) = X

y

P(y | x, θ^t) logP(x, y | θ) −X

y

P(y | x, θ^t) logP(y | x, θ) The first sum will be denoted Q(θ | θ^t).

y

P(y | x, θ^t) log P(y | x, θ^t) P(y | x, θ) should be positive.

Note that the last sum is the relative entropy of P(y | x, θ^t) with respect to P(y | x, θ), therefore it is non-negative. So,

log P(x | θ) − logP(x | θ^t) ≥ Q(θ | θ^t) −Q(θ^t | θ^t)

with equality only if θ = θ^t, or if P(x | θ) = P(x | θ^t) for some other θ 6= θ^t.

(37)

Taking θ^t+1 = argmax_θQ(θ | θ^t) will imply logP(x | θ^t+1) − logP(x | θ^t) ≥ 0.

(If θ^t+1 = θ^t, the maximum has been reached.) Note: The function Q(θ | θ^t) ^def.= P

y P(y | x, θ^t) logP(x, y | θ) is an average of logP(x, y | θ) over the distribution of y obtained with the current set of parameters θ^t. This [LC: average] can be expressed as a function of θ in which the constants are expectation values in the old model. (See details in the sequel.)

The (backbone of ) EM algorithm:

initialize θ to some arbitrary value θ⁰; until a certain stop criterion is met, do:

xxx E-step: compute the expectations E[y | x, θ^t]; calculate the Q function;

xxx M-step: compute θ^t+1 = argmax_θQ(θ | θ^t).

Note: Since the likelihood increases at each iteration, the procedure will always reach a local (or maybe global) maximum asymptotically as t → ∞.

(38)

Note:

For many models, such as HMM, both of these steps can be carried out analytically.

If the second step cannot be carried out exactly, we can use some numerical optimisation technique to maximise Q.

In fact, it is enough to make Q(θ^t+1 | θ^t) > Q(θ^t | θ^t), thus getting generalised EM algorithms. See [Dempster, Laird, Rubin, 1977], [Meng, Rubin, 1992], [Neal, Hinton, 1993].

(39)

3.3.3.2 Derivation of EM steps for HMM

In this case, the ‘missing data’ are the state paths π. We want to maximize Q(θ | θ^t) = X

π

P(π | x, θ^t) logP(x, π | θ)

For a given path, each parameter of the model will appear some number of times in P(x, π | θ), computed as usual. We will note this number A_kl(π) for transitions and E_k(b, π) for emissions. Then,

P(x, π | θ) = Π^M_k=1Π_b[e_k(b)]^E^k^(b,π)Π^M_k=0Π^M_l=1a^A_kl^kl^(π)

By taking the logarithm in the above formula, it follows

Q(θ | θ^t) = X

π

P(π | x, θ^t) ×

" _M X

k=1

X

b

E_k(b, π) loge_k(b) +

M

X

k=0 M

X

l=1

A_kl(π) loga_kl

#

(40)

The expected values Akl and Ek(b) can be written as expectations of Akl(π) and Ek(b, π) with respect to P(π | x, θ^t):

E_k(b) = X

π

P(π | x, θ^t)Ek(b, π) and A_kl = X

π

P(π | x, θ^t)Akl(π) Therefore,

Q(θ | θ^t) =

M

X

k=1

X

b

E_k(b) loge_k(b) +

M

X

k=0 M

X

l=1

A_kl loga_kl To maximise, let us look first at the A term.

The difference between this term for a⁰_ij = Aij

P

k Aik

and for any other aij is

M

X

k=0 M

X

l=1

A_kl log a⁰_kl akl

=

M

X

k=0

X

l^′

A_kl^′

! _M X

l=1

a⁰_kl log a⁰_kl akl

The last sum is a relative entropy, and thus it is larger than 0 unless a_kl = a⁰_kl. This proves that the maximum is at a⁰_kl.

Exactly the same procedure can be used for the E term.

(41)

For the HMM, the E-step of the EM algorithm consists of calcu- lating the expectations A_kl and E_k(b). This is done by using the Forward and Backward probabilities. This completely determines the Q function, and the maximum is expressed directly in terms of these numbers.

Therefore, the M-step just consists of plugging Akl and Ek(b) into the re-estimation formulae for akl and ek(b). (See formulae (3.18) in the R. Durbin et al. BSA book.)

(42)

4 HMM extensions

• Null (epsilon) emissions

• Initialization of parameters: improve chances of reaching global optimum

• Parameter tying: help coping with data sparseness

• Linear interpolation of HMMs

• Variable-Memory HMMs

• Acquiring HMM topologies from data

(43)

5 Some applications of HMMs

◦ Speech Recognition

• Text Processing: Part Of Speech Tagging

• Probabilistic Information Retrieval

◦ Bioinformatics: genetic sequence analysis

(44)

5.1 Part Of Speech (POS) Tagging

Sample POS tags for the Brown/Penn Corpora

AT article

BEZ is

IN preposition

JJ adjective

JJR adjective: comparative

MD modal

NN noun: singular or mass NNP noun: singular proper PERIOD .:?!

PN personal pronoun

RB adverb

RBR adverb: comparative

TO to

VB verb: base form VBD verb: past tense

VBG verb: present participle, gerund VBN verb: past participle

VBP verb: non-3rd singular present VBZ verb: 3rd singular present

WDT wh-determiner (what, which)

(45)

POS Tagging: Methods

[Charniak, 1993] Frequency-based: 90% accuracy

now considered baseline performance [Schmid, 1994] Decision lists; artificial neural networks [Brill, 1995] Transformation-based learning

[Brants, 1998] Hidden Markov Modelss [Chelba &

Jelinek, 1998] lexicalized probabilistic parsing (the best!)

(46)

A fragment of a HMM for POS tagging

(from [Charniak, 1997])

π_det=1

P(large) = 0.004 small = 0.005

P(a) = 0.245 the = 0.586

P(house) = 0.001 stock = 0.001

det noun

adj 0.218 0.45

0.475 0.016

(47)

Using HMMs for POS tagging

argmax

t1...n

P(t_1...n|w_1...n) = argmax

t1...n

P(w_1...n|t_1...n)P(t_1...n) P(w_1...n)

= argmax

t1...n

P(w_1...n|t_1...n)P(t_1...n)

using the two Markov assumptions

= argmax

t1...n

Πⁿ_i=1P(w_i|t_i)Πⁿ_i=1P(t_i|t_i−1)

Supervised POS Tagging:

MLE estimations: P(w|t) = ^C(w,t)_C_(t) , P(t^′′|t^′) = ^C_C^(t_(t^′^,t′)^′′⁾

(48)

The Treatment of Unknown Words:

• use apriori uniform distribution over all tags:

error rate 40% ⇒ 20%

• feature-based estimation [ Weishedel et al., 1993 ]:

P((w|t) = _Z¹ P(unknown word | t)P(Capitalized | t)P(Ending | t)

• using both roots and suffixes [Charniak, 1993]

Smoothing:

P (t|w) =

_C(w)+k^C(t,w)+1

w [Church, 1988]

where k_w is the number of possible tags for w

P (t

^′′

|t

^′

) = (1 − ǫ)

^C(t_C(t^′^,t_′₎^′′⁾

+ ǫ

[Charniak et al., 1993]

(49)

Fine-tuning HMMs for POS tagging

See [ Brants, 1998 ]

(50)

5.2 The Google PageRank Algorithm

A Markov Chain worth no. 5 on Forbes list!

(2 × 18.5 billion USD, as of November 2007)

(51)

“Sergey Brin and Lawrence Page introduced Google in 1998, a time when the pace at which the web was growing began to oustrip the ability of current search engines to yield usable results.

In developing Google, they wanted to improve the design of search engines by moving it into a more open, academic environment.

In addition, they felt that the usage of statistics for their search engine would provide an interesting data set for research.”

From David Austin, “How Google finds your needle in the web’s haystack”, Monthly Essays on Mathematical Topics, 2006.

(52)

Notations

Let n = the number of pages on Internet, and H and A two n ×n matrices defined by

h_ij =

1 if page j points to page i (notation: P_j ∈ B_i) 0 otherwise

aij =

1 if page i contains no outgoing links 0 otherwise

α ∈ [0; 1] (this is a parameter that was initially set to 0.85) The transition matrix of the Google Markov Chain is

G = α(H + A) + 1 − α n · 1

where 1 is the n ×n matrix whose entries are all 1

(53)

The significance of G is derived from:

• the Random Surfer model

• the definition the (relative) importance of a page: combining votes from the pages that point to it

I(P_i) = X

P_j∈Bi

I(P_j) l_j

where l_j is the number of links pointing out from P j.

(54)

The PageRank algorithm

[Brin & Page, 1998]

G is a stochastic matrix (gij ∈ [0; 1], Pn

i=1 gij = 1),

therefore λ1 the greatest eigenvalue of G is 1, and G has a stationary vector I (i.e., GI = I).

G is also primitive (| λ2 |< 1, where λ2 is the second eigenvalue of G) and irreducible (I > 0).

From the matrix calculus it follows that

I can be computed using the power method:

if I¹ = GI⁰, I² = GI¹, . . . , I^k = GI^k−1 then I^k → I. I gives the relative importance of pages.

(55)

ADDENDA

Formalisation of HMM algorithms in

“Biological Sequence Analysis” [ Durbin et al, 1998 ]

Note

A begin state was introduced. The transition probability a0k from this begin state to state k can be thought as the probability of starting in state k.

An end state is assumed, which is the reason for ak0 in the termination step.

If ends are not modelled, this ak0 will disappear.

For convenience we label both begin and end states as 0. There is no conflict because you can only transit out of the begin state and only into the end state, so variables are not used more than once.

The emission probabilities are considered independent of the origin state.

(Thus te emission of (pairs of ) symbols can be seen as being done when reaching the non-end states.) The begin and end states are silent.

(57)

Forward:

1. Initialization (i = 0): f₀(0) = 1;f_k(0) = 0, for k > 0 2. Induction (i = 1 . . . L): f_l(i) = e_l(x_i) P

k f_k(i − 1)a_kl 3. Total: P(x) = P

k f_k(L)a_k0. Backward:

1. Initialization (i = L): b_k(L) = a_k0, for all k 2. Induction (i = L − 1, . . . ,1: b_k(i) = P

l a_kle_l(x_i+1)b_l(i + 1) 3. Total: P(x) = P

l a_0le_l(x₁)b_l(1)

Combining f and b: P(πk, x) = f_k(i)bk(i)

(58)

Viterbi:

1. Initialization (i = 0): v₀(0) = 1;v_k(0) = 0, for k > 0 2. Induction (i = 1 . . . L):

v_l(i) = e_l(x_i) max_k(v_k(i − 1)a_kl);

ptr_i(l) = argmax_k v_k(i − 1)a_kl)

3. Termination and readout of best path:

P(x, π^⋆) = max_k(v_k(L)a_k0);

π_L^⋆ = argmax_k v_k(L)a_k0, and π_i−1^⋆ = ptr_i(π_i^⋆), for i = L . . .1.

(59)

Baum-Welch:

1. Initialization: Pick arbitrary model parameters 2. Induction:

For each sequence j = 1. . . n calculate f_k^j(i) and b^j_k(i) for sequence j using the forward and respectively backward algorithms.

Calculate the expected number of times each transition of emission is used, given the training sequences:

Akl = X

j

1 P(x^j)

X

i

f_k^j(i)aklel(x^j_i+1)b^j_l(i+ 1)

E_kl = X

j

1 P(x^j)

X

{i|x^j_i=b}

f_k^j(i)b^j_k(i)

Calculate the new model parameters:

a_kl = Akl

P

l^′ Akl^′

and e_k(b) = Ek(b) P

b^′ Ek(b^′) Calculate the new log likelihood of the model.

3. Termination:

Stop is the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded.

Probability of a Markov Chain

Hidden Markov Models

PLAN

1 Markov Models (generally)

Markov assumptions

Probability of a Markov Chain

A 1st Markov chain example: DNA

A T

C G

A 2nd Markov chain example:

CpG islands in DNA sequences

Using log likelihoood (log-odds) ratios for discrimination

2 Hidden Markov Models

A program for a HMM

A 1st HMM example: CpG islands

F L

3 Three fundamental questions for HMMs

3.1 Probability of an observation sequence

3.1.1 Probability of an observation sequence:

The Forward algorithm

Closeup of the Forward update step

Trellis

3.1.2 Probability of an observation sequence:

The Backward algorithm

Combining Forward and Backward probabilities

3.2.1 Posterior decoding

Note

3.2.2 Finding the “best” state sequence The Viterbi algorithm

3.3 HMM parameter estimation

3.3.1 The Forward-Backward algorithm The idea

The Forward-Backward algorithm: Expectations

. . .

. . .

s

s

a

b

β

(t+1) t+1 t

α

(t)

t−1 t

The Forward-Backward algorithm: Re-estimation

The Forward-Backward algorithm: Justification

3.3.2 HMM parameter estimation: Viterbi version

3.3.3 Proof of the Baum-Welch theorem...

3.3.3.2 Derivation of EM steps for HMM

4 HMM extensions

5 Some applications of HMMs

5.1 Part Of Speech (POS) Tagging

Sample POS tags for the Brown/Penn Corpora

POS Tagging: Methods

Using HMMs for POS tagging

P (t|w) =

P (t

|t

) = (1 − ǫ)

+ ǫ

Fine-tuning HMMs for POS tagging

5.2 The Google PageRank Algorithm

Notations

The significance of G is derived from:

The PageRank algorithm

Suggested readings

ADDENDA