• Nu S-Au Găsit Rezultate

C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002

N/A
N/A
Protected

Academic year: 2022

Share "C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002"

Copied!
59
0
0
Arată mai multe ( pagini)

Text complet

(1)

Basic Statistics and Probability Theory

Based on

“Foundations of Statistical NLP”

C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002

“Probability theory is nothing but common sense reduced to calculation.”

Pierre Simon, Marquis de Laplace (1749-1827)

(2)

PLAN

1.

Elementary Probability Notions:

• Sample Space, Event Space, and Probability Function

• Conditional Probability

• Bayes’ Theorem

• Independence of Probabilistic Events 2.

Random Variables:

• Discrete Variables and Continuous Variables

• Mean, Variance and Standard Deviation

• Standard Distributions

• Joint, Marginal and and Conditional Distributions

• Independence of Random Variables

(3)

PLAN (cont’d)

3.

Limit Theorems

• Laws of Large Numbers

• Central Limit Theorems

4.

Estimating the parameters

of probabilistic models from data

• Maximum Likelihood Estimation (MLE)

• Maximum A Posteriori (MAP) Estimation 5.

Elementary Information Theory

• Entropy; Conditional Entropy; Joint Entropy

• Information Gain / Mutual Information

• Cross-Entropy

• Relative Entropy / Kullback-Leibler (KL) Divergence

• Properties: bounds, chain rules, (non-)symmetries, properties pertaining to independence

(4)

1. Elementary Probability Notions

• sample space: Ω (either discrete or continuous)

• event: A ⊆ Ω

– the certain event: Ω – the impossible event: ∅

– elementary event: any {ω}, where ω ∈ Ω

• event space: F = 2 (or a subspace of 2 that contains ∅ and is closed under complement and countable union)

• probability function/distribution: P : F → [0,1] such that:

– P(Ω) = 1

– the “countable additivity” property:

∀A1, ..., Ak disjoint events, P(∪Ai) = P

P(Ai)

Consequence: for a uniform distribution in a finite sample space:

P(A) = #favorable elementary events

#all elementary events

(5)

Conditional Probability

• P(A | B) =

P (A ∩ B) P (B )

Note: P(A | B) is called the a posteriory probability of A, given B.

• The “multiplication” rule:

P(A ∩B) = P(A | B)P(B) = P(B | A)P(A)

• The “chain” rule:

P(A1 ∩ A2 ∩ . . .∩ An) =

P(A1)P(A2 | A1)P(A3 | A1, A2). . . P(An | A1, A2, . . . , An−1)

(6)

The “total probability” formula:

P (A) = P (A | B)P (B) + P (A | ¬ B )P ( ¬ B)

More generally:

if A ⊆ ∪Bi and ∀i 6= j Bi ∩Bj = ∅, then P(A) = P

i P(A | Bi)P(Bi)

Bayes’ Theorem:

P (B | A)

=

P (A | B) P (B) P (A)

or P(B | A) = P(A | B) P(B)

P(A | B)P(B) + P(A | ¬B)P(¬B) or ...

(7)

Independence of Probabilistic Events

• Independent events: P(A∩ B) = P(A)P(B)

Note: When P(B) 6= 0, the above definition is equivalent to P(A|B) = P(A).

• Conditionally independent events:

P(A ∩ B | C) = P(A | C)P(B | C), assuming, of course, that P(C) 6= 0.

Note: When P(B ∩ C) 6= 0, the above definition is equivalent to P(A|B, C) = P(A|C).

(8)

2. Random Variables 2.1 Basic Definitions

Let Ω be a sample space, and

P : 2 → [0,1] a probability function.

• A random variable of distribution P is a function X : Ω → Rn

◦ For now, let us consider n = 1.

◦ The cumulative distribution function of X is F : R → [0,∞) defined by F(x) = P(X ≤ x) = P({ω ∈ Ω | X(ω) ≤ x})

(9)

Definition: Let P : 2 → [0,1] be a probability function, and X be a random variable of distribution P.

• If Val(X) is either finite or unfinite countable, then X is called a discrete random variable.

◦ For such a variable we define the probability mass function (pmf) p : R → [0,1] as p(x) not.= p(X = x) def.= P({ω ∈ Ω | X(ω) = x}).

(Obviously, it follows that P

xi∈V al(X) p(xi) = 1.) Mean, Variance, and Standard Deviation:

• Expectation / mean of X: E(X) not.= E[X] = P

xxp(x) if X is a discrete random variable.

• Variance of X: Var(X) not.= Var[X] = E((X − E(X))2).

• Standard deviation: σ = p

Var(X).

Covariance of X and Y , two random variables of distribution P:

• Cov(X, Y) = E[(X −E[X])(Y −E[Y ])]

(10)

Exemplification:

• the Binomial distribution: b(r;n, p) = Cnr pr(1 −p)n−r for r = 0, . . . , n mean: np, variance: np(1 −p)

◦ the Bernoulli distribution: b(r; 1, p)

mean: p, variance: p(1 −p), entropy: −plog2 p −(1 − p) log2(1 − p)

0 10 20 30 40

0.000.050.100.150.200.25

r

b(r ; n, p)

0 10 20 30 40

0.000.050.100.150.200.25

Binomial probability mass function

p = 0.5, n = 20 p = 0.7, n = 20 p = 0.5, n = 40

0 10 20 30 40

0.00.20.40.60.81.0

r

F(r)

0 10 20 30 40

0.00.20.40.60.81.0

Binomial cumulative distribution function

p = 0.5, n = 20 p = 0.7, n = 20 p = 0.5, n = 40

(11)

2.3 Continuous Random Variables

Definitions:

Let P : 2 → [0,1] be a probability function, and X : Ω → R be a random variable of distribution P.

• If Val(X) is unfinite non-countable set, and

F, the cumulative distribution function of X is continuous, then X is called a continuous random variable.

(It follows, naturally, that P(X = x) = 0, for all x ∈ R.)

• If there exists p : R → [0,∞) such that F(x) = Rx

−∞p(t)dt, then X is called absolutely continuous.

In such a case, p is called the probability density function (pdf) of X.

◦ For B ⊆ R for which R

B p(x)dx exists, P(X−1(B)) = R

B p(x)dx, where X−1(B) not.= {ω ∈ Ω | X(ω) ∈ B}.

In particular, R+∞

−∞ p(x)dx = 1.

• Expectation / mean of X: E(X) not.= E[X] = R

xp(x)dx.

(12)

Exemplification:

• Normal (Gaussean) distribution: N(x;µ, σ) = √ 1

2πσ

e

(xµ)

2

2

mean: µ, variance: σ2

◦ Standard Normal distribution: N(x; 0,1)

• Remark:

For n, p such that np(1 − p) > 5, the Binomial distributions can be approximated by Normal distributions.

(13)

−4 −2 0 2 4

0.00.20.40.60.81.0

x Nµ,σ2(X=x)

−4 −2 0 2 4

0.00.20.40.60.81.0

Gaussian probability density function

µ = 0, σ = 0.2 µ = 0, σ = 1.0 µ = 0, σ = 5.0 µ = −2, σ = 0.5

−4 −2 0 2 4

0.00.20.40.60.81.0

x φµ2(x)

−4 −2 0 2 4

0.00.20.40.60.81.0

Gaussian cumulative distribution function

µ = 0, σ = 0.2 µ = 0, σ = 1.0 µ = 0, σ = 5.0 µ = −2, σ = 0.5

(14)

2.4 Basic Properties of Random Variables

Let P : 2 → [0,1] be a probability function,

X : Ω → Rn be a random discrete/continuous variable of distribution P.

• If g : Rn → Rm is a function, then g(X) is a random variable.

If g(X) is discrete, then E(g(X)) = P

xg(x)p(x).

If g(X) is continuous, then E(g(X)) = R

g(x)p(x)dx.

◦ If g is non-linear 6⇒ E(g(X)) = g(E(X)).

• E(aX) = aE(X).

• E(X + Y) = E(X) + E(Y), therefore E[Pn

i=1aiXi] = Pn

i=1aiE[Xi].

◦ Var(aX) = a2Var(X).

◦ Var(X + a) = Var(X).

• Var(X) = E(X2) − E2(X).

• Cov(X, Y) = E[XY ] −E[X]E[Y ].

(15)

2.5 Joint, Marginal and Conditional Distributions

Exemplification for the bi-variate case:

Let Ω be a sample space, P : 2 → [0,1] a probability function, and V : Ω → R2 be a random variable of distribution P.

One could naturally see V as a pair of two random variables X : Ω → R and Y : Ω → R. (More precisely, V (ω) = (x, y) = (X(ω), Y(ω)).)

• the joint pmf/pdf of X and Y is defined by

p(x, y) not.= pX,Y(x, y) = P(X = x, Y = y) = P(ω ∈ Ω | X(ω) = x, Y(ω) = y).

• the marginal pmf/pdf functions of X and Y are:

for the discrete case:

pX(x) = P

y p(x, y), pY(y) = P

xp(x, y) for the continuous case:

pX(x) = R

y p(x, y)dy, pY(y) = R

xp(x, y)dx

• the conditional pmf/pdf of X given Y is:

pX|Y(x | y) = pX,Y(x, y) pY(y)

(16)

2.6 Independence of Random Variables

Definitions:

• Let X, Y be random variables of the same type (i.e. either discrete or continuous), and pX,Y their joint pmf/pdf.

X and Y are said to be independent if

pX,Y(x, y) = pX(x) · pY(y)

for all possible values x and y of X and Y respectively.

• Similarly, let X, Y and Z be random variables of the same type, and p their joint pmf/pdf.

X and Y are conditionally independent given Z if pX,Y|Z(x, y | z) = pX|Z(x | z) · pY|Z(y | z)

for all possible values x, y and z of X, Y and Z respectively.

(17)

Properties of random variables pertaining to independence

• If X, Y are independent, then

Var(X + Y ) = Var(X) + Var(Y).

• If X, Y are independent, then

E(XY) = E(X)E(Y ), i.e. Cov(X, Y ) = 0.

◦ Cov(X, Y) = 0 6⇒ X, Y are independent.

◦ The covariance matrix corresponding to a vector of random variables is symmetric and positive semi-definite.

• If the covariance matrix of a multi-variate Gaussian distribution is diagonal, then the marginal distributions are independent.

(18)

3. Limit Theorems

[ Sheldon Ross, A first course in probability, 5th ed., 1998 ]

“The most important results in probability theory are limit theo- rems. Of these, the most important are...

laws of large numbers, concerned with stating conditions under which the average of a sequence of random variables converges (in some sense) to the expected average;

central limit theorems, concerned with determining the conditions under which the sum of a large number of random variables has a probability distribution that is approximately normal.”

(19)

Two Probability Bounds

Markov’s inequality:

If X is a random variable that takes only non-negative values, then for any value a > 0,

P(X ≥ a) ≤ E[X] a

Chebyshev’s inequality:

If X is a random variable with finite mean µ and variance σ2, then for any value a > 0,

P(| X −µ |≥ a) ≤ σ2 a2.

Note: As Chebyshev’s inequality is valid for all distributions of the random variable X, we cannot expect the bound of the probability to be very close to the actual probability in most cases. (See ex. 2b, pag.

397 in Ross’ book.)

(20)

The weak law of large numbers

[ Bernoulli; Khintchine ]

Let X1, X2, . . . , Xn be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi] = µ.

Then, for any value ǫ > 0, P

X1 + . . . + Xn

n −µ

≥ ǫ

→ 0 as n → ∞.

(21)

The central limit theorem for i.i.d. random variables

[ Pierre Simon, Marquis de Laplace; Liapunoff in 1901-1902 ] Let X1, X2, . . . , Xn be a sequence of independent random variables, each having finite mean µ and finite variance σ2.

Then the distribution of

X1 + . . . + Xn −nµ σ √

n

tends to be the standard normal (Gaussian) as n → ∞. That is, for −∞ < a < ∞,

P

X1 + . . . + Xn − nµ σ √

n ≤ a

→ 1

√2π Z a

−∞

e−x2/2dx as n → ∞

(22)

The central limit theorem

for independent random variables

Let X1, X2, . . . , Xn be a sequence of independent random variables having respective means µi and variances σi2.

If

(a) the variables Xi are uniformly bounded,

i.e. for some M ∈ R+ P(| Xi |< M) = 1 for all i, and

(b) P

i=1σi2 = ∞, then

P

Pn

i=1(Xi −µi) pPn

i=1σi2 ≤ a

!

→ Φ(a) as n → ∞

where Φ is the cumulative distribution function for the standard normal (Gaussian) distribution.

(23)

The strong law of large numbers

Let X1, X2, . . . , Xn be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi] = µ.

Then, with probability 1,

X1 + . . . + Xn

n → µ as n → ∞ That is,

P

n→∞lim (X1 + . . . + Xn)/n = µ

= 1

(24)

Other Probability Bounds

One-sided Chebyshev inequality:

If X is a random variable with mean 0 and finite variance σ2, then for any a > 0,

P(X ≥ a) ≤ σ2 σ2 + a2 Corollary:

If E[X] = µ, Var(X) = σ2, then for a > 0 P(X ≥ µ + a) ≤ σ2

σ2 + a2 P(X ≤ µ− a) ≤ σ2

σ2 + a2

(25)

Other Probability Bounds (cont’d)

Chernoff bounds:

If X is a random variable, then M(t) not= E[etX], is called the moment generating function of X.

It can be shown that

P(X ≥ a) ≤ e−taM(t) for all t > 0 P(X ≤ a) ≤ e−taM(t) for all t < 0.

Chernoff bounds for the standard normal distribution:

If Z is a standard normal random variable, then M(t) not.= E[etX] calculus= et2/2.

It can be shown that

P(Z ≥ a) ≤ e−a2/2 for all a > 0 P(Z ≤ a) ≤ e−a2/2 for all a < 0.

(26)

Other Probability Bounds (cont’d)

Hoeffding bounds:

Let X1, . . . , Xn be some independent random variables, each Xi being bounded by the interval [ai, bi].

If X¯ not.= n1 Pn

i=1 Xi, then it follows that for any t ≥ 0

P( ¯X − E[ ¯X] ≥ t) ≤ exp − 2n2t2 Pn

i=1 (bi − ai)2

!

P(E[ ¯X] − X¯ ≥ t) ≤ exp

− 2n2t2 Pn

i=1(bi − ai)2

⇒ P(|X¯ − E[ ¯X]| ≥ t) ≤ 2 exp

− 2n2t2 Pn

i=1(bi − ai)2

.

(27)

4. Estimation/inference of the parameters of probabilistic models from data

(based on [Durbin et al, Biological Sequence Analysis, 1998], p. 311-313, 319-321)

A probabilistic model can be anything from a simple distribution to a complex stochastic grammar with many implicit probability distributions. Once the type of the model is chosen, the parame- ters have to be inferred from data.

We will first consider the case of the categorical distribution, and then we will present the different strategies that can be used in general.

(28)

A case study: Estimation of the parameters of a categorical distribution from data

Assume that the observations — for example, when rolling a die about which we don’t know whether it is fair or not, or when counting the number of times the amino acid i occurs in a column of a multiple sequence align- ment — can be expressed as counts ni for each outcome i (i = 1, l . . . , K), and we want to estimate the probabilities θi of the underlying distribution.

Case 1:

When we have plenty of data, it is natural to use

the maximum likeli- hood (ML) solution,

i.e. the observed frequency θiML = ni

P

j nj

not.= ni N. Note: it is easy to show that indeed P(n | θML) > P(n | θ) for any θ 6= θML.

ln P(n | θML)

P(n | θ) = ln ΠiiML)ni

Πiθini = X

i

niln θiML

θi = N X

i

θiMLln θMi L θi > 0

The inequality follows from the fact that the relative entropy is always positive except when the two distributions are identical.

(29)

When the data is scarce, it is not clear what is the best estimate.

In general, we should use prior knowledge, via Bayesian statistics.

For instance, one can use the Dirichlet distribution with parameters α.

P | n) = P(n | θ)D | α) P(n)

It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320) that

the posterior mean estimation (PME)

of the parameters is

θPMEi def.= Z

θP(θ | n)dθ = ni + αi N + P

j αj

The αs are like pseudocounts added to the real counts. (If we think of the αs as extra observations added to the real ones, this is precisely the ML estimate!) This makes the Dirichlet regulariser very intuitive.

How to use the pseudocounts: If it is fairly obvious that a certain residue, let’s say i, is very common, than we should give it a very high pseudocount αi; if the residue j is generaly rare, we should give it a low pseudocount.

(30)

Strategies to be used in the general case

A. The Maximum Likelihood (ML) Estimate

When we wish to infer the parameters θ = (θi) for a model M from a set of data D, the most obvious strategy is to maximise P(D | θ, M) over all possible values of θ. Formally:

θML = argmax

θ

P(D | θ, M)

Note: Generally speaking, when we treat P(x | y) as a function of x (and y is fixed), we refer to it as a probability. When we treat P(x | y) as a function of y (and x is fixed), we call it a likelihood. Note that a likelihood is not a probability distribution or density; it is simply a function of the variable y.

A serious drawback of maximum likelihood is that it gives poor results when data is scarce. The solution then is to introduce more prior knowl- edge, using Bayes’ theorem. (In the Bayesian framework, the parameters are themselves seen as random variables!)

(31)

B. The Maximum A posteriori Probability (MAP) Estimate

θMAP d=ef. argmax

θ

P | D, M) = argmax

θ

P(D | θ, M)P | M) P(D | M)

= argmax

θ

P(D | θ, M)P | M)

The prior probability P(θ | M) has to be chosen in some reasonable manner, and this is the art of Bayesian estimation (although this freedom to choose a prior has made Bayesian statistics controversial at times...).

C. The Posterior Mean Estimator (PME)

θPME = Z

θP(θ | D, M)dθ

where the integral is over all probability vectors, i.e. all those that sum to one.

D. Yet another solution is to use the posterior probability P(θ | D, M) to

sample

from it (see [Durbin et al, 1998], section 11.4) and thereby locate regions of high probability for the model parameters.

(32)

5. Elementary Information Theory Definitions:

Let X and Y be discrete random variables.

• Entropy: H(X) def.= P

xp(x) log2 1

p(x) = −P

x p(x) log2 p(x) = Ep[−log2 p(X)].

Convention: if p(x) = 0 then we shall consider p(x) log2p(x) = 0.

• Specific Conditional entropy: H(Y | X = x) def.= −P

y∈Y p(y | x) log2 p(y | x).

• Average conditional entropy:

H(Y | X) def.= P

x∈X p(x)H(Y | X = x) imed.= −P

x∈X

P

y∈Y p(x, y) log2 p(y | x).

• Joint entropy:

H(X, Y)def.= −P

x,y p(x, y) log2 p(x, y)dem.= H(X)+H(Y |X)dem.= H(Y )+H(X|Y ).

• Information gain (or: Mutual information):

IG(X;Y ) def.= H(X) − H(X | Y ) imed.= H(Y) − H(Y | X)

imed.

= H(X, Y ) − H(X | Y) − H(Y | X) = IG(Y;X).

(33)

Exemplification: Entropy of a Bernoulli Distribution

H(p) = −plog2 p −(1 − p) log2(1 − p)

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

p

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

(34)

Basic properties of

Entropy, Conditional Entropy, Joint Entropy and Information Gain / Mutual Information

• 0 ≤ H(p1, . . . , pn) ≤ H 1

n, . . . , 1 n

= log2 n;

H(X) = 0 iff X is a constant random variable.

• IG(X;Y) ≥ 0;

IG(X;Y) = 0 iff X and Y are independent;

IG(X;X) = H(X).

• H(X | Y ) ≤ H(X)

H(X | Y ) = H(X) iff X and Y are independent.

• H(X, Y) ≤ H(X) + H(Y );

H(X, Y) = H(X) + H(Y ) iff X and Y are independent;

H(X, Y|A) = H(X|A) + H(Y |A) (a conditional form).

• a chain rule: H(X1, . . . , Xn) = H(X1)+H(X2|X1)+. . .+H(Xn|X1, . . . , Xn−1).

(35)

The Relationship between

Entropy, Conditional Entropy, Joint Entropy and Information Gain

H(X|Y) H(Y|X) H(X,Y)

H(X) H(Y)

IG(X;Y)

(36)

Other definitions

• Let X be a discrete random variable, p its pmf and q another pmf (usually a model of p).

Cross-entropy:

CH(X, q) = −X

x∈X

p(x) log2 q(x) = Ep

log2 1 q(X)

• Let X and Y be discrete random variables, and p and q their respective pmf’s.

Relative entropy (or, Kullback-Leibler divergence):

KL(p || q) = −X

x∈X

p(x) log2 q(x)

p(x) = Ep

log2 p(X) q(X)

= CH(X, q) − H(X).

(37)

The Relationship between

Entropy, Conditional Entropy, Joint Entropy, Information Gain,

Cross-Entropy and Relative Entropy (or KL divergence)

H(p )

X

CH(X, p )

Y

CH(Y, p )

X

X

KL(p || p )

Y

H(p )

Y Y

KL(p || p )

X

H(X|Y) H(Y|X)

H(X,Y)

Y XY X

IG(X,Y) = KL(p || p p )

(38)

Basic properties of

cross-entropy and relative entropy

◦ CH(X, q) ≥ 0

• KL(p || q) ≥ 0 for all p and q;

KL(p || q) = 0 iff p and q are identical.

◦ [Consequence:]

If X is a discrete random variable, p its pmf, and q another pmf, then CH(X, q) ≥ H(X) ≥ 0.

The first of these two inequations is also known as Gibbs’ inequation:

−Pn

i=1 pi log2 pi ≤ −P

i pi log2 qi.

◦ Unlike H of a discrete n-ary variable, which is bounded by log2 n, there is no (general) upper bound for CH. (However, KL is upper-bounded.)

• Unlike H(X, Y ), which is symmetric in its arguments, CH and KL are not! Therefore KL is NOT a distance metric! (See the next slide.)

• IG(X;Y) = KL(pX,Y || pX pY) = −P

x

P

y p(x, y) log2

p(x)p(y) p(x, y)

.

(39)

Remark

◦ The quantity

V I(X, Y ) def= H(X, Y ) − IG(X;Y) = H(X) + H(Y ) − 2IG(X;Y)

= H(X | Y ) + H(Y | X)

known as variation of information, is a distance metric, i.e. it is nonengative, symmetric, implies indiscernability, and satisfies the tri- angle inequality.

◦ Consider M(p, q) = 1

2(p + q).

The function JSD(p||q) = 1

2KL(p||M) + 1

2KL(q||M) is called the Jensen- Shannon divergence.

One can prove that p

JSD(p||q) defines a distance metric (the Jensen- Shannon distance).

(40)

6. Recommended Exercises

• From [Manning & Sch¨ utze, 2002 , ch. 2:]

Examples 1, 2, 4, 5, 7, 8, 9 Exercises 2.1, 2.3, 2.4, 2.5

• From [Sheldon Ross, 1998 , ch. 8:]

Examples 2a, 2b, 3a, 3b, 3c, 5a, 5b

(41)

Addenda:

Other Examples of Probabilistic Distributions

(42)

Multinomial distribution:

generalises the binomial distribution to the case where there are K inde- pendent outcomes with probabilities θi, i = 1, . . . , K such that PK

i=1 θi = 1.

The probability of getting ni occurrence of outcome i is given by P(n | θ) = n!

ΠKi=1(ni!)ΠKi=1θini, where n = n1 + . . .+ nK, and θ = (θ1, . . . , θK).

Note:

The particular case n = 1 represents the

categorical distribu- tion

. This is a generalisation of the Bernoulli distribution.

Example:

The outcome of rolling a die n times is described by a categor- ical distribution. The probabilities of each of the 6 outcomes are θ1, . . . , θ6. For a fair die, θ1 = . . . = θ6, and the probability of rolling it 12 times and getting each outcome twice is:

12!

(2!)6 1

6 12

= 3.4 ×10−3

(43)

Poisson distribution

(or, Poisson law of small numbers):

p(k;λ) = λk

k! · e−λ, with k ∈ N and parameter λ > 0.

Mean = variance = λ.

0 5 10 15 20

0.00.10.20.30.4

k

P(X=k)

0 5 10 15 20

0.00.10.20.30.4

Poisson probability mass function

λ = 1 λ = 4 λ = 10

0 5 10 15 20

0.00.20.40.60.81.0

k

P(X k)

0 5 10 15 20

0.00.20.40.60.81.0

Poisson cumulative distribution function

λ = 1 λ = 4 λ = 10

(44)

Exponential distribution

(a.k.a. negative exponential distribution):

p(x;λ) = λe−λx for x ≥ 0 and parameter λ > 0.

Mean = λ−1, variance = λ−2.

0 1 2 3 4 5

0.00.51.01.5

x

p(x)

0 1 2 3 4 5

0.00.51.01.5

Exponential probability density function

λ = 0.5 λ = 1 λ = 1.5

0 1 2 3 4 5

0.00.20.40.60.81.0

x

P(X x)

0 1 2 3 4 5

0.00.20.40.60.81.0

Exponential cumulative distribution function

λ = 0.5 λ = 1 λ = 1.5

Note: The Exponential distribution is a particular case of the Gamma distribution (take k= 1 in the next slide).

(45)

p(x;k, θ) = xk−1 e−x/θ

Γ(k)θk for x 0 and parameters k > 0 (shape) and θ > 0 (scale).

Mean = kθ, variance = 2.

The gamma function is a generalisation of the factorial function to real values. For any positive real number x, Γ(x+ 1) = xΓ(x). (Thus, for integers Γ(n) = (n1)!.)

0 5 10 15 20

0.00.10.20.30.40.5

x

p(x)

0 5 10 15 20

0.00.10.20.30.40.5

Gamma probability density function

k = 1.0, θ = 2.0 k = 2.0, θ = 2.0 k = 3.0, θ = 2.0 k = 5.0, θ = 1.0 k = 9.0, θ = 0.5 k = 7.5, θ = 1.0 k = 0.5, θ = 1.0

0 5 10 15 20

0.00.20.40.60.81.0

x

P(X x)

0 5 10 15 20

0.00.20.40.60.81.0

Gamma cumulative distribution function

k = 1.0, θ = 2.0 k = 2.0, θ = 2.0 k = 3.0, θ = 2.0 k = 5.0, θ = 1.0 k = 9.0, θ = 0.5 k = 7.5, θ = 1.0 k = 0.5, θ = 1.0

(46)

χ

2

distribution:

p(x;ν) = 1 Γ(ν/2)

1 2

ν/2

xν/2−1e12x for x 0 and ν a positive integer.

It is obtained from Gamma distribution by taking k = ν/2 and θ = 2.

Mean = ν, variance = .

0 2 4 6 8

0.00.10.20.30.40.5

x

p(x)

0 2 4 6 8

0.00.10.20.30.40.5

Chi Squared probability density function

k = 1 k = 2 k = 3 k = 4 k = 6 k = 9

0 2 4 6 8

0.00.20.40.60.81.0

x

P(X x)

0 2 4 6 8

0.00.20.40.60.81.0

Chi Squared cumulative distribution function

k = 1 k = 2 k = 3 k = 4 k = 6 k = 9

(47)

Laplace distribution:

p(x;µ, θ) = 1 e

|µ x|

θ , with θ > 0.

Mean = µ, variance = 2.

−10 −5 0 5 10

0.00.10.20.30.40.5

x

p(x)

−10 −5 0 5 10

0.00.10.20.30.40.5

Laplace probability density function

µ = 0, θ = 1 µ = 0, θ = 2 µ = 0, θ = 4 µ = −5, θ = 4

−10 −5 0 5 10

0.00.20.40.60.81.0

x

P(X x)

−10 −5 0 5 10

0.00.20.40.60.81.0

Laplace cumulative density function

µ = 0, θ = 1 µ = 0, θ = 2 µ = 0, θ = 4 µ = −5, θ = 4

(48)

Student’s distribution:

p(x;ν) = Γ

ν + 1 2

νπ Γν 2

1 + x2 ν

ν + 1

2 for x R and ν > 0 (the “degree of freedom” param.)

Mean = 0 for ν > 1, otherwise undefined.

Variance = ν

ν 2 for ν > 2, for 1 < ν 2, otherwise undefined.

The probability density function and the cumulative distribution function:

Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it has havier tails, meaning that it is more prone to producing values that fall far from its mean.

(49)

Beta distribution:

p(θ;α, β) = θα−1(1 −θ)β−1 B(α, β) ,

where B(α, β) is the Beta function of arguments α, β ∈ R+

B(α, β) = Γ(α)Γ(β) Γ(α+ β),

with Γ(x) = (x −1)! for any x ∈ N.

Beta distribution: p.d.f.

(50)

Dirichlet distribution:

D(θ | α) = 1

Z(α)ΠKi=1θiαi−1δ(PK

i=1 θi −1) where

α = α1, . . . , αK with αi > 0 are the parameters,

θi satisfy 0 ≤ θi ≤ 1 and sum to 1, this being indicated by the delta function term δ(P

iθi − 1), and

the normalising factor can be expressed in terms of the gamma function:

Z(α) = R

ΠKi=1θαi i−1δ(P

i −1)dθ = ΠiΓ(αi) Γ(P

i αi) Mean of θi: αi

P

j αj.

For K = 2, the Dirichlet distribution reduces to the Beta distribution.

(51)

Remark:

Concerning the multinomial and Dirichlet distributions:

The algebraic expression for the parameters θi is similar in the two distri- butions.

However, the multinomial is a distribution over its exponents ni, whereas the Dirichlet is a distribution over the numbers θi that are exponentiated.

The two distributions are said to be

conjugate distributions

and their close formal relationship leads to a harmonious interplay in many estima- tion problems.

Similarly,

the Beta distribution is the conjugate of the Bernoulli distribution, and the Gamma distribution is the conjugate of the Poisson distribution.

(52)

Notations

Let n = the number of pages on Internet, and H and A two n ×n matrices defined by

hij =

1 if page j points to page i (notation: Pj ∈ Bi) 0 otherwise

aij =

1 if page i contains no outgoing links 0 otherwise

α ∈ [0; 1] (this is a parameter that was initially set to 0.85) The transition matrix of the Google Markov Chain is

G = α(H + A) + 1 − α n · 1

where 1 is the n ×n matrix whose entries are all 1

(53)

The significance of G is derived from:

• the Random Surfer model

• the definition the (relative) importance of a page: com- bining votes from the pages that point to it

I (P

i

) = X

PjBi

I (P

j

) l

j

where l

j

is the number of links pointing out from P j .

(54)

The PageRank algorithm

[Brin & Page, 1998]

G is a stochastic matrix (gij ∈ [0; 1], Pn

i=1 gij = 1),

therefore λ1 the greatest eigenvalue of G is 1, and G has a stationary vector I (i.e., GI = I).

G is also primitive (| λ2 |< 1, where λ2 is the second eigenvalue of G) and irreducible (I > 0).

From the matrix calculus it follows that

I can be computed using the power method:

if I1 = GI0, I2 = GI1, . . . , Ik = GIk−1 then Ik → I. I gives the relative importance of pages.

(55)

Suggested readings

“Using Google’s PageRank algorithm to identify important attributes of genes”, G.M. Osmani, S.M. Rahman, 2006

(56)

ADDENDA

Formalisation of HMM algorithms in

“Biological Sequence Analysis” [ Durbin et al, 1998 ]

Note

A begin state was introduced. The transition probability a0k from this begin state to state k can be thought as the probability of starting in state k.

An end state is assumed, which is the reason for ak0 in the termination step.

If ends are not modelled, this ak0 will disappear.

For convenience we label both begin and end states as 0. There is no conflict because you can only transit out of the begin state and only into the end state, so variables are not used more than once.

The emission probabilities are considered independent of the origin state.

(Thus te emission of (pairs of ) symbols can be seen as being done when reaching the non-end states.) The begin and end states are silent.

(57)

Forward:

1. Initialization (i = 0): f

0

(0) = 1; f

k

(0) = 0, for k > 0 2. Induction (i = 1 . . . L): f

l

(i) = e

l

(x

i

) P

k

f

k

(i − 1)a

kl

3. Total: P (x) = P

k

f

k

(L)a

k0

. Backward:

1. Initialization (i = L): b

k

(L) = a

k0

, for all k 2. Induction (i = L − 1, . . . , 1: b

k

(i) = P

l

a

kl

e

l

(x

i+1

)b

l

(i + 1) 3. Total: P (x) = P

l

a

0l

e

l

(x

1

)b

l

(1)

Combining f and b: P (π

k

, x) = f

k

(i)b

k

(i)

(58)

Viterbi:

1. Initialization (i = 0): v

0

(0) = 1; v

k

(0) = 0, for k > 0 2. Induction (i = 1 . . . L):

v

l

(i) = e

l

(x

i

) max

k

(v

k

(i − 1)a

kl

);

ptr

i

(l) = argmax

k

v

k

(i − 1)a

kl

)

3. Termination and readout of best path:

P (x, π

) = max

k

(v

k

(L)a

k0

);

π

L

= argmax

k

v

k

(L)a

k0

, and π

i1

= ptr

i

i

), for i = L . . . 1.

(59)

Baum-Welch:

1. Initialization: Pick arbitrary model parameters 2. Induction:

For each sequence j = 1. . . n calculate fkj(i) and bjk(i) for sequence j using the forward and respectively backward algorithms.

Calculate the expected number of times each transition of emission is used, given the training sequences:

Akl = X

j

1 P(xj)

X

i

fkj(i)aklel(xji+1)bjl(i+ 1)

Ekl = X

j

1 P(xj)

X

{i|xji=b}

fkj(i)bjk(i)

Calculate the new model parameters:

akl = Akl P

l Akl and ek(b) = Ek(b) P

b Ek(b) Calculate the new log likelihood of the model.

3. Termination:

Stop is the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded.

Referințe

DOCUMENTE SIMILARE

The study results concluded that the ACEIs are effective in lowering systolic and diastolic blood pressure, they reduce global cardiovascular risk through

Finally, thedeterministic model is converted to stochastic model and the results ofstochastic model are compared with corresponding deterministic model.. It is

These factors should be envisaged by Romanian society when trying to make both the minority and the majority cooperate in order to avoid the transformation of the Roma

Key words: Greta Thunberg, Public Appearance, Protestant Ethics, Martin Luther, Self- Restraint, Public Responsibility, Cultural Patterns, Modern and Postmodern

The distinction between first order truth claims and second order grammatical reflection stems from the application of the linguistic metaphor to religion and from allowing

If the genus of X is greater than 2, eventually after a finite base change, the stable model X is obtained from the minimal regular (semistable) model by contracting the

maximally non-Gaussian linear combinations of the ob- served data x..

Given a single observation sequence for training, we want to find the model (parameters) µ = (A, B, π) that best explains the observed data.. Under Maximum Likelihood Estimation,

For the calibration of the connection between the integral intensity of the band “amide I” and the acetylation degree, five mixes were prepared with different proportions of

By a simple wet chemical method, using different precipitating agents KOH, NaOH and (CH 2 ) 6 N 4 ) and a structure-directing agent (gum arabic) we obtain ZnO structures with complex

Keywords: Electric arc furnace, random differential equation, Ornstein-Uhlenbeck process, Monte Carlo method, polynomial chaos

It will also discuss several free trade agreements that are in effect in the region as well as efforts by the country to join the Trans-Pacific Partnership (TPP) and the

Communication model proposed, Figure 1, is a holistic model based on the correlation between a series of factors which can influence in a favorable way

In order to describe the economic implications of a prolonged military rivalry, we have constructed a nonlinear dynamical model that merges the classical Richardson arms race

De¸si ˆın ambele cazuri de mai sus (S ¸si S ′ ) algoritmul Perceptron g˘ ase¸ste un separator liniar pentru datele de intrare, acest fapt nu este garantat ˆın gazul general,

Thus, if Don Quixote is the idealist, Casanova the adventurous seducer, Werther the suicidal hero, Wilhelm Meister the apprentice, Jesus Christ will be, in the audacious and

– Players, Objectives, Procedures, Rules, Resources, Conflict, Boundaries, Outcome. •

According to our previous investigations it seems that tolerance, whether regarded as a political practice or a philosophical or moral principle, is a strategy (or tactics) of one

the internal memory has a limited size of M words, the external memory can be accessed using I/Os that move B contiguous words between the two memories.. I the measure of

However, the sphere is topologically different from the donut, and from the flat (Euclidean) space.. Classification of two

The number of vacancies for the doctoral field of Medicine, Dental Medicine and Pharmacy for the academic year 2022/2023, financed from the state budget, are distributed to

- aligners for minor dental changes: small dental corrections after an orthodontic treatment with a fixed appliance/ patients with a stable occlusion but a slightly

Subsequently, it can presume that the proposed factual model is can be utilized as a pre-handling model for different computerized picture preparing strategy to improve