f -divergences and Monte Carlo methods

(1)

f -divergences and Monte Carlo methods

Ph.D. candidate OLARIU Emanuel Florentin

Advisor Professor LUCHIAN Henri

(2)

(3)

Chapter 1 Introduction

The problem addressed by this thesis broadly concerns the use of f- divergences mainly for variance reduction in Monte Carlo (MC) integration. A f-divergence is a particular type of measure for two probability distributions. MC is a classical randomized method for solving various types of problem for which we do not know analytical solutions; it is based on generating samples from particular distributions. Hence the problem of comparing distributions comes up naturally.

By the name of one of the first who studied them, they are also called Csiszár divergences, and are generated by convex functions. More general a divergence measure is a function of two probability density (or distribution) functions, which has nonnegative values and becames zero only when the two arguments (distributions) are the same. Often, a divergence is not a symmetric function but can be easily symmetrized.

There are many techniques for reducing the variance of the MC estimator and one of these is Importance Sampling (IS). Monte Carlo, f- divergences and various directions of variance minimization for IS and MC estimators are described in more detail in Chapter 2.

We use MC method in two ways: for pricing financial derivatives known as options, and for estimation of rare-events probabilities. Both these applications are again linked with the use of f-divergences.

In Chapter 3 we develop some techniques for pricing two option styles.

Spread European options are valued using IS and minimizing various divergences; this approach is compared with the least squares method for direct variance minimization. Bermudan options are priced using a modified version of MRAS algorithm, involving sampling importance resampling following the reference distributions from the standard algorithm.

The problem of estimation of rare events probabilities appears frequently in the analysis of performance of communications systems (e.g.

(6)

the probability of failure of a network system). The IS problem for this estimation consists in the increase of the frequence of rare-events by changing (to a more important) distribution. We introduce a new algorithm for such an estimation based on Rényi divergence instead of Kullback-Leibler divergence (the cross-entropy method). This algorithm with its stochastic counterpart and a version for solving continuous optimization problems are presented in Chapter 4.

The last chapter concerns the means for measuring the similarity of time series. Time series are common in various fields of science: medicine, mmultimedia, computational finance etc, and synthetic datasets are used in prediction and computer simulations.

Numerical experiments show that we can measure with great accuracy by using simple instruments like mean similarity and symmetrized divergences; these tools are easier to compute than the usual features which include common statistics, extremal points, slope and filtered data.

(7)

Chapter 2 Monte Carlo and Importance Sampling

Monte Carlo method (see [21], [27], [46]), although used in the begin- ing for stochastic simulation only, covers today a wide range of problems which could benefit from randomness and adjacent properties. Generally speaking any technique which approaches a problem using a “large” number of random samples for various computations will take the famous name.

This method is intended to solve problems for which deterministic/analytic approaches are not available, or give poor results.

This method heavily relies on computer generated numbers which are not so random being generated by deterministic mechanisms. Many techniques was developped in order to obtain unrelated (pseudo random) or low discrepancy (quasi random) sequence of numbers (see [19], [34]).

Applications of Monte Carlo method range from physical sciences, biol- ogy, medicine to weather forecasting, risk management and pricing financial derivative instruments (see [12], [17], [23]). We emphasize here a commun use in mathematics, namely Monte Carlo integration, which is, perhaps, the first technique to properly receive this name.

(8)

2.1 Monte Carlo Integration

Many problems which arise in a variety of applications can be described as the evaluation of the expected value for a given random variable. Let (Rⁿ; B(Rⁿ); ) be the Lebesgue measure space. Suppose that X : Rⁿ ! R is a randomn-dimensional variable (random vector) andf is a probability density function (pdf), i.e., has the following properties:

(2.1) f :Rⁿ !R+; ^Z

Rⁿ

f(s) d(s) = 1:

Let H : Rⁿ !R+ be (at least) a Lebesgue measurable¹ (or, integrable) function. We want to calculate the expectation of H(X):

(2.2) m =Ef[H(X)] =^Z

Rⁿ

H(s)f(s) d(s);

IfX=Xⁱ_16i6N are independent and identical distributed (i.i.d.) samples fromf, based on the Law of Large Numbers,

(2.3) m_N(X) = 1

N

XN i=1

H(Xⁱ)

is an unbiased estimator of m from (2.2): E[mN(X)] = m. The variance of this estimator is

(2.4) V ar [mN(X)] = 1 N

Z

Rⁿ

[H(s) m]² d(s) = 1

NV arf[H(X)] : The above variance is a measure of Monte Carlo efficiency and becomes one of the limitations of this method; in order to halve the standard devi- ation ofmN(X), you have to quadruple the number of samples. A number of techniques are used to reduce the variance, between them are Antithetic Variables, Control Variates and Importance Sampling. All these methods aim to reduce the variance without increasing the number of samples. A big part of our work concerns the Importance Sampling method for variance reduction.

1is the Lebesgue measure onRⁿ.

(9)

2.2. Importance Sampling

2.2 Importance Sampling

Importance sampling is a technique for estimating a parameter of a distribution while sampling from a different distribution. This means to choose a (possibly better known, or better to simulate) distribution from which to simulate one’s random variables. Let f be the original pdf and g be another pdf such as supp(H f) supp(g). We can rewrite (2.2):

(2.5) Ef[H(X)] =^Z

Rⁿ

"

H(s) f(s) g(s)

#

g(s) d(s) =Eg

"

H(X) f(X) g(X)

#

: Using (2.5) we get another unbiased estimator of m:

(2.6) m_N(g;X) = 1

N

XN i=1

H(Xⁱ)f(Xⁱ) g(Xⁱ);

where X = Xⁱ_16i6N are i.i.d. samples from g which is known as the importance sampling (IS) distribution. The idea behind Importance Sam- pling is to draw from another distribution (g), and then modify the result to correct the bias introduced in this way.

The variance of this estimator is V ar_g[m_N(g;X)] =^Z

Rⁿ

"

H²(s)f(s) g(s) m

#₂

g(s) d(s):

Minimizing this variance is equivalent with

(2.7) min

g2G Ef

"

H²(X)f(X) g(X)

#

; and its stochastic counterpart is

(2.8) min

g2G

XM j=1

H²(Y^j)f(Y^j) g(Y^j); where Y¹; Y²; : : : ; Y^M are i.i.d. samples from f.

The main reason for changing the pdf is to reduce the variance by an appropriate choice of g – samples from g could be more "important" for the estimation of our integral. As long as H(t) > 0; 8t 2 Rⁿ the variance of this estimator is minimized when g is proportional with H f:

(10)

(2.9) g(s) = H(s)f(s) m is the zero-variance IS distribution.

This pdf is hard to determine as it depends on the desired value m; from another point of view [31] the variance of this estimator is

V ar [mN(g;X)] = 1 NV arg

"

H(X) f(X) g(X)

#

= (2.10)

= 1 NEg

2

4 H(X) f(X) g(X)

!₂3 5 1

N Eg

"

H(X) f(X) g(X)

#!₂

=

= 1 NEf

"

H²(X) f(X) g(X)

# m² N ;

therefore, minimizing the variance is equivalent with solving the following problem

(2.11) min

g2G Ef

"

H²(X) f(X) g(X)

#

over the entire set of pdf’s G.

A more practical approach is to search for an IS distribution from a parametric family (g) which has to minimize a certain divergence with respect to the zero-variance pdf g (see [4] and section 2.3). In many situations this search is refined and we look from distributions of the form (see [4], [22], [31] and [44]):

(2.12) g(s) = g₁(s₁) g₂(s) : : : g_n(s_n);s= (s₁; s₂; : : : ; s_n);

i.e., the multi-dimensional distributions having independent components.

The (mean) parametrized distributions usually make part of the so-called natural exponential families ([13], [33]) which has a pdf of the form

(2.13) f(s) = '(s) exp (h;si ());

where s 2 Rⁿ, 2 Rⁿ is the parameter of the family, and ' and are known functions. We can include here the following distributions: normal with known covariance, Poisson, gamma with known shape parameter.

(11)

2.3. f-divergences and IS

2.3 f-divergences and IS

Let(Rⁿ; B(Rⁿ); )the Lebesgue measure space (is a-finite measure), 1; 2 two probability measures on (Rⁿ; B(Rⁿ))absolutely continuous with respect to ((M) = 0 implies i(M) = 0). Denote by pi = di

d (i = 1; 2), the corresponding Radon-Nikodym derivatives with respect to . If f : R+ ! R is a convex function, continuous in 0, the f-divergence of 1

and ₂ (or p₁ and p₂) (cf. [2], [11] and [16]) is

(2.14) Df(1; 2) = Df(p1; p2) =^Z

Rⁿ

p2(s)f

"

p1(s) p₂(s)

#

d(s):

A few examples of particular interest:

f :R+ !R, f(x) = (x 1)² gives the Karl Pearson’s ²-divergence:

D²(₁; ₂) =^Z

Rⁿ

[p₁(s) p₂(s)]² p2(s) d(s)

f :R+ !R, f(x) = jx 1j gives the total variance distance:

V (1; 2) =^Z

Rⁿ

jp1(s) p2(s)j d(s) = sup

M2B(Rⁿ)[1(M) 2(M)]

!

f :R₊ !R, f(x) = x log xgives the Kullback-Leibler divergence:

DKL(1; 2) =^Z

Rⁿ

p1(s) log

"

p1(s) p2(s)

#

d(s)

f : R₊ !R, f(x) = x gives the -order Rényi divergence (or Rényi entropy):

D(1; 2) =^Z

Rⁿ

p₁(s)p¹₂ (s) d(s)

For all these divergences, I_f(₁; ₂) > 0, and I_f(₁; ₂) = 0 if and only if ₁ = ₂, i.e., p₁ = p₂ almost everywhere (a.e.).

As we already mention on section 2.2, instead of solving the problem (2.11) we can try to minimize a given f-divergence with respect to the zero-variance pdf g, that is

(2.15) min

g2G D_f(g; g)

Although sometimes we can change the order of the pdf’s - remember that these divergences are not all symmetric.

(12)

2.3.1 Kullback-Leibler divergence

For Kullback-Leibler divergence we get a problem (see [4], [26] and [45]) which gives the well known method named cross-entropy:

(2.16) min

g2G DKL(g; g) = min

g2G

Z

Rⁿ

g(s) log

"

g(s) g(s)

#

d(s);

or

ming2G

Z

Rⁿ

[g(s) log g(s) g(s) log g(s)] d(s);

which is equivalent with ming2G

Z

Rⁿ

H(s)f(s) log g(s) d(s):

An IS distribution is a solution to the following optimization problem

(2.17) arg max

g2G Ef [H(X) log g(X)]

2.3.2 Pearson divergence

The (reversed-) Pearson divergence can be calculated like this

D²(₁; ₂) =^Z

Rⁿ

p²₁(s)

p2(s) 2p₁(s) + p₂(s)

!

d(s) =

Z

Rⁿ

p²₁(s)

p2(s) d(s) 1 =Ep1

"

p1(X) p2(X)

#

1

In order to find an IS distribution function with respect to this divergence we have to solve the following problem:

(2.18) arg min

g2G D²(g; g) =arg min

g2G Ef

"

H²(X)f(X) g(X)

#

(13)

2.4. Sampling Importance Resampling

2.3.3 Rényi divergence

Relative information of order or -order Rényi divergence is:

D(p₁; p₂) = 1

1ln^Z

Rⁿ

p₁(s)p¹₂ (s) (ds)

Therefore, for > 1, and reversing the order of pdf’s, we have the following optimization problem

arg min

g2G D(g; g) =arg min

g2G

Z

Rⁿ

g(s)H¹(s)f¹(s) (ds); or

(2.19) arg min

g2G Ef

"

g(X)

f(X)H¹(X)

#

2.4 Sampling Importance Resampling

The sampling importance resampling (SIR) method (see [36]) draws a random sample from a distribution with pdf h in two steps. First an independent random sample X = (X1; X2; : : : :Xn) is drawn from a distribution with pdf g, then a (usually, smaller) sample Y= (Y₁; Y₂; : : : ; Y_m) is drawn (often with replacement) from X with sample probabilities w(Xi), proportional with h(Xi)=g(Xi). In practice

(2.20) w(X_i) = h(X_i)=g(X_i)

Xn j=1

h(X_j)=g(X_j)

We generate the new samples using a multinomial distribution with these weights. That is, from X, we give more importance to h. We shall see the value of this technique in the following chapters.

(14)

(15)

Chapter 3 Option pricing. Importance sampling and f-divergences

In this chapter we develop some techniques used mainly for pricing financial instruments, but useful in a more general framewok. We involve here the importance sampling techniques and minimizing f-divergences.

There are many financial instruments for which closed form formulae cannot be derived from the existing mathematical models. One example of such model is the classical result of Black and Scholes which cover only a small part of the entire spectra of derivatives, especially for multivariate contingent claims.

In the following sections we study two types of option pricing: spread and a variety of American option known as Bermudan option. An option is a derivative instrument - because do not depend directly on the price of an asset (commodities, stocks, currencies or financial indexes).

An option is a contract between two parts (a seller and a buyer) in which, one - the buyer - buys the right to engage in a transaction concerning the asset (at a future date), from the second - the seller. The buyer has the right, but not the obligation to fulfill the above transaction, while the seller has the obligation to engage the transaction if the first party agrees with that. Therefore an option contract can be exercised or not at the convened moment(s) in time.

Depending on the transaction involved, there are to types of options:

if the transaction gives the right to buy the asset(s) is called call option, while, if the transaction gives the right to sell the asset(s) is called put option.

Depending on the moment when the transaction can be exercised, there are two main styles of options: European - the option can be exercised only at the expiration (maturity) date and American - the contract can be

(16)

exercised any time between the writing and the expiration date. In between those two reference types exist many others, like: Bermudan option - the buyer has the right to exercise at a designated number of times, Canary option - the buyer has the right to exercise at a designated number of times but not before a time period.

(17)

3.1. Spread options

3.1 Spread options

One of these derivatives is the spread option, which are very widespread in the financial markets (equity, commodities, foreign exchange and energy markets) despite the fact that the corresponding pricing and hedging methods are not so well developed.

In this section we shall use the divergences from section 2.3 to approximate the price for call european spread options. The payoff functionH(Z) is given by formula (A.5), and we have to approximate V =E[H(Z)].

We consider here the restriction ofGat the familyN2of bivariate normal distributions with unit variances, constant correlation, parameterized by the means v= (v1; v2). The pdf’s from this family are:

'v(s) = '_(v₁_;v₂₎(s₁; s₂) = '₀(s₁ v₁; s₂ v₂);

where

'₀(s₁; s₂) = 1

2^q1 ²₀ exp

"

1 2(1 ²₀)

s²₁ 2₀s₁s₂+ s²₂^#:

3.1.1 Kullback-Leibler divergence

Supposing that, f = 'u, the problem (2.17) becomes

(3.1) arg max

v E'u[H(Z) ln 'v(Z)]

While its stochastic version is

(3.2) arg max

v

XN i=1

H(Zⁱ) ln 'v(Zⁱ);

whereZ¹;Z²; : : : ;Z^N a i.i.d. samples from'u. After some algebra, problem (3.2) has the following form

(3.3) arg min

v

XN i=1

H(Zⁱ)^h(Z₁ⁱ v1)² 20(Z₁ⁱ v1)(Z₂ⁱ v2) + (Z₂ⁱ v2)ⁱ; As the function to optimize in (3.3) is a quadratic polynomial inv₁ and v2, the problem could be solved directly. Alternatively, for this simulation, one can use again the importance sampling, and the optimal solution can be estimated by using an adaptive procedure similar to that indicated in the cross-entropy method (see [4]).

(18)

3.1.2 Pearson divergence

Problem (2.18) is transformed in

(3.4) arg min

v E'u

"

H²(Z)'u(Z) 'v(Z)

#

Using a Monte Carlo estimator for the above expectation, we get the stochastic counterpart of (3.4):

(3.5) arg min

v

XN i=1

H²(Z_i)'u(Z_i) 'v(Z_i) ; where Z¹;Z²; : : : ;Z^N is a random sample from 'u.

3.1.3 Relative information

The problem of minimizing the relative information of order from (2.19):

(3.6) arg min

v E'u

"

H¹(X)'v(X) 'u(X)

#

;

with its stochastic version:

(3.7) arg min

v

XN i=1

H¹(Zⁱ)'v(Zⁱ) 'u(Zⁱ) ; where Z¹;Z²; : : : ;Z^N is a random sample from '_u.

3.1.4 Least squares problem and IS

We recall the nonlinear least squares problem framework ([30], [35]).

Let us suppose that we have a function p : Rⁿ ! R^m, where m > n, and we have to solve the problem

(3.8) arg min

x P (x); where P (x) = jjp(x)jj² =^X^m

i=1

p²_i(x);

(19)

3.1. Spread options

where p(x) = (p₁(x); p₂(x); : : : ; p_m(x))^T. If all p_j are linear functions we have thelinear least squares problem, otherwise we have anonlinear least squares problem.

This is a frequently approach to approximate solutions to overdetemined systems of equations: instead of solving such a system we try to minimize the sum of squares of the errors.

Applications of this method are data fitting, regression analysis and statistics. There are some very efficient methods which address this problem; we used Levenberg-Marquardt method and Powell’s Dog Leg method which was potentially more efficient but gave worse results. These methods (which are iterative) are useful each time the function to minimize is a sum of non-negative, and smooth enough (say C²-class) terms.

Let us recall the problem (2.8), the stochastic version for variance minimization, in our framework (G = N2, Y¹; Y²; : : : ; Y^M i.i.d. samples from 'u):

arg min

v

XM j=1

H²(Y^j)'u(Y^j) 'v(Y^j) = arg min

v

1 2

XM j=1

0

@H(Y^j)

vu

ut'u(Y^j) 'v(Y^j)

1 A

2

: (3.9)

Since 'v(s) > 0, 8s 2 Rⁿ and 8v 2 R², the problem (3.9) is a least squares problem of form (3.8). Therefore we can solve it using one of the already mentioned iterative methods.

In conclusion, the least squares method can be used in two different ways for solving stochastic problems we already formulated. First, to minimize directly the variance and find an IS optimal distribution. On the other hand, we can use the divergences to find the IS distribution. Kulbak- Leibler divergence can be used directly, but, since H(s)> 0, 8s 2 Rⁿ the problems (3.5) and (3.7) have also the form of a least squares problem (3.8), hence, for Pearson divergence and relative information we can involve again the least squares method.

It is worth noting here that Kullback-Leibler divergence has a big advantage over many other divergences, at least when using distributions from a natural exponential family. That is because the logarithm involved in its definion cancels the exponential function of the corresponding pdfs. In this respect using Kullback-Leibler divergence conducts to exact methods, while for other divergences, as we already seen, we have to solve one more (stochastic) optimization problem.

(20)

3.1.5 Numerical results

Our numerical experiments (see [37]) address all four importance sampling variants (3.3) - KL-IS, (3.5) -²-IS, (3.7) - RI-IS, and (3.9) - LSq-IS.

The tests was conducted for an European call spread option with the following characteristics: spot prices S₁⁰ = 105 and S₂⁰ = 110, dividend yields q₁ = 2% and q₂ = 3%, volatilities ₁ = 15% and ₂ = 10%, and interest rate r = 5%.

The parameter of the distributions which is changed is the pair(₁; ₂), i.e., the means of the two joint gaussian variables (Z1; Z2), -correlated.

We used as an estimate of the efficiency the variance ratio (which is a ratio between the estimated variances over 1; 000; 000 simulated paths):

²_MC ²_IS

Table 6.4 shows the results for various levels of correlation coefficient and strike K (the larger the results the bettter the method). Every value is obtained as an average of 30 samples; in paranthesis are the standard errors of the mean.

Least Squares method (LSq-IS) and Kulback-Leibler (KL-IS) are the best methods. It is worth noting that, although, LSq-IS works better most of the time, KL-IS performs better for deep in the money spread options.

KL-IS has always at least five times better (than crude Monte Carlo) variance, making this method a more reliable one. LSq-IS has its worst results for deeply correlated assets, while KL-IS is less dependent on the correlation level.

Overall,²-IS is the worst of the first three methods, but still has better performance than LSq-IS for deep in the money and positively correlated options. The RI-IS method needs further investigations concerning the fine tuning of parameter - which is 1:5 in these experiments. This method gives the best results on the levels of low correlation.

3.2 American options

This section is motivated by the problem of pricing American option - a very challenging financial instrument from mathematical point of view.

American options can be exercised before (but not after) maturity time - the problem of pricing such options is more delicate since, in addition to estimate its value, one has to find first the optimal exercise time. We

(21)

3.2. American options

restrict our study to the Bermudan options - a style of American options which can be exercised only at a finite number of times.

We improve (see [38], [39]) a method successfully used for solving optimization problem named Model Reference Adaptive Search (MRAS [22]) which is an approach similar to the cross-entropy method (see [4], [45]).

Such stochastic optimization methods involve two iterative stages:

generate data samples using a specific random procedure, most likely a distribution with known parameters.

update the parameters for the random procedure using the data from previous step.

The calculus of parameters in the second step often involves random variable expectation which are estimated (as we already seen), by Monte Carlo simulation. Although MRAS itself is a form of importance samplng, we use here importance sampling in the form of sampling importance resampling.

That is, from first step generated samples we resample (with replacement) using a multinomial distribution having probabilities proportional with their importance ratios. In this way we give more importance to samples which shift towards another distribution; the main distributions we chose for resampling are those of reference in the original algorithm.

3.2.1 Pricing Bermudan options

Recall, from Appendix B, the value of a put Bermudan option with early exercise thresholds S= (S_i)_16i6n

(3.10) H(S) = max

S E[Vp;S];

where V_p;S is given by (B.3).

The vector of threshold prices must be generated from a multivariate gaussian distribution truncated on one of the above polytopes (B.7); the pdf for such a distribution (with certain mean vector µ and covariance matrix Σ) is

'X(s;µ;Σ) = 1[X](s)

p

2ⁿ^qjΣjexp 1

2(s µ)^TΣ ¹(s µ):

(22)

As the threshold prices are the optimized arguments in the algorithm, a fast and quality sampling procedure is crucial for the accuracy of our results. Therefore we avoid the accept-reject method for this truncated distribution and we used a Gibbs sampler combined with a Metropolis Chain (see [14], [24], [50]).

After all these prices are determined, the values of the desired option can be calculated by estimating the expectation of the value function (B.3) (this is done by a forward simulation - knowing the equation which models the price dynamics).

3.2.2 The algorithm

In this section we analyze only the put options; the problem is to find S =arg max

S H(S)

MRAS is an iterative procedure which, in our case, generates candidate solutions following the normal distribution N(µ;Σ), and updates the parameters of the distribution using:

(3.11) µ_t+1 =

Et

"

s [H(S)]^t

'Xp(S;µ_t;Σ_t)1[^H(^S^)>t+1] S

#

Et

"

s [H(S)]^t

'Xp(S;µ_t;Σ_t)1[^H(^S^)>t+1]

#

(3.12) Σ_t+1 = Et

"

s [H(S)]^t

'Xp(S;µ_t;Σ_t)1[^H(^S^)>t+1]M

S;µ_t+1

#

Et

"

s [H(S)]^t

'_X_p(S;µ_t;Σ_t)1[^H(^S^)>t+1]

#

wheresis a continuous strictly increasing positive function, the expectation Et[] is taken under the truncated distribution NXp(µ_t;Σ_t), and

MS;µ_t+1=S µ_t+1S µ_t+1^T

The above expectations are estimated by Monte Carlo simulation; we sample a sequence of i.i.d. gaussian vectors S = S^(t;j)

16j6Nt Xp, and, for the sake of simplicity, define the following weights:

(23)

w_j(S;µ;Σ; ) =

s^hHS^(t;j)ⁱ^t

'Xp(S^(t;j);µ;Σ)1[^H(^S^(t;j))^>]

Nt

X

j=1

s^hHS^(t;j)ⁱ^t

'Xp(S^(t;j);µ;Σ)1[^H(^S^(t;j))^>]

;

where µ 2 Rⁿ, Σ 2 Rⁿⁿ, 2 R. The parameters for the truncated multivariate gaussian distribution are updated using the following formulas:

(3.13) µ_t+1= ^X^N^t

j=1

wj

S;µ_t;Σ_t; _t+1S^(t;j)

(3.14) Σ_t+1 =^X^N^t

j=1

wj

S;µ_t;Σ_t; _t+1 MS^(t;j);µ_t+1

A smoothing coefficient is defined ( 2 (0; 1), p 2N):

_t= 1 1 t

_p

: And the parameters are accordingly modified:

b

µ_t = _tµ_t+ (1 _t)µ^b_{t 1} (3.15)

Σb_t= tΣ_t+ (1 t)Σ^b_{t 1}

We describe a modified version of MRAS algorithm by including a sampling importance resampling phase; before calculate the sample (1 )- quantile, we replace the actual samples using sampling importance resampling. First we can resample based on the natural weights: ifS=S^j

16j6N

is the currently generated samples we have to determine the following weights:

(3.16) u_j = H(S^j)

XN i=1

H(Sⁱ)

and generate new samples using a multinomial distribution with these weights. That is, we give more importance to those samples which have greater payoffs.

(24)

In the following let us remember a few theoretical considerations concerning MRAS (exact version) global convergence. The parameters to update in each iteration are (µ_t;Σ_t) = t; this merging parameter is chosen in such a way that '_t+1, the next pdf, is “closer” to the corresponding distribution from sequence of so-called reference distribution(gk)_k>1, where:

(3.17) gt+1(x) = s[H(x)] 1[^H(x)>t+1] g^t(x) Egt

s[H(X)] 1[^H(X)>t+1]

; 8 t>1

(3.18) g₁(x) = 1[H(x)>₁]

E0

h1[H(X)>₁]

i:

Hence, for step t, a more natural resampling would be using the following weights:

(3.19) vj =

hs(H(S^j))ⁱ^{t 1} 1[^H(^S^j^)>t]

XN i=1

hs(H(Sⁱ))ⁱ^{t 1} 1[^H(^Sⁱ^)>t]

The algorithm follows:

Step 1. initialization: quantile level 0, sample size N0, µ₀, Σ, sample size level , a limit parameter " > 0, a weight 2 (0; 1), a continuous increasing positive function s(), andt = 0;

Step 2. the general iteration of the algorithm:

generate i.i.d. samples

S^t =S^(t;1); S^(t;2); : : : ; S^(t;N^t⁾

from density'bt = 't+ (1 )'0; ('tis the density of NS

µb_t;Σ^b_t);

then, resample using (3.16) or (3.19) and get another samples:

S^(t;(1)); S^(t;(2)); : : : ; S^(t;(N^t⁾⁾

calculate the(1 t) quantile, t+1(t; Nt), of the samples HS^(t;(1)); HS^(t;(2)); : : : ; HS^(t;(N^t⁾⁾ if (t = 0 or t+1(t; Nt)>t+ ") then

_t+1 t+1(t; Nt); t+1 t; Nt+1 Nt

(25)

else, if exists

= max f⁰ : t+1(⁰; Nt)>_t+ "; 0 6⁰ 6tg;

then

_t+1 t+1(; Nt); t+1 ; Nt+1 Nt

else

_t+1 _t; t+1 t; Nt+1 Nt

update and smooth µ_t+1 and Σ_t+1, using (3.13), (3.14) and, respectively, (3.15)

t + +;

3.2.3 Models used

The techniques described above are tested on pricing bermudan options under three different models for stock price dynamics: the geometric Brownian, the normal jump diffusion, and a relatively new framework - an asymmetric double-exponentially jump diffusion model.

Geometric Brownian motion model. We say that the underlying stock price, S(t), follows a geometric Brownian motion if

(3.20) dS(t) = S(t)dt + S(t)dW (t);

where (W (t))_t>0 is a standard Wiener process (or Brownian motion), i.e.

W0 0, t 7! W (t) is continuous almost surely, and its increments are mutually independent and stationary (W (t + s) W (s) N(0; t),8 s > 0).

For this model = r , where is the dividend yield, and is the volatility – all supposed constants. Under these conditions, using Ito’s lemma, equation (3.20) has the following solution:

S(t) = S(0) exp

"

²

2

!

t + W (t)

#

from where the discrete counterpart used for simulation is

S₊

S = exp

"

²

2

!

+ p

Z

#

; where Z is a normal standard distributed random variable.

(26)

Merton normal jump diffusion model. From a practical point of view we know that geometric Brownian motion does not always accurately simulate the stock price behaviour. Therefore other models which allow jumps have been introduced – namely jump-diffusion models ([10], [47]). Merton ([32]) proposed the following dynamic to model the underlying stock price:

(3.21) dS(t) = S(t)dt + S(t)dW (t) + S(t)dX(t);

X(t) = ^N(t)^X

i=1

Yi;

whereW (t)is a standard Wiener process, X(t)is a compound Poisson process: N(t)– the number of allowed jumps – is a Poisson process with parameter, and Y₁; Y₂; : : :is a sequence of independent and identical log-normal distributed, LN( ²=2; ²), random variables; here is the frequency and is the volatility of the jumps.

The discrete form of equation (3.21) is:

S+

S = exp

2

4 ² 2

!

+ p

Z0+^N()^X

i=1

Zi ² 2

!3 5;

whereZ0; Z1; : : :are i.i.d normal standard random variables, and N()is Poisson distributed with parameter .

A double exponentially jump diffusion model Kou proposed (see [25]) another jump diffusion model for the asset price, which basically differs from the above model by the distribution of jump sizes which are double exponentially:

(3.22) dS(t) = S(t)dt + S(t)dW (t) + S(t)dV (t);

V (t) =^N(t)^X

i=1

(V_i 1);

where W (t) is a standard Wiener process, V (t) is a compound Poisson process: N(t) is a Poisson process with rate , and V₁; V₂; : : : are independent and identical log-asymmetric double exponential distributed random variables, i. e. Y = log (Vi) has density:

f(x) =

( p 1 e ¹^x; x>0 (1 p) 2 e²^x; x < 0 ;

(27)

pand(1 p)being the probabilities of up and down jumps. The parameters for this distribution are

E[Y ] = p 1

1 p 2 ; Var[Y ] = p(1 p) 1

₁ + 1 ₂

₂

+ p

²₁ + 1 p ²₂ : The solution to the equation (3.22) is

(3.23) S(t) = S(0) exp

"

²

2

!

t + W (t)

#

^N(t)^Y

i=1

Vi: The discrete form of equation (3.23) is:

S₊

S = exp

2

4 ² 2

!

+ p

Z0 +^N()^X

i=1

Yi

3 5;

where Z0 is a normal standard distributed random variable, Y1, Y1, : : : are i.i.d asymmetric double exponentially distributed random variables (with density (3.23)), and N() is Poisson distributed with parameter .

3.2.4 Numerical experiments and conclusions

In our implementation the stopping criteria includes, besides the max- imum number of steps (N_t >N_max), a minimum number of valid samples which will be used in the updating phase.

The numerical results (see also [38], [39]) are obtained in the following conditions: initial sample size is N₀ = 200, initial quantile level ₀ = 0:5, smoothing parameter = 0:8, sample size level correction = 2, weight parameter = 0:3, " = 0:001. The increasing positive function used is s(x) = exp (0:1x), initial mean is a vector having all components 10, and initial covariance matrix has diagonal elements 225. Option prices are obtained by using 50; 000 simulations, after the threshold prices are estimated.

Tables 6.1 to 6.3 shows the results of our simulations: the prices, the standard errors and the average number of iterations. All models are tested for various early exercise dates and different first threshold price values.

The two sampling importance versions are: the uniform (u-SIR) and the reference distributions sampling importance resampling (rd-SIR). Both perform less steps, while rd-SIR performs at most half steps, than standard algorithm.

(28)

The prices we get are slightly smaller for importance resampling, although remaining very close to those obtained with the standard MRAS procedure. In almost all cases the standard error of the mean is similar for all three algorithms; an exception is for Kou and Merton models on6early exercices where standard error almost doubles (remaining under 0.05).

Our algorithm performs almost twice as fast as the standard algorithm having same standard errors - this means that our method is a reliable and faster method.

(29)

Chapter 4 Rare-events probabilities and optimization

Estimation of probabilities of rare-events are very important for the performance guarantee of various systems – usually stochastic networks (e.g. in telecommunications). There are a number of randomized methods for this estimation: genetic algorithms, simulated annealing etc; among them cross-entropy method (see [4] and [44]) is one of the most successful.

This type of approach can be used not only for estimating probabilities but also for solving various combinatorial optimization problems. Some applications of this cross-entropy show the its utility for solving even hard combinatorial problems.

The general procedure we describe (see [40],[41]) does not involve any specific family of distributions, the only restriction is that the search space consists of product-form probability density functions. We discuss an algorithm for estimation of probability of rare events and a version for continuous optimization. The results of numerical experimentation with these algorithms carried in the last section support their performances.

(30)

4.1 Minimizing the Rényi divergence

Many problems which arise in a variety of applications of operations research can be described as the evaluation of the expected value for a given random variable. Areas of our interest which use such an evaluation are rare event simulation or global optimization.

Let us recall the problem (2.15) of minimizing a f-divergence with respect to the zero-variance pdf,g; this is a subproblem of (2.11). If we use the Lullback-Leibler divergence we get the problem (2.17).

In this section we present an alternative approach to the problem of estimating probabilities of rare events using the class of Rényi divergences of order. As we later see this approach can be also used for solving global optimization problems.

In the remaining of this chapter we suppose that > 1. The restriction we impose is that the pdf’s to have a product form, i.e., the distributions with independent components:

G =

(

g :Rⁿ !R+ : ^Z

Rⁿ

g(s) d(s) = 1; g(s) =^Yⁿ

i=1

gi(si); 8s2Rⁿ

)

This constraint will allow us to study the problem of divergence minimization in more detail.

As mentioned earlier we propose to choose as IS distribution that which minimize the Rényi divergence:

ming2G

Z

Rⁿ

hg(s)g¹(s)ⁱ d(s)=

(4.1) min

g2G

Z

Rⁿ

hH(s)f(s)g¹(s)ⁱ d(s)

For a given "0 2 (0; 1), say "0 = 1=2, we define U =ⁿh :Rⁿ !R+ : h 2 L¹(R)^o; U₀ =h 2 U : ^Z

R

h(t) d(t) 1< "₀;

obviously, U₀ and U₀ⁿ are convex subsets of the Banach spaces L¹(R) and

L¹(R)ⁿ, respectively. Where

L¹(R) =' :R!R : ^Z

R

j'(s)j d(s) < 1

(31)

4.1. Minimizing the Rényi divergence

is the space of absolutely integrable¹ functions. This is the most relaxed framework we can use, although it is possible to restrain our study to the square integrable functions L²(R).

The latter space has the advantage of being reflexive, therefore its unit sphere is relative compact in the weak convergence. As our search setGis a convex subset of the unit set in L¹(Rⁿ), we can direct the analysis towards a Weierstrass type optimization of a continuous function on a compact set.

As our functional is not proven to be continuous we focus on a different approach which is based on the convexity and critical points of the Lagrangian.

Problem (4.1) can be viewed as a functional minimization problem:

(4.2) min

g2U₀ⁿ

Z

Rⁿ

"

H(s)f(s)^Yⁿ

i=1

g_i¹(si)

#

d(s)

!

;

subject to

Z

R

gi(si) d(si) = 1; 8i = 1; n:

For the sake of simplicity, we make the following notations:

: U₀ⁿ !R; :L¹(R)ⁿ !Rⁿ; (g) =^Z

Rⁿ

"

H(s)f(s)^Yⁿ

i=1

g¹_i (s_i)

#

d(s);

(g) =^Z

R

g1(s1) d(s1) 1; : : : ;^Z

R

gn(sn) d(sn) 1

4.1.1 Convex optimization problem.

Thus, problem (4.2) becomes

(4.3) min

g2U₀ⁿ(g); (g) = 0:

This a convex optimization problem with constraints. We first prove the convexity of the objective function.

LEMMA 1 For > 1, is a convex functional on U₀ⁿ.

1We use the same notations, for Lebesgue measures on different-algebras,B(R^p).

(32)

PROOF: It will suffice to show that the function ' : Rⁿ ! R, '(x) = (x1 : : : xn)¹ is a convex one.

To this end, let x;y 2 Rⁿ and t 2 (0; 1); first, from the concavity of ln (), one has:

ln (' [tx+ (1 t)y]) = ln ^Yⁿ

i=1

[tx_i+ (1 t)y_i]¹

!

=

= (1 )^Xⁿ

i=1

ln [tx_i+ (1 t)y_i]6(1 )^Xⁿ

i=1

[t ln x_i+ (1 t) ln y_i] =

= t ln ('(x)) + (1 t) ln ('(y)):

Therefore '() is log-convex; now, we observe that ' [tx+ (1 t)y]6t'(x) + (1 t)'(y) is equivalent with

(4.4) ln (' [tx+ (1 t)y])6ln [t'(x) + (1 t)'(y)]:

Since '() is log-convex, we have

(4.5) ln (' [tx+ (1 t)y])6t ln ('(x)) + (1 t) ln ('(y)) and, by concavity of ln (),

(4.6) t ln ('(x)) + (1 t) ln ('(y))6ln (t'(x) + (1 t)'(y)):

Inequalities (4.5) and (4.6) combined give (4.4), hence, '()is a convex

function.

The Lagrange function of (4.3) is L(g; ) = (g) + h; (g)i, and is a convex functional.

COROLLARY 1 For every 2Rⁿ, L(; ) is a convex function.

(33)

4.1. Minimizing the Rényi divergence

PROOF: Use Lemma 1 and the fact that ()is affine.

It easily seen that, for 0 < < 1, with similar arguments one can prove that is a concave functional. Thus, in this case, problem (4.3) can be written as

(4.7) min

g2U₀ⁿ (g); (g) = 0:

As we already said in the beginning of this section, we choose to study only the case > 1, the two cases being so similar. Therefore, in the following, we shall assume that > 1 if not otherwise mentioned.

4.1.2 Lagrangian analysis

For everyh 2 Uⁿandg 2 U₀ⁿ, it exists at₀ > 0, such that(g+t₀h) 2 U₀ⁿ; being convex, the function

t ! (g + t h) (g) t

is monotone, therefore the directional derivatives of the above functionals can be calculated using the Lebesgue monotone convergence theorem:

D(g)(h) = lim

t!0

(g + t h) (g)

t =

= (1 )^Xⁿ

i=1

Z

Rⁿ

"

H(s)f(s)g¹(s)h_i(s_i) gi(si)

#

d(s);

D (g)(h) = lim

t!0

(g + t h) (g)

t =

=^Z

R

h1(s1) d(s1); : : : ;^Z

R

hn(sn) d(sn); DL(g; )(h) = D(g)(h) + h; D (g)(h)i:

Using Theorem 3.4 from [3], g is solution to (4.3) if and only if it exists a 2Rⁿ (Lagrange multipliers) such that

(4.8) g 2 arg min

g2U₀ⁿ L(g; )and (g) = 0Rⁿ:

(34)

For a given 2 Rⁿ, first condition in (4.8) is equivalent with 0 2

@L(g; )- the subdifferential ofL(; ). AsL(; )admits directional derivatives (at least) on every directionh 2 Uⁿ, a natural way to solve (4.8) is to find a solution to DL(g; )(h) = 0, 8 h 2 Uⁿ or, equivalently,

(4.9) ( 1)^Xⁿ

i=1

Z

Rⁿ

"

H(s)f(s)g¹(s)hi(si) gi(si)

#

d(s) =

=^Xⁿ

i=1

i

Z

R

hi(si) d(si); 8 h 2 Uⁿ:

For an index 1 6 i 6 n, and a vector s = (s1; s2; : : : ; sn) 2 Rⁿ, let us denote sⁱ= (s1; : : : si 1; si+1; : : : ; sn).

LEMMA 2 If g is a solution to (4.9), then, for every 1 6 i 6 n, we have

(4.10) g_i(s_i) =

Z

R^{n 1}

hH(s)f(s)g¹(s)ⁱ d(sⁱ)

Z

Rⁿ

hH(s)f(s)g¹(s)ⁱ d(s) a. e., and

if X_i is a random variable having density g_i, and b : R ! R is a continuous function, then

(4.11) Eg_i[b(Xi)] =

Z

Rⁿ

hH(s)f(s)g¹(s)b(s_i)ⁱ d(s)

Z

Rⁿ

hH(s)f(s)g¹(s)ⁱ d(s) :

PROOF: equation (4.9) has the form:

(4.12)

Xn i=1

Z

Rⁿ

A(s)h_i(s_i) d(s) =^Xⁿ

i=1

Z

R

_ih_i(s_i) d(s_i); 8 h 2 Uⁿ: We will use in this proof, the simple fact that, for a given measurable function : R ! R+,

Z

R

(s) d(s) = 0 if and only if (s) = 0 almost everywhere (i.e., except, perhaps, for slying in a negligible set).

For a given i, let us choose hj 0, for all j 6= i, and, denote ⁱ₊ =si :^Z

R^{n 1}

A(s) d(sⁱ) > i

; ⁱ =Rn ⁱ₊:

(35)

4.2. Estimation of rare-events probabilities

We choose for (4.12), first hi(si) =^Z

R^{n 1}

A(s) d(sⁱ) i

₊

; i = 1; n;

and secondly,

hi(si) =^Z

R^{n 1}

A(s) d(sⁱ) i

; i = 1; n;

and we obtain

Z

ⁱ₊

Z

R^{n 1}

A(s) d(sⁱ) i

₂

d(s_i) = 0 and, respectively,

Z

ⁱ

Z

R^{n 1}

A(s) d(sⁱ) i

₂

d(s_i) = 0;

or, equivalently,

Z

R^{n 1}

A(s) d(sⁱ) = _i; a:e:

From (4.12) we get g_i(s_i) = 1

_i

Z

R^{n 1}

hH(s)f(s)g¹(s)ⁱ d(sⁱ):

Moreover, if, for any given i, we chooseh_i= g_i, andh_j 0for all j 6= i, we get:

i = ( 1)^Z

Rⁿ

hH(s)f(s)g¹(s)ⁱ d(s):

From here (4.10) follows easily; on the other hand (4.11) is an easy

problem of calculation.

4.2 Estimation of rare-events probabilities

In this section we suppose that H(s) = 1[F(s)>a], where F is a Lebesgue measurable function, a 2 R, and [F(s)>a] is a small probability event (say, at most 10 ⁵).

We follow here the framework of a multistage procedure for the estimation of g - this type of algorithm appear often in the literature (e.g. [4], [31], [45]).

f -divergences and Monte Carlo methods