### f -divergences and Monte Carlo methods

### Ph.D. candidate OLARIU Emanuel Florentin

### Advisor Professor LUCHIAN Henri

## Contents

1 Introduction 3

2 Monte Carlo and Importance Sampling 5

2.1 Monte Carlo Integration . . . 6

2.2 Importance Sampling . . . 7

2.3 f-divergences and IS . . . 9

2.3.1 Kullback-Leibler divergence . . . 10

2.3.2 Pearson divergence . . . 10

2.3.3 Rényi divergence . . . 11

2.4 Sampling Importance Resampling . . . 11

3 Option pricing. Importance sampling and f-divergences 13 3.1 Spread options . . . 15

3.1.1 Kullback-Leibler divergence . . . 15

3.1.2 Pearson divergence . . . 16

3.1.3 Relative information . . . 16

3.1.4 Least squares problem and IS . . . 16

3.1.5 Numerical results . . . 18

3.2 American options . . . 18

3.2.1 Pricing Bermudan options . . . 19

3.2.2 The algorithm . . . 20

3.2.3 Models used . . . 23

3.2.4 Numerical experiments and conclusions . . . 25

4 Rare-events probabilities and optimization 27 4.1 Minimizing the Rényi divergence . . . 28

4.1.1 Convex optimization problem. . . 29

4.1.2 Lagrangian analysis . . . 31

4.2 Estimation of rare-events probabilities . . . 33

4.2.1 Numerical results . . . 35

4.3 Optimization . . . 37

5 Measure the similarity using divergences 41

5.1 The data-generator . . . 43

5.1.1 Syntactic pattern recognition . . . 43

5.1.2 Generating data . . . 44

5.2 Measuring similarities . . . 44

5.2.1 Feature extraction . . . 45

5.2.2 Similarity measures . . . 45

5.3 Numerical results . . . 47

6 Conclusions 49

A Spread options 55

B American options 57

Bibliography 59

## Chapter 1 Introduction

The problem addressed by this thesis broadly concerns the use of f- divergences mainly for variance reduction in Monte Carlo (MC) integra- tion. A f-divergence is a particular type of measure for two probability distributions. MC is a classical randomized method for solving various types of problem for which we do not know analytical solutions; it is based on generating samples from particular distributions. Hence the problem of comparing distributions comes up naturally.

By the name of one of the first who studied them, they are also called Csiszár divergences, and are generated by convex functions. More general a divergence measure is a function of two probability density (or distribution) functions, which has nonnegative values and becames zero only when the two arguments (distributions) are the same. Often, a divergence is not a symmetric function but can be easily symmetrized.

There are many techniques for reducing the variance of the MC es- timator and one of these is Importance Sampling (IS). Monte Carlo, f- divergences and various directions of variance minimization for IS and MC estimators are described in more detail in Chapter 2.

We use MC method in two ways: for pricing financial derivatives known as options, and for estimation of rare-events probabilities. Both these ap- plications are again linked with the use of f-divergences.

In Chapter 3 we develop some techniques for pricing two option styles.

Spread European options are valued using IS and minimizing various di- vergences; this approach is compared with the least squares method for direct variance minimization. Bermudan options are priced using a modi- fied version of MRAS algorithm, involving sampling importance resampling following the reference distributions from the standard algorithm.

The problem of estimation of rare events probabilities appears fre- quently in the analysis of performance of communications systems (e.g.

the probability of failure of a network system). The IS problem for this estimation consists in the increase of the frequence of rare-events by chang- ing (to a more important) distribution. We introduce a new algorithm for such an estimation based on Rényi divergence instead of Kullback-Leibler divergence (the cross-entropy method). This algorithm with its stochastic counterpart and a version for solving continuous optimization problems are presented in Chapter 4.

The last chapter concerns the means for measuring the similarity of time series. Time series are common in various fields of science: medicine, mmultimedia, computational finance etc, and synthetic datasets are used in prediction and computer simulations.

Numerical experiments show that we can measure with great accuracy by using simple instruments like mean similarity and symmetrized diver- gences; these tools are easier to compute than the usual features which include common statistics, extremal points, slope and filtered data.

## Chapter 2

## Monte Carlo and Importance Sampling

Monte Carlo method (see [21], [27], [46]), although used in the begin- ing for stochastic simulation only, covers today a wide range of problems which could benefit from randomness and adjacent properties. Generally speaking any technique which approaches a problem using a “large” num- ber of random samples for various computations will take the famous name.

This method is intended to solve problems for which deterministic/analytic approaches are not available, or give poor results.

This method heavily relies on computer generated numbers which are not so random being generated by deterministic mechanisms. Many tech- niques was developped in order to obtain unrelated (pseudo random) or low discrepancy (quasi random) sequence of numbers (see [19], [34]).

Applications of Monte Carlo method range from physical sciences, biol- ogy, medicine to weather forecasting, risk management and pricing financial derivative instruments (see [12], [17], [23]). We emphasize here a commun use in mathematics, namely Monte Carlo integration, which is, perhaps, the first technique to properly receive this name.

### 2.1 Monte Carlo Integration

Many problems which arise in a variety of applications can be described
as the evaluation of the expected value for a given random variable. Let
(R^{n}; B(R^{n}); ) be the Lebesgue measure space. Suppose that X : R^{n} ! R
is a randomn-dimensional variable (random vector) andf is a probability
density function (pdf), i.e., has the following properties:

(2.1) f :R^{n} !R+; ^{Z}

R^{n}

f(s) d(s) = 1:

Let H : R^{n} !R+ be (at least) a Lebesgue measurable^{1} (or, integrable)
function. We want to calculate the expectation of H(X):

(2.2) m =Ef[H(X)] =^{Z}

R^{n}

H(s)f(s) d(s);

IfX=^{}X^{i}^{}_{16i6N} are independent and identical distributed (i.i.d.) sam-
ples fromf, based on the Law of Large Numbers,

(2.3) m_{N}(X) = 1

N

XN i=1

H(X^{i})

is an unbiased estimator of m from (2.2): E[mN(X)] = m. The variance of this estimator is

(2.4) V ar [mN(X)] = 1 N

Z

R^{n}

[H(s) m]^{2} d(s) = 1

NV arf[H(X)] : The above variance is a measure of Monte Carlo efficiency and becomes one of the limitations of this method; in order to halve the standard devi- ation ofmN(X), you have to quadruple the number of samples. A number of techniques are used to reduce the variance, between them are Antithetic Variables, Control Variates and Importance Sampling. All these methods aim to reduce the variance without increasing the number of samples. A big part of our work concerns the Importance Sampling method for variance reduction.

1is the Lebesgue measure onR^{n}.

2.2. Importance Sampling

### 2.2 Importance Sampling

Importance sampling is a technique for estimating a parameter of a distribution while sampling from a different distribution. This means to choose a (possibly better known, or better to simulate) distribution from which to simulate one’s random variables. Let f be the original pdf and g be another pdf such as supp(H f) supp(g). We can rewrite (2.2):

(2.5) Ef[H(X)] =^{Z}

R^{n}

"

H(s) f(s) g(s)

#

g(s) d(s) =Eg

"

H(X) f(X) g(X)

#

: Using (2.5) we get another unbiased estimator of m:

(2.6) m_{N}(g;X) = 1

N

XN i=1

H(X^{i})f(X^{i})
g(X^{i});

where X = ^{}X^{i}^{}_{16i6N} are i.i.d. samples from g which is known as the
importance sampling (IS) distribution. The idea behind Importance Sam-
pling is to draw from another distribution (g), and then modify the result
to correct the bias introduced in this way.

The variance of this estimator is
V ar_{g}[m_{N}(g;X)] =^{Z}

R^{n}

"

H^{2}(s)f(s)
g(s) m

#_{2}

g(s) d(s):

Minimizing this variance is equivalent with

(2.7) min

g2G Ef

"

H^{2}(X)f(X)
g(X)

#

; and its stochastic counterpart is

(2.8) min

g2G

XM j=1

H^{2}(Y^{j})f(Y^{j})
g(Y^{j});
where Y^{1}; Y^{2}; : : : ; Y^{M} are i.i.d. samples from f.

The main reason for changing the pdf is to reduce the variance by an
appropriate choice of g – samples from g could be more "important" for
the estimation of our integral. As long as H(t) > 0; 8t 2 R^{n} the variance
of this estimator is minimized when g is proportional with H f:

(2.9) g^{}(s) = H(s)f(s)
m
is the zero-variance IS distribution.

This pdf is hard to determine as it depends on the desired value m; from another point of view [31] the variance of this estimator is

V ar [mN(g;X)] = 1 NV arg

"

H(X) f(X) g(X)

#

= (2.10)

= 1 NEg

2

4 H(X) f(X) g(X)

!_{2}3
5 1

N Eg

"

H(X) f(X) g(X)

#!_{2}

=

= 1 NEf

"

H^{2}(X) f(X)
g(X)

# m^{2}
N ;

therefore, minimizing the variance is equivalent with solving the following problem

(2.11) min

g2G Ef

"

H^{2}(X) f(X)
g(X)

#

over the entire set of pdf’s G.

A more practical approach is to search for an IS distribution from a
parametric family (g)_{} which has to minimize a certain divergence with
respect to the zero-variance pdf g^{} (see [4] and section 2.3). In many
situations this search is refined and we look from distributions of the form
(see [4], [22], [31] and [44]):

(2.12) g(s) = g_{1}(s_{1}) g_{2}(s) : : : g_{n}(s_{n});s= (s_{1}; s_{2}; : : : ; s_{n});

i.e., the multi-dimensional distributions having independent components.

The (mean) parametrized distributions usually make part of the so-called natural exponential families ([13], [33]) which has a pdf of the form

(2.13) f(s) = '(s) exp (h;si ());

where s 2 R^{n}, 2 R^{n} is the parameter of the family, and ' and are
known functions. We can include here the following distributions: normal
with known covariance, Poisson, gamma with known shape parameter.

2.3. f-divergences and IS

### 2.3 f-divergences and IS

Let(R^{n}; B(R^{n}); )the Lebesgue measure space (is a-finite measure),
1; 2 two probability measures on (R^{n}; B(R^{n}))absolutely continuous with
respect to ((M) = 0 implies i(M) = 0). Denote by pi = di

d (i = 1; 2), the corresponding Radon-Nikodym derivatives with respect to . If f : R+ ! R is a convex function, continuous in 0, the f-divergence of 1

and _{2} (or p_{1} and p_{2}) (cf. [2], [11] and [16]) is

(2.14) Df(1; 2) = Df(p1; p2) =^{Z}

R^{n}

p2(s)f

"

p1(s)
p_{2}(s)

#

d(s):

A few examples of particular interest:

f :R+ !R, f(x) = (x 1)^{2} gives the Karl Pearson’s ^{2}-divergence:

D_{}^{2}(_{1}; _{2}) =^{Z}

R^{n}

[p_{1}(s) p_{2}(s)]^{2}
p2(s) d(s)

f :R+ !R, f(x) = jx 1j gives the total variance distance:

V (1; 2) =^{Z}

R^{n}

jp1(s) p2(s)j d(s) = sup

M2B(R^{n})[1(M) 2(M)]

!

f :R^{}_{+} !R, f(x) = x log xgives the Kullback-Leibler divergence:

DKL(1; 2) =^{Z}

R^{n}

p1(s) log

"

p1(s) p2(s)

#

d(s)

f : R^{}_{+} !R, f(x) = x^{} gives the -order Rényi divergence (or Rényi
entropy):

D(1; 2) =^{Z}

R^{n}

p^{}_{1}(s)p^{1 }_{2} (s) d(s)

For all these divergences, I_{f}(_{1}; _{2}) > 0, and I_{f}(_{1}; _{2}) = 0 if and only if
_{1} = _{2}, i.e., p_{1} = p_{2} almost everywhere (a.e.).

As we already mention on section 2.2, instead of solving the problem
(2.11) we can try to minimize a given f-divergence with respect to the
zero-variance pdf g^{}, that is

(2.15) min

g2G D_{f}(g^{}; g)

Although sometimes we can change the order of the pdf’s - remember that these divergences are not all symmetric.

### 2.3.1 Kullback-Leibler divergence

For Kullback-Leibler divergence we get a problem (see [4], [26] and [45]) which gives the well known method named cross-entropy:

(2.16) min

g2G DKL(g^{}; g) = min

g2G

Z

R^{n}

g^{}(s) log

"

g^{}(s)
g(s)

#

d(s);

or

ming2G

Z

R^{n}

[g^{}(s) log g^{}(s) g^{}(s) log g(s)] d(s);

which is equivalent with ming2G

Z

R^{n}

H(s)f(s) log g(s) d(s):

An IS distribution is a solution to the following optimization problem

(2.17) arg max

g2G Ef [H(X) log g(X)]

### 2.3.2 Pearson divergence

The (reversed-) Pearson divergence can be calculated like this

D_{}^{2}(_{1}; _{2}) =^{Z}

R^{n}

p^{2}_{1}(s)

p2(s) 2p_{1}(s) + p_{2}(s)

!

d(s) =

Z

R^{n}

p^{2}_{1}(s)

p2(s) d(s) 1 =Ep1

"

p1(X) p2(X)

#

1

In order to find an IS distribution function with respect to this diver- gence we have to solve the following problem:

(2.18) arg min

g2G D^{2}(g^{}; g) =arg min

g2G Ef

"

H^{2}(X)f(X)
g(X)

#

2.4. Sampling Importance Resampling

### 2.3.3 Rényi divergence

Relative information of order or -order Rényi divergence is:

D_{}(p_{1}; p_{2}) = 1

1ln^{Z}

R^{n}

p^{}_{1}(s)p^{1 }_{2} (s) (ds)^{}

Therefore, for > 1, and reversing the order of pdf’s, we have the following optimization problem

arg min

g2G D(g; g^{}) =arg min

g2G

Z

R^{n}

g^{}(s)H^{1 }(s)f^{1 }(s) (ds)^{};
or

(2.19) arg min

g2G Ef

"

g^{}(X)

f^{}(X)H^{1 }(X)

#

### 2.4 Sampling Importance Resampling

The sampling importance resampling (SIR) method (see [36]) draws
a random sample from a distribution with pdf h in two steps. First an
independent random sample X = (X1; X2; : : : :Xn) is drawn from a distri-
bution with pdf g, then a (usually, smaller) sample Y= (Y_{1}; Y_{2}; : : : ; Y_{m}) is
drawn (often with replacement) from X with sample probabilities w(Xi),
proportional with h(Xi)=g(Xi). In practice

(2.20) w(X_{i}) = h(X_{i})=g(X_{i})

Xn j=1

h(X_{j})=g(X_{j})

We generate the new samples using a multinomial distribution with these weights. That is, from X, we give more importance to h. We shall see the value of this technique in the following chapters.

## Chapter 3

## Option pricing. Importance sampling and f-divergences

In this chapter we develop some techniques used mainly for pricing financial instruments, but useful in a more general framewok. We involve here the importance sampling techniques and minimizing f-divergences.

There are many financial instruments for which closed form formulae cannot be derived from the existing mathematical models. One example of such model is the classical result of Black and Scholes which cover only a small part of the entire spectra of derivatives, especially for multivariate contingent claims.

In the following sections we study two types of option pricing: spread and a variety of American option known as Bermudan option. An option is a derivative instrument - because do not depend directly on the price of an asset (commodities, stocks, currencies or financial indexes).

An option is a contract between two parts (a seller and a buyer) in which, one - the buyer - buys the right to engage in a transaction concerning the asset (at a future date), from the second - the seller. The buyer has the right, but not the obligation to fulfill the above transaction, while the seller has the obligation to engage the transaction if the first party agrees with that. Therefore an option contract can be exercised or not at the convened moment(s) in time.

Depending on the transaction involved, there are to types of options:

if the transaction gives the right to buy the asset(s) is called call option, while, if the transaction gives the right to sell the asset(s) is called put option.

Depending on the moment when the transaction can be exercised, there are two main styles of options: European - the option can be exercised only at the expiration (maturity) date and American - the contract can be

exercised any time between the writing and the expiration date. In between those two reference types exist many others, like: Bermudan option - the buyer has the right to exercise at a designated number of times, Canary option - the buyer has the right to exercise at a designated number of times but not before a time period.

3.1. Spread options

### 3.1 Spread options

One of these derivatives is the spread option, which are very widespread in the financial markets (equity, commodities, foreign exchange and energy markets) despite the fact that the corresponding pricing and hedging meth- ods are not so well developed.

In this section we shall use the divergences from section 2.3 to approxi- mate the price for call european spread options. The payoff functionH(Z) is given by formula (A.5), and we have to approximate V =E[H(Z)].

We consider here the restriction ofGat the familyN2of bivariate normal distributions with unit variances, constant correlation, parameterized by the means v= (v1; v2). The pdf’s from this family are:

'v(s) = '_{(v}_{1}_{;v}_{2}_{)}(s_{1}; s_{2}) = '_{0}(s_{1} v_{1}; s_{2} v_{2});

where

'_{0}(s_{1}; s_{2}) = 1

2^{q}1 ^{2}_{0} exp

"

1
2(1 ^{2}_{0})

s^{2}_{1} 2_{0}s_{1}s_{2}+ s^{2}_{2}^{#}:

### 3.1.1 Kullback-Leibler divergence

Supposing that, f = 'u, the problem (2.17) becomes

(3.1) arg max

v E'u[H(Z) ln 'v(Z)]

While its stochastic version is

(3.2) arg max

v

XN i=1

H(Z^{i}) ln 'v(Z^{i});

whereZ^{1};Z^{2}; : : : ;Z^{N} a i.i.d. samples from'u. After some algebra, problem
(3.2) has the following form

(3.3) arg min

v

XN i=1

H(Z^{i})^{h}(Z_{1}^{i} v1)^{2} 20(Z_{1}^{i} v1)(Z_{2}^{i} v2) + (Z_{2}^{i} v2)^{i};
As the function to optimize in (3.3) is a quadratic polynomial inv_{1} and
v2, the problem could be solved directly. Alternatively, for this simulation,
one can use again the importance sampling, and the optimal solution can
be estimated by using an adaptive procedure similar to that indicated in
the cross-entropy method (see [4]).

### 3.1.2 Pearson divergence

Problem (2.18) is transformed in

(3.4) arg min

v E'u

"

H^{2}(Z)'u(Z)
'v(Z)

#

Using a Monte Carlo estimator for the above expectation, we get the stochastic counterpart of (3.4):

(3.5) arg min

v

XN i=1

H^{2}(Z_{i})'u(Z_{i})
'v(Z_{i}) ;
where Z^{1};Z^{2}; : : : ;Z^{N} is a random sample from 'u.

### 3.1.3 Relative information

The problem of minimizing the relative information of order from (2.19):

(3.6) arg min

v E'u

"

H^{1 }(X)'^{}v(X)
'^{}u(X)

#

;

with its stochastic version:

(3.7) arg min

v

XN i=1

H^{1 }(Z^{i})'^{}v(Z^{i})
'^{}u(Z^{i}) ;
where Z^{1};Z^{2}; : : : ;Z^{N} is a random sample from '_{u}.

### 3.1.4 Least squares problem and IS

We recall the nonlinear least squares problem framework ([30], [35]).

Let us suppose that we have a function p : R^{n} ! R^{m}, where m > n, and
we have to solve the problem

(3.8) arg min

x P (x); where P (x) = jjp(x)jj^{2} =^{X}^{m}

i=1

p^{2}_{i}(x);

3.1. Spread options

where p(x) = (p_{1}(x); p_{2}(x); : : : ; p_{m}(x))^{T}. If all p_{j} are linear functions we
have thelinear least squares problem, otherwise we have anonlinear least
squares problem.

This is a frequently approach to approximate solutions to overdetemined systems of equations: instead of solving such a system we try to minimize the sum of squares of the errors.

Applications of this method are data fitting, regression analysis and
statistics. There are some very efficient methods which address this prob-
lem; we used Levenberg-Marquardt method and Powell’s Dog Leg method
which was potentially more efficient but gave worse results. These methods
(which are iterative) are useful each time the function to minimize is a sum
of non-negative, and smooth enough (say C^{2}-class) terms.

Let us recall the problem (2.8), the stochastic version for variance min-
imization, in our framework (G = N2, Y^{1}; Y^{2}; : : : ; Y^{M} i.i.d. samples from
'u):

arg min

v

XM j=1

H^{2}(Y^{j})'u(Y^{j})
'v(Y^{j}) =
arg min

v

1 2

XM j=1

0

@H(Y^{j})

vu

ut'u(Y^{j})
'v(Y^{j})

1 A

2

: (3.9)

Since 'v(s) > 0, 8s 2 R^{n} and 8v 2 R^{2}, the problem (3.9) is a least
squares problem of form (3.8). Therefore we can solve it using one of the
already mentioned iterative methods.

In conclusion, the least squares method can be used in two different ways
for solving stochastic problems we already formulated. First, to minimize
directly the variance and find an IS optimal distribution. On the other
hand, we can use the divergences to find the IS distribution. Kulbak-
Leibler divergence can be used directly, but, since H(s)> 0, 8s 2 R^{n} the
problems (3.5) and (3.7) have also the form of a least squares problem (3.8),
hence, for Pearson divergence and relative information we can involve again
the least squares method.

It is worth noting here that Kullback-Leibler divergence has a big advan- tage over many other divergences, at least when using distributions from a natural exponential family. That is because the logarithm involved in its definion cancels the exponential function of the corresponding pdfs. In this respect using Kullback-Leibler divergence conducts to exact methods, while for other divergences, as we already seen, we have to solve one more (stochastic) optimization problem.

### 3.1.5 Numerical results

Our numerical experiments (see [37]) address all four importance sam-
pling variants (3.3) - KL-IS, (3.5) -^{2}-IS, (3.7) - RI-IS, and (3.9) - LSq-IS.

The tests was conducted for an European call spread option with the
following characteristics: spot prices S_{1}^{0} = 105 and S_{2}^{0} = 110, dividend
yields q_{1} = 2% and q_{2} = 3%, volatilities _{1} = 15% and _{2} = 10%, and
interest rate r = 5%.

The parameter of the distributions which is changed is the pair(_{1}; _{2}),
i.e., the means of the two joint gaussian variables (Z1; Z2), -correlated.

We used as an estimate of the efficiency the variance ratio (which is a ratio between the estimated variances over 1; 000; 000 simulated paths):

^{2}_{MC}
^{2}_{IS}

Table 6.4 shows the results for various levels of correlation coefficient and strike K (the larger the results the bettter the method). Every value is obtained as an average of 30 samples; in paranthesis are the standard errors of the mean.

Least Squares method (LSq-IS) and Kulback-Leibler (KL-IS) are the best methods. It is worth noting that, although, LSq-IS works better most of the time, KL-IS performs better for deep in the money spread options.

KL-IS has always at least five times better (than crude Monte Carlo) variance, making this method a more reliable one. LSq-IS has its worst results for deeply correlated assets, while KL-IS is less dependent on the correlation level.

Overall,^{2}-IS is the worst of the first three methods, but still has better
performance than LSq-IS for deep in the money and positively correlated
options. The RI-IS method needs further investigations concerning the fine
tuning of parameter - which is 1:5 in these experiments. This method
gives the best results on the levels of low correlation.

### 3.2 American options

This section is motivated by the problem of pricing American option - a very challenging financial instrument from mathematical point of view.

American options can be exercised before (but not after) maturity time - the problem of pricing such options is more delicate since, in addition to estimate its value, one has to find first the optimal exercise time. We

3.2. American options

restrict our study to the Bermudan options - a style of American options which can be exercised only at a finite number of times.

We improve (see [38], [39]) a method successfully used for solving op- timization problem named Model Reference Adaptive Search (MRAS [22]) which is an approach similar to the cross-entropy method (see [4], [45]).

Such stochastic optimization methods involve two iterative stages:

generate data samples using a specific random procedure, most likely a distribution with known parameters.

update the parameters for the random procedure using the data from previous step.

The calculus of parameters in the second step often involves random variable expectation which are estimated (as we already seen), by Monte Carlo simulation. Although MRAS itself is a form of importance samplng, we use here importance sampling in the form of sampling importance re- sampling.

That is, from first step generated samples we resample (with replace- ment) using a multinomial distribution having probabilities proportional with their importance ratios. In this way we give more importance to sam- ples which shift towards another distribution; the main distributions we chose for resampling are those of reference in the original algorithm.

### 3.2.1 Pricing Bermudan options

Recall, from Appendix B, the value of a put Bermudan option with
early exercise thresholds S= (S_{i}^{})_{16i6n}

(3.10) H(S) = max

S E[Vp;S];

where V_{p;}S is given by (B.3).

The vector of threshold prices must be generated from a multivariate gaussian distribution truncated on one of the above polytopes (B.7); the pdf for such a distribution (with certain mean vector µ and covariance matrix Σ) is

'X(s;µ;Σ) = 1[X](s)

p

2^{}^{n}^{q}jΣjexp^{} 1

2(s µ)^{T}Σ ^{1}(s µ)^{}:

As the threshold prices are the optimized arguments in the algorithm, a fast and quality sampling procedure is crucial for the accuracy of our results. Therefore we avoid the accept-reject method for this truncated distribution and we used a Gibbs sampler combined with a Metropolis Chain (see [14], [24], [50]).

After all these prices are determined, the values of the desired option can be calculated by estimating the expectation of the value function (B.3) (this is done by a forward simulation - knowing the equation which models the price dynamics).

### 3.2.2 The algorithm

In this section we analyze only the put options; the problem is to find
S^{} =arg max

S H(S)

MRAS is an iterative procedure which, in our case, generates candi- date solutions following the normal distribution N(µ;Σ), and updates the parameters of the distribution using:

(3.11) µ_{t+1} =

Et

"

s [H(S)]^{t}

'Xp(S;µ_{t};Σ_{t})1[^{H(}^{S}^{)>}t+1] S

#

Et

"

s [H(S)]^{t}

'Xp(S;µ_{t};Σ_{t})1[^{H(}^{S}^{)>}t+1]

#

(3.12) Σ_{t+1} =
Et

"

s [H(S)]^{t}

'Xp(S;µ_{t};Σ_{t})1[^{H(}^{S}^{)>}t+1]M

S;µ_{t+1}

#

Et

"

s [H(S)]^{t}

'_{X}_{p}(S;µ_{t};Σ_{t})1[^{H(}^{S}^{)>}t+1]

#

wheresis a continuous strictly increasing positive function, the expectation
Et[] is taken under the truncated distribution NXp(µ_{t};Σ_{t}), and

M^{}S;µ_{t+1}^{}=^{}S µ_{t+1}^{}^{}S µ_{t+1}^{}^{T}

The above expectations are estimated by Monte Carlo simulation; we
sample a sequence of i.i.d. gaussian vectors S = ^{}S^{(t;j)}^{}

16j6Nt Xp, and, for the sake of simplicity, define the following weights:

3.2. American options

w_{j}(S;µ;Σ; ) =

s^{h}H^{}S^{(t;j)}^{i}^{t}

'Xp(S^{(t;j)};µ;Σ)1[^{H}(^{S}^{(t;j)})^{>}]

Nt

X

j=1

s^{h}H^{}S^{(t;j)}^{i}^{t}

'Xp(S^{(t;j)};µ;Σ)1[^{H}(^{S}^{(t;j)})^{>}]

;

where µ 2 R^{n}, Σ 2 R^{nn}, 2 R. The parameters for the truncated
multivariate gaussian distribution are updated using the following formulas:

(3.13) µ_{t+1}= ^{X}^{N}^{t}

j=1

wj

S;µ_{t};Σ_{t}; _{t+1}^{}S^{(t;j)}

(3.14) Σ_{t+1} =^{X}^{N}^{t}

j=1

wj

S;µ_{t};Σ_{t}; _{t+1}^{} M^{}S^{(t;j)};µ_{t+1}

A smoothing coefficient is defined ( 2 (0; 1), p 2N^{}):

_{t}= ^{}1 1
t

_{p}

: And the parameters are accordingly modified:

b

µ_{t} = _{t}µ_{t}+ (1 _{t})µ^{b}_{t 1}
(3.15)

Σb_{t}= tΣ_{t}+ (1 t)Σ^{b}_{t 1}

We describe a modified version of MRAS algorithm by including a sam-
pling importance resampling phase; before calculate the sample (1 )-
quantile, we replace the actual samples using sampling importance resam-
pling. First we can resample based on the natural weights: ifS=^{}S^{j}^{}

16j6N

is the currently generated samples we have to determine the following weights:

(3.16) u_{j} = H(S^{j})

XN i=1

H(S^{i})

and generate new samples using a multinomial distribution with these weights. That is, we give more importance to those samples which have greater payoffs.

In the following let us remember a few theoretical considerations con-
cerning MRAS (exact version) global convergence. The parameters to up-
date in each iteration are (µ_{t};Σ_{t}) = t; this merging parameter is chosen
in such a way that '_{}_{t+1}, the next pdf, is “closer” to the corresponding dis-
tribution from sequence of so-called reference distribution(gk)_{k>1}, where:

(3.17) gt+1(x) = s[H(x)] 1[^{H(x)>}t+1] g^{t}(x)
Egt

s[H(X)] 1[^{H(X)>}t+1]

; 8 t>1

(3.18) g_{1}(x) = 1[H(x)>_{1}]

E0

h1[H(X)>_{1}]

i:

Hence, for step t, a more natural resampling would be using the following weights:

(3.19) vj =

hs(H(S^{j}))^{i}^{t 1} 1[^{H(}^{S}^{j}^{)>}t]

XN i=1

hs(H(S^{i}))^{i}^{t 1} 1[^{H(}^{S}^{i}^{)>}t]

The algorithm follows:

Step 1. initialization: quantile level 0, sample size N0, µ_{0}, Σ, sample
size level , a limit parameter " > 0, a weight 2 (0; 1), a continuous
increasing positive function s(), andt = 0;

Step 2. the general iteration of the algorithm:

generate i.i.d. samples

S^{t} =S^{(t;1)}; S^{(t;2)}; : : : ; S^{(t;N}^{t}^{)}

from density'bt = 't+ (1 )'0; ('tis the density of NS

µb_{t};Σ^{b}_{t}^{});

then, resample using (3.16) or (3.19) and get another samples:

S^{(t;(1))}; S^{(t;(2))}; : : : ; S^{(t;(N}^{t}^{))}

calculate the(1 t) quantile, t+1(t; Nt), of the samples
H^{}S^{(t;(1))}^{}; H^{}S^{(t;(2))}^{}; : : : ; H^{}S^{(t;(N}^{t}^{))}^{}
if (t = 0 or t+1(t; Nt)>t+ ") then

_{t+1} t+1(t; Nt); t+1 t; Nt+1 Nt

3.2. American options

else, if exists

= max f^{0} : t+1(^{0}; Nt)>_{t}+ "; 0 6^{0} 6tg;

then

_{t+1} t+1(; Nt); t+1 ; Nt+1 Nt

else

_{t+1} _{t}; t+1 t; Nt+1 Nt

update and smooth µ_{t+1} and Σ_{t+1}, using (3.13), (3.14) and, respec-
tively, (3.15)

t + +;

### 3.2.3 Models used

The techniques described above are tested on pricing bermudan op- tions under three different models for stock price dynamics: the geometric Brownian, the normal jump diffusion, and a relatively new framework - an asymmetric double-exponentially jump diffusion model.

Geometric Brownian motion model. We say that the underlying stock price, S(t), follows a geometric Brownian motion if

(3.20) dS(t) = S(t)dt + S(t)dW (t);

where (W (t))_{t>0} is a standard Wiener process (or Brownian motion), i.e.

W0 0, t 7! W (t) is continuous almost surely, and its increments are mutually independent and stationary (W (t + s) W (s) N(0; t),8 s > 0).

For this model = r , where is the dividend yield, and is the volatility – all supposed constants. Under these conditions, using Ito’s lemma, equation (3.20) has the following solution:

S(t) = S(0) exp

"

^{2}

2

!

t + W (t)

#

from where the discrete counterpart used for simulation is

S_{+}

S = exp

"

^{2}

2

!

+ p

Z

#

; where Z is a normal standard distributed random variable.

Merton normal jump diffusion model. From a practical point of view we know that geometric Brownian motion does not always accurately sim- ulate the stock price behaviour. Therefore other models which allow jumps have been introduced – namely jump-diffusion models ([10], [47]). Merton ([32]) proposed the following dynamic to model the underlying stock price:

(3.21) dS(t) = S(t)dt + S(t)dW (t) + S(t)dX(t);

X(t) = ^{N(t)}^{X}

i=1

Yi;

whereW (t)is a standard Wiener process, X(t)is a compound Poisson pro-
cess: N(t)– the number of allowed jumps – is a Poisson process with param-
eter, and Y_{1}; Y_{2}; : : :is a sequence of independent and identical log-normal
distributed, LN( ^{2}=2; ^{2}), random variables; here is the frequency and
is the volatility of the jumps.

The discrete form of equation (3.21) is:

S+

S_{} = exp

2

4 ^{2}
2

!

+ p

Z0+^{N()}^{X}

i=1

Zi ^{2}
2

!3 5;

whereZ0; Z1; : : :are i.i.d normal standard random variables, and N()is Poisson distributed with parameter .

A double exponentially jump diffusion model Kou proposed (see [25]) another jump diffusion model for the asset price, which basically differs from the above model by the distribution of jump sizes which are double exponentially:

(3.22) dS(t) = S(t)dt + S(t)dW (t) + S(t)dV (t);

V (t) =^{N(t)}^{X}

i=1

(V_{i} 1);

where W (t) is a standard Wiener process, V (t) is a compound Poisson
process: N(t) is a Poisson process with rate , and V_{1}; V_{2}; : : : are indepen-
dent and identical log-asymmetric double exponential distributed random
variables, i. e. Y = log (Vi) has density:

f(x) =

( p 1 e ^{}^{1}^{x}; x>0
(1 p) 2 e^{}^{2}^{x}; x < 0 ;

3.2. American options

pand(1 p)being the probabilities of up and down jumps. The parameters for this distribution are

E[Y ] = p 1

1 p
2 ;
Var[Y ] = p(1 p)^{} 1

_{1} + 1
_{2}

_{2}

+ p

^{2}_{1} + 1 p
^{2}_{2} :
The solution to the equation (3.22) is

(3.23) S(t) = S(0) exp

"

^{2}

2

!

t + W (t)

#

^{N(t)}^{Y}

i=1

Vi: The discrete form of equation (3.23) is:

S_{+}

S = exp

2

4 ^{2}
2

!

+ p

Z0 +^{N()}^{X}

i=1

Yi

3 5;

where Z0 is a normal standard distributed random variable, Y1, Y1, : : : are i.i.d asymmetric double exponentially distributed random variables (with density (3.23)), and N() is Poisson distributed with parameter .

### 3.2.4 Numerical experiments and conclusions

In our implementation the stopping criteria includes, besides the max-
imum number of steps (N_{t} >N_{max}), a minimum number of valid samples
which will be used in the updating phase.

The numerical results (see also [38], [39]) are obtained in the following
conditions: initial sample size is N_{0} = 200, initial quantile level _{0} =
0:5, smoothing parameter = 0:8, sample size level correction = 2,
weight parameter = 0:3, " = 0:001. The increasing positive function
used is s(x) = exp (0:1x), initial mean is a vector having all components
10, and initial covariance matrix has diagonal elements 225. Option prices
are obtained by using 50; 000 simulations, after the threshold prices are
estimated.

Tables 6.1 to 6.3 shows the results of our simulations: the prices, the standard errors and the average number of iterations. All models are tested for various early exercise dates and different first threshold price values.

The two sampling importance versions are: the uniform (u-SIR) and the reference distributions sampling importance resampling (rd-SIR). Both perform less steps, while rd-SIR performs at most half steps, than standard algorithm.

The prices we get are slightly smaller for importance resampling, al- though remaining very close to those obtained with the standard MRAS procedure. In almost all cases the standard error of the mean is similar for all three algorithms; an exception is for Kou and Merton models on6early exercices where standard error almost doubles (remaining under 0.05).

Our algorithm performs almost twice as fast as the standard algorithm having same standard errors - this means that our method is a reliable and faster method.

## Chapter 4

## Rare-events probabilities and optimization

Estimation of probabilities of rare-events are very important for the performance guarantee of various systems – usually stochastic networks (e.g. in telecommunications). There are a number of randomized methods for this estimation: genetic algorithms, simulated annealing etc; among them cross-entropy method (see [4] and [44]) is one of the most successful.

This type of approach can be used not only for estimating probabilities but also for solving various combinatorial optimization problems. Some applications of this cross-entropy show the its utility for solving even hard combinatorial problems.

The general procedure we describe (see [40],[41]) does not involve any specific family of distributions, the only restriction is that the search space consists of product-form probability density functions. We discuss an al- gorithm for estimation of probability of rare events and a version for con- tinuous optimization. The results of numerical experimentation with these algorithms carried in the last section support their performances.

### 4.1 Minimizing the Rényi divergence

Many problems which arise in a variety of applications of operations research can be described as the evaluation of the expected value for a given random variable. Areas of our interest which use such an evaluation are rare event simulation or global optimization.

Let us recall the problem (2.15) of minimizing a f-divergence with re-
spect to the zero-variance pdf,g^{}; this is a subproblem of (2.11). If we use
the Lullback-Leibler divergence we get the problem (2.17).

In this section we present an alternative approach to the problem of estimating probabilities of rare events using the class of Rényi divergences of order. As we later see this approach can be also used for solving global optimization problems.

In the remaining of this chapter we suppose that > 1. The restriction we impose is that the pdf’s to have a product form, i.e., the distributions with independent components:

G =

(

g :R^{n} !R+ : ^{Z}

R^{n}

g(s) d(s) = 1; g(s) =^{Y}^{n}

i=1

gi(si); 8s2R^{n}

)

This constraint will allow us to study the problem of divergence mini- mization in more detail.

As mentioned earlier we propose to choose as IS distribution that which minimize the Rényi divergence:

ming2G

Z

R^{n}

hg^{}(s)g^{1 }(s)^{i} d(s)^{}=

(4.1) min

g2G

Z

R^{n}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s)^{}

For a given "0 2 (0; 1), say "0 = 1=2, we define
U =^{n}h :R^{n} !R+ : h 2 L^{1}(R)^{o};
U_{0} =^{}h 2 U : ^{}_{}^{Z}

R

h(t) d(t) 1^{}_{}< "_{0}^{};

obviously, U_{0} and U_{0}^{n} are convex subsets of the Banach spaces L^{1}(R) and

L^{1}(R)^{}^{n}, respectively. Where

L^{1}(R) =^{}' :R!R : ^{Z}

R

j'(s)j d(s) < 1^{}

4.1. Minimizing the Rényi divergence

is the space of absolutely integrable^{1} functions. This is the most relaxed
framework we can use, although it is possible to restrain our study to the
square integrable functions L^{2}(R).

The latter space has the advantage of being reflexive, therefore its unit
sphere is relative compact in the weak convergence. As our search setGis a
convex subset of the unit set in L^{1}(R^{n}), we can direct the analysis towards
a Weierstrass type optimization of a continuous function on a compact set.

As our functional is not proven to be continuous we focus on a differ- ent approach which is based on the convexity and critical points of the Lagrangian.

Problem (4.1) can be viewed as a functional minimization problem:

(4.2) min

g2U_{0}^{n}

Z

R^{n}

"

H^{}(s)f^{}(s)^{Y}^{n}

i=1

g_{i}^{1 }(si)

#

d(s)

!

;

subject to

Z

R

gi(si) d(si) = 1; 8i = 1; n:

For the sake of simplicity, we make the following notations:

: U_{0}^{n} !R; :^{}L^{1}(R)^{}^{n} !R^{n};
(g) =^{Z}

R^{n}

"

H^{}(s)f^{}(s)^{Y}^{n}

i=1

g^{1 }_{i} (s_{i})

#

d(s);

(g) =^{Z}

R

g1(s1) d(s1) 1; : : : ;^{Z}

R

gn(sn) d(sn) 1^{}

### 4.1.1 Convex optimization problem.

Thus, problem (4.2) becomes

(4.3) min

g2U_{0}^{n}(g); (g) = 0:

This a convex optimization problem with constraints. We first prove the convexity of the objective function.

LEMMA 1 For > 1, is a convex functional on U_{0}^{n}.

1We use the same notations, for Lebesgue measures on different-algebras,B(R^{p}).

PROOF: It will suffice to show that the function ' : R^{n} ! R, '(x) =
(x1 : : : xn)^{1 } is a convex one.

To this end, let x;y 2 R^{n} and t 2 (0; 1); first, from the concavity of
ln (), one has:

ln (' [tx+ (1 t)y]) = ln ^{Y}^{n}

i=1

[tx_{i}+ (1 t)y_{i}]^{1 }

!

=

= (1 )^{X}^{n}

i=1

ln [tx_{i}+ (1 t)y_{i}]6(1 )^{X}^{n}

i=1

[t ln x_{i}+ (1 t) ln y_{i}] =

= t ln ('(x)) + (1 t) ln ('(y)):

Therefore '() is log-convex; now, we observe that ' [tx+ (1 t)y]6t'(x) + (1 t)'(y) is equivalent with

(4.4) ln (' [tx+ (1 t)y])6ln [t'(x) + (1 t)'(y)]:

Since '() is log-convex, we have

(4.5) ln (' [tx+ (1 t)y])6t ln ('(x)) + (1 t) ln ('(y)) and, by concavity of ln (),

(4.6) t ln ('(x)) + (1 t) ln ('(y))6ln (t'(x) + (1 t)'(y)):

Inequalities (4.5) and (4.6) combined give (4.4), hence, '()is a convex

function.

The Lagrange function of (4.3) is L(g; ) = (g) + h; (g)i, and is a convex functional.

COROLLARY 1 For every 2R^{n}, L(; ) is a convex function.

4.1. Minimizing the Rényi divergence

PROOF: Use Lemma 1 and the fact that ()is affine.

It easily seen that, for 0 < < 1, with similar arguments one can prove that is a concave functional. Thus, in this case, problem (4.3) can be written as

(4.7) min

g2U_{0}^{n} (g); (g) = 0:

As we already said in the beginning of this section, we choose to study only the case > 1, the two cases being so similar. Therefore, in the following, we shall assume that > 1 if not otherwise mentioned.

### 4.1.2 Lagrangian analysis

For everyh 2 U^{n}andg 2 U_{0}^{n}, it exists at_{0} > 0, such that(g+t_{0}h) 2 U_{0}^{n};
being convex, the function

t ! (g + t h) (g) t

is monotone, therefore the directional derivatives of the above functionals can be calculated using the Lebesgue monotone convergence theorem:

D(g)(h) = lim

t!0

(g + t h) (g)

t =

= (1 )^{X}^{n}

i=1

Z

R^{n}

"

H^{}(s)f^{}(s)g^{1 }(s)h_{i}(s_{i})
gi(si)

#

d(s);

D (g)(h) = lim

t!0

(g + t h) (g)

t =

=^{Z}

R

h1(s1) d(s1); : : : ;^{Z}

R

hn(sn) d(sn)^{};
DL(g; )(h) = D(g)(h) + h; D (g)(h)i:

Using Theorem 3.4 from [3], g is solution to (4.3) if and only if it exists
a 2R^{n} (Lagrange multipliers) such that

(4.8) g 2 arg min

g2U_{0}^{n} L(g; )and (g) = 0R^{n}:

For a given 2 R^{n}, first condition in (4.8) is equivalent with 0 2

@L(g; )- the subdifferential ofL(; ). AsL(; )admits directional deriva-
tives (at least) on every directionh 2 U^{n}, a natural way to solve (4.8) is to
find a solution to DL(g; )(h) = 0, 8 h 2 U^{n} or, equivalently,

(4.9) ( 1)^{X}^{n}

i=1

Z

R^{n}

"

H^{}(s)f^{}(s)g^{1 }(s)hi(si)
gi(si)

#

d(s) =

=^{X}^{n}

i=1

i

Z

R

hi(si) d(si); 8 h 2 U^{n}:

For an index 1 6 i 6 n, and a vector s = (s1; s2; : : : ; sn) 2 R^{n}, let us
denote s^{i}= (s1; : : : si 1; si+1; : : : ; sn).

LEMMA 2 If g is a solution to (4.9), then, for every 1 6 i 6 n, we have

(4.10) g_{i}(s_{i}) =

Z

R^{n 1}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s^{i})

Z

R^{n}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s) a. e., and

if X_{i} is a random variable having density g_{i}, and b : R ! R is a
continuous function, then

(4.11) Eg_{i}[b(Xi)] =

Z

R^{n}

hH^{}(s)f^{}(s)g^{1 }(s)b(s_{i})^{i} d(s)

Z

R^{n}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s) :

PROOF: equation (4.9) has the form:

(4.12)

Xn i=1

Z

R^{n}

A(s)h_{i}(s_{i}) d(s) =^{X}^{n}

i=1

Z

R

_{i}h_{i}(s_{i}) d(s_{i}); 8 h 2 U^{n}:
We will use in this proof, the simple fact that, for a given measurable
function : R ! R+,

Z

R

(s) d(s) = 0 if and only if (s) = 0 almost everywhere (i.e., except, perhaps, for slying in a negligible set).

For a given i, let us choose hj 0, for all j 6= i, and, denote
^{i}_{+} =^{}si :^{Z}

R^{n 1}

A(s) d(s^{i}) > i

; ^{i} =Rn ^{i}_{+}:

4.2. Estimation of rare-events probabilities

We choose for (4.12), first
hi(si) =^{Z}

R^{n 1}

A(s) d(s^{i}) i

_{+}

; i = 1; n;

and secondly,

hi(si) =^{Z}

R^{n 1}

A(s) d(s^{i}) i

; i = 1; n;

and we obtain

Z

^{i}_{+}

Z

R^{n 1}

A(s) d(s^{i}) i

_{2}

d(s_{i}) = 0
and, respectively,

Z

^{i}

Z

R^{n 1}

A(s) d(s^{i}) i

_{2}

d(s_{i}) = 0;

or, equivalently,

Z

R^{n 1}

A(s) d(s^{i}) = _{i}; a:e:

From (4.12) we get
g_{i}(s_{i}) = 1

_{i}

Z

R^{n 1}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s^{i}):

Moreover, if, for any given i, we chooseh_{i}= g_{i}, andh_{j} 0for all j 6= i,
we get:

i = ( 1)^{Z}

R^{n}

hH^{}(s)f^{}(s)g^{1 }(s)^{i} d(s):

From here (4.10) follows easily; on the other hand (4.11) is an easy

problem of calculation.

### 4.2 Estimation of rare-events probabilities

In this section we suppose that H(s) = 1[F(s)>a], where F is a Lebesgue
measurable function, a 2 R, and [F(s)>a] is a small probability event
(say, at most 10 ^{5}).

We follow here the framework of a multistage procedure for the estima- tion of g - this type of algorithm appear often in the literature (e.g. [4], [31], [45]).