## 8. Experiments

AEA 2021/2022

Based on

Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 8

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Experimentation

The analysis of an algorithm (2^{nd} step of AE process):

I worst-case, average-case, and experimental analysis; the focus was on theoretical analysis

I recently, the interest in experimental analysis has grown I no well-established methods, generally accepted as standard

for empirical studies

I theoretical analysis cannot reveal all aspects of algorithmic behavior, especially for real-world applications

## Experimentation

I The analysis shows a bad worst-case behavior (for a small subset of instances), but the algorithm is much better in practice; ex: simplex

I A theoretically good algorithm is practically irrelevant due to huge hidden constants

I Example: Robertson & Seymour’s algorithm for testing whether a graph is a minor of another

(an undirected graphHis a minor ofG ifH can be formed fromG by deleting edges and vertices and by contracting edges)

## Experimentation

I The analysis is invalidated by experiments: the theoretically good behavior doesn’t apply to practically relevant instances

I Example: the simplest algorithm for MST (Prim’s) - the fastest in experiments, but doesn’t have the best running time in theory (O(|E|) a provably optimal deterministic

comparison-based MST alg.)
adjacency matrix -O(|V|^{2})

binary heap and adjacency list -O(|E|log|V|)

Fibonacci heap and adjacency list -O(|E|+|V|log|V|) I the study of data structures and algorithms can improve the

implementation up to the point of drawing new conclusions

## Experimentation

I Experimental analysis provide insights into the structure and properties of an algorithm that is hard to analyze theoretically

I SA, GA, Union-Find with path compression

I The result of the experiments can be used in the next cycle of the Algorithm Engineering process

## The Experimentation Process

Experimentation: to answer a hypothesis/question using meaningful test data obtained by a reproducible process.

1. Define the goals of the experiment (type, primary goals, newsworthiness of the experiment)

2. Choose the measures of performance and factors to explore 3. Choose a good set of test instances (test data libraries) 4. Implement and execute the experiment

5. Analyze the data and draw conclusions (graphical/statistical methods to analyze the data)

6. Report the experiment’s results (reproducible)

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Planning experiments

In the planning step, define a clear set of objectives (questions and statements to verify,before implementing the algorithm and starting to collect data.

Themotivationto perform an experiment:

I show the relevance of an algorithm for a specific application I show the superiority of an algorithm compared with the

existing ones

I compare the performance of competing algorithms I improve existing algorithms

I show that an existing algorithm performs surprisingly bad I analyze an algorithm/problem to better understand it

(experimental analysis)

I support/reject/refine conjectures on problems and algorithms I checking for correctness, quality, and robustness

I develop refined models and optimization criteria

## Planning experiments

Verify thenewsworthinessof the experiment:

I Does the performance of several algorithms differ and do some of the algorithms clearly dominate the others?

I Does dominance hold for all instances or only for some subset?

If so, what are the structural properties of this subset?

I What are the reasons for dominance? - structural properties of inputs, comparison of operation counts for critical operations I How much does each phase of the algorithm contribute to the

running time/performance of the algorithm?

## Measures

Measures: quantities obtained by the execution of the experiment.

I Running time I Space consumption

I Value/quality of the solution (heuristics and approximation algorithms)

The first two measures depend on the programming language, compiler and computer, implementation style.

The choice of the test instances has a strong influence, especially on the value/quality of the solution in the case of heuristics.

Unlikely that a good understanding of the problem and the algorithm emerges from these measures. They are aggregate measures that do not reveal much about the algorithm’s behavior

I ex: cannot discoverbottleneck operations (are performed repeatedly and influence the running time at most)

## Measures

Measures to gain a deeper understanding:

I Extensions of running time; for heuristics: the time to find the best-found solution; for multi-phase, measure the time and the improvement in each phase

I Structural measures: no. of iterations, no. of calls to a crucial subroutine, no. of comparisons, no. of nodes in a search tree, memory references, data moves

I Bottleneck operation counts; count the no. of calls to a crucial subroutine/the no. of executions of a major sub-task;

asymptotic nonbottleneck operation: its fraction in the computation time becomes smaller and approaches 0 as the problem size increases

## Other ”measures”

I Robustness: the algorithm should perform well on a wide range of test instances

I to estimate the robustness: the no. of solved instances of a benchmark library of hard instances

I Ease of implementation

I Scalability: the algorithm can deal with small and large data sets

## Measures

How to find good measures that helpunderstanding an algorithm?

Exploratory experimentation

I How is the running time affected by implementation details, parameter settings, heuristics, data structures, instance size/structure, etc.?

I Is there a correlation between running time and the count of some operations?

I Find out the bottlenecks of the algorithm.

I How does the running time depend on the machine

architecture? How performs the algorithm compared with its competitors?

Profilers - to quickly find good structural measures.

## Factors and Sampling Points

Factors: properties of the experimental environment or setup that influences the result of the experiment (the parameters of the algorithm, the computing environment, etc)

I Which factors have a major influence on the measures?

I The level of a factor (the processor speed, the memory usage, the value of a configuration variable)

Define asampling point: fix the factors at some level

I Which sampling points will be considered in the experiment?

How many runs should be performed for each sampling point?

I Apply preliminary tests to find out which factors have a major influence; pick up the ones we are interested in and their levels

## Factors and Sampling Points

DOE (Design Of Experiments): the careful design of experiments and the choice of factors and sampling points to allow a sound statistical analysis

## Advanced Techniques

Methods that should be considered when planning an experiment:

I Simulation

I Simulation speedup: how to speedup the process of obtaining data (tests that run faster) →reduce variance

I Variance reduction techniques: reduced variance admits fewer test runs, speeding up the process of gathering data

I Differentiate between a real-world process (an application program running in a computing environment) and a mathematical model of a process (the underlying algorithm) I To get a reliable value for a measure, repeat the test run

several times and compute the mean; for measures with high variance, a high no. of runs is needed to get a reliable mean value with low deviation

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Test Data Generation

I A good choice of test instances can result in meaningful conclusions drawn from the experiment

I Properties to have in mind:

I Comparability: the results of experiments should be comparable to other experiments

I the standard solution to assure comparability is to make the instances or their generator programs available

I Measurability: how far the heuristic’s solution is from an optimal solution

I Purpose: show the potential of an algorithm (lots of instances) or the practicality in specific situations (restricted instances) I Portability: use a widely accepted or some standard format

(ex: cnf-format for SAT-instances) I Variety, Quantity,Unbiasedness

## Types of Test Instances

Real-world instances I Advantages

I representative of real-world behavior (purpose) I allow evaluation of practical usefulness

I Disadvantages

I only of bounded size (variety) I only few available (quantity)

I sometimes proprietary (comparability) I lack of control of characteristics

## Types of Test Instances

Artificial Instances: randomly generated by a generator program, given a list of parameters

I a possibility to overcome the main disadvantages of real-world instances

I ex: NETGEN (Network Generator for Combinatorial Graph Problems) generates network problem instances

Advantages

I arbitrary size available (variety) I arbitrary number available (quantity) I rarely proprietary (comparability) I ability to control characteristics Disadvantages

I lack of realism (purpose)

I difficult to evaluate real-world performance (purpose) I susceptible to unintended correlations and biases

(unbiasedness)

## Types of Test Instances

Perturbed Real-World Instances I Advantages

I betterquantity than real-world instances I more realistic than artificial instances I Disadvantages

I variety comparable to real-world instances I less realistic than real-world instances

I often hard to identify meaningful perturbation

## Phase transitions

I From random 3-SAT experiments: the phase transitionoccur when the ratio of clauses to variables is approximately 4:3.

I Theoretically, the best known lower bound and upper bound are 3:003 and 4:602 respectively.

I The existence of a threshold for random k-SAT (by proving that the width of the transition region narrows as the no. of variables increases).

I The phase transition phenomenon does exist for Model RB as the no. of variables approaches ∞:

∃r,pcontrol parameters andr_{cr},p_{cr} corresponding critical values s.t.

for each fixed valuer <rcr orp<pcr, a random CSP instance generated following Model RB is satisfiablewith probability tending to 1 as the no. of variables approaches∞, and

when r>rcr orp>pcr,unsatisfiablewith probability tending to 1.

## Test data libraries

I CSPLib - constraint satisfaction problems http://csplib.org/

I MIPLIB - real-world mixed integer programs http://miplib.zib.de/

I OR-Library - operations research problems

http://people.brunel.ac.uk/~mastjjb/jeb/info.html I SATLIB - satisfiability of Boolean formulas

http://www.cs.ubc.ca/~hoos/SATLIB/

I TSPLIB - traveling salesman and related problems, SteinLib - Steiner tree problems in graphs, FAP web - frequency

assignment problems, QAPLIB - quadratic assignment problem, PSPLIB - project scheduling problems, TPTP - thousands of problems for theorem provers, etc.

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Setting-up and Running the Experiment

I Setup-Phase: ensure you use reasonably efficient implementations, check your input data, use different platforms, use appropriate formats, use version control, use scripting

I Running-Phase: keep notes, change of factors, change only one thing at a time, run it often enough and with appropriate data sets, look at the results

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Evaluate the data

The data supports the working hypothesis? What else may be deduced?

I Look at the data without being biased by the working hypothesis; observe patterns in data and try to explain them (may involve a more detailed analysis and new experiments).

Example: B&B using a new pruning rule performs worse than using the old one

I the reason might be poor pruning or too much time spent for the pruning

I look at the no. of nodes visited and the fraction of the time spent for pruning.

I Generate ”raw” data: instead of averages / minimum / maximum, record all values and related quantities → large amounts of data, but saves from running time-consuming

## Evaluate the data

I The significance of the findings is increased if you can provide explanations

Example: not only interesting which algorithm runs faster, but also where the respective running times come from.

I Look at more machine-independent running time measures (ex:

nodes evaluated in B&B, improvement steps taken in a LS heuristic, or simply the no. of iterations). Investigate how these measures depend on instance size.

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Graphical Analysis

I Usediagrams andplotsto discover patterns

I can represent vast amounts of data in a succinct way

I a standard tool of statisticians for arriving at good hypotheses.

I Graphical analysis can give some evidence for conclusions;

useful to communicate your results to other people.

Thediagramplots some metric (e.g. running time) as a function on some parameter (e. g., input size of instance).

I The interpretation: there is a functional relation between the vars on the x-/y-axis - imply causality (ex: an increase in input size causes an increase in running time).

## Functional plot

To show the convergence of the algorithms: a functional plot, a time series where time is shown on the x-axis.

Ex: A functional plot showing the typical behavior of B&B algorithms Figure:Displayed are the upper and lower bounds relative to the opt. val.

evolving with the no. of nodes of the tree that have been explored so far.

Every time an improved solution is found, the upper bound drops and remains on this level for some time. Note: an optimal solution has been

## Functional plot

To giveaverage values,

I provide error bars which indicate the std. dev. of the data or, I use box plots which characterize the underlying distribution to depict more information of the range of the data.

## Scatter plot

To investigate therelationship of two variables (the graph is the set of points corresponding to the measurements); when it’s unclear which of the variables is ”independent” or dependent”.

I can be applied if the data are not ordered quantities Example: it’s not clear in which order to put the instances for which you measured the solution quality; use a scatter plot relating instance size to solution quality.

## Scatter plot - Example

A scatter plot can also be used to compare many different measurements, e.g., the performance of many TSP heuristics.

Figure:The approximation ratio and the running time of some heuristics for symmetric TSP. The data: averages over a set of 10,000-city random Euclidean instances. Each heuristic’s label depicts its average excess over the Held-Karp lower bound.

## Bar chart

Consists of bars whose heights represent numerical quantities and are scaled proportionally. Appropriate when multiple quantities need to be compared.

Figure:A bar chart for B&B that shows running time data for different settings used to solve the same instance. LB1 runs fast and gives weak lower bounds, whereas LB2 runs longer and gives stronger bounds. A better quality of the lower bounds provided by method 2 decreases the time spent for searching. It only pays off to use LB2 if the heuristic is not used, since the total time is lowest if the heuristic and LB1 are used.

## Histograms

Used toanalyze distributionsof some quantity. The range of the quantity is divided in buckets (intervals of equal size), and for each bucket the percentage of values lying in this interval gives the height of the bar in a bar chart.

Figure:Comparison of waiting time distributions achieved by some vehicle dispatching algorithms. BestInsert and 2-OPT are heuristics based on LS;

ZIBDIP is an exact algorithm based on IP techniques. The distribution of ZIBDIP is the one that is farthest left, indicating short waiting times.

## Box-plots (box-and-whisker diagrams)

Used in statistics to compare distributions. Are based on quartiles (i.e. the quantilesa0,a0.25,a0.5,a0.75,a1).

The (empirical)p-quantile, 0≤p ≤1 of a sample x_{1}, ...,x_{n},
a_{p} :=x_{dpne} is the smallest value s.t. at leastpn values of the
sample are less thanap.

Figure:All quartiles of ZIBDIP’s distribution are smaller than the respective quartiles of the heuristics’ distributions.

## Scales and Normalization

I Linear scales - appropriate if the numbers are in a relatively small range

I Logarithmic scales - useful if the numbers are of different orders of magnitude

I if interested in the the asymptotic behavior of the running times as a function of the instance size, use logarithmic axes (if instance sizes grow exponentially by setup)

I if instance sizes grow additively by a const., but the running times are known to be roughly exponential, use a logarithmic y-axis scale

Normalization: a set of functions to be compared; if a lower bound
for them is known,f_{1}(n),f_{2}(n)∈Ω(n), look at f_{1}^{0}(n) :=f_{1}(n)/n,
f_{2}^{0}(n) =f_{2}(n)/n instead off_{1},f_{2}, since the ”common part” is
factored out and differences become more visible.

## Example

Figure:Running time for five algorithms, depending on the input size

E, D - worse than the other three on large instances.

## Example

There is no clear picture for smaller instances (the plots coincide).

Furthermore, almost all data points are in the left half of the diagram.

Changing to a logarithmic x-scale (b) fixes this, but still there is much coincidence of the plots.

Figure:Effect of different choices for the scaling of the diagram

## Example

If both axes are logarithmic (c), the plots seem to have approximately the same
slope (the same asymptotic behavior). From the 1^{st}diagram, it is not quite
true.

There is a lower bound ofθ(nlogn) for all algorithms: use it to normalize the running time (still keep the logarithmic x-axis). (d) shows the consistent ranking, brings out the asymptotically worse running times of E and D, and indicates that the other three algorithms are asymptotically optimal (up to a const).

Figure:(c), (d) - the ordering of the algorithms is consistent. Only (d) reveals that the performance of D and E are asymptotically much worse than the lower bound, a fact that cannot be seen directly from the table.

## Content

Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis

## Statistical Analysis

I Give much stronger support for claims, or indicate that claims are not justified by the data.

I Gain a deeper understanding of the data

I it’s possible to analyze the impact of parameter choices on the performance of the algorithm, and

to distinguish between significant and insignificant parameters.

I May suggest directions for further experiments.

## Confidence intervals

Most likely, for an estimator ¯θof a population parameter θ, we have ¯θ6=θ due to a sampling error. How much can we trust the reported estimator?

Confidence intervals givea range of plausible values for a parameter and are based on sample data.

Example: we may be 95% confident thatµlies in the interval (0.2,3.4)

Definition

An interval[a,b]is a (1−α)100% confidence interval for the parameterθ if it contains the parameter with probability (1−α), P{a≤θ≤b}= 1−α.

The coverage probability(1−α) is called aconfidence level.

## Confidence interval

Given a sample of data and a desired confidence level (1−α), how to construct a confidence interval [a,b] that will satisfy

P{a≤θ≤b}= 1−α?

Estimate the parameterθ. Assume there is an unbiased estimator θˆthat has a Normal distribution. When we standardize it, we get a Standard Normal variable

Z = θˆ−E(ˆθ)

σ(ˆθ) = θˆ−θ σ(ˆθ)

## Confidence interval

This variable falls between the Standard Normal quantilesq_{α/2} and
q_{1−α/2}, denoted byz_{−α/2} and z_{α/2}, with probability (1−α).

If parameterθ has an unbiased, Normally distributed estimator ˆθ,
then ˆθ±z_{α/2}·σ(θ) = [ˆθ−z_{α/2}·σ(θ),θˆ+z_{α/2}·σ(θ)] is a
(1−α)100% confidence interval forθ.

## 1. Deriving a confidence interval for the mean µ

When sampling from a normally distributed population with a known value ofσ:

The confidence intervalforµ: ¯X ±Margin of Error, ¯X the sample mean.

## Confidence intervals

Suppose we draw a random sample ofn independent observations, from a normally distributed population.

X¯ is a normally distributed random variable with a meanµ and a
standard deviation ^{√}^{σ}_{n}.

Standardize ¯X:

Z = X¯ −µ σ/√

n Z has the standard normal distribution.

P(−z_{α/2} ≤ X¯ −µ
σ/√

n ≤z_{α/2}) = 1−α

## Confidence intervals

Example: α= 0.05

(z_{0.025}= 1.96)

## Confidence intervals

Isolateµ: P( ¯X −z_{α/2}^{√}^{σ}_{n} < µ <X¯ +z_{α/2}^{√}^{σ}_{n}) = 1−α.

X¯ is the random variable, µfixed, unknown quantity A (1−α)100% confidence interval for µis given by:

X¯ ±z_{α/2} σ

√n

Example: for 95% confidence level,α= 0.05, z_{.025} = 1.96,
X¯ ±1.96× ^{√}^{σ}_{n}.

## 2. Confidence intervals for the population mean µ using the t procedure

Require: a simple random sample from the population and a normally distributed population.

If the population standard deviationσ is not known:

X¯ ±t_{α/2} s

√n

wheres is the sample standard deviation,SE( ¯X) = ^{√}^{s}_{n} the
standard error of ¯X (the estimated standard deviation of the
sampling distribution of ¯X).

Thet value is found from the t distribution with n−1 degrees of freedom.

## Confidence intervals for the population mean µ using the t procedure

t_{α/2} the t value that has an area to the right ofα/2.