• Nu S-Au Găsit Rezultate

Define the goals of the experiment (type, primary goals, newsworthiness of the experiment) 2


Academic year: 2022

Share "Define the goals of the experiment (type, primary goals, newsworthiness of the experiment) 2"

Arată mai multe ( pagini)

Text complet


8. Experiments

AEA 2021/2022

Based on

Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 8



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis



The analysis of an algorithm (2nd step of AE process):

I worst-case, average-case, and experimental analysis; the focus was on theoretical analysis

I recently, the interest in experimental analysis has grown I no well-established methods, generally accepted as standard

for empirical studies

I theoretical analysis cannot reveal all aspects of algorithmic behavior, especially for real-world applications



I The analysis shows a bad worst-case behavior (for a small subset of instances), but the algorithm is much better in practice; ex: simplex

I A theoretically good algorithm is practically irrelevant due to huge hidden constants

I Example: Robertson & Seymour’s algorithm for testing whether a graph is a minor of another

(an undirected graphHis a minor ofG ifH can be formed fromG by deleting edges and vertices and by contracting edges)



I The analysis is invalidated by experiments: the theoretically good behavior doesn’t apply to practically relevant instances

I Example: the simplest algorithm for MST (Prim’s) - the fastest in experiments, but doesn’t have the best running time in theory (O(|E|) a provably optimal deterministic

comparison-based MST alg.) adjacency matrix -O(|V|2)

binary heap and adjacency list -O(|E|log|V|)

Fibonacci heap and adjacency list -O(|E|+|V|log|V|) I the study of data structures and algorithms can improve the

implementation up to the point of drawing new conclusions



I Experimental analysis provide insights into the structure and properties of an algorithm that is hard to analyze theoretically

I SA, GA, Union-Find with path compression

I The result of the experiments can be used in the next cycle of the Algorithm Engineering process


The Experimentation Process

Experimentation: to answer a hypothesis/question using meaningful test data obtained by a reproducible process.

1. Define the goals of the experiment (type, primary goals, newsworthiness of the experiment)

2. Choose the measures of performance and factors to explore 3. Choose a good set of test instances (test data libraries) 4. Implement and execute the experiment

5. Analyze the data and draw conclusions (graphical/statistical methods to analyze the data)

6. Report the experiment’s results (reproducible)



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Planning experiments

In the planning step, define a clear set of objectives (questions and statements to verify,before implementing the algorithm and starting to collect data.

Themotivationto perform an experiment:

I show the relevance of an algorithm for a specific application I show the superiority of an algorithm compared with the

existing ones

I compare the performance of competing algorithms I improve existing algorithms

I show that an existing algorithm performs surprisingly bad I analyze an algorithm/problem to better understand it

(experimental analysis)

I support/reject/refine conjectures on problems and algorithms I checking for correctness, quality, and robustness

I develop refined models and optimization criteria


Planning experiments

Verify thenewsworthinessof the experiment:

I Does the performance of several algorithms differ and do some of the algorithms clearly dominate the others?

I Does dominance hold for all instances or only for some subset?

If so, what are the structural properties of this subset?

I What are the reasons for dominance? - structural properties of inputs, comparison of operation counts for critical operations I How much does each phase of the algorithm contribute to the

running time/performance of the algorithm?



Measures: quantities obtained by the execution of the experiment.

I Running time I Space consumption

I Value/quality of the solution (heuristics and approximation algorithms)

The first two measures depend on the programming language, compiler and computer, implementation style.

The choice of the test instances has a strong influence, especially on the value/quality of the solution in the case of heuristics.

Unlikely that a good understanding of the problem and the algorithm emerges from these measures. They are aggregate measures that do not reveal much about the algorithm’s behavior

I ex: cannot discoverbottleneck operations (are performed repeatedly and influence the running time at most)



Measures to gain a deeper understanding:

I Extensions of running time; for heuristics: the time to find the best-found solution; for multi-phase, measure the time and the improvement in each phase

I Structural measures: no. of iterations, no. of calls to a crucial subroutine, no. of comparisons, no. of nodes in a search tree, memory references, data moves

I Bottleneck operation counts; count the no. of calls to a crucial subroutine/the no. of executions of a major sub-task;

asymptotic nonbottleneck operation: its fraction in the computation time becomes smaller and approaches 0 as the problem size increases


Other ”measures”

I Robustness: the algorithm should perform well on a wide range of test instances

I to estimate the robustness: the no. of solved instances of a benchmark library of hard instances

I Ease of implementation

I Scalability: the algorithm can deal with small and large data sets



How to find good measures that helpunderstanding an algorithm?

Exploratory experimentation

I How is the running time affected by implementation details, parameter settings, heuristics, data structures, instance size/structure, etc.?

I Is there a correlation between running time and the count of some operations?

I Find out the bottlenecks of the algorithm.

I How does the running time depend on the machine

architecture? How performs the algorithm compared with its competitors?

Profilers - to quickly find good structural measures.


Factors and Sampling Points

Factors: properties of the experimental environment or setup that influences the result of the experiment (the parameters of the algorithm, the computing environment, etc)

I Which factors have a major influence on the measures?

I The level of a factor (the processor speed, the memory usage, the value of a configuration variable)

Define asampling point: fix the factors at some level

I Which sampling points will be considered in the experiment?

How many runs should be performed for each sampling point?

I Apply preliminary tests to find out which factors have a major influence; pick up the ones we are interested in and their levels


Factors and Sampling Points

DOE (Design Of Experiments): the careful design of experiments and the choice of factors and sampling points to allow a sound statistical analysis


Advanced Techniques

Methods that should be considered when planning an experiment:

I Simulation

I Simulation speedup: how to speedup the process of obtaining data (tests that run faster) →reduce variance

I Variance reduction techniques: reduced variance admits fewer test runs, speeding up the process of gathering data

I Differentiate between a real-world process (an application program running in a computing environment) and a mathematical model of a process (the underlying algorithm) I To get a reliable value for a measure, repeat the test run

several times and compute the mean; for measures with high variance, a high no. of runs is needed to get a reliable mean value with low deviation



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Test Data Generation

I A good choice of test instances can result in meaningful conclusions drawn from the experiment

I Properties to have in mind:

I Comparability: the results of experiments should be comparable to other experiments

I the standard solution to assure comparability is to make the instances or their generator programs available

I Measurability: how far the heuristic’s solution is from an optimal solution

I Purpose: show the potential of an algorithm (lots of instances) or the practicality in specific situations (restricted instances) I Portability: use a widely accepted or some standard format

(ex: cnf-format for SAT-instances) I Variety, Quantity,Unbiasedness


Types of Test Instances

Real-world instances I Advantages

I representative of real-world behavior (purpose) I allow evaluation of practical usefulness

I Disadvantages

I only of bounded size (variety) I only few available (quantity)

I sometimes proprietary (comparability) I lack of control of characteristics


Types of Test Instances

Artificial Instances: randomly generated by a generator program, given a list of parameters

I a possibility to overcome the main disadvantages of real-world instances

I ex: NETGEN (Network Generator for Combinatorial Graph Problems) generates network problem instances


I arbitrary size available (variety) I arbitrary number available (quantity) I rarely proprietary (comparability) I ability to control characteristics Disadvantages

I lack of realism (purpose)

I difficult to evaluate real-world performance (purpose) I susceptible to unintended correlations and biases



Types of Test Instances

Perturbed Real-World Instances I Advantages

I betterquantity than real-world instances I more realistic than artificial instances I Disadvantages

I variety comparable to real-world instances I less realistic than real-world instances

I often hard to identify meaningful perturbation


Phase transitions

I From random 3-SAT experiments: the phase transitionoccur when the ratio of clauses to variables is approximately 4:3.

I Theoretically, the best known lower bound and upper bound are 3:003 and 4:602 respectively.

I The existence of a threshold for random k-SAT (by proving that the width of the transition region narrows as the no. of variables increases).

I The phase transition phenomenon does exist for Model RB as the no. of variables approaches ∞:

∃r,pcontrol parameters andrcr,pcr corresponding critical values s.t.

for each fixed valuer <rcr orp<pcr, a random CSP instance generated following Model RB is satisfiablewith probability tending to 1 as the no. of variables approaches∞, and

when r>rcr orp>pcr,unsatisfiablewith probability tending to 1.


Test data libraries

I CSPLib - constraint satisfaction problems http://csplib.org/

I MIPLIB - real-world mixed integer programs http://miplib.zib.de/

I OR-Library - operations research problems

http://people.brunel.ac.uk/~mastjjb/jeb/info.html I SATLIB - satisfiability of Boolean formulas


I TSPLIB - traveling salesman and related problems, SteinLib - Steiner tree problems in graphs, FAP web - frequency

assignment problems, QAPLIB - quadratic assignment problem, PSPLIB - project scheduling problems, TPTP - thousands of problems for theorem provers, etc.



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Setting-up and Running the Experiment

I Setup-Phase: ensure you use reasonably efficient implementations, check your input data, use different platforms, use appropriate formats, use version control, use scripting

I Running-Phase: keep notes, change of factors, change only one thing at a time, run it often enough and with appropriate data sets, look at the results



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Evaluate the data

The data supports the working hypothesis? What else may be deduced?

I Look at the data without being biased by the working hypothesis; observe patterns in data and try to explain them (may involve a more detailed analysis and new experiments).

Example: B&B using a new pruning rule performs worse than using the old one

I the reason might be poor pruning or too much time spent for the pruning

I look at the no. of nodes visited and the fraction of the time spent for pruning.

I Generate ”raw” data: instead of averages / minimum / maximum, record all values and related quantities → large amounts of data, but saves from running time-consuming


Evaluate the data

I The significance of the findings is increased if you can provide explanations

Example: not only interesting which algorithm runs faster, but also where the respective running times come from.

I Look at more machine-independent running time measures (ex:

nodes evaluated in B&B, improvement steps taken in a LS heuristic, or simply the no. of iterations). Investigate how these measures depend on instance size.



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Graphical Analysis

I Usediagrams andplotsto discover patterns

I can represent vast amounts of data in a succinct way

I a standard tool of statisticians for arriving at good hypotheses.

I Graphical analysis can give some evidence for conclusions;

useful to communicate your results to other people.

Thediagramplots some metric (e.g. running time) as a function on some parameter (e. g., input size of instance).

I The interpretation: there is a functional relation between the vars on the x-/y-axis - imply causality (ex: an increase in input size causes an increase in running time).


Functional plot

To show the convergence of the algorithms: a functional plot, a time series where time is shown on the x-axis.

Ex: A functional plot showing the typical behavior of B&B algorithms Figure:Displayed are the upper and lower bounds relative to the opt. val.

evolving with the no. of nodes of the tree that have been explored so far.

Every time an improved solution is found, the upper bound drops and remains on this level for some time. Note: an optimal solution has been


Functional plot

To giveaverage values,

I provide error bars which indicate the std. dev. of the data or, I use box plots which characterize the underlying distribution to depict more information of the range of the data.


Scatter plot

To investigate therelationship of two variables (the graph is the set of points corresponding to the measurements); when it’s unclear which of the variables is ”independent” or dependent”.

I can be applied if the data are not ordered quantities Example: it’s not clear in which order to put the instances for which you measured the solution quality; use a scatter plot relating instance size to solution quality.


Scatter plot - Example

A scatter plot can also be used to compare many different measurements, e.g., the performance of many TSP heuristics.

Figure:The approximation ratio and the running time of some heuristics for symmetric TSP. The data: averages over a set of 10,000-city random Euclidean instances. Each heuristic’s label depicts its average excess over the Held-Karp lower bound.


Bar chart

Consists of bars whose heights represent numerical quantities and are scaled proportionally. Appropriate when multiple quantities need to be compared.

Figure:A bar chart for B&B that shows running time data for different settings used to solve the same instance. LB1 runs fast and gives weak lower bounds, whereas LB2 runs longer and gives stronger bounds. A better quality of the lower bounds provided by method 2 decreases the time spent for searching. It only pays off to use LB2 if the heuristic is not used, since the total time is lowest if the heuristic and LB1 are used.



Used toanalyze distributionsof some quantity. The range of the quantity is divided in buckets (intervals of equal size), and for each bucket the percentage of values lying in this interval gives the height of the bar in a bar chart.

Figure:Comparison of waiting time distributions achieved by some vehicle dispatching algorithms. BestInsert and 2-OPT are heuristics based on LS;

ZIBDIP is an exact algorithm based on IP techniques. The distribution of ZIBDIP is the one that is farthest left, indicating short waiting times.


Box-plots (box-and-whisker diagrams)

Used in statistics to compare distributions. Are based on quartiles (i.e. the quantilesa0,a0.25,a0.5,a0.75,a1).

The (empirical)p-quantile, 0≤p ≤1 of a sample x1, ...,xn, ap :=xdpne is the smallest value s.t. at leastpn values of the sample are less thanap.

Figure:All quartiles of ZIBDIP’s distribution are smaller than the respective quartiles of the heuristics’ distributions.


Scales and Normalization

I Linear scales - appropriate if the numbers are in a relatively small range

I Logarithmic scales - useful if the numbers are of different orders of magnitude

I if interested in the the asymptotic behavior of the running times as a function of the instance size, use logarithmic axes (if instance sizes grow exponentially by setup)

I if instance sizes grow additively by a const., but the running times are known to be roughly exponential, use a logarithmic y-axis scale

Normalization: a set of functions to be compared; if a lower bound for them is known,f1(n),f2(n)∈Ω(n), look at f10(n) :=f1(n)/n, f20(n) =f2(n)/n instead off1,f2, since the ”common part” is factored out and differences become more visible.



Figure:Running time for five algorithms, depending on the input size

E, D - worse than the other three on large instances.



There is no clear picture for smaller instances (the plots coincide).

Furthermore, almost all data points are in the left half of the diagram.

Changing to a logarithmic x-scale (b) fixes this, but still there is much coincidence of the plots.

Figure:Effect of different choices for the scaling of the diagram



If both axes are logarithmic (c), the plots seem to have approximately the same slope (the same asymptotic behavior). From the 1stdiagram, it is not quite true.

There is a lower bound ofθ(nlogn) for all algorithms: use it to normalize the running time (still keep the logarithmic x-axis). (d) shows the consistent ranking, brings out the asymptotically worse running times of E and D, and indicates that the other three algorithms are asymptotically optimal (up to a const).

Figure:(c), (d) - the ordering of the algorithms is consistent. Only (d) reveals that the performance of D and E are asymptotically much worse than the lower bound, a fact that cannot be seen directly from the table.



Experimentation Planning experiments Test Data Generation

Setting-up and Running the Experiment Evaluating the data

Graphical Analysis Statistical Analysis


Statistical Analysis

I Give much stronger support for claims, or indicate that claims are not justified by the data.

I Gain a deeper understanding of the data

I it’s possible to analyze the impact of parameter choices on the performance of the algorithm, and

to distinguish between significant and insignificant parameters.

I May suggest directions for further experiments.


Confidence intervals

Most likely, for an estimator ¯θof a population parameter θ, we have ¯θ6=θ due to a sampling error. How much can we trust the reported estimator?

Confidence intervals givea range of plausible values for a parameter and are based on sample data.

Example: we may be 95% confident thatµlies in the interval (0.2,3.4)


An interval[a,b]is a (1−α)100% confidence interval for the parameterθ if it contains the parameter with probability (1−α), P{a≤θ≤b}= 1−α.

The coverage probability(1−α) is called aconfidence level.


Confidence interval

Given a sample of data and a desired confidence level (1−α), how to construct a confidence interval [a,b] that will satisfy

P{a≤θ≤b}= 1−α?

Estimate the parameterθ. Assume there is an unbiased estimator θˆthat has a Normal distribution. When we standardize it, we get a Standard Normal variable

Z = θˆ−E(ˆθ)

σ(ˆθ) = θˆ−θ σ(ˆθ)


Confidence interval

This variable falls between the Standard Normal quantilesqα/2 and q1−α/2, denoted byz−α/2 and zα/2, with probability (1−α).

If parameterθ has an unbiased, Normally distributed estimator ˆθ, then ˆθ±zα/2·σ(θ) = [ˆθ−zα/2·σ(θ),θˆ+zα/2·σ(θ)] is a (1−α)100% confidence interval forθ.


1. Deriving a confidence interval for the mean µ

When sampling from a normally distributed population with a known value ofσ:

The confidence intervalforµ: ¯X ±Margin of Error, ¯X the sample mean.


Confidence intervals

Suppose we draw a random sample ofn independent observations, from a normally distributed population.

X¯ is a normally distributed random variable with a meanµ and a standard deviation σn.

Standardize ¯X:

Z = X¯ −µ σ/√

n Z has the standard normal distribution.

P(−zα/2 ≤ X¯ −µ σ/√

n ≤zα/2) = 1−α


Confidence intervals

Example: α= 0.05

(z0.025= 1.96)


Confidence intervals

Isolateµ: P( ¯X −zα/2σn < µ <X¯ +zα/2σn) = 1−α.

X¯ is the random variable, µfixed, unknown quantity A (1−α)100% confidence interval for µis given by:

X¯ ±zα/2 σ


Example: for 95% confidence level,α= 0.05, z.025 = 1.96, X¯ ±1.96× σn.


2. Confidence intervals for the population mean µ using the t procedure

Require: a simple random sample from the population and a normally distributed population.

If the population standard deviationσ is not known:

X¯ ±tα/2 s


wheres is the sample standard deviation,SE( ¯X) = sn the standard error of ¯X (the estimated standard deviation of the sampling distribution of ¯X).

Thet value is found from the t distribution with n−1 degrees of freedom.


Confidence intervals for the population mean µ using the t procedure

tα/2 the t value that has an area to the right ofα/2.



(M.O. Also, in the case of TBC patients, although the medical system provides free specific medicines they hardly access these services because of the distance

The study results concluded that the ACEIs are effective in lowering systolic and diastolic blood pressure, they reduce global cardiovascular risk through

The static model of the suspension system based on 5SS axle guiding mechanism, which is shown in Figure 1, contains the bodies/parts in the suspension system (axle &amp;

T.. planning system in their structure, that is, ‘it is advisable to measure what the plan was’. Opportunity for intervention: The reporting system is useful for the

It is the relationship between the female vampire and the child that will concern the rest of this paper in the analysis of Arthur Conan Doyle’s Sherlock Holmes

permanent tension between the possibilities of human nature and the spiritual valences of human condition, the human essence displays itself within the current political

Faced with these realities, the native sons who have espoused Zionist ideology and adopted the culture of the New Hebrew – the carriers of Israeli secular culture – have ascribed

2 Referring to the constitutional regulation of Kosovo regarding the form of state regulation, we have a unitary state, but in practice the unitary state

During the period 1992-2004, for criminal offenses with elements of abuse in the field of real estate turnover in Kosovo there were accused in total 35 persons and none

(1) of the Criminal Code. The act is committed by a public official in the performance of their duties, the act endangered the life of the minor, the offense was committed by a

1 Council of Europe Convention on the Protection of Children against Sexual Exploitation and Sexual Abuse, Lanzarote on October 25, 2007.. 59 13 years, the relations defended by

Abstract: The Canadian Immigration and Refugee Protection Act provides that one of the objectives of immigration is “to see that families are reunited in Canada.” The Act

Keywords: trickster discourse, meaning, blasphemy, social change, transgression of social norms.. The Myth of the trickster and its

The evolution to globalization has been facilitated and amplified by a series of factors: capitals movements arising from the need of covering the external

Using a case study designed for forecasting the educational process in the Petroleum-Gas University, the paper presents the steps that must be followed to realise a Delphi

We then go on to examine a number of prototype techniques proposed for engineering agent systems, including methodologies for agent-oriented analysis and design, formal

Key Words: American Christians, Christian Right, Christian Zionism, US-Israel Relations, Conservative Christians Theology, State of Israel, Jews, Millennial beliefs,

Un locuitor al oglinzii (An Inhabitant of the Mirror), prose, 1994; Fascinaţia ficţiunii sau despre retorica elipsei (On the Fascination of Fiction and the Rhetoric of Ellipsis),

units, it is worth noting that the fundamental reference quantities in the British system of units (foot, pound, second) are based on the primary dimensions of length (L), force

•  A TS groups a sequence of events of one character or a stable group of characters over a period of. story Kme, which is uninterruptedly told in a span

The purpose of the regulation on volunteering actions is to structure a coherent approach in order to recognise the period of professional experience as well as the

The number of vacancies for the doctoral field of Medicine, Dental Medicine and Pharmacy for the academic year 2022/2023, financed from the state budget, are distributed to

Adrian Iftene, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Elena Irimia, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian