### 8. Experiments

AEA 2021/2022

1/51

2/51

### Content

Inferential statistics Parametric Tests Non-Parametric Tests

3/51

### Inferential statistics

Hypothesis testing is a general tool to analyze the data.

H0 = hypothesis (the null hypothesis)
H_{A} = alternative (the alternative hypothesis)

Astatistical test is characterized by a null hypothesis, assumptions on the experiment (i.e., how the data is generated) and atest statistic (a number computed from the data).

The purpose of the test is to check whether some data is consistent with the null hypothesis or not.

If the null hypothesis is not consistent with the data, it is rejected and there is some evidence that the research hypothesis is true.

4/51

### Statistical hypothesis testing

Steps:

I State thenull hypothesisH0 and thealternative hypothesisH_{A}
I Select a significance levelα (a threshold bellow which the null

hypothesis will be rejected) I Identify the test statistic

I Identify the critical values (using the distribution of the test statistic and the significance level)

I Construct thecritical region(the null hypothesis is rejected) I Take the decision: reject null hypothesis if the computed

value is in the critical region

5/51

### Hypothesis testing

Example: verify that the the average connection speed is 54 Mbps H0 :µ= 54

H_{A} :µ6= 54, µthe average speed of all connections
(two-sided alternative)

If we worry about a low connection speed only, we can conduct a one-sided test:

H0 :µ= 54
H_{A} :µ <54

6/51

### Hypothesis testing

Alternative of the typeH_{A} :µ6=µ0 covering regions on both sides
of the hypothesis (H_{0}:µ=µ_{0}) is a two-sided alternative.

AlternativeH_{A} :µ < µ_{0} covering the region to the left of H_{0} is
one-sided, left-tail.

AlternativeH_{A} :µ > µ_{0} covering the region to the right of H_{0} is
one-sided, right-tail.

7/51

### Types of errors

Result of the test
Reject H_{0} Accept H_{0}
H_{0} is true Type I error correct
H0 is false correct Type II error
Sampling errors:

I A type I erroroccurs when we reject the true null hypothesis.

I A type II erroroccurs when we accept the false null hypothesis.

Probability of a type I error is thesignificance level of a test,
α=P{rejectH_{0}|H_{0} is true}.

8/51

### Level α test

Testing hypothesis is based on atest statistic T, a quantity
computed from the data, that has some known, tabulated
distributionF_{0} if the hypothesisH_{0} is true.

Thenull distributionF0: the distribution of test statistic T when
the hypothesisH_{0} is true.

AcceptH_{0} if the test statistic T belongs to the acceptance region.

9/51

### Statistical tests

If normality and equal variances are not guaranteed, use non-parametric tests.

10/51

### Content

Inferential statistics Parametric Tests Non-Parametric Tests

11/51

### Parametric tests

I applied when the shape of the distribution is known I variants:

I for a population (ex: hypothesis about the mean/variance of a population),

I for two populations (relation btw means), I for more than two populations

I examples: Z-test (hypothesis about the mean of a population with normal distribution, known variance), T-test (unknown variance and the sample size is not large)

12/51

### 0. Z-test

The null distribution of the test statistic isStandard Normal

The test statisticZ = ^{θ−E(ˆ}^{ˆ} ^{θ)}

s(ˆθ) = √^{θ−E(ˆ}^{ˆ} ^{θ)}

Vard(ˆθ).

I Z-test for means: when we know the population variance, or when the sample size is large

13/51

### 1. t-test (unknown stdev)

T-statistic: t = ^{θ−E(ˆ}^{ˆ} ^{θ)}

s(ˆθ) = √^{θ−E(ˆ}^{ˆ} ^{θ)}

Vard(ˆθ).

When the distribution ofθˆis Normal, the test is based on
Student’s T-distribution(with acceptance and rejection regions
according to the direction ofH_{A}):

I For a right-tail alternative,

(rejectH0 if t≥tα

acceptH0 if t<tα

I For a left-tail alternative,

(reject H_{0} ift ≤ −t_{α}
accept H0 ift >−t_{α}
I For a two-sided alternative,

(rejectH_{0} if|t| ≥t_{α/2}
acceptH0 if|t|<t_{α/2}

14/51

### t-test

When two populations have equal variances,σ^{2}_{X} =σ^{2}_{Y} =σ^{2}, the
estimator ofσ^{2},pooled sample variance:

s_{p}^{2}=

Pn

i=1(Xi−X¯)^{2}+Pn

i=1(Yi−Y¯)^{2}

n+m−2 = ^{(n−1)s}_{n+m−2}^{X}^{2}^{+(m−1)s}^{Y}^{2}

15/51

### a. t-test for a population mean (variance unknown)

Used whenσ^{2} is not known and a normal population.

H_{0}: µ=µ_{0}
H_{A}: µ6=µ_{0}

Test statistics: t= ^{x−µ}_{s/}^{¯} ^{√}_{n}^{0}, wheres^{2}=

P(x−¯x)^{2}
n−1 .

16/51

### t-test - Critical region

I For a two-sided alternative:

R = (−∞,t_{n−1,1−α/2})∪(t_{n−1,α/2},∞)
I For a right-tail alternative,R = (tn−1,α,∞).

I For a left-tail alternative,R = (−∞,tn−1,1−α).

17/51

### Example 1: unauthorized use of a computer account

A long-time authorized user of the account makes 0.2 seconds between keystrokes. The following times between keystrokes were recorded when a user typed the username and password:

.24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds.

At a 5% level of significance, is this an evidence of an unauthorized attempt?

Test: H0 :µ= 0.2 vs HA :µ6= 0.2 Significance levelα= 0.05.

The sample statistics: n = 18, ¯X = 0.29, s = 0.074.

18/51

### Example 1: unauthorized use of a computer account

Compute the T-statistic:

t = X¯ −0.2 s/√

n = 0.29−0.2 0.074/√

18 = 5.16

The rejection region: R= (−∞,−2.11]∪[2.11,∞) (we used T-distribution with 18 - 1 = 17 degrees of freedom and α/2 = 0.025 because of the two-sided alternative).

Sincet ∈R, we reject the null hypothesis and conclude that there is a significant evidence of an unauthorized use of that account.

19/51

### Table of Student’s T-distribution

20/51

### b. t-test for comparing means of two populations

I Equal variances

I for small sample size, the hypothesis of normality of populations is required

I Unequal variances

I estimate the degrees of freedomν of a T-distribution that is

“closest” tot

Satterthwaite approximation:

ν = (^{s}_{n}^{X}^{2} +^{s}_{m}^{Y}^{2})^{2}

s^{4}_{X}

n^{2}(n−1)+_{m}2(m−1)^{s}^{Y}^{4}

If this number of degrees of freedom is non-integer, take the closestν

21/51

### Example 2: Comparison of two servers

An account on serverA is more expensive than an account on serverB. However, serverA is faster. To see if it’s optimal to go with the faster but more expensive server, a manager needs to know how much faster it is.

A certain computer algorithm is executed 30 times on serverAand 20 times on serverB with the following results,

Server A Server B Sample mean 6.7 min 7.5 min Sample standard deviation 0.6 min 1.2 min Is server A faster?

22/51

### Example 2: Comparison of two servers

TestH_{0}:µ_{X} =µ_{Y} vsH_{A} :µ_{X} < µ_{Y}.
Significance levelα= 0.05.

n= 30,m= 20,X¯ = 6.7,Y¯ = 7.5,s_{X} = 0.6, ands_{Y} = 1.2.

This is the case of unknown, unequal standard deviations.

UseSatterthwaite approximationto find the number of degrees of freedom:

ν = (^{(0.6)}_{30}^{2} +^{(1.2)}_{20}^{2})^{2}

(0.6)^{4}

30^{2}(29)+_{20}^{(1.2)}2(19)^{4}

= 25.4 Reject the null hypothesis ift ≤ −1.708.

t= 6.7−7.5
q(0.6)^{2}

30 + ^{(1.2)}_{20}^{2}

=−2.7603∈ R

RejectH_{0} and conclude that there is evidence that serverA is
faster.

23/51

### Inferential statistics

1. Choose the significance level 0< α <1, the confidence you want to achieve (ex: α= 0.05 - accept 5% error).

2. Then compute the test statistic for the data.

3. If thep-value of the statistic is smaller than the significance level, the null hypothesis is rejected.

24/51

### P-value

How do we choose the significance levelα?

P-value is the lowestsignificance levelα that forces rejectionof the null hypothesis.

P-value is also thehighest significance level α that forces acceptanceof the null hypothesis.

Testing hypotheses with a P-value:

I For α >P, rejectH_{0}
I For α <P, accept H_{0}
Practically,

I IfP <0.01, rejectH0

I IfP >0.1, acceptH0

Only if the P-value falls between 0.01 and 0.1, we have to think about the level of significance.

25/51

### Computing P-values

P−value is the probability of observing a test statistic at least as
extreme ast_{obs}.

F_{ν} is the cumulative distribution function of T-distribution with ν
degrees of freedom

26/51

### Content

Inferential statistics Parametric Tests Non-Parametric Tests

27/51

### Non-Parametric Tests

I Non-parametric statistics does not assume any particular distribution

I Are less powerful (the less you assume about the data, the less you can find out from it)

I Having fewer requirements, they are applicable to wider applications

I Variants: chi-squared test (verify a hypothesis about population distribution), independence/association (Fisher, chi-squared), comparison in case of nominal/ordinal characteristics (sign test, rank test, etc)

28/51

### 2. The sign test

A sample (x_{1}, ...,x_{n}) of n real numbers. The assumptions: all the
x_{i} are drawn independently from the same distribution.

The null hypothesisH0: the median M =m.

Test against a one-sided or a two-sided alternative,H_{A} :M <m,
H_{A} :M >m, orH_{A} :M 6=m.

We are testing whether exactly a half of the population is belowm and a half is abovem.

29/51

### The sign test

Compute the test statisticS :=|{i|x_{i} >m}|.

If the null hypothesis is true, the probability thatx_{i} is greater than
mis the same as that it is smaller than m (is 1/2). Therefore,S is
distributed according to abinomial distributionwith parameters
p= 1/2 and n.

Suppose the observed value ofS is k, w.l.o.g. k ≥n/2. Compute
thep-value: the probability thatS isat least k: 1/2^{n}Pn

i=k

n i

.

30/51

### The sign test

31/51

### The sign test

Example: if n= 15,k = 12, we get a p-value of 0.018→ rejection of the null hypothesis at a significance level ofα= 0.05.

Instead, we have some evidence that the real median isgreater than0.

I we couldn’t conclude this if we selected α= 0.01

32/51

### Example 1: Unauthorized use of a computer account

Times between keystrokes: .24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds. The account owner usually makes 0.2 sec between keystrokes.

We had to assume the Normal distribution of data (for T-test).

The histogram does not confirm this assumption.

33/51

### Example 1: Unauthorized use of a computer account

The sign test: H0 :M = 0.2 vs HA :M 6= 0.2

The test statistic: S_{obs} = 15 (15 of 18 recorded times exceed 0.2).

From Binomial distribution table withn= 18 andp = 0.5, find the P−value:

P = 2min(P{S ≤S_{obs}},P{S ≥S_{obs}}) = 2min(0.0038,0.9993) = 0.0076
The sign test rejectsH_{0} at any α >0.0076, which is an evidence

that the account was used by an unauthorized person.

34/51

### The sign test

Application: compare pairwise samples from two different
distributions. For comparing two algorithms, suppose we have
samples (x_{1}, ...,x_{n}), (y_{1}, ...,y_{n}),x_{i},y_{i} performance measures of
algorithms on instancei. Question: Is it true that the 1^{st} algorithm
is better than the 2^{nd}?

Consider the sequence of differencesd_{i} =y_{i} −x_{i} and do the sign
test on this sample. The null hypothesis: the medians of the
performance distribution are equal (the performance of both
algorithms is the same).

If sufficiently manydi are positive, the null hypothesis is rejected
and there is evidence that 1^{st} algorithm is better.

Note: the null hypothesis is rejected if there are too few positive
di, which would indicate that 2^{nd} algorithm is better.

35/51

### The Sign Test and Heuristics for the TSP

Example: a new algorithm (CCAO) for Euclidean TSP:

1. construct a partial tour from the convex hull of the cities, 2. includes remaining cities (cheapest insertion, angle selection) 3. improve the solution (Or-opt); other post-processors: 2-opt,

3-opt (find better tours by exchanging 2 or 3 edges of the current tour until no further improvement is possible).

Only 8 instances (a larger no. of samples is required to draw statistical significant conclusions).

Compare the algorithm to other heuristics: apply the sign test to assess solution quality

I the algorithm is better than heuristics with a weak post-processor (i.e., 2-opt).

I the algorithm is as good as those with a strong post-processor (i.e., Or-opt, 3-opt).

36/51

### 3. The Wilcoxon signed-rank test

TheWilcoxon signed-rank testis an extension to the sign test: it takes the value of the differences into account. Used to compare two related samples, matched samples, or repeated measurements on a single sample.

1. The distances between observations and the tested value,
d_{i} =|X_{i} −m|.

2. Order the distances and compute their ranksR_{i} (R_{i} =r: X_{i} is
ther-th smallest observation in the sample).

3. Take only the ranks corresponding to observations X_{i} greater
thanm. Their sum is the test statisticW =P

i:X_{i}>mRi.
4. Large values ofW suggest rejection ofH_{0} in favor of

HA :M >m; small values supportHA :M <m; both support
a two-sided alternative H_{A} :M 6=m

37/51

### The Wilcoxon signed-rank test

Test of the median,H_{0} :M =m.

Test statisticW =P

i:Xi>mR_{i}, where R_{i} is the rank of
di =|X_{i} −m|.

Null distribution: Table of Critical values
Forn≥15,W ≈Normal(^{n(n+1)}_{4} ,

qn(n+1)(2n+1)

24 )

Assumptions: the distribution ofXi is continuous and symmetric.

38/51

### Example 1: Supply and demand

You have to ensure that the printers don’t run out of paper.

During the first six days, the lab consumed: 7, 5.5, 9.5, 6, 3.5, and 9 cartons of paper. Does this imply significant evidence, at the 5%

level of significance, that the median daily consumption of paper is more than 5 cartons?

Right-tail testH_{0} :M = 5 vs H_{A} :M >5.

39/51

### Example 1: Supply and demand

Forn= 6 and α= 0.05, we’ll rejectH0 when the sum of positive ranksT ≥19.

Compute distancesd_{i} =|X_{i}−5|and rank them from the smallest
to the largest.

i X_{i} X_{i} −5 d_{i} R_{i} sign

1 7 2 2 4 +

2 5.5 0.5 0.5 1 +

3 9.5 4.5 4.5 6 +

4 6 1 1 2 +

5 3.5 -1.5 1.5 3 -

6 9 4 4 5 +

ComputeT adding the ”positive” ranks only:

T = 4 + 1 + 6 + 2 + 5 = 18<19.

No rejection: at the 5% level, data do not provide significance evidence that the median consumption of paper exceeds 5 cartons.

40/51

### Table of Critical Values for the Wilcoxon Signed Rank Test

41/51

### Example 2: Unauthorized use of a computer account

TestM = 0.2 vs M 6= 0.2.

Compute the distancesd1 =|X_{1}−m|=|0.24−0.2|= 0.04, ... ,
d_{18}=|0.27−0.2|= 0.07 and rank them.

Notice that the 9-th, 12-th, and 13-th observations are below the tested valuem= 0.2 while all the others are above. The test statistic (the sum of only positive signed ranks):

W = X

i:Xi>m

R_{i} = 162

42/51

### Example 2: Unauthorized use of a computer account

Compute aP−value. This is a two-sided test, therefore, P = 2min(P{W ≤162},P{W ≥162})<2·0.001 = 0.002 (use Table withn= 18).

I Obs: for the sample size n= 18, we can also use the Normal approximation.

The test shows strong evidence that the account was used by an unauthorized person.

43/51

### 4. Mann-Whitney-Wilcoxon rank sum test

Wilcoxon signed rank test can be extended to a two-sample problem: compare two populations, the population ofX and the population ofY. In terms of their cumulative distribution functions, test

H0 :F_{X}(t) =F_{Y}(t),for all t

AlternativeH_{A}: either Y is stochastically larger than X, and
F_{X}(t)>F_{Y}(t), or it is stochastically smaller than X, and
FX(t)<FY(t).

44/51

### Mann-Whitney-Wilcoxon rank sum test

1. Combine all Xi andYj into one sample.

2. Rank observations in this combined sample. RanksR_{i} are
from 1 to (n+m). Some of these ranks correspond to
X-variables, others to Y-variables.

3. The test statistic U is the sum of all X-ranks.

IfU is small,X-variables have low ranks in the combined sample, so they are generally smaller thanY-variables. This implies that Y is stochastically larger thanX: support the alternative

HA :FY(t)<FX(t).

45/51

### Mann-Whitney-Wilcoxon rank sum test

VariableY is stochastically larger than variableX. It has a larger
medianM_{Y} >M_{X} and a smaller cdfF_{Y}(t)<F_{X}(t).

46/51

### Mann-Whitney-Wilcoxon rank sum test

Test of two populations,H0:FX =FY. Test statisticU =P

iRi, whereRi are ranks ofXi in the combined
sample ofX_{i} andY_{i}.

Null distribution: Table of Critical values
Forn,m≥10, U ≈Normal(^{n(n+m+1)}_{2} ,

qnm(n+m+1)

12 )

Assumptions: the distributions ofX_{i} andY_{i} are continuous;

F_{X}(t) =F_{Y}(t) under H_{0};F_{X}(t)<F_{Y}(t) for all t or
FX(t)>FY(t) for all t underHA.

47/51

### Example 1: On-line incentives

Managers of a shopping portal suspect that more customers participate in on-line shopping if they are offered some incentive, such as a discount or cash back. To verify this hypothesis, they chose 12 days at random, offered a 5% discount on 6 randomly selected days, but did not offer any incentives on the other 6 days.

The discounts were indicated on the links leading to this shopping portal.

With the discount, the portal received (rounded to 100s) 1200, 1700, 2600, 1500, 2400, and 2100 hits. Without the discount, 1400, 900, 1300, 1800, 700, and 1000 hits were registered. Does this support the managers’ hypothesis?

48/51

### Example 1: On-line incentives

F_{X},F_{Y} the cdf of the number of hits without the discount and
with the discount, respectively.

TestH_{0}:F_{X} =F_{Y} vsH_{A} :X is stochastically smaller thanY.
The discount is suspected to boost the number of hits to the
shopping portal, soY should be stochastically larger than X.
To compute the test statistic: combine all the observations and
order them

700,900,1000,1200,1300,1400,1500,1700,1800,2100,2400,2600
X-variables are underlined. In the combined sample, their ranks are
1, 2, 3, 5, 6, and 9, and their sum isU_{obs} =P6

i=1R_{i} = 26.

From Table of critical vals. withn =m= 6: the one-sided left-tail P-value isp∈(0.01,0.025]. Although it implies some evidence that discounts help increase the on-line shopping activity, this evidence is not overwhelming. We can conclude that the evidence supporting the managers’ claim is significant at anyα >0.025.

49/51

### Table of Critical Values

50/51

### Example 2: Pings

Round-trip transit times (pings) at two locations, arranged in the increasing order:

Location I: 0.0156, 0.0210, 0.0215, 0.0280, 0.0308, 0.0327, 0.0335, 0.0350, 0.0355, 0.0396, 0.0419, 0.0437, 0.0480, 0.0483, 0.0543 seconds

Location II: 0.0039, 0.0045, 0.0109, 0.0167, 0.0198, 0.0298, 0.0387, 0.0467, 0.0661, 0.0674, 0.0712, 0.0787 seconds

Is there evidence that the median ping depends on the location?

Test forH0:FX =FY vs HA :FX 6=FY, whereX and Y are pings at the two locations.

Samples of sizesn= 15 andm= 12.

51/51

### Pings

X-pings have ranks 4, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, and 23, and their sum is

U_{obs} =

15

X

i=1

R_{i} = 213
To use the Normal distribution, compute
E(U|H_{0}) = n(n+m+ 1)

2 = 210,Var(U|H_{0}) = nm(n+m+ 1)

12 = 420

Compute the P-value for the two-sided test

P= 2min(P{U ≤213},P{U ≥213}) = 2min(P{U ≤213.5},P{U ≥212.5})

= 2min(P{Z ≤ 213.5−210

√420 },P{Z ≥212.5−210

√420 })

= 2P{Z ≥ 212.5−210

√420 }= 2(1−φ(0.12)) = 2(0.4522) = 0.9044 There is no evidence that pings at the two locations have different distributions.