8. Experiments
AEA 2021/2022
1/51
2/51
Content
Inferential statistics Parametric Tests Non-Parametric Tests
3/51
Inferential statistics
Hypothesis testing is a general tool to analyze the data.
H0 = hypothesis (the null hypothesis) HA = alternative (the alternative hypothesis)
Astatistical test is characterized by a null hypothesis, assumptions on the experiment (i.e., how the data is generated) and atest statistic (a number computed from the data).
The purpose of the test is to check whether some data is consistent with the null hypothesis or not.
If the null hypothesis is not consistent with the data, it is rejected and there is some evidence that the research hypothesis is true.
4/51
Statistical hypothesis testing
Steps:
I State thenull hypothesisH0 and thealternative hypothesisHA I Select a significance levelα (a threshold bellow which the null
hypothesis will be rejected) I Identify the test statistic
I Identify the critical values (using the distribution of the test statistic and the significance level)
I Construct thecritical region(the null hypothesis is rejected) I Take the decision: reject null hypothesis if the computed
value is in the critical region
5/51
Hypothesis testing
Example: verify that the the average connection speed is 54 Mbps H0 :µ= 54
HA :µ6= 54, µthe average speed of all connections (two-sided alternative)
If we worry about a low connection speed only, we can conduct a one-sided test:
H0 :µ= 54 HA :µ <54
6/51
Hypothesis testing
Alternative of the typeHA :µ6=µ0 covering regions on both sides of the hypothesis (H0:µ=µ0) is a two-sided alternative.
AlternativeHA :µ < µ0 covering the region to the left of H0 is one-sided, left-tail.
AlternativeHA :µ > µ0 covering the region to the right of H0 is one-sided, right-tail.
7/51
Types of errors
Result of the test Reject H0 Accept H0 H0 is true Type I error correct H0 is false correct Type II error Sampling errors:
I A type I erroroccurs when we reject the true null hypothesis.
I A type II erroroccurs when we accept the false null hypothesis.
Probability of a type I error is thesignificance level of a test, α=P{rejectH0|H0 is true}.
8/51
Level α test
Testing hypothesis is based on atest statistic T, a quantity computed from the data, that has some known, tabulated distributionF0 if the hypothesisH0 is true.
Thenull distributionF0: the distribution of test statistic T when the hypothesisH0 is true.
AcceptH0 if the test statistic T belongs to the acceptance region.
9/51
Statistical tests
If normality and equal variances are not guaranteed, use non-parametric tests.
10/51
Content
Inferential statistics Parametric Tests Non-Parametric Tests
11/51
Parametric tests
I applied when the shape of the distribution is known I variants:
I for a population (ex: hypothesis about the mean/variance of a population),
I for two populations (relation btw means), I for more than two populations
I examples: Z-test (hypothesis about the mean of a population with normal distribution, known variance), T-test (unknown variance and the sample size is not large)
12/51
0. Z-test
The null distribution of the test statistic isStandard Normal
The test statisticZ = θ−E(ˆˆ θ)
s(ˆθ) = √θ−E(ˆˆ θ)
Vard(ˆθ).
I Z-test for means: when we know the population variance, or when the sample size is large
13/51
1. t-test (unknown stdev)
T-statistic: t = θ−E(ˆˆ θ)
s(ˆθ) = √θ−E(ˆˆ θ)
Vard(ˆθ).
When the distribution ofθˆis Normal, the test is based on Student’s T-distribution(with acceptance and rejection regions according to the direction ofHA):
I For a right-tail alternative,
(rejectH0 if t≥tα
acceptH0 if t<tα
I For a left-tail alternative,
(reject H0 ift ≤ −tα accept H0 ift >−tα I For a two-sided alternative,
(rejectH0 if|t| ≥tα/2 acceptH0 if|t|<tα/2
14/51
t-test
When two populations have equal variances,σ2X =σ2Y =σ2, the estimator ofσ2,pooled sample variance:
sp2=
Pn
i=1(Xi−X¯)2+Pn
i=1(Yi−Y¯)2
n+m−2 = (n−1)sn+m−2X2+(m−1)sY2
15/51
a. t-test for a population mean (variance unknown)
Used whenσ2 is not known and a normal population.
H0: µ=µ0 HA: µ6=µ0
Test statistics: t= x−µs/¯ √n0, wheres2=
P(x−¯x)2 n−1 .
16/51
t-test - Critical region
I For a two-sided alternative:
R = (−∞,tn−1,1−α/2)∪(tn−1,α/2,∞) I For a right-tail alternative,R = (tn−1,α,∞).
I For a left-tail alternative,R = (−∞,tn−1,1−α).
17/51
Example 1: unauthorized use of a computer account
A long-time authorized user of the account makes 0.2 seconds between keystrokes. The following times between keystrokes were recorded when a user typed the username and password:
.24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds.
At a 5% level of significance, is this an evidence of an unauthorized attempt?
Test: H0 :µ= 0.2 vs HA :µ6= 0.2 Significance levelα= 0.05.
The sample statistics: n = 18, ¯X = 0.29, s = 0.074.
18/51
Example 1: unauthorized use of a computer account
Compute the T-statistic:
t = X¯ −0.2 s/√
n = 0.29−0.2 0.074/√
18 = 5.16
The rejection region: R= (−∞,−2.11]∪[2.11,∞) (we used T-distribution with 18 - 1 = 17 degrees of freedom and α/2 = 0.025 because of the two-sided alternative).
Sincet ∈R, we reject the null hypothesis and conclude that there is a significant evidence of an unauthorized use of that account.
19/51
Table of Student’s T-distribution
20/51
b. t-test for comparing means of two populations
I Equal variances
I for small sample size, the hypothesis of normality of populations is required
I Unequal variances
I estimate the degrees of freedomν of a T-distribution that is
“closest” tot
Satterthwaite approximation:
ν = (snX2 +smY2)2
s4X
n2(n−1)+m2(m−1)sY4
If this number of degrees of freedom is non-integer, take the closestν
21/51
Example 2: Comparison of two servers
An account on serverA is more expensive than an account on serverB. However, serverA is faster. To see if it’s optimal to go with the faster but more expensive server, a manager needs to know how much faster it is.
A certain computer algorithm is executed 30 times on serverAand 20 times on serverB with the following results,
Server A Server B Sample mean 6.7 min 7.5 min Sample standard deviation 0.6 min 1.2 min Is server A faster?
22/51
Example 2: Comparison of two servers
TestH0:µX =µY vsHA :µX < µY. Significance levelα= 0.05.
n= 30,m= 20,X¯ = 6.7,Y¯ = 7.5,sX = 0.6, andsY = 1.2.
This is the case of unknown, unequal standard deviations.
UseSatterthwaite approximationto find the number of degrees of freedom:
ν = ((0.6)302 +(1.2)202)2
(0.6)4
302(29)+20(1.2)2(19)4
= 25.4 Reject the null hypothesis ift ≤ −1.708.
t= 6.7−7.5 q(0.6)2
30 + (1.2)202
=−2.7603∈ R
RejectH0 and conclude that there is evidence that serverA is faster.
23/51
Inferential statistics
1. Choose the significance level 0< α <1, the confidence you want to achieve (ex: α= 0.05 - accept 5% error).
2. Then compute the test statistic for the data.
3. If thep-value of the statistic is smaller than the significance level, the null hypothesis is rejected.
24/51
P-value
How do we choose the significance levelα?
P-value is the lowestsignificance levelα that forces rejectionof the null hypothesis.
P-value is also thehighest significance level α that forces acceptanceof the null hypothesis.
Testing hypotheses with a P-value:
I For α >P, rejectH0 I For α <P, accept H0 Practically,
I IfP <0.01, rejectH0
I IfP >0.1, acceptH0
Only if the P-value falls between 0.01 and 0.1, we have to think about the level of significance.
25/51
Computing P-values
P−value is the probability of observing a test statistic at least as extreme astobs.
Fν is the cumulative distribution function of T-distribution with ν degrees of freedom
26/51
Content
Inferential statistics Parametric Tests Non-Parametric Tests
27/51
Non-Parametric Tests
I Non-parametric statistics does not assume any particular distribution
I Are less powerful (the less you assume about the data, the less you can find out from it)
I Having fewer requirements, they are applicable to wider applications
I Variants: chi-squared test (verify a hypothesis about population distribution), independence/association (Fisher, chi-squared), comparison in case of nominal/ordinal characteristics (sign test, rank test, etc)
28/51
2. The sign test
A sample (x1, ...,xn) of n real numbers. The assumptions: all the xi are drawn independently from the same distribution.
The null hypothesisH0: the median M =m.
Test against a one-sided or a two-sided alternative,HA :M <m, HA :M >m, orHA :M 6=m.
We are testing whether exactly a half of the population is belowm and a half is abovem.
29/51
The sign test
Compute the test statisticS :=|{i|xi >m}|.
If the null hypothesis is true, the probability thatxi is greater than mis the same as that it is smaller than m (is 1/2). Therefore,S is distributed according to abinomial distributionwith parameters p= 1/2 and n.
Suppose the observed value ofS is k, w.l.o.g. k ≥n/2. Compute thep-value: the probability thatS isat least k: 1/2nPn
i=k
n i
.
30/51
The sign test
31/51
The sign test
Example: if n= 15,k = 12, we get a p-value of 0.018→ rejection of the null hypothesis at a significance level ofα= 0.05.
Instead, we have some evidence that the real median isgreater than0.
I we couldn’t conclude this if we selected α= 0.01
32/51
Example 1: Unauthorized use of a computer account
Times between keystrokes: .24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds. The account owner usually makes 0.2 sec between keystrokes.
We had to assume the Normal distribution of data (for T-test).
The histogram does not confirm this assumption.
33/51
Example 1: Unauthorized use of a computer account
The sign test: H0 :M = 0.2 vs HA :M 6= 0.2
The test statistic: Sobs = 15 (15 of 18 recorded times exceed 0.2).
From Binomial distribution table withn= 18 andp = 0.5, find the P−value:
P = 2min(P{S ≤Sobs},P{S ≥Sobs}) = 2min(0.0038,0.9993) = 0.0076 The sign test rejectsH0 at any α >0.0076, which is an evidence
that the account was used by an unauthorized person.
34/51
The sign test
Application: compare pairwise samples from two different distributions. For comparing two algorithms, suppose we have samples (x1, ...,xn), (y1, ...,yn),xi,yi performance measures of algorithms on instancei. Question: Is it true that the 1st algorithm is better than the 2nd?
Consider the sequence of differencesdi =yi −xi and do the sign test on this sample. The null hypothesis: the medians of the performance distribution are equal (the performance of both algorithms is the same).
If sufficiently manydi are positive, the null hypothesis is rejected and there is evidence that 1st algorithm is better.
Note: the null hypothesis is rejected if there are too few positive di, which would indicate that 2nd algorithm is better.
35/51
The Sign Test and Heuristics for the TSP
Example: a new algorithm (CCAO) for Euclidean TSP:
1. construct a partial tour from the convex hull of the cities, 2. includes remaining cities (cheapest insertion, angle selection) 3. improve the solution (Or-opt); other post-processors: 2-opt,
3-opt (find better tours by exchanging 2 or 3 edges of the current tour until no further improvement is possible).
Only 8 instances (a larger no. of samples is required to draw statistical significant conclusions).
Compare the algorithm to other heuristics: apply the sign test to assess solution quality
I the algorithm is better than heuristics with a weak post-processor (i.e., 2-opt).
I the algorithm is as good as those with a strong post-processor (i.e., Or-opt, 3-opt).
36/51
3. The Wilcoxon signed-rank test
TheWilcoxon signed-rank testis an extension to the sign test: it takes the value of the differences into account. Used to compare two related samples, matched samples, or repeated measurements on a single sample.
1. The distances between observations and the tested value, di =|Xi −m|.
2. Order the distances and compute their ranksRi (Ri =r: Xi is ther-th smallest observation in the sample).
3. Take only the ranks corresponding to observations Xi greater thanm. Their sum is the test statisticW =P
i:Xi>mRi. 4. Large values ofW suggest rejection ofH0 in favor of
HA :M >m; small values supportHA :M <m; both support a two-sided alternative HA :M 6=m
37/51
The Wilcoxon signed-rank test
Test of the median,H0 :M =m.
Test statisticW =P
i:Xi>mRi, where Ri is the rank of di =|Xi −m|.
Null distribution: Table of Critical values Forn≥15,W ≈Normal(n(n+1)4 ,
qn(n+1)(2n+1)
24 )
Assumptions: the distribution ofXi is continuous and symmetric.
38/51
Example 1: Supply and demand
You have to ensure that the printers don’t run out of paper.
During the first six days, the lab consumed: 7, 5.5, 9.5, 6, 3.5, and 9 cartons of paper. Does this imply significant evidence, at the 5%
level of significance, that the median daily consumption of paper is more than 5 cartons?
Right-tail testH0 :M = 5 vs HA :M >5.
39/51
Example 1: Supply and demand
Forn= 6 and α= 0.05, we’ll rejectH0 when the sum of positive ranksT ≥19.
Compute distancesdi =|Xi−5|and rank them from the smallest to the largest.
i Xi Xi −5 di Ri sign
1 7 2 2 4 +
2 5.5 0.5 0.5 1 +
3 9.5 4.5 4.5 6 +
4 6 1 1 2 +
5 3.5 -1.5 1.5 3 -
6 9 4 4 5 +
ComputeT adding the ”positive” ranks only:
T = 4 + 1 + 6 + 2 + 5 = 18<19.
No rejection: at the 5% level, data do not provide significance evidence that the median consumption of paper exceeds 5 cartons.
40/51
Table of Critical Values for the Wilcoxon Signed Rank Test
41/51
Example 2: Unauthorized use of a computer account
TestM = 0.2 vs M 6= 0.2.
Compute the distancesd1 =|X1−m|=|0.24−0.2|= 0.04, ... , d18=|0.27−0.2|= 0.07 and rank them.
Notice that the 9-th, 12-th, and 13-th observations are below the tested valuem= 0.2 while all the others are above. The test statistic (the sum of only positive signed ranks):
W = X
i:Xi>m
Ri = 162
42/51
Example 2: Unauthorized use of a computer account
Compute aP−value. This is a two-sided test, therefore, P = 2min(P{W ≤162},P{W ≥162})<2·0.001 = 0.002 (use Table withn= 18).
I Obs: for the sample size n= 18, we can also use the Normal approximation.
The test shows strong evidence that the account was used by an unauthorized person.
43/51
4. Mann-Whitney-Wilcoxon rank sum test
Wilcoxon signed rank test can be extended to a two-sample problem: compare two populations, the population ofX and the population ofY. In terms of their cumulative distribution functions, test
H0 :FX(t) =FY(t),for all t
AlternativeHA: either Y is stochastically larger than X, and FX(t)>FY(t), or it is stochastically smaller than X, and FX(t)<FY(t).
44/51
Mann-Whitney-Wilcoxon rank sum test
1. Combine all Xi andYj into one sample.
2. Rank observations in this combined sample. RanksRi are from 1 to (n+m). Some of these ranks correspond to X-variables, others to Y-variables.
3. The test statistic U is the sum of all X-ranks.
IfU is small,X-variables have low ranks in the combined sample, so they are generally smaller thanY-variables. This implies that Y is stochastically larger thanX: support the alternative
HA :FY(t)<FX(t).
45/51
Mann-Whitney-Wilcoxon rank sum test
VariableY is stochastically larger than variableX. It has a larger medianMY >MX and a smaller cdfFY(t)<FX(t).
46/51
Mann-Whitney-Wilcoxon rank sum test
Test of two populations,H0:FX =FY. Test statisticU =P
iRi, whereRi are ranks ofXi in the combined sample ofXi andYi.
Null distribution: Table of Critical values Forn,m≥10, U ≈Normal(n(n+m+1)2 ,
qnm(n+m+1)
12 )
Assumptions: the distributions ofXi andYi are continuous;
FX(t) =FY(t) under H0;FX(t)<FY(t) for all t or FX(t)>FY(t) for all t underHA.
47/51
Example 1: On-line incentives
Managers of a shopping portal suspect that more customers participate in on-line shopping if they are offered some incentive, such as a discount or cash back. To verify this hypothesis, they chose 12 days at random, offered a 5% discount on 6 randomly selected days, but did not offer any incentives on the other 6 days.
The discounts were indicated on the links leading to this shopping portal.
With the discount, the portal received (rounded to 100s) 1200, 1700, 2600, 1500, 2400, and 2100 hits. Without the discount, 1400, 900, 1300, 1800, 700, and 1000 hits were registered. Does this support the managers’ hypothesis?
48/51
Example 1: On-line incentives
FX,FY the cdf of the number of hits without the discount and with the discount, respectively.
TestH0:FX =FY vsHA :X is stochastically smaller thanY. The discount is suspected to boost the number of hits to the shopping portal, soY should be stochastically larger than X. To compute the test statistic: combine all the observations and order them
700,900,1000,1200,1300,1400,1500,1700,1800,2100,2400,2600 X-variables are underlined. In the combined sample, their ranks are 1, 2, 3, 5, 6, and 9, and their sum isUobs =P6
i=1Ri = 26.
From Table of critical vals. withn =m= 6: the one-sided left-tail P-value isp∈(0.01,0.025]. Although it implies some evidence that discounts help increase the on-line shopping activity, this evidence is not overwhelming. We can conclude that the evidence supporting the managers’ claim is significant at anyα >0.025.
49/51
Table of Critical Values
50/51
Example 2: Pings
Round-trip transit times (pings) at two locations, arranged in the increasing order:
Location I: 0.0156, 0.0210, 0.0215, 0.0280, 0.0308, 0.0327, 0.0335, 0.0350, 0.0355, 0.0396, 0.0419, 0.0437, 0.0480, 0.0483, 0.0543 seconds
Location II: 0.0039, 0.0045, 0.0109, 0.0167, 0.0198, 0.0298, 0.0387, 0.0467, 0.0661, 0.0674, 0.0712, 0.0787 seconds
Is there evidence that the median ping depends on the location?
Test forH0:FX =FY vs HA :FX 6=FY, whereX and Y are pings at the two locations.
Samples of sizesn= 15 andm= 12.
51/51
Pings
X-pings have ranks 4, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, and 23, and their sum is
Uobs =
15
X
i=1
Ri = 213 To use the Normal distribution, compute E(U|H0) = n(n+m+ 1)
2 = 210,Var(U|H0) = nm(n+m+ 1)
12 = 420
Compute the P-value for the two-sided test
P= 2min(P{U ≤213},P{U ≥213}) = 2min(P{U ≤213.5},P{U ≥212.5})
= 2min(P{Z ≤ 213.5−210
√420 },P{Z ≥212.5−210
√420 })
= 2P{Z ≥ 212.5−210
√420 }= 2(1−φ(0.12)) = 2(0.4522) = 0.9044 There is no evidence that pings at the two locations have different distributions.