• Nu S-Au Găsit Rezultate

Inferential statistics


Academic year: 2022

Share "Inferential statistics"

Arată mai multe ( pagini)

Text complet


8. Experiments

AEA 2021/2022





Inferential statistics Parametric Tests Non-Parametric Tests



Inferential statistics

Hypothesis testing is a general tool to analyze the data.

H0 = hypothesis (the null hypothesis) HA = alternative (the alternative hypothesis)

Astatistical test is characterized by a null hypothesis, assumptions on the experiment (i.e., how the data is generated) and atest statistic (a number computed from the data).

The purpose of the test is to check whether some data is consistent with the null hypothesis or not.

If the null hypothesis is not consistent with the data, it is rejected and there is some evidence that the research hypothesis is true.



Statistical hypothesis testing


I State thenull hypothesisH0 and thealternative hypothesisHA I Select a significance levelα (a threshold bellow which the null

hypothesis will be rejected) I Identify the test statistic

I Identify the critical values (using the distribution of the test statistic and the significance level)

I Construct thecritical region(the null hypothesis is rejected) I Take the decision: reject null hypothesis if the computed

value is in the critical region



Hypothesis testing

Example: verify that the the average connection speed is 54 Mbps H0 :µ= 54

HA :µ6= 54, µthe average speed of all connections (two-sided alternative)

If we worry about a low connection speed only, we can conduct a one-sided test:

H0 :µ= 54 HA :µ <54



Hypothesis testing

Alternative of the typeHA :µ6=µ0 covering regions on both sides of the hypothesis (H0:µ=µ0) is a two-sided alternative.

AlternativeHA :µ < µ0 covering the region to the left of H0 is one-sided, left-tail.

AlternativeHA :µ > µ0 covering the region to the right of H0 is one-sided, right-tail.



Types of errors

Result of the test Reject H0 Accept H0 H0 is true Type I error correct H0 is false correct Type II error Sampling errors:

I A type I erroroccurs when we reject the true null hypothesis.

I A type II erroroccurs when we accept the false null hypothesis.

Probability of a type I error is thesignificance level of a test, α=P{rejectH0|H0 is true}.



Level α test

Testing hypothesis is based on atest statistic T, a quantity computed from the data, that has some known, tabulated distributionF0 if the hypothesisH0 is true.

Thenull distributionF0: the distribution of test statistic T when the hypothesisH0 is true.

AcceptH0 if the test statistic T belongs to the acceptance region.



Statistical tests

If normality and equal variances are not guaranteed, use non-parametric tests.




Inferential statistics Parametric Tests Non-Parametric Tests



Parametric tests

I applied when the shape of the distribution is known I variants:

I for a population (ex: hypothesis about the mean/variance of a population),

I for two populations (relation btw means), I for more than two populations

I examples: Z-test (hypothesis about the mean of a population with normal distribution, known variance), T-test (unknown variance and the sample size is not large)



0. Z-test

The null distribution of the test statistic isStandard Normal

The test statisticZ = θ−E(ˆˆ θ)

s(ˆθ) = √θ−E(ˆˆ θ)


I Z-test for means: when we know the population variance, or when the sample size is large



1. t-test (unknown stdev)

T-statistic: t = θ−E(ˆˆ θ)

s(ˆθ) = √θ−E(ˆˆ θ)


When the distribution ofθˆis Normal, the test is based on Student’s T-distribution(with acceptance and rejection regions according to the direction ofHA):

I For a right-tail alternative,

(rejectH0 if t≥tα

acceptH0 if t<tα

I For a left-tail alternative,

(reject H0 ift ≤ −tα accept H0 ift >−tα I For a two-sided alternative,

(rejectH0 if|t| ≥tα/2 acceptH0 if|t|<tα/2




When two populations have equal variances,σ2X =σ2Y =σ2, the estimator ofσ2,pooled sample variance:





n+m−2 = (n−1)sn+m−2X2+(m−1)sY2



a. t-test for a population mean (variance unknown)

Used whenσ2 is not known and a normal population.

H0: µ=µ0 HA: µ6=µ0

Test statistics: t= x−µs/¯ n0, wheres2=

P(x−¯x)2 n−1 .



t-test - Critical region

I For a two-sided alternative:

R = (−∞,tn−1,1−α/2)∪(tn−1,α/2,∞) I For a right-tail alternative,R = (tn−1,α,∞).

I For a left-tail alternative,R = (−∞,tn−1,1−α).



Example 1: unauthorized use of a computer account

A long-time authorized user of the account makes 0.2 seconds between keystrokes. The following times between keystrokes were recorded when a user typed the username and password:

.24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds.

At a 5% level of significance, is this an evidence of an unauthorized attempt?

Test: H0 :µ= 0.2 vs HA :µ6= 0.2 Significance levelα= 0.05.

The sample statistics: n = 18, ¯X = 0.29, s = 0.074.



Example 1: unauthorized use of a computer account

Compute the T-statistic:

t = X¯ −0.2 s/√

n = 0.29−0.2 0.074/√

18 = 5.16

The rejection region: R= (−∞,−2.11]∪[2.11,∞) (we used T-distribution with 18 - 1 = 17 degrees of freedom and α/2 = 0.025 because of the two-sided alternative).

Sincet ∈R, we reject the null hypothesis and conclude that there is a significant evidence of an unauthorized use of that account.



Table of Student’s T-distribution



b. t-test for comparing means of two populations

I Equal variances

I for small sample size, the hypothesis of normality of populations is required

I Unequal variances

I estimate the degrees of freedomν of a T-distribution that is

“closest” tot

Satterthwaite approximation:

ν = (snX2 +smY2)2



If this number of degrees of freedom is non-integer, take the closestν



Example 2: Comparison of two servers

An account on serverA is more expensive than an account on serverB. However, serverA is faster. To see if it’s optimal to go with the faster but more expensive server, a manager needs to know how much faster it is.

A certain computer algorithm is executed 30 times on serverAand 20 times on serverB with the following results,

Server A Server B Sample mean 6.7 min 7.5 min Sample standard deviation 0.6 min 1.2 min Is server A faster?



Example 2: Comparison of two servers

TestH0XY vsHAX < µY. Significance levelα= 0.05.

n= 30,m= 20,X¯ = 6.7,Y¯ = 7.5,sX = 0.6, andsY = 1.2.

This is the case of unknown, unequal standard deviations.

UseSatterthwaite approximationto find the number of degrees of freedom:

ν = ((0.6)302 +(1.2)202)2



= 25.4 Reject the null hypothesis ift ≤ −1.708.

t= 6.7−7.5 q(0.6)2

30 + (1.2)202

=−2.7603∈ R

RejectH0 and conclude that there is evidence that serverA is faster.



Inferential statistics

1. Choose the significance level 0< α <1, the confidence you want to achieve (ex: α= 0.05 - accept 5% error).

2. Then compute the test statistic for the data.

3. If thep-value of the statistic is smaller than the significance level, the null hypothesis is rejected.




How do we choose the significance levelα?

P-value is the lowestsignificance levelα that forces rejectionof the null hypothesis.

P-value is also thehighest significance level α that forces acceptanceof the null hypothesis.

Testing hypotheses with a P-value:

I For α >P, rejectH0 I For α <P, accept H0 Practically,

I IfP <0.01, rejectH0

I IfP >0.1, acceptH0

Only if the P-value falls between 0.01 and 0.1, we have to think about the level of significance.



Computing P-values

P−value is the probability of observing a test statistic at least as extreme astobs.

Fν is the cumulative distribution function of T-distribution with ν degrees of freedom




Inferential statistics Parametric Tests Non-Parametric Tests



Non-Parametric Tests

I Non-parametric statistics does not assume any particular distribution

I Are less powerful (the less you assume about the data, the less you can find out from it)

I Having fewer requirements, they are applicable to wider applications

I Variants: chi-squared test (verify a hypothesis about population distribution), independence/association (Fisher, chi-squared), comparison in case of nominal/ordinal characteristics (sign test, rank test, etc)



2. The sign test

A sample (x1, ...,xn) of n real numbers. The assumptions: all the xi are drawn independently from the same distribution.

The null hypothesisH0: the median M =m.

Test against a one-sided or a two-sided alternative,HA :M <m, HA :M >m, orHA :M 6=m.

We are testing whether exactly a half of the population is belowm and a half is abovem.



The sign test

Compute the test statisticS :=|{i|xi >m}|.

If the null hypothesis is true, the probability thatxi is greater than mis the same as that it is smaller than m (is 1/2). Therefore,S is distributed according to abinomial distributionwith parameters p= 1/2 and n.

Suppose the observed value ofS is k, w.l.o.g. k ≥n/2. Compute thep-value: the probability thatS isat least k: 1/2nPn


n i




The sign test



The sign test

Example: if n= 15,k = 12, we get a p-value of 0.018→ rejection of the null hypothesis at a significance level ofα= 0.05.

Instead, we have some evidence that the real median isgreater than0.

I we couldn’t conclude this if we selected α= 0.01



Example 1: Unauthorized use of a computer account

Times between keystrokes: .24, .22, .26, .34, .35, .32, .33, .29, .19, .36, .30, .15, .17, .28, .38, .40, .37, .27 seconds. The account owner usually makes 0.2 sec between keystrokes.

We had to assume the Normal distribution of data (for T-test).

The histogram does not confirm this assumption.



Example 1: Unauthorized use of a computer account

The sign test: H0 :M = 0.2 vs HA :M 6= 0.2

The test statistic: Sobs = 15 (15 of 18 recorded times exceed 0.2).

From Binomial distribution table withn= 18 andp = 0.5, find the P−value:

P = 2min(P{S ≤Sobs},P{S ≥Sobs}) = 2min(0.0038,0.9993) = 0.0076 The sign test rejectsH0 at any α >0.0076, which is an evidence

that the account was used by an unauthorized person.



The sign test

Application: compare pairwise samples from two different distributions. For comparing two algorithms, suppose we have samples (x1, ...,xn), (y1, ...,yn),xi,yi performance measures of algorithms on instancei. Question: Is it true that the 1st algorithm is better than the 2nd?

Consider the sequence of differencesdi =yi −xi and do the sign test on this sample. The null hypothesis: the medians of the performance distribution are equal (the performance of both algorithms is the same).

If sufficiently manydi are positive, the null hypothesis is rejected and there is evidence that 1st algorithm is better.

Note: the null hypothesis is rejected if there are too few positive di, which would indicate that 2nd algorithm is better.



The Sign Test and Heuristics for the TSP

Example: a new algorithm (CCAO) for Euclidean TSP:

1. construct a partial tour from the convex hull of the cities, 2. includes remaining cities (cheapest insertion, angle selection) 3. improve the solution (Or-opt); other post-processors: 2-opt,

3-opt (find better tours by exchanging 2 or 3 edges of the current tour until no further improvement is possible).

Only 8 instances (a larger no. of samples is required to draw statistical significant conclusions).

Compare the algorithm to other heuristics: apply the sign test to assess solution quality

I the algorithm is better than heuristics with a weak post-processor (i.e., 2-opt).

I the algorithm is as good as those with a strong post-processor (i.e., Or-opt, 3-opt).



3. The Wilcoxon signed-rank test

TheWilcoxon signed-rank testis an extension to the sign test: it takes the value of the differences into account. Used to compare two related samples, matched samples, or repeated measurements on a single sample.

1. The distances between observations and the tested value, di =|Xi −m|.

2. Order the distances and compute their ranksRi (Ri =r: Xi is ther-th smallest observation in the sample).

3. Take only the ranks corresponding to observations Xi greater thanm. Their sum is the test statisticW =P

i:Xi>mRi. 4. Large values ofW suggest rejection ofH0 in favor of

HA :M >m; small values supportHA :M <m; both support a two-sided alternative HA :M 6=m



The Wilcoxon signed-rank test

Test of the median,H0 :M =m.

Test statisticW =P

i:Xi>mRi, where Ri is the rank of di =|Xi −m|.

Null distribution: Table of Critical values Forn≥15,W ≈Normal(n(n+1)4 ,


24 )

Assumptions: the distribution ofXi is continuous and symmetric.



Example 1: Supply and demand

You have to ensure that the printers don’t run out of paper.

During the first six days, the lab consumed: 7, 5.5, 9.5, 6, 3.5, and 9 cartons of paper. Does this imply significant evidence, at the 5%

level of significance, that the median daily consumption of paper is more than 5 cartons?

Right-tail testH0 :M = 5 vs HA :M >5.



Example 1: Supply and demand

Forn= 6 and α= 0.05, we’ll rejectH0 when the sum of positive ranksT ≥19.

Compute distancesdi =|Xi−5|and rank them from the smallest to the largest.

i Xi Xi −5 di Ri sign

1 7 2 2 4 +

2 5.5 0.5 0.5 1 +

3 9.5 4.5 4.5 6 +

4 6 1 1 2 +

5 3.5 -1.5 1.5 3 -

6 9 4 4 5 +

ComputeT adding the ”positive” ranks only:

T = 4 + 1 + 6 + 2 + 5 = 18<19.

No rejection: at the 5% level, data do not provide significance evidence that the median consumption of paper exceeds 5 cartons.



Table of Critical Values for the Wilcoxon Signed Rank Test



Example 2: Unauthorized use of a computer account

TestM = 0.2 vs M 6= 0.2.

Compute the distancesd1 =|X1−m|=|0.24−0.2|= 0.04, ... , d18=|0.27−0.2|= 0.07 and rank them.

Notice that the 9-th, 12-th, and 13-th observations are below the tested valuem= 0.2 while all the others are above. The test statistic (the sum of only positive signed ranks):

W = X


Ri = 162



Example 2: Unauthorized use of a computer account

Compute aP−value. This is a two-sided test, therefore, P = 2min(P{W ≤162},P{W ≥162})<2·0.001 = 0.002 (use Table withn= 18).

I Obs: for the sample size n= 18, we can also use the Normal approximation.

The test shows strong evidence that the account was used by an unauthorized person.



4. Mann-Whitney-Wilcoxon rank sum test

Wilcoxon signed rank test can be extended to a two-sample problem: compare two populations, the population ofX and the population ofY. In terms of their cumulative distribution functions, test

H0 :FX(t) =FY(t),for all t

AlternativeHA: either Y is stochastically larger than X, and FX(t)>FY(t), or it is stochastically smaller than X, and FX(t)<FY(t).



Mann-Whitney-Wilcoxon rank sum test

1. Combine all Xi andYj into one sample.

2. Rank observations in this combined sample. RanksRi are from 1 to (n+m). Some of these ranks correspond to X-variables, others to Y-variables.

3. The test statistic U is the sum of all X-ranks.

IfU is small,X-variables have low ranks in the combined sample, so they are generally smaller thanY-variables. This implies that Y is stochastically larger thanX: support the alternative

HA :FY(t)<FX(t).



Mann-Whitney-Wilcoxon rank sum test

VariableY is stochastically larger than variableX. It has a larger medianMY >MX and a smaller cdfFY(t)<FX(t).



Mann-Whitney-Wilcoxon rank sum test

Test of two populations,H0:FX =FY. Test statisticU =P

iRi, whereRi are ranks ofXi in the combined sample ofXi andYi.

Null distribution: Table of Critical values Forn,m≥10, U ≈Normal(n(n+m+1)2 ,


12 )

Assumptions: the distributions ofXi andYi are continuous;

FX(t) =FY(t) under H0;FX(t)<FY(t) for all t or FX(t)>FY(t) for all t underHA.



Example 1: On-line incentives

Managers of a shopping portal suspect that more customers participate in on-line shopping if they are offered some incentive, such as a discount or cash back. To verify this hypothesis, they chose 12 days at random, offered a 5% discount on 6 randomly selected days, but did not offer any incentives on the other 6 days.

The discounts were indicated on the links leading to this shopping portal.

With the discount, the portal received (rounded to 100s) 1200, 1700, 2600, 1500, 2400, and 2100 hits. Without the discount, 1400, 900, 1300, 1800, 700, and 1000 hits were registered. Does this support the managers’ hypothesis?



Example 1: On-line incentives

FX,FY the cdf of the number of hits without the discount and with the discount, respectively.

TestH0:FX =FY vsHA :X is stochastically smaller thanY. The discount is suspected to boost the number of hits to the shopping portal, soY should be stochastically larger than X. To compute the test statistic: combine all the observations and order them

700,900,1000,1200,1300,1400,1500,1700,1800,2100,2400,2600 X-variables are underlined. In the combined sample, their ranks are 1, 2, 3, 5, 6, and 9, and their sum isUobs =P6

i=1Ri = 26.

From Table of critical vals. withn =m= 6: the one-sided left-tail P-value isp∈(0.01,0.025]. Although it implies some evidence that discounts help increase the on-line shopping activity, this evidence is not overwhelming. We can conclude that the evidence supporting the managers’ claim is significant at anyα >0.025.



Table of Critical Values



Example 2: Pings

Round-trip transit times (pings) at two locations, arranged in the increasing order:

Location I: 0.0156, 0.0210, 0.0215, 0.0280, 0.0308, 0.0327, 0.0335, 0.0350, 0.0355, 0.0396, 0.0419, 0.0437, 0.0480, 0.0483, 0.0543 seconds

Location II: 0.0039, 0.0045, 0.0109, 0.0167, 0.0198, 0.0298, 0.0387, 0.0467, 0.0661, 0.0674, 0.0712, 0.0787 seconds

Is there evidence that the median ping depends on the location?

Test forH0:FX =FY vs HA :FX 6=FY, whereX and Y are pings at the two locations.

Samples of sizesn= 15 andm= 12.




X-pings have ranks 4, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, and 23, and their sum is

Uobs =




Ri = 213 To use the Normal distribution, compute E(U|H0) = n(n+m+ 1)

2 = 210,Var(U|H0) = nm(n+m+ 1)

12 = 420

Compute the P-value for the two-sided test

P= 2min(P{U 213},P{U 213}) = 2min(P{U 213.5},P{U 212.5})

= 2min(P{Z 213.5210

420 },P{Z 212.5210

420 })

= 2P{Z 212.5210

420 }= 2(1φ(0.12)) = 2(0.4522) = 0.9044 There is no evidence that pings at the two locations have different distributions.



The training stage is used to train the created model and the test stage is used to test the created model performance on new data.. The data which is used taken from the

Experimentation Planning experiments Test Data Generation.. Setting-up and Running the Experiment Evaluating

I Calculate the p-value (the area under the appropriate null sampling distribution of F that is bigger than the observed F-statistic). I Reject the null hypothesis if p

Key words: Greta Thunberg, Public Appearance, Protestant Ethics, Martin Luther, Self- Restraint, Public Responsibility, Cultural Patterns, Modern and Postmodern

The distinction between first order truth claims and second order grammatical reflection stems from the application of the linguistic metaphor to religion and from allowing

maximally non-Gaussian linear combinations of the ob- served data x..

példa: SzállításiInformációk relációja nincs 2NF-ben, mivel a reláció kulcsa a {SzállID, ÁruID} és fennáll a SzállID → SzállNév, tehát SzállNév függ

Seen as three directions of discursive argumentation, they allow in-depth discussions on every specific activity pertaining to someone on the job: putting out in

For the calibration of the connection between the integral intensity of the band “amide I” and the acetylation degree, five mixes were prepared with different proportions of

public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {.

Prin date spatiale intelegem acele date statistice ce sunt asociate cu o locatie in spatiu; pentru datele spatio-temporale mai apare si referirea la variabila timp (datele

It will also discuss several free trade agreements that are in effect in the region as well as efforts by the country to join the Trans-Pacific Partnership (TPP) and the

“Given that the higher education sector is situated at the crossroads of research, education and innovation, it is a central player in the knowledge economy and society and key to

If accounts receivable decrease during the time period, this means customers have paid off some accounts, (the company received cash payments) and so, net income should be increased

De¸si ˆın ambele cazuri de mai sus (S ¸si S ′ ) algoritmul Perceptron g˘ ase¸ste un separator liniar pentru datele de intrare, acest fapt nu este garantat ˆın gazul general,

The best performance, considering both the train and test results, was achieved by using GLRLM features for directions {45 ◦ , 90 ◦ , 135 ◦ }, GA feature selection with DT and

Thus, if Don Quixote is the idealist, Casanova the adventurous seducer, Werther the suicidal hero, Wilhelm Meister the apprentice, Jesus Christ will be, in the audacious and

– Players, Objectives, Procedures, Rules, Resources, Conflict, Boundaries, Outcome. •

However, the sphere is topologically different from the donut, and from the flat (Euclidean) space.. Classification of two

Genetic Algorithms generate adequate test data in terms of mutation testing and generating test data for the original (unmutated) software is better.. A detailed description is

The Guidelines and Good Clinical Practice Recom- mendations for Contrast Enhanced Ultrasound (CEUS) in the Liver - Update 2012 [12] is freely available to download from

From the correlation test the calculated value is greater than the table value, so the null hypothesis is accepted, which shows that there is no significance

These weights of test data make the representation The backprobagation is born of a neural network, this is based on the number of hidden layers and the ReLu layer appears in