Decision Tree Learning

(1)

Decision Tree Learning

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 3

Acknowledgement:

The present slides are an adaptation of slides drawn by T. Mitchell

0.

(2)

PLAN

• DT Learning: Basic Issues

1. Concept learning: an example 2. Decision tree representation

3. ID3 learning algorithm (Ross Quinlan, 1986) Hypothesis space search by ID3

Statistical measures in decision tree learning:

Entropy, Information gain 4. Inductive bias in ID3

5. Time complexity of the ID3 algorithm

6. Other “impurity” measures (apart entropy): Gini, missclassification

(3)

PLAN (cont’d)

• Useful extensions to the ID3 algorithm 1. Dealing with...

continuous-valued attributes attributes with many values attributes with different costs

training examples with missing attributes values 2. Avoiding overfitting of data:

reduced-error prunning, and rule post-pruning

• Advanced Issues

Ensemble Learning using DTs: boosting, bagging, Random Forests

2.

(4)

When to Consider Decision Trees

• Instances are described by attribute–value pairs

• Target function is discrete valued

• Disjunctive hypothesis may be required

• Possibly noisy training data Examples:

• Equipment or medical diagnosis

• Credit risk analysis

• Modeling calendar scheduling preferences

(5)

1. Basic Issues in DT Learning

1.1 Concept learning: An example

Given the data:

Day Outlook Temperature Humidity Wind EnjoyTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

predict the value of EnjoyTennis for

hOutlook = sunny, T emp = cool, Humidity = high, W ind = strongi

4.

(6)

1.2. Decision tree representation

• Each internal node tests an attribute

• Each branch corresponds to attribute value

• Each leaf node assigns a classification

Example:

Decision Tree for EnjoyT ennis

Outlook

Overcast

Humidity

Normal High

No Yes

Wind

Strong Weak

No Yes

Yes Sunny Rain

(7)

Another example:

A Tree to Predict C-Section Risk

Learned from medical records of 1000 women Negative examples are C-sections

[833+,167-] .83+ .17-

Fetal_Presentation = 1: [822+,116-] .88+ .12-

| Previous_Csection = 0: [767+,81-] .90+ .10-

| | Primiparous = 0: [399+,13-] .97+ .03-

| | Primiparous = 1: [368+,68-] .84+ .16-

| | | Fetal_Distress = 0: [334+,47-] .88+ .12-

| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05-

| | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22-

| | | Fetal_Distress = 1: [34+,21-] .62+ .38-

| Previous_Csection = 1: [55+,35-] .61+ .39- Fetal_Presentation = 2: [3+,29-] .11+ .89- Fetal_Presentation = 3: [8+,22-] .27+ .73-

6.

(8)

1.3. Top-Down Induction of Decision Trees:

ID3 algorithm outline [Ross Quinlan, 1979, 1986]

START

create the root node;

assign all examples to root;

Main loop:

1. A ← the “best” decision attribute for next node;

2. for each value of A, create a new descendant of node;

3. sort training examples to leaf nodes;

4. if training examples perfectly classified, then STOP;

else iterate over new leaf nodes

(9)

ID3 Algorithm: basic version

ID3(Examples, Target attribute, Attributes)

• create a Root node for the tree; assign all Examples to Root;

• if all Examples are positive, return the single-node tree Root, with label=+;

• if all Examples are negative, return the single-node tree Root, with label=−;

• if Attributes is empty, return the single-node tree Root,

with label = the most common value of Target attribute in Examples;

• otherwise // Main loop:

A ← the attribute from Attributes that best^∗ classifies Examples;

the decision attribute for Root ← A; for each possible value v_i of A

add a new tree branch below Root, corresponding to the test A = v_i; let Examples^vi be the subset of Examples that have the value v_i for A; if Examples^vi is empty

below this new branch add a leaf node with label = the most common value of Target attribute in Examples;

else

below this new branch add the subtree

ID3(Examples^vi, Target attribute, Attributes\{A});

• return Root; ^∗ ^The ^best attribute is the one with the highest information gain.

8.

(10)

Hypothesis Space Search by ID3

• Hypothesis space is complete!

The target function surely is in there...

• Outputs a single hypothesis Which one?

• Inductive bias:

approximate “prefer the shortest tree”

• Statisically-based search choices Robust to noisy data...

• No back-tracking Local minima...

...

+ + +

A1

+ – + – A2

A3 +

...

+ – + –

A2

A4 – + – + –

A2 + – +

... ...

–

(11)

Statistical measures in DT learning:

Entropy, and Information Gain

Information gain:

the expected reduction of the entropy of the instance set S due to sorting on the attribute A

Gain( S, A ) = Entropy( S ) − P

v∈

Values ( A ) |S

_v

|

| S | ^Entropy( ^S

^v

⁾

10.

(12)

Entropy

• Let S be a sample of training examples

p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S

• Entropy measures the impurity of S

• Information theory:

Entropy(S) = expected number of bits needed to encode ⊕ or ⊖ for a randomly drawn member of S (under the optimal, shortest- length code)

The optimal length code for a message having the probability p is

−log2 p bits. So:

Entropy(S) = p⊕(−log2 p⊕) + p⊖(−log2 p⊖) = −p⊕ log2p⊕ − p⊖ log2p⊖

(13)

Entropy(S) 1.0

0.5

0.0 0.5 1.0

p+

Entropy (S ) = p

_⊕

log

₂

1 p

_⊕

+ p

_⊖

log

₂

1 p

_⊖

Note: By convention, 0 · log

₂

0 = 0.

12.

(14)

Back to the EnjoyTennis example:

Selecting the root attribute

Which attribute is the best classifier?

High Normal

Humidity

[3+,4-] [6+,1-]

Wind

Weak Strong

[6+,2-] [3+,3-]

= .940 - (7/14).985 - (7/14).592 = .151

= .940 - (8/14).811 - (6/14)1.0 = .048

Gain (S, Humidity ) Gain (S, )Wind

=0.940

E E=0.940

=0.811 E

=0.592 E

=0.985

E E=1.00

[9+,5-]

S:

[9+,5-]

S:

Similarly,

Gain(S, Outlook) = 0.246 Gain(S, Temperature) = 0.029

(15)

A partially learned tree

Outlook

Sunny Overcast Rain

[9+,5−]

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}

[2+,3−] [4+,0−] [3+,2−]

Yes

{D1, D2, ..., D14}

? ?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity) sunny

Gain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

= .970 − (3/5) 0.0 − (2/5) 0.0 = .970

14.

(16)

Converting A Tree to Rules

Outlook

Overcast Humidity

Normal High

No Yes

Wind

Strong Weak

No Yes

IF (Outlook = Sunny)∧(Humidity = High) THEN EnjoyT ennis= N o

IF (Outlook = Sunny)∧(Humidity = N ormal) THEN EnjoyT ennis= Y es

. . .

(17)

1.4 Inductive Bias in ID3

Is ID3 unbiased?

Not really...

• Preference for short trees, and for those with high information gain attributes near the root

• The ID3 bias is a preference for some hypotheses (i.e., a search bias);

there are learning algorithms (e.g. Candidate-Elimination, ch. 2) whose bias is a restriction of hypothesis space H (i.e, a language bias).

• Occam’s razor: prefer the shortest hypothesis that fits the data

16.

(18)

Occam’s Razor

Why prefer short hypotheses?

Argument in favor:

• Fewer short hypotheses than long hypsotheses

→ a short hypothesis that fits data unlikely to be coincidence

→ a long hypothesis that fits data might be coincidence

Argument opposed:

• There are many ways to define small sets of hypotheses

(e.g., all trees with a prime number of nodes that use attributes be- ginning with “Z”)

• What’s so special about small sets based on the size of

hypotheses?

(19)

1.5 Complexity of decision tree induction

from “Data mining. Practical machine learning tools and techniques”

Witten et al, 3rd ed., 2011, pp. 199-200

• Input: d attributes, and m training instances

• Simplifying assumptions:

(A1): the depth of the ID3 tree is O(logm),

(i.e. it remains “bushy” and doesn’t degenerate into long, stringy branches);

(A2): [most] instances differ from each other;

(A2’): the d attributes provide enough tests to allow the instnces to be differentiated.

• Time complexity: O(d mlogm).

18.

(20)

• i(n) ^not.=

( Gini Impurity: 1−P^k

i=1P²(cⁱ)

Misclassification Impurity: i(n) = 1 −max^ki=1P(ci)

• Drop-of-Impurity: ∆i(n) ^not.= i(n)− P(nl)i(nl)− P(nr)i(nr),

where n^l and n^r are left and right child of node n after splitting.

For a Bernoulli variable of parameter p:

Entropy(p) = −plog₂ p−(1− p) log₂(1− p) Gini(p) = 1− p² − (1−p)² = 2p(1−p) MisClassif(p) =

1−(1−p), if p ∈ [0; 1/2) 1−p, if p ∈ [1/2; 1]

=

p, if p ∈[0; 1/2) 1−p, if p ∈[1/2; 1]

(21)

2. Extensions of the ID3 algorithm

2.1 Dealing with ...Continuous valued attributes

Create one or more discrete attribute to test the continuous.

For instance:

Temperature = 82.5

(Temperature > 72.3) = t, f

How to choose such (threshold) values:

Sort the examples according to the values of the continuous attribute, then identify examples that differ in their target classification.

For EnjoyTennis:

Temperature: 40 48 60 72 80 90 EnjoyTennis: No No Yes Yes Yes No Temperature^>54

Temperature>85

20.

(22)

...Attributes with many values

Problem:

• If an attribute has many values, Gain will select it

• Imagine using Date = J un 3 1996 as attribute One approach: use GainRatio instead

GainRatio ( S, A ) ≡ Gain ( S, A )

SplitInf ormation ( S, A ) SplitInf ormation ( S, A ) ≡ −

c

X

i=1

| S

_i

|

| S | log

₂

| S

_i

|

| S |

where S

_i

is the subset of S for which A has the value v

_i

(23)

...Attributes with different costs

Consider

• medical diagnosis, BloodT est has cost $150

• robotics, W idth f rom 1 f t has cost 23 sec.

Question:

How to learn a consistent tree with low expected cost?

One approach: replace gain by

•

^Gain

2(S,A)

Cost(A)

(Tan and Schlimmer, 1990)

•

²

Gain(S,A)−1

(Cost(A)+1)^w

(Nunez, 1988)

where w ∈ [0, 1] determines the importance of cost

22.

(24)

...Training examples with unknown attribute values

Question:

What if an example is missing the value of an attribute A?

Answer:

Use the training example anyway, sort through the tree, and if node n tests A,

• assign the most common value of A among the other examples sorted to node n, or

• assign the most common value of A among the other examples with same target value, or

• assign probability pi to each possible value vi of A;

assign the fraction pi of the example to each descendant in the tree.

Classify the test instances in the same fashion.

(25)

2.2 Overfitting in Decision Trees

Consider adding noisy training example #15:

(Sunny, Hot, N ormal, Strong, EnjoyT ennis = N o)

What effect does it produce on the earlier tree?

Outlook

Overcast Humidity

Normal High

No Yes

Wind

Strong Weak

No Yes

24.

(26)

Overfitting: Definition

Consider error of hypothesis h over

• training data: error

_train

(h)

• entire distribution D of data: error

_D

(h) Hypothesis h ∈ H overfits training data if

there is an alternative hypothesis h

^′

∈ H such that error

_train

(h) < error

_train

(h

^′

)

and

error

_D

(h) > error

_D

(h

^′

)

(27)

Overfitting in Decision Tree Learning

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0 10 20 30 40 50 60 70 80 90 100

Accuracy

Size of tree (number of nodes)

On training data On test data

26.

(28)

Avoiding Overfitting

How can we avoid overfitting?

• stop growing when the data split is not anymore statisti- cally significant

• grow full tree, then post-prune How to select the “best” tree:

• Measure performance over a separate validation data set

• Minimum Description Length (MDL) principle:

minimize size(tree) + size(misclassifications(tree))

(29)

2.2.1 Reduced-Error Pruning

Split data into training set and validation set Do until further pruning is harmful:

1. Evaluate impact on validation set of pruning each possible node (plus those below it)

2. Greedily remove the one that most improves validation set accuracy

28.

(30)

Effect of Reduced-Error Pruning

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0 10 20 30 40 50 60 70 80 90 100

Accuracy

Size of tree (number of nodes)

On training data On test data On test data (during pruning)

Note:

A validation set (distinct from both the training and test sets) was used for pruning. Accuracy over this validation set is not shown here.

(31)

2.2.2 Rule Post-Pruning

1. Convert tree to equivalent set of rules 2. Prune each rule independently of others

3. Sort the final rules into the desired sequence (e.g. accord- ing to the estimated accuracy) for use

It is perhaps most frequently used method (e.g., C4.5 by Ross quinlan, 1993)

30.

(32)

3. Ensemble Learning: a very brief introduction

There exist several well-known meta-learning techniques that aggregate decision trees:

Boosting

[Freund et al., 1995; Shapire et al., 1996]:

When constructing a new tree, the data points that have been in- correctly predicted by earlier trees are given some extra wight, thus forcing the learner to concentrate successively on more and more dif- ficult cases.

In the end, a weighted vote is taken for prediction.

Bagging

[Breiman, 1996]:

New trees do not depend on earlier trees; each tree is independently constructed using a boostrap sample (i.e. sampling with replacing) of the data set.

The final classification is done via simple majority voting.

(33)

The AdaBoost Algorithm

[pseudo-code from Statistical Pattern Recognition, Andrew Webb, Keith Copsey, 2011]

Input:

{(xi, yi) | i = 1, . . . , n} — a set of labeled instances;

T ∈ N^∗ — the number of boosting rounds;

Training:

initialization: wi = 1/n, for i = 1, . . . , n for t = 1, . . . , T

a. construct a classifier η^t (e.g., a decision tree) using the given training data, with weights wi, i = 1, . . . , n;

b. et = P

iwi, where i indexes all instances misclassified by ηt; c. if e^t = 0 or e^t > 1/2 then terminate the procedure;

else wⁱ ← wⁱ 1

et

−1

for all instances which were misclassified by η^t, and then renormalize the weights wⁱ, i = 1, . . . , n so that they sum to 1;

Prediction:

given a test instance x, and

assuming that the classifiers η^t have two clases, −1 and +1, compute ˆ

η = P^T

t=1

log

1 et

− 1

ηt(x);

assign x the label sign(ˆη);

32.

(34)

The Bagging Algorithm (Bootstrap aggregating)

Input:

{(xⁱ, yⁱ) | i = 1, . . . , n} — a set of labeled instances;

B ∈ N^∗ — the number of samples/(sub)classifiers to be produced;

Training:

for b = 1, . . . , B

a. generate a boostrap sample of size n by extracting with replacement from the training set;

(Note: some instances will be replicated, others will be omitted.) b. construct a classifier η^b (e.g., a decision tree), using the boostrap

sample as training data;

Prediction:

given a test instance x, assign it the most common label in the set {η^b(x) | b = 1, . . . , B};

(35)

Random Forests (RF)

[Breiman, 2001]

RF extends bagging with and additional layer of randomness:

random feature selection:

While in standard classification trees each node is split using the best split among all variables, in RF each node is split using the best among a subset of features randomly chosen at that node.

RF uses only two parameters:

− the number of variables in the random subset at each node;

− the number of trees in the forest.

This somehow counter-intuitive strategy is robust against overfitting, and it compares well to other machine learning techniques (SVMs, neural networks, discriminat analysis etc).

34.

(36)

The Random Forests (RF) Algorithm

Input:

{(xⁱ, yⁱ) | i = 1, . . . , n} — a set of labeled instances;

B ∈ N^∗ — the number of samples to be produced / trees in the forest;

m — the number of features to be selected

Training:

for b = 1, . . . , B

a. generate a boostrap sample of size n by extracting with replacement from the training set;

(Note: some instances will be replicated, others will be omitted.)

b. construct a a decision tree ηb by using the boostrap sample as training data, and choosing at each node the “best” among m randomly selected attributes;

Computation of the out-the-bag error:

a training instance xⁱ, is misclassified by RF if its label yⁱ differs from zⁱ, the most common label in the set {ηb′(xi) | b^′ ∈ {1, . . . , B}, such that xi 6∈ the boostrap sample for the classifier η^b^′};

Prediction:

given a test instance x, assign it the most common label in the set {η^b(x) | b = 1, . . . , B};

(37)

Some exercises

0.

(38)

Exemplifying

The application of the ID3 algorithm on continuous attributes;

Decision surfaces; decision boundaries;

The computation of the CVLOO error

CMU, 2002 fall, Andrew Moore, midterm, pr. 3

(39)

• training data:

0 1 2 3 4 5 6 7 8 9 10 X

• discretization / decision thresholds:

X

1 2 3 4 6 7 8 9 10

5 8.25

8.75

• compact representation of the ID3 tree:

0 1 2

X

1 2 3 4 5 6 7 8 9 10

• decision “surfaces”:

8.25 8.75 X

− + − +

5

ID3 tree:

X<5

X<8.75

0

0 1

1

X<8.25

0

1

2

[4−,0+] [1−,5+]

[5−,5+]

[1−,0+] [0−,2+]

[1−,2+]

[0−,3+]

Da

Da Nu

Nu

Nu Da

2.

(40)

ID3:

IG computations

X<8.75 X<8.25

[1−,2+] [1−,3+]

Nu [0−,2+]

Da

[1−,5+]

[0−,3+]

Nu Da

[1−,5+]

IG: 0.109 IG: 0.191

Level 1:

5 8.25 8.75

+

− + −

Decision "surfaces":

X<8.75

[5−,3+]

Nu [0−,2+]

Da

[5−,5+]

X<8.25

[1−,2+]

[4−,3+]

Nu Da

[5−,5+]

X<5

[4−,0+] [1−,5+]

Da Nu

[5−,5+]

<<

<

Level 0:

(41)

ID3, CVLOO:

Decision surfaces

CVLOO error: 3/10

8.75 8.25

+

+ −

−

4.5 X=4:

8.75 8.25

+

+ −

−

4.5 X=6:

5 8.75

+

− +

7.75 X=8: −

5

+

− + +

X=8.5:

5 8.25

+

+ −

−

9.25 X=9:

5 8.25 8.75

+

+ −

X=1,2,3: −

5 8.25 8.75

+

+ −

X=7: −

5 8.25 8.75

+

+ −

X=10: −

4.

(42)

DT2

+

5

−

Decision "surfaces":

X<5

0 1

[1−,5+]

Da Nu

[5−,5+]

[4−,0+]

(43)

DT2, CVLOO IG computations

Case 1: X=1, 2, 3, 4

X<5 X<8.25 X<8.75

[3−,0+] [1−,5+]

Da Nu

[4−,5+]

<<

<

[1−,2+] [4−,3+]

Nu [0−,2+]

Da

[4−,5+]

[3−,3+]

Nu Da

[4−,5+]

/4.5

Case 2: X=6, 7, 8

X<5 X<8.25 X<8.75

[4−,0+] [1−,4+]

Da Nu

<

[5−,2+]

Nu [0−,2+]

Da

[5−,4+]

[4−,2+]

Nu Da

[5−,4+]

/5.5 /7.75

[5−,4+]

[1−,2+]

6.

(44)

DT2, CVLOO IG computations (cont’d)

Case 3: X=8.5

X<5

[4−,0+] [0−,5+]

Da Nu

[4−,5+]

Case 2: X=9, 10

X<5 X<8.25 X<8.75

[1−,4+]

Da Nu

<

[5−,3+]

Nu [0−,1+]

Da

[5−,4+]

[4−,3+]

Nu Da

[5−,4+]

[1−,1+]

/9.25

[4−,0+]

<

CVLOO error: 1/10

(45)

Exemplifying

χ

²

-Based Pruning of Decision Trees

CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1 Input:

X¹ X² X³ X⁴ Class

1 1 0 0 0

1 0 1 0 1

0 1 0 0 0

1 0 1 1 1

0 1 1 1 1

0 0 1 0 0

1 0 0 0 1

0 1 0 1 1

1 0 0 1 1

1 1 0 1 1

1 1 1 1 1

0 0 0 0 0

X4

X1

X2

0

1

0

1

0 1

[4−,8+]

1 0

[1−,2+]

0 1

[0−,2+] [1−,0+]

[3−,0+]

[4−,2+] [0−,6+]

8.

(46)

Idea

While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which

there is not enough (“significant”) statistical evidence that there is a dependence between

the values of the input attribute tested in that node and the values of the output attribute (the labels),

supported by the set of instances assigned to that node.

(47)

Contingency tables

OX₄ X4 = 0 X4 = 1

Class = 0 4 0

Class = 1 2 6

N=12

⇒ 8

><

>:

P(Class = 0) = 4

12 = 1

3, P(Class = 1) = 2 3 P(X4 = 0) = 6

12 = 1

2, P(X4 = 1) = 1 2

OX₁|X₄=0 X¹ = 0 X¹ = 1

Class = 0 3 1

Class = 1 0 2

N=6

⇒ 8

>>

<

>>

:

P(Class = 0 | X4 = 0) = 4 6 = 2

3 P(Class = 1 | X4 = 0) = 1

3 P(X¹ = 0 | X⁴ = 0) = 3

6 = 1 2 P(X¹ = 1 | X4 = 0) = 1

2

O_X₂_|X₄=0,X₁=1 X2 = 0 X2 = 1

Class = 0 0 1

Class = 1 2 0

N=3⇒ 8

>>

<

>>

:

P(Class = 0 | X4 = 0, X¹ = 1) = 1 3 P(Class = 1 | X⁴ = 0, X¹ = 1) = 2 3 P(X² = 0 | X4 = 0, X¹ = 1) = 2

3 P(X² = 1 | X4 = 0, X¹ = 1) = 1 3

10.

(48)

The rationale behind the computation of the expected number of observations

P ( C = i, A = j ) = P ( C = i ) · P ( A = j ) P (C = i) =

P

c

k=1

O

_i,k

N and P (A = j ) =

P

r

k=1

O

_k,j

N

P ( C = i, A = j )

^indep.

= ( P

c

k=1

O

_i,k

) ( P

r

k=1

O

_k,j

) N

²

E [ C = i, A = j ] = N · P ( C = i, A = j )

χ

²

=

r

X

i=1 c

X

j=1

( O

_i,j

− E

_i,j

)

²

E

_i,j

(49)

12.

(50)

Expected number of observations

EX₄ X⁴ = 0 X⁴ = 1

Class = 0 2 2

Class = 1 4 4

E_X₁_|X₄ X¹ = 0 X¹ = 1

Class = 0 2 2

Class = 1 1 1

E_X₂_|X₄_,X₁=1 X² = 0 X² = 1 Class = 0 2

3

1 3 Class = 1 4

3

2 3

EX₄(Class = 0, X⁴ = 0) : N = 12, P(Class = 0) = 1

3 ¸si P(X⁴ = 0) = 1 2 ⇒

N ·P(Class = 0, X⁴ = 0) = N ·P(Class = 0)·P(X⁴ = 0) = 12· 1 3 · 1

2 = 2

(51)

χ

²

Statistics

χ² _X

4 = (4 − 2)²

2 + (0 − 2)²

2 + (2 − 4)²

4 + (6 −4)²

4 = 2 + 2 + 1 + 1 = 6 χ² _X

1|X4 =0 = (3 − 2)²

2 + (1 − 2)²

2 + (0 − 1)²

1 + (2 − 1)²

1 = 3

χ² _X_2|X

4 =0,X1 =1 =

0 − 2 3

₂

2 3

+

1 − 1 3

₂

1 3

+

2 − 4 3

₂

4 3

+

0 − 2 3

₂

2 3

= 4

9 · 27

4 = 3

p-values: 0.0143, 0.0833, and respectively 0.0833.

The first one of these p-values is smaller than ε, therefore the root node (X₄) cannot be prunned.

14.

(52)

Output (pruned tree) for 95% confidence level

X4

0 1

1 0

Decision Tree Learning