Decision Trees

(1)

Decision Trees

Contains:

ID3 algo.:

ex. 34, 36, 2, 4, 12, 10, 47, 15, 19, 20,

AdaBoost algo.:

ex. 22 and 23, 24, 25, 26, 27, 28, 70, 29

from the 2022f version of the ML exercise book by L. Ciortuz et al.

(2)

Exemplifying

how to compute information gains and how to work with decision stumps

CMU, 2013 fall, W. Cohen E. Xing, Sample questions, pr. 4

(3)

Timmy wants to know how to do well for ML exam. He collects those old statistics and decides to use decision trees to get his model. He now gets 9 data points, and two features: “whether stay up late before exam” (S) and

“whether attending all the classes” (A). We already know the statistics is as below:

Set(all) = [5+,4−]

Set(S+) = [3+,2−],Set(S−) = [2+,2−] Set(A+) = [5+,1−],Set(A−) = [0+,3−]

Suppose we are going to use the feature to gain most information at first split, which feature should we choose? How much is the information gain?

You may use the following approximations:

N 3 5 7

log₂N 1.58 2.32 2.81

(4)

[5+,4−]

[3+,2−] [2+,2−]

+ −

S

[5+,4−]

[5+,1−] [0+,3−]

+ −

A

H=1 H=0

H(all) ^not.= H[5+,4−]^not.= H 5

9

sim.

= H 4

9

def.

= 5

9 log₂ 9 5 + 4

9 log₂ 9

4 = log₂9− 5

9 log₂5 − 4

9 log₂4

= 2 log₂3 − 5

9 log₂5− 8

9 = −8

9 + 2 log₂3 − 5

9 log₂5 = 0.991076 H(all|S) ^def.= 5

9 ·H[3+,2−] + 4

9 ·H[2+,2−] = . . .

= 5

9 ·0.970951 + 4

9 ·1 = 0.983861 H(all|A) ^def.= 6

9 ·H[5+,1−] + 3

9 ·H[0+,3−] = . . .

= 6

9 ·0.650022 + 4

9 ·0 = 0.433348

IG(all, S) ^def.= H(all) −H(all|S) = 0.007215 IG(all, A) ^def.= H(all) −H(all|A) = 0.557728 IG(all, S) < IG(all, A) ⇔ H(all|S) > H(all|A)

(5)

Decision stumps;

entropy, mean conditional entropy, and information gain:

some very convenient formulas to be used when working with pocket calculators

Sebastian Ciobanu, Liviu Ciortuz, 2017

(6)

Consider the decision stump given in the nearby image.

The symbols a, b, c, d, e and f represent counts computed from a training dataset (not provided). As you see, the label (or, output variable), here denoted by Y , is binary, and so is the attribute (or, input variable) A. Obviously, a = c +e and b = d +f.

0 1

[a+,b−]

A

[c+,d−] [e+,f−]

a. Prove that the entropy of [the output variable] corresponding to the partition associated to the test node in this decision stump is

H[a+, b−] = 1

a+b log₂ (a+b)^a+b

a^ab^b if a 6= 0 and b 6= 0.

b. Derive a similar formula for the entropy, for the case when the output variable has three values, and the partition associated to the test node in the decision stump would be [a+, b−, c∗].

(Note that there is no link between the last c and the c count in the above decision stump.)

(7)

c. Assume that for the above given decision stump we would have [all counts]

c, d, e and f different from 0. Prove that the mean conditional entropy corresponding to this decision stump is

Hnode|attribute = 1

a+b log₂

(c+d)^c+d

c^cd^d · (e +f)^e+f e^ef^f

.

d. Now suppose that one of the counts c, d, e and f is 0; for example, let’s consider c = 0. Infer the formula for the mean conditional entropy in this case.

e. Prove the following formula for the information gain corresponding to the above given decision stump, assuming that all a, b, c, d, e and f are strictly positive:

IGnode;attribute = 1

a+b log₂

(a+b)^a+b

a^ab^b · c^cd^d

(c+d)^c+d · e^ef^f (e+f)^e+f

.

(8)

WARNING!

A serious problem when using the above formulas on a pocket calculator is the fact that the internal capacity of representation for intermediate results can be overflown.

For example, a pocket calculator Sharp EL-531VH can represent the number 56⁵⁶ but not 57⁵⁷. Similarly, the calculator made available by the Linux Mint operating system [see the Accessories menu] can represent 179¹⁷⁹ but not 180¹⁸⁰.

In such overflow cases, you should use the basic / general formulas for entropies and the information gain, because they make a better use of the log function.

(9)

Answer

a.

H[a+, b−] = − a

a+b ·log₂ a

a+b + b

a+b ·log₂ b a+b

= − 1

a+b

a·log₂ a

a+b +b·log₂ b a+b

= − 1 a+b

log₂ a a+b

a

+ log₂ b

a+b)^b

= − 1

a+b

log₂ a^a

(a+b)^a + log₂ b^b (a+b)^b

= − 1

a+b ·log₂ a^a ·b^b (a+b)^a+b

= 1

a+b ·log₂ (a+b)^a+b a^a ·b^b . b.

H[a+, b−, c∗] = − a

a+b +c ·log₂ a

a+b+c + b

a+b +c ·log₂ b

a+b +c + c

a+b +c ·log₂ c a+b+c

= − 1

a+b+c

log₂ a a+b +c

a

+ log₂ b a+b +c

b

+ log₂ c a+b +c

c

= − 1

a+b+c

log₂ a^a

(a+b +c)^a + log₂ b^b

(a+b +c)^b + log₂ c^c (a+b +c)^c

= − 1

a+b+c ·log₂ a^a ·b^b ·c^c

(a +b+c)^a+b+c = 1

a+b +c ·log₂ (a+b +c)^a+b+c a^a ·b^b ·c^c .

(10)

Hnod|atribut

def.= c +d

a+b ·H[c+, d−] + e+f

a+b ·H[e+, f−]

= ✘✘✘ c +d

a+b · 1

✘✘✘

c+d ·log₂ (c +d)^c+d

c^c ·d^d +✘✘✘e+f

a+b · 1

✘✘✘e+f ·log₂ (e +f)^e+f e^e ·f^f

= 1

a+b ·log₂(c +d)^c+d

c^c ·d^d · (e +f)^e+f e^e ·f^f

.

d.

Hnod|atribut = e+f

a+b ·H[e+, f−]

= e+f

a+b · e

e+f ·log₂ e +f

e + f

e+f ·log₂ e +f f

= ✘✘✘e+f

a+b · 1

✘✘✘e+f

e ·log₂ e +f

e +f ·log₂ e+f f

= 1

a+b ·log₂ (e +f)^e+f e^e ·f^f . e.

IGnod;atribut = 1

a +b ·log₂ (a+b)^a+b

a^a ·b^b − 1

a+b ·log₂(c+d)^c+d

c^c ·d^d · (e +f)^e+f e^e ·f^f

= 1

a +b ·log₂(a+b)^a+b

a^a · b^b · c^c ·d^d

(c +d)^c+d · e^e ·f^f (e +f)^e+f

.

10.

(11)

Important REMARKS (in Romanian)

1. Întrucât majoritatea calculatoarelor de buzunar nu au funct¸ia log₂ ci funct¸iile ln ¸si lg, ˆın formulele prezentate sau deduse la punctele a − e ar fi de dorit s˘a schimb˘am baza logaritmului. Aceasta revine – pe lâng˘a ˆınlocuirea lui log₂ cu ln sau lg – la ˆınmult¸irea membrului drept cu 1/ln 2, respectiv 1/lg 2.

2. ˆIntrucˆat, la aplicarea algoritmului ID3, pentru alegerea celui mai bun atribut de pus ˆın nodul curent este suficient s˘a calcul˘am entropiile condit¸ionale medii, va fi suficient s˘a compar˘am produsele de forma

(c +d)^c+d

c^cd^d · (e+f)^e+f

e^ef^f (1)

pentru compa¸sii de decizie considerat¸i la nodul respectiv ¸si s˘a alegem minimul dintre aceste produse.

(12)

Exemplifying the application of the ID3 algorithm on a toy mushrooms dataset

CMU, 2002(?) spring, Andrew Moore, midterm example questions, pr. 2

(13)

You are stranded on a deserted island. Mushrooms of various types grow widely all over the island, but no other food is anywhere to be found. Some of the mushrooms have been determined as poisonous and others as not (determined by your former companions’ trial and error). You are the only one remaining on the island. You have the following data to consider:

Example NotHeavy Smelly Spotted Smooth Edible

A 1 0 0 0 1

B 1 0 1 0 1

C 0 1 0 1 1

D 0 0 0 1 0

E 1 1 1 0 0

F 1 0 1 1 0

G 1 0 0 1 0

H 0 1 0 0 0

U 0 1 1 1 ?

V 1 1 0 1 ?

W 1 1 0 0 ?

You know whether or not mushrooms A through H are poisonous, but you do not know about U through W.

(14)

For the a–d questions, consider only mushrooms A through H . a. What is the entropy of Edible ?

b. Which attribute should you choose as the root of a decision tree?

Hint : You can figure this out by looking at the data without explicitly computing the information gain of all four attributes.

c. What is the information gain of the attribute you chose in the previous question?

d. Build a ID3 decision tree to classify mushrooms as poisonous or not.

e. Classify mushrooms U , V and W using the decision tree as poisonous or not poisonous.

f. If the mushrooms A through H that you know are not poisonous suddenly

became scarce, should you consider trying U , V and W ? Which one(s) and

why? Or if none of them, then why not?

(15)

a.

H

_Edible

= H [3+, 5 − ]

^def.

= − 3

8 · log

₂

3 8 − 5

8 · log

₂

5 8 = 3

8 · log

₂

8 3 + 5

8 · log

₂

8 5

= 3

8 · 3 − 3

8 · log

₂

3 + 5

8 · 3 − 5

8 · log

₂

5 = 3 − 3

8 · log

₂

3 − 5

8 · log

₂

5 ≈ 0.9544

(16)

b.

0 1

[3+,5−]

[1+,2−] [2+,3−]

NotHeavy 0

0 1

[3+,5−]

[2+,3−] [1+,2−]

Smelly 0

0 1

[2+,3−] [1+,2−]

Spotted 0 [3+,5−]

0 1

[2+,2−] [1+,3−]

Smooth 0 [3+,5−]

Node 1 Node 2

(17)

c.

H

_0/Smooth ^def.

= 4

8 H [2+, 2 − ] + 4

8 H [1+, 3 − ] = 1

2 · 1 + 1 2

1 4 log

₂

4 1 + 3

4 log

₂

4 3

= 1

2 + 1 2

1 4 · 2 + 3

4 · 2 − 3

4 log

₂

3 = 1

2 + 1 2

2 − 3

4 log

₂

3 = 1

2 + 1 − 3

8 log

₂

3 = 3

2 − 3

8 log

₂

3 ≈ 0.9056 IG

_0/Smooth ^def.

= H

Edible

− H

_0/Smooth

= 0.9544 − 0.9056 = 0.0488

(18)

d.

H

_0/NotHeavy ^def.

= 3

8 H [1+, 2 − ] + 5

8 H [2+, 3 − ]

= 3 8

1 3 log

₂

3 1 + 2

3 log

₂

3 2

+ 5 8

2 5 log

₂

5 2 + 3

5 log

₂

5 3

= 3 8

1 3 log

₂

3 + 2

3 log

₂

3 − 2 3 · 1

+ 5 8

2 5 log

₂

5 − 2

5 · 1 + 3

5 log

₂

5 − 3

5 log

₂

3 = 3 8

log

₂

3 − 2 3

+ 5 8

log

₂

5 − 3

5 log

₂

3 − 2 5

= 3

8 log

₂

3 − 2

8 + 5

8 log

₂

5 − 3

8 log

₂

3 − 2 8

= 5

8 log

₂

5 − 4

8 ≈ 0.9512

⇒ IG

_0/NotHeavy ^def.

= H

Edible

− H

_0/NotHeavy

= 0.9544 − 0.9512 = 0.0032,

IG

_0/NotHeavy

= IG

_0/Smelly

= IG

_0/Spotted

= 0.0032 < IG

_0/Smooth

= 0.0488

(19)

Important Remark (in Romanian)

ˆIn loc s˘a fi calculat efectiv aceste cˆa¸stiguri de informat¸ie, pentru a determina atributul cel mai ,,bun“, ar fi fost suficient s˘a compar˘am valorile entropiilor condit¸ionale medii H_0/Smooth ¸si H_0/NotHeavy:

IG_0/Smooth > IG_0/NotHeavy ⇔ H_0/Smooth < H_0/NotHeavy

⇔ 3 2 − 3

8 log₂3 < 5

8 log₂5− 1

2 ⇔ 12−3 log₂3 < 5 log₂5−4

⇔ 16 < 5 log₂5 + 3 log₂3 ⇔ 16 < 11.6096 + 4.7548 (adev.)

În mod alternativ, t¸inând cont de formulele de la problema UAIC, 2017 fall, S. Ciobanu, L. Ciortuz, putem proceda chiar mai simplu relativ la calcule (nu doar aici, ori de câte ori nu avem de-a face cu un num˘ar mare de instant¸e):

H_0/Neted˘a < H_0/U¸soar˘a ⇔ 4⁴

2² ·2² · 4⁴

3³ < 5⁵

2² ·3³ · 3³

2² ⇔ 4⁸

3³ < 5⁵ ⇔ 4⁸ < 3³· 5⁵ ⇔ 2¹⁶ < 3³ ·5⁵

⇔ 64 · 2¹⁰

|{z}1024

< 27 · 25 ·125

| {z }

>3·8·125

| {z }

1000

(adev.)

(20)

Node 1: Smooth = 0

1 [2+,2−]

[2+,0−] [0+,2−]

0 1

Smelly

0 1

1 [2+,2−]

[0+,1−] [2+,1−]

0

NotHeavy

1 [2+,2−]

[1+,1−] [1+,1−]

0 1

Spotted

(21)

Node 2: Smooth = 1

0 0 1

2 [1+,3−]

[1+,1−] [0+,2−]

Node 3

NotHeavy

0 2 [1+,3−]

[1+,0−]

[0+,3−]

1 0 1

Smelly

0 2 [1+,3−]

[0+,1−]

[1+,2−]

0 1

Spotted

(22)

The resulting ID3 Tree

1 0 0 1

1 0

[3+,5−]

[2+,2−] [1+,3−]

[2+,0−] [0+,2−] [0+,3−] [1+,0−]

0 1 0 1

Smelly Smelly

Smooth

IF (Smooth = 0 AND Smelly = 0) OR (Smooth = 1 AND Smelly = 1) THEN Edibile;

ELSE ¬Edible;

Classification of test instances:

U Smooth = 1, Smelly = 1 ⇒ Edible = 1 V Smooth = 1, Smelly = 1 ⇒ Edible = 1 W Smooth = 0, Smelly = 1 ⇒ Edible = 0

(23)

Exemplifying the greedy character of the ID3 algorithm

CMU, 2003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a

(24)

Fie atributele binare de intrare A, B, C , atributul de ie¸sire Y ¸si urm˘ atoarele exemple de antrenament:

A B C Y 1 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0

a. Determinat¸i arborele de decizie calculat de algoritmul ID3. Este acest

arbore de decizie consistent cu datele de antrenament?

(25)

R˘ aspuns

Nodul 0 : (r˘ ad˘ acina)

[2+,2−]

[1+,1−] [1+,1−]

0 1

A

[2+,2−]

[1+,1−] [1+,1−]

0 1

B

0 Nod 1

[2+,2−]

[0+,1−] [2+,1−]

0 1

C

Se observ˘ a imediat c˘ a primii doi “compa¸si de decizie” (engl. decision

stumps) au IG = 0, ˆın timp ce al treilea compas de decizie are IG > 0. Prin

urmare, ˆın nodul 0 (r˘ ad˘ acin˘ a) vom pune atributul C .

(26)

Nodul 1 : Avem de clasificat instant¸ele cu C = 1, deci alegerea se face ˆıntre atributele A ¸si B .

Nod 2

1 [1+,1−] [1+,0−]

0 1

A [2+,1−]

1 [1+,1−] [1+,0−]

0 1

B [2+,1−]

Cele dou˘ a entropii condit¸ionale medii sunt egale:

H

_1/A

= H

_1/B

= 2

3 H [1+, 1 − ] + 1

3 H [1+, 0 − ]

A¸sadar, putem alege oricare dintre cele dou˘ a atribute. Pentru fixare, ˆıl

alegem pe A.

(27)

Nodul 2 : La acest nod nu mai avem disponibil decˆ at atributul B , deci ˆıl vom pune pe acesta.

Arborele ID3 complet este reprezentat ˆın figura al˘ aturat˘ a.

Prin construct¸ie, algoritmul ID3 este consistent cu datele de antrenament dac˘ a acestea sunt con- sistente (i.e., necontradictorii). ˆIn cazul nostru, se verific˘ a imediat c˘ a datele de antrenament sunt consistente.

1

1 0

[1+,0−]

[0+,1−]

0 1

B 0

[2+,1−]

C

0 1

[0+,1−]

[1+,1−] [1+,0−]

0 1

A [2+,2−]

(28)

b. Exist˘ a un arbore de decizie de adˆ ancime mai mic˘ a (decˆ at cea a arborelui ID3) consistent cu datele de mai sus? Dac˘ a da, ce concept (logic) reprezint˘ a acest arbore?

R˘ aspuns:

Din date se observ˘ a c˘ a atributul de ie¸sire Y reprezint˘ a de fapt funct¸ia logic˘ a A xor B .

Reprezentˆ and aceast˘ a funct¸ie ca arbore de decizie, vom obt¸ine arborele al˘ aturat.

Acest arbore are cu un nivel mai put¸in decˆ at arborele construit cu algoritmul ID3.

Prin urmare, arborele ID3 nu este op- tim din punctul de vedere al num˘ arului de niveluri.

0 1

1 1

0 0

1 0

[2+,2−]

[1+,1−] [1+,1−]

A

B B

[0+,1−] [1+,0−] [1+,0−] [0+,1−]

(29)

Aceasta este o consecint¸˘ a a caracterului “greedy” al algoritmului ID3, datorat faptului c˘ a la fiecare iterat¸ie alegem ,,cel mai bun“

atribut ˆın raport cu criteriul cˆ a¸stigului de informat¸ie.

Se ¸stie c˘ a algoritmii de tip “greedy” nu granteaz˘ a obt¸inerea opti-

mului global.

(30)

Exemplifying the application of the ID3 algorithm in the presence of both

categorical and continue attributes

CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 1.1

(31)

As of September 2012, 800 extrasolar planets have been identified in our galaxy. Super- secret surveying spaceships sent to all these planets have established whether they are habitable for humans or not, but sending a spaceship to each planet is expensive. In this problem, you will come up with decision trees to predict if a planet is habitable based only on features observable using telescopes.

a. In nearby table you are given the data from all 800 planets surveyed so far. The features observed by telescope are Size (“Big” or

“Small”), and Orbit (“Near” or “Far”). Each row indicates the values of the features and habitability, and how many times that set of values was observed. So, for example, there were 20 “Big” planets “Near” their star that were habitable.

Size Orbit Habitable Count

Big Near Yes 20

Big Far Yes 170

Small Near Yes 139

Small Far Yes 45

Big Near No 130

Big Far No 30

Small Near No 11

Small Far No 255

Derive and draw the decision tree learned by ID3 on this data (use the maximum information gain criterion for splits, don’t do any pruning). Make sure to clearly mark at each node what attribute you are splitting on, and which value corresponds to which branch. By each leaf node of the tree, write in the number of habitable and inhabitable planets in the training data that belong to that node.

(32)

Answer: Level 1

[374+,426−]

S B

Size

[184+,266−]

[190+,160−]

F N

Orbit

[215+,285−]

[374+,426−]

H(374/800) H(374/800)

H(92/225)

[159+,141−]

H(19/35) H(47/100) H(43/100)

H(Habitable|Size) = 35 80 ·H

19 35

+ 45 80 ·H

92 225

= 35

80 ·0.9946 + 45

80 ·0.9759 = 0.9841 H(Habitable|Orbit) = 3

8 ·H

47 100

+ 5

8 ·H

43 100

= 3

8 ·0.9974 + 5

8 ·0.9858 = 0.9901

IG(Habitable;Size) = 0.0128

IG(Habitable;Orbit) = 0.0067

(33)

The final decision tree

+

− + −

F N

[170+,30−]

[374+,426−]

S B

F N

[139+,11−] [45+,255−]

[184+,266−]

[190+,160−]

[20+,130−]

Size

Orbit Orbit

(34)

b. For just 9 of the planets, a third feature, Temperature (in Kelvin degrees), has been measured, as shown in the nearby table.

Redo all the steps from part a on this data using all three features. For the Temper- ature feature, in each iteration you must maximize over all possible binary thresh- olding splits (such as T ≤ 250 vs. T > 250, for example).

Size Orbit Temperature Habitable

Big Far 205 No

Big Near 205 No

Big Near 260 Yes

Big Near 380 Yes

Small Far 205 No

Small Far 260 Yes

Small Near 260 Yes

Small Near 380 No

According to your decision tree, would a planet with the features (Big, Near, 280) be predicted to be habitable or not habitable?

Hint: You might need to use the following values of the entropy function for a Bernoulli variable of parameter p:

H(1/3) = 0.9182, H(2/5) = 0.9709, H(92/225) = 0.9759, H(43/100) = 0.9858, H(16/35) = 0.9946, H(47/100) = 0.9974.

(35)

Answer

Binary threshold splits for the continuous attribute Temperature :

205 232.5 260 320 380 Temperature

(36)

Answer: Level 1

− H=0

H(1/3) T<=320

[4+,5−]

H(4/9)

[1+,2−]

N [4+,5−]

[4+,5−]

S B

[2+,2−]

H=1

H(4/9)

[1+,2−]

F N

H(1/3)

H(4/9)

Orbit Size

[2+,3−]

H(2/5)

[3+,3−]

H=1

[3+,3−]

H=1 [4+,5−]

H(4/9)

[0+,3−] [4+,2−]

H(1/3) N Y

T<=232.5

Y

==

> >

>>

H(Habitable|Size) = 4 9 + 5

9 ·H 2

5

= 4 9 + 5

9 ·0.9709 = 0.9838 H(Habitable|Temp≤232.5) = 2

3 ·H 1

3

= 2

3 ·0.9182 = 0.6121

IG(Habitable;Size) = H 187

400

−0.9838

= 0.9969−0.9838

= 0.0072 IG(Habitable;Temp≤ 232.5) = 0.3788

(37)

Answer: Level 2

+ H=0

+ H=0 [4+,2−]

S B

[2+,0−]

F

Orbit Size

[2+,2−]

H=1 [4+,2−]

H(1/3)

[1+,0−]

N

H(2/5) [3+,2−]

H(1/3)

T<=320 [4+,2−]

H(1/3)

[3+,0−] [1+,2−]

N Y

H(1/3)

>

>= > >

Note:

The plain lines indicate that both the specific conditional entropies and their coefficients (weights) in the average conditional entropies satisfy the indicated relationship. (For example, H(2/5) > H(1/3) and 5/6 > 3/6.)

The dotted lines indicate that only the specific conditional entropies satisfy the indicated relationship. (For example, H(2/2) = 1 > H(2/5) but 4/6 < 5/6.)

(38)

The final decision tree:

−

+

+ −

[0+,3−]

N Y

[4+,2−]

T<=232.5 [4+,5−]

[3+,0−] [1+,2−]

N Y

T<=320

[1+,0−]

S B

[0+,2−]

Size

c. According to your decision tree, would a planet with the features (Big, Near, 280) be predicted to be habitable or not habitable?

Answer: habitable .

(39)

Exemplifying

the application of the ID3 algorithm on continuous attributes,

and in the presence of a noise.

Decision surfaces; decision boundaries.

The computation of the LOOCV error

CMU, 2002 fall, Andrew Moore, midterm, pr. 3

(40)

Suppose we are learning a classifier with binary input values Y = 0 and Y = 1. There is one real-valued input X. The training data is given in the nearby table.

Assume that we learn a decision tree on this data. Assume that when the decision tree splits on the real-valued attribute X, it puts the split threshold halfway between the attributes that surround the split. For example, using information gain as splitting criterion, the decision tree would initially choose to split at X = 5, which is halfway between X = 4 and X = 6 datapoints.

X Y

1 0

2 0

3 0

4 0

6 1

7 1

8 1

8.5 0

9 1

10 1 Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes (i.e., only one split).

Let Algorithm DT^⋆ be the method of learning a decision tree fully, with no prunning.

a. What will be the training set error for DT2 and respectively DT^⋆ on our data?

b. What will be the leave-one-out cross-validation error (LOOCV) for DT2 and respectively DT^⋆ on our data?

(41)

• training data:

0 1 2 3 4 5 6 7 8 9 10 X

• discretization / decision thresholds:

1 2 3 4 6 7 8 9 10 X

5 8.25

8.75

• compact representation of the ID3 tree:

0 1 2

X

1 2 3 4 5 6 7 8 9 10

• decision “surfaces”:

8.25 8.75 X

− + − +

5

ID3 tree:

X<5

X<8.75

0

0 1

1

X<8.25

0

1

2

[4−,0+] [1−,5+]

[5−,5+]

[1−,0+] [0−,2+]

[1−,2+]

[0−,3+]

Da

Da Nu

Nu

Nu Da

(42)

ID3:

IG computations

X<8.75 X<8.25

[1−,2+] [1−,3+]

Nu [0−,2+]

Da

[1−,5+]

[0−,3+]

Nu Da

[1−,5+]

IG: 0.109 IG: 0.191

Level 1:

5 8.25 8.75

+

−

+

−

Decision "surfaces":

X<8.75

[5−,3+]

Nu [0−,2+]

Da

[5−,5+]

X<8.25

[1−,2+]

[4−,3+]

Nu Da

[5−,5+]

X<5

[4−,0+] [1−,5+]

Da Nu

[5−,5+]

<<

Level 0:

=

<

(43)

ID3, LOOCV:

Decision surfaces

LOOCV error: 3/10

8.75 8.25

+

+ −

−

4.5 X=4:

8.75 8.25

+

+ −

−

5.5 X=6:

5 8.25 8.75

+

+ −

−

5 8.75

+

− +

7.75 X=8:

−

5

+

− + +

X=8.5:

5 8.25

+

+ −

−

9.25 X=9:

X=1,2,3,7,10:

(44)

DT2

+

5

−

Decision "surfaces":

X<5

0 1

[1−,5+]

Da Nu

[5−,5+]

[4−,0+]

(45)

DT2, LOOCV IG computations

Case 1: X=1, 2, 3, 4

X<5 X<8.25 X<8.75

[3−,0+] [1−,5+]

Da Nu

[4−,5+]

<<

[1−,2+] [4−,3+]

Nu [0−,2+]

Da

[4−,5+]

[3−,3+]

Nu Da

[4−,5+]

/4.5

=

<

Case 2: X=6, 7, 8

X<5 X<8.25 X<8.75

[4−,0+] [1−,4+]

Da Nu

<

[5−,2+]

Nu [0−,2+]

Da

[5−,4+]

[4−,2+]

Nu Da

[5−,4+]

/5.5 /7.75

[5−,4+]

[1−,2+]

=

<

(46)

DT2, CVLOO IG computations (cont’d)

Case 3: X=8.5

X<5

[4−,0+] [0−,5+]

Da Nu

[4−,5+]

Case 2: X=9, 10

X<5 X<8.25 X<8.75

<

[1−,4+]

Da Nu

<

[5−,3+]

Nu [0−,1+]

Da

[5−,4+]

[4−,3+]

Nu Da

[5−,4+]

[1−,1+]

/9.25

[4−,0+]

=

<

CVLOO error: 1/10

(47)

Applying ID3 on a dataset with two continuous attributes:

decision zones

Liviu Ciortuz, 2017

(48)

Consider the training dataset in the nearby figure.

X

₁

and X

₂

are considered countinous at- tributes.

Apply the ID3 algorithm on this dataset.

Draw the resulting decision tree.

Make a graphical representation of the decision areas and decision boundaries determined by ID3.

X

1

X

2

0 1 3 2 4

1 2 3 4 5

0

(49)

Solution

Level 1:

− H=0

− H=0 [4+,5−]

[4+,5−]

N Y

[2+,0−] [2+,4−]

Y N

H(1/3)

X1 < 9/2

[4+,5−]

N Y

X2 < 3/2 X1 < 5/2

H(1/4) [4+,5−]

N Y

[4+,5−]

N Y

X2 < 5/2 X2 < 7/2

[2+,5−] [2+,1−] [1+,1−] [3+,4−] [3+,2−] [1+,3−] [4+,2−] [0+,3−]

H=1 H(3/7) H(2/5)

H(2/7) H(1/3)

=

>

H(1/3)

<

>

IG=0.091

IG=0.378 IG=0.319

H[Y| . ] = 2/3 H(1/3)

H[Y| . ] = 7/9 H(2/7) H[Y| . ] = 5/9 H(2/5) + 4/9 H(1/4)

(50)

Level 2:

− H=0

N X1 < 5/2

[1+,1−]

N

Y Y

[4+,2−]

[2+,0−] [2+,2−]

H=1

X1 < 4 X2 < 3/2 X2 < 5/2

[4+,2−] [4+,2−] [4+,2−]

Y N Y

[2+,2−]

H=1

[2+,0−]

=

[3+,1−]

H=1 H(1/4)

[3+,2−] [1+,0−]

H(2/5) IG=0.04

IG=0.109

>

IG=0.251

= >

N

IG=0.251

H[Y| . ] = 1/3 + 2/3 H(1/4) H[Y| . ] = 5/6 H(2/5) H[Y| . ] = 2/3

Notes:

1. Split thresholds for continuous attributes must be recomputed at each new iteration, because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X1.) 2. In the current stage, i.e., for the current node in the ID3 tree you may choose (as test) either X1 < 5/2 or X1 < 4.

3. Here above we have an example of reverse relationships between weighted and respectively un-weighted specific entropies: H[2+,2−] > H[3+,2−] but 4

6 ·H[2+,2−] < 5

6 ·H[3+,2−].

(51)

The final decision tree:

− [0+,3−]

+

− +

N Y

X2 < 7/2 [4+,5−]

[4+,2−]

[2+,0−] [2+,2−]

N Y

X1 < 5/2

N Y

X1 < 4

[0+,2−] [2+,0−]

Decision areas:

+

X

2

X

1 0

1

1 2 3

3 2 4

4 5

0

+

(52)

Other criteria than IG for

the best attribute selection in ID3:

Gini impurity / index

and Misclassification impurity

CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 4

(53)

Entropy is a natural measure to quantify the impurity of a data set. The Decision Tree learning algorithm uses entropy as a splitting criterion by cal- culating the information gain to decide the next attribute to partition the current node.

However, there are other impurity measures that could be used as the splitting criteria too. Let’s investigate two of them.

Assume the current node n has k classes c₁, c₂, . . . , c_k.

• Gini Impurity: i(n) = 1 −Pk

i=1 P²(c_i).

• Misclassification Impurity: i(n) = 1 −max^k_i=1P(c_i).

a. Assume node n has two classes, c1 and c2. Please draw a figure in which the three impurity measures (Entropy, Gini and Misclassification) are repre- sented as the function of P(c₁).

(54)

Answer

Entropy(p) = −plog₂p−(1 −p) log₂(1−p) Gini(p) = 1 −p² −(1−p)² = 2p(1 −p) MisClassif(p) =

=

1−(1 −p), for p ∈ [0; 1/2) 1−p, for p ∈ [1/2; 1]

=

p, for p ∈ [0; 1/2) 1−p, for p ∈ [1/2; 1]

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

p

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Entropy Gini MisClassif

(55)

b. Now we can define new splitting criteria based on the Gini and Misclassifi- cation impurities, which is called Drop-of-Impurity in some literatures. That is the difference between the impurity of the current node and the weighted sum of the impurities of children.

For the binary category splits, Drop-of-Impurity is defined as

∆i(n) = i(n) −P(n_l)i(n_l) −P(n_r)i(n_r),

where n_l and n_r are the left and respectively the right child of node n after splitting.

Please calculate the Drop-of-Impurity (using both Gini and Misclassification Impurity) for the following example data set in which C is the class variable to be predicted.

A a1 a1 a1 a2 a2 a2

C c₁ c₁ c₂ c₂ c₂ c₂

(56)

Answer

2 1

A

0

a1 a₂ [2+,4−]

[2+,1−] [0+,3−]

Gini: p = 2/6 = 1/3 ⇒ i(0) = 2· 1

3(1 − 1

3) = 2 3 · 2

3 = 4 9 i(1) = 2· 2

3(1 − 2

3) = 4 3 · 1

3 = 4 9 i(2) = 0











⇒ ∆i(0) = 4 9 − 3

6 · 4 9 = 4

9 − 2 9 = 2

9.

Misclassification: p = 1/3 < 1/2 ⇒ i(0) = p = 1

3 i(1) = 1− 2

3 = 1 3 i(2) = 0











⇒ ∆i(0) = 1 3 − 1

2 · 1 3 = 1

6.

(57)

c. We choose the attribute that can maximize the Drop-of-Impurity to split a node. Please create a data set and show that on this data set, Misclassification Impurity based ∆i(n) couldn’t determine which attribute should be used for splitting (e.g., ∆i(n) = 0 for all the attributes), but Information Gain and Gini Impurity based ∆i(n) can.

Answer

A a₁ a₁ a₁ a₂ a₂ a₂ a₂ C c₁ c₂ c₂ c₂ c₂ c₂ c₁ Entropy: ∆i(0) = H[2+,5−]−

3

7H[1+,2−] + 4

7H[1+,3−]

= 0.006 6= 0;

Gini: 2 2

7

1 − 2 7

− 3

7 · 1 3

1 − 1 3

+ 4

7 · 1 4

1− 1 4

= 2 10

49 − 2

21 + 3 28

= 2 10

49 − 17 84

6

= 0;

Misclassification: ∆i(0) = 2 7 −

3 7 · 1

3 + 4 7 · 1

4

= 0.

(58)

Note: A [quite bad] property

If C₁ < C₂, C₁^l < C₂^l and C₁^r < C₂^r

(with C₁ = C₁^l +C₁^r and C₂ = C₂^l +C₂^r),

then the Drop-of-Impurity based on Misclassification is 0.

A a₁ a₂ [C +,C −]l 2

[C +,C −]l 2 r r l 2

l l

[C +,C −]

Proof

∆i(n) = C1

C₁ +C₂ −

C₁^l +C₂^l

C₁ +C₂ · C₁^l

C₁^l +C₂^l + C₁^r +C₂^r

C₁ +C₂ · C₁^r C₁^r +C₂^r

= C₁

C₁ +C₂ − C₁^l +C₁^r

C₁ +C₂ = C₁

C₁ +C₂ − C₁

C₁ +C₂ = 0.

(59)

Exemplifying

pre- and post-pruning of decision trees using a threshold for the Information Gain

CMU, 2006 spring, Carlos Guestrin, midterm, pr. 4

[adapted by Liviu Ciortuz]

(60)

Starting from the data in the following table, the ID3 algorithm builds the decision tree shown nearby.

V W X Y

0 0 0 0

0 1 0 1

1 0 0 1

1 1 0 0

1 1 1 1

V

X

1 W

0 1

1 0

W

1 0

0 1

a. One idea for pruning such a decision tree would be to start at the root, and prune splits for which the information gain (or some other criterion) is less than some small ε. This is called top- down pruning. What is the decision tree returned for ε = 0.0001?

What is the training set error for this tree?

(61)

Answer

We will first augment the given decision tree with informations regarding the data partitions (i.e., the number of positive and, respectively, negative instances) which were assigned to each test node during the application of ID3 algorithm.

The information gain yielded by the attribute X in the root node is:

H[3+; 2−]−1/5·0 −4/5·1 = 0.971−0.8 = 0.171 > ε.

Therefore, this node will not be eliminated from the tree.

The information gain for the attribute V (in the left- hand side child of the root node) is:

H[2+; 2−]−1/2·1−1/2·1 = 1−1 = 0 < ε.

X

V 1

W W

0 1 1 0

[3+;2−]

[2+;2−] [1+;0−]

0 1

[1+;1−] [1+;1−]

1

0 0 1

[0+;1−] [1+;0−] [1+;0−] [0+;1−]

So, the whole left subtree will be cut off and replaced by a decision node, as shown nearby. The training error produced by this tree is 2/5.

X

0 1

(62)

b. Another option would be to start at the leaves, and prune subtrees for which the information gain (or some other criterion) of a split is less than some small ε. In this method, no ancestors of children with high information gain will get pruned. This is called bottom-up pruning. What is the tree returned for ε = 0.0001?

What is the training set error for this tree?

Answer:

The information gain of V is IG(Y ; V ) = 0. A step later, the infor- mation gain of W (for either one of the descendent nodes of V ) is IG(Y ; W ) = 1. So bottom-up pruning won’t delete any nodes and the tree [given in the problem statement] remains unchanged.

The training error is 0.

(63)

c. Discuss when you would want to choose bottom-up pruning over top-down pruning and vice versa.

Answer:

Top-down pruning is computationally cheaper. When building the tree we can determine when to stop (no need for real pruning).

But as we saw top-down pruning prunes too much.

On the other hand, bottom-up pruning is more expensive since we have to first build a full tree — which can be exponentially large

— and then apply pruning. The second problem with bottom-up

pruning is that supperfluous attributes may fullish it (see CMU,

CMU, 2009 fall, Carlos Guestrin, HW1, pr. 2.4). The third prob-

lem with it is that in the lower levels of the tree the number of

examples in the subtree gets smaller so information gain might

be an inappropriate criterion for pruning, so one would usually

use a statistical test instead.

(64)

Exemplifying

χ

²

-Based Pruning of Decision Trees

CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1

(65)

In class, we learned a decision tree pruning algorithm that iter- atively visited subtrees and used a validation dataset to decide whether to remove the subtree. However, sometimes it is desir- able to prune the tree after training on all of the available data.

One such approach is based on statistical hypothesis testing.

After learning the tree, we visit each internal node and test whether the attribute split at that node is actually uncorrelated with the class labels.

We hypothesize that the attribute is independent and then use

Pearson’s chi-square test to generate a test statistic that may

provide evidence that we should reject this “null” hypothesis. If

we fail to reject the hypothesis, we prune the subtree at that node.

(66)

a. At each internal node we can create a contingency table for the training examples that pass through that node on their paths to the leaves. The table will have the c class labels associated with the columns and the r values the split attribute associated with the rows.

Each entry O_i,j in the table is the number of times we observe a training sample with that attribute value and label, where i is the row index that corresponds to an attribute value and j is the column index that corresponds to a class label.

In order to calculate the chi-square test statistic, we need a similar table of expected counts. The expected count is the number of observations we would expect if the class and attribute are independent.

Derive a formula for each expected count E_i,j in the table.

Hint: What is the probability that a training example that passes through the node has a particular label? Using this probability and the independence assumption, what can you say about how many examples with a specific attribute value are expected to also have the class label?

(67)

b. Given these two tables for the split, you can now calculate the chi-square test statistic

χ² = Xr

i=1

Xc

j=1

(O_i,j −E_i,j)² E_i,j

with degrees of freedom (r −1)(c−1).

You can plug the test statistic and degrees of freedom into a software package^a or an online calculator^b to calculate a p-value. Typically, if p < 0.05 we reject the null hypothesis that the attribute and class are independent and say the split is statistically significant.

The decision tree given on the next slide was built from the data in the table nearby.

For each of the 3 internal nodes in the decision tree, show the p-value for the split and state whether it is statistically significant.

How many internal nodes will the tree have if we prune splits with p ≥ 0.05?

a Use 1-chi2cdf(x,df) in MATLAB or CHIDIST(x,df) in Excel.

b (https://en.m.wikipedia.org/wiki/Chi-square distribution#Table of .CF.872 value vs p-value.

(68)

Input:

X₁ X₂ X₃ X₄ Class

1 1 0 0 0

1 0 1 0 1

0 1 0 0 0

1 0 1 1 1

0 1 1 1 1

0 0 1 0 0

1 0 0 0 1

0 1 0 1 1

1 0 0 1 1

1 1 0 1 1

1 1 1 1 1

0 0 0 0 0

X4

X1

X2

0

1

0

1

0 1

[4−,8+]

1 0

[1−,2+]

0 1

[0−,2+] [1−,0+]

[3−,0+]

[4−,2+] [0−,6+]

(69)

Idea

While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which

there is not enough (“significant”) statistical evidence that there is a dependence between

the values of the input attribute tested in that node and the values of the output attribute (the labels),

supported by the set of instances assigned to that node.

(70)

Contingency tables

OX₄ Class = 0 Class = 1

X4 = 0 4 2

X4 = 1 0 6

N=12

⇒







P(X4 = 0) = 6

12 = 1

2, P(X4 = 1) = 1 2 P(Class = 0) = 4

12 = 1

3, P(Class = 1) = 2 3

OX₁|X₄=0 Class = 0 Class = 1

X1 = 0 3 0

X1 = 1 1 2

N=6⇒











P(X1 = 0 | X4 = 0) = 3 6 = 1

2 P(X1 = 1 | X4 = 0) = 1

2 P(Class = 0 | X4 = 0) = 4

6 = 2 3 P(Class = 1 | X4 = 0) = 1

3

OX₂|X₄=0,X₁=1 Class = 0 Class = 1

X2 = 0 0 2

X2 = 1 1 0

N=3

⇒











P(X2 = 0 | X4 = 0, X1 = 1) = 2 3 P(X2 = 1 | X4 = 0, X1 = 1) = 1 3 P(Class = 0 | X4 = 0, X1 = 1) = 1

3 P(Class = 1 | X4 = 0, X1 = 1) = 2 3

(71)

The reasoning that leads to the computation of the expected number of observations

P (A = i, C = j ) = P (A = i) · P (C = j ) P ( A = i ) =

P

_c

k=1

O

_i,k

N and P ( C = j ) =

P

_r

k=1

O

_k,j

N

P ( A = i, C = j )

^indep.

= ( P

_c

k=1

O

_i,k

) ( P

_r

k=1

O

_k,j

) N

²

E [ A = i, C = j ] = N · P ( A = i, C = j )

(72)

Expected number of observations

E_X4 Class = 0 Class = 1

X₄ = 0 2 4

X₄ = 1 2 4

E_X₁_|_X₄₌₀ Class = 0 Class = 1

X₁ = 0 2 1

X₁ = 1 2 1

E_X₂_|_X₄_=0,X₁₌₁ Class = 0 Class = 1

X₂ = 0 2

3

4 3

X₂ = 1 1

3

2 3

E_X₄(X₄ = 0,Class = 0) : N = 12, P(X₄ = 0) = 1

2 ¸si P(Class = 0) = 1 3 ⇒

N ·P(X₄ = 0,Class = 0) = N ·P(X₄ = 0)·P(Class = 0) = 12· 1 2 · 1

3 = 2

(73)

χ

²

Statistics

χ

²

= X

r

i=1

X

c

j=1

(O

i,j

− E

_i,j

)

²

E

_i,j

χ

² _X

4

= (4 − 2)

²

2 + (0 − 2)

²

2 + (2 − 4)

²

4 + (6 − 4)

²

4 = 2 + 2 + 1 + 1 = 6 χ

² _X

1|X4=0

= (3 − 2)

²

2 + (1 − 2)

²

2 + (0 − 1)

²

1 + (2 − 1)

²

1 = 3

χ

² _X

2|X4=0,X1=1

=

0 − 2 3

2

2 3

+

1 − 1 3

2

1 3

+

2 − 4 3

2

4 3

+

0 − 2 3

2

2 3

= 4

9 · 27

4 = 3 p-values: 0.0143, 0.0833, and respectively 0.0833.

The first one of these p-values is smaller than 0.05, therefore the root node

(X

₄

) cannot be prunned.

(74)

0 2 4 6 8

0.00.20.40.60.81.0

χ² − Pearson’s cumulative test statistic

p−value

0 2 4 6 8

0.00.20.40.60.81.0

Chi Squared Pearson test statistics

k = 1 k = 2 k = 3 k = 4 k = 6 k = 9

(75)

Output (pruned tree) for 95% confidence level

X4

0 1

1 0

(76)