• Nu S-Au Găsit Rezultate

Advances in

N/A
N/A
Protected

Academic year: 2022

Share "Advances in "

Copied!
35
0
0

Text complet

(1)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Advances in

Decision Tree Construction

Johannes Gehrke Cornell University [email protected] http://www.cs.cornell.edu/johannes

Wei-Yin Loh University of Wisconsin-Madison

[email protected] http://www.stat.wisc.edu/~loh

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

(2)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Classification

Goal: Learn a function that assigns a record to one of several predefined classes.

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Classification Example

OExample training database

OTwo predictor attributes:

Age and Car-type (Sport, Minivan and Truck)

OAge is ordered, Car-type is categorical attribute

OClass label indicates whether person bought product

ODependent attribute is categorical

Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No

Types of Variables

ONumerical: Domain is ordered and can be represented on the real line (e.g., age, income) ONominalor categorical: Domain is a finite set

without any natural ordering (e.g., occupation, marital status, race)

OOrdinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

(3)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Definitions

ORandom variables X1, …, Xk(predictor variables) and Y (dependent variable)

OXihas domain dom(Xi), Y has domain dom(Y) OP is a probability distribution on

dom(X1) x … x dom(Xk) x dom(Y)

Training database D is a random sample from P OA predictord is a function

d: dom(X1) … dom(Xk) Ædom(Y)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Classification Problem

O C is called the class label, d is called a classifier.

O Take r be record randomly drawn from P.

Define the misclassification rate of d:

RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)

Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.

(More on regression problems in the second part of the tutorial.)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Goals and Requirements

Goals:

OTo produce an accurate classifier/regression function

OTo understand the structure of the problem

Requirements on the model:

OHigh accuracy

OUnderstandable by humans, interpretable

OFast construction for very large training databases

(4)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

What are Decision Trees?

Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

0 30 60 Age

YES YES

NO Minivan

Sports, Truck

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Decision Trees

OA decision treeT encodes d (a classifier or regression function) in form of a tree.

OA node t in T without children is called a leaf node. Otherwise t is called an internal node. OEach internal node has an associated splitting

predicate. Most common are binary predicates.

Example splitting predicates:

OAge <= 20

OProfession in {student, teacher}

O5000*Age + 3*Salary – 10000 > 0

Internal and Leaf Nodes

Internal nodes:

OBinary Univariate splits:

ONumerical or ordered X: X <= c, c in dom(X) OCategorical X: X in A, A subset dom(X) OBinary Multivariate splits:

OLinear combination split on numerical variables:

ΣaiXi<= c

Ok-ary (k>2) splits analogous Leaf nodes:

ONode t is labeled with one class label c in dom(C)

(5)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Example

Encoded classifier:

If (age<30 and carType=Minivan) Then YES If (age <30 and

(carType=Sports or carType=Truck)) Then NO If (age >= 30)

Then NO Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Evaluation of Misclassification Error

Problem:

O In order to quantify the quality of a classifier d, we need to know its misclassification rate RT(d,P).

O But unless we know P, RT(d,P) is unknown.

O Thus we need to estimate RT(d,P) as good as possible.

Approaches:

O Resubstitution estimate

O Test sample estimate

O V-fold Cross Validation

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Resubstitution Estimate

TheResubstitution estimateR(d,D) estimates RT(d,P) of a classifier d using D:

OLet D be the training database with N records.

OR(d,D) = 1/N ΣI(d(r.X) != r.C))

OIntuition: R(d,D) is the proportion of training records that is misclassified by d

OProblem with resubstitution estimate:

Overly optimistic; classifiers that overfit the training dataset will have very low resubstitution error.

(6)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Test Sample Estimate

O

Divide D into D

1

and D

2

O

Use D

1

to construct the classifier d

O

Then use resubstitution estimate R(d,D

2

) to calculate the estimated misclassification error of d

O

Unbiased and efficient, but removes D

2

from training dataset D

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

V-fold Cross Validation

Procedure:

OConstruct classifier d from D OPartition D into V datasets D1, …, DV OConstruct classifier diusing D \ Di

OCalculate the estimated misclassification error R(di,Di) of diusing test sample Di

Final misclassification estimate:

OWeighted combination of individual misclassification errors:

R(d,D) = 1/V ΣR(di,Di)

Cross-Validation: Example

d

d1

d2

d

(7)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Cross-Validation

OMisclassification estimate obtained through cross-validation is usually nearly unbiased OCostly computation (we need to compute d, and

d1, …, dV); computation of diis nearly as expensive as computation of d

OPreferred method to estimate quality of learning algorithms in the machine learning literature

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Decision Tree Construction

OTop-down tree construction schema:

OExamine training database and find best splitting predicate for the root node

OPartition training database

ORecurse on each child node

BuildTree(Node t, Training database D, Split Selection Method S) (1) Apply Sto D to find splitting criterion

(2) if(t is not a leaf node) (3) Create children nodes of t (4) Partition D into children partitions (5) Recurse on each partition (6) endif

(8)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Decision Tree Construction (Contd.)

O

Three algorithmic components:

OSplit selection (CART, C4.5, QUEST, CHAID, CRUISE, …)

OPruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)

OData access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Split Selection Methods

O

Multitude of split selection methods in the literature

O

In this tutorial:

OImpurity-based split selection: CART (most common in today’s data mining tools) OModel-based split selection: QUEST

Split Selection Methods: CART

OClassification And Regression Trees (Breiman, Friedman, Ohlson, Stone, 1984;

considered “the” reference on decision tree construction)

OCommercial version sold by Salford Systems (www.salford-systems.com)

OMany other, slightly modified implementations exist (e.g., IBM Intelligent Miner implements the CART split selection method)

(9)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

CART Split Selection Method Motivation: We need a way to choose

quantitatively between different splitting predicates

OIdea: Quantify the impurityof a node OMethod: Select splitting predicate that

generates children nodes with minimum impurity from a space of possible splitting predicates

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Intuition: Impurity Function

X1 X2 Class 1 1 Yes 1 2 Yes 1 2 Yes 1 2 Yes 1 2 Yes 1 1 No 2 1 No 2 1 No 2 2 No 2 2 No

X1<=1 (50%,50%)

X2<=1 (50%,50%) Yes

(83%,17%)

No (25%,75%)

No (0%,100%)

Yes (66%,33%)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Impurity Function

Let p(j|t) be the proportion of class j training records at node t. Then the node impurity measure at node t:

i(t) = phi(p(1|t), …, p(J|t)) Properties:

Ophi is symmetric

OMaximum value at arguments (J-1, …, J-1) Ophi(1,0,…,0) = … =phi(0,…,0,1) = 0

The reduction in impurity through splitting predicate s (t splits into children nodes tLwith impurity phi(tL) and tRwith impurity phi(tR)) is:phi(s,t) = phi(t) – pLphi(tL) – pRphi(tR)

(10)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Example

Root node t: p(1|t)=0.5; p(2|t)=0.5 Left child node t:

P(1|t)=0.83; p(2|t)=-.17 Impurity of root node: phi(0.5,0.5) Impurity of left child node:

phi(0.83,0.17) Impurity of right child node:

phi(0.0,1.0) Impurity of whole tree:

0.6* phi(0.83,0.17) + 0.4 * phi(0,1) Impurity reduction:

phi(0.5,0.5) - 0.6*phi(0.83,0.17) - 0.4*phi(0,1)

X1<=1 (50%,50%)

Yes (83%,17%)

No (0%,100%)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Error Reduction as Impurity Function

OPossible impurity function:

Resubstitution error R(T,D).

OExample:

R(no tree, D) = 0.5 R(T1,D) = 0.6*0.17 R(T2,D) =

0.4*0.25 + 0.6*0.33

X1<=1 (50%,50%)

X2<=1 (50%,50%) Yes

(83%,17%)

No (25%,75%)

No (0%,100%)

Yes (66%,33%)

T1

T2

Problems with Resubstitution Error

O Obvious problem:

There are situations where no split can decrease impurity

O Example:

R(no tree, D) = 0.2 R(T1,D)

=0.6*0.17+0.4*0.25

=0.2

O More subtle problems exist

X3<=1 (80%,20%)

Yes 6: (83%,17%)

Yes 4: (75%,25%)

(11)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Remedy: Concavity

Concave Impurity Functions Use impurity functions that are concave: phi’’< 0 Example concave impurity functions

O Entropy: phi(t) = -Σp(j|t) log(p(j|t)) O Gini index: phi(t) = Σp(j|t)2

Nonnegative Decrease in Impurity

Theorem: Let phi(p1, …, pJ) be a strictly concave function on j=1, …, J, Σjpj= 1.

Then for any split s:phi(s,t) >= 0

With equality if and only if: p(j|tL) = p(j|tR) = p(j|t), j = 1, …, J

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

CART Univariate Split Selection

OUse gini-index as impurity function OFor each numerical or ordered attribute X,

consider all binary splits s of the form X <= x

where x in dom(X)

OFor each categorical attribute X, consider all binary splits s of the form

X in A, where A subset dom(X) OAt a node t, select split s* such that

phi(s*,t) is maximal over all s considered

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

CART: Shortcut for Categorical Splits

Computational shortcut if |Y|=2.

OTheorem: Let X be a categorical attribute with dom(X) = {b1, …, bk}, |Y|=2, phi be a concave function, and let

p(X=b1) <= … <= p(X=bk).

Then the best split is of the form:

X in {b1, b2, …, bl} for some l < k

OBenefit: We need only to check k-1 subsets of dom(X) instead of 2(k-1)-1 subsets

(12)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Problems with CART Split Selection

O

Biased towards variables with more splits (M-category variable has 2

M-1

-1) possible splits, an M-valued ordered variable has (M-1) possible splits

(Explanation and remedy later)

O

Computationally expensive for categorical variables with large domains

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

QUEST: Model-based split selection

“The purpose of models is not to fit the data but to sharpen the questions.”

Karlin, Samuel (1923 - )

(11th R A Fisher Memorial Lecture, Royal Society 20, April 1983.)

Split Selection Methods: QUEST

OQuick, Unbiased, Efficient, Statistical Tree (Loh and Shih, Statistica Sinica, 1997)

Freeware, available at www.stat.wisc.edu/~loh Also implemented in SPSS.

OMain new ideas:

OSeparate splitting predicate selection into variable selection and split point selection

OUse statistical significance tests instead of impurity function

(13)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

QUEST Variable Selection

Let X1, …, Xlbe numerical predictor variables, and let Xl+1, …, Xkbe categorical predictor variables.

1. Find p-value from ANOVA F-test for each numerical variable.

2. Find p-value for each X2-test for each categorical variable.

3. Choose variable Xk’with overall smallest p-value pk’

(Actual algorithm is more complicated.)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

QUEST Split Point Selection

CRIMCOORD transformation of categorical variables into numerical variables:

1. Take categorical variable X with domain dom(X)={x1, …, xl}

2. For each record in the training database, create vector (v1, …, vl) where vi = I(X=xi)

3. Find principal components of set of vectors V

4. Project the dimensionality-reduced data onto the largest discriminant coordinate dxi

5. Replace X with numeral dxiin the rest of the algorithm

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

CRIMCOORDs: Examples

O Values(X|Y=1) = {4c1,c2,5c3}, values(X|Y=2) = {2c1, 2c2, 6c3} Îdx1= 1, dx2= -1, dx3= -0.3

O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, 5c3} Îdx1= 1, dx2= 0, dx3= 1

O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, c2, 5c3} Îdx1= 1, dx2= -1, dx3= 1

Advantages

O Avoid exponential subset search from CART

O Each dxihas the form ΣbiI(X=xi) for some b1, …, bl, thus there is a 1-1 correspondence between subsets of X and a dxi

(14)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

QUEST Split Point Selection

O Assume X is the selected variable (either numerical, or categorical transformed to CRIMCOORDS)

O Group J>2 classes into two superclasses

O Now problem is reduced to one-dimensional two-class problem

OUse exhaustive search for the best split point (like in CART) OUse quadratic discriminant analysis (QDA, next few bullets)

QUEST Split Point Selection: QDA

O Let x1, x2and s12, s22the means and variances for the two superclasses

O Make normal distribution assumption, and find intersections of the two normal distributions N(x1,s12) and N(x2,s22)

O QDA splits the X-axis into three intervals

O Select as split point the root that is closer to the sample means

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Illustration: QDA Splits

0 0.1 0.2 0.3 0.4 0.5

-1.5 -0.5 0.5 1.5 2.5 3.5

N(0,1) N(2,2.25)

QUEST Linear Combination Splits

OTransform all categorical variables to CRIMCOORDS

OApply PCA to the correlation matrix of the data ODrop the smallest principal components, and

project the remaining components onto the largest CRIMCOORD

OGroup J>2 classes into two superclasses OFind split on largest CRIMCOORD using ES or

QDA

(15)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Key Differences CART/QUEST

Feature QUEST CART

Variable selection Statistical tests ES Split point selection QDA or ES ES Categorical variables CRIMCOORDS ES Monotone transformations

for numerical variables Not invariant Invariant

Ordinal Variables No Yes

Variables selection bias No Yes (No)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Pruning Methods

O

Test dataset pruning

O

Direct stopping rule

O

Cost-complexity pruning (not covered)

O

MDL pruning

O

Pruning by randomization testing

(16)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Stopping Policies

A stopping policy indicates when further growth of the tree at a node t is counterproductive.

OAll records are of the same class

OThe attribute values of all records are identical OAll records have missing values

OAt most one class has a number of records larger than a user-specified number

OAll records go to the same child node if t is split (only possible with some split selection

methods)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Test Dataset Pruning

O

Use an independent test sample D’ to estimate the misclassification cost using the resubstitution estimate R(T,D’) at each node

O

Select the subtree T’ of T with the smallest expected cost

Reduced Error Pruning

(Quinlan, C4.5, 1993)

O

Assume observed misclassification rate at a node is p

O

Replace p (pessimistically) with the upper 75% confidence bound p’, assuming a binomial distribution

O

Then use p’ to estimate error rate of the

node

(17)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Pruning Using the MDL Principle

(Mehta, Rissanen, Agrawal, KDD 1996)

Also used before by Fayyad, Quinlan, and others.

OMDL: Minimum Description Length Principle OIdea: Think of the decision tree as encoding the

class labels of the records in the training database

OMDL Principle: The best tree is the tree that encodes the records using the fewest bits

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

How To Encode a Node

Given a node t, we need to encode the following:

ONodetype: One bit to encode the type of each node (leaf or internal node)

For an internal node:

OCost(P(t)): The cost of encoding the splitting predicate P(t) at node t

For a leaf node:

On*E(t): The cost of encoding the records in leaf node t with n records from the training database (E(t) is the entropy of t)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

How To Encode a Tree

Recursive definition of the minimal cost of a node:

ONode t is a leaf node:

cost(t)= n*E(t)

ONode t is an internal node with children nodes t1 and t2. Choice: Either make t a leaf node, or take the best subtrees, whatever is cheaper:

cost(t) =

min( n*E(t), 1+cost(P(t))+cost(t1)+cost(t2))

(18)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

How to Prune

1.

Construct decision tree to its maximum size

2.

Compute the MDL cost for each node of the tree bottom-up

3.

Prune the tree bottom-up:

If cost(t)=n*E(t), make t a leaf node.

Resulting tree is the final tree output by the pruning algorithm.

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Performance Improvements: PUBLIC

(Shim and Rastogi, VLDB 1998)

OMDL bottom-up pruning requires construction of a complete tree before the bottom-up pruning can start

OIdea: Prune the tree during (not after) the tree construction phase

OWhy is this possible?

OCalculate a lower bound on cost(t) and compare it with n*E(t)

PUBLIC Lower Bound Theorem

OTheorem: Consider a classification problem with k predictor attributes and J classes. Let Ttbe a subtree with s internal nodes, rooted at node t, let nibe the number of records with class label i.

Then

cost(Tt) >= 2*s+1+s*log k + Σni OLower bound on cost(Tt) is thus the minimum

of:

On*E+1 (t becomes a leaf node)

O2*s+1+s*log k + Σni(subtree at t remains)

(19)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Large Datasets Lead to Large Trees

OOates and Jensen (KDD 1998)

OProblem: Constant probability distribution P, datasets D1, D2, …, Dk with

|D1| < |D2| < … < |Dk|

|Dk| = c |Dk-1| = … = ck|D1| OObservation: Trees grow

|T1| < |T2| < … < |Tk|

|Tk| = c’ |Tk-1| = … = c’k|T1|

OBut: No gain in accuracy due to larger trees R(T1,D1) ~ R(T2,D2) ~ … ~ R(Tk, Dk)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Pruning By Randomization Testing

O Reduce pruning decision at each node to a hypothesis test O Generate empirical distribution of the hypothesis under the null

hypothesis for a node

Node n with subtree T(n) and pruning statistic S(n) For (i=0; i<K; i++)

1. Randomize class labels of the data at n 2. Build and prune a tree rooted at n 3. Calculate pruning statistic Si(n)

Compare S(n) to empirical distribution of Si(n) to estimate significance of S(n)

If S(n) is not significant enough compared to a significance level alpha, then prune T(n) to n

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

(20)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SLIQ

Shafer, Agrawal, Mehta (EDBT 1996) OMotivation:

OScalable data access method for CART

OTo find the best split we need to evaluate the impurity function at all possible split points for each numerical attribute, at each node of the tree

OIdea: Avoids re-sorting at each node of the three through pre- sorting and maintenance of sort orders

OIdeas:

OUses vertical partitioning to avoid re-sorting

OMain-memory resident data structure with schema (class label, leaf node index)

Very likely to fit in-memory for nearly all training databases

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SLIQ: Pre-Sorting

Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No

Age Ind 20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9

Ind Class Leaf 1 Yes 1 2 Yes 1 3 No 1 4 Yes 1 5 Yes 1 6 No 1 7 Yes 1 8 Yes 1 9 Yes 1 10 No 1

SLIQ: Evaluation of Splits

Age Ind 20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9

Ind Class Leaf 1 Yes 2 2 Yes 2 3 No 2 4 Yes 3 5 Yes 3 6 No 2 7 Yes 2 8 Yes 2 9 Yes 2 10 No 3

Node2 Yes No Left 2 0 Right 3 2

Node3 Yes No Left 0 1 Right 2 0

(21)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SLIQ: Splitting of a Node

Ind Class Leaf 1 Yes 4 2 Yes 5 3 No 5 4 Yes 7 5 Yes 7 6 No 4 7 Yes 7 8 Yes 7 9 Yes 7 10 No 6 Age Ind

20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9

1

2 3

4 5 6 7

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SLIQ: Summary

O

Uses vertical partitioning to avoid re- sorting

O

Main-memory resident data structure with schema (class label, leaf node index) Very likely to fit in-memory for nearly all training databases

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SPRINT

Shafer, Agrawal, Mehta (VLDB 1996)

O Motivation:

OScalable data access method for CART

OImprovement over SLIQ to avoid main-memory data structure O Ideas:

OCreate vertical partitions called attribute lists for each attribute OPre-sort the attribute lists

Recursive tree construction:

1. Scan all attribute lists at node t to find the best split 2. Partition current attribute lists over children nodes while

maintaining sort orders 3. Recurse

(22)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SPRINT Attribute Lists

Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No

Age Class Ind 20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9

Car Class Ind M Yes 1 M Yes 2 T No 3 S Yes 4 S Yes 5 T No 6 M Yes 7 M Yes 8 M Yes 9 S No 10

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SPRINT: Evaluation of Splits

Node1 Yes No Left 1 2 Right 6 1 Age Class Ind

20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9

SPRINT: Splitting of a Node

1.

Scan all attribute lists to find the best split

2.

Partition the attribute list of the splitting attribute X

3.

For each attribute X

i

!= X

Perform the partitioning step of a hash-join between the attribute list of X and the attribute list of Xi

(23)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SPRINT: Hash-Join Partitioning

Age Class Ind 20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9

Car Class Ind M Yes 1 M Yes 2 M Yes 7 M Yes 8 M Yes 9 Right Child

Right Child R

R R

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

SPRINT: Summary

O

Scalable data access method for CART split selection method

O

Completely scalable, can be (and has been) implemented “inside” a database system

O

Hash-join partitioning step expensive (each attribute, at each node of the tree)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest

(Gehrke, Ramakrishnan, Ganti, VLDB 1998)

Age Yes No 20 1 2 25 1 1 30 3 0 40 2 0 Car Yes No Sport 2 1 Truck 0 2 Minivan 5 0 Age Car Class

20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No

Training Database AVC-Sets

(24)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Refined RainForest Top-Down Schema

BuildTree(Node n, Training database D, Split Selection Method S) [ (1) Apply Sto D to find splitting criterion ] (1a) foreach predictor attribute X

(1b) Call S.findSplit(AVC-set of X) (1c)endfor

(1d) S.chooseBest();

(2) if(n is not a leaf node) ...

S: C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, etc.

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Data Access Method

Assume datapartition at a node is D. Then the following steps are carried out:

1. Construct AVC-group of the node

2. Choose splitting attribute and splitting predicate 3. Partition D across the children

Main Memory

RainForest Algorithms: RF-Write First scan:

Database AVC-Sets

(25)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Write Second Scan:

Main Memory

Database

Partition 1 Partition 2

Age<30?

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Write

Analysis:

O Assumes that the AVC-group of the root node fits into main memory

O Two database scans per level of the tree

O Usually more main memory available than one single AVC- group needs

Main Memory Database

Main Memory

Database

Partition 1 Partition 2 Age<30?

? ?

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Main Memory

RainForest Algorithms: RF-Read First scan:

Database AVC-Sets

(26)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Read Second Scan:

Main Memory

Database

AVC-Sets

Age<30

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Read Third Scan:

Main Memory

Database

Age<30

Sal<20k Car==S

RainForest Algorithms: RF-Hybrid First scan:

Main Memory

Database AVC-Sets

(27)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Hybrid Second Scan:

Main Memory

Database

AVC-Sets

Age<30

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Hybrid Third Scan:

Main Memory

Database

Age<30

Sal<20k Car==S

Partition 1 Partition 2 Partition 3 Partition 4

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

RainForest Algorithms: RF-Hybrid

Further optimization: While writing partitions, concurrently build AVC-groups of as many nodes as possible in-memory

Main Memory

Database

Age<30 Sal<20k Car==S

Partition 1 Partition 2 Partition 3 Partition 4

(28)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

BOAT

(Gehrke, Ganti, Ramakrishnan, Loh; SIGMOD 1999)

Training Database Age

<30 >=30

Left Partition Right Partition

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

BOAT: Algorithm Overview

In-memory Sample

Approximate tree with bounds

Sampling Phase Cleanup Phase

Approximate tree, bounds

Final tree All the data

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing Values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

(29)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Missing Values

O

What is the problem?

ODuring computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems) OBut if a record r misses the value of the

variable in the splitting attribute, r can not participate further in tree construction

Algorithms for missing values address this

problem.

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Mean and Mode Imputation

Assume record r has missing value r.X, and splitting variable is X.

O

Simplest algorithm:

OIf X is numerical (categorical), impute the overall mean (mode)

O

Improved algorithm:

OIf X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Surrogate Splits (CART)

Assume record r has missing value r.X, and splitting predicate is PX.

OIdea: Find splitting predicate QX’involving another variable X’ != X that is most similar to PX.

OSimilarity sim(Q,P|D) between splits Q and P:

Sim(Q,P|D) = |{r in D: P(r) and Q(r)}|/|D|

O0 <= sim(Q,P|D) <= 1

OSim(P,P) = 1

(30)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Surrogate Splits: Example

Consider splitting predicate X1 <= 1.

Sim((X1 <= 1), (X2 <= 1)|D) = (3+4)/10 Sim((X1 <= 1),

(X2 <= 2)|D) = (6+3)/10

(X2 <= 2) is the preferred surrogate split.

X1 X2 Class 1 1 Yes 1 1 Yes 1 1 Yes 1 2 Yes 1 2 Yes 1 2 No 2 2 No 2 3 No 2 3 No 2 3 No

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing Values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

Choice of Classification Algorithm?

OExample study: (Lim, Loh, and Shih, Machine Learning 2000)

O33 classification algorithms

O16 (small) data sets (UC Irvine ML Repository)

OEach algorithm applied to each data set OExperimental measurements:

OClassification accuracy

OComputational speed

OClassifier complexity

(31)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Experimental Setup

Algorithms:

OTree-structure classifiers (IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1)

OStatistical methods (LDA, QDA, NN, LOG, FDA, PDA, MDA, POL) ONeural networks (LVQ, RBF)

Setup:

O16 primary data sets, created 16 more data sets by adding noise OConverted categorical predictor variables to 0-1 dummy

variables if necessary

OError rates for 6 data sets estimated from supplied test sets, 10- fold cross-validation used for the other data sets

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Results

Rank Algorithm Mean Error Time

1 Polyclass 0.195 3 hours

2 Quest Multivariate 0.202 4 min

3 Logistic Regression 0.204 4 min

6 LDA 0.208 10 s

8 IND CART 0.215 47 s

12 C4.5 Rules 0.220 20 s

16 Quest Univariate 0.221 40 s

O Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers)) O Mean misclassification rates for top 26 algorithms are not

statistically significantly different, bottom 7 algorithms have significantly lower error rates

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tutorial Overview

OPart I: Classification Trees

OIntroduction

OClassification tree construction schema

OSplit selection

OPruning

OData access

OMissing Values

OEvaluation

OBias in split selection (Short Break)

OPart II: Regression Trees

(32)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Bias in Split Selection for ES

Assume: No correlation with the class label.

OQuestion: Should we choose Age or Car?

OAnswer: We should choose both of them equally likely!

Age Yes No 20 15 15 25 15 15 30 15 15 40 15 15

Car Yes No Sport 20 20 Truck 20 20 Minivan 20 20

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Formal Definition of the Bias

OBias: “Odds of choosing X1and X2as split variable when neither X1nor X2is correlated with the class label”

OFormally:

Bias(X1,X2) = Log10(P(X1,X2)/(1-P(X1,X2)), P(X1,X2): probability of choosing variable X1over X2 We would like: Bias(X1,X2) = 0in the Null Case

Formal Definition of the Bias (Contd.)

OExample: Synthetic data with two categorical predictor variables

OX1: 10 categories

OX2: 2 categories OFor each category:

Same probability of choosing “Yes” (no correlation)

Car Yes No Car1 Car2 Car3

… Car10

State Yes No CA NY

(33)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Evidence of the Bias

Gini Entropy

Gain Ratio

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

One Explanation

Theorem: (Expected Value of the Gini Gain) Assume:

OTwo classlabels

On: number of categories

ON: number of records

Op1: probability of having classlabel “Yes”

Then: E(ginigain) = 2p(1-p)*(n-1)/N Expected ginigain increases linearly with number

of categories!

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Bias Correction: Intuition

O Value of the splitting criteria is biased under the Null Hypothesis.

O Idea: Use p-valueof the criterion:

Probability that the value of the criterion under the Null Case is as extreme as the observed value

Method:

1. Compute criterion (gini, entropy, etc.)

2. Compute p-value

3. Choose splitting variable

(34)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Correction Through P-Value

O New p-value criterion:

O Maintains “good” properties of your favorite splitting criterion

O Theorem: The correction through the p-value is nearly unbiased.

Computation:

1. Exact (randomization statistic; very expensive to compute) 2. Bootstrapping (Monte Carlo simulations; computationally

expensive; works only for small p-values)

3. Asymptotic approximations (G2 for entropy, Chi2distribution for Chi2 test; don’t work well in boundary conditions)

4. Tight approximations (cheap, often work well in practice)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Tight Approximation

O Experimental evidence shows that Gamma distribution approximates gini-gain very well.

O We can calculate:

OExpected gain:

E(gain) = 2p(1-p)*(n-1)/N OVariance of gain:

Var(gain) = 4p(1-p)/N2[(1- 6p-6p2) * (sum 1/Ni– (2n- 1)/N) + 2(n-1)p(1-p)]

Problem: ES and Missing Value

Consider a training database with the following schema: (X1, …, Xk, C)

OAssume the projection onto (X1, C) is the following:

{(1, Class1), (2, Class2), (NULL, Class13), …, (NULL, Class1N)}

(X1has missing values except for the first two records)

OExhaustive search will very likely split on X1!

(35)

Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees

Concluding Remarks Part I

O There are many algorithms available for:

OSplit selection

OPruning

OData access

OHandling missing values

O Challenges: Performance, getting the “right” model, data streams, new applications

Referințe

DOCUMENTE SIMILARE

The first theme, Efficient and interoperable eGovernment services, aims at improving the efficiency and effectiveness of public administrations and facilitating their interactions

Member States have committed themselves to inclusive eGovernment objectives to ensure that by 2010 all citizens, including socially disadvantaged groups, become major beneficiaries

In this chapter, we explore the role of SI algorithms in certain bioinformatics tasks like micro- array data clustering, multiple sequence alignment, protein structure prediction

This value does not belong to a specific data type and it could be used in expressions together with other attribute values having different types (number, text, date,

This classification allows stating an important correlation between the denotation of the adjective and its syntax (for a more detailed analysis, see Cornilescu

In the same line, Nistor and Roman (1968b) offer an outstanding survey of the major problems in computational linguistics in the sixties: information retrieval, Chomsky’s

.then (response =&gt; { // verificăm dacă am primit date JSON de la server let contentType = response.headers.get ('Content-Type');. if (contentType &amp;&amp;

Rather than hard-coding the data type and precision of a variable, you can use the %TYPE attribute to declare a variable according to another previously declared variable