Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Advances in
Decision Tree Construction
Johannes Gehrke Cornell University [email protected] http://www.cs.cornell.edu/johannes
Wei-Yin Loh University of Wisconsin-Madison
[email protected] http://www.stat.wisc.edu/~loh
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Classification
Goal: Learn a function that assigns a record to one of several predefined classes.
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Classification Example
OExample training database
OTwo predictor attributes:
Age and Car-type (Sport, Minivan and Truck)
OAge is ordered, Car-type is categorical attribute
OClass label indicates whether person bought product
ODependent attribute is categorical
Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No
Types of Variables
ONumerical: Domain is ordered and can be represented on the real line (e.g., age, income) ONominalor categorical: Domain is a finite set
without any natural ordering (e.g., occupation, marital status, race)
OOrdinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Definitions
ORandom variables X1, …, Xk(predictor variables) and Y (dependent variable)
OXihas domain dom(Xi), Y has domain dom(Y) OP is a probability distribution on
dom(X1) x … x dom(Xk) x dom(Y)
Training database D is a random sample from P OA predictord is a function
d: dom(X1) … dom(Xk) Ædom(Y)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Classification Problem
O C is called the class label, d is called a classifier.
O Take r be record randomly drawn from P.
Define the misclassification rate of d:
RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)
Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.
(More on regression problems in the second part of the tutorial.)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Goals and Requirements
Goals:
OTo produce an accurate classifier/regression function
OTo understand the structure of the problem
Requirements on the model:
OHigh accuracy
OUnderstandable by humans, interpretable
OFast construction for very large training databases
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
What are Decision Trees?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES YES
NO Minivan
Sports, Truck
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Decision Trees
OA decision treeT encodes d (a classifier or regression function) in form of a tree.
OA node t in T without children is called a leaf node. Otherwise t is called an internal node. OEach internal node has an associated splitting
predicate. Most common are binary predicates.
Example splitting predicates:
OAge <= 20
OProfession in {student, teacher}
O5000*Age + 3*Salary – 10000 > 0
Internal and Leaf Nodes
Internal nodes:
OBinary Univariate splits:
ONumerical or ordered X: X <= c, c in dom(X) OCategorical X: X in A, A subset dom(X) OBinary Multivariate splits:
OLinear combination split on numerical variables:
ΣaiXi<= c
Ok-ary (k>2) splits analogous Leaf nodes:
ONode t is labeled with one class label c in dom(C)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Example
Encoded classifier:
If (age<30 and carType=Minivan) Then YES If (age <30 and
(carType=Sports or carType=Truck)) Then NO If (age >= 30)
Then NO Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Evaluation of Misclassification Error
Problem:
O In order to quantify the quality of a classifier d, we need to know its misclassification rate RT(d,P).
O But unless we know P, RT(d,P) is unknown.
O Thus we need to estimate RT(d,P) as good as possible.
Approaches:
O Resubstitution estimate
O Test sample estimate
O V-fold Cross Validation
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Resubstitution Estimate
TheResubstitution estimateR(d,D) estimates RT(d,P) of a classifier d using D:
OLet D be the training database with N records.
OR(d,D) = 1/N ΣI(d(r.X) != r.C))
OIntuition: R(d,D) is the proportion of training records that is misclassified by d
OProblem with resubstitution estimate:
Overly optimistic; classifiers that overfit the training dataset will have very low resubstitution error.
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Test Sample Estimate
O
Divide D into D
1and D
2O
Use D
1to construct the classifier d
O
Then use resubstitution estimate R(d,D
2) to calculate the estimated misclassification error of d
O
Unbiased and efficient, but removes D
2from training dataset D
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
V-fold Cross Validation
Procedure:
OConstruct classifier d from D OPartition D into V datasets D1, …, DV OConstruct classifier diusing D \ Di
OCalculate the estimated misclassification error R(di,Di) of diusing test sample Di
Final misclassification estimate:
OWeighted combination of individual misclassification errors:
R(d,D) = 1/V ΣR(di,Di)
Cross-Validation: Example
d
d1
d2
d
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Cross-Validation
OMisclassification estimate obtained through cross-validation is usually nearly unbiased OCostly computation (we need to compute d, and
d1, …, dV); computation of diis nearly as expensive as computation of d
OPreferred method to estimate quality of learning algorithms in the machine learning literature
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Decision Tree Construction
OTop-down tree construction schema:
OExamine training database and find best splitting predicate for the root node
OPartition training database
ORecurse on each child node
BuildTree(Node t, Training database D, Split Selection Method S) (1) Apply Sto D to find splitting criterion
(2) if(t is not a leaf node) (3) Create children nodes of t (4) Partition D into children partitions (5) Recurse on each partition (6) endif
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Decision Tree Construction (Contd.)
O
Three algorithmic components:
OSplit selection (CART, C4.5, QUEST, CHAID, CRUISE, …)
OPruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)
OData access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Split Selection Methods
O
Multitude of split selection methods in the literature
O
In this tutorial:
OImpurity-based split selection: CART (most common in today’s data mining tools) OModel-based split selection: QUEST
Split Selection Methods: CART
OClassification And Regression Trees (Breiman, Friedman, Ohlson, Stone, 1984;
considered “the” reference on decision tree construction)
OCommercial version sold by Salford Systems (www.salford-systems.com)
OMany other, slightly modified implementations exist (e.g., IBM Intelligent Miner implements the CART split selection method)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
CART Split Selection Method Motivation: We need a way to choose
quantitatively between different splitting predicates
OIdea: Quantify the impurityof a node OMethod: Select splitting predicate that
generates children nodes with minimum impurity from a space of possible splitting predicates
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Intuition: Impurity Function
X1 X2 Class 1 1 Yes 1 2 Yes 1 2 Yes 1 2 Yes 1 2 Yes 1 1 No 2 1 No 2 1 No 2 2 No 2 2 No
X1<=1 (50%,50%)
X2<=1 (50%,50%) Yes
(83%,17%)
No (25%,75%)
No (0%,100%)
Yes (66%,33%)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Impurity Function
Let p(j|t) be the proportion of class j training records at node t. Then the node impurity measure at node t:
i(t) = phi(p(1|t), …, p(J|t)) Properties:
Ophi is symmetric
OMaximum value at arguments (J-1, …, J-1) Ophi(1,0,…,0) = … =phi(0,…,0,1) = 0
The reduction in impurity through splitting predicate s (t splits into children nodes tLwith impurity phi(tL) and tRwith impurity phi(tR)) is:∆phi(s,t) = phi(t) – pLphi(tL) – pRphi(tR)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Example
Root node t: p(1|t)=0.5; p(2|t)=0.5 Left child node t:
P(1|t)=0.83; p(2|t)=-.17 Impurity of root node: phi(0.5,0.5) Impurity of left child node:
phi(0.83,0.17) Impurity of right child node:
phi(0.0,1.0) Impurity of whole tree:
0.6* phi(0.83,0.17) + 0.4 * phi(0,1) Impurity reduction:
phi(0.5,0.5) - 0.6*phi(0.83,0.17) - 0.4*phi(0,1)
X1<=1 (50%,50%)
Yes (83%,17%)
No (0%,100%)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Error Reduction as Impurity Function
OPossible impurity function:
Resubstitution error R(T,D).
OExample:
R(no tree, D) = 0.5 R(T1,D) = 0.6*0.17 R(T2,D) =
0.4*0.25 + 0.6*0.33
X1<=1 (50%,50%)
X2<=1 (50%,50%) Yes
(83%,17%)
No (25%,75%)
No (0%,100%)
Yes (66%,33%)
T1
T2
Problems with Resubstitution Error
O Obvious problem:
There are situations where no split can decrease impurity
O Example:
R(no tree, D) = 0.2 R(T1,D)
=0.6*0.17+0.4*0.25
=0.2
O More subtle problems exist
X3<=1 (80%,20%)
Yes 6: (83%,17%)
Yes 4: (75%,25%)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Remedy: Concavity
Concave Impurity Functions Use impurity functions that are concave: phi’’< 0 Example concave impurity functions
O Entropy: phi(t) = -Σp(j|t) log(p(j|t)) O Gini index: phi(t) = Σp(j|t)2
Nonnegative Decrease in Impurity
Theorem: Let phi(p1, …, pJ) be a strictly concave function on j=1, …, J, Σjpj= 1.
Then for any split s:∆phi(s,t) >= 0
With equality if and only if: p(j|tL) = p(j|tR) = p(j|t), j = 1, …, J
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
CART Univariate Split Selection
OUse gini-index as impurity function OFor each numerical or ordered attribute X,
consider all binary splits s of the form X <= x
where x in dom(X)
OFor each categorical attribute X, consider all binary splits s of the form
X in A, where A subset dom(X) OAt a node t, select split s* such that
∆phi(s*,t) is maximal over all s considered
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
CART: Shortcut for Categorical Splits
Computational shortcut if |Y|=2.
OTheorem: Let X be a categorical attribute with dom(X) = {b1, …, bk}, |Y|=2, phi be a concave function, and let
p(X=b1) <= … <= p(X=bk).
Then the best split is of the form:
X in {b1, b2, …, bl} for some l < k
OBenefit: We need only to check k-1 subsets of dom(X) instead of 2(k-1)-1 subsets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Problems with CART Split Selection
O
Biased towards variables with more splits (M-category variable has 2
M-1-1) possible splits, an M-valued ordered variable has (M-1) possible splits
(Explanation and remedy later)
O
Computationally expensive for categorical variables with large domains
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
QUEST: Model-based split selection
“The purpose of models is not to fit the data but to sharpen the questions.”
Karlin, Samuel (1923 - )
(11th R A Fisher Memorial Lecture, Royal Society 20, April 1983.)
Split Selection Methods: QUEST
OQuick, Unbiased, Efficient, Statistical Tree (Loh and Shih, Statistica Sinica, 1997)
Freeware, available at www.stat.wisc.edu/~loh Also implemented in SPSS.
OMain new ideas:
OSeparate splitting predicate selection into variable selection and split point selection
OUse statistical significance tests instead of impurity function
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
QUEST Variable Selection
Let X1, …, Xlbe numerical predictor variables, and let Xl+1, …, Xkbe categorical predictor variables.
1. Find p-value from ANOVA F-test for each numerical variable.
2. Find p-value for each X2-test for each categorical variable.
3. Choose variable Xk’with overall smallest p-value pk’
(Actual algorithm is more complicated.)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
QUEST Split Point Selection
CRIMCOORD transformation of categorical variables into numerical variables:
1. Take categorical variable X with domain dom(X)={x1, …, xl}
2. For each record in the training database, create vector (v1, …, vl) where vi = I(X=xi)
3. Find principal components of set of vectors V
4. Project the dimensionality-reduced data onto the largest discriminant coordinate dxi
5. Replace X with numeral dxiin the rest of the algorithm
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
CRIMCOORDs: Examples
O Values(X|Y=1) = {4c1,c2,5c3}, values(X|Y=2) = {2c1, 2c2, 6c3} Îdx1= 1, dx2= -1, dx3= -0.3
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, 5c3} Îdx1= 1, dx2= 0, dx3= 1
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, c2, 5c3} Îdx1= 1, dx2= -1, dx3= 1
Advantages
O Avoid exponential subset search from CART
O Each dxihas the form ΣbiI(X=xi) for some b1, …, bl, thus there is a 1-1 correspondence between subsets of X and a dxi
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
QUEST Split Point Selection
O Assume X is the selected variable (either numerical, or categorical transformed to CRIMCOORDS)
O Group J>2 classes into two superclasses
O Now problem is reduced to one-dimensional two-class problem
OUse exhaustive search for the best split point (like in CART) OUse quadratic discriminant analysis (QDA, next few bullets)
QUEST Split Point Selection: QDA
O Let x1, x2and s12, s22the means and variances for the two superclasses
O Make normal distribution assumption, and find intersections of the two normal distributions N(x1,s12) and N(x2,s22)
O QDA splits the X-axis into three intervals
O Select as split point the root that is closer to the sample means
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Illustration: QDA Splits
0 0.1 0.2 0.3 0.4 0.5
-1.5 -0.5 0.5 1.5 2.5 3.5
N(0,1) N(2,2.25)
QUEST Linear Combination Splits
OTransform all categorical variables to CRIMCOORDS
OApply PCA to the correlation matrix of the data ODrop the smallest principal components, and
project the remaining components onto the largest CRIMCOORD
OGroup J>2 classes into two superclasses OFind split on largest CRIMCOORD using ES or
QDA
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Key Differences CART/QUEST
Feature QUEST CART
Variable selection Statistical tests ES Split point selection QDA or ES ES Categorical variables CRIMCOORDS ES Monotone transformations
for numerical variables Not invariant Invariant
Ordinal Variables No Yes
Variables selection bias No Yes (No)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Pruning Methods
O
Test dataset pruning
O
Direct stopping rule
O
Cost-complexity pruning (not covered)
O
MDL pruning
O
Pruning by randomization testing
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Stopping Policies
A stopping policy indicates when further growth of the tree at a node t is counterproductive.
OAll records are of the same class
OThe attribute values of all records are identical OAll records have missing values
OAt most one class has a number of records larger than a user-specified number
OAll records go to the same child node if t is split (only possible with some split selection
methods)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Test Dataset Pruning
O
Use an independent test sample D’ to estimate the misclassification cost using the resubstitution estimate R(T,D’) at each node
O
Select the subtree T’ of T with the smallest expected cost
Reduced Error Pruning
(Quinlan, C4.5, 1993)
O
Assume observed misclassification rate at a node is p
O
Replace p (pessimistically) with the upper 75% confidence bound p’, assuming a binomial distribution
O
Then use p’ to estimate error rate of the
node
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Pruning Using the MDL Principle
(Mehta, Rissanen, Agrawal, KDD 1996)
Also used before by Fayyad, Quinlan, and others.
OMDL: Minimum Description Length Principle OIdea: Think of the decision tree as encoding the
class labels of the records in the training database
OMDL Principle: The best tree is the tree that encodes the records using the fewest bits
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
How To Encode a Node
Given a node t, we need to encode the following:
ONodetype: One bit to encode the type of each node (leaf or internal node)
For an internal node:
OCost(P(t)): The cost of encoding the splitting predicate P(t) at node t
For a leaf node:
On*E(t): The cost of encoding the records in leaf node t with n records from the training database (E(t) is the entropy of t)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
How To Encode a Tree
Recursive definition of the minimal cost of a node:
ONode t is a leaf node:
cost(t)= n*E(t)
ONode t is an internal node with children nodes t1 and t2. Choice: Either make t a leaf node, or take the best subtrees, whatever is cheaper:
cost(t) =
min( n*E(t), 1+cost(P(t))+cost(t1)+cost(t2))
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
How to Prune
1.
Construct decision tree to its maximum size
2.
Compute the MDL cost for each node of the tree bottom-up
3.
Prune the tree bottom-up:
If cost(t)=n*E(t), make t a leaf node.
Resulting tree is the final tree output by the pruning algorithm.
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Performance Improvements: PUBLIC
(Shim and Rastogi, VLDB 1998)
OMDL bottom-up pruning requires construction of a complete tree before the bottom-up pruning can start
OIdea: Prune the tree during (not after) the tree construction phase
OWhy is this possible?
OCalculate a lower bound on cost(t) and compare it with n*E(t)
PUBLIC Lower Bound Theorem
OTheorem: Consider a classification problem with k predictor attributes and J classes. Let Ttbe a subtree with s internal nodes, rooted at node t, let nibe the number of records with class label i.
Then
cost(Tt) >= 2*s+1+s*log k + Σni OLower bound on cost(Tt) is thus the minimum
of:
On*E+1 (t becomes a leaf node)
O2*s+1+s*log k + Σni(subtree at t remains)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Large Datasets Lead to Large Trees
OOates and Jensen (KDD 1998)
OProblem: Constant probability distribution P, datasets D1, D2, …, Dk with
|D1| < |D2| < … < |Dk|
|Dk| = c |Dk-1| = … = ck|D1| OObservation: Trees grow
|T1| < |T2| < … < |Tk|
|Tk| = c’ |Tk-1| = … = c’k|T1|
OBut: No gain in accuracy due to larger trees R(T1,D1) ~ R(T2,D2) ~ … ~ R(Tk, Dk)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Pruning By Randomization Testing
O Reduce pruning decision at each node to a hypothesis test O Generate empirical distribution of the hypothesis under the null
hypothesis for a node
Node n with subtree T(n) and pruning statistic S(n) For (i=0; i<K; i++)
1. Randomize class labels of the data at n 2. Build and prune a tree rooted at n 3. Calculate pruning statistic Si(n)
Compare S(n) to empirical distribution of Si(n) to estimate significance of S(n)
If S(n) is not significant enough compared to a significance level alpha, then prune T(n) to n
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SLIQ
Shafer, Agrawal, Mehta (EDBT 1996) OMotivation:
OScalable data access method for CART
OTo find the best split we need to evaluate the impurity function at all possible split points for each numerical attribute, at each node of the tree
OIdea: Avoids re-sorting at each node of the three through pre- sorting and maintenance of sort orders
OIdeas:
OUses vertical partitioning to avoid re-sorting
OMain-memory resident data structure with schema (class label, leaf node index)
Very likely to fit in-memory for nearly all training databases
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SLIQ: Pre-Sorting
Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No
Age Ind 20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9
Ind Class Leaf 1 Yes 1 2 Yes 1 3 No 1 4 Yes 1 5 Yes 1 6 No 1 7 Yes 1 8 Yes 1 9 Yes 1 10 No 1
SLIQ: Evaluation of Splits
Age Ind 20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9
Ind Class Leaf 1 Yes 2 2 Yes 2 3 No 2 4 Yes 3 5 Yes 3 6 No 2 7 Yes 2 8 Yes 2 9 Yes 2 10 No 3
Node2 Yes No Left 2 0 Right 3 2
Node3 Yes No Left 0 1 Right 2 0
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SLIQ: Splitting of a Node
Ind Class Leaf 1 Yes 4 2 Yes 5 3 No 5 4 Yes 7 5 Yes 7 6 No 4 7 Yes 7 8 Yes 7 9 Yes 7 10 No 6 Age Ind
20 1 20 6 20 10 25 3 25 8 30 2 30 4 30 7 40 5 40 9
1
2 3
4 5 6 7
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SLIQ: Summary
O
Uses vertical partitioning to avoid re- sorting
O
Main-memory resident data structure with schema (class label, leaf node index) Very likely to fit in-memory for nearly all training databases
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SPRINT
Shafer, Agrawal, Mehta (VLDB 1996)
O Motivation:
OScalable data access method for CART
OImprovement over SLIQ to avoid main-memory data structure O Ideas:
OCreate vertical partitions called attribute lists for each attribute OPre-sort the attribute lists
Recursive tree construction:
1. Scan all attribute lists at node t to find the best split 2. Partition current attribute lists over children nodes while
maintaining sort orders 3. Recurse
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SPRINT Attribute Lists
Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No
Age Class Ind 20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9
Car Class Ind M Yes 1 M Yes 2 T No 3 S Yes 4 S Yes 5 T No 6 M Yes 7 M Yes 8 M Yes 9 S No 10
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SPRINT: Evaluation of Splits
Node1 Yes No Left 1 2 Right 6 1 Age Class Ind
20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9
SPRINT: Splitting of a Node
1.
Scan all attribute lists to find the best split
2.
Partition the attribute list of the splitting attribute X
3.
For each attribute X
i!= X
Perform the partitioning step of a hash-join between the attribute list of X and the attribute list of Xi
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SPRINT: Hash-Join Partitioning
Age Class Ind 20 Yes 1 20 No 6 20 No 10 25 No 3 25 Yes 8 30 Yes 2 30 Yes 4 30 Yes 7 40 Yes 5 40 Yes 9
Car Class Ind M Yes 1 M Yes 2 M Yes 7 M Yes 8 M Yes 9 Right Child
Right Child R
R R
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
SPRINT: Summary
O
Scalable data access method for CART split selection method
O
Completely scalable, can be (and has been) implemented “inside” a database system
O
Hash-join partitioning step expensive (each attribute, at each node of the tree)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest
(Gehrke, Ramakrishnan, Ganti, VLDB 1998)
Age Yes No 20 1 2 25 1 1 30 3 0 40 2 0 Car Yes No Sport 2 1 Truck 0 2 Minivan 5 0 Age Car Class
20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No
Training Database AVC-Sets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Refined RainForest Top-Down Schema
BuildTree(Node n, Training database D, Split Selection Method S) [ (1) Apply Sto D to find splitting criterion ] (1a) foreach predictor attribute X
(1b) Call S.findSplit(AVC-set of X) (1c)endfor
(1d) S.chooseBest();
(2) if(n is not a leaf node) ...
S: C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, etc.
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Data Access Method
Assume datapartition at a node is D. Then the following steps are carried out:
1. Construct AVC-group of the node
2. Choose splitting attribute and splitting predicate 3. Partition D across the children
Main Memory
RainForest Algorithms: RF-Write First scan:
Database AVC-Sets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Write Second Scan:
Main Memory
Database
Partition 1 Partition 2
Age<30?
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Write
Analysis:
O Assumes that the AVC-group of the root node fits into main memory
O Two database scans per level of the tree
O Usually more main memory available than one single AVC- group needs
Main Memory Database
Main Memory
Database
Partition 1 Partition 2 Age<30?
? ?
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Main Memory
RainForest Algorithms: RF-Read First scan:
Database AVC-Sets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Read Second Scan:
Main Memory
Database
AVC-Sets
Age<30
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Read Third Scan:
Main Memory
Database
Age<30
Sal<20k Car==S
RainForest Algorithms: RF-Hybrid First scan:
Main Memory
Database AVC-Sets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Hybrid Second Scan:
Main Memory
Database
AVC-Sets
Age<30
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Hybrid Third Scan:
Main Memory
Database
Age<30
Sal<20k Car==S
Partition 1 Partition 2 Partition 3 Partition 4
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
RainForest Algorithms: RF-Hybrid
Further optimization: While writing partitions, concurrently build AVC-groups of as many nodes as possible in-memory
Main Memory
Database
Age<30 Sal<20k Car==S
Partition 1 Partition 2 Partition 3 Partition 4
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
BOAT
(Gehrke, Ganti, Ramakrishnan, Loh; SIGMOD 1999)
Training Database Age
<30 >=30
Left Partition Right Partition
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
BOAT: Algorithm Overview
In-memory Sample
Approximate tree with bounds
Sampling Phase Cleanup Phase
Approximate tree, bounds
Final tree All the data
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing Values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Missing Values
O
What is the problem?
ODuring computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems) OBut if a record r misses the value of the
variable in the splitting attribute, r can not participate further in tree construction
Algorithms for missing values address this
problem.
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Mean and Mode Imputation
Assume record r has missing value r.X, and splitting variable is X.
O
Simplest algorithm:
OIf X is numerical (categorical), impute the overall mean (mode)
O
Improved algorithm:
OIf X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Surrogate Splits (CART)
Assume record r has missing value r.X, and splitting predicate is PX.
OIdea: Find splitting predicate QX’involving another variable X’ != X that is most similar to PX.
OSimilarity sim(Q,P|D) between splits Q and P:
Sim(Q,P|D) = |{r in D: P(r) and Q(r)}|/|D|
O0 <= sim(Q,P|D) <= 1
OSim(P,P) = 1
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Surrogate Splits: Example
Consider splitting predicate X1 <= 1.
Sim((X1 <= 1), (X2 <= 1)|D) = (3+4)/10 Sim((X1 <= 1),
(X2 <= 2)|D) = (6+3)/10
(X2 <= 2) is the preferred surrogate split.
X1 X2 Class 1 1 Yes 1 1 Yes 1 1 Yes 1 2 Yes 1 2 Yes 1 2 No 2 2 No 2 3 No 2 3 No 2 3 No
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing Values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Choice of Classification Algorithm?
OExample study: (Lim, Loh, and Shih, Machine Learning 2000)
O33 classification algorithms
O16 (small) data sets (UC Irvine ML Repository)
OEach algorithm applied to each data set OExperimental measurements:
OClassification accuracy
OComputational speed
OClassifier complexity
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Experimental Setup
Algorithms:
OTree-structure classifiers (IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1)
OStatistical methods (LDA, QDA, NN, LOG, FDA, PDA, MDA, POL) ONeural networks (LVQ, RBF)
Setup:
O16 primary data sets, created 16 more data sets by adding noise OConverted categorical predictor variables to 0-1 dummy
variables if necessary
OError rates for 6 data sets estimated from supplied test sets, 10- fold cross-validation used for the other data sets
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Results
Rank Algorithm Mean Error Time
1 Polyclass 0.195 3 hours
2 Quest Multivariate 0.202 4 min
3 Logistic Regression 0.204 4 min
6 LDA 0.208 10 s
8 IND CART 0.215 47 s
12 C4.5 Rules 0.220 20 s
16 Quest Univariate 0.221 40 s
…
O Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers)) O Mean misclassification rates for top 26 algorithms are not
statistically significantly different, bottom 7 algorithms have significantly lower error rates
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tutorial Overview
OPart I: Classification Trees
OIntroduction
OClassification tree construction schema
OSplit selection
OPruning
OData access
OMissing Values
OEvaluation
OBias in split selection (Short Break)
OPart II: Regression Trees
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Bias in Split Selection for ES
Assume: No correlation with the class label.
OQuestion: Should we choose Age or Car?
OAnswer: We should choose both of them equally likely!
Age Yes No 20 15 15 25 15 15 30 15 15 40 15 15
Car Yes No Sport 20 20 Truck 20 20 Minivan 20 20
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Formal Definition of the Bias
OBias: “Odds of choosing X1and X2as split variable when neither X1nor X2is correlated with the class label”
OFormally:
Bias(X1,X2) = Log10(P(X1,X2)/(1-P(X1,X2)), P(X1,X2): probability of choosing variable X1over X2 We would like: Bias(X1,X2) = 0in the Null Case
Formal Definition of the Bias (Contd.)
OExample: Synthetic data with two categorical predictor variables
OX1: 10 categories
OX2: 2 categories OFor each category:
Same probability of choosing “Yes” (no correlation)
Car Yes No Car1 Car2 Car3
… Car10
State Yes No CA NY
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Evidence of the Bias
Gini Entropy
Gain Ratio
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
One Explanation
Theorem: (Expected Value of the Gini Gain) Assume:
OTwo classlabels
On: number of categories
ON: number of records
Op1: probability of having classlabel “Yes”
Then: E(ginigain) = 2p(1-p)*(n-1)/N Expected ginigain increases linearly with number
of categories!
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Bias Correction: Intuition
O Value of the splitting criteria is biased under the Null Hypothesis.
O Idea: Use p-valueof the criterion:
Probability that the value of the criterion under the Null Case is as extreme as the observed value
Method:
1. Compute criterion (gini, entropy, etc.)
2. Compute p-value
3. Choose splitting variable
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Correction Through P-Value
O New p-value criterion:
O Maintains “good” properties of your favorite splitting criterion
O Theorem: The correction through the p-value is nearly unbiased.
Computation:
1. Exact (randomization statistic; very expensive to compute) 2. Bootstrapping (Monte Carlo simulations; computationally
expensive; works only for small p-values)
3. Asymptotic approximations (G2 for entropy, Chi2distribution for Chi2 test; don’t work well in boundary conditions)
4. Tight approximations (cheap, often work well in practice)
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Tight Approximation
O Experimental evidence shows that Gamma distribution approximates gini-gain very well.
O We can calculate:
OExpected gain:
E(gain) = 2p(1-p)*(n-1)/N OVariance of gain:
Var(gain) = 4p(1-p)/N2[(1- 6p-6p2) * (sum 1/Ni– (2n- 1)/N) + 2(n-1)p(1-p)]
Problem: ES and Missing Value
Consider a training database with the following schema: (X1, …, Xk, C)
OAssume the projection onto (X1, C) is the following:
{(1, Class1), (2, Class2), (NULL, Class13), …, (NULL, Class1N)}
(X1has missing values except for the first two records)
OExhaustive search will very likely split on X1!
Gehrke and Loh KDD 2001 Tutorial: Advances in Decision Trees
Concluding Remarks Part I
O There are many algorithms available for:
OSplit selection
OPruning
OData access
OHandling missing values
O Challenges: Performance, getting the “right” model, data streams, new applications