6.867 Machine learning and neural networks
Problem set 3
Deadline: November 2, 11am in recitation
What and how to turn in?
Turn in short written answers to the questions explicitly stated, and when requested to explain or prove. Do not turn in answers when requested to “think”, “consider”, “try” or
“experiment” (except when specifically instructed). You are encouraged to think about, and turn in, answers to questions marked “optional”— they will be read and corrected, but a grade will not be recorded for them.
Turn in all MATLAB code explicitly requested, or that you used to calculate requested values. It should be clear exactly what command was used to get the answer to each question.
To help the graders (including yourself...), please be neat, answer the questions briefly, and in the order they are stated. Staple each “Problem” separately, and be sure to write your name on the top of every page.
Problem 1: Boosting
This question is about AdaBoost, the simple boosting algorithm presented in the lectures.
We assume that there is a method unknown to us that produces a component (weak) hypothesish(x) in response to a weighted training set. The resulting weak classifier should do at least a bit better than random guessing on the weighted training set it was trained on. We make no assumptions about how well it does on other possibly differently weighted training sets.
Let D = {(x1, y1). . . ,(xn, yn)} be the training set of examples xi and binary (±1) labels yi. We denote the weights on the training examples at the beginning of the kth boosting iteration as pk(1), . . . , pk(n), where Pn
i=1pk(i) = 1. Unlike in the lecture, the component hypotheses h(x) can produce a real valued output (confidence rated classifiers). The sign of the output indicates the label and the magnitude specifies a measure of “confidence” in the classification decision. So, for example, h(x) = 10 would be interpreted as a relatively confident prediction of label 1.
Now, starting with equal weightsp1(i) = 1/n, AdaBoost generates a sequence of hypotheses h1(x), . . . , hm(x), where each hk(x) was trained with a different set of example weights.
After having generated a hypothesis hk(x) in response to weightspk(i) at thekth iteration, the example weights are updated according to
pk+1(i) =c·pk(i) exp(−αˆkyihk(xi) ), i= 1, . . . , n (1) where c is the normalization constant ensuring that Pn
i=1pk+1(i) = 1 and ˆαk gives the votes that we assign to the new component classifier hk(x). We would like to assign these votes so that the new component classifier hk(x) is at chance performance relative to the new weights pk+1(i). One way to express this is as follows:
n
X
i=1
pk+1(i)yihk(xi) = 0 (2)
1. Show that if hk(xi) generates only binary ±1 outputs, then the above condition corresponds to requiring that the training error is exactly 0.5 relative to the new weights
Answer: In the sumPn
i=1pk+1(i)yihk(xi), letP be the set ofifor whichpk+1(i)is multiplied by+1, andQbe the set ofiforpk+1(i)is multiplied by−1. Then since Pn
i=1pk+1(i)yihk(xi) = 0, we have P
i∈P pk+1(i) = P
j∈Qpk+1(j). Also, since Pn
i=1pk+1(i) = 1, we have P
i∈P pk+1(i) +P
j∈Qpk+1(j) = 1, so P
i∈Ppk+1(i) =
1
2. But, P
i∈Ppk+1(i) =P
hk(xi)=yipk+1(i), so the training error of hk relative to pk+1 is 12.
Scoring: 4 points
Things are a bit different here since the component classifier can generate real valued outputs. We can view the condition instead as a way of decorrelating the predictions and the labels relative to the new weights.
2. Show that if we choose ˆαk as the minimizing argument of J(α) = log
n
X
i=1
pk(i) exp{−αyihk(xi)}
!
(3) we will indeed ensure that the decorrelation condition (2) holds for the new weights.
(Hint: Do not attempt to compute ˆαk. Instead, set the derivative to zero and use the resulting equation, which ˆαk must satisfy, in order to show (2) is satisfied.
Answer: Let αˆk be the minimizing argument of J(α) above, then J0( ˆαk) = 0.
Taking the derivative of J, we get:
0 = ∂
∂αˆk log
n
X
i=1
pk(i) exp{−αˆkyihk(xi)}
!
= Pn
i=1pk(i)∂α∂ e−αˆkyihk(xi) Pn
i=1pk(i)e−ˆαkyihk(xi)
=− Pn
i=1pk(i)yihk(xi)e−αˆkyihk(xi) Pn
i=1pk(i)e−αˆkyihk(xi)
=−
n
X
i=1
pk(i)e−αˆkyihk(xi) Pn
j=1pk(j)e−ˆαkyjhk(xj)yihk(xi)
Matching this with the definition of the weights pk+1(i) given in (1), and noticing that the denominator is exactly the correct normalization constant 1/c:
=−
n
X
i=1
pk+1(i)yihk(xi) This is the desired correlation equality.
Scoring: 5 points
This should look a bit suspicious... as if there were an objective function that the boosting algorithm were minimizing at each iteration. This is indeed the case. Let Hm(x) be the combined hypothesis resulting from m boosting iterations:
Hm(x) = ˆα1h1(x) +. . .+ ˆαmhm(x) (4) We’d like you to show that
J(Hm) = log
n
X
i=1
exp{−yiHm(xi)}
!
(5) serves as an objective function for the boosting algorithm. In other words, we claim that every time we add a new component hypothesis to the combined classifier Hm(x), we decrease the objective.
3. Show that J(Hm)≥J(Hm+1), where
Hm+1(x) =Hm(x) + ˆαm+1hm+1(x), (6) hm+1(x) is the new component hypothesis, and ˆαm+1 is optimized as shown above.
(Hint: expand the weights pm(i) and use the optimization of ˆαm+1).
Answer: Denote by Zk+1 the normalization constant 1c in (1):
Zk+1 =X
i
pk(i)e−αˆkyihk(xi) so that (1) can be rewritten as:
pk+1(i) = 1
Zk+1pk(i)e−αˆkyihk(xi). Unrolling this recursive definition of the weights yields:
pm+1(i) = 1
Zm+1pm(i)e−αˆmyihm(xi)
= 1
Zm+1 1
Zmpm−1(i)e−ˆαm−1yihm−1(xi)e−ˆαmyihm(xi)
=· · ·=
m+1
Y
k=2
1 Zkp1(i)
m
Y
k=1
e−αˆkyihk(xi)
initially p1(i) = n1, hence:
= e−yiPmk=1αˆkhk(xi) nQm+1
k=2 Zk
noticing that the sum in the exponent corresponds to the combined classifier Hm:
= e−yiHm(xi) nQm+1
k=2 Zk
We now note that P
ipm+1(i) = 1, and thus:
0 = logX
i
pm+1(i) = log X
i
e−yiHm(xi) nQm+1
k=2 Zk
!
= log X
i
e−yiHm(xi)
!
−
m+1
X
k=2
logZk−logn
=J(Hm)−
m+1
X
k=2
logZk−logn Noting that logZk+1 =J( ˆαk)we can rewrite J(Hm) as:
J(Hm) = logn+
m
X
k=1
J( ˆαk)
Consequently, J(Hm+1) = J(Hm) +J( ˆαm+1). Noting that αˆm+1 was chosen to minimize J:
J( ˆαm+1)≤J(α) For any α J( ˆαm+1)≤J(0)
= log
n
X
i=1
pm(i)e−0yihm(xi)
!
= log
n
X
i=1
pm(i)e0
!
= log
n
X
i=1
pm(i)
= log 1 = 0 Hence J(Hm+1)≤J(Hm).
Scoring: 6 points
Having now put some effort into understanding the boosting algorithm, let’s explore a bit how it behaves in practice. We have provided you with MATLAB code that finds and evaluates (confidence rated) decision stumps. These are the hypothesis that our boosting algorithm assumes we can generate. The relevant MATLAB files are “boost digit.m”,
“boost.m”, “eval boost.m”, “find stump.m”, “eval stump.m”. You’ll only have to make minor modifications to “boost.m” and, a bit later, to “eval boost.m” and “boost digit.m”
to make these work.
4. Complete the weight update in “boost.m” and run “boost digit” to plot the training and test errors for the combined classifier as well as the corresponding training error of the decision stump, as a function of the number of iterations. Are the errors what you would expect them to be? Why or why not?
Answer: The completed boost.m:
function [param] = boost(X,y,niter) n = size(X,1);
p = ones(n,1)/n;
param = [];
for i=1:niter,
stump = find_stump(X,y,p);
h = eval_stump(X,stump); % stump predictions on the training set
% compute the number of votes to give to the new stump epsilon = sum(p .* (y ~= h)) ;
alpha = 0.5*log((1-epsilon)/epsilon) ;
% update the weights
p = p .* exp(-alpha * y .* h) ;
p = p / sum(p) ; % don’t forget to normalize !
param = [param,struct(’stump’,stump,’alpha’,alpha)];
end;
0 2 4 6 8 10 12 14 16
0 0.1 0.2 0.3 0.4
Training error Test error Single−stump error
As can be seen in the figure, the training and test errors decrease as we perform more boosting iterations. Eventually the training error reaches zero, but we do not overfit, and the test error remains low (though higher than the training error).
However, no single stump can predict the training set well, and especially since we continue to emphasize “difficult” parts of the training set, the error of each particular stump remains high, and does not drop bellow about 1/3.
Scoring: 6 points for the execution, 3 points for the explanation
We will now investigate the classification margins of training examples. Recall that the classification margin of a training point in the boosting context reflects the “confidence”
in which the point was classified correctly. You can view the margin of a training example as the difference between the weighted fraction of votes assigned to the correct label and those assigned to the incorrect one. Note that this is not a geometric notion of ”margin”
but one based on votes. The margin will be positive for correctly classified training points and negative for others.
5. Modify “eval boost.m” so that it returns normalized predictions (normalized by the total number of votes). The resulting predictions should be in the range [−1,1]. Fill in the missing computation of the training set margins in “boost digit.m” (that is, the classification margins for each of the training points). You should also uncomment
the plotting script for cumulative margin distributions (what is plotted is, for each
−1< r <1 on the horizontal axis, what fraction of the training points have a margin of at least r). Explain the differences between the cumulative distributions after 4 and 16 boosting iterations.
Answer: The completed eval boost.m:
function [H] = eval_boost(X,param);
[n,m] = size(X);
H = zeros(n,1); % combined predictions totalalpha = 0; % Total weight of all votes for i = 1:length(param),
H = H + param(i).alpha*eval_stump(X,param(i).stump);
totalalpha = totalalpha + param(i).alpha ; end;
H = H/totalalepha ; % Normalize by the total vote weight The completed boost digit.m:
load digit_x.dat;
load digit_y.dat;
digit_y = 2*digit_y-1;
load digit_x_test.dat;
load digit_y_test.dat;
digit_y_test = 2*digit_y_test-1;
E=[];
M=[];
for i=1:16,
param = boost(digit_x,digit_y,i);
yb = eval_boost(digit_x,param);
trainerr= mean(yb.*digit_y<=0);
% classification margins on the training set
% mar = a column vector of length length(digit_y), where
% mar(i) is the classification margin of the ith training point mar = yb .* digit_y ;
M = [M,mar];
yb = eval_boost(digit_x_test,param);
testerr = mean(yb.*digit_y_test<=0);
E=[E;trainerr,testerr,param(i).stump.err];
end;
figure(1); plot(E);
legend(’Training error’,’Test error’,’Single-stump error’);
figure(2);
n = length(digit_y);
plot(sort(M(:,4)),(1:n)’/n,’:g’,sort(M(:,16)),(1:n)’/n,’-r’);
legend(’After four iterations’,’After sixteen iterations’);
(note that we also added code to add legends to the figures).
−0.60 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
After four iterations After sixteen iterations
The key difference between the cumulative distributions after 4 and 16 boosting iterations is that the additional iterations seem to push the left (low end) tail of the cumulative distribution to the right. To understand the effect, note that the examples that are difficult to classify have poor or negative classification margins and therefore define the low end tail of the cumulative distribution. Additional boosting iterations concentrate on the difficult examples and ensure that their margins will improve. As the margins improve, the left tail of the cumulative distribution moves to the right, as we see in the figure.
Scoring: 4 points for the execution, 2 points for the explanation
Problem 2: support vector machines
Lagrange multipliers and optimization problems
We’ll present here a very simple tutorial example of using and understanding Lagrange multipliers. Letwbe a scalar parameter we wish to estimate andxa fixed scalar. We wish to solve the following (tiny) SVM like optimization problem:
minimize 1
2w2 subject to wx−1≥0 (7)
This is difficult only because of the constraint. We’d rather solve an unconstrained version of the problem but, somehow, we have to take into account the constraint. We can do this by including the constraint itself in the minimization objective as it allows us to twist the solution towards satisfying the constraint. We need to know how much to emphasize the constraint and this is what the Lagrange multiplier is doing. We will denote the Lagrange multiplier byαto be consistent with the SVM problem. So we have now constructed a new minimization problem (still minimizing with respect to w) that includes the constraint as an additional linear term:
J(w;α) = 1
2w2−α(wx−1) (8)
The Lagrange multiplierαappears here as a parameter. You might view this new objective a bit suspiciously since we appear to have lost the information about what type of constraint we had, i.e., whether the constraint was wx−1≥ 0, wx−1≤ 0, or wx−1 = 0. How is this information encoded? We can encode this by constraining the values of the Lagrange multipliers:
wx−1≥0 ⇒ α ≥0 wx−1≤0 ⇒ α ≤0
wx−1 = 0 ⇒ α is unconstrained
Note, for example, that when the constraint is wx−1≥0, as we have above, large positive values of α will encourage choices ofw that result in large positive values for wx−1. This is because in the above objective, J(w;α), we try to minimize −α(wx−1) in addition to w2/2; minimizing −α(wx−1) is the same as maximizing α(wx−1) or wx−1 since α is positive. Figure 1 tries to illustrate this effect. Assuming x = 1 we can plot the new objective function as a function of w for different values of α. Larger values of α clearly move the solution (minimizingw) towards satisfyingw−1≥0 (we assume here thatx= 1).
Based on the figure we can see that setting α = 1 produces just the right solution, i.e., w∗ = 1, which satisfies the constraint wx−1≥0 (when x= 1) with minimal distortion of the original objective. There’s no reason to consider negative values forα since they would push the solution away from satisfying our inequality constraint.
Effectively what we are doing here is solving a large number of optimization problems, once for each setting of the Lagrange multiplierα. Indeed, we can express the solution (the
−3 −2 −1 0 1 2 3
−2 0 2 4 6 8 10
w
J(w;α)
α = 2 α = 1 α = 0
Figure 1: J(w;α) as a function ofw for different values of α. The minimizingw values are indicated with dashed line segments. x was set to 1.
minimizing w) as a parametric function ofα:
∂
∂wJ(w;α) =w−αx= 0 (9)
meaning that w∗α = αx. We could now find the setting of α such that the constraint w∗αx−1≥0 is satisfied. There are multiple answers to this since larger values of α would better satisfy the constraint. Finding the smallest (non-negative)αfor which the constraint is satisfied would in this case produce the right solution (one corresponding to the minimal change of the original problem).
We can proceed a bit more generally, however, the way we handled the quadratic optimiza- tion problem for SVMs. Let’s insert our solution wα∗ back into the objective function:
J(w∗α;α) = 1
2(w∗α)2−α(wα∗x−1) = 1
2(αx)2−α(αx2−1) =α− 1
2(αx)2 (10) The result, which we denote as J(α), is a function of the Lagrange multiplierα only. Let’s understand this function a bit better. In Figure 1, the values of the objective at the dashed lines correspond exactly to J(w∗α;α) or J(α), evaluated at α= 0,1,2. Isn’t it strange that the right solution (α = 1) appears to yield the maximum of J(α)? This is a very useful property. Let’s verify this by finding the maximum of J(α) a bit more formally:
J(α) = α− 1
2(αx)2 (11)
∂
∂αJ(α) = 1−αx2 = 1−wa∗x= 0 (12) where we have used our previous resultwα∗ =αx. So, the constraint is satisfied with equality at the maximum of J(α). More rigorously, since α ≥ 0 in our setting, the maximum is obtained either at α = 0 or at the point where 1−wa∗x = 0. We can express this more concisely by saying that their product vanishes, i.e., α(w∗ax−1) = 0 at the optimum. This
is generally true, i.e., either the Lagrange multiplier is not used andα= 0 (the constraint is satisfied without any modification) or the Lagrange multiplier is positive and the constraint is satisfied with equality.
The remaining question for us here is why maximize α− 1
2(αx)2 subject to α≥0 (13)
is any better than the problem we started with. The short answer is that the constraints here are very simple non-negativity constraints that are easy to deal with in the optimiza- tion. In the SVM context, we have another reason to prefer this formulation.
Examples of SVM training
We provided a skeleton for a SVM training routinetrainsvm.m. The routine takes as input the training set, a constant C which will be an upper bound on the Lagrange multipliers, and a specification for a kernel function. Your first task is to complete this routine. Two parts of the routine are missing.
First, you should set up and solve the quadratic programming problem:
minimize 1 2
X
i,j
αiαjyiyjK(xi,xj)−X
i
αi
subject to: P
iαiyi = 0 and for all i, 0≤αi ≤C. Solve the quadratic programming using the MATLAB routine quadprog. (Note that in lecture we saw this as a maximization problem. MATLAB’squadprogsolves quadraticminimizationproblems, and so we changed the sign of the objective function to get an equivalent minimization problem).
Next, add code to calculate the bias term w0. We have already provided code for finding vectors that lie on the margin (note that we account for numeric inaccuracies by ignoring α values very close to zero or toC). Although it is enough to use one of these input vectors for calculating w0, in order to increase accuracy, we will calculate w0 using each one of these input vectors separately, and then take w0 to be the median of all the calculated values (all of which should theoretically be the same).
Note that the routine also returns a list of the support vectors, and the vectors that lie exactly on the margin.
1. Why do we use only training vectors xi with 0< αi < C for calculating w0 ? Answer: We know what the prediction wTx+w0 should be on points x which are exactly on the margin, that is with the constraintyi(wTx+w0)≥1 tightly satisfied.
Points that are loosely satisfied by the constraints, i.e. withyi(wTx+w0)>
1, will necessarily have their Lagrange multiplier equal to zero (since higher Lagrange multipliers will hurt the objective function).
If we impose some limit Con the Lagrange multipliers, then if a point does not satisfy the constraint, we would like to increase its Lagrange multiplier as high as possible, and so its Lagrange multiplier will be equal toC.
Taking only points with Lagrange multipliers strictly between zero and C ensures that all the points considered are on the margin (note that there might still be margin vectors with Lagrange multipliers equal to zero or C, and we will miss them).
Scoring: 2 points
2. Complete the routine trainsvm.m.
Answer: Here is the completed routine. Note that we also compute the margin (which is requested in some of the questions).
function [a,w0,fval,supvecs,margvecs,margin] = trainsvm(x,y,C,...
kernalfunc,varargin) if nargin<4
kernalfunc = ’polyK’;
end
if nargin<3
C = 1e+10; % should be inf. Set like this because of end % bug with quadprog
n = size(x,1);
K = feval(kernalfunc,x,x,varargin{:});
[a,fval,exitflag] = quadprog(diag(y)*K*diag(y),-ones(n,1),...
[],[],y’,0,zeros(n,1),C*ones(n,1));
ay = a .* y;
eps = 0.001*max(a);
supvecs = find(a>=eps);
margvecs = find(a>=eps & a<=C-eps);
w0 = median(y(margvecs)-K(margvecs,:)*ay);
% Calculate the margin in feature space, which is 1/||w||
margin = 1/sqrt(ay’*K*ay) ;
Scoring: 6 points (not including the margin computation)
3. Write a routine svmpred.m that classifies a set of input vectors based on a specified SVM classifier. The routine should take as input a matrix of input vectors, and the specifications of the SVM classifier. You may use the provided skeleton.
Answer:
function [y,z] = svmpred(x,tx,ty,a,w0,kernalfunc,varargin) if nargin<6
kernalfunc = ’polyK’;
end
z = feval(kernalfunc,x,tx,varargin{:})*(a.*ty)+w0;
y=sign(z);
Scoring: 5 points
In studying SVM training, it will be useful to plot the decision boundary, the margin, and the support vectors. The provided routineplotsvm2d.mdoes just that. The rou- tine takes as input the training set, the lists of support and margin vectors, the name of the prediction function (which will probably be svmpred), and the classifier speci- fications to be passed to the prediction function. (If you deviated from the provided skeleton of svmpred.m, you might need to changeplotsvm2d.m appropriately).
The example data sets used in this question can be found in the filesvmexample.mat, and can be read using:
>> load svmexample
4. We start with a simple example, given intrain1x,train1y. Use the routines above to train a SVM using a linear kernel function K(x1,x2) = (1 +xT1x2) (note the provided polyK.m), and without bounding αi from above. Turn in the plot created by plotsvm2d.m.
Answer: 1.51.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
Example 1, linear kernal, unbounded C: margin=0.3464
Scoring: 1 point
5. Why is it not necessary to bound the values of αi from above in this case ?
Answer: The data set is separable, and so the dual problem is bounded. In- creasingαto infinity will not better the objective function of the dual quadratic programming.
Scoring: 2 points
6. What is the classification margin ? (Explain how you calculated it, providing any relevant MATLAB code)
Answer: The margin, as computed by trainsvm.m above, is 0.3464.
Scoring: 3 points (including the explanation/code)
We now move on to a slightly more complex data set given in train2x, train2y.
This time it is necessary to bound the values of αi by some finite C.
7. In order to understand what happens when we do not impose such a bound, try bounding with increasingly higher constants C and plot the value of the quadratic programming objective function as a function of C (turn in this plot). Why can we not solve the quadratic optimization problem when the values of αi are unbounded ?
Answer: Using the following routine:
function fvals = calcfvals(x,y,Cs,varargin) fvals = zeros(size(Cs));
for i=1:length(Cs)
[a,w0,fvals(i),supvecs,margvecs,margin] = trainsvm(x,y,Cs(i),varargin{:});
end
We can generate the plot using:
>> Cs = 2*(0:12) ;
>> fvals = calcfvals(train2x,train2y,Cs) ;
>> plot(Cs,fvals) ; We get the following plot:
0 500 1000 1500 2000 2500 3000 3500 4000 4500
−12
−10
−8
−6
−4
−2 0x 104
C (bound on Lagrange multipliers)
value of quadratic program
From this plot, we can see that the objective function continues to decreases as the maximum value of the Lagrange multiplier increases. If the Lagrange multiplier are not bounded, the value of the quadratic program is unbounded, and there is no optimal solutions.
For comparison, consider the following plot of the objective function as a function of C, for the linearly separable example one:
0 20 40 60 80 100 120 140 160 180 200
−4.5
−4
−3.5
−3
−2.5
C (bound on Lagrange multipliers)
value of quadratic program
Example 1 (separable)
In this case, the objective function initially decreases (it is always monoton- ically non-increasing), but then reaches a lower bound, and does not change further even as the bound C increases.
Scoring: 2 points for the plot, 2 points for the explanation
8. Recall that we are using Lagrange multipliers to solve a minimization problem on the weights w, where we would like to minimize kwk2 subject to yi(wTxi+w0)≥1 for all training examplesi. In this case, no weight vector satisfies all the constraints.
How is this addressed by bounding the Lagrange multipliers ?
Answer: If some of the constraints are not satisfied, then there are i for which yi(wTx+w0)−1<0. The termαi yi(wTx+w0)−1
appearing in our objective function would thus decrease as we increase αi. Bounding αi in effect sets a limit as to how much having an unsatisfied constraint can influence the objective function. Bounding the Lagrange multipliers, the dual objective function becomes bounded.
Scoring: 2 points
9. Train a SVM using the a linear kernel, imposing a bound of C = 5. Turn in the plotsvm2d plot. How many support vectors are there ? How many training points are misclassified ? How many training points do not satisfy the SVM constraints yi(wTxi+w0)≥1 ?
Answer: 00 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
Example 2, linear kernal, C=5: margin=0.7404
There are 26 support vectors, of which 23 do no satisfy the SVM constraints, but only 12 are misclassified.
Scoring: 4 points, one for the plot, one for each number
10. In order enrich the space of allowable classifiers, we would like to use a kernel function representing a more complex feature space. Try training a SVM, on the same example data, using polynomial kernels of the form (1 +xT1x2)p, with various settings ofp, and look at the resulting graphs (no need to turn them in). What is the lowest degree under which the training set is separable ?
Answer: The training data is separable using a fourth degree features.
Scoring: 1 point
0 1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 9
Example 2, 3rd degree kernal, C=1e+10
0 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
Example 2, 4th degree kernal, C=1e+10: margin=0.0434
Also try using the Gaussian kernel e−|x1−x2|
2
2 (given in expK.m).
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Example 2, Gaussian kernal (unit variance): margin=0.0229
11. Note that in many of the graphs you produced, the graphical margin, as it appears in the graphs, is not uniform as with the linear kernel: it is narrower in some areas and wider in others. Why does this happen ?
Answer: The margin is uniform infeature space. What we see ininput space is a non-linear transformation of feature space. This transformation does not conserve the width of the margin.
Scoring: 2 points
12. Even if the training set is separable, it is sometimes useful to set a bound on the values of αi. Train a SVM using a fourth degree kernel (1 +xT1x2)4, once without a bound, and once with a boundC = 2. Turn in both resulting graphs, and the margin
implied by the SVM weights (i.e. the classification margins for those training points satisfying the SVM constraints).
Answer: Margin without a bound is 0.0434. With a bound of C = 2 it increases to 0.74.
0 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
Example 2, 4th degree kernal, C=1e+10: margin=0.0434
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Example 2, 4th degree kernal, C=2: margin=0.7400
Scoring: 5 points: one for the plots, four for the margins and their calculation 13. What is the affect of imposing a low value for C ? In this example, what happened
when C was set to two ?
Answer: A low value of C decreases the importance of satisfying the con- straints, relative to increasing the margin. We allow a few points to violate the constraints (and two points to even be misclassified), but we increase the margin more than ten-fold.
Scoring: 2 points
14. Using the training datatrain3x, train3y, investigate training a SVM with different kernels and different values of C. Is the training set separable using a Gaussian kernel ? Do you think the classification given by a SVM with a Gaussian kernel would generalize well ?
Answer: The training set is separable using a Gaussian kernel, as are all training sets (without repeated input values). Especially for small variances, almost all training points are support vectors, and so we would not necessarily expect the classification to generalize well.
Scoring: 1 point
1 2 3 4 5 6 7 8 9 10 0
1 2 3 4 5 6 7 8 9
Example 3, Gaussian kernal, unit variance: margin=0.00017
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
Example 3, Gaussian kernal, variance 0.1: margin=0.0428
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
Example 3, Gaussian kernal, variance 0.1: margin=0.0428
1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9
Example 3, Gaussian kernal, variance 1: margin=0.0000075
Problem 3: VC-dimmeniosality
In this problem, we will investigate the VC-dimension of various sets of classifiers. In this context, a classifieris a function from some input space to the binary class labels +1,−1.
A classifier can also be described as a subset of the input space which gets the label +1.
For example, a linear classifier in the plane R2, can be described by a half-plane. For this reason, we can discuss the family of linear classifiers as the set of all half-planes (and possibly also the plane itself and the empty set).
We say that a class (i.e. set) H of classifiers shatters a set of points X ={x1, x2, . . . , xn} if we can classify the points in X in all possible ways. More precisely, for all 2n possible labeling vectorsy1, y2, . . . , yn∈ {−1,1}n, there exists a classifierh ∈ Hsuch thath(xi) =yi for all i. For any possible labeling of the points, there has to be a classifier in our set that reproduces those labels. Using the set notation for classifiers, this means that for any subset
X0 ⊆ X (indicating the subset of points labeled +1), there exist a classifier h ∈ H such thatX∩h=X0 (the set of points for whichhassigns label +1 includesX0 but not the rest of X). It is important to understand that shattering is a property of a set of classifiers–
not of a single classifier (a single classifier cannot even shatter a single point).
The VC-dimension of a set H of classifiers is the size of the largest set X that can be shattered by H.
We first analyze the VC-dimension of the class of axis-parallel rectangles in the plane. That is, a classifier is defined in terms of an axis-parallel rectangle, and classifies points inside the rectangle as positive, and points outside of it as negative. More formally (or perhaps just using more notation), a classifier belongs to this class if it is specified by the left, right, top and bottom coordinates and
hl,r,t,b(x1, x2) =
(+1 Ifl ≤x1 ≤r and b ≤x2 ≤t
−1 Otherwise Or written as a set:
hl,r,t,b ={(x1, x2)|l ≤x1 ≤r and b ≤x2 ≤t} And the class of axis-parallel rectangles is
Hrect ={hl,r,t,b|l < r and b < t},
which is just the set of all axis-parallel rectangles. (Your answers need not use such formal notation)
1. Find a set of four points that can be shattered by the class Hrect of axis-parallel rectangles. That is, specify four points, and show how for all 16 possible subsets (each in turn specifying the points labeled +1), you can find a rectangle that captures exactly this subset.
Answer: Consider four points forming a diamond. The following sixteen rectangles classify the points in all possible ways:
Scoring: 3 points
2. Prove that no set of five points can be shattered by Hrect. The claim here is that for any set of five points, there exists a labeling that cannot be obtained with an axis-parallel rectangle. Conclude that the VC-dimension of Hrect is exactly four.
Answer: For any set of five points, take a left-most point, a right-most point, a top-most point and a bottom-most point. Consider a labeling where these (at most) four points are labeled ’+’, and the remaining point(s) are labeled ’-’
(note that there might be more than one remaining point if a single point is, e.g. both topmost and leftmost). A rectangle containing these four extreme points must have its left edge left of the left-most point, its right edge right of the right-most point, its top edge above the top-most point and it bottom edge beneath the bottom-most point. Hence, it must contain all points, including those which we wish to label ’-’. This labeling is thus unachievable with axis- parallel rectangles.
Following this, no five points can be shattered by axis-parallel rectangle.
Since we saw in the previous question how four points can be shattered, we can conclude that the VC dimension is exactly four.
Scoring: 3 points
Now consider classification of points on the one-dimensional line Rd using d-degree polynomials as discriminant functions. These are linear classifiers with basis functions
xk, k = 1, . . . , d. Any classifier in this class has the form hw(x) = sign(w0+w1x+ w2x2+...+wdxd). The set of all such classifiers is
Hd ={hw,w∈ Rd+1} 3. (Optional) Prove that the VC-dimension of Hd is d+ 1.
Answer: Consider the points{1,2, . . . , d+1}. For any labeling(y1, y2, . . . , yd+1), we will search for a degreed polynomial p such thatp(i) =yi for all i (where yi is +1 or −1). We know such a polynomial exists, since for any designated target values ond+ 1 input values, there exists a unique degree d polynomial that satisfies all the constraints. This polynomial achieves the desired labeling.
Since this is true for any labeling, the d+ 1 points can be shattered (in fact, any d+ 1 points can be shattered).
We will now show that nod+2points can be shattered. For anyd+2points, considering a desired labeling in which the points are assigned alternating labels (i.e. each to consecutive points on the real line have different labels). The polynomial must cross zero between any two points, hence it must be equal to zero in at leastd+ 1different places. But addegree polynomial that hasd+ 1 roots must be the equal to zero everywhere, and such a polynomial surely does not achieve the desired labeling (in fact, it does not achieve any labeling).
As mentioned in class, this results extends to more general types of basis functions.
Consider a classHof classifier that can all be written as hw(x) = sign(w0+w1φ1(x)+
w2φ2(x) +· · ·+wdφd(x) +φfixed(x)) for some fixed set of basis functions (by fixed basis functions we mean that the basis functions are the same for all classifiers, and the classifiers differ only in their weight). Note that we also allow a special term φfixed(x), for which we cannot vary the weight. Regardless of the basis functions, H has a VC-dimension of at mostd+1. In general, if the basis functions areindependent (i.e. no basis function is a function of other basis functions), and all weight vectors are allowed, the VC-dimension will be exactly d+ 1.
4. Determine the VC-dimension of each of the following classes, by considering them as linear discriminant with specific basis functions. Briefly explain your reasoning in each case.
(a) Circles in the plane. (Hint: use the basis-function formulation only as an upper bound).
Answer: It is easy to see how to shatter three points (e.g. in an equilateral triangle) using circles.
In order to prove that the VC dimension is not more than three, note that a point(x1, x2) lies inside a circle of radius r with center (a, b)iff:
(x1−a)2+ (x2−b)2 ≤r2
Or equivalently:
x21−2ax1+a2+x22 −2bx2+b2−r2 ≤0 Collecting terms we get:
(−2a)x1+ (−2b)x2+ (a2+b2−r2)1 + (x21+x22)≤0
We can now rewrite the above as a weighted combination of three features, plus a fixed-weight feature:
sign (w1φ1(x1, x2) +w2φ2(x1, x2) +w3φ3(x1, x2) +φfixed(x1, x2)) where:
w1 =−2a φ1(x1, x2) =x1 w2 =−2b φ2(x1, x2) =x2 w3 =a2+b2−r2 φ3(x1, x2) = 1
φfixed(x1, x2) =x21+x22
All circle classifiers can be written as such a weighted combination of these features, and thus the VC-dimension of circle classifiers is at most three.
Note that since not all weights are possible (since r2 must be positive), we cannot conclude from this alone that the VC-dimension is in fact three, and not lower. We can conclude this from the shattering of three points, establishing a VC-dimension of exactly three.
Scoring: 4 points
(b) (Optional) Ellipsoids in the plane.
Answer: Using similar arguments, the VC-dimension of axis parallel el- lipses is four, and of general ellipses it is five.
(c) Classifiers representing decision regions of 2-component Gaussian mixture mod- els in the plane. (Hint: rewrite the class posterior as 1+e1z(x) and use z(x) as a discriminant function).
Answer: We already saw how the posterior class probabilities in a Gaus- sian mixture model is of the form 1+e1zx, wherez(x)is a quadratic function.
In fact, we saw how for any quadratic functionz(·), there exists a Gaussian mixture model with a posterior of 1+e1zx. Based on the posterior, the class decision criterion is 1+e1zx < 0.5, which is equivalent to z(x) < 0. Gaus- sian mixture classifiers are thus exactly the classifiers given by the sign of quadratic functions. The number of quadratic features in the plane is six:
1, x1, x21, x2, x22, x1x2. Thus, the VC-dimension of Gaussian mixture models in the plane is six.
Scoring: 4 points
(d) (Optional) Classifiers representing decision regions of 2-component Gaussian mixture models inRd.
Answer: Using a similar argument, and counting the number of quadratic features in Rd: one bias term, d linear features, d squared features, d2 quadratic features, for a combined VC-dimension of1 + 2d+d(d−1)/2 = 1 + 32d+ 12d2.
Note that this is less then the straight-forward counting of parameters in the standard parameterization of Gaussian mixture models through pri- ors, means and the covariance matrix. Following this count we would get 2 parameters from the priors, d from each mean, and d2 from each co- variance matrix, for a combined total of 2 + 2d+ 2d2. This is certainly an overcounting, since we know, for example, that the priors must sum up to one (hence it is enough to specify a single prior) and that the covari- ance matrix must be symmetric. But even after accounting for this, we are still left with 1 + 2d+d(d+ 11)/2 = 1 + 52d+ 12d2 parameters. But we know there are still more redundancies in the parameterization, since the covariance matrix must be positive definite, and there are many different Gaussian mixture models sharing the same posterior, as well as different posteriors sharing the same decision boundary. The VC-dimension provides for a way of accounting for these further redundancies.
5. In this question we will explore the VC-dimension of decision stumps. Recall that a stump classifier hi,a,b in Rd is specified by an axis-parallel half-space: hi,a,b = {(x1, x2, . . . , xn)|axi ≤b} (a can always be taken to be +1 or −1).
(a) (Optional) Consider a d = 2δ dimensional space. Suggest a set of δ point in Rd=R(2δ) that can be shattered by stumps, and show how it can be shattered.
Conclude that the VC-dimension of stumps in Rd is at least blog2dc (i.e. logd rounded down to the nearest integer).
Answer: Each of the d coordinates will correspond to one of the 2δ =d possible labeling of ourδ points. We will set value of the ith coordinate of thejth point, to +1or−1according to the labeling of thejth point under the ith labeling. To achieve any labeling, all that is needed is a stump xi >0 along the relevant coordinate i.
(b) For any set of n points in Rd, show that the stumps can only classify the n points in at most 2dn different ways. That is, there are at most 2dn different labeling on the points which are attainable using stump classifiers. (Hint: count the number of possible classifications using stumps on a specific axis)
Answer: For each for the i coordinates, consider the ordering σ = (x1,x2, . . . ,xn) of points along that coordinate. Stumps along this co- ordinate can only separate between some prefix ofσand the corresponding
suffix. There are n+ 1 prefixes (including the empty prefix and the com- plete set of points) and each one can either be classified as +1 or as −1, for a total of at most 2(n+ 1)classifications along this coordinate (not all of these might be possible, if there are repeated values in this coordinate among the n points). But note that classifying the empty prefix as +1 is equivalent to classifying the complete prefix as −1, and equivalently for classifying the empty prefix as−1and the complete prefix as+1, reducing the number of of possible classifications to at most 2n.
This analysis can be applied to each of the d coordinates, and so the total number of classifiers is at most2dn(we are overcounting, since stumps along different coordinates may lead to the same classification).
Scoring: 2 points
(c) Use the above bound to show that the VC-dimension of stumps is at most 2(log2d+ 1). (Hint: use the fact that log2n < n/2 and so n−log2n > n/2)
Answer: If the VC-dimension is n, then there must be a set of n points that is shattered. For this set ofnpoints, all2nlabeling must be achievable.
But using the above bound we can only classify n points in at most 2dn different ways, hence:
2n≤2dn
n≤log2(2dn) = 1 + log2d+ log2n n
2 < n−log2n≤1 + log2d n <2(log2d+ 1) Scoring: 4 points
Combining the results, we see that the VC-dimension of stumps is roughly logd (more formally, it is Θ(logd): for large enough d, it differs from logd only by a multiplicative constant).
Although the VC-dimension of stumps is fairly low, combining stumps, as we did in AdaBoost, is much more expressive. A linear combination of m stumps is a classifier given by h(x) = sign(α1h1 +α2h2 +· · ·+αmhm), where hj are decision stumps, andαm ∈ Rare their corresponding weights. We will focus on linear combinations of stumps on the one-dimensional line R.
Note that this is not a linear combination of fixed basis functions, since the stumpshj vary from from classifier to classifier.
(d) Show how m/2 points in R can be shattered by convex combinations of m stumps. (Hint: for each point use a combination of two stumps that classifies the point correctly, and is ambivalent on all other points). State the resulting bound on the VC-dimension of convex combinations of m stumps in the one- dimensional line.
Answer: For each point xi which should be positively classified, consider two stumps given byx≤xi+ and x≥xi−, for small enough so that there is no other point within of xi. Similarly, if xi should be negatively classified, consider the two stumps x ≥ xi + and x ≤ xi −. Let the weights of all stumps be equal. The votes of the twoxi stumps cancel out when voting on any other point xj (j 6=i), and so the total classification of any pointxi is exactly the (common) vote of its two stumps on it, which is correct.
Since we can shatter a set ofm/2points, the VC-dimension of a convex combination of m stumps is at least m/2.
Note that we can also show how to shatter m points using m stumps, and thus show a lower bound of m on the VC-dimension. We can shatter m points by having a stump between each two consecutive points with non-agreeing labels, with the stump classifying these two points correctly.
This will require at mostm−1stumps, all of which will have equal weight, and result in a combined weighted vote of either −1/0 or 0/1 on points that should be classified−1/+ 1. By introducing an additional stump that classifies all points the same way, and has half the weight as the rest of the stumps, we can get correct classification on all points.
Scoring: 4 points: 3 for the example, 1 for the bound of m/2.
(e) (Optional)Form+2 points inR, show a specific labeling that cannot be attained by a convex combination of m stumps. State the resulting bound on the VC- dimension of convex combinations ofm stumps in the one-dimensional line.
Answer: A labeling with alternating labels cannot be attains, since there must be at least one stump between every pair of points, for a total of at least m+ 1 stumps.
(f) (Optional) Bound the VC-dimension of a convex combination of m stumps in higher dimensions.
Now consider sinusoidal classifiers over the real line R. A sinusoidal classifier hω is specified by:
hω(x) = sign(sin(ωx))
And the class of sinusoidal classifiers contains all such classifiers:
Hsin ={hω|ω >0}
6. (Optional) Consider a set ofn points xi = 4−i for i = 1, . . . , n. Show how this set of points can be shattered by Hsin.
Answer: Suppose we are given any labeling for then pointsy1, . . . , yn. Then we set
ω=π
n
X
i=1
(1−yi)4i
2 =π X
i:yi=−1
4i
Supposek is s.t. xk=−1. Then we have ωxk=ω4−k =π(α+ 1 +β) Here,
α= X
i:yi=−1∧i>k
4i−k
β = X
i:yi=−1∧i<k
4i−k
The 1 term comes from the cancellation of4−k with one of the 4i terms in the sum. Note that α is even, and β < 1. Thus, ωxi = γ +β, where γ is odd.
Thus,sign(sin(ωxk)) =−1 =yk. For k s.t. xk= 1, we will getωxk =γ+β, where γ is even, and sign(sin(ωxk)) = 1 = yk. Thus, Hsin can shatter the n points.
7. What is the VC-dimension ofHsin ? (You may use the previous question, even if you did not answered it)
Answer: The VC-dimension is infinite, since for any n, there is a set of n points that can be shattered.
Scoring: 1 point
Thanks to Rui Fan and Gregory Shakhnarovich for making available their typeset solutions.