4 Decision Trees

(1)

(2)

(3)

1 Regression Methods

1. (Gradient descent: comparison with

xxx another training algorithm

xxx on a function approximation task)

• CMU, 2004 fall, Carlos Guestrin, HW4, pr. 2 In this problem you’ll compare the Gradient Descent training algorithm with one [other] training algorithm of your choosing, on a particular function approximation problem. For this problem, the idea is to familiarize yourself with Gradient Descent and at least one other numerical solution technique.

The dataset data.txt contains a series of (x, y) records, where x∈[0,5] and y is a function ofxgiven byy=asin(bx) +w, whereaand bare parameters to be learned and w is a noise term such that w∼N(0, σ²). We want to learn from the data the best values of aand b to minimize the sum of squared error:

arg min

a,b n

X

i=1

(yi−asin(bxi))².

Use any programming language of your choice and implement two training techniques to learn these parameters. The ﬁrst technique should be Gradient Descent with a ﬁxed learning rate, as discussed in class. The second can be any of the other numerical solutions listed in class: Levenberg-Marquardt, Newton’s Method, Conjugate Gradient, Gradient Descent with dynamic learning rate and/or momentum considerations, or one of your own choice not mentioned in class.

You may want to look at a scatterplot of the data to get rough initial values for the parameters a and b. If you are getting a large sum of squared error after convergence (where large means > 100), you may want to try random restarts.

Write a short report detailing the method you chose and its relative performance in comparison to standard Gradient Descent (report the final solution obtained (values of a and b) and some measure of the computation required to reach it and/or the resistance of the approach to local minima). If possible, explain the difference in performance based on the algorithmic difference between the two approaches you implemented and the function being learned.

(4)

2. (Distribuţia exponenţială: estimarea parametrilor

xxx în sens MLE şi respectiv în sens MAP,

xxx folosind ca distribuţie a priori distribuţia Gamma) CMU, 2015 fall, A. Smola, B. Poczos, HW1, pr. 1.1.ab a. An exponential distribution with parameter λ has the probability density function (p.d.f.) Exp(x) = λe^−λx for x ≥ 0. Given some i.i.d. data {xi}ⁿi=1 ∼ Exp(λ), derive the maximum likelihood estimate (MLE)λMLE.

b. A Gamma distribution with parameters r >0, α >0 has the p.d.f.

Gamma(x|r, α) = α^r

Γ(r)x^r−1e^−αx for x≥0, where Γ is Euler’s gamma function.

If the posterior distribution is in the same family as the prior distribution, then we say that the prior distribution is the conjugate prior for the likelihood function.

Show that the Gamma distribution is a conjugate prior of the Exp(λ) distribution. In other words, show that if X ^not.= {xi}ⁿi=1 and X ∼ Exp(λ) and λ∼Gamma(r, α), then P(λ|X)∼Gamma(r^∗, α^∗)for some valuesr^∗, α^∗.

c. Derive the maximum a posteriori estimator (MAP) λMAP as a function of r, α.

d. What happens [with λ^MLE and λ^MAP] as ngets large?

e. Let’s perform an experiment in the above setting. Generate n = 20 random variables drawn from Exp(λ = 0.2). Fix α = 100 and vary r over the range (1,30) using a stepsize of 1. Compute the corresponding MLE and MAP estimates for λ. For each r, repeat this process 50 times and compute the mean squared error of both estimates compared against the true value. Plot the mean squared error as a function of r. (Note: Octave parameterizes the Exponential distribution with θ= 1/λ.)

f. Now, ﬁx (r, α) = (30,100) and varyn up to 1000. Plot the MSE for each n of the corresponding estimates.

g. Under what conditions is the MLE estimator better? Under what conditions is the MAP estimator better? Explain the behavior in the two above plots.

Răspuns:

a. The log likelihood is

ℓ(λ) =X

i

lnλ−λxi=nlogλ−λX

i

xi

Set the derivative to 0:

n λ−X

i

xi = 0⇒λMLE= 1

¯ x. This is biased.

(5)

b.

P(λ|X) ∝ P(X|λ)P(λ)

∝ λⁿe^−λ^Pⁱ^xⁱλ^α−1e^−βλ

∝ λ^n+α−1e^−λ(^Pⁱ^xⁱ^+β). Therefore P(λ|X)∝Gamma(α+n,P

ixi+β).

c. The log posterior is

lnP(λ|X)∝ −λ X

i

xi+β

!

+ (n+α−1) lnλ.

Set the derivative to 0:

0 =−X

i

xi−β+n+α−1

λ →λMAP= n+α−1 P

ixi+β. d.

λMAP= n+α−1 P

ixi+β =

1 +α−1 P n

ixi

n +β n

=

1 + α−1 n

¯ x+β

n

→ 1

¯

x=λMLE. e.

f.

g. The MLE is better when prior information is incorrect. The MAP is better with low sample size and good prior information. Asymptotically they are the same.

(6)

3. (Distribuţia binomială: estimarea parametrului în sens MLE,

xxx folosind metoda lui Newton;

xxx distribuţia Gamma: estimarea parametrilor în sens MLE, xxx folosind metoda gradientului şi metoda lui Newton)

• CMU, 2008 spring, Tom Mitchell, HW2, pr. 1.2 xxx • ◦ CMU, 2015 fall, A. Smola, B. Poczos, HW1, pr. 1.2.c a. For the binomial sampling function, pdff(x) =C_n^xp^x(1−p)^n−x, ﬁnd the MLE using the Newton-Raphson method, starting with an estimateθ⁰= 0.1,n= 100, x= 8. Show the resultingθj until it reaches convergence (θ^j+1−θ^j < .01). (Note that the binomial pdf may be calculated analytically -– you may use this to check your answer.)

b. Note: For this [part of the] exercise, please make use of the digamma and trigamma functions. You can ﬁnd the digamma and trigamma functions in any scientiﬁc computing package (e.g. Octave, Matlab, Python...).

Inside the handout, the estimators.mat ﬁle contains a vector drawn from a Gamma distribution. Run your implementation of gradient descent and New- ton’s method — for the later, see the ex. 132 in our exercise book — to obtain the MLE estimators for this distribution. Create a plot showing the convergence of the two above methods. How do they compare? Which took more iterations? Lastly, provide the actual estimated values obtained.

Solution:

a.

b. You should have gotten α≈4,β≈0.5.

(7)

4. (Distribuţia categorială:: estimarea parametrilor

xxx în sens MLE şi respectiv în sens MAP,

xxx folosind ca distribuţie a priori distribuţia Dirichlet;

xxx testare la clasiﬁcare de documente)

• ◦ MIT, 2011 fall, Leslie Kaelbling, HW3, pr. 2

(8)

5. (Linear, polynomial, regularized (L2), and kernelized regression:

xxx application on a [UCI ML Repository] dataset

xxx for housing prices in Boston area)

• · MIT, 2001 fall, Tommi Jaakkola, HW1, pr. 1

xxx MIT, 2004 fall, Tommi Jaakkola, HW1, pr. 3

xxx MIT, 2006 fall, Tommi Jaakkola, HW2, pr. 2.d A. Here we will be using a regression method to predict housing prices in suburbs of Boston. You’ll find the data in the file housing.data. Information about the data, including the column interpretation can be found in the file housing.names. These files are taken from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets.html.

We will predict the median house value (the 14th, and last, column of the data) based on the other columns.

a. First, we will use a linear regression model to predict the house values, using squared-error as the criterion to minimize. In other words y=f(x; ˆw) =

ˆ

w0+P13

i=1wˆixi, where wˆ = arg minwPn

t=1(yt−f(xt;w))²; here yt are the house values, xt are input vectors, and nis the number of training examples.

Write the following MATLAB functions (these should be simple functions to code in MATLAB):

• A function that takes as input weights w and a set of input vectors {xt}^t=1,...,n, and returns the predicted output values{yt}^t=1,...,n.

• A function that takes as input training input vectors and output values, and return the optimal weight vector w.ˆ

• A function that takes as input a training set of input vectors and output values, and a test set input vectors, and output values, and returns the mean training error (i.e., average squared-error over all training samples) and mean test error.

b. To test our linear regression model, we will use part of the data set as a training set, and the rest as a test set. For each training set size, use the ﬁrst lines of the data ﬁle as a training set, and the remaining lines as a test set.

Write a MATLAB function that takes as input the complete data set, and the desired training set size, and returns the mean training and test errors.

Turn in the mean squared training and test errors for each of the following training set sizes: 10,50, 100, 200, 300,400.

(Quick validation: For a sample size of 100, we got a mean training error of 4.15 and a mean test error of1328.)

c. What condition must hold for the training input vectors so that the training error will be zero for any set of output values?

d. Do the training and test errors tend to increase or decrease as the training set size increases? Why? Try some other training set sizes to see that this is only a tendency, and sometimes the change is in the diﬀerent direction.

e. We will now move on to polynomial regression. We will predict the house values using a function of the form:

f(x;w) =w0+

13

X

i=1 m

X

d=1

wi,dx^d_i,

(9)

where again, the weights w are chosen so as to minimize the mean squared error of the training set. Think about why we also include all lower order polynomial terms up to the highest order rather than just the highest ones.

Note that we only use features which are powers of a single input feature.

We do so mostly in order to simplify the problem. In most cases, it is more beneﬁcial to use features which are products of diﬀerent input features, and perhaps also their powers.

Think of why such features are usually more powerful.

Write a version of your MATLAB function from sectionbthat takes as input also a maximal degree m and returns the training and test error under such a polynomial regression model.

NOTE: When the degree is high, some of the features will have extremely high values, while others will have very low values. This causes severe nume- ric precision problems with matrix inversion, and yields wrong answers. To overcome this problem, you will have to appropriately scale each feature x^d_i included in the regression model, to bring all features to roughly the same magnitude. Be sure to use the same scaling for the training and test sets.

For example, divide each feature by the maximum absolute value of the feature, among all training and test examples. (MATLAB matrix and vector operations can be very useful for doing such scaling operations easily.) f.

g. For a training set size of 400, turn in the mean squared training and test errors for maximal degrees of zero through ten.

(Quick validation: for maximal degree two, we got a training error of14.5and a test error of32.8).

h. Explain the qualitative behavior of the test error as a function of the polynomial degree. Which degree seems to be the best choice?

i. Prove (in two sentences) that the training error is monotonically decreasing with the maximal degree m. That is, that the training error using a higher degree and the same training set, is necessarily less then or equal to the training error using a lower degree.

j. We claim that if there is at least one feature (component of the input vector x) with no repeated values in the training set, then the training error will approach zero as the polynomial degree increases. Why is this true?

B. In this [part of the] problem, we explore the behavior of polynomial regression methods when only a small amount of training data is available.

...

We will begin by using a maximum likelihood estimation criterion for the parametersw that reduces to least squares ﬁtting.

k. Consider a simple 1D regression problem. The data inhousing.dataprovides information of how 13 diﬀerent factors aﬀect house price in the Boston area.

(Each column of data represents a diﬀerent factor, and is described in brief in the ﬁlehousing.names.) To simplify matters (and make the problem easier to

(10)

visualise), we consider predicting the house price (the 14th column) from the LSTAT feature (the 13th column).

We split the data set into two parts (in testLinear.m), train on the ﬁrst part and test on the second. We have provided you with the necessary MATLAB code for training and testing a polynomial regression model. Simply edit the script (ps1_part2.m) to generate the variations discussed below.

i. Use ps1_part2.m to calculate and plot training and test errors for polynomial regression models as a function of the polynomial order (from 1 to 7). Use 250 training examples (set numtrain=250).

ii. Briefly explain the qualitative behavior of the errors. Which of the regression models are over-fitting to the data? Provide a brief justification.

iii. Rerun ps1 part2.m with only 50 training examples (set numtrain=50). Briefly explain key differences between the resulting plot and the one from part i). Which of the models are over-fitting this time?

Comment: There are many ways of trying to avoid over-ﬁtting. One way is to use a maximum a posteriori (MAP) estimation criterion rather than maximum likelihood. The MAP criterion allows us to penalize parameter choices that we would not expect to lead to good generalization. For example, very large parameter values in linear regression make predictions very sensitive to slight variations in the inputs. We can express a preference against such large parameter values by assigning a prior distribution over the parameters such as simple Gaussian

p(w;α²) =N(0, α²I).

This prior decreases rapidly as the parameters deviate from zero. The single variance (hyper-parameter) α² controls the extent to which we penalize large parameter values. This prior needs to be combined with the likelihood to get the MAP criterion. The MAP parameter estimate maximizes

ln(p(y|Xw, σ²)p(w;α²)) = lnp(y|Xw, σ²) + lnp(w;α²).

The resulting parameter estimates are biased towards zero due to the prior.

We can ﬁnd these estimates as before by setting the derivatives to zero.

l. Show that

ˆ

wMAP= (X^⊤X+σ²

α²I)⁻¹X^⊤y.

m. In the above solution, show that in the limit of inﬁnitely largeα, the MAP estimate is equal to the ML estimate, and explain why this happens.

n. Let us see how the MAP estimate changes our solution in the housing-price estimation problem. The MATLAB code you used above actually contains a variable corresponding to the variance ratio var_ratio = α²

σ² for the MAP estimator. This has been set to a default value of zero to simulate the ML estimator. In this part, you should vary this value from 1e-8 to 1e-4 in multi- ples of 10 (i.e., 1e-8, 1e-7, . . . , 1e-4). A larger ratio corresponds to a stronger prior (smaller values of α² constrain the parameters w to lie closer to origin).

(11)

iv. Plot the training and test errors as a function of the polynomial order using the above 5 MAP estimators and 250 and 50 training points.

v. Describe how the prior aﬀects the estimation results.

C. Implement the kernel linear regression method (described in MIT, 2006 fall, Tommi Jaakkola, HW2, pr. 2.a-c / Regr-56 / Regr-72) for λ > 0. We are interested in exploring how the regularization parameterλ≥0 aﬀects the solution when the kernel function is the radial basis kernel

K(x, x^′) = exp

−β

2kx−x^′k²

, β >0.

We have provided training and test data as well as helpful MATLAB scripts inhw2/prob2. You should only need to complete the relevant lines in runprob2 script. The data pertains to the problem of predicting Boston housing prices based on various indicators (normalized). Evaluate and plot the training and test errors (mean squared errors) as a function ofλin the range λ∈(0,1). Use β= 0.05. Explain the qualitative behavior of the two curves.

Solution:

A.

a.

b. First to read the data (ignoring column four):

» data = load(’housing.data’);

» x = data(:,[1:3 5:13]);

» y = data(:,14);

To get the training and test errors for training set of size s, we invoke the following MATLAB command:

» [trainE,testE] = testLinear(x,y,s) Here are the errors I got:

training size training error test error 10 6.27×10⁻²⁶ 1.05×10⁵

50 3.437 24253

100 4.150 1328

200 9.538 316.1

300 9.661 381.6

400 22.52 41.23

[ Note that for a training size of ten, the training error should have been zero. The very low, but still non-zero, error is a result of limited precision of the calculations, and is not a problem. Furthermore, with only ten training examples, the optimal regression weights are not uniquely deﬁned. There is a four dimensional linear subspace of weight vectors that all yield zero training error. The test error above (for a training size of ten) represents an 2arbitrary choice of weights from this subspace (implicitly made by thepinv() function).

Using diﬀerent, equally optimal, weights would yield diﬀerent test errors. ]

(12)

c. The training error will be zero if the input vectors are linearly independent.

More precisely, since we are allowing an aﬃne term w0, it is enough that the input vectors with an additional term always equal to one, are linearly independent. Let X be the matrix of input vectors, with additional ‘one’

terms, y any output vector, and w a possible weight vector. If the inputs are linearly independent,Xw=y always has a solution, and the weightswlead to zero training error.

[ Note that if X is a square matrix with linearly independent rows, than it is invertible, and Xw =y has a unique solution. But even if X is not square matrix, but its rows are still linearly independent (this can only happen if there are less rows then columns, i.e., less features then training examples), then there are solutions to Xw =y, which do not determine w uniquely, but still yield zero training error (as in the case of a sample size of ten above). ] d. The training error tends to increase. As more examples have to be ﬁtted, it becomes harder to ’hit’, or even come close, to all of them.

The test error tends to decrease. As we take into account more examples when training, we have more information, and can come up with a model that better resembles the true behavior. More training examples lead to better generalization.

e. We will use the following functions, on top of those from question b:

function xx = degexpand(x, deg)

function [trainE, testE] = testPoly(x, y, numtrain, deg) f...

.

g. To get the training and test errors for maximum degree d, we invoke the following MATLAB command:

» [trainE,testE] = testPoly(x,y,400,d) Here are the errors I got:

degree training error test error

0 83.8070 102.2266

1 22.5196 41.2285

2 14.8128 32.8332

3 12.9628 31.7880

4 10.8684 5262

5 9.4376 5067

6 7.2293 4.8562×10⁷ 7 6.7436 1.5110×10⁶ 8 5.9908 3.0157×10⁹ 9 5.4299 7.8748×10¹⁰ 10 4.3867 5.2349×10¹³

[ These results were obtained usingpinv(). Using different operations, although theoretically equivalent, might produce different results for higher degrees. In any case, using any of the suggested methods above, the errors should match the above table at least up to degree five. Beyond that, using inv() starts producing unreasonable results due to extremely small values in the matrix,

(13)

which make it almost singular (non-invertible). If you usedinv() and got such values, you should point this out.

Degree zero refers to having a constant predictor, i.e., predict the same input value for all output values. The constant value that minimizes the training error (and is thus used) is the mean training output. ]

h. Allowing more complex models, with more features, we can use as predictors functions that better correspond to the true behavior of the data. And so, the approximation error (the diﬀerence between the optimal model from our limited class, and the true behavior of the data) decreases as we increase the degree. As long as there is enough training data to support such complex models, the generalization error is not too bad, and the test error decreases. However, past some point we start over-fitting the training data and the increase in the generalization error becomes much more signiﬁcant than the continued decrease in the approximation error (which we cannot directly observe), causing the test error to rise.

Looking at the test error, the best maximum degree seems to be three.

i. Predictors of lower maximum degree are included in the set of predictors of higher maximum degree (they correspond to predictors in which weights of higher degree features are set to zero). Since we choose the predictor from within the set the minimizes the training error, allowing more predictors, can only decrease the training error.

j. We show for all m≥ n−1 (where n is the number of training examples), the training error is 0, but constructing weights which predict the training examples exactly. Let j be a component of the input with no repeat values.

We let wi,d= 0 for alli=j, and alld= 1, . . . , m. Then we have f(x) =w0+X

i

X

d

wi,dx^d_i =w0+

m

X

d=1

wj,dx^d_j.

Givenntraining points(x1, y1), . . . ,(xn, yn)we are required to ﬁndw0, wj,1, . . . , wj,m

s.t.w0+Pm

d=1wj,d(xi)^d_j =yi, ∀i= 1, . . . , n. That is, we want to interpolate npo- ints with a degreem≥n−1 polynomial, which can be done exactly as long as the pointsxi are distinct.

B.

k.

i.

ii. The training error is monotonically decreasing (non-increasing) with polynomial order. This is because higher order models can fully represent any

(14)

lower order model by adequate setting of parameters, which in turn implies that the former can do no worse than the latter when ﬁtting to the same training data.

(Note that this monotonicity property need not hold if the training sets to which the higher and lower order models were ﬁt were different, even if these were drawn from the same underlying distribution.)

The test error mostly decreases with model order till about 5th order, and then increases. This is an indication (but not proof) that higher order models (6th and 7th) might be overﬁtting to the data. Based on these results, the best choice of model for training on the given data is the 5th order model, since it has lowest error on an independent test set of around 250 examples.

iii.

We note the following diﬀerences between the plots for 250 and 50 examples:

• The training errors are lower in the present case. This is because we are having to fit fewer points with the same model. In this examples, in particular, we are fitting only a subset of the points we were previously fitting (since there is no randomness in drawing points for training).

• The test errors for most models are higher. This is evidence of systematic overﬁtting for all model orders, relative to the case where there were many more training points.

• The model with the lowest test error is now the third order model. From 4th order onwards, the test error generally increases (though the 7th order is an exception, perhaps due to the particular choice of training and test sets). This tells us that with fewer training examples, our preference should switch towards lower-order models (in the interest of achieving low generalisation error), even though the true model responsible for generating the underlying data might be of much higher order. This relates to the trade-oﬀ between bias and variance. We typically want to minimise the mean-square error, which is the sum of the bias and variance. Low-order models typically have high bias but low variance.

Higher order models may be unbiased, but have higher variance.

l...

. m...

. n.

(15)

iv.

Plots for 250 training examples. Left to right, (then) top to bottom, variance ratio = 1e-8 to 1e-4:

Plots for 50 training examples. Left to right, (then) top to bottom, variance ratio = 1e-8 to 1e-4:

v. We make the following observations:

• As the variance ratio (i.e. the strength of the prior) increases, the training error increases (slightly). This is because we are no longer solely interested in obtaining the best ﬁt to the training data.

• The test error for higher order models decreases dramatically with strong priors. This is because we are no longer allowing these models to overﬁt to training data by restricting the range of weights possible.

• Test error generally decreases with increasing prior.

(16)

• As a consequence of the above two points, thebest model changes slightly with increasing prior in the direction of more complex models.

• For 50 training samples, the difference in test error between ML and MAP is more significant than with 250 training examples. This is because overfitting is a more serious problem in the former case.

C. Sample code for this problem is shown below:

Ntrain = size(Xtrain,1);

Ntest = size(Xtest,1);

for i=1:length(lambda), lmb = lambda(i);

alpha = lmb * ((lmb*eye(Ntrain) + K)^∧-1) * Ytrain;

Atrain = (1/lmb) * repmat( alpha’, Ntrain, 1);

yhat_train = sum(Atrain.*K,2);

Atest = (1/lmb) * repmat( alpha’, Ntest, 1);

yhat_test = sum(Atest.*(Ktrain_test’), 2);

E(i,:) = [mean((yhat_train-Ytrain).^∧2),mean((yhat_test-Ytest).^∧2)];

end;

The resulting plot is shown in the nearby ﬁgure. As can be seen the training error is zero at λ = 0 and increases as λincreases. The test error initially decreases, reaches a mi- nimum around 0.1, and then increases again. This is exactly as we would expect.

λ≈0 results in over-ﬁtting (the model is too powerful). Our regression function has a low bias but high variance.

By increasing λwe constrain the model, thus increasing the training error. While the regularization increases bias, the variance decreases fas- ter, and we generalize better.

High values of λ result in under- ﬁtting (high bias, low variance) and both training error and test errors are high.

(17)

6. (Regresia liniară local-ponderată (ne-regularizată))

• ◦ Stanford, 2011 fall, Andrew Ng, HW1, pr. 2.d The ﬁles q2x.dat and q2y.dat contain the inputs (x⁽ⁱ⁾) and outputs (y⁽ⁱ⁾) for a regression problem, with one training example per row.

a. Implement (unweighted) linear regression (y=θ^⊤x) on this dataset (using the normal equations, [LC: i.e., the analytical / closed form solution]), and plot on the same ﬁgure the data and the straight line resulting from your ﬁt.

(Remember to include theintercept term.)

b. Implement locally weighted linear regression on this dataset (using the weighted normal equations you derived in part (b) [LC: i.e, Stanford, 2008 fall, Andrew Ng, HW1, pr. 2.b, but you may take a look at CMU, 2010 fall, Aarti Singh, midterm, pr. 4 found in the exercise book]), and plot on the same ﬁgure the data and the curve resulting from your ﬁt. When evaluating h(·) at a query point x, use weights

w(x⁽ⁱ⁾) = exp

−(x−x⁽ⁱ⁾)² 2τ²

,

with a bandwidth parameterτ = 0.8. (Again, remember to include the intercept term.)

c. Repeat (b) four times, withτ= 0.1,0.3,2and 10. Comment brieﬂy on what happens to the ﬁt whenτ is too small or too large.

Solution:

LC: See the code in the prob2.m Matlab ﬁle that I put in the HW1 sub- folder of Stanford 2011f folder in the main (Stanford) arhive and also in book/ﬁg/Stanford.2008f.ANg.HW1.pr2d.

(18)

(Plotted in color where available.)

For small bandwidth parameter τ, the ﬁtting is dominated by the closest by training samples. The smaller the bandwidth, the less training samples that are actually taken into account when doing the regression, and the regression results thus become very susceptible to noise in those few training samples.

For larger τ, we have enough training samples to reliably ﬁt straight lines, unfortunately a straight line is not the right model for these data, so we also get a bad ﬁt for large bandwidths.

(19)

7. ([Weighted] Linear Regression applied to

xxx predicting the needed quatity of insulin,

xxx starting from the sugar level in the patient’s blood)

• ◦CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 4 An automated insulin injector needs to calculate how much insulin it should inject into a patient based on the patient’s blood sugar level. Let us formulate this as a linear regression problem as follows: letyibe the dependent predicted variable (blood sugar level), and let β0, β1 and β2 be the unknown coeﬃcients of the regression function. Thus, yi =β0+β1xi+β2x²_i, and we can formulate the problem of ﬁnding the unknownβ^not.= (β0, β1, β2) as:

βˆ= (X^⊤X)⁻¹X^⊤y.

See data2.txt (posted on website) for data based on the above scenario with space separated ﬁelds conforming to:

bloodsugarlevel insulinedose weightage

The purpose of theweightage ﬁeld will be made clear in part c.

a. Write code in Matlab to estimate the regression coeﬃcients given the dataset consisting of pairs of independent and dependent variables. Generate a space separated ﬁle with the estimated parameters from the entire dataset by writing out all the parameter.

b. Write code in Matlab to perform inference by predicting the insulin dosage given the blood sugar level based on training data using a leave one out cross validation scheme. Generate a space separated ﬁle with the predicted dosages in order. The predicted dosages:

c. However, it has been found that one group of patients are twice as sensitive to the insuline dosage than the other. In the training data, these particular patients are given a weightage of 2, while the others are given a weightage of 1. Is your goodness of ﬁt function ﬂexible enough to incorporate this information?

d. Show how to formulate the regression function, and correspondingly calculate the coeﬃcients of regression under this new scenario, by incorporating the given weights.

e. Code up this variant of regressional analysis. Write out the new coeﬃcients of regression you obtain by using the whole dataset as training data.

Solution:

a. The betas, in order:

−74.3825 13.4215 1.1941 b.

1417.0177 1501.4423 1966.3563 2833.7942 2953.4532 3075.472 3199.8566 3326.9663 3456.4704 5038.3777 5196.6767 5357.0094 7091.8113 7278.2709 7467.3604 7658.5256 7852.0808 9703.9748 9921.8604 10142.5438 10365.8512 12498.2968 12749.6202 13003.94 1798.3745 1869.1966 2024.7112 2265.6988 2392.0227 2756.1588 3915.9302 4004.465 4878.0057 5094.3282 6217.981 6485.8542 6544.4688 6805.7564 7073.8455 7207.5032 7285.3657 9393.5129 9515.9043 9704.0029 10060.5539 12037.6304 12361.3586 12903.5394

(20)

c. Since we need to weigh each point differently, out current goodness of fit function is unable to work in this scenario. However, since the weights for this specific dataset are 1 and 2, we may just use the old formalism and double the data items with weightage 2. The changed formalism which enables us to assign weights of any precision to the data sample is shown below.

d. Let:

y = Xβ

y = (y1, y2, . . . , yn)^⊤ β = (β1, β2, . . . , βm)^⊤ Xi,1 = 1

Xi,j+1 = xi,j

Let us deﬁne the weight matrix as:

Ωi,i=√wi

Ωi,j= 0 (for i6=j) So, Ωy= ΩXβ.

To minimize the weighted square error, we have to take the derivative with respect to β:

∂

∂β((Ωy−ΩXβ)^⊤(Ωy−ΩXβ))

= ∂

∂β((Ωy)^⊤(Ωy)−2(Ωy)^⊤(ΩXβ) + (ΩXβ)^⊤(ΩXβ)

= ∂

∂β((Ωy)^⊤(Ωy)−2(Ωy)^⊤(ΩXβ) +β^⊤X^⊤Ω^⊤ΩXβ Therefore

∂

∂β((Ωy−ΩXβ)^⊤(Ωy−ΩXβ)) = 0

⇔ 0−2((Ωy)^⊤(ΩX))^⊤+ 2X^⊤Ω^⊤ΩXβˆ= 0

⇔ βˆ= (X^⊤Ω^⊤ΩX)⁻¹X^⊤Ω^⊤Ωy e. The new beta coeﬃcients are in order:

−57.808 13.821 1.199

(21)

8. (Linear [Ridge] regression applied to xxx predicting the level of PSA in the prostate gland,

xxx using a set of medical test results)

• ◦ ⋆ ⋆ CMU, 2009 fall, Geoff Gordon, HW3, pr. 3 The linear regression method is widely used in the medical domain. In this question you will work on a prostate cancer data from a study by Stamey et al.⁹⁶³ You can download the data from . . ..

Your task is to predict the level of prostate-speciﬁc antigen (PSA) using a set of medical test results. PSA is a protein produced by the cells of the prostate gland. High levels of PSA often indicate the presence of prostate cancer or other prostate disorders.

The attributes are several clinical measurements on men who have prostate cancer. There are 8 attributes: log cancer volume lcavol, log prostate weight (lweight), log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), age, log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores of 4 or 5 (pgg45). svi and gleason are categorical, that is they take values either 1 or 0; others are real-valued.

We will refer to these attributes as A1 = lcavol,A2 = lweight, A3 = age, A4

= lbph, A5 = svi,A6 = lcp,A7 = gleason,A8 = pgg45.

Each row of the input file describes one data point: the first column is the index of the data point, the following eight columns are attributes, and the tenth column gives the log PSA level lpsa, the response variable we are interested in. We already randomized the data and split it into three parts corresponding to training, validation and test sets. The last column of the file indicates whether the data point belongs to the training set, validation set or test set, indicated by ‘1’ for training, ‘2’ for validation and ‘3’ for testing.

The training data includes 57 examples; validation and test sets contain 20 examples each.

Inspecting the Data

a. Calculate the correlation matrix of the 8 attributes and report it in a table.

The table should be 8-by-8. You can use Matlab functions.

b. Report the top 2 pairs of attributes that show the highest pairwise positive correlation and the top 2 pairs of attributes that show the highest pairwise negative correlation.

Solving the Linear Regression Problem

You will now try to ﬁnd several models in order to predict the lpsa levels.

The linear regression model is

Y =f(X) +ǫ whereǫ is a Gaussian noise variable, and

f(X) =

p

X

j=0

wjφj(X)

963Stamey TA, Kabalin JN, McNeal JE et al. Prostate specific antigen in the diagnosis and treatment of the prostate. II. Radical prostatectomy treated patients. J Urol 1989;141:107683.

(22)

wherepis the number of basis functions (features),φj is thejth basis function, and wj is the weight we wish to learn for thejth basis function. In the models below, we will always assume that φ0(X) = 1 represents the intercept term.

c. Write a Matlab function that takes the data matrix Φ and the column vector of responses y as an input and produces the least squares ﬁt w as the output (refer to the lecture notes for the calculation of w).

d. You will create the following three models. Note that before solving each regression problem below, you should scale each feature vector to have a zero mean and unit variance. Don’t forget to include the intercept column, φ0(X) = 1, after scaling the other features. Notice that since you shifted the attributes to have zero mean, in your solutions, the intercept term will be the mean of the response variable.

• Model1: Features are equal to input attributes, with the addition of a constant feature φ0. That is, φ0(X) = 1, φ1(X) = A1, . . . , φ8(X) = A8. Solve the linear regression problem and report the resulting feature weights. Discuss what it means for a feature to have a large negative weight, a large positive weight, or a small weight. Would you be able to comment on the weights, if you had not scaled the predictors to have the same variance? Report mean squared error (MSE) on the training and validation data.

• Model2: Include additional features corresponding to pairwise products of the ﬁrst six of the original attributes,⁹⁶⁴i.e.,φ9(X) =A1·A2, . . . ,φ13(X) =A1·A6, φ15(X) = A2·A3, . . . , φ23(X) = A5·A6. First compute the features according to the formulas above using the unnormalized values, then shift and scale the new features to have zero mean and unit variance and add the column for the intercept termφ0(X) = 1. Report the ﬁve features whose weights achieved the largest absolute values.

• Model3: Starting with the results of Model1, drop the four features with the lowest weights (in absolute values). Build a new model using only the remaining features. Report the resulting weights.

e. Make two bar charts, the ﬁrst to compare the training errors of the three models, the second to compare the validation errors of the three models.

Which model achieves the best performance on the training data? Which model achieves the best performance on the validation data? Comment on diﬀerences between training and validation errors for individual models.

f. Which of the models would you use for predicting the response variable?

Explain.

Ridge Regression

For this question you will start with Model2 and employ regularization on it.

g. Write a Matlab function to solve Ridge regression. The function should take the data matrix Φ, the column vector of responses y, and the regularization parameter λ as the inputs and produce the least squares ﬁt w as the output (refer to the lecture notes for the calculation of w). Do not penalize w0, the

964These features are also calledinteractions, because they attempt to account for the effect of two attributes being simultaneously high or simultaneously low.

(23)

intercept term. (You can achieve this by replacing the ﬁrst column of the λI matrix with zeros.)

h. You will create a plot exploring the eﬀect of the regularization parameter on training and validation errors. The x-axis is the regularization parameter (on a log scale) and the y-axis is themean squared error. Show two curves in the same graph, one for the training error and one for the validation error.

Starting withλ= 2⁻³⁰, try 50 values: at each iteration increase λ by a factor of 2, so that for example the second iteration uses λ= 2⁻²⁹. For each λ, you need to train a new model.

i. What happens to the training error as the regularization parameter increases? What about the validation error? Explain the curve in terms of overﬁtting, bias and variance.

j. What is the λ that achieves the lowest validation error and what is the validation error at that point? Compare this validation error to the Model2 validation error when no regularization was applied (you solved this in part e). How doeswdiﬀer in the regularized and unregularized versions, i.e., what eﬀect did regularization have on the weights?

k. Is this validation error lower or higher than the validation error of the model you chose in part f? Which one should be your ﬁnal model?

l. Now that you have decided on your model (features and possibly the regularization parameter), combine your training and validation data to make a combined training set, train your model on this combined training set, and calculate it on the test set. Report thetraining and test errors.

Solution:

a.

lcavol lweight age lbph svi lcp gleason pgg45 lpsa

lcavol 1.0000 0.2805 0.2249 0.0273 0.5388 0.6753 0.4324 0.4336 0.7344 lweight 0.2805 1.0000 0.3479 0.4422 0.1553 0.1645 0.0568 0.1073 0.4333 age 0.2249 0.3479 1.0000 0.3501 0.1176 0.1276 0.2688 0.2761 0.1695 lbph 0.0273 0.4422 0.3501 1.0000 -0.0858 -0.0069 0.0778 0.0784 0.1798 svi 0.5388 0.1553 0.1176 -0.0858 1.0000 0.6731 0.3204 0.4576 0.5662 lcp 0.6753 0.1645 0.1276 -0.0069 0.6731 1.0000 0.5148 0.6315 0.5488 gleason 0.4324 0.0568 0.2688 0.0778 0.3204 0.5148 1.0000 0.7519 0.3689 pgg45 0.4336 0.1073 0.2761 0.0784 0.4576 0.6315 0.7519 1.0000 0.4223 lpsa 0.7344 0.4333 0.1695 0.1798 0.5662 0.5488 0.3689 0.4223 1.0000

b. The top 2 pairs that show the highest pairwise positive correlation are gleason-ppg4(0.7519) andlcavol-lcp(0.6731). Highest negative correlations:

lbph-svi(-0.0858) and lph- lcp(-0.0070).

c. See below:

function what=lregress(Y,X)

% least square solution to linear regression

% X is the feature matrix

% Y is the response variable vector what=inv(X’*X)*X’*Y;

end d.

Model1:

(24)

the weight vector:

w= [2.68265,0.71796,0.17843,−0.21235,0.25752,0.42998,−0.14179,0.08745,0.02928].

Model2:

The largest ﬁve absolute values in descending order:

lweight*age,lpbh,lweight,age,age*lpbh.

Model3:

The features with have the lowest absolute weights in Model1:

pgg45,gleason,lcp,lweight.

The resulting weights: w= [2.6827,0.7164,−0.1735,0.3441,0.4095].

e.

1 2 3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Training Error of the Three Models

Model ID

Training MSE

1 2 3

0 0.2 0.4 0.6 0.8 1

Validation Error of the Three Models

Model ID

Validation MSE

Model2 achieves the best performance on the training data, whereas Model1 achieves the best performance on the validation data. Model2 suﬀer from overfitting, indicated by the very good training model but low validation error.

Model3 seems to be too simple, it has a higher training and a higher validation error compared to Model1. The features that are dropped are informative, as indicated by the lower training and validation errors.

f. Model1, since it achieves the best performance on the validation data.

Model2 overﬁts, and Model3 is too simple.

g. See below:

function what = ridgeregress(Y,X,lambda)

% X is the feature matrix

% Y is the response vector

% what are the estimated weights penal = lambda*eye(size(X,2));

penal(:,1) = 0;

what = inv(X’*X+penal)*X’*Y;

end

(25)

h.

−30 −20 −10 0 10 20

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

log2(lambda)

MSE

training error testing error

i. When the model is not regularized much (the left side of the graph), the training error is low and the validation error is high, indicating the model is too complex and overfitting to the training data. In that region, the bias is low and the variance is high.

As the regularization parameter increases, the bias increases and variance decreases. The overﬁtting problem is overcome as indicated by decreasing validation error and increasing training error.

As regularization penalty increase too much, the model becomes getting too simple and start suﬀering from underfitting as can be shown by the poor performance on the training data.

j. logλ= 4, i.e., λ= 16, achieves the lowest validation error, which is 0.447.

This validation error is much less than the validation error of the model wi- thout regularization, which was 0.867. Regularized weights are smaller than unregularized weights. Regularization decreases the magnitude of the weights.

k. The validation error of the penalized model (λ= 16) is 0.447, which is lower than Model1’s validation error, 0.5005. Therefore, this model is chosen.

l. The ﬁnal models’ training error is 0.40661 and the test error is 0.58892.

(26)

9. (Linear weighted, unweighted, and functional regression:

xxx application to denoising quasar spectra)

• ◦ · Stanford, 2017 fall, Andrew Ng, Dan Boneh, HW1, pr. 5 xxx Stanford, 2016 fall, Andrew Ng, John Duchi, HW1, pr. 5 Solution:

(27)

10. (Linear regression withL1 regularization:

xxx the coordinate descent method)

• ◦⋆ Stanford, 2007 fall, Andrew Ng, HW3, pr. 3.bc Solution:

(28)

11. (Feature selection in the context of linear regression

xxx with L1 regularization:

xxx the coordinate descent method)

• ◦⋆ MIT, 2003 fall, Tommi Jaakkola, HW4, pr. 1 Solution:

(29)

12. (Logistic regression with gradient ascent:

xxx application on a synthetic dataset from R²;

xxx overﬁtting)

• ◦ CMU, 2015 spring, T. Mitchell, N. Balcan, HW4, pr. 2.c-i In logistic regression, ourgoal is to learn a set of parameters by maximizing the conditional log-likelihood of the data.

In this problem you will implement a logistic regression classifier and apply it to a two-class classification problem. In the archive, you will find one.m file for each of the functions that you are asked to implement, along with a file calledHW4Data.mat that contains the data for this problem. You can load the data into Octave by executing load(”HW4Data.mat”) in the Octave interpreter.

Make sure not to modify any of the function headers that are provided.

a. Implement a logistic regression classiﬁer using gradient ascent — for the formulas and their calculation see ex. 13 in our exercise book⁹⁶⁵ — by ﬁlling in the missing code for the following functions:

• Calculate the value of the objective function:

obj = LR_CalcObj(XTrain,yTrain,wHat)

• Calculate the gradient:

grad = LR_CalcGrad(XTrain,yTrain,wHat)

• Update the parameter value:

wHat = LR_UpdateParams(wHat,grad,eta)

• Check whether gradient ascent has converged:

hasConverged = LR_CheckConvg(oldObj,newObj,tol)

• Complete the implementation of gradient ascent:

[wHat,objVals] = LR_GradientAscent(XTrain,yTrain)

• Predict the labels for a set of test examples:

[yHat,numErrors] = LR_PredictLabels(XTest,yTest,wHat)

where the arguments and return values of each function are deﬁned as follows:

• XTrain is an n×p dimensional matrix that contains one training instance per row

• yTrain is an n×1 dimensional vector containing the class labels for each training instance

• wHat is a p+ 1×1 dimensional vector containing the regression parameter estimates wˆ0,wˆ1, . . . ,wˆp

• grad is ap+ 1×1 dimensional vector containing the value of the gradient of the objective function with respect to each parameter in wHat

• eta is the gradient ascent step size that you should set to eta = 0.01

• obj,oldObjand newObj are values of the objective function

• tol is the convergence tolerance, which you should set totol = 0.001

• objVals is a vector containing the objective value at each iteration of gradient ascent

965From the formal point of view you will assume that a dataset withntraining examples andpfeatures will be given to you. The class labels will be denotedy⁽ⁱ⁾, the featuresx⁽ⁱ⁾₁ , . . . , x⁽ⁱ⁾p , and the parametersw0, w1, . . . , wp, where the superscript(i)denotes the sample index.

(30)

• XTest is an m×p dimensional matrix that contains one test instance per row

• yTest is an m×1 dimensional vector containing the true class labels for each test instance

• yHat is anm×1 dimensional vector containing your predicted class labels for each test instance

• numErrors is the number of misclassiﬁed examples, i.e. the diﬀerences between yHat and yTest

To complete the LR_GradientAscent function, you should use the helper functions LR_CalcObj,LR_CalcGrad, LR_UpdateParams, and LR_CheckConvg.

b. Train your logistic regression classiﬁer on the data provided in XTrain and yTrain with LR_GradientAscent, and then use your estimated parameterswHatto calculate predicted labels for the data in XTest with LR_PredictLabels.

c. Report the number of misclassiﬁed examples in the test set.

d. Plot the value of the objective function on each iteration of gradient ascent, with the iteration number on the horizontal axis and the objective value on the vertical axis. Make sure to include axis labels and a title for your plot. Report the number of iterations that are required for the algorithm to converge.

e. Next, you will evaluate how the training and test error change as the training set size increases. For each value ofkin the set{10,20,30, . . . ,480,490,500}, ﬁrst choose a random subset of the training data of size kusing the following code:

subsetInds=randperm(n,k)

XTrainSubset=XTrain(subsetInds,:) yTrainSubset=yTrain(subsetInds)

Then re-train your classiﬁer using XTrainSubset and yTrainSubset, and use the estimated parameters to calculate the number of misclassiﬁed examples on both the training set XTrainSubset and yTrainSubset and on the original test set XTestand yTest. Finally, generate a plot with two lines: in blue, plot the value of the training error against k, and in red, pot the value of the test error against k, where the error should be on the vertical axis and training set size should be on the horizontal axis. Make sure to include a legend in your plot to label the two lines. Describe what happens to the training and test error as the training set size increases, and provide an explanation for why this behavior occurs.

f. Based on the logistic regression formula you learned in class, derive the analytical expression for the decision boundary of the classiﬁer in terms of w0, w1, . . . , wp and x1, . . . , xp. What can you say about the shape of the decision boundary?

g. In this part, you will plot the decision boundary produced by your classiﬁer.

First, create a two-dimensional scatter plot of your test data by choosing the two features that have highest absolute weight in your estimated parameters wHat (let’s call them featuresj and k), and plotting the j-th dimension stored in XTest(:,j) on the horizontal axis and the k-th dimension stored in XTest(:,k)

(31)

on the vertical axis. Color each point on the plot so that examples with true label y = 1 are shown in blue and label y = 0 are shown in red. Next, using the formula that you derived in part (f), plot the decision boundary of your classiﬁer in black on the same ﬁgure, again considering only dimensionsj and k.

Solution:

a. See the functionsLR_CalcObj,LR_CalcGrad,LR_UpdateParams,LR_CheckConvg, LR_GradientAscent, andLR_PredictLabelsin the solution code.

b. See the function RunLRin the solution code.

c. There are 13 misclassiﬁed examples in the test set.

d. See the ﬁgure below. The algorithm converges after 87 iterations.

(32)

e. See ﬁgure below.

As the training set size increases, test error decreases but training error increases. This pattern becomes even more evident when we perform the same experiment using multiple random sub-samples for each training set size, and calculate the average training and test error over these samples, the result of which is shown in the ﬁgure below.

When the training set size is small, the logistic regression model is often capable of perfectly classifying the training data since it has relatively little variation. This is why the training error is close to zero. However, such a model has poor generalization ability because its estimate of what is based on a sample that is not representative of the true population from which the data

(33)

is drawn. This phenomenon is known as overfitting because the model fits too closely to the training data. As the training set size increases, more variation is introduced into the training data, and the model is usually no longer able to fit to the training set as well. This is also due to the fact that the complete dataset is not 100% linearly separable. At the same time, more training data provides the model with a more complete picture of the overall population, which allows it to learn a more accurate estimate of wHat. This in turn leads to better generalization ability i.e. lower prediction error on the test dataset.

f. The analytical formula for the decision boundary is given byw0+PP

j=1wjxj= 0. This is the equation for a hyperplane inR^p, which indicates that the decision boundary is linear.

g. See the function PlotDBin the solution code. See the ﬁgure below.

(34)

13. (Logistic Regression (with gradient ascent)

xxx and Rosenblatt’s Perceptron:

xxx application on the Breast Cancer dataset

xxx n-fold cross-validation; conﬁdence interval)

• ◦ (CMU, 2009 spring, Ziv Bar-Joseph, HW2, pr. 4) For this exercise, you will use the Breast Cancer dataset, downloadable from the course web page. Given 9 diﬀerent attributes, such as “uniformity of cell size”, the task is to predict malignancy.⁹⁶⁶ The archive from the course web page contains a Matlab methodloaddata.m, so you can easily load in the data by typing (from the directory containing loaddata.m): data = loaddata. The variables in the resulting data structure relevant for you are:

• data.X: 683 9-dimensional data points, each element in the interval[1,10].

• data.Y: the 683 corresponding classes, either0(benign), or 1(malignant).

Logistic Regression

a. Write code in Matlab to train the weights for logistic regression. To avoid dealing with the intercept term explicitly, you can add a nonzero-constant tenth dimension to data.X:data.X(:,10)=1. Your regression function thus becomes simply:

P(Y = 0|x;w) = 1 1 + exp(P10

k=1xkwk) P(Y = 1|x;w) = exp(P10

k=1xkwk) 1 + exp(P10

k=1xkwk) and the gradient-ascend update rule:

w←w+α/683

683

X

j=1

x^j(y^j−P(Y^j= 1|x^j;w))

Use the learning rate α= 1/10. Try diﬀerent learning rates if you cannot get w to converge.

b. To test your program, use 10-fold cross-validation, splitting [data.X data.Y]

into 10 random approximately equal-sized portions, training on 9 concatenated parts, and testing on the remaining part. Report the mean classiﬁcation accuracy over the 10 runs, and the 95% confidence interval.

Rosenblatt’s Perceptron

A very simple and popular linear classiﬁer is the perceptron algorithm of Rosenblatt (1962), a single-layer neural network model of the form

y(x) =f(w^⊤x), with the activation function

f(a) =

1 if a≥0

−1 otherwise.

966For more information on what the individual attributes mean, see ftp://ftp.ics.uci.edu/pub/machine- learning-databases/breast-cancer-wisconsin/breastcancer-wisconsin.names.

(35)

For this classiﬁer, we need our classes to be −1 (benign) and 1 (malignant), which can be achieved with the Matlab command: data.Y = data.Y ⋆ 2 - 1.

Weight training usually proceeds in an online fashion, iterating through the individual data points x^j one or more times. For each x^j, we compute the predicted classyˆ^j=f(w^⊤x^j)forx^j under the current parametersw, and update the weight vector as follows:

w←w+x^j[y^j−yˆ^j].

Note how w only changes if xj was misclassiﬁed under the current model.

c. Implement this training algorithm in Matlab. To avoid dealing with theintercept term explicitly, augment each point indata.Xwith a non-zero constant tenth element. In Matlab this can be done by typing: data.X(:,10)=1. Have your algorithm iterate through the whole training data 20 times and report the number of examples that were still mis-classiﬁed in the 20th iteration. Does it look like the training data is linearly separable? (Hint: The perceptron algorithm is guaranteed to converge if the data is linearly separable.)

d. To test your program, use 10-fold cross-validation, using the splits you obtained in part b. For each split, do 20 training iterations to train the weights. Report the mean classiﬁcation accuracy over the 10 runs, and the 95%

conﬁdence interval.

e. If the data is not linearly separable, weights can toggle back and forth from iteration to iteration. Even in the linearly separable case, the learned model is often very dependent on which training data points come ﬁrst in the training sequence. A simple improvement is theweighted perceptron: training proceeds as before, but the weight vector wis saved after each update. After training, instead of the ﬁnal w, the average of all saved w is taken to be the learned weight vector. Report 10-fold CV accuracy for this variant and compare it to the simple perceptron’s.

Solution:

You should have gotten something like this:

b. mean accuracy: 0.965, conﬁdence interval: (0.951217, 0.978783).

c. 30 mis-classifications in the 20th iteration. (Note that using the trained weights *after* the 20th iteration results in only around 24 mis-classifications.) When running with 200 iterations, still more than 20 mis-classifications occur, so the data is unlikely to be linearly separable as otherwise the training error would become zero after many enough iterations.

d. Perceptron:

mean accuracy = 0.956, 95% conﬁdence interval: (0.940618, 0.971382).

e. Weighted perceptron:

mean accuracy = 0.968, 95% conﬁdence interval: (0.954800, 0.981200).

(36)

14. (Logistic regression with gradient ascent:

xxx application to text classiﬁcation)

• ◦ CMU, 2010 fall, Aarti Singh, HW1, pr. 5 In this problem you will implement Logistic Regression and evaluate its performance on a document classiﬁcation task. The data for this task is taken from the 20 Newsgroups data set,⁹⁶⁷ and is available from the course web page.

Our model will use the bag-of-words assumption. This model assumes that each word in a document is drawn independently from a categorical distribution over possible words. (A categorical distribution is a generalization of a Bernoulli distribution to multiple values.) Although this model ignores the ordering of words in a document, it works surprisingly well for a number of tasks. We number the words in our vocabulary from 1 to m, where m is the total number of distinct words in all of the documents. Documents from class y are drawn from a class-speciﬁc categorical distribution parameterized by θy. θy is a vector, whereθy,i is the probability of drawing wordiand Pm

i=1θy,i = 1.

Therefore, the class-conditional probability of drawing document x from our model is

P(X =x|Y =y) =

m

Y

i=1

θ^count_y,i ⁱ^(x), where counti(x) is the number of times word iappears in x.

a. Provide high-level descriptions of the Logistic Regression algorithm. Be sure to describe how to estimate the model parameters and how to classify a new example.

b. Implement Logistic Regression. We found that a step size around 0.0001 worked well. Train the model on the provided training data and predict the labels of the test data. Report the training and test error.

Solution:

a. The logistic regression model is

P(Y = 1|X=x, w) = exp(w0+P

iwixi) 1 + exp(w0+P

iwixi),

where w= (w0, w1, . . . , wm)^⊤ is our parameter vector. We will ﬁnd wˆ by maximizing the data loglikelihood l(w):

l(w) = log



 Y

j

exp(y^j(w0+P

iwix^j_i)) 1 + exp(w0+P

iwix^j_i)





= X

j

y^j(w0+X

i

wix^j_i)−log(1 + exp(w0+X

i

wix^j_i))

!

We can estimate/learn the parameters (w) of logistic regression by optimi- zing l(w), using gradient ascent. The gradient of l(w) is the array of partial derivatives of l(w):

967Full version available fromhttp://people.csail.mit.edu/jrennie/20Newsgroups/.

(37)

∂l(w)

∂w0

= X

j

y^j− exp(w0+P

iwix^j_i) 1 + exp(w0+P

iwix^j_i)

!

= X

j

(y^j−P(Y = 1|X =x^j;w))

∂l(w)

∂wk

= X

j

y^jx^j_k− x^j_kexp(w0+P

iwix^j_i) 1 + exp(w0+P

iwix^j_i)

!

= X

j

x^j_k(y^j−P(Y = 1|X =x^j;w))

Letw^(t)represent our parameter vector on thet-th iteration of gradient ascent.

To perform gradient ascent, we ﬁrst setw⁽⁰⁾ to some arbitrary value (say 0).

We then repeat the followingupdates until convergence:

w^(t+1)₀ ← w₀^(t)+αX

j

y^j−P(Y = 1|X=x^j;w^(t))

w^(t+1)_k ← w_k^(t)+αX

j

x^j_k

y^j−P(Y = 1|X =x^j;w^(t))

where α is a step size parameter which controls how far we move along our gradient at each step. We set α = 0.0001. The algorithm converges when

||w^(t)−w^(t+1)||< δ, that is when the weight vector doesn’t change much during an iteration. We setδ= 0.001.

b. Training error: 0.00. Test error: 0.29. The large diﬀerence between training and test error means that our model overfits our training data. A possible reason is that we do not have enough training data to estimate either model accurately.

(38)

15. (Logistic regression, regularized (L2):

xxx application to binary text classiﬁcation;

xxx feature selection using mutual information)

• ◦ CMU, 2012 fall, T. Mitchell, Z. Bar-Joseph, HW2, pr. 4