MACHINE LEARNING
Liviu Ciortuz
Department of CS, University of Ia¸si, Romˆ ania
What is Machine Learning?
• ML studies algorithms that improve with
| {z }
learn from
experience.
Tom Mitchell’s Definition of the [general ] learning problem:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance on tasks in T , as measured by P , improves with experience E .”
• Examples of [specific] learning problems (see next slide)
• [Liviu Ciortuz:] ML is data-driven programming
• [Liviu Ciortuz:] ML gathers a number of well-defined sub-
domains/disciplines, each one of them aiming to solve in its
own way the above-formulated [general ] learning problem.
What is Machine Learning good for?
• natural language (text & speech) processing
• genetic sequence analysis
• robotics
• customer (financial risc) evaluation
• terrorist threat detection
• compiler optimisation
• semantic web
• computer security
• software engineering
• computer vision (image processing)
• etc.
Related courses at FII
• Genetic Algorithms
• Artificial Neural Networks
• Probabilistic programming
• Special Chapters of Machine Learning
• Special Chapters of Artificial Neural Networks
• Data Mining
• Nature-inspired computing methods
• Big Data Analytics
• Image Processing
• Computer Vision
◦ Bioinformatics
A multi-domain view
Intelligence Artificial
(concept learning)
Algorithms Mathematics
Statistics (model fitting)
Machine Learning
Learning
Statistical Pattern
Recognition Mining Data
Engineering Database Systems
(Knowledge Discovery in Databases)
The Machine Learning Undergraduate Course:
Plan
0. Introduction to Machine Learning (T. Mitchell, ch. 1)
1. Probabilities Revision (Ch. Manning & H. Sch¨ utze, ch. 2) 2. Decision Trees (T. Mitchell, ch. 3)
3. Bayesian Learning (T. Mitchell, ch. 6)
[and the relationship with Logistic Regression]
4. Instance-based Learning (T. Mitchell, ch. 8)
5. Clustering Algorithms (Ch. Manning & H. Sch¨ utze, ch. 14)
The Machine Learning Master Course:
Tentative Plan 1. Decision Trees: Boosting
2. Support Vector Machines (N. Cristianini & J. Shawe-Taylor, 2000) 3. Computational Learning Theory (T. Mitchell, ch. 7)
Probabilities Revision (Ch. Manning & H. Sch¨ utze, ch. 2) 4. Gaussian Bayesian Learning
5. The EM algorithmic schemata (T. Mitchell, ch. 6.12)
6. Hidden Markov Models (Ch. Manning & H. Sch¨ utze, ch. 9)
Bibliography
0. “Exercit¸ii de ˆınv˘at¸are automat˘a”
L. Ciortuz, A. Munteanu E. B˘ad˘ar˘au.
Ia¸si, Romania, 2022
www.info.uaic.ro/∼ciortuz/ML.ex-book/editia-2022f/ex-book.22sept2022.pdf 1. “Machine Learning”
Tom Mitchell. McGraw-Hill, 1997
2. “Support Vector Machines and other kernel-based learning methods”
Nello Cristianini, John Shawe-Taylor. Cambridge University Press, 2000.
3. “Foundations of Statistical Natural Language Processing”
Christopher Manning, Hinrich Sch¨utze. MIT Press, 2002 4. “Machine Learning Foundations”
Teaho Jo. Springer, 2021
5. “A First Course in Machine Learning”
Simon Rogers, Mark Girolami. Chapman and Hall / CRC, 2nd ed., 2016
A general schema for machine learning methods
test/generalization data
predicted classification algorithm
machine learning
model data training
data
“We are drawning in information but starved for knowledge.”
John Naisbitt, “Megatrends” book, 1982
Basic ML Terminology
1. instance x , instance set X
concept c ⊆ X , or c : X → {0 , 1}
example (labeled instance): hx, c(x)i; positive examples, neg. examples 2. hypotheses h : X → {0 , 1}
hypotheses representation language hypotheses set H
hypotheses consistent with the concept c: h(x) = c(x), ∀ example hx, c(x)i version space
3. learning = train + test
supervised learning (classification), unsupervised learning (clustering) 4. error
h= | { x ∈ X, h ( x ) 6= c ( x )} |
training error, test error accuracy, precision, recall
5. validation set, development set
n-fold cross-validation, leave-one-out cross-validation
overfitting
The Inductive Learning Assumption
Any hypothesis found to conveniently approximate the target function over a sufficiently large set of training examples
will also conveniently approximate the target function
over other unobserved examples.
Inductive Bias
Consider
• a concept learning algorithm L
• the instances X , and the target concept c
• the training examples D
c= {h x, c ( x )i}.
• Let L ( x
i, D
c) denote the classification assigned to the instance x
iby L after training on data D
c.
Definition :
The inductive bias of L is any minimal set of assertions B such that
(∀ x
i∈ X )[( B ∨ D
c∨ x
i) ⊢ L ( x
i, D
c)]
for any target concept c and corresponding training examples D
c.
( A ⊢ B means A logically entails B )
Inductive systems
can be modelled by
equivalent deductive
systems
Evaluation measures in Machine Learning
tn
c h
fn tp fp
tp − true positives f p − false positives tn − true negatives f n − false negatives
accuracy: Acc = tp + tn
tp + tn + fp + fn precision: P = tp
tp + fp
recall (or: sensitivity): R = tp tp + fn F-measure: F = 2 P × R
P+R specificity: Sp = tn
tn + fp follout: = fp
tn + fp
Mathew’s Correlation Coefficient:
MCC = tp × tn − fp × fn
q(tp + fp)×(tn + fn)×(tp + fn)×(tn + fp)
Lazy learning vs. eager learning algorithms
Eager: generalize before seeing query
◦ ID3, Backpropagation, Naive Bayes, Radial basis function net- works, . . .
• Must create global approximation Lazy: wait for query before generalizing
◦ k -Nearest Neighbor, Locally weighted regression, Case based rea- soning
• Can create many local approximations Does it matter?
If they use the same hypothesis space H , lazy learners can represent more complex functions.
E.g., a lazy Backpropagation algorithm can learn a NN which is dif-
ferent for each query point, compared to the eager version of Back-
propagation.
Basic Machine Learning Algorithms
ID3 algorithm: a simplified version
Ross Quinlan, 1979, 1986 START
create the root node ;
assign all examples to the root node;
Main loop:
1. A ← the “best” decision attribute for the next node ; 2. for each value of A , create a new descendant of node ; 3. sort training examples to leaf nodes;
4. if training examples are perfectly classified, then STOP;
else iterate over the new leaf nodes
Consider m training examples S = {(x1, y1), . . . ,(xm, ym)}, where xi ∈ X and yi ∈ {−1,+1}.
Suppose we have a weak learning algorithm A which produces a hypothesis h : X → {−1,+1}
given any distribution D of examples.
• Begin with a uniform distribution D1(i) = 1
m, i = 1, . . . , m.
• At each iteration t = 1, . . . , T,
• run the weak learning algo A on the distribution Dt and produce the hypothsis ht; Note (1): Since A is a weak learning algorithm, the produced hypothesis ht at round t is only slightly better than random guessing, say, by a margin γt:
εt = errDt(ht) = Prx∼Dt[y 6= ht(x)] = 1
2 − γt.
Note (2): If at a certain iteration t ≤ T the weak classifier A cannot produce a hypothesis better than random guessing (i.e., γt = 0) or it produces a hypothesis for which εt = 0, then the AdaBoost algorithm should be stopped.
• update the distribution Dt+1(i) = 1
Zt
·Dt(i)·e−αtyiht(xi) for i= 1, . . . , m, (1) where αt
not.= 1
2 ln 1−εt
εt
, and Zt is the normalizer.
• In the end, deliver HT = signPT
t=1 αtht
as the learned hypothesis, which will act as a weighted majority vote.
AdaBoost as an instance of a more general stepwise algorithm
Input: S, T, H, φ, where
S = {(x1, y1), . . . ,(xm, ym) is the training dataset, with yi ∈ {−1,+1}
T is the number of iterations to be executed, H is a set of “hypotheses”,
φ(y, y′) is a “loss” / “cost” / “risk” function;
Procedure:
Initialize the classifier by taking f0(x) = 0 (the constant function 0), and D1(i) = 1/m for i= 1, . . . , m
for t = 1 to T do:
1. Compute
(ht, αt) = arg minα∈R,h∈HPm
i=1φ(yi, ft−1(xi) +αh(xi)) 2. Update the classifier
ft(x) = ft−1(x) +αtht(x) compyte Dt+1
end for
return the classifier sign(fT(x))
Note: At each step, the algorithm greedily adds a hypothesis h ∈ H to the current combined hypothesis to minimize the φ-loss.
[MIT, 2003 fall, Tommi Jaakkola, HW4, pr. 2.1-3]
Initialization: W˜i(1) = 1/m and f0(xi) = 0 for i= 1, . . . , m.
Loop: for t = 1 to T do:
Step 1: Find a classifier h(x; ˆθt) performing better than chance wrt the weighted training error:
εt
not.= X
i:yi6=h(xi; ˆθt)
W˜i(t)yih(xi;θ) = 1 2
1− Xm
i=1
W˜i(t)yih(xi; ˆθt) .
Step 1: Note: Minimizing εt is equivalent to finding θˆt that minimizes ∂
∂αJt(α, θt)|α=0 wrt θt, where Jt(α, θt) = 1
m
m
X
i=1
Loss(yift−1(xi) +yiα h(xi;θt)).
Step 2: Set the votes αt for the new component by minimizing the overall empirical loss:
αt = arg min
α≥0Jt(α,θˆt).
Step 3: Recompute the normalized weights for the next iteration according to W˜i(t+1) = −ct ·dL(yift−1(xi) +yiαth(xi; ˆθt)
| {z }
yift(xi)
) for i = 1, . . . , m,
Step 3: where ct is chosen so that Pm
i=1W˜i(t+1) = 1.
Output: fT
The Naive Bayes Classifier
• Assume that the attributes < a1, . . . , an > that describe instances are conditionally independent w.r.t. to the given classification:
P(a1, a2. . . an|vj) = Y
i
P(ai|vj)
• Training procedure:
Naive Bayes Learn(examples)
for each value vj of the output attribute Pˆ(vj) ← estimate P(vj)
for each value ai of each input attribute a Pˆ(ai|vj) ← estimate P(ai|vj)
• The decision rule of the Naive Bayes classifier is:
vM AP = argmax
vj∈V
P(vj|a1, a2. . . an) = argmax
vj∈V
P(a1, a2. . . an|vj)P(vj) P(a1, a2. . . an)
= argmax
vj∈V
P(a1, a2. . . an|vj)P(vj) = argmax
vj∈V
Y
i
P(ai|vj)P(vj) not.= vN B
Logistic Regression
Given the dataset {(x(1), y(1)), (x(2), y(2)), . . . , (x(n), y(n))}, where each vector x(i) has d features / attributes, and y(i) ∈ {0,1} for i = 1, . . . , n, its complete log-likelihood is:
log-likelihood= ln
n
Y
i=1
P(x(i), y(i)) = ln
n
Y
i=1
(PY|X(y(i)|x(i))PX(x(i)))
= ln
n Y
i=1
PY|X(y(i)|x(i))
!
·
n
Y
i=1
PX(x(i))
!!
= ln
n
Y
i=1
PY|X(y(i)|x(i)) + ln
n
Y
i=1
PX(x(i))not.= ℓ(w) +ℓx.
Note that ℓx does not depend on the parameter w.
P(Y = 0|X =x) = 1−σ(z), where σ(z) def.= 1
1 +e−z = ez 1 +ez, z not.= w0 +
d
X
i=1
wixi
not.= w·x, with
w not.= (w0, w1, . . . , wd) ∈Rd+1, assuming x0 = 1.
0 0.5 1
-6 -4 -2 0 2 4 6
sigma(x)
x
It can be shown that the conditional log-likelihood function ℓ(w) can be written as:
ℓ(w) = Xn
i=1
y(i)lnσ(w ·x(i)) + (1−y(i)) ln(1−σ(w · x(i))) wLogR
def.= argmax
w
ℓ(w) = arg min
w (−ℓ(w)). (2)
Note that −ℓ(w) is a cross-entropy.
For a more general result than (2), see Stanford, 2015 fall, Andrew Ng, HW3, pr. 5.c.
See the analogy with the sigmoidal perceptron
CMU, 2011 fall, Eric Xing, HW1, pr. 3.3
w1 w2
wn
w0 x1
x2
xn
x0 = 1
A A A A
. . .
Σ
net = Σwi xii=0 n
1 1 + e-net o = σ(net) =
The k -Nearest Neighbor Argorithm
Evelyn Fix, Joseph Hodges, 1951; Thomas Cover, Peter Hart, 1967
Training:
Store all training examples.
Classification:
Given a query/test instance xq,
first locate the k nearest training examples x1, . . . , xk, then estimate fˆ(xq):
• take a vote among its k nearest neighbors fˆ(xq) ← argmax
v∈V
Xk
i=1
1{f(xi)=v}
where 1{ · } is the well-known indicator function.
The Bottom-up Hierarchical Clustering Algorithm
Given: a set X = {x1, . . . , xn} of objects a function sim: P(X)× P(X) → R for i = 1, n do
ci = {xi} end C = {c1, . . . , cn} j = n + 1
while | C |> 1
(cn1, cn2) = argmax(cu,cv)∈C×C sim(cu, cv) cj = cn1 ∪cn2
C = C\{cn1, cn2} ∪ {cj} j = j + 1
The k -Means Algorithm
S. P. Lloyd, 1957
Given: a set X = {x1, . . . , xn} ⊆ Rm, a distance measure d on Rm,
a function for computing the mean µ : P(Rm) → Rm,
built k clusters so as to satisfy a certain (“stopping”) criterion (e.g., maxi- mization of group-average similarity).
Procedure:
Select (arbitrarily) k initial centers f1, . . . , fk in Rm; while the stopping criterion is not satisfied
for all clusters cj do cj = {xi | ∀fl d(xi, fj) ≤ d(xi, fl)} end for all means fj do fj ← µ(cj) end
K-Means algorithm revisited (I)
• Se init¸ializeaz˘a ˆın mod arbitrar centroizii µ1,µ2, . . . ,µK ¸si se ia C ={1, . . . , K}.
• Atˆata timp cˆat valoarea criteriului J de- scre¸ste ˆın mod strict, repet˘a:
Pasul 1:
Calculeaz˘a γ astfel:
γij ←
(1, dac˘a kx
i−µjk2 ≤ kx
i−µj′ k2, ∀j′ ∈C, 0, ˆın caz contrar.
ˆIn caz de egalitate, alege ˆın mod arbi- trar c˘arui cluster (dintre cele eligibile) s˘a-i apart¸in˘a xi.
Pasul 2:
Recalculeaz˘a µj folosind matricea γ actual- izat˘a:
Pentru fiecare j ∈ C, dac˘a Pn
i=1γij > 0, asigneaz˘a
µj ← Pn
i=1γijxi
Pn
i=1γij
.
Altfel, ment¸ine neschimbat centroidul µj.
Algoritmul de clusterizare K-means poate fi v˘azut [¸si reformulat] ca un algoritm de optimizare, folosind metoda descre¸sterii pe coordonate.
Obiectivul este acela de a minimiza o funct¸ie obiectiv care m˘asoar˘a (indirect) coeziunea intra-clustere:
J(L, µ) =
n
X
i=1
||xi −µli||2,
Algoritmul K-means face init¸ializarea centroizilor clus- terelor µ cu anumite valori, dup˘a care procedeaz˘a astfel:
Pasul 1:
P˘astrˆand µ fixat, g˘ase¸ste acea asignare L a instant¸elor la clustere care minimizeaz˘a funct¸ia J(L, µ);
Pasul 2:
P˘astrˆand asignarea L fixat˘a, g˘ase¸ste acea valoare pentru µ pentru care se minimizeaz˘a J(L, µ).
Criteriul de oprire: Dac˘a [aceasta nu este prima iterat¸ie
¸
si] niciuna dintre asign˘arile din listaLnu s-a modificat ˆın raport cu precedenta iterat¸ie, se trece la pasul urm˘ator (Terminare); altfel se repet˘a de la Pasul 1.
Terminare: Returneaz˘a L ¸si µ.
K-Means algorithm revisited (II)
(t) 1
(t) (t)
µ = (µ , ..., µ )K
min J(C, )µ
(t+1)
γij = ...
(t+1)
γij
(t+1)
γij xi (t+1)
µj =
ΣΣ
t=0
++t
(t) 1
(t) (t)
µ = (µ , ..., µ )K
min J(C, )µ
C(t+1) J(C, )µ(t)
C
= argmin
(t+1)
J(C , )µ
(t+1)
µ = argmin
µ
t=0
++t
The General EM Problem
Given
• observed data X = {x1, . . . , xm} independently generated using the parameterized distribu- tions/hypotheses h1, . . . , hm
• unobserved data Z = {z1, . . . , zm} determine
hˆ that (locally) maximizes P(X|h).
Approach
Start with h(0), an arbitrarily/conveniently chosen value of h.
Repeatedly
1. Use the observed data X and the current hypothesis h(t) to estimate [the proba- bilities associated to the values of ] the unobserved variables Z, and further on compute their expectations, E[Z].
2. The expected values of the unobserved variables Z are used to calculate an im- proved hypothesis h(t+1), based on max- imizing the mean of a log-likelihood function: E[lnP(Y|h)|X, h(t)], where Y = {y1, . . . , ym} is the complete (observed and unobserved) data, i.e. yi = (xi, zi), for i = 1, . . . , m.
The EM algorithmic Schema
X, h
(t)E[Z | ] h
(t)h
= argmax h
(t+1)P(Z|X; )h(t)
E [ln P(X,Z|h)]
++t
ln P(X|h)
t=0
ADMINISTRATIVIA
Who is Liviu Ciortuz?
• Diploma (maths and CS) from UAIC, Ia¸si, Romania, 1985 PhD in CS from Universit´ e de Lille, France, 1996
• programmer:
Bac˘ au, Romania (1985-1987)
• full-time researcher:
Germany (DFKI, Saarbr¨ ucken, 1997-2001),
UK (Univ. of York and Univ. of Aberystwyth, 2001-2003), France (INRIA, Rennes, 2012-2013)
• assistant, lecturer, and then associate professor:
Univ. of Iasi, Romania (1990-1997, 2003-2012, 2013-today)
Teaching assistants for the ML undergraduate course 2022 (fall semester)
• Conf. dr. Anca Ignat ( . . . Image processing) https://profs.info.uaic.ro/∼ancai/ML/
• Andi Munteanu (PhD student)
• S ¸tefan Pant¸iru (MSc)
• S ¸tefan Matcovici (MSc)
• S ¸tefan B˘ al˘ auc˘ a (MSc student)
• Astrid Mocanu (MSc student)
• Ramona Albert (MSc student)
Grading standards for the ML undergraduate course 2022 Obiectiv: ˆInv˘ at¸are pe tot parcursul semestrului!
P2 T1
Punctaj
T2 S P1
Test: 6p Test: 6p Seminar: 8p Ex. partial: 10p Ex. partial: 10p Minim: 1.5p
Minim: 1.5p Minim: 2p Minim: 2.5p Minim: 2.5p
Prezenta la seminar: obligatorie!
Penalizare: 0.2p pentru fiecare absenta de la a doua incolo!
Pentru promovare: Nota >= 4.5 Prezenta la curs: recomandata!
Nota = (10 + T1 + T2 + S + P1 + P2) / 5
<=> T1 + T2 + S + P1 + P2 >= 12.5
REGULI generale pentru cursul de ˆInv˘ at¸are automat˘ a de la licent¸˘ a
Regulile de organizare a cursului de ˆInv˘at¸are Automat˘a (engl., Machine Learning, ML), sem. I, sunt specificate ˆın fi¸sa disciplinei
http://profs.info.uaic.ro/∼ciortuz/fisa-disciplinei.pdf
• Bibliografie minimal˘a: vezi slide #8
• Planificarea materiei, pentru fiecare s˘apt˘amˆan˘a (curs + seminar):
http://profs.info.uaic.ro/∼ciortuz/what-you-should-know.pdf
• Prezent¸a la curs: recomandat˘a!
• Regula 0: Prezent¸a la seminar: obligatorie!
Pentru fiecare absent¸˘a la seminar, ˆıncepˆand de la a doua absent¸˘a ˆıncolo, se aplic˘a o penalizare/depunctare de 0.2 puncte. (Vezi formula de notare.)
Regulile se aplic˘a inclusiv student¸ilor reˆınmatriculat¸i.
• S˘apt˘amˆanal — mart¸ea, ˆıntre orele 18–20, ˆın C309 — se va t¸ine un seminar suplimentar, destinat pentru acei student¸i care sunt foarte interesat¸i de acest domeniu ¸si c˘arora le plac demonstrat¸iile matematice. (Vedet¸i sect¸iunile “Advanced issues” din documentul http://profs.info.uaic.ro/∼ciortuz/what-you-should-know.pdf.)
REGULI generale pentru cursul de ˆInv˘ at¸are automat˘ a de la licent¸˘ a (cont.)
Regula 1:
Pentru seminarii, nu se admit mut˘ari ale student¸ilor de la o grup˘a la alta, decˆat ˆın cadrul grupelor care au acela¸si asistent / profesor responsabil de seminar.Regula 2:
Nu se fac echival˘ari de punctaje pentru student¸ii care nu au promovat cursul ˆın anii precedent¸i.Regula 3:
Profesorul responsabil pentru acest curs, Liviu Ciortuz,NU va r˘aspunde la email-uri care pun ˆıntreb˘ari pentru care raspunsul a fost deja dat – fie ˆın aceste slide-uri,
– fie pe site-ul Piazza dedicat acestui curs:
https://piazza.com/info.uaic.ro/fall2022/ml2022f/home, – fie la curs.
Recomandare important˘ a (1)
La fiecare curs ¸si seminar, student¸ii vor avea culegerea de Exercit¸ii de ˆınv˘at¸are automat˘a (de L. Ciortuz et al) — v˘a recomand˘am s˘a imprimat¸i capitolele Clasificare bayesian˘a, ˆInv˘at¸are bazat˘a pe memorare, Arbori de decizie ¸si Clus- terizare — ¸si eventual slide-urile pe care le-am indicat ˆın slide-ul precedent.Recomandare important˘ a (2)
Consultat¸i s˘apt˘amˆanal documentulwhat-you-should-know.pdf din pagina de Resurse, de pe site-ul Piazza dedicat acestui curs.
REGULI generale pentru cursul de ˆInv˘ at¸are automat˘ a de la licent¸˘ a (cont.)
•
Slide-uri de imprimat
(ˆın aceast˘a ordine ¸si, de preferat, COLOR):http://profs.info.uaic.ro/∼ciortuz/SLIDES/foundations.pdf
https://profs.info.uaic.ro/∼ciortuz/ML.ex-book/SLIDES/ML.ex-book.SLIDES.ProbStat.pdf https://profs.info.uaic.ro/∼ciortuz/ML.ex-book/SLIDES/ML.ex-book.SLIDES.DT.pdf
https://profs.info.uaic.ro/∼ciortuz/ML.ex-book/SLIDES/ML.ex-book.SLIDES.Bayes.pdf https://profs.info.uaic.ro/∼ciortuz/ML.ex-book/SLIDES/ML.ex-book.SLIDES.IBL.pdf https://profs.info.uaic.ro/∼ciortuz/ML.ex-book/SLIDES/ML.ex-book.SLIDES.Cluster.pdf (Atent¸ie: acest set de slide-uri poate fi actualizat pe parcursul semestrului!)
• De imprimat (ALB-NEGRU):
http://profs.info.uaic.ro/∼ciortuz/SLIDES/ml0.pdf http://profs.info.uaic.ro/∼ciortuz/SLIDES/ml3.pdf http://profs.info.uaic.ro/∼ciortuz/SLIDES/ml6.pdf http://profs.info.uaic.ro/∼ciortuz/SLIDES/ml8.pdf http://profs.info.uaic.ro/∼ciortuz/SLIDES/cluster.pdf