Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity and kernels in machine learning
Zal´an Bod´o
Babe¸s–Bolyai University, Cluj-Napoca/Kolozsv´ar Faculty of Mathematics and Computer Science
MACS 2016 Eger, Hungary
1/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Overview of the presentation
Machine learning
Similarity. Similarity in (machine) learning Kernels
Kernel methods
Examples of general purpose kernels Kernels and similarities
A sample/simple method: prototype learning The representer theorem
Dimensionality
The “kernelization period”
Semi-supervised learning and kernels Assumptions in SSL
Humans and SSL Data-dependent kernels Reweighting cluster kernels A toy dataset
2/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Machine learning
Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”
“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”
[J¨akel et al., 2007]
Machine learning=
• supervised learning – classification, regression
• unsupervised learning – clustering, density estimation
• reinforcement learning
+ semi-supervised learning (classification)
3/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Example: Content-based spam filtering
spamham (1200x287x16M jpeg)
4/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity. Similarity in (machine) learning
• similarity is fundamental to learning
• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]
• generalization is based on similarity between situations/events/objects/. . .
• learning = generalize. . .
(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data
“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”
[Shahbazi et al., 2016]
5/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity. Similarity in (machine) learning
• similarity is fundamental to learning
• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]
• generalization is based on similarity between situations/events/objects/. . .
• learning = generalize. . .
(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data
“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”
[Shahbazi et al., 2016]
5/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product)
C(xxx,zzz) = xxx0zzz kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product)
C(xxx,zzz) = xxx0zzz kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz
kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz
kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz
kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Similarity of. . .
• sets, e.g. Jaccard similarity
J(A,B) = |A∩B|
|A∪B|
• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)
max(|s|,|t|)
• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz
kxxxkkzzzk
• . . .
• complex objects, e.g. of two text segments extracted from a PDF file
• . . .
6/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
20.05.2016, MACS dinner
7/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
(www.mindmegette.hu/husveti-zserbo.recept)
8/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Kernels
× ×
×
× × × o o o
o o o
Figure : XOR problem: separate the o’s from the x’s
• Marvin Minsky, Seymour Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969. – a single artificial neuron/perceptron (= lin. class.) cannot solve the problem
• M. A. Aizerman, E. M. Braverman, L. I. Rozonoer.
Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964. – use kernels!
9/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
0 0.2 0.4 0.6 0.8 10
0.2 0.4
0.6 0.8
1
0 0.2 0.4 0.6 0.8 1
X22 X12
X1X2
Figure :Using the polynomial kernel
• map the points using the function φ(xxx) = [x12x22√ 2x1x2]0
• this is equivalent to usingk(xxx,zzz) =hφ(xxx), φ(zzz)i= (xxx0zzz)2(=
polynomial kernel)
• polynomial kernel: link the features using logical AND(size of the group of “linked” features is determined by the order of the kernel)
10/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Kernel methods
• 1909: James Mercer –any continuous symmetric, positive
semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]
• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]
• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]
• linear algorithms→non-linear algorithms
• feature mapping: φ:X → H(φ:Rd1→Rd2)
• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)0φ(zzz)
• covers all geometric constructions that can be formulated in terms of angles, lengths and distances
Kernel trick
Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).
11/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Kernel methods
• 1909: James Mercer –any continuous symmetric, positive
semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]
• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]
• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]
• linear algorithms→non-linear algorithms
• feature mapping: φ:X → H(φ:Rd1→Rd2)
• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)0φ(zzz)
• covers all geometric constructions that can be formulated in terms of angles, lengths and distances
Kernel trick
Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).
11/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Examples of general purpose kernels
• linear:
k(xxx,zzz) =xxx0zzz
• polynomial:
k(xxx,zzz) = (axxx0zzz+b)c
• Gaussian (RBF):
k(xxx,zzz) = exp −γkxxx−zzzk2
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
↑ ↑ ↑
12/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Examples of general purpose kernels
• linear:
k(xxx,zzz) =xxx0zzz
• polynomial:
k(xxx,zzz) = (axxx0zzz+b)c
• Gaussian (RBF):
k(xxx,zzz) = exp −γkxxx−zzzk2
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
↑ ↑ ↑
12/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Examples of general purpose kernels
• linear:
k(xxx,zzz) =xxx0zzz
• polynomial:
k(xxx,zzz) = (axxx0zzz+b)c
• Gaussian (RBF):
k(xxx,zzz) = exp −γkxxx−zzzk2
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
−10 −5 0 5 10 15 20
0 5 10
↑ ↑ ↑
12/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Kernels and similarities
kernel similarity
real-valued real-valued
symmetric not necessarily symmetric
positive definite not necessarily p.d.
k(xxx,zzz) = 1
2[k(xxx,xxx) +k(zzz,zzz)
−kφ(xxx)−φ(zzz)k22i
sim(xxx,zzz) = inverse of the
distance betweenxxxandzzz
k(xxx,zzz) = hφ(xxx), φ(zzz)i
= the cosine similarity of the mapped vectors, provided they are normalized
⇒
13/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Kernels and similarities
kernel similarity
real-valued real-valued
symmetric not necessarily symmetric
positive definite not necessarily p.d.
k(xxx,zzz) = 1
2[k(xxx,xxx) +k(zzz,zzz)
−kφ(xxx)−φ(zzz)k22i
sim(xxx,zzz) = inverse of the
distance betweenxxxandzzz
k(xxx,zzz) = hφ(xxx), φ(zzz)i
= the cosine similarity of the mapped vectors, provided they are normalized
⇒
13/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
A sample/simple method: prototype learning
ccc+
ccc− www xxx−ccc
• class centers (centroids, prototypes):
ccc+ = 1 N+
X
xxxi∈X+
xxxi
ccc− = 1 N−
X
xxxi∈X−
xxxi
14/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
• define the following vectors: www =ccc+−ccc− andccc= (ccc++ccc−)/2
• then
y(xxx) = sgnhxxx−ccc,wwwi
= sgn(hccc+,xxxi − hccc−,xxxi+b) withb= kccc−k2− kccc+k2
/2.
• using dot products between thexxxis:
y(xxx) =sgn
1 N+
X
xx xi∈X+
hxxx,xxxii − 1 N−
X
xx xi∈X−
hxxx,xxxii+b
where
b= 1 2
1 N−2
X
xx xi,xxxj∈X−
hxxxi,xxxji − 1 N+2
X
xx xi,xxxj∈X+
hxxxi,xxxji
15/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
The representer theorem
Theorem (Sch¨ olkopf and Smola, 2002)
LetHbe the feature space associated to a positive semi-definite kernel k:X×X →R. Denote byΩ : [0,∞)→Ra strictly monotonic increasing function, and by c: (X×R2)`→R∪ {∞}an arbitrary loss function. Then each minimizer of the regularized risk
c((xxx1,y1,f(xxx1)), . . . ,(xxx`,y`,f(xxx`))) + Ω(kfkH) admits a representation of the form
f(xxx) =
`
X
i=1
αik(xxxi,xxx)
16/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Semiparametric representer theorem
f(xxx) =
`
X
i=1
αik(xxxi,xxx) +
M
X
p=1
βpψp(xxx)
Loss function + regularization for thecentroidclassifier:
−X
yi=1
yif(xxxi)−N+
N− X
yi=−1
yif(xxxi) +N+
2 kwwwk22 wheref(xxxi) =www0xxxi+b
17/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Dimensionality
• curseorblessing?
• usually: φ:Rd1 →Rd2 withd2>d1ord2d1
• why? – higher the dimensionality,easierto find a separating hyperplane
• Vapnik–Chervonenkis dimension of a classification algorithm = largest set of points that the algorithm canshatter
• (shattering of a set of points = all possible labelings of the points can be realized by the method)
• VC dimension of oriented hyperplanes inRd isd+ 1 (see proof in [Burges, 1998])
18/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
• φneed not bedimensionality increaser/raiser
• it suffices to map the points to abetter representational space
• in either case: Johnson–Lindenstrauss lemma[Johnson and Lindenstrauss, 1984]
• if number of data points is relatively small (compared to dimensionality)
• random projection of logaritmically lower dimensionality
• relative distances will be approximately preserved
• corollary: kernels can be used for dimensionality reduction
19/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
The “kernelization period”
199x – 200y
• 1992: SVM
• ?: kernel regularized least squares
• 1996: kernel PCA
• 1999: kernel Fisher discriminant analysis, transductive SVM
• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)
• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM
• . . .
20/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
The “kernelization period”
199x – 200y
• 1992: SVM
• ?: kernel regularized least squares
• 1996: kernel PCA
• 1999: kernel Fisher discriminant analysis, transductive SVM
• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)
• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM
• . . .
20/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
(Some DBLP stats)
1960 1970 1980 1990 2000 2010 2020
0 200 400 600 800 1000 1200
Figure :11 463 works retrieved for keyword
“kernel” on 21.03.2016 (among 3 294 261)
Bernhard Sch¨olkopf (73) Johan A. K. Suykens (68)
Jos´e Carlos Pr´ıncipe (63) Stefan Kratsch (60) Alessandro Moschitti (56)
Alexander J. Smola (53) Hortensia Galeana-S´anchez (51)
Arthur Gretton (47) Saket Saurabh (44) Edwin R. Hancock (44)
Figure : Top 10 authors for the same keyword
21/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Semi-supervised learning and kernels
Semi-supervised learning (SSL)
• supervised learning:
D={(xxxi,yi)|xxxi ∈X ⊆Rd,yi∈ {−1,+1},i= 1, . . . , `};
find f :X → {−1,+1}which agrees withD
• semi-supervised learning:
D={(xxxi,yi)|i= 1, . . . , `} ∪ {xxxj|j= 1, . . . ,u}, `u,N=`+u;
• inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU
• transductive: findf :DU→ {−1,+1}by usingD=DL∪DU
0 5 10
−6
−4
−2 0 2 4
0 5 10
−6
−4
−2 0 2 4
22/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Semi-supervised learning and kernels
Semi-supervised learning (SSL)
• supervised learning:
D={(xxxi,yi)|xxxi ∈X ⊆Rd,yi∈ {−1,+1},i= 1, . . . , `};
find f :X → {−1,+1}which agrees withD
• semi-supervised learning:
D={(xxxi,yi)|i= 1, . . . , `} ∪ {xxxj|j= 1, . . . ,u}, `u,N=`+u;
• inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU
• transductive: findf :DU→ {−1,+1}by usingD=DL∪DU
0 5 10
−6
−4
−2 0 2 4
0 5 10
−6
−4
−2 0 2 4
22/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Assumptions in SSL
1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.
2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.
3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.
23/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Assumptions in SSL
1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.
2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.
3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.
23/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Assumptions in SSL
1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.
2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.
3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.
23/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Humans and SSL
• humans do semi-supervised classification too
• 2007: experiment (Zhu and his colleagues), University of Wisconsin [Zhu et al., 2007]
• complex 3D shapes classified into two categories
• participants were told they see microscopic images of pollen particles from two fictitious flowers (BelianthusandNortulaca)
• data given:
• 2 labeled examples (each appearing 10 times in 20 trials)
• test set of 21 evenly spaced unlabeled examples – to test the learned decision boundary
• 3×230 unlabeled examples – the means are shifted away from the labeled examples (left-shiftedorright-shifted)
• test set of 21 evenly spaced unlabeled examples – to test whether the decision boundary has changed
• ⇒the learned decision boundary is determined by both labeled and unlabeled data
24/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Data-dependent kernels
• supervised learning + data-dependent kernels =semi-supervised learning
• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)
• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)
“m” reads as“not necessarily equal”
25/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Data-dependent kernels
• supervised learning + data-dependent kernels =semi-supervised learning
• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)
• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)
“m” reads as“not necessarily equal”
25/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Data-dependent kernels
• supervised learning + data-dependent kernels =semi-supervised learning
• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)
• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)
“m” reads as“not necessarily equal”
25/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Reweighting cluster kernels
• idea borrowed from bagged cluster kernel [Weston et al., 2005]
• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]
• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2
• cluster kernel: KKK =KKKrwKKKb where
• KKKb= base kernel (e.g. Gaussian, polynomial, etc.)
• KKKrw= reweighting kernel
• KKK = resulting cluster kernel used in the learning algorithm
krw(xxx,zzz) = exp
−kUUU·xxx−UUU·zzzk2 2σ2
K
KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size
K
|{z}
no. of clusters
× N
|{z}
no. of points
)
26/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Reweighting cluster kernels
• idea borrowed from bagged cluster kernel [Weston et al., 2005]
• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]
• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2
• cluster kernel: KKK =KKKrwKKKb where
• KKKb= base kernel (e.g. Gaussian, polynomial, etc.)
• KKKrw= reweighting kernel
• KKK = resulting cluster kernel used in the learning algorithm
krw(xxx,zzz) = exp
−kUUU·xxx−UUU·zzzk2 2σ2
K
KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size
K
|{z}
no. of clusters
× N
|{z}
no. of points
)
26/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Reweighting cluster kernels
• idea borrowed from bagged cluster kernel [Weston et al., 2005]
• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]
• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2
• cluster kernel: KKK =KKKrwKKKb where
• KKKb= base kernel (e.g. Gaussian, polynomial, etc.)
• KKKrw= reweighting kernel
• KKK = resulting cluster kernel used in the learning algorithm
krw(xxx,zzz) = exp
−kUUU·xxx−UUU·zzzk2 2σ2
K
KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size
K
|{z}
no. of clusters
× N
|{z}
no. of points
)
26/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
A toy dataset
Figure :Linked tori dataset – labeled examples: 3 + 3, unlabeled examples:
197 + 197.
27/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Figure :Linear SVM:
Accuracy = 70.8122% (279/394)
Figure : Gaussian SVM:
• γ= 0.003
Accuracy = 69.5431% (274/394)
28/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Figure :SVM with reweighting cluster kernel (RCK)
• clustering: fuzzy,p= 2, no. of clusters = 30
• 3rd kernel
• β= 1000
Accuracy = 76.1421% (300/394)
29/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
Thank you!
30/31
Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset
References
Aizerman et al., 1964 M. A. Aizerman, E. M. Braverman, L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964.
Bod´o and Csat´o, 2010 Z. Bod´o, L. Csat´o. Hierarchical and Reweighting Cluster Kernels for Semi-Supervised Learning. Int. J. of Computers, Communications & Control, Vol. V, No. 4, pp. 469-476, 2010.
Boser et al., 1992 B. E. Boser, I. M. Guyon, V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, pp. 144–152, 1992.
Burges, 1998 C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(4), pp. 121–167, 1998.
J¨akel et al., 2007 F. J¨akel , B. Sch¨olkopf , F. A. Wichmann. A Tutorial on Kernel Methods for Categorization.
Journal of Mathematical Psychology 51(6), pp. 343–358, 2007.
Johnson and Lindenstrauss, 1984 W. B. Johnson, J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, pp. 189–206, 1984.
Mercer, 1909 J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, Series A, vol. 209, pp.
415–446, 1909.
Minsky and Papert, 1969 M. Minsky, S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969
Sch¨olkopf and Smola, 2002 B. Sch¨olkopf, A. J. Smola. Learning with Kernels. MIT Press, Cambridge, Mass., 2002.
Shahbazi et al., 2016 R. Shahbazi, R. Raizada, S. Edelman. Similarity, kernels, and the fundamental constraints on cognition. Journal of Mathematical Psychology, vol. 70, pp. 21–34, 2016.
Shepard, 1987 R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237, pp. 1317–1323, 1987.
Weston et al., 2005 J. Weston, C. Leslie, D. Zhou, A. Elisseeff, W. S. Noble. Semi-Supervised Protein Classification using Cluster Kernels. Bioinformatics, 21(15), pp. 3241–3247, 2005.
Zhu et al., 2007 X. Zhu, T. Rogers, R. Qian, C. Kalish. Humans perform semi-supervised classification too.
AAAI, pp. 864–870, 2007.
31/31