Machine learning

(1)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity and kernels in machine learning

Zal´an Bod´o

Babe¸s–Bolyai University, Cluj-Napoca/Kolozsv´ar Faculty of Mathematics and Computer Science

MACS 2016 Eger, Hungary

1/31

(2)

Overview of the presentation

Machine learning

Similarity. Similarity in (machine) learning Kernels

Kernel methods

Examples of general purpose kernels Kernels and similarities

A sample/simple method: prototype learning The representer theorem

Dimensionality

The “kernelization period”

Semi-supervised learning and kernels Assumptions in SSL

Humans and SSL Data-dependent kernels Reweighting cluster kernels A toy dataset

2/31

(3)

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(4)

Machine learning

Machine learning=

3/31

(5)

Machine learning

Machine learning=

3/31

(6)

Machine learning

Machine learning=

3/31

(7)

Machine learning

Machine learning=

3/31

(8)

Machine learning

Machine learning=

3/31

(9)

Machine learning

Machine learning=

3/31

(10)

Example: Content-based spam filtering

spamham (1200x287x16M jpeg)

4/31

(11)

Similarity. Similarity in (machine) learning

• similarity is fundamental to learning

• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]

• generalization is based on similarity between situations/events/objects/. . .

• learning = generalize. . .

(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data

“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”

[Shahbazi et al., 2016]

5/31

(12)

Similarity. Similarity in (machine) learning

• similarity is fundamental to learning

• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]

• generalization is based on similarity between situations/events/objects/. . .

• learning = generalize. . .

(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data

“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”

[Shahbazi et al., 2016]

5/31

(13)

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product)

C(xxx,zzz) = xxx⁰zzz kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(14)

Similarity of. . .

J(A,B) = |A∩B|

|A∪B|

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product)

C(xxx,zzz) = xxx⁰zzz kxxxkkzzzk

• . . .

6/31

(15)

Similarity of. . .

J(A,B) = |A∩B|

|A∪B|

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx⁰zzz

kxxxkkzzzk

• . . .

6/31

(16)

Similarity of. . .

J(A,B) = |A∩B|

|A∪B|

max(|s|,|t|)

kxxxkkzzzk

• . . .

6/31

(17)

Similarity of. . .

J(A,B) = |A∩B|

|A∪B|

max(|s|,|t|)

kxxxkkzzzk

• . . .

6/31

(18)

Similarity of. . .

J(A,B) = |A∩B|

|A∪B|

max(|s|,|t|)

kxxxkkzzzk

• . . .

6/31

(19)

20.05.2016, MACS dinner

7/31

(20)

(www.mindmegette.hu/husveti-zserbo.recept)

8/31

(21)

Kernels

× ×

×

× × × o o o

o o o

Figure : XOR problem: separate the o’s from the x’s

• Marvin Minsky, Seymour Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969. – a single artificial neuron/perceptron (= lin. class.) cannot solve the problem

• M. A. Aizerman, E. M. Braverman, L. I. Rozonoer.

Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964. – use kernels!

9/31

(22)

0 0.2 0.4 0.6 0.8 10

0.2 0.4

0.6 0.8

1

0 0.2 0.4 0.6 0.8 1

X₂² X₁²

X1X2

Figure :Using the polynomial kernel

• map the points using the function φ(xxx) = [x₁²x₂²√ 2x1x2]⁰

• this is equivalent to usingk(xxx,zzz) =hφ(xxx), φ(zzz)i= (xxx⁰zzz)²(=

polynomial kernel)

• polynomial kernel: link the features using logical AND(size of the group of “linked” features is determined by the order of the kernel)

10/31

(23)

Kernel methods

• 1909: James Mercer –any continuous symmetric, positive

semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]

• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]

• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]

• linear algorithms→non-linear algorithms

• feature mapping: φ:X → H(φ:R^d¹→R^d²)

• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)⁰φ(zzz)

• covers all geometric constructions that can be formulated in terms of angles, lengths and distances

Kernel trick

Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).

11/31

(24)

Kernel methods

• 1909: James Mercer –any continuous symmetric, positive

semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]

• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]

• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]

• linear algorithms→non-linear algorithms

• feature mapping: φ:X → H(φ:R^d¹→R^d²)

• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)⁰φ(zzz)

• covers all geometric constructions that can be formulated in terms of angles, lengths and distances

Kernel trick

Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).

11/31

(25)

Examples of general purpose kernels

• linear:

k(xxx,zzz) =xxx⁰zzz

• polynomial:

k(xxx,zzz) = (axxx⁰zzz+b)^c

• Gaussian (RBF):

k(xxx,zzz) = exp −γkxxx−zzzk²

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(26)

• linear:

• polynomial:

• Gaussian (RBF):

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(27)

• linear:

• polynomial:

• Gaussian (RBF):

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(28)

Kernels and similarities

kernel similarity

real-valued real-valued

symmetric not necessarily symmetric

positive definite not necessarily p.d.

k(xxx,zzz) = 1

2[k(xxx,xxx) +k(zzz,zzz)

−kφ(xxx)−φ(zzz)k²₂i

sim(xxx,zzz) = inverse of the

distance betweenxxxandzzz

k(xxx,zzz) = hφ(xxx), φ(zzz)i

= the cosine similarity of the mapped vectors, provided they are normalized

⇒

13/31

(29)

Kernels and similarities

kernel similarity

real-valued real-valued

symmetric not necessarily symmetric

positive definite not necessarily p.d.

k(xxx,zzz) = 1

2[k(xxx,xxx) +k(zzz,zzz)

−kφ(xxx)−φ(zzz)k²₂i

sim(xxx,zzz) = inverse of the

distance betweenxxxandzzz

k(xxx,zzz) = hφ(xxx), φ(zzz)i

= the cosine similarity of the mapped vectors, provided they are normalized

⇒

13/31

(30)

A sample/simple method: prototype learning

ccc₊

ccc₋ www xxx−ccc

• class centers (centroids, prototypes):

ccc+ = 1 N₊

X

xxx_i∈X₊

xxxi

ccc− = 1 N₋

X

xxxi∈X−

xxxi

14/31

(31)

• define the following vectors: www =ccc+−ccc− andccc= (ccc++ccc−)/2

• then

y(xxx) = sgnhxxx−ccc,wwwi

= sgn(hccc₊,xxxi − hccc₋,xxxi+b) withb= kccc₋k²− kccc₊k²

/2.

• using dot products between thexxxis:

y(xxx) =sgn



 1 N+

X

xx xi∈X+

hxxx,xxxii − 1 N₋

X

xx xi∈X−

hxxx,xxxii+b





where

b= 1 2



 1 N₋²

X

xx xi,xxxj∈X−

hxxxi,xxxji − 1 N₊²

X

xx xi,xxxj∈X+

hxxxi,xxxji





15/31

(32)

The representer theorem

Theorem (Sch¨ olkopf and Smola, 2002)

LetHbe the feature space associated to a positive semi-definite kernel k:X×X →R. Denote byΩ : [0,∞)→Ra strictly monotonic increasing function, and by c: (X×R²)^`→R∪ {∞}an arbitrary loss function. Then each minimizer of the regularized risk

c((xxx₁,y₁,f(xxx₁)), . . . ,(xxx_`,y_`,f(xxx_`))) + Ω(kfkH) admits a representation of the form

f(xxx) =

`

X

i=1

αik(xxxi,xxx)

16/31

(33)

Semiparametric representer theorem

f(xxx) =

`

X

i=1

αik(xxxi,xxx) +

M

X

p=1

βpψp(xxx)

Loss function + regularization for thecentroidclassifier:

−X

yi=1

yif(xxxi)−N+

N₋ X

yi=−1

yif(xxxi) +N+

2 kwwwk²₂ wheref(xxxi) =www⁰xxxi+b

17/31

(34)

Dimensionality

• curseorblessing?

• usually: φ:R^d¹ →R^d² withd2>d1ord2d1

• why? – higher the dimensionality,easierto find a separating hyperplane

• Vapnik–Chervonenkis dimension of a classification algorithm = largest set of points that the algorithm canshatter

• (shattering of a set of points = all possible labelings of the points can be realized by the method)

• VC dimension of oriented hyperplanes inR^d isd+ 1 (see proof in [Burges, 1998])

18/31

(35)

• φneed not bedimensionality increaser/raiser

• it suffices to map the points to abetter representational space

• in either case: Johnson–Lindenstrauss lemma[Johnson and Lindenstrauss, 1984]

• if number of data points is relatively small (compared to dimensionality)

• random projection of logaritmically lower dimensionality

• relative distances will be approximately preserved

• corollary: kernels can be used for dimensionality reduction

19/31

(36)

199x – 200y

• 1992: SVM

• ?: kernel regularized least squares

• 1996: kernel PCA

• 1999: kernel Fisher discriminant analysis, transductive SVM

• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)

• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM

• . . .

20/31

(37)

199x – 200y

• 1992: SVM

• ?: kernel regularized least squares

• 1996: kernel PCA

• 1999: kernel Fisher discriminant analysis, transductive SVM

• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)

• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM

• . . .

20/31

(38)

(Some DBLP stats)

1960 1970 1980 1990 2000 2010 2020

0 200 400 600 800 1000 1200

Figure :11 463 works retrieved for keyword

“kernel” on 21.03.2016 (among 3 294 261)

Bernhard Sch¨olkopf (73) Johan A. K. Suykens (68)

Jos´e Carlos Pr´ıncipe (63) Stefan Kratsch (60) Alessandro Moschitti (56)

Alexander J. Smola (53) Hortensia Galeana-S´anchez (51)

Arthur Gretton (47) Saket Saurabh (44) Edwin R. Hancock (44)

Figure : Top 10 authors for the same keyword

21/31

(39)

Semi-supervised learning and kernels

Semi-supervised learning (SSL)

• supervised learning:

D={(xxxi,yi)|xxxi ∈X ⊆R^d,yi∈ {−1,+1},i= 1, . . . , `};

find f :X → {−1,+1}which agrees withD

• semi-supervised learning:

D={(xxx_i,y_i)|i= 1, . . . , `} ∪ {xxx_j|j= 1, . . . ,u}, `u,N=`+u;

• inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU

• transductive: findf :DU→ {−1,+1}by usingD=DL∪DU

0 5 10

−6

−4

−2 0 2 4

0 5 10

−6

−4

−2 0 2 4

22/31

(40)

Semi-supervised learning and kernels

Semi-supervised learning (SSL)

• supervised learning:

D={(xxxi,yi)|xxxi ∈X ⊆R^d,yi∈ {−1,+1},i= 1, . . . , `};

find f :X → {−1,+1}which agrees withD

• semi-supervised learning:

D={(xxx_i,y_i)|i= 1, . . . , `} ∪ {xxx_j|j= 1, . . . ,u}, `u,N=`+u;

• inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU

• transductive: findf :DU→ {−1,+1}by usingD=DL∪DU

0 5 10

−6

−4

−2 0 2 4

0 5 10

−6

−4

−2 0 2 4

22/31

(41)

Assumptions in SSL

1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.

2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.

3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

23/31

(42)

Assumptions in SSL

23/31

(43)

Assumptions in SSL

23/31

(44)

Humans and SSL

• humans do semi-supervised classification too

• 2007: experiment (Zhu and his colleagues), University of Wisconsin [Zhu et al., 2007]

• complex 3D shapes classified into two categories

• participants were told they see microscopic images of pollen particles from two fictitious flowers (BelianthusandNortulaca)

• data given:

• 2 labeled examples (each appearing 10 times in 20 trials)

• test set of 21 evenly spaced unlabeled examples – to test the learned decision boundary

• 3×230 unlabeled examples – the means are shifted away from the labeled examples (left-shiftedorright-shifted)

• test set of 21 evenly spaced unlabeled examples – to test whether the decision boundary has changed

• ⇒the learned decision boundary is determined by both labeled and unlabeled data

24/31

(45)

Data-dependent kernels

• supervised learning + data-dependent kernels =semi-supervised learning

• conventionalkernels: given data sets D₁6=D₂,xxx,zzz ∈D₁∩D₂ k(xxx,zzz) =k(xxx,zzz)

• data-dependentkernels: given data sets D₁6=D₂,xxx,zzz ∈D₁∩D₂ k(xxx,zzz;D1)mk(xxx,zzz;D2)

“m” reads as“not necessarily equal”

25/31

(46)

25/31

(47)

25/31

(48)

Reweighting cluster kernels

• idea borrowed from bagged cluster kernel [Weston et al., 2005]

• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]

• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2

• cluster kernel: KKK =KKK_rwKKK_b where

• KKKb= base kernel (e.g. Gaussian, polynomial, etc.)

• KKKrw= reweighting kernel

• KKK = resulting cluster kernel used in the learning algorithm

krw(xxx,zzz) = exp

−kUUU_·xxx−UUU_·zzzk² 2σ²

K

KKrw=UUU⁰UUU+α·111111⁰, α∈[0,1) KKKrw=β· UUU⁰UUU + 111111⁰, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(49)

krw(xxx,zzz) = exp

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(50)

krw(xxx,zzz) = exp

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(51)

A toy dataset

Figure :Linked tori dataset – labeled examples: 3 + 3, unlabeled examples:

197 + 197.

27/31

(52)

Figure :Linear SVM:

Accuracy = 70.8122% (279/394)

Figure : Gaussian SVM:

• γ= 0.003

Accuracy = 69.5431% (274/394)

28/31

(53)

Figure :SVM with reweighting cluster kernel (RCK)

• clustering: fuzzy,p= 2, no. of clusters = 30

• 3rd kernel

• β= 1000

Accuracy = 76.1421% (300/394)

29/31

(54)

Thank you!

30/31

(55)

References

Aizerman et al., 1964 M. A. Aizerman, E. M. Braverman, L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964.

Bodó and Csató, 2010 Z. Bodó, L. Csató. Hierarchical and Reweighting Cluster Kernels for Semi-Supervised Learning. Int. J. of Computers, Communications & Control, Vol. V, No. 4, pp. 469-476, 2010.

Boser et al., 1992 B. E. Boser, I. M. Guyon, V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, pp. 144–152, 1992.

Burges, 1998 C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(4), pp. 121–167, 1998.

Jäkel et al., 2007 F. Jäkel , B. Schölkopf , F. A. Wichmann. A Tutorial on Kernel Methods for Categorization.

Journal of Mathematical Psychology 51(6), pp. 343–358, 2007.

Johnson and Lindenstrauss, 1984 W. B. Johnson, J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, pp. 189–206, 1984.

Mercer, 1909 J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, Series A, vol. 209, pp.

415–446, 1909.

Minsky and Papert, 1969 M. Minsky, S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969

Sch¨olkopf and Smola, 2002 B. Sch¨olkopf, A. J. Smola. Learning with Kernels. MIT Press, Cambridge, Mass., 2002.

Shahbazi et al., 2016 R. Shahbazi, R. Raizada, S. Edelman. Similarity, kernels, and the fundamental constraints on cognition. Journal of Mathematical Psychology, vol. 70, pp. 21–34, 2016.

Shepard, 1987 R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237, pp. 1317–1323, 1987.

Weston et al., 2005 J. Weston, C. Leslie, D. Zhou, A. Elisseeff, W. S. Noble. Semi-Supervised Protein Classification using Cluster Kernels. Bioinformatics, 21(15), pp. 3241–3247, 2005.

Zhu et al., 2007 X. Zhu, T. Rogers, R. Qian, C. Kalish. Humans perform semi-supervised classification too.

AAAI, pp. 864–870, 2007.

31/31