• Nu S-Au Găsit Rezultate

Machine learning

N/A
N/A
Protected

Academic year: 2022

Share "Machine learning"

Copied!
55
0
0

Text complet

(1)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity and kernels in machine learning

Zal´an Bod´o

Babe¸s–Bolyai University, Cluj-Napoca/Kolozsv´ar Faculty of Mathematics and Computer Science

MACS 2016 Eger, Hungary

1/31

(2)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Overview of the presentation

Machine learning

Similarity. Similarity in (machine) learning Kernels

Kernel methods

Examples of general purpose kernels Kernels and similarities

A sample/simple method: prototype learning The representer theorem

Dimensionality

The “kernelization period”

Semi-supervised learning and kernels Assumptions in SSL

Humans and SSL Data-dependent kernels Reweighting cluster kernels A toy dataset

2/31

(3)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(4)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(5)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(6)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(7)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(8)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(9)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Machine learning

Arthur Samuel, 1959: “field of study that gives computers the ability to learn without being explicitly programmed”

“[. . . ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis”

[J¨akel et al., 2007]

Machine learning=

• supervised learning – classification, regression

• unsupervised learning – clustering, density estimation

• reinforcement learning

+ semi-supervised learning (classification)

3/31

(10)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Example: Content-based spam filtering

spamham (1200x287x16M jpeg)

4/31

(11)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity. Similarity in (machine) learning

• similarity is fundamental to learning

• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]

• generalization is based on similarity between situations/events/objects/. . .

• learning = generalize. . .

(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data

“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”

[Shahbazi et al., 2016]

5/31

(12)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity. Similarity in (machine) learning

• similarity is fundamental to learning

• Shepard: in each individual there is an“internal metric of similarity between possible situations”[Shepard, 1987]

• generalization is based on similarity between situations/events/objects/. . .

• learning = generalize. . .

(a) supervised scenarios: . . . from labeled to unlabeled data (b) unsupervised scenarios: . . . from familiar to novel data

“The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner.”

[Shahbazi et al., 2016]

5/31

(13)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product)

C(xxx,zzz) = xxx0zzz kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(14)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product)

C(xxx,zzz) = xxx0zzz kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(15)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz

kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(16)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz

kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(17)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz

kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(18)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Similarity of. . .

• sets, e.g. Jaccard similarity

J(A,B) = |A∩B|

|A∪B|

• sequences, e.g. edit (Levenshtein) distance-based similarity E(s,t) = 1− edist(s,t)

max(|s|,|t|)

• vectors, e.g. cosine similarity (= normalized dot product) C(xxx,zzz) = xxx0zzz

kxxxkkzzzk

• . . .

• complex objects, e.g. of two text segments extracted from a PDF file

• . . .

6/31

(19)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

20.05.2016, MACS dinner

7/31

(20)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

(www.mindmegette.hu/husveti-zserbo.recept)

8/31

(21)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Kernels

× ×

×

× × × o o o

o o o

Figure : XOR problem: separate the o’s from the x’s

• Marvin Minsky, Seymour Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969. – a single artificial neuron/perceptron (= lin. class.) cannot solve the problem

• M. A. Aizerman, E. M. Braverman, L. I. Rozonoer.

Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964. – use kernels!

9/31

(22)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

0 0.2 0.4 0.6 0.8 10

0.2 0.4

0.6 0.8

1

0 0.2 0.4 0.6 0.8 1

X22 X12

X1X2

Figure :Using the polynomial kernel

• map the points using the function φ(xxx) = [x12x22√ 2x1x2]0

• this is equivalent to usingk(xxx,zzz) =hφ(xxx), φ(zzz)i= (xxx0zzz)2(=

polynomial kernel)

• polynomial kernel: link the features using logical AND(size of the group of “linked” features is determined by the order of the kernel)

10/31

(23)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Kernel methods

• 1909: James Mercer –any continuous symmetric, positive

semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]

• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]

• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]

• linear algorithms→non-linear algorithms

• feature mapping: φ:X → H(φ:Rd1→Rd2)

• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)0φ(zzz)

• covers all geometric constructions that can be formulated in terms of angles, lengths and distances

Kernel trick

Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).

11/31

(24)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Kernel methods

• 1909: James Mercer –any continuous symmetric, positive

semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909]

• 1964: Aizerman, Braverman and Rozonoer – first application [Aizerman et al., 1964]

• 1992: Boser, Guyon and Vapnik – famous application (SVM) [Boser et al., 1992]

• linear algorithms→non-linear algorithms

• feature mapping: φ:X → H(φ:Rd1→Rd2)

• kernels: k(xxx,zzz) =hφ(xxx), φ(zzz)i=φ(xxx)0φ(zzz)

• covers all geometric constructions that can be formulated in terms of angles, lengths and distances

Kernel trick

Given an algorithm which is formulated in terms of a positive definite kernelk(·,·), one can construct an alternative algorithm by replacing k(·,·) by another positive definite kernelek(·,·).

11/31

(25)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Examples of general purpose kernels

• linear:

k(xxx,zzz) =xxx0zzz

• polynomial:

k(xxx,zzz) = (axxx0zzz+b)c

• Gaussian (RBF):

k(xxx,zzz) = exp −γkxxx−zzzk2

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(26)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Examples of general purpose kernels

• linear:

k(xxx,zzz) =xxx0zzz

• polynomial:

k(xxx,zzz) = (axxx0zzz+b)c

• Gaussian (RBF):

k(xxx,zzz) = exp −γkxxx−zzzk2

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(27)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Examples of general purpose kernels

• linear:

k(xxx,zzz) =xxx0zzz

• polynomial:

k(xxx,zzz) = (axxx0zzz+b)c

• Gaussian (RBF):

k(xxx,zzz) = exp −γkxxx−zzzk2

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

−10 −5 0 5 10 15 20

0 5 10

↑ ↑ ↑

12/31

(28)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Kernels and similarities

kernel similarity

real-valued real-valued

symmetric not necessarily symmetric

positive definite not necessarily p.d.

k(xxx,zzz) = 1

2[k(xxx,xxx) +k(zzz,zzz)

−kφ(xxx)φ(zzz)k22i

sim(xxx,zzz) = inverse of the

distance betweenxxxandzzz

k(xxx,zzz) = hφ(xxx), φ(zzz)i

= the cosine similarity of the mapped vectors, provided they are normalized

13/31

(29)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Kernels and similarities

kernel similarity

real-valued real-valued

symmetric not necessarily symmetric

positive definite not necessarily p.d.

k(xxx,zzz) = 1

2[k(xxx,xxx) +k(zzz,zzz)

−kφ(xxx)φ(zzz)k22i

sim(xxx,zzz) = inverse of the

distance betweenxxxandzzz

k(xxx,zzz) = hφ(xxx), φ(zzz)i

= the cosine similarity of the mapped vectors, provided they are normalized

13/31

(30)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

A sample/simple method: prototype learning

ccc+

ccc www xxx−ccc

• class centers (centroids, prototypes):

ccc+ = 1 N+

X

xxxi∈X+

xxxi

ccc = 1 N

X

xxxi∈X

xxxi

14/31

(31)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

• define the following vectors: www =ccc+−ccc andccc= (ccc++ccc)/2

• then

y(xxx) = sgnhxxx−ccc,wwwi

= sgn(hccc+,xxxi − hccc,xxxi+b) withb= kccck2− kccc+k2

/2.

• using dot products between thexxxis:

y(xxx) =sgn

 1 N+

X

xx xi∈X+

hxxx,xxxii − 1 N

X

xx xi∈X

hxxx,xxxii+b

where

b= 1 2

 1 N2

X

xx xi,xxxj∈X

hxxxi,xxxji − 1 N+2

X

xx xi,xxxj∈X+

hxxxi,xxxji

15/31

(32)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

The representer theorem

Theorem (Sch¨ olkopf and Smola, 2002)

LetHbe the feature space associated to a positive semi-definite kernel k:X×X →R. Denote byΩ : [0,∞)→Ra strictly monotonic increasing function, and by c: (X×R2)`→R∪ {∞}an arbitrary loss function. Then each minimizer of the regularized risk

c((xxx1,y1,f(xxx1)), . . . ,(xxx`,y`,f(xxx`))) + Ω(kfkH) admits a representation of the form

f(xxx) =

`

X

i=1

αik(xxxi,xxx)

16/31

(33)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Semiparametric representer theorem

f(xxx) =

`

X

i=1

αik(xxxi,xxx) +

M

X

p=1

βpψp(xxx)

Loss function + regularization for thecentroidclassifier:

−X

yi=1

yif(xxxi)−N+

N X

yi=−1

yif(xxxi) +N+

2 kwwwk22 wheref(xxxi) =www0xxxi+b

17/31

(34)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Dimensionality

• curseorblessing?

• usually: φ:Rd1 →Rd2 withd2>d1ord2d1

why? – higher the dimensionality,easierto find a separating hyperplane

Vapnik–Chervonenkis dimension of a classification algorithm = largest set of points that the algorithm canshatter

(shattering of a set of points = all possible labelings of the points can be realized by the method)

VC dimension of oriented hyperplanes inRd isd+ 1 (see proof in [Burges, 1998])

18/31

(35)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

• φneed not bedimensionality increaser/raiser

• it suffices to map the points to abetter representational space

• in either case: Johnson–Lindenstrauss lemma[Johnson and Lindenstrauss, 1984]

if number of data points is relatively small (compared to dimensionality)

random projection of logaritmically lower dimensionality

relative distances will be approximately preserved

• corollary: kernels can be used for dimensionality reduction

19/31

(36)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

The “kernelization period”

199x – 200y

• 1992: SVM

• ?: kernel regularized least squares

• 1996: kernel PCA

• 1999: kernel Fisher discriminant analysis, transductive SVM

• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)

• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM

• . . .

20/31

(37)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

The “kernelization period”

199x – 200y

• 1992: SVM

• ?: kernel regularized least squares

• 1996: kernel PCA

• 1999: kernel Fisher discriminant analysis, transductive SVM

• 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering)

• 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM

• . . .

20/31

(38)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

(Some DBLP stats)

1960 1970 1980 1990 2000 2010 2020

0 200 400 600 800 1000 1200

Figure :11 463 works retrieved for keyword

“kernel” on 21.03.2016 (among 3 294 261)

Bernhard Sch¨olkopf (73) Johan A. K. Suykens (68)

Jos´e Carlos Pr´ıncipe (63) Stefan Kratsch (60) Alessandro Moschitti (56)

Alexander J. Smola (53) Hortensia Galeana-S´anchez (51)

Arthur Gretton (47) Saket Saurabh (44) Edwin R. Hancock (44)

Figure : Top 10 authors for the same keyword

21/31

(39)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Semi-supervised learning and kernels

Semi-supervised learning (SSL)

• supervised learning:

D={(xxxi,yi)|xxxi ∈X ⊆Rd,yi∈ {−1,+1},i= 1, . . . , `};

find f :X → {−1,+1}which agrees withD

• semi-supervised learning:

D={(xxxi,yi)|i= 1, . . . , `} ∪ {xxxj|j= 1, . . . ,u}, `u,N=`+u;

inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU

transductive: findf :DU→ {−1,+1}by usingD=DL∪DU

0 5 10

−6

−4

−2 0 2 4

0 5 10

−6

−4

−2 0 2 4

22/31

(40)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Semi-supervised learning and kernels

Semi-supervised learning (SSL)

• supervised learning:

D={(xxxi,yi)|xxxi ∈X ⊆Rd,yi∈ {−1,+1},i= 1, . . . , `};

find f :X → {−1,+1}which agrees withD

• semi-supervised learning:

D={(xxxi,yi)|i= 1, . . . , `} ∪ {xxxj|j= 1, . . . ,u}, `u,N=`+u;

inductive: findf :X → {−1,+1}which agrees withD+ use the information ofDU

transductive: findf :DU→ {−1,+1}by usingD=DL∪DU

0 5 10

−6

−4

−2 0 2 4

0 5 10

−6

−4

−2 0 2 4

22/31

(41)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Assumptions in SSL

1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.

2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.

3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

23/31

(42)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Assumptions in SSL

1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.

2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.

3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

23/31

(43)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Assumptions in SSL

1. smoothness assumption: If two points xxxi and xxxj in a high density region are close, then so should be the corresponding outputs yi and yj.

2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class.

3. manifold assumption(a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

23/31

(44)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Humans and SSL

• humans do semi-supervised classification too

• 2007: experiment (Zhu and his colleagues), University of Wisconsin [Zhu et al., 2007]

• complex 3D shapes classified into two categories

• participants were told they see microscopic images of pollen particles from two fictitious flowers (BelianthusandNortulaca)

• data given:

2 labeled examples (each appearing 10 times in 20 trials)

test set of 21 evenly spaced unlabeled examples – to test the learned decision boundary

3×230 unlabeled examples – the means are shifted away from the labeled examples (left-shiftedorright-shifted)

test set of 21 evenly spaced unlabeled examples – to test whether the decision boundary has changed

• ⇒the learned decision boundary is determined by both labeled and unlabeled data

24/31

(45)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Data-dependent kernels

• supervised learning + data-dependent kernels =semi-supervised learning

• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)

• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)

“m” reads as“not necessarily equal”

25/31

(46)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Data-dependent kernels

• supervised learning + data-dependent kernels =semi-supervised learning

• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)

• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)

“m” reads as“not necessarily equal”

25/31

(47)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Data-dependent kernels

• supervised learning + data-dependent kernels =semi-supervised learning

• conventionalkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz) =k(xxx,zzz)

• data-dependentkernels: given data sets D16=D2,xxx,zzz ∈D1∩D2 k(xxx,zzz;D1)mk(xxx,zzz;D2)

“m” reads as“not necessarily equal”

25/31

(48)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Reweighting cluster kernels

• idea borrowed from bagged cluster kernel [Weston et al., 2005]

• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]

• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2

• cluster kernel: KKK =KKKrwKKKb where

KKKb= base kernel (e.g. Gaussian, polynomial, etc.)

KKKrw= reweighting kernel

KKK = resulting cluster kernel used in the learning algorithm

krw(xxx,zzz) = exp

−kUUU·xxx−UUU·zzzk22

K

KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(49)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Reweighting cluster kernels

• idea borrowed from bagged cluster kernel [Weston et al., 2005]

• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]

• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2

• cluster kernel: KKK =KKKrwKKKb where

KKKb= base kernel (e.g. Gaussian, polynomial, etc.)

KKKrw= reweighting kernel

KKK = resulting cluster kernel used in the learning algorithm

krw(xxx,zzz) = exp

−kUUU·xxx−UUU·zzzk22

K

KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(50)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Reweighting cluster kernels

• idea borrowed from bagged cluster kernel [Weston et al., 2005]

• reweighting conventional kernels according to some clustering of the data [Bod´o and Csat´o, 2010]

• kernel combinations: KKK1+KKK2,a·KKK,KKK1KKK2

• cluster kernel: KKK =KKKrwKKKb where

KKKb= base kernel (e.g. Gaussian, polynomial, etc.)

KKKrw= reweighting kernel

KKK = resulting cluster kernel used in the learning algorithm

krw(xxx,zzz) = exp

−kUUU·xxx−UUU·zzzk22

K

KKrw=UUU0UUU+α·1111110, α∈[0,1) KKKrw=β· UUU0UUU + 1111110, β ∈(0,∞) ( UUU = matrix of cluster membership vectors (columns) of size

K

|{z}

no. of clusters

× N

|{z}

no. of points

)

26/31

(51)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

A toy dataset

Figure :Linked tori dataset – labeled examples: 3 + 3, unlabeled examples:

197 + 197.

27/31

(52)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Figure :Linear SVM:

Accuracy = 70.8122% (279/394)

Figure : Gaussian SVM:

• γ= 0.003

Accuracy = 69.5431% (274/394)

28/31

(53)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Figure :SVM with reweighting cluster kernel (RCK)

• clustering: fuzzy,p= 2, no. of clusters = 30

• 3rd kernel

• β= 1000

Accuracy = 76.1421% (300/394)

29/31

(54)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

Thank you!

30/31

(55)

Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset

References

Aizerman et al., 1964 M. A. Aizerman, E. M. Braverman, L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp. 821–837, 1964.

Bod´o and Csat´o, 2010 Z. Bod´o, L. Csat´o. Hierarchical and Reweighting Cluster Kernels for Semi-Supervised Learning. Int. J. of Computers, Communications & Control, Vol. V, No. 4, pp. 469-476, 2010.

Boser et al., 1992 B. E. Boser, I. M. Guyon, V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, pp. 144–152, 1992.

Burges, 1998 C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(4), pp. 121–167, 1998.

akel et al., 2007 F. J¨akel , B. Sch¨olkopf , F. A. Wichmann. A Tutorial on Kernel Methods for Categorization.

Journal of Mathematical Psychology 51(6), pp. 343–358, 2007.

Johnson and Lindenstrauss, 1984 W. B. Johnson, J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, pp. 189–206, 1984.

Mercer, 1909 J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, Series A, vol. 209, pp.

415–446, 1909.

Minsky and Papert, 1969 M. Minsky, S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969

Sch¨olkopf and Smola, 2002 B. Sch¨olkopf, A. J. Smola. Learning with Kernels. MIT Press, Cambridge, Mass., 2002.

Shahbazi et al., 2016 R. Shahbazi, R. Raizada, S. Edelman. Similarity, kernels, and the fundamental constraints on cognition. Journal of Mathematical Psychology, vol. 70, pp. 21–34, 2016.

Shepard, 1987 R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237, pp. 1317–1323, 1987.

Weston et al., 2005 J. Weston, C. Leslie, D. Zhou, A. Elisseeff, W. S. Noble. Semi-Supervised Protein Classification using Cluster Kernels. Bioinformatics, 21(15), pp. 3241–3247, 2005.

Zhu et al., 2007 X. Zhu, T. Rogers, R. Qian, C. Kalish. Humans perform semi-supervised classification too.

AAAI, pp. 864–870, 2007.

31/31

Referințe

DOCUMENTE SIMILARE

Machine learning (ML) is an artificial intelligence (AI) technique that allows computers to learn without being explicitly programmed.Machine learning is concerned with the

The models used in Machine Learning to predict diabetes are the Linear Regression, Support Vector Machine.. Other algorithms require more computational time and Deep

Adaptive Learning otherwise called as Adaptive Teaching which is an educational teaching mechanism or rather method which uses computer algorithms such as machine

Every attribute has useful information for analyzing patient churn by using machine learning algorithms which may be k-means, decision tree and naive Bayes algorithm.. It

In order to develop the financial credit scoring model, various machine learning methods are used.. We propose a machine learning classifier-based analysis model for credit

As we discussed, the various machine learning and deep learning algorithms like K-Nearest Neighbour, Convolutional Neural Network, Logistic Regression, Unsupervised

This paper presents several ensemble classification techniques that combine the performance of various algorithms and compares it with existing Machine Learning Algorithms

Finally, we compare and evaluate few machine learning algorithms in spark using RDD-based regression and classification methods for Random forest, decision tree,