• Nu S-Au Găsit Rezultate

Auto-encoders: reconstruction versus compression

N/A
N/A
Protected

Academic year: 2022

Share "Auto-encoders: reconstruction versus compression"

Copied!
24
0
0

Text complet

(1)

Auto-encoders: reconstruction versus compression

Yann Ollivier

Abstract

We discuss the similarities and differences between training an auto- encoder to minimize the reconstruction error, and training the same auto-encoder to compress the data via a generative model. Minimizing a codelength for the data using an auto-encoder is equivalent to mini- mizing the reconstruction error plus some correcting terms which have an interpretation as either adenoising orcontractiveproperty of the decoding function. These terms are related but not identical to those used in denoising or contractive auto-encoders [VLL+10,RVM+11]. In particular, the codelength viewpoint fully determines an optimal noise level for the denoising criterion.

Given a dataset, auto-encoders (for instance, [PH87, Section 8.1] or [HS06]) aim at building a hopefully simpler representation of the data via a hidden, usually lower-dimensionalfeature space. This is done by looking for a pair of maps𝑋𝑓 𝑌𝑔 𝑋 from data space𝑋 to feature space𝑌 and back, such that the reconstruction error between 𝑥 and 𝑔(𝑓(𝑥)) is small.

Identifying relevant features hopefully makes the data more understandable, more compact, or simpler to describe.

Here we take this interpretation literally, by considering auto-encoders in the framework of minimum description length (MDL), i.e., data compres- sion via a probabilistic generative model, using the general correspondence between compression and “simple” probability distributions on the data [Grü07]. The objective is then to minimize the codelength (log-likelihood) of the data using the features found by the auto-encoder1.

We use the “variational” approach to answer the following question: Do auto-encoders trained to minimize reconstruction error actually minimize the length of a compressed encoding of the data, at least approximately?

We will see that by adding an information-theoretic term to the recon- struction error, auto-encoders can be trained to minimize a tight upper bound on the codelength (compressed size) of the data.

In Section 3 we introduce a first, simple bound on codelength based on reconstruction error: a dataset𝒟 ⊂ 𝑋 can be encoded by encoding a

1The goal here is not to build an actual compressed code of the data, but to find a good pair of feature and generative functions thatwouldyield a short codelength [Grü07].

If the codelength is known as a function of the parameters of the auto-encoder, it can be used as the training criterion.

(2)

(hopefully simpler) feature value 𝑓(𝑥) for each 𝑥 ∈ 𝒟, and applying the decoding function 𝑔. However, this result only applies to discrete features, and the resulting bound is far from tight. Still, this already illustrates how minimizing codelength favors using fewer features.

In Section 4 we refine the bound from Section 3 and make it valid for general feature spaces (Proposition 2). This bound is tight in the sense that it gets arbitrarily close to the actual codelength when the feature and generative functions are inverse to each other in a probabilistic sense. This is an instance of thevariational bound [Bis06, Chapter 10]. Related results appear in [HOT06] and in [KW13, Section 2.2].

The result in Section4 also illustrates how, to optimize codelength, an auto-encoder approach helps compared to directly looking for a generative model. Trying to optimize the codelength directly is often difficult (Section2).

So even though the codelength𝐿gendepends only on the generative function𝑔 and not on a feature function, we build an upper bound on𝐿gendepending on both; optimizing over𝑔 aims at lowering𝐿gen by lowering this upper bound, while optimizing over 𝑓 aims at making the upper bound more precise.

In Sections5and6we provide a connection withdenoising auto-encoders [VLL+10]. When the feature space is continuous, it is impossible to encode a feature value𝑓(𝑥) exactly for each 𝑥 in the dataset as this yields an infinite codelength. Thus, it is necessary to encode features with finite precision and to use a decoding function that is not too sensitive to approximate features.

Quantifying this effect leads to an explicit upper bound on codelength (Corollary3). The denoising criterion is from features to output, rather than

from input to features as in [VLL+10].

Moreover the MDL approach allows us to find the optimal noise level for the denoising criterion, i.e., the one which yields the best codelength (Theorem5). In particular, the noise level should be set differently for each

data sample.

In Section 7 we establish a connection with contractive auto-encoders [RVM+11]: under various approximations, minimizing codelength penalizes large derivatives of the output (Proposition6). The penalty takes a form somewhat different from [RVM+11], though: contractivity occurs from fea- tures to output rather than from input to features, and the penalty term is not the Frobenius norm of the Jacobian matrix but the sum of the logs of the norms of its rows. An advantage of the MDL approach is that the penalty constant is determined from theory.

In Section 8 we show that optimal compression requires including the variance of each data component as additional parameters, especially when various data components have different variances or noise levels. Compression focuses on relative rather than absolute error, minimizing thelogarithms of the errors.

The variational bound has already been applied to neural networks in non-auto-encoding situations, to evaluate the cost of encoding the network

(3)

parameters [Gra11,HvC93]. In that situation, one tries to find a map𝑌𝑔 𝑋 that minimizes the codelength of the output data 𝑥 if the features 𝑦 are given; this decomposes as the output error plus a term describing the cost of encoding the parameters of 𝑔. In an auto-encoding setting 𝑋𝑓 𝑌𝑔 𝑋, it is meaningless to encode the dataset given the very same inputs: so the dataset is encoded by encoding the features𝑦 together with𝑔. In this text we focus on the cost of encoding 𝑦, and the consequences of minimizing the resulting codelength. Encoding of the parameters of𝑔 can be done following [Gra11] and we do not reproduce it here. Still, the cost of𝑔 must be included for actual data compression, and also especially when comparing generative models with different dimensions.

1 Notation: Auto-encoders, reconstruction error. Let 𝑋 be an input space and 𝑌 be a feature space, usually of smaller dimension. 𝑌 may be discrete, such as 𝑌 = {0,1}𝑑 (each feature present/absent) or 𝑌 ={1, . . . , 𝑑} (classification), or continuous.

An auto-encoder can be seen as a pair of functions 𝑓 and 𝑔, the feature function and thegenerative function. The feature function goes from𝑋to 𝑌 (deterministic features) or to Prob(𝑌) (probability distribution on features), while the generative function goes from 𝑌 to𝑋 or Prob(𝑋).

The functions𝑓 and𝑔 depend on parameters𝜃𝑓 and𝜃𝑔 respectively. For instance, 𝑓 and 𝑔 may each represent a multilayer neural network or any other model. Training the parameters via the reconstruction error criterion focuses on having𝑔(𝑓(𝑥)) close to𝑥, as follows.

Given a feature function𝑓:𝑋𝑌 and a generative function 𝑔:𝑌 → Prob(𝑋), define the reconstruction error for a dataset𝒟 ⊂𝑋 as

𝐿rec(𝑥) :=−log𝑔𝑓(𝑥)(𝑥), 𝐿rec(𝒟) := ∑︁

𝑥∈𝒟

𝐿rec(𝑥) (1) where𝑔𝑦 is the probability distribution on𝑋 associated with feature𝑦.

The case of a deterministic𝑔:𝑌𝑋 with square error‖𝑔(𝑓(𝑥))−𝑥‖2 is recovered by interpreting𝑔as a Gaussian distribution2 centered at𝑔(𝑓(𝑥)).

So we will always consider that𝑔 is a probability distribution on 𝑋.

Discrete-valued features can be difficult to train using gradient-based methods. For this reason, with discrete features it is more natural to define 𝑓(𝑥) as a distribution over the feature space 𝑌 describing the law of inferred features for 𝑥. Thus 𝑓(𝑥) will have continuous parameters. If 𝑓:𝑋 →Prob(𝑌) describes a probability distribution on features for each𝑥, we define the expected reconstruction error as the expectation of the above:

E𝐿rec(𝑥) :=−E𝑦∼𝑓(𝑥)log𝑔𝑦(𝑥), E𝐿rec(𝒟) := ∑︁

𝑥∈𝒟

E𝐿rec(𝑥) (2)

2While the choice of variance does not influence minimization of the reconstruction error, when working with codelengths it will change the scaling of the various terms in Propositions1–6. See Section8for the optimal variance

(4)

This covers the previous case when 𝑓(𝑥) is a Dirac mass at a single value𝑦.

In Sections2–4the logarithms may be in any base; in Sections5–8the logarithms are in base e.

2 Auto-encoders as generative models. Alternatively, auto-encoders can be viewed as generative models for the data. For this we assume that we are given (or learn) an elementary model 𝜌 on feature space, such as a Gaussian or Bernoulli model, or even a uniform model in which each feature is present or absent with probability 1/2. Then, to generate the data, we draw features at random according to𝜌 and apply the generative function 𝑔. The goal is to maximize the probability to generate the actual data. In this viewpoint the feature function𝑓 is used only as a prop to learn a good feature space and a good generative function𝑔.

Given a probability distribution𝑝 on a set 𝑋, a dataset (𝑥1, . . . , 𝑥𝑛) of points on 𝑋 can be encoded in −∑︀𝑖log2𝑝(𝑥𝑖) bits3. Let 𝜌 ∈ Prob(𝑌) be the elementary model on feature space and let 𝑔: 𝑌 → Prob(𝑋) be the generative function. The probability to obtain𝑥𝑋 by drawing𝑦𝜌 and applying 𝑔is

𝑝𝑔(𝑥) :=

∫︁

𝑦

𝜌(𝑦)𝑔𝑦(𝑥) (3)

(where the integral is a sum if the feature space 𝑌 is discrete). Thus minimizing the codelength of the dataset𝒟amounts to minimizing

𝐿gen(𝒟) := ∑︁

𝑥∈𝒟

𝐿gen(𝑥), (4)

𝐿gen(𝑥) :=−log𝑝𝑔(𝑥) =−log

∫︁

𝑦

𝜌(𝑦)𝑔𝑦(𝑥) (5) over𝑔.

This is the codelength of the data knowing the distribution 𝜌 and the function𝑔. We do not consider here the problem of encoding the parameters of𝜌 and 𝑔; this can be done following [Gra11], for instance.

The codelength𝐿gendoes not depend on any feature function𝑓. However, it is difficult to optimize𝐿gen via a direct approach: this leads to working with all possible values of 𝑦 for every sample 𝑥, as 𝐿gen(𝑥) is an integral over 𝑦. Presumably, for each given 𝑥 only a few feature values contribute significantly to𝐿gen(𝑥). Using a feature function is a way to explore fewer possible values of𝑦 for a given𝑥, hopefully those that contribute most to 𝐿gen(𝑥).

3Technically, for continuous-valued data 𝑥, the actual compressed length is rather

log2𝑝(𝑥)−log2𝜀where𝜀is the quantization threshold of the data and𝑝is the probability density for𝑥. For the purpose of comparing two different probabilistic models𝑝on the same data with the same𝜀, the termlog2𝜀can be dropped

(5)

For instance, consider the gradient of𝐿gen(𝑥) with respect to a parameter 𝜃:

𝜕𝐿gen(𝑥)

𝜕𝜃 =−

∫︀

𝑦𝜌(𝑦)𝜕𝑔𝑦(𝑥)/𝜕𝜃

∫︀

𝑦𝜌(𝑦)𝑔𝑦(𝑥) =−

∫︀

𝑦𝜌(𝑦)𝑔𝑦(𝑥)𝜕ln𝑔𝑦(𝑥)/𝜕𝜃

∫︀

𝑦𝜌(𝑦)𝑔𝑦(𝑥) (6)

=−E𝑦∼𝑝𝑔(𝑦|𝑥)

𝜕ln𝑔𝑦(𝑥)

𝜕𝜃 (7)

where 𝑝𝑔(𝑦|𝑥) =𝜌(𝑦)𝑔𝑦(𝑥)/∫︀𝑦𝜌(𝑦)𝑔𝑦(𝑥) is the conditional probability of𝑦 knowing𝑥, in the generative model given by𝜌 and𝑔. In general we have no easy access to this distribution.

Using a (probabilistic) feature function𝑓 and minimizing the reconstruc- tion errorE𝐿rec(𝑥) amounts to replacing the expectation under 𝑦𝑝𝑔(𝑦|𝑥) with an expectation under𝑓(𝑥) in the above, presumably easier to handle.

However this gives no guarantees about minimizing𝐿gen unless we know that the feature function𝑓 is close to the inverse of the generative function 𝑔, in the sense that𝑓(𝑥)(𝑦) is close to the conditional distribution𝑝𝑔(𝑦|𝑥) of 𝑦 knowing𝑥. It would be nice to obtain a guarantee on the codelength based on the reconstruction error of a feature function𝑓 and generative function 𝑔.

The variational bound in Proposition2below shows that, given a feature function𝑓 and a generative function 𝑔, the quantity 𝐿rec(𝑥) + KL(𝑓(𝑥)||𝜌) is an upper bound on the codelength𝐿gen(𝑥). Training an autoencoder to minimize this criterion will thus minimize an upper bound on𝐿gen.

Moreover, Proposition2shows that the bound is tight when𝑓(𝑥) is close to𝑝𝑔(𝑦|𝑥), and that minimizing this bound will indeed bring𝑓(𝑥) closer to 𝑝𝑔(𝑦|𝑥). On the other hand, just minimizing the reconstruction error does not, a priori, guarantee any of this.

3 Two-part codes: explicitly encoding feature values. We first discuss a simple, less efficient “two-part” [Grü07] coding method. It always yields a codelength larger than 𝐿gen but is more obviously related to the auto-encoder reconstruction error.

Given a generative model 𝑔:𝑌 →Prob(𝑋) and a prior4 distribution 𝜌 on 𝑌, one way to encode a data sample 𝑥𝑋 is to explicitly encode a well-chosen feature value 𝑦𝑌 using the prior distribution𝜌 on features, then encode 𝑥 using the probability distribution 𝑔𝑦(𝑥) on 𝑥 defined by 𝑦.

The codelength resulting from this choice of𝑦 is thus

𝐿two-part(𝑥) :=−log𝜌(𝑦)−log𝑔𝑦(𝑥) (8) In this section we assume that 𝑌 is a discrete set. Indeed for continuous features, the above does not make sense as encoding a precise value for𝑦

4We use the term “prior” in a loose way: we just encode features𝑦 with a code of lengthlog𝜌(𝑦), without implying any a priori belief. Thus𝜌is just a simple model used on feature space.

(6)

would require an infinite codelength. Continuous features are dealt with in Sections4 and 5.

We always have

𝐿two-part(𝑥)>𝐿gen(𝑥) (9)

for discrete features, as𝐿two-part uses a single value of 𝑦 while 𝐿gen uses a sum over𝑦. The difference can be substantial if, for instance, not all feature components are relevant for all 𝑥: using the two-part code, it is always necessary to fully encode the feature values 𝑦.

From an auto-encoder perspective, the feature function 𝑓 is used to choose the feature value 𝑦 used to encode 𝑥. So if the feature function is deterministic, 𝑓:𝑋𝑌, and if we set 𝑦=𝑓(𝑥) in the above, the cost of encoding the dataset is

𝐿two-part(𝒟) =−∑︁

𝑥∈𝒟

(︁log𝜌(𝑓(𝑥)) + log𝑔𝑓(𝑥)(𝑥))︁

=𝐿rec(𝒟)−∑︁

𝑥∈𝒟

log𝜌(𝑓(𝑥))

involving the reconstruction error and a cross-entropy term between the empirical distribution of features𝑓(𝑥) and the prior𝜌 on feature space. We can further decompose

− 1

#𝒟

∑︁

𝑥∈𝒟

log𝜌(𝑓(𝑥)) = KL(𝑞𝑓||𝜌) + Ent𝑞𝑓 (10) where 𝑞𝑓 is the empirical distribution of the feature𝑓(𝑥) when 𝑥 runs over the dataset,

𝑞𝑓(𝑦) := 1

#𝒟

∑︁

𝑥∈𝒟

1𝑓(𝑥)=𝑦 (11)

and KL(𝑞𝑓||𝜌) =E𝑦∼𝑞𝑓 log(𝑞𝑓(𝑦)/𝜌(𝑦)) is the Kullback–Leibler divergence between𝑞𝑓 and 𝜌.

If the feature function 𝑓 is probabilistic,𝑓:𝑋→Prob(𝑌), the analysis is identical, with the expected two-part codelength of 𝑥being

E𝐿two-part(𝑥) =E𝑦∼𝑓(𝑥)(−log𝜌(𝑦)−log𝑔𝑦(𝑥)) (12)

=E𝐿rec(𝑥)−E𝑦∼𝑓(𝑥)log𝜌(𝑦) (13) Thus we have proved the following, which covers both the case of prob- abilistic𝑓 and of deterministic 𝑓 (by specializing𝑓 to a Dirac mass) on a discrete feature space.

Proposition 1 (Two-part codelength and reconstruction error for discrete features). The expected two-part codelength

(7)

of 𝑥∈ 𝒟 and the reconstruction error are related by E𝐿two-part(𝒟) =E𝐿rec(𝒟)−∑︁

𝑥∈𝒟

E𝑦∼𝑓(𝑥)log𝜌(𝑦) (14)

=E𝐿rec(𝒟) + (#𝒟)(KL(𝑞𝑓||𝜌) + Ent𝑞𝑓) (15) where

𝑞𝑓(𝑦) := 1

#𝒟

∑︁

𝑥∈𝒟

Pr(𝑓(𝑥) =𝑦) (16)

is the empirical distribution of features.

Here are a few comments on this relation. These comments also apply to the codelength discussed in Section4.

∙ The reconstruction error in (14) is the average reconstruction error for features𝑦sampled from𝑓(𝑥), in case𝑓(𝑥) is probabilistic. For instance, applying Proposition 1 to neural networks requires interpreting the activities of the𝑌 layer as probabilities to sample 0/1-valued features on the𝑌 layer. (This is not necessary for the results of Sections 4–7, which hold for continuous features.)

∙ The cross-entropy term −E𝑦∼𝑞𝑓log𝜌(𝑦) = KL(𝑞𝑓||𝜌) + Ent𝑞𝑓 is an added term to the optimisation problem. The Kullback–Leibler diver- gence favors feature functions that do actually match an elementary model on 𝑌, e.g., feature distributions that are “as Bernoulli-like” as possible. The entropy term Ent𝑞𝑓 favors parsimonious feature functions that use fewer feature components if possible, arguably introducing some regularization or sparsity. (Note the absence of any arbitrary parameter in front of this regularization term: its value is fixed by the MDL interpretation.)

∙ If the elementary model 𝜌 has tunable parameters (e.g., a Bernoulli parameter for each feature), these come into the optimization problem as well. If𝜌 is elementary it will be fairly easy to tune the parameters to find the elementary model𝜌*(𝑓) minimizing the Kullback–Leibler divergence to𝑞𝑓. Thus in this case the optimization problem over 𝑓 and𝑔involves a term KL(𝑞𝑓||𝜌*(𝑓)) between the empirical distribution of features and the closest elementary model.

This two-part code is somewhat naive in case not all feature components are relevant for all samples 𝑥: indeed for every 𝑥, a value of 𝑦 has to be fully encoded. For instance, with feature space𝑌 ={0,1}𝑑, if two values of 𝑦 differ in one place and contribute equally to generating some sample 𝑥, one could expect to save one bit on the codelength, by leaving a blank in the encoding where the two values of𝑦 differ. In general, one could expect to save Ent𝑓(𝑥) bits on the encoding of 𝑦 if several𝑦𝑓(𝑥) have a high probability to generate𝑥. We now show that indeedE𝐿two-part(𝑥)−Ent𝑓(𝑥) is still an upper bound on𝐿gen(𝑥).

(8)

4 Comparing 𝐿gen and 𝐿rec. We now turn to the actual codelength 𝐿gen(𝑥) =−log𝑝𝑔(𝑥) associated with the probabilistic model 𝑝𝑔(𝑥) defined by the generative function 𝑔and the prior 𝜌 on feature space. As mentioned above, it is always smaller that the two-part codelength.

Recall that this model first picks a feature value𝑦at random according to the distribution𝜌and then generates an object𝑥according to the distribution 𝑔𝑦(𝑥), so that the associated codelength is−log𝑝𝑔(𝑥) =−log∫︀𝑦𝜌(𝑦)𝑔𝑦(𝑥).

So far this does not depend on the feature function so it is not clear how 𝑓 can help in optimizing this codelength. Actually each choice of𝑓 leads to upper bounds on𝐿gen: the two-part codelength𝐿two-part above is one such bound in the discrete case, and we now introduce a more precise and more general one,𝐿𝑓-gen.

We have argued above (Section 2) that for gradient-based training it would be helpful to be able to sample features from the distribution 𝑝𝑔(𝑦|𝑥), and it is natural to expect the feature function𝑓(𝑥) to approximate𝑝𝑔(𝑦|𝑥), so that 𝑓 and 𝑔 are inverse to each other in a probabilistic sense. The tightness of the bound𝐿𝑓-gen is related to the quality of this approximation.

Moreover, while auto-encoder training based on the reconstruction error provides no guarantee that𝑓 will get closer to 𝑝𝑔(𝑦|𝑥), minimizing 𝐿𝑓-gen does.

Proposition 2 (Codelength and reconstruction error for probabilistic features). The codelength 𝐿gen and reconstruction er- ror 𝐿rec for an auto-encoder with feature function 𝑓:𝑋 → Prob(𝑌) and generative function𝑔:𝑌 →Prob(𝑋)satisfy

𝐿gen(𝑥) =E𝐿rec(𝑥) + KL(𝑓(𝑥)||𝜌)−KL(𝑓(𝑥)||𝑝𝑔(𝑦|𝑥)), (17) 𝐿gen(𝒟) =E𝐿rec(𝒟) + ∑︁

𝑥∈𝒟

KL(𝑓(𝑥)||𝜌)∑︁

𝑥∈𝒟

KL(𝑓(𝑥)||𝑝𝑔(𝑦|𝑥)) (18)

where𝜌 is the elementary model on features, and𝑝𝑔(𝑦|𝑥) = ∫︀ 𝜌(𝑦)𝑔𝑦(𝑥)

𝑦𝜌(𝑦)𝑔𝑦(𝑥). In particular, for any feature function 𝑓, the quantity

𝐿𝑓-gen(𝒟) := ∑︁

𝑥∈𝒟

𝐿𝑓-gen(𝑥) (19) where

𝐿𝑓-gen(𝑥) :=E𝐿rec(𝑥) + KL(𝑓(𝑥)||𝜌) (20) is an upper bound on the codelength𝐿gen(𝒟) of the generative function𝑔.

The result holds whether 𝑌 is discrete or continuous.

The proof is by substitution in the right-hand-side of (17); actually this is an instance of thevariational bound [Bis06, Chapter 10]. Related results appear in [HOT06] and [KW13, Section 2.2].

(9)

On a discrete feature space, 𝐿𝑓-gen is always smaller than the codelength 𝐿two-part above; indeed

𝐿𝑓-gen(𝑥) =E𝐿two-part(𝑥)−Ent𝑓(𝑥) (21) as can be checked directly.

The term KL(𝑓(𝑥)||𝜌) represents the cost of encoding a feature value𝑦 drawn from 𝑓(𝑥) for each𝑥 (encoded using the distribution 𝜌). The last, negative term in (17)–(18) represents how pessimistic the reconstruction error is w.r.t. the true codelength when𝑓(𝑥) is far from the feature values that contribute most to𝐿gen(𝑥).

The codelength 𝐿gen depends only on 𝑔 and not on the feature function 𝑓, so that the right-hand-side in (17)–(18) is the same for all 𝑓 despite appearances. Ideally, this relation could be used to evaluate 𝐿gen(𝒟) for a given generative function 𝑔, and then to minimize this quantity over 𝑔.

However, as explained above, the conditional probabilities 𝑝𝑔(𝑦|𝑥) are not easy to work with, hence the introduction of the upper bound𝐿𝑓-gen, which does depend on the feature function 𝑓.

Minimizing𝐿𝑓-genover𝑓 will bring𝐿𝑓-gencloser to𝐿gen. Since𝐿gen(𝒟) = 𝐿𝑓-gen(𝒟)−∑︀𝑥∈𝒟KL(𝑓(𝑥)||𝑝𝑔(𝑦|𝑥)), and since 𝐿gen does not depend on 𝑓, minimizing𝐿𝑓-gen is the same as bringing 𝑓(𝑥) closer to 𝑝𝑔(𝑦|𝑥) on average.

Thus, in the end, an auto-encoder trained by minimizing𝐿𝑓-gen as a function of𝑓 and𝑔 will both minimize an upper bound on the codelength𝐿gen and bring 𝑓(𝑥) close to the “inverse” of𝑔.

This also clarifies the role of the auto-encoder structure in minimizing the codelength, which does not depend on a feature function: Optimizing over𝑔 aims at actually reducing the codelength by decreasing an upper bound on it, while optimizing over𝑓 will make this upper bound more precise.

One can apply to 𝐿𝑓-gen the same decomposition as for the two-part codelength, and write

𝐿𝑓-gen(𝒟) =E𝐿rec(𝒟) + (#𝒟)(KL(𝑞𝑓||𝜌) + Ent𝑞𝑓)−∑︁

𝑥∈𝒟

Ent𝑓(𝑥) (22) where as above𝑞𝑓 = #𝒟1 ∑︀𝑥∈𝒟𝑓(𝑥) is the empirical feature distribution. As above, the term KL(𝑞𝑓||𝜌) favors feature distributions that match a simple model. The terms Ent𝑞𝑓 and ∑︀𝑥∈𝒟Ent𝑓(𝑥) pull in different directions.

Minimizing Ent𝑞𝑓 favors using fewer features overall (more compact repre- sentation). Increasing the entropy of 𝑓(𝑥) for a given 𝑥, if it can be done without impacting the reconstruction error, means that more features are

“indifferent” for reconstructing𝑥and do not have to be encoded, as discussed at the end of Section3.

The “auto-encoder approximation”𝑥 =𝑥from [AO12, Section 2.4] can be used to define another bound on𝐿gen, but is not tight when𝑓(𝑥)≈𝑝𝑔(𝑦|𝑥).

(10)

5 Continuous-valued features and denoising. Proposition2cannot be directly applied to a deterministic feature function𝑓:𝑋𝑌 with values in a continuous space 𝑌. In the continous case, the reconstruction error based on a single value 𝑦𝑌 cannot control the codelength 𝐿gen(𝑥), which involves an integral over𝑦. In the setting of Proposition2, a deterministic 𝑓 seen as a probability distribution is a Dirac mass at a single value, so that the term KL(𝑓(𝑥)||𝜌) is infinite: it is infinitely costly to encode the feature value 𝑓(𝑥) exactly.

This can be overcome by considering the feature values 𝑦 as probability distributions over an underlying space 𝑍, namely, 𝑌 ⊂ Prob(𝑍). Then Proposition2 can be applied to𝑓(𝑥) seen as a probability distribution over the feature space𝑍.

For instance, one possibility for neural networks with logistic activation function is to see the activities 𝑦 ∈ [0; 1] of the feature layer as Bernoulli probabilities over discrete-valued binary features, 𝑍={0,1}.

One may also use Gaussian distributions over 𝑍=𝑌 and apply Propo- sition 2 to a normal distribution 𝒩(𝑓(𝑥),Σ) centered at 𝑓(𝑥) with small covariance matrix Σ. Intuitively we overcome the problem of infinite code- length for𝑓(𝑥) by encoding𝑓(𝑥) with finite accuracy given by Σ.

The reconstruction error 𝐿rec from Proposition 2 then becomes an ex- pectation over features sampled around 𝑓(𝑥): this is similar to denoising auto-encoders [VLL+10], except that here the noise is added to the features rather than the inputs. This relationship is not specific to a particular choice of feature noise (Bernoulli, Gaussian...) but leads to interesting developments in the Gaussian case, as follows.

Corollary 3 (Codelength and denoising the features). Let 𝑓:𝑋𝑌 be a deterministic feature function with values in𝑌 =R𝑑. Let Σ be any positive definite matrix. Then

𝐿gen(𝑥)6E𝐿rec(𝑥)−E𝑦∼𝒩(𝑓(𝑥),Σ)log𝜌(𝑦)−1

2log det Σ−𝑑

2(1 + log 2𝜋) (23) whereE𝐿rec(𝑥) is the expected reconstruction error obtained from a feature 𝑦∼ 𝒩(𝑓(𝑥),Σ).

If the elementary model 𝜌 on feature space is 𝒩(0, 𝜆Id) this reads 𝐿gen(𝑥)6E𝐿rec(𝑥) +‖𝑓(𝑥)‖2

2𝜆 + 1

2𝜆Tr(Σ)−1

2log det Σ + 𝑑

2log𝜆𝑑 2 (24) Thus the codelength bound decomposes as the sum of the average (noisy) reconstruction error, constant terms, and a term that penalizes improbable feature values under the elementary model.

Proof.

Apply Proposition 2 with a normal distribution𝒩(𝑓(𝑥),Σ) as the feature distribution.

(11)

We refer to [VLL+10, Section 4.2] for a discussion and further references on training with noise in an auto-encoder setting.

In practice, the bound (23) can be optimized over 𝑓 and 𝑔 via Monte Carlo sampling over𝑦∼ 𝒩(𝑓(𝑥),Σ). For the case of neural networks, this can be done via ordinary backpropagation if one considers that the activation function of the layer representing𝑌 is𝑦 =𝑓(𝑥) +𝒩(0,Σ): one can then run several independent samples 𝑦𝑖, backpropagate the loss obtained with each 𝑦𝑖, and average over𝑖. The backpropagation from𝑦 to the input layer can even be factorized over the samples, thanks to linearity of backpropagation, namely: generate samples𝑦𝑖 ∼ 𝒩(𝑓(𝑥),Σ), backpropagate the error obtained with 𝑦𝑖 from the output to the layer representing 𝑌, average the obtained backpropagated values over𝑖, and backpropagate from the 𝑌 layer to the input layer using 𝑓. For any explicit choice of 𝜌, the contribution of the gradient of the log𝜌(𝑦) term can easily be incorporated into this scheme.

6 Optimal noise level. A good choice of noise level Σ leads to tighter bounds on𝐿gen: a small Σ results in a high cost of encoding the features up to Σ (log det Σ term), while a large Σ will result in more noise on features and a worse reconstruction error. An approximately optimal choice of Σ can be obtained by a Taylor expansion of the reconstruction error around 𝑓(𝑥), as follows. (A theoretical treatment of using such Taylor expansions for optimization with denoising can be found in [GCB97].)

Lemma 4 (Taylor expansion of𝐿𝑓-genfor smallΣ). Let𝑓:𝑋𝑌 be a deterministic feature function with values in 𝑌 =R𝑑. Let 𝐿𝑓,Σ-gen be the upper bound (23) using a normal distribution𝒩(𝑓(𝑥),Σ)for features.

Then for small covariance matrixΣwe have 𝐿𝑓,Σ-gen(𝑥)≈𝐿rec(𝑥)−log𝜌(𝑓(𝑥))−1

2log det Σ +1

2Tr(Σ𝐻)−𝑑

2(1 + log 2𝜋) (25) where𝐿rec(𝑥)is the deterministic reconstruction error using feature𝑦=𝑓(𝑥), and𝐻 is the Hessian

𝐻= 𝜕2

𝜕𝑦2(𝐿𝑦rec(𝑥)−log𝜌(𝑦)) (26) at𝑦 =𝑓(𝑥), where𝐿𝑦rec(𝑥) is the reconstruction error using feature 𝑦. Thus this is an approximate upper bound on 𝐿gen(𝑥).

Theorem 5 (Optimal choice of Σ for feature noise). Let 𝑓:𝑋𝑌 be a deterministic feature function with values in𝑌 =R𝑑. Let 𝐿𝑓,Σ-genbe the upper bound (23)using a normal distribution𝒩(𝑓(𝑥),Σ)for features. Let as above

𝐻(𝑥) := 𝜕2

𝜕𝑦2(𝐿𝑦rec(𝑥)−log𝜌(𝑦)) (27)

(12)

at𝑦=𝑓(𝑥).

Then the choiceΣ(𝑥) = 𝐻(𝑥)−1 (provided𝐻 is positive) is optimal in the bound (25) and yields

𝐿𝑓,Σ-gen(𝑥)≈𝐿rec(𝑥)−log𝜌(𝑓(𝑥)) + 1

2log det𝐻(𝑥)𝑑

2log 2𝜋 (28) as an approximate upper bound on𝐿gen(𝑥).

Among diagonal matrices Σ, the optimal choice isΣ(𝑥) = (diag𝐻(𝑥))−1 and produces a corresponding term12log det diag𝐻(𝑥)instead of12log det𝐻(𝑥).

In addition to the reconstruction error 𝐿rec(𝑥) at 𝑓(𝑥) and to the encod- ing cost−log𝜌(𝑓(𝑥)) under the elementary model, this codelength bound involves the reconstruction error around 𝑓(𝑥) through the Hessian. Mini- mizing this bound will favor points where the error is small in the widest possible feature region around𝑓(𝑥). This presumably leads to more robust reconstruction.

Several remarks can be made on this result. First, the optimal choice of noise Σ depends on the data sample𝑥, since 𝐻 does. This should not be a practical problem when training denoising auto-encoders.

Second, this choice only optimizes a Taylor approximation of the actual bound in Corollary3, so it is only approximately optimal; see [GCB97]. Still, Corollary3applies to any choice of Σ so it provides a valid, exact bound for this approximately optimal choice.

Third, computing the Hessian 𝐻(𝑥) may not be practical. Still, since again Corollary 3applies to an arbitrary Σ, it is not necessary to compute 𝐻(𝑥) exactly, and any reasonable approximation of𝐻(𝑥)−1 yields a valid near-optimal bound and should provide a suitable order of magnitude for feature noise. [LBOM96, Section 7] provides useful Hessian approximations for neural networks, in particular the diagonal Gauss–Newton approximation (see the Appendix for more details).

In practice there are two different ways of using this result:

∙ One can use the denoising criterion of Corollary 3, in which at each step the noise level is set to an approximation of 𝐻(𝑥)−1, such as diagonal Gauss–Newton. This alternates between optimizing the model parameters for a given noise level, and optimizing the noise level for given model parameters.

∙ One can work directly with the objective function (28) from Theorem5, which has an error term 𝐿rec and a regularization term log det𝐻(𝑥).

Computing a gradient of the latter may be tricky. For multilayer neural networks, we provide in the Appendix (Theorem9) an algorithm to com- pute this gradient at a cost of two forward and backpropagation passes if the layer-wise diagonal Gauss–Newton approximation of [LBOM96]

is used for𝐻. The algorithm is inspired from dynamic programming and the forward-backward algorithm used in hidden Markov models.

(13)

Proof of Lemma 4.

Using 𝑦 ∼ 𝒩(𝑓(𝑥),Σ) in Proposition 2, the reconstruction error E𝐿rec(𝑥) is E𝑦𝐿𝑦rec(𝑥). Using a second-order Taylor expansion of 𝐿𝑦rec(𝑥) around 𝑦 = 𝑓(𝑥), and using that E𝑧∼𝒩(0,Σ)(𝑧𝑀 𝑧) = Tr(Σ𝑀) for any matrix 𝑀, we find E𝐿rec(𝑥)≈𝐿rec(𝑥) +12Tr(Σ𝐻𝑔) where 𝐻𝑔 is the Hessian of𝐿𝑦rec(𝑥) at𝑦=𝑓(𝑥). By a similar argument the term KL(𝒩(𝑓(𝑥),Σ)||𝜌) is approx- imately −Ent𝒩(𝑓(𝑥),Σ)−log𝜌(𝑓(𝑥)) + 12Tr(Σ𝐻𝜌) with 𝐻𝜌 the Hessian of −log𝜌(𝑦) at 𝑦 = 𝑓(𝑥). Thus the bound 𝐿𝑓,Σ-gen(𝑥) is approximately 𝐿rec(𝑥)−log𝜌(𝑓(𝑥))−Ent𝒩(𝑓(𝑥),Σ) +12Tr(Σ𝐻) with𝐻 =𝐻𝑔+𝐻𝜌. The result follows from Ent𝒩(𝑓(𝑥),Σ) = 12log det Σ + 𝑑2(1 + log 2𝜋).

Proof of Theorem 5.

Substituting Σ =𝐻(𝑥)−1 in (25) directly yields the estimate in the propo- sition. Let us prove that this choice is optimal. We have to minimize

−log det Σ + Tr(Σ𝐻) over Σ. The case of diagonal Σ follows by direct minimization over the diagonal entries. For the general case, we have Tr(Σ𝐻) = Tr(𝐻1/2Σ𝐻1/2). Since 𝐻1/2Σ𝐻1/2 is symmetric we can de- compose 𝐻1/2Σ𝐻1/2 = 𝑂𝐷𝑂 with 𝑂 orthogonal and 𝐷 diagonal. Then

Tr(Σ𝐻) = Tr(𝑂𝐷𝑂) = Tr(𝐷). Moreover, log det Σ = log det(𝐻−1/2𝑂𝐷𝑂𝐻−1/2) =

−log det𝐻+log det𝐷so that−log det Σ+Tr(Σ𝐻) = log det𝐻−log det𝐷+

Tr(𝐷) = log det𝐻+∑︀𝑘(𝑑𝑘−log𝑑𝑘) with𝑑𝑘 the entries of𝐷. The function 𝑧↦→𝑧−log𝑧 is convex on R+ with a unique minimum at𝑧= 1, so this is minimal if and only if𝐷= Id, i.e., Σ =𝐻−1.

7 Link with contractive auto-encoders. The Hessian of the recon- struction error may not be easy to compute in practice. However, when reconstruction error is small this Hessian is related to the square derivatives of the reconstructed output with respect to the features, using the well-known Gauss–Newton approximation.

The resulting bound on codelength penalizes large square derivatives of the reconstructed outputs, as follows.

This is reminiscent of contractive auto-encoders ([RVM+11]; see also [Bis95] for the relationship between denoising and contractivity as regular- ization methods), with two differences: the contractivity is from features to output instead of from input to features, and instead of the Frobenius norm of the Jacobian matrix [RVM+11], the penalty is the sum of the logs of the norms of the rows of this matrix.

Proposition 6 (Codelength and contractivity). Consider a quadratic reconstruction error of the type 𝐿=∑︀𝑘(^𝑥𝑘2𝜎−𝑥2𝑘)2

𝑘

where𝑥^𝑘 are the components of the reconstructed data 𝑥^ = ^𝑥(𝑦) using features 𝑦. Let the elementary model 𝜌 on𝑌 be Gaussian with variancediag(𝜆𝑖).

(14)

Then, when the reconstruction error is small enough,

𝐿rec(𝑥)−log𝜌(𝑓(𝑥)) +∑︁

𝑖

log

1 𝜆𝑖 +∑︁

𝑘

1 𝜎𝑘2

(︃𝜕𝑥^𝑘

𝜕𝑦𝑖 )︃2

𝑑

2log 2𝜋 (29) is an approximate upper bound on 𝐿gen(𝑥).

This corresponds to the approximately optimal choice Σ = (diag𝐻)−1 together with the Gauss–Newton approximation 𝜕𝑦𝜕𝑖2𝜕𝑦𝐿𝑗∑︀𝑘 1

𝜎2𝑘

𝜕𝑥^𝑘

𝜕𝑦𝑖

𝜕𝑥^𝑘

𝜕𝑦𝑗. The terms 1/𝜆𝑖 prevent the logarithms from diverging to −∞in case a feature component𝑖has no influence on the output ^𝑥. Typically 𝜆𝑖 will be large so the Jacobian norm∑︀𝑘 1

𝜎𝑘2

(︁𝜕^𝑥𝑘

𝜕𝑦𝑖

)︁2

dominates.

[RVM+11] contains an indication on how to optimize objective functions involving such derivatives for the case of asingle-layer neural network: in that case the square derivatives are related to the squared weights of the network, so that the gradient of this term can be computed. For more complex models, however,𝜕𝑥^𝑘/𝜕𝑦𝑖 is a complex (though computable) function of the model parameters. Computing the gradient of (︁𝜕^𝜕𝑦𝑥𝑘𝑖

)︁2

with respect to the model parameters is thus feasible but costly. Lemma10 in the Appendix allows to compute a similar quantity for multilayer networks if the Gauss–Newton approximation is used on each layer in turn, instead of once globally from the 𝑦 layer to the ˜𝑥 layer as used here. Optimizing (29) for multilayer networks using Lemma 10 would require (dim𝑋) distinct backpropagations. More work is needed on this, such as stacking auto-encoders [HS06] to work with only one layer at a time.

Proof of Proposition 6.

Starting from Theorem 5, we have to approximate the Hessian 𝐻(𝑥) =

𝜕2

𝜕𝑦2(𝐿𝑦rec(𝑥) −log𝜌(𝑦)). By the assumption that 𝐿𝑦rec(𝑥) = ∑︀𝑘 (^𝑥𝑘2𝜎−𝑥2𝑘)2 𝑘

, where ^𝑥 is a function of 𝑦, and since 𝜌 is Gaussian, we get

𝐻𝑖𝑗(𝑥) = diag(1/𝜆𝑖) +∑︁

𝑘

1 2𝜎2𝑘

𝜕2

𝜕𝑦𝑖𝜕𝑦𝑗(^𝑥𝑘(𝑦)−𝑥𝑘)2 (30) and we can use the well-known Gauss–Newton approximation [Bis06, 5.4.2], namely

𝜕2

𝜕𝑦𝑖𝜕𝑦𝑗(^𝑥𝑘(𝑦)−𝑥𝑘)2= 2𝜕𝑥^𝑘

𝜕𝑦𝑖

𝜕^𝑥𝑘

𝜕𝑦𝑗 + 2(︁𝑥^𝑘(𝑦)−𝑥𝑘)︁ 𝜕2𝑥^𝑘

𝜕𝑦𝑖𝜕𝑦𝑗 (31)

≈2𝜕𝑥^𝑘

𝜕𝑦𝑖

𝜕^𝑥𝑘

𝜕𝑦𝑗 (32)

valid whenever the error ^𝑥𝑘(𝑦)−𝑥𝑘 is small enough. (Interestingly, when summing over the dataset, it is not necessary thateveryerror is small enough,

(15)

because errors with opposite signs will compensate; a fact used implicitly in [Bis95].)

The diagonal terms of𝐻(𝑥) are thus 𝐻𝑖𝑖(𝑥)≈ 1

𝜆𝑖

+∑︁

𝑘

1 𝜎𝑘2

(︃𝜕𝑥^𝑘

𝜕𝑦𝑖 )︃2

(33) Now, from Theorem 5the choice Σ = (diag𝐻)−1 is optimal among diag- onal noise matrices Σ. Computing the term 12log det diag𝐻=∑︀𝑖log√

𝐻𝑖𝑖 from (28) and substituting 𝐻𝑖𝑖ends the proof.

Remark 7. A tighter (approximately optimal) but less convenient bound is

𝐿rec(𝑥)−log𝜌(𝑓(𝑥)) +1

2log det (︃

𝜕2log𝜌(𝑦)

𝜕𝑦𝑖𝜕𝑦𝑗 +∑︁

𝑘

1 𝜎𝑘2

𝜕𝑥^𝑘

𝜕𝑦𝑖

𝜕𝑥^𝑘

𝜕𝑦𝑗 )︃

𝑖𝑗

𝑑 2log 2𝜋

(34) which forgoes the diagonal approximation. For more general loss functions, a similar argument applies, resulting in a more complex expression which involves the Hessian of the loss with respect to the reconstruction ^𝑥.

Remark 8 (Adapting the elementary feature model𝜌). Since all our bounds on𝐿gen involve log𝜌(𝑓(𝑥)) terms, the best choice of elemen- tary model𝜌 is the one which maximizes the log-likelihood of the empirical feature distribution in space 𝑌. This can be done concurrently with the optimization of the codelength, by re-adapting the prior after each step in the optimization of the functions 𝑓 and 𝑔. For Gaussian models 𝜌 as in Proposition6, this leads to

𝜆𝑖←Var[︁𝑓(𝑥)𝑖]︁ (35) with𝑓(𝑥)𝑖 the𝑖-th component of feature𝑓(𝑥), and𝑥ranging over the dataset.

If using the “denoising” criterion from Corollary3, the noise on𝑓(𝑥) must be included when computing this variance.

8 Variance of the output, and relative versus absolute error. A final, important choice when considering auto-encoders from a compression perspective is whether or not to include the variance of the output as a model parameter. While minimizing the reconstruction error usually focuses on absolute error, dividing the error by two will reduce codelength by one bit whether the error is large or small. This works out as follows.

Consider a situation where the outputs are real-valued (e.g., image). The usual loss is the square loss 𝐿 = ∑︀𝑛∑︀𝑖(𝑥𝑖𝑛𝑥^𝑖𝑛)2 where 𝑛 goes through all samples and 𝑖 goes through the components of each sample (output

(16)

dimension), the 𝑥𝑛 are the actual data, and the ^𝑥𝑛 are the reconstructed data computed from the features𝑦.

This square loss is recovered as the log-likelihood of the data over a Gaussian model with fixed variance 𝜎 and mean ^𝑥𝑛:

𝐿rec(𝒟) =−∑︁

𝑥∈𝒟

log𝑔𝑥^(𝑥) = ∑︁

𝑥∈𝒟

∑︁

𝑖

(︃(𝑥𝑖𝑥^𝑖)2

2𝜎2 + log𝜎+1 2log 2𝜋

)︃

(36) For any fixed 𝜎, the optimum is the same as for the square loss above.

Incorporating a new parameter𝜎𝑖 for the variance of the𝑖-th component into the model may make a difference if the various output components have different scales or noise levels. The reconstruction error becomes

𝐿rec(𝒟) =∑︁

𝑖

∑︁

𝑥∈𝒟

(︃(𝑥𝑖𝑥^𝑖)2

2𝜎𝑖2 + log𝜎𝑖+1 2log 2𝜋

)︃

(37) which is now to be optimized jointly over the functions𝑓 and𝑔, and the𝜎𝑖’s.

The optimal𝜎𝑖 for a given𝑓 and𝑔is the mean square error5 of component𝑖, 𝜎*𝑖2 =𝐸𝑖 := 1

#𝒟

∑︁

𝑥∈𝒟

(𝑥𝑖𝑥^𝑖)2 (38) so with this optimal choice the reconstruction error is

𝐿rec(𝒟) = (#𝒟)∑︁

𝑖

(︂1 2+1

2log𝐸𝑖+1 2log 2𝜋

)︂

(39) and so we have to optimize

𝐿rec(𝒟) = #𝒟 2

∑︁

𝑖

log𝐸𝑖+ Cst (40)

that is, the sum of thelogarithmsof the mean square error for each component.

(Note that this is not additive over the dataset: each𝐸𝑖 is an average over the dataset.) Usually, the sum of the𝐸𝑖 themselves is used. Thus, including the 𝜎𝑖 as parameters changes the minimization problem by focusing onrelative error, both for codelength and reconstruction error.

This is not cancelled out by normalizing the data: indeed the above does not depend on the variance of each component, but on the mean square prediction error, which can vary even if all components have the same variance, if some components are harder to predict.

This is to be used with caution when some errors become close to 0 (the log tends to−∞). Indeed, optimizing this objective function means that

5If working with feature noise as in Corollary3, this is the error after adding the noise.

Optimizing𝜎𝑖 for the estimate in Proposition6is more complicated since𝜎𝑖influences both the reconstruction error and the regularization term.

(17)

being able to predict an output component with an accuracy of 100 digits (for every sample𝑥in the data) can balance out 100 bad predictions on other output components. This is only relevant if the data are actually precise up to 100 significant digits. In practice an error of 0 only means that the actual error is below the quantization level𝜀. Thus, numerically, we might want to consider that the smallest possible square error is 𝜀2, and to optimize

∑︀

𝑖log(𝐸𝑖+𝜀2) for data quantized up to𝜀.

When working with the results of the previous sections (Prop. 2, Corol- lary 3, Thm. 5, and Prop. 6), changing 𝜎 has an influence: it changes the relative scaling of the reconstruction error term 𝐿rec w.r.t. the remaining information-theoretic terms. Choosing the optimal𝜎𝑖 as described here fixes this problem and makes all terms homogeneous.

Intuitively, from the minimum description length or compression view- point, dividing an error by 2 is an equally good move whether the error is small or large (one bit per sample gained on the codelength). Still, in a specific application, the relevant loss function may be the actual sum of square errors as usual, or a user-defined perceptual error. But in order to find a good representation of the data as an intermediate step in a final, user-defined problem, the compression point of view might be preferred.

Conclusions and perspectives

We have established that there is a strong relationship between minimizing a codelength of the data and minimizing reconstruction error using an auto- encoder. A variational approach provides a bound on data codelength in terms of the reconstruction error to which certain regularization terms are added.

The additional terms in the codelength bounds can be interpreted as a denoising condition from features to reconstructed output. This is in contrast with previously proposed denoising auto-encoders. For neural networks, this criterion can be trained using standard backpropagation techniques.

The codelength approach determines an optimal noise level for this denoising interpretation, namely, the one that will provide the tightest codelength. This optimal noise is approximately the inverse Hessian of the reconstruction function, for which several approximation techniques exist in the literature.

A practical consequence is that the noise level should be set differently for each data sample in a denoising approach.

Under certain approximations, the codelength approach also translates as a penalty for large derivatives from feature to output, different from that posited in contractive auto-encoders. However, the resulting criterion is hard to train for complex models such as multilayer neural networks. More work is needed on this point.

(18)

Including the variances of the outputs as parameters results in better com- pression bounds and a modified reconstruction error involving thelogarithms of the square errors together with the data quantization level. Still, having these variances as parameters is a modeling choice that may be relevant for compression but not in applications where the actual reconstruction error is considered.

It would be interesting to explore the practical consequences of these insights. Another point in need of further inquiry is how this codelength viewpoint combines with the stacking approach to deep learning, namely, after the data𝑥have been learned using features𝑦 and an elementary model for 𝑦, to further learn a finer model of 𝑦. For instance, it is likely that there is an interplay, in the denoising interpretation, between the noise level used on𝑦 when computing the codelength of 𝑥, and the output variance 𝜎𝑦 used in the definition of the reconstruction error of a model of𝑦 at the next level.

This would require modeling the transmission of noise from one layer to another in stacked generative models and optimizing the levels of noise to minimize a resulting bound on codelength of the output.

References

[AO12] Ludovic Arnold and Yann Ollivier. Layer-wise learning of deep generative models. Preprint, arXiv:1212.1524, 2012.

[Bis95] Christopher M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995.

[Bis06] Christopher M. Bishop.Pattern recognition and machine learning.

Springer, 2006.

[GCB97] Yves Grandvalet, Stéphane Canu, and Stéphane Boucheron. Noise injection: Theoretical prospects. Neural Computation, 9(5):1093–

1108, 1997.

[Gra11] Alex Graves. Practical variational inference for neural networks.

In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fer- nando C. N. Pereira, and Kilian Q. Weinberger, editors,Advances in Neural Information Processing Systems 24: 25th Annual Con- ference on Neural Information Processing Systems 2011. Proceed- ings of a meeting held 12-14 December 2011, Granada, Spain.,

pages 2348–2356, 2011.

[Grü07] Peter D. Grünwald. The minimum description length principle.

MIT Press, 2007.

(19)

[HOT06] G.E. Hinton, S. Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets.Neural Computation, 18:1527–1554, 2006.

[HS06] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504–

507, 2006.

[HvC93] Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Lenny Pitt, editor,Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, COLT 1993, Santa Cruz, CA, USA, July 26-28, 1993., pages 5–13.

ACM, 1993.

[KW13] Diederik P. Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder. Preprint, arXiv:1312.6114, 2013.

[LBOM96] Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient backprop. In Genevieve B. Orr and Klaus- Robert Müller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 9–50.

Springer, 1996.

[Oll13] Yann Ollivier. Riemannian metrics for neural networks I: feedfor- ward networks. Preprint, http://arxiv.org/abs/1303.0818, 2013.

[PH87] David C. Plaut and Geoffrey Hinton. Learning sets of filters using back-propagation.Computer Speech and Language, 2:35–61, 1987.

[RVM+11] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 833–840. Omnipress, 2011.

[VLL+10] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders:

Learning useful representations in a deep network with a lo- cal denoising criterion. Journal of Machine Learning Research, 11:3371–3408, 2010.

Referințe

DOCUMENTE SIMILARE

In the contact map prediction, one obvious input at each (i, j) location is the pair of corre- sponding amino acids. Amino acids can be represented using orthogonal encoding, i.e.

■ pred-m : the macroblock is forward-predictive encoded (difference from the previous frame) using a forward motion vector. ■ pred-c : the macroblock is encoded using a

The book opens with an introduction that provides a brief historical orientation to the background and methods of philosophical investigation and then discusses how the study

Dragomir believed that the real union Bishop Atanasie Anghel and the Romanian clergy wanted was a formal one, and that they were to preserve the situation of the Orthodox Church

It is specific to the environment law the fact that within this field juridical liability enters the discussion in the situation when a certain prejudice was caused by

You may find that construction projects, traffic, weather, or other events may cause conditions to differ from the map results, and you should plan your route accordingly.. You

• Cluster: a subset of data items that are close to each other in a high dimensional input data space.. – The subset of data is arranged to a map area consisting of nearby neurons

- aligners for minor dental changes: small dental corrections after an orthodontic treatment with a fixed appliance/ patients with a stable occlusion but a slightly