• Nu S-Au Găsit Rezultate

Neural Networks and Deep Learning

N/A
N/A
Protected

Academic year: 2022

Share "Neural Networks and Deep Learning"

Copied!
512
0
0

Text complet

(1)

Neural

Networks and Deep Learning

A Textbook

(2)
(3)

Neural Networks and Deep Learning

A Textbook

123

(4)

ISBN 978-3-319-94462-3 ISBN 978-3-319-94463-0 (eBook) https://doi.org/10.1007/978-3-319-94463-0

Library of Congress Control Number: 2018947636

c Springer International Publishing AG, part of Springer Nature 2018

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, com- puter software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

(5)
(6)

“Any A.I. smart enough to pass a Turing test is smart enough to know to fail it.”—Ian McDonald

Neural networks were developed to simulate the human nervous system for machine learning tasks by treating the computational units in a learning model in a manner similar to human neurons. The grand vision of neural networks is to create artificial intelligence by building machines whose architecture simulates the computations in the human ner- vous system. This is obviously not a simple task because the computational power of the fastest computer today is a minuscule fraction of the computational power of a human brain. Neural networks were developed soon after the advent of computers in the fifties and sixties. Rosenblatt’s perceptron algorithm was seen as a fundamental cornerstone of neural networks, which caused an initial excitement about the prospects of artificial intelligence.

However, after the initial euphoria, there was a period of disappointment in which the data hungry and computationally intensive nature of neural networks was seen as an impediment to their usability. Eventually, at the turn of the century, greater data availability and in- creasing computational power lead to increased successes of neural networks, and this area was reborn under the new label of “deep learning.” Although we are still far from the day that artificial intelligence (AI) is close to human performance, there are specific domains like image recognition, self-driving cars, and game playing, where AI has matched or ex- ceeded human performance. It is also hard to predict what AI might be able to do in the future. For example, few computer vision experts would have thought two decades ago that any automated system could ever perform an intuitive task like categorizing an image more accurately than a human.

Neural networks are theoretically capable of learning any mathematical function with sufficient training data, and some variants like recurrent neural networks are known to be Turing complete. Turing completeness refers to the fact that a neural network can simulate any learning algorithm,given sufficient training data. The sticking point is that the amount of data required to learn even simple tasks is often extraordinarily large, which causes a corresponding increase in training time (if we assume that enough training data is available in the first place). For example, the training time for image recognition, which is a simple task for a human, can be on the order of weeks even on high-performance systems. Fur- thermore, there are practical issues associated with the stability of neural network training, which are being resolved even today. Nevertheless, given that the speed of computers is

VII

(7)

expected to increase rapidly over time, and fundamentally more powerful paradigms like quantum computing are on the horizon, the computational issue might not eventually turn out to be quite as critical as imagined.

Although the biological analogy of neural networks is an exciting one and evokes com- parisons with science fiction, the mathematical understanding of neural networks is a more mundane one. The neural network abstraction can be viewed as a modular approach of enabling learning algorithms that are based on continuous optimization on a computational graph of dependencies between the input and output. To be fair, this is not very different from traditional work in control theory; indeed, some of the methods used for optimization in control theory are strikingly similar to (and historically preceded) the most fundamental algorithms in neural networks. However, the large amounts of data available in recent years together with increased computational power have enabled experimentation with deeper architectures of these computational graphs than was previously possible. The resulting success has changed the broader perception of the potential of deep learning.

The chapters of the book are organized as follows:

1. The basics of neural networks:Chapter1discusses the basics of neural network design.

Many traditional machine learning models can be understood as special cases of neural learning. Understanding the relationship between traditional machine learning and neural networks is the first step to understanding the latter. The simulation of various machine learning models with neural networks is provided in Chapter2. This will give the analyst a feel of how neural networks push the envelope of traditional machine learning algorithms.

2. Fundamentals of neural networks: Although Chapters 1 and 2 provide an overview of the training methods for neural networks, a more detailed understanding of the training challenges is provided in Chapters3 and4. Chapters5 and6present radial- basis function (RBF) networks and restricted Boltzmann machines.

3. Advanced topics in neural networks: A lot of the recent success of deep learning is a result of the specialized architectures for various domains, such as recurrent neural networks and convolutional neural networks. Chapters7 and8discuss recurrent and convolutional neural networks. Several advanced topics like deep reinforcement learn- ing, neural Turing mechanisms, and generative adversarial networks are discussed in Chapters9and10.

We have taken care to include some of the “forgotten” architectures like RBF networks and Kohonen self-organizing maps because of their potential in many applications. The book is written for graduate students, researchers, and practitioners. Numerous exercises are available along with a solution manual to aid in classroom teaching. Where possible, an application-centric view is highlighted in order to give the reader a feel for the technology.

Throughout this book, a vector or a multidimensional data point is annotated with a bar, such asX ory. A vector or multidimensional point may be denoted by either small letters or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots, such asX·Y. A matrix is denoted in capital letters without a bar, such asR. Throughout the book, the n×d matrix corresponding to the entire training data set is denoted by D, with n documents and d dimensions. The individual data points in D are therefore d-dimensional row vectors. On the other hand, vectors with one component for each data

(8)

point are usuallyn-dimensional column vectors. An example is the n-dimensional column vectory of class variables of n data points. An observed value yi is distinguished from a predicted value ˆyi by a circumflex at the top of the variable.

Yorktown Heights, NY, USA Charu C. Aggarwal

(9)

I would like to thank my family for their love and support during the busy time spent in writing this book. I would also like to thank my manager Nagui Halim for his support during the writing of this book.

Several figures in this book have been provided by the courtesy of various individuals and institutions. The Smithsonian Institution made the image of the Mark I perceptron (cf. Figure 1.5) available at no cost. Saket Sathe provided the outputs in Chapter 7 for the tiny Shakespeare data set, based on code available/described in [233, 580]. Andrew Zisserman provided Figures 8.12 and 8.16 in the section on convolutional visualizations.

Another visualization of the feature maps in the convolution network (cf. Figure8.15) was provided by Matthew Zeiler. NVIDIA provided Figure 9.10 on the convolutional neural network for self-driving cars in Chapter9, and Sergey Levine provided the image on self- learning robots (cf. Figure 9.9) in the same chapter. Alec Radford provided Figure 10.8, which appears in Chapter10. Alex Krizhevsky provided Figure 8.9(b) containingAlexNet.

This book has benefitted from significant feedback and several collaborations that I have had with numerous colleagues over the years. I would like to thank Quoc Le, Saket Sathe, Karthik Subbian, Jiliang Tang, and Suhang Wang for their feedback on various portions of this book. Shuai Zheng provided feedbback on the section on regularized autoencoders in Chapter4. I received feedback on the sections on autoencoders from Lei Cai and Hao Yuan.

Feedback on the chapter on convolutional neural networks was provided by Hongyang Gao, Shuiwang Ji, and Zhengyang Wang. Shuiwang Ji, Lei Cai, Zhengyang Wang and Hao Yuan also reviewed the Chapters 3 and 7, and suggested several edits. They also suggested the ideas of using Figures8.6and8.7for elucidating the convolution/deconvolution operations.

For their collaborations, I would like to thank Tarek F. Abdelzaher, Jinghui Chen, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M.

Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep Sri- vastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jiany- ong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao. I would also like to thank my advisor James B. Orlin for his guid- ance during my early years as a researcher.

XI

(10)

I would like to thank Lata Aggarwal for helping me with some of the figures created using PowerPoint graphics in this book. My daughter, Sayani, was helpful in incorporating special effects (e.g., image color, contrast, and blurring) in several JPEG images used at various places in this book.

(11)

1 An Introduction to Neural Networks 1

1.1 Introduction. . . 1

1.1.1 Humans Versus Computers: Stretching the Limits of Artificial Intelligence . . . 3

1.2 The Basic Architecture of Neural Networks . . . 4

1.2.1 Single Computational Layer: The Perceptron . . . 5

1.2.1.1 What Objective Function Is the Perceptron Optimizing? . 8 1.2.1.2 Relationship with Support Vector Machines . . . 10

1.2.1.3 Choice of Activation and Loss Functions . . . 11

1.2.1.4 Choice and Number of Output Nodes . . . 14

1.2.1.5 Choice of Loss Function . . . 14

1.2.1.6 Some Useful Derivatives of Activation Functions . . . 16

1.2.2 Multilayer Neural Networks . . . 17

1.2.3 The Multilayer Network as a Computational Graph . . . 20

1.3 Training a Neural Network with Backpropagation . . . 21

1.4 Practical Issues in Neural Network Training . . . 24

1.4.1 The Problem of Overfitting . . . 25

1.4.1.1 Regularization . . . 26

1.4.1.2 Neural Architecture and Parameter Sharing. . . 27

1.4.1.3 Early Stopping . . . 27

1.4.1.4 Trading Off Breadth for Depth . . . 27

1.4.1.5 Ensemble Methods. . . 28

1.4.2 The Vanishing and Exploding Gradient Problems. . . 28

1.4.3 Difficulties in Convergence. . . 29

1.4.4 Local and Spurious Optima . . . 29

1.4.5 Computational Challenges. . . 29

1.5 The Secrets to the Power of Function Composition . . . 30

1.5.1 The Importance of Nonlinear Activation . . . 32

1.5.2 Reducing Parameter Requirements with Depth . . . 34

1.5.3 Unconventional Neural Architectures . . . 35

1.5.3.1 Blurring the Distinctions Between Input, Hidden, and Output Layers. . . 35

1.5.3.2 Unconventional Operations and Sum-Product Networks . . 36 XIII

(12)

1.6 Common Neural Architectures . . . 37

1.6.1 Simulating Basic Machine Learning with Shallow Models . . . 37

1.6.2 Radial Basis Function Networks . . . 37

1.6.3 Restricted Boltzmann Machines. . . 38

1.6.4 Recurrent Neural Networks . . . 38

1.6.5 Convolutional Neural Networks . . . 40

1.6.6 Hierarchical Feature Engineering and Pretrained Models . . . 42

1.7 Advanced Topics . . . 44

1.7.1 Reinforcement Learning . . . 44

1.7.2 Separating Data Storage and Computations . . . 45

1.7.3 Generative Adversarial Networks . . . 45

1.8 Two Notable Benchmarks . . . 46

1.8.1 The MNIST Database of Handwritten Digits . . . 46

1.8.2 The ImageNet Database . . . 47

1.9 Summary . . . 48

1.10 Bibliographic Notes. . . 48

1.10.1 Video Lectures . . . 50

1.10.2 Software Resources . . . 50

1.11 Exercises . . . 51

2 Machine Learning with Shallow Neural Networks 53 2.1 Introduction. . . 53

2.2 Neural Architectures for Binary Classification Models . . . 55

2.2.1 Revisiting the Perceptron . . . 56

2.2.2 Least-Squares Regression . . . 58

2.2.2.1 Widrow-Hoff Learning . . . 59

2.2.2.2 Closed Form Solutions. . . 61

2.2.3 Logistic Regression . . . 61

2.2.3.1 Alternative Choices of Activation and Loss . . . 63

2.2.4 Support Vector Machines . . . 63

2.3 Neural Architectures for Multiclass Models . . . 65

2.3.1 Multiclass Perceptron . . . 65

2.3.2 Weston-Watkins SVM . . . 67

2.3.3 Multinomial Logistic Regression (Softmax Classifier) . . . 68

2.3.4 Hierarchical Softmax for Many Classes . . . 69

2.4 Backpropagated Saliency for Feature Selection . . . 70

2.5 Matrix Factorization with Autoencoders . . . 70

2.5.1 Autoencoder: Basic Principles. . . 71

2.5.1.1 Autoencoder with a Single Hidden Layer . . . 72

2.5.1.2 Connections with Singular Value Decomposition . . . 74

2.5.1.3 Sharing Weights in Encoder and Decoder . . . 74

2.5.1.4 Other Matrix Factorization Methods. . . 76

2.5.2 Nonlinear Activations . . . 76

2.5.3 Deep Autoencoders. . . 78

2.5.4 Application to Outlier Detection . . . 80

2.5.5 When the Hidden Layer Is Broader than the Input Layer . . . 81

2.5.5.1 Sparse Feature Learning. . . 81

2.5.6 Other Applications . . . 82

(13)

2.5.7 Recommender Systems: Row Index to Row Value Prediction . . . . 83

2.5.8 Discussion . . . 86

2.6 Word2vec: An Application of Simple Neural Architectures . . . 87

2.6.1 Neural Embedding with Continuous Bag of Words . . . 87

2.6.2 Neural Embedding with Skip-Gram Model. . . 90

2.6.3 Word2vec (SGNS) Is Logistic Matrix Factorization . . . 95

2.6.4 Vanilla Skip-Gram Is Multinomial Matrix Factorization . . . 98

2.7 Simple Neural Architectures for Graph Embeddings . . . 98

2.7.1 Handling Arbitrary Edge Counts . . . 100

2.7.2 Multinomial Model . . . 100

2.7.3 Connections with DeepWalk and Node2vec . . . 100

2.8 Summary . . . 101

2.9 Bibliographic Notes. . . 101

2.9.1 Software Resources . . . 102

2.10 Exercises . . . 103

3 Training Deep Neural Networks 105 3.1 Introduction. . . 105

3.2 Backpropagation: The Gory Details. . . 107

3.2.1 Backpropagation with the Computational Graph Abstraction . . . . 107

3.2.2 Dynamic Programming to the Rescue . . . 111

3.2.3 Backpropagation with Post-Activation Variables . . . 113

3.2.4 Backpropagation with Pre-activation Variables . . . 115

3.2.5 Examples of Updates for Various Activations . . . 117

3.2.5.1 The Special Case of Softmax . . . 117

3.2.6 A Decoupled View of Vector-Centric Backpropagation . . . 118

3.2.7 Loss Functions on Multiple Output Nodes and Hidden Nodes . . . . 121

3.2.8 Mini-Batch Stochastic Gradient Descent . . . 121

3.2.9 Backpropagation Tricks for Handling Shared Weights . . . 123

3.2.10 Checking the Correctness of Gradient Computation . . . 124

3.3 Setup and Initialization Issues. . . 125

3.3.1 Tuning Hyperparameters . . . 125

3.3.2 Feature Preprocessing . . . 126

3.3.3 Initialization . . . 128

3.4 The Vanishing and Exploding Gradient Problems . . . 129

3.4.1 Geometric Understanding of the Effect of Gradient Ratios . . . 130

3.4.2 A Partial Fix with Activation Function Choice . . . 133

3.4.3 Dying Neurons and “Brain Damage” . . . 133

3.4.3.1 Leaky ReLU . . . 133

3.4.3.2 Maxout . . . 134

3.5 Gradient-Descent Strategies . . . 134

3.5.1 Learning Rate Decay. . . 135

3.5.2 Momentum-Based Learning . . . 136

3.5.2.1 Nesterov Momentum . . . 137

3.5.3 Parameter-Specific Learning Rates . . . 137

3.5.3.1 AdaGrad . . . 138

3.5.3.2 RMSProp . . . 138

3.5.3.3 RMSProp with Nesterov Momentum. . . 139

(14)

3.5.3.4 AdaDelta . . . 139

3.5.3.5 Adam . . . 140

3.5.4 Cliffs and Higher-Order Instability . . . 141

3.5.5 Gradient Clipping . . . 142

3.5.6 Second-Order Derivatives . . . 143

3.5.6.1 Conjugate Gradients and Hessian-Free Optimization . . . . 145

3.5.6.2 Quasi-Newton Methods and BFGS. . . 148

3.5.6.3 Problems with Second-Order Methods: Saddle Points . . . 149

3.5.7 Polyak Averaging. . . 151

3.5.8 Local and Spurious Minima . . . 151

3.6 Batch Normalization . . . 152

3.7 Practical Tricks for Acceleration and Compression . . . 156

3.7.1 GPU Acceleration . . . 157

3.7.2 Parallel and Distributed Implementations . . . 158

3.7.3 Algorithmic Tricks for Model Compression . . . 160

3.8 Summary . . . 163

3.9 Bibliographic Notes. . . 163

3.9.1 Software Resources . . . 165

3.10 Exercises . . . 165

4 Teaching Deep Learners to Generalize 169 4.1 Introduction. . . 169

4.2 The Bias-Variance Trade-Off . . . 174

4.2.1 Formal View . . . 175

4.3 Generalization Issues in Model Tuning and Evaluation . . . 178

4.3.1 Evaluating with Hold-Out and Cross-Validation. . . 179

4.3.2 Issues with Training at Scale . . . 180

4.3.3 How to Detect Need to Collect More Data. . . 181

4.4 Penalty-Based Regularization . . . 181

4.4.1 Connections with Noise Injection . . . 182

4.4.2 L1-Regularization . . . 183

4.4.3 L1- orL2-Regularization? . . . 184

4.4.4 Penalizing Hidden Units: Learning Sparse Representations. . . 185

4.5 Ensemble Methods . . . 186

4.5.1 Bagging and Subsampling . . . 186

4.5.2 Parametric Model Selection and Averaging . . . 187

4.5.3 Randomized Connection Dropping . . . 188

4.5.4 Dropout . . . 188

4.5.5 Data Perturbation Ensembles . . . 191

4.6 Early Stopping . . . 192

4.6.1 Understanding Early Stopping from the Variance Perspective . . . . 192

4.7 Unsupervised Pretraining . . . 193

4.7.1 Variations of Unsupervised Pretraining. . . 197

4.7.2 What About Supervised Pretraining? . . . 197

4.8 Continuation and Curriculum Learning. . . 199

4.8.1 Continuation Learning . . . 199

4.8.2 Curriculum Learning . . . 200

4.9 Parameter Sharing . . . 200

(15)

4.10 Regularization in Unsupervised Applications . . . 201

4.10.1 Value-Based Penalization: Sparse Autoencoders . . . 202

4.10.2 Noise Injection: De-noising Autoencoders . . . 202

4.10.3 Gradient-Based Penalization: Contractive Autoencoders . . . 204

4.10.4 Hidden Probabilistic Structure: Variational Autoencoders . . . 207

4.10.4.1 Reconstruction and Generative Sampling . . . 210

4.10.4.2 Conditional Variational Autoencoders . . . 212

4.10.4.3 Relationship with Generative Adversarial Networks . . . . 213

4.11 Summary . . . 213

4.12 Bibliographic Notes. . . 214

4.12.1 Software Resources . . . 215

4.13 Exercises . . . 215

5 Radial Basis Function Networks 217 5.1 Introduction. . . 217

5.2 Training an RBF Network . . . 220

5.2.1 Training the Hidden Layer. . . 221

5.2.2 Training the Output Layer . . . 222

5.2.2.1 Expression with Pseudo-Inverse . . . 224

5.2.3 Orthogonal Least-Squares Algorithm . . . 224

5.2.4 Fully Supervised Learning . . . 225

5.3 Variations and Special Cases of RBF Networks . . . 226

5.3.1 Classification with Perceptron Criterion . . . 226

5.3.2 Classification with Hinge Loss. . . 227

5.3.3 Example of Linear Separability Promoted by RBF . . . 227

5.3.4 Application to Interpolation . . . 228

5.4 Relationship with Kernel Methods . . . 229

5.4.1 Kernel Regression as a Special Case of RBF Networks . . . 229

5.4.2 Kernel SVM as a Special Case of RBF Networks . . . 230

5.4.3 Observations . . . 231

5.5 Summary . . . 231

5.6 Bibliographic Notes. . . 232

5.7 Exercises . . . 232

6 Restricted Boltzmann Machines 235 6.1 Introduction. . . 235

6.1.1 Historical Perspective . . . 236

6.2 Hopfield Networks . . . 237

6.2.1 Optimal State Configurations of a Trained Network . . . 238

6.2.2 Training a Hopfield Network . . . 240

6.2.3 Building a Toy Recommender and Its Limitations . . . 241

6.2.4 Increasing the Expressive Power of the Hopfield Network . . . 242

6.3 The Boltzmann Machine . . . 243

6.3.1 How a Boltzmann Machine Generates Data . . . 244

6.3.2 Learning the Weights of a Boltzmann Machine . . . 245

6.4 Restricted Boltzmann Machines . . . 247

6.4.1 Training the RBM . . . 249

6.4.2 Contrastive Divergence Algorithm . . . 250

6.4.3 Practical Issues and Improvisations. . . 251

(16)

6.5 Applications of Restricted Boltzmann Machines . . . 251

6.5.1 Dimensionality Reduction and Data Reconstruction . . . 252

6.5.2 RBMs for Collaborative Filtering . . . 254

6.5.3 Using RBMs for Classification. . . 257

6.5.4 Topic Models with RBMs . . . 260

6.5.5 RBMs for Machine Learning with Multimodal Data . . . 262

6.6 Using RBMs Beyond Binary Data Types. . . 263

6.7 Stacking Restricted Boltzmann Machines . . . 264

6.7.1 Unsupervised Learning. . . 266

6.7.2 Supervised Learning . . . 267

6.7.3 Deep Boltzmann Machines and Deep Belief Networks . . . 267

6.8 Summary . . . 268

6.9 Bibliographic Notes. . . 268

6.10 Exercises . . . 270

7 Recurrent Neural Networks 271 7.1 Introduction. . . 271

7.1.1 Expressiveness of Recurrent Networks . . . 274

7.2 The Architecture of Recurrent Neural Networks . . . 274

7.2.1 Language Modeling Example of RNN . . . 277

7.2.1.1 Generating a Language Sample. . . 278

7.2.2 Backpropagation Through Time . . . 280

7.2.3 Bidirectional Recurrent Networks . . . 283

7.2.4 Multilayer Recurrent Networks . . . 284

7.3 The Challenges of Training Recurrent Networks . . . 286

7.3.1 Layer Normalization . . . 289

7.4 Echo-State Networks . . . 290

7.5 Long Short-Term Memory (LSTM) . . . 292

7.6 Gated Recurrent Units (GRUs) . . . 295

7.7 Applications of Recurrent Neural Networks . . . 297

7.7.1 Application to Automatic Image Captioning. . . 298

7.7.2 Sequence-to-Sequence Learning and Machine Translation . . . 299

7.7.2.1 Question-Answering Systems . . . 301

7.7.3 Application to Sentence-Level Classification . . . 303

7.7.4 Token-Level Classification with Linguistic Features . . . 304

7.7.5 Time-Series Forecasting and Prediction . . . 305

7.7.6 Temporal Recommender Systems . . . 307

7.7.7 Secondary Protein Structure Prediction . . . 309

7.7.8 End-to-End Speech Recognition. . . 309

7.7.9 Handwriting Recognition . . . 309

7.8 Summary . . . 310

7.9 Bibliographic Notes. . . 310

7.9.1 Software Resources . . . 311

7.10 Exercises . . . 312

(17)

8 Convolutional Neural Networks 315

8.1 Introduction. . . 315

8.1.1 Historical Perspective and Biological Inspiration . . . 316

8.1.2 Broader Observations About Convolutional Neural Networks . . . . 317

8.2 The Basic Structure of a Convolutional Network . . . 318

8.2.1 Padding . . . 322

8.2.2 Strides. . . 324

8.2.3 Typical Settings . . . 324

8.2.4 The ReLU Layer . . . 325

8.2.5 Pooling . . . 326

8.2.6 Fully Connected Layers . . . 327

8.2.7 The Interleaving Between Layers . . . 328

8.2.8 Local Response Normalization . . . 330

8.2.9 Hierarchical Feature Engineering . . . 331

8.3 Training a Convolutional Network . . . 332

8.3.1 Backpropagating Through Convolutions . . . 333

8.3.2 Backpropagation as Convolution with Inverted/Transposed Filter. . 334

8.3.3 Convolution/Backpropagation as Matrix Multiplications . . . 335

8.3.4 Data Augmentation . . . 337

8.4 Case Studies of Convolutional Architectures . . . 338

8.4.1 AlexNet . . . 339

8.4.2 ZFNet . . . 341

8.4.3 VGG. . . 342

8.4.4 GoogLeNet . . . 345

8.4.5 ResNet . . . 347

8.4.6 The Effects of Depth . . . 350

8.4.7 Pretrained Models . . . 351

8.5 Visualization and Unsupervised Learning . . . 352

8.5.1 Visualizing the Features of a Trained Network . . . 353

8.5.2 Convolutional Autoencoders. . . 357

8.6 Applications of Convolutional Networks . . . 363

8.6.1 Content-Based Image Retrieval . . . 363

8.6.2 Object Localization . . . 364

8.6.3 Object Detection . . . 365

8.6.4 Natural Language and Sequence Learning . . . 366

8.6.5 Video Classification . . . 367

8.7 Summary . . . 368

8.8 Bibliographic Notes. . . 368

8.8.1 Software Resources and Data Sets . . . 370

8.9 Exercises . . . 371

9 Deep Reinforcement Learning 373 9.1 Introduction. . . 373

9.2 Stateless Algorithms: Multi-Armed Bandits . . . 375

9.2.1 Na¨ıve Algorithm . . . 376

9.2.2 -Greedy Algorithm . . . 376

9.2.3 Upper Bounding Methods . . . 376

9.3 The Basic Framework of Reinforcement Learning . . . 377

9.3.1 Challenges of Reinforcement Learning . . . 379

(18)

9.3.2 Simple Reinforcement Learning for Tic-Tac-Toe. . . 380

9.3.3 Role of Deep Learning and a Straw-Man Algorithm . . . 380

9.4 Bootstrapping for Value Function Learning . . . 383

9.4.1 Deep Learning Models as Function Approximators . . . 384

9.4.2 Example: Neural Network for Atari Setting . . . 386

9.4.3 On-Policy Versus Off-Policy Methods: SARSA . . . 387

9.4.4 Modeling States Versus State-Action Pairs. . . 389

9.5 Policy Gradient Methods . . . 391

9.5.1 Finite Difference Methods . . . 392

9.5.2 Likelihood Ratio Methods . . . 393

9.5.3 Combining Supervised Learning with Policy Gradients . . . 395

9.5.4 Actor-Critic Methods . . . 395

9.5.5 Continuous Action Spaces . . . 397

9.5.6 Advantages and Disadvantages of Policy Gradients . . . 397

9.6 Monte Carlo Tree Search. . . 398

9.7 Case Studies . . . 399

9.7.1 AlphaGo: Championship Level Play at Go. . . 399

9.7.1.1 Alpha Zero: Enhancements to Zero Human Knowledge . . 402

9.7.2 Self-Learning Robots . . . 404

9.7.2.1 Deep Learning of Locomotion Skills . . . 404

9.7.2.2 Deep Learning of Visuomotor Skills . . . 406

9.7.3 Building Conversational Systems: Deep Learning for Chatbots . . . 407

9.7.4 Self-Driving Cars . . . 410

9.7.5 Inferring Neural Architectures with Reinforcement Learning . . . 412

9.8 Practical Challenges Associated with Safety . . . 413

9.9 Summary . . . 414

9.10 Bibliographic Notes. . . 414

9.10.1 Software Resources and Testbeds . . . 416

9.11 Exercises . . . 416

10 Advanced Topics in Deep Learning 419 10.1 Introduction. . . 419

10.2 Attention Mechanisms . . . 421

10.2.1 Recurrent Models of Visual Attention . . . 422

10.2.1.1 Application to Image Captioning . . . 424

10.2.2 Attention Mechanisms for Machine Translation . . . 425

10.3 Neural Networks with External Memory . . . 429

10.3.1 A Fantasy Video Game: Sorting by Example . . . 430

10.3.1.1 Implementing Swaps with Memory Operations . . . 431

10.3.2 Neural Turing Machines . . . 432

10.3.3 Differentiable Neural Computer: A Brief Overview . . . 437

10.4 Generative Adversarial Networks (GANs) . . . 438

10.4.1 Training a Generative Adversarial Network . . . 439

10.4.2 Comparison with Variational Autoencoder. . . 442

10.4.3 Using GANs for Generating Image Data . . . 442

10.4.4 Conditional Generative Adversarial Networks . . . 444

10.5 Competitive Learning . . . 449

10.5.1 Vector Quantization . . . 450

10.5.2 Kohonen Self-Organizing Map . . . 450

(19)

10.6 Limitations of Neural Networks . . . 453

10.6.1 An Aspirational Goal: One-Shot Learning . . . 453

10.6.2 An Aspirational Goal: Energy-Efficient Learning . . . 455

10.7 Summary . . . 456

10.8 Bibliographic Notes. . . 457

10.8.1 Software Resources . . . 458

10.9 Exercises . . . 458

Bibliography 459

Index 493

(20)

Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his under- graduate degree in Computer Science from the Indian Institute of Technology at Kan- pur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.

He has worked extensively in the field of data mining. He has pub- lished more than 350 papers in refereed conferences and journals and authored over 80 patents. He is the author or editor of 18 books, including textbooks on data mining, recommender systems, and outlier analysis. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio- terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contribu- tions to privacy technology, and a recipient of two IBM Outstanding Technical Achievement Awards (2009, 2015) for his work on data streams/high-dimensional data. He received the EDBT 2014 Test of Time Award for his work on condensation-based privacy-preserving data mining. He is also a recipient of the IEEE ICDM Research Con- tributions Award (2015), which is one of the two highest awards for influential research contributions in the field of data mining.

He has served as the general co-chair of the IEEE Big Data Conference (2014) and as the program co-chair of the ACM CIKM Conference (2015), the IEEE ICDM Conference (2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the IEEE Transactions on Big Data, an action editor of the Data Mining and Knowledge Discovery Journal, and an associate editor of the Knowledge and Information Systems Journal. He serves as the editor-in-chief of the ACM Transactions on Knowledge Discovery from Data as well as the ACM SIGKDD Explorations. He serves on the advisory board of the Lecture Notes on Social Networks, a publication by Springer. He has served as the vice-president of the SIAM Activity Group on Data Mining and is a member of the SIAM industry committee. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to knowledge discovery and data mining algorithms.”

XXIII

(21)

An Introduction to Neural Networks

“Thou shalt not make a machine to counterfeit a human mind.”—Frank Herbert

1.1 Introduction

Artificial neural networks are popular machine learning techniques that simulate the mech- anism of learning in biological organisms. The human nervous system contains cells, which are referred to asneurons. The neurons are connected to one another with the use ofax- onsanddendrites, and the connecting regions between axons and dendrites are referred to as synapses. These connections are illustrated in Figure 1.1(a). The strengths of synaptic connections often change in response to external stimuli. This change is how learning takes place in living organisms.

This biological mechanism is simulated inartificialneural networks, which contain com- putation units that are referred to as neurons. Throughout this book, we will use the term

“neural networks” to refer to artificial neural networks rather than biological ones. The computational units are connected to one another through weights, which serve the same

NEURON w1

w2 w3 w4

AXON

DENDRITES WITH SYNAPTIC WEIGHTS w5

(a) Biological neural network (b) Artificial neural network Figure 1.1: The synaptic connections between neurons. The image in (a) is from “The Brain:

Understanding Neurobiology Through the Study of Addiction[598].” Copyright c2000 by BSCS & Videodiscovery. All rights reserved. Used with permission.

©Springer International Publishing AG, part of Springer Nature 2018 C. C. Aggarwal,Neural Networks and Deep Learning,

https://doi.org/10.1007/978-3-319-94463-0 1

1

(22)

role as the strengths of synaptic connections in biological organisms. Each input to a neuron is scaled with a weight, which affects the function computed at that unit. This architecture is illustrated in Figure1.1(b). An artificial neural network computes a function of the inputs by propagating the computed values from the input neurons to the output neuron(s) and using the weights as intermediate parameters. Learning occurs by changing the weights con- necting the neurons. Just as external stimuli are needed for learning in biological organisms, the external stimulus in artificial neural networks is provided by the training data contain- ing examples of input-output pairs of the function to be learned. For example, the training data might contain pixel representations of images (input) and their annotated labels (e.g., carrot, banana) as the output. These training data pairs are fed into the neural network by using the input representations to make predictions about the output labels. The training data provides feedback to the correctness of the weights in the neural network depending on how well the predicted output (e.g., probability of carrot) for a particular input matches the annotated output label in the training data. One can view the errors made by the neural network in the computation of a function as a kind of unpleasant feedback in a biological organism, leading to an adjustment in the synaptic strengths. Similarly, the weights between neurons are adjusted in a neural network in response to prediction errors. The goal of chang- ing the weights is to modify the computed function to make the predictions more correct in future iterations. Therefore, the weights are changed carefully in a mathematically justified way so as to reduce the error in computation on that example. By successively adjusting the weights between neurons over many input-output pairs, the function computed by the neural network is refined over time so that it provides more accurate predictions. Therefore, if the neural network is trained with many different images of bananas, it will eventually be able to properly recognize a banana in an image it has not seen before. This ability to accurately compute functions of unseen inputs by training over a finite set of input-output pairs is referred to asmodel generalization. The primary usefulness of all machine learning models is gained from their ability to generalize their learning from seen training data to unseen examples.

The biological comparison is often criticized as a very poor caricature of the workings of the human brain; nevertheless, the principles of neuroscience have often been useful in designing neural network architectures. A different view is that neural networks are built as higher-level abstractions of the classical models that are commonly used in machine learning. In fact, the most basic units of computation in the neural network are inspired by traditional machine learning algorithms likeleast-squares regressionandlogistic regression.

Neural networks gain their power by putting together many such basic units, and learning the weights of the different units jointly in order to minimize the prediction error. From this point of view, a neural network can be viewed as acomputational graphof elementary units in which greater power is gained by connecting them in particular ways. When a neural network is used in its most basic form, without hooking together multiple units, the learning algorithms often reduce to classical machine learning models (see Chapter2). The real power of a neural model over classical methods is unleashed when these elementary computational units are combined, and the weights of the elementary models are trained using their dependencies on one another. By combining multiple units, one is increasing the power of the model to learn more complicated functions of the data than are inherent in the elementary models of basic machine learning. The way in which these units are combined also plays a role in the power of the architecture, and requires some understanding and insight from the analyst. Furthermore, sufficient training data is also required in order to learn the larger number of weights in these expanded computational graphs.

(23)

ACCURACY

AMOUNT OF DATA DEEP LEARNING

CONVENTIONAL MACHINE LEARNING

Figure 1.2: An illustrative comparison of the accuracy of a typical machine learning al- gorithm with that of a large neural network. Deep learners become more attractive than conventional methods primarily when sufficient data/computational power is available. Re- cent years have seen an increase in data availability and computational power, which has led to a “Cambrian explosion” in deep learning technology.

1.1.1 Humans Versus Computers: Stretching the Limits of Artificial Intelligence

Humans and computers are inherently suited to different types of tasks. For example, com- puting the cube root of a large number is very easy for a computer, but it is extremely difficult for humans. On the other hand, a task such as recognizing the objects in an image is a simple matter for a human, but has traditionally been very difficult for an automated learning algorithm. It is only in recent years that deep learning has shown an accuracy on some of these tasks that exceeds that of a human. In fact, the recent results by deep learning algorithms that surpass human performance [184] in (some narrow tasks on) image recog- nition would not have been considered likely by most computer vision experts as recently as 10 years ago.

Many deep learning architectures that have shown such extraordinary performance are not created by indiscriminately connecting computational units. The superior performance ofdeep neural networks mirrors the fact that biological neural networks gain much of their power from depth as well. Furthermore, biological networks are connected in ways we do not fully understand. In the few cases that the biological structure is understood at some level, significant breakthroughs have been achieved by designing artificial neural networks along those lines. A classical example of this type of architecture is the use of theconvolutional neural networkfor image recognition. This architecture was inspired by Hubel and Wiesel’s experiments [212] in 1959 on the organization of the neurons in the cat’s visual cortex. The precursor to the convolutional neural network was theneocognitron[127], which was directly based on these results.

The human neuronal connection structure has evolved over millions of years to optimize survival-driven performance; survival is closely related to our ability to merge sensation and intuition in a way that is currently not possible with machines. Biological neuroscience [232]

is a field that is still very much in its infancy, and only a limited amount is known about how the brain truly works. Therefore, it is fair to suggest that the biologically inspired success of convolutional neural networks might be replicated in other settings, as we learn more about how the human brain works [176]. A key advantage of neural networks over tradi- tional machine learning is that the former provides a higher-level abstraction of expressing semantic insights about data domains by architectural design choices in the computational graph. The second advantage is that neural networks provide a simple way to adjust the

(24)

complexity of a model by adding or removing neurons from the architecture according to the availability of training data or computational power. A large part of the recent suc- cess of neural networks is explained by the fact that the increased data availability and computational power of modern computers has outgrown the limits of traditional machine learning algorithms, which fail to take full advantage of what is now possible. This situation is illustrated in Figure1.2. The performance of traditional machine learning remains better at times for smaller data sets because of more choices, greater ease of model interpretation, and the tendency to hand-craft interpretable features that incorporate domain-specific in- sights. With limited data, the best of a very wide diversity of models in machine learning will usually perform better than a single class of models (like neural networks). This is one reason why the potential of neural networks was not realized in the early years.

The “big data” era has been enabled by the advances in data collection technology; vir- tually everything we do today, including purchasing an item, using the phone, or clicking on a site, is collected and stored somewhere. Furthermore, the development of powerful Graph- ics Processor Units (GPUs) has enabled increasingly efficient processing on such large data sets. These advances largely explain the recent success of deep learning using algorithms that are only slightly adjusted from the versions that were available two decades back.

Furthermore, these recent adjustments to the algorithms have been enabled by increased speed of computation, because reduced run-times enable efficient testing (and subsequent algorithmic adjustment). If it requires a month to test an algorithm, at most twelve varia- tions can be tested in an year on a single hardware platform. This situation has historically constrained the intensive experimentation required for tweaking neural-network learning algorithms. The rapid advances associated with the three pillars of improved data, compu- tation, and experimentation have resulted in an increasingly optimistic outlook about the future of deep learning. By the end of this century, it is expected that computers will have the power to train neural networks with as many neurons as the human brain. Although it is hard to predict what the true capabilities of artificial intelligence will be by then, our experience with computer vision should prepare us to expect the unexpected.

Chapter Organization

This chapter is organized as follows. The next section introduces single-layer and multi-layer networks. The different types of activation functions, output nodes, and loss functions are discussed. The backpropagation algorithm is introduced in Section1.3. Practical issues in neural network training are discussed in Section1.4. Some key points on how neural networks gain their power with specific choices of activation functions are discussed in Section1.5. The common architectures used in neural network design are discussed in Section1.6. Advanced topics in deep learning are discussed in Section1.7. Some notable benchmarks used by the deep learning community are discussed in Section1.8. A summary is provided in Section1.9.

1.2 The Basic Architecture of Neural Networks

In this section, we will introduce single-layer and multi-layer neural networks. In the single- layer network, a set of inputs is directly mapped to an output by using a generalized variation of a linear function. This simple instantiation of a neural network is also referred to as the perceptron. In multi-layer neural networks, the neurons are arranged in layered fashion, in which the input and output layers are separated by a group of hidden layers. This layer-wise architecture of the neural network is also referred to as afeed-forward network. This section will discuss both single-layer and multi-layer networks.

(25)

INPUT NODES

OUTPUT NODE y w1

w2 w3 w4

x4

x3

x2

x1

x5

w5

INPUT NODES

OUTPUT NODE w1

w2

w3 w4

w5

b

+1 BIAS NEURON

y x4

x3

x2

x1

x5

(a) Perceptron without bias (b) Perceptron with bias Figure 1.3: The basic architecture of the perceptron

1.2.1 Single Computational Layer: The Perceptron

The simplest neural network is referred to as the perceptron. This neural network contains a single input layer and an output node. The basic architecture of the perceptron is shown in Figure1.3(a). Consider a situation where each training instance is of the form (X, y), where each X = [x1, . . . xd] contains d feature variables, and y ∈ {−1,+1} contains the observed valueof the binary class variable. By “observed value” we refer to the fact that it is given to us as a part of the training data, and our goal is to predict the class variable for cases in which it is not observed. For example, in a credit-card fraud detection application, the features might represent various properties of a set of credit card transactions (e.g., amount and frequency of transactions), and the class variable might represent whether or not this set of transactions is fraudulent. Clearly, in this type of application, one would have historical cases in which the class variable is observed, and other (current) cases in which the class variable has not yet been observed but needs to be predicted.

The input layer contains d nodes that transmit the d features X = [x1. . . xd] with edges of weight W = [w1. . . wd] to an output node. The input layer does not perform any computation in its own right. The linear functionW·X =d

i=1wixi is computed at the output node. Subsequently, the sign of this real value is used in order to predict the dependent variable ofX. Therefore, the prediction ˆy is computed as follows:

ˆ

y= sign{W·X}= sign{ d j=1

wjxj} (1.1)

The sign function maps a real value to either +1 or1, which is appropriate for binary classification. Note the circumflex on top of the variabley to indicate that it is a predicted value rather than an observed value. The error of the prediction is thereforeE(X) =y−y,ˆ which is one of the values drawn from the set {−2,0,+2}. In cases where the error value E(X) is nonzero, the weights in the neural network need to be updated in the (negative) direction of the error gradient. As we will see later, this process is similar to that used in various types of linear models in machine learning. In spite of the similarity of the perceptron with respect to traditional machine learning models, its interpretation as a computational unit is very useful because it allows us to put together multiple units in order to create far more powerful models than are available in traditional machine learning.

(26)

The architecture of the perceptron is shown in Figure1.3(a), in which a single input layer transmits the features to the output node. The edges from the input to the output contain the weightsw1. . . wd with which the features are multiplied and added at the output node.

Subsequently, the sign function is applied in order to convert the aggregated value into a class label. The sign function serves the role of an activation function. Different choices of activation functions can be used to simulate different types of models used in machine learning, like least-squares regression with numeric targets, the support vector machine, or alogistic regression classifier. Most of the basic machine learning models can be easily represented as simple neural network architectures. It is a useful exercise to model traditional machine learning techniques as neural architectures, because it provides a clearer picture of how deep learning generalizes traditional machine learning. This point of view is explored in detail in Chapter2. It is noteworthy that the perceptron contains two layers, although the input layer does not perform any computation and only transmits the feature values.

The input layer is not included in the count of the number of layers in a neural network.

Since the perceptron contains a singlecomputational layer, it is considered a single-layer network.

In many settings, there is an invariant part of the prediction, which is referred to as thebias. For example, consider a setting in which the feature variables are mean centered, but the mean of the binary class prediction from{−1,+1}is not 0. This will tend to occur in situations in which the binary class distribution is highly imbalanced. In such a case, the aforementioned approach is not sufficient for prediction. We need to incorporate an additional bias variablebthat captures this invariant part of the prediction:

ˆ

y= sign{W ·X+b}= sign{

d j=1

wjxj+b} (1.2)

The bias can be incorporated as the weight of an edge by using a bias neuron. This is achieved by adding a neuron that always transmits a value of 1 to the output node. The weight of the edge connecting the bias neuron to the output node provides the bias variable.

An example of a bias neuron is shown in Figure1.3(b). Another approach that works well with single-layer architectures is to use afeature engineering trickin which an additional feature is created with a constant value of 1. The coefficient of this feature provides the bias, and one can then work with Equation1.1. Throughout this book, biases will not be explicitly used (for simplicity in architectural representations) because they can be incorporated with bias neurons. The details of the training algorithms remain the same by simply treating the bias neurons like any other neuron with a fixed activation value of 1. Therefore, the following will work with the predictive assumption of Equation 1.1, which does not explicitly uses biases.

At the time that the perceptron algorithm was proposed by Rosenblatt [405], these op- timizations were performed in a heuristic way with actual hardware circuits, and it was not presented in terms of a formal notion of optimization in machine learning (as is common today). However, the goal was always to minimize the error in prediction, even if a for- mal optimization formulation was not presented. The perceptron algorithm was, therefore, heuristically designed to minimize the number of misclassifications, and convergence proofs were available that provided correctness guarantees of the learning algorithm in simplified settings. Therefore, we can still write the (heuristically motivated) goal of the perceptron algorithm in least-squares form with respect to all training instances in a data setDcon-

(27)

taining feature-label pairs:

MinimizeWL=

(X,y)∈D

(y−y)ˆ 2=

(X,y)∈D

y−sign{W·X}2

This type of minimization objective function is also referred to as a loss function. As we will see later, almost all neural network learning algorithms are formulated with the use of a loss function. As we will learn in Chapter2, this loss function looks a lot like least- squares regression. However, the latter is defined for continuous-valued target variables, and the corresponding loss is a smooth and continuous function of the variables. On the other hand, for the least-squares form of the objective function, the sign function is non- differentiable, with step-like jumps at specific points. Furthermore, the sign function takes on constant values over large portions of the domain, and therefore the exact gradient takes on zero values at differentiable points. This results in a staircase-like loss surface, which is not suitable for gradient-descent. The perceptron algorithm (implicitly) uses a smooth approximation of the gradient of this objective function with respect to each example:

∇Lsmooth =

(X,y)∈D

(y−y)Xˆ (1.3)

Note that the above gradient is not a true gradient of the staircase-like surface of the (heuris- tic) objective function, which does not provide useful gradients. Therefore, the staircase is smoothed out into a sloping surface defined by theperceptron criterion. The properties of the perceptron criterion will be described in Section1.2.1.1. It is noteworthy that concepts like the “perceptron criterion” were proposed later than the original paper by Rosenblatt [405]

in order to explain the heuristic gradient-descent steps. For now, we will assume that the perceptron algorithm optimizes some unknown smooth function with the use of gradient descent.

Although the above objective function is defined over the entire training data, the train- ing algorithm of neural networks works by feeding each input data instance X into the network one by one (or in small batches) to create the prediction ˆy. The weights are then updated, based on the error valueE(X) = (y−y). Specifically, when the data pointˆ X is fed into the network, the weight vectorW is updated as follows:

W ⇐W+α(y−y)Xˆ (1.4)

The parameterαregulates the learning rate of the neural network. The perceptron algorithm repeatedly cycles through all the training examples in random order and iteratively adjusts the weights until convergence is reached. A single training data point may be cycled through many times. Each such cycle is referred to as an epoch. One can also write the gradient- descent update in terms of the errorE(X) = (y−y) as follows:ˆ

W ⇐W+αE(X)X (1.5)

The basic perceptron algorithm can be considered a stochastic gradient-descent method, which implicitly minimizes the squared error of prediction by performing gradient-descent updates with respect to randomly chosen training points. The assumption is that the neural network cycles through the points in random order during training and changes the weights with the goal of reducing the prediction error on that point. It is easy to see from Equa- tion1.5that non-zero updates are made to the weights only wheny= ˆy, which occurs only

(28)

when errors are made in prediction. Inmini-batch stochastic gradient descent, the aforemen- tioned updates of Equation1.5are implemented over a randomly chosen subset of training pointsS:

W ⇐W +α

XS

E(X)X (1.6)

LINEARLY SEPARABLE NOT LINEARLY SEPARABLE W X = 0

Figure 1.4: Examples of linearly separable and inseparable data in two classes The advantages of using mini-batch stochastic gradient descent are discussed in Section3.2.8 of Chapter3. An interesting quirk of the perceptron is that it is possible to set the learning rateαto 1, because the learning rate only scales the weights.

The type of model proposed in the perceptron is a linear model, in which the equation W·X = 0 defines a linear hyperplane. Here,W = (w1. . . wd) is ad-dimensional vector that is normal to the hyperplane. Furthermore, the value ofW·X is positive for values ofX on one side of the hyperplane, and it is negative for values ofX on the other side. This type of model performs particularly well when the data islinearly separable. Examples of linearly separable and inseparable data are shown in Figure1.4.

The perceptron algorithm is good at classifying data sets like the one shown on the left-hand side of Figure1.4, when the data is linearly separable. On the other hand, it tends to perform poorly on data sets like the one shown on the right-hand side of Figure1.4. This example shows the inherent modeling limitation of a perceptron, which necessitates the use of more complex neural architectures.

Since the original perceptron algorithm was proposed as a heuristic minimization of classification errors, it was particularly important to show that the algorithm converges to reasonable solutions in some special cases. In this context, it was shown [405] that the perceptron algorithm always converges to provide zero error on the training data when the data are linearly separable. However, the perceptron algorithm is not guaranteed to converge in instances where the data are not linearly separable. For reasons discussed in the next section, the perceptron might sometimes arrive at a very poor solution with data that are not linearly separable (in comparison with many other learning algorithms).

1.2.1.1 What Objective Function Is the Perceptron Optimizing?

As discussed earlier in this chapter, the original perceptron paper by Rosenblatt [405] did not formally propose a loss function. In those years, these implementations were achieved using actual hardware circuits. The originalMark I perceptronwas intended to be a machine rather than an algorithm, and custom-built hardware was used to create it (cf. Figure1.5).

(29)

The general goal was to minimize the number of classification errors with a heuristic update process (in hardware) that changed weights in the “correct” direction whenever errors were made. This heuristic update strongly resembled gradient descent but it was not derived as a gradient-descent method. Gradient descent is defined only for smooth loss functions in algorithmic settings, whereas the hardware-centric approach was designed in a more

Figure 1.5: The perceptron algorithm was originally implemented using hardware circuits.

The image depicts the Mark I perceptron machine built in 1958. (Courtesy: Smithsonian Institute)

heuristic way withbinary outputs. Many of the binary and circuit-centric principles were inherited from theMcCulloch-Pittsmodel [321] of the neuron. Unfortunately, binary signals are not prone to continuous optimization.

Can we find a smooth loss function, whose gradient turns out to be the perceptron update? The number of classification errors in a binary classification problem can be written in the form of a 0/1 loss function for training data point (Xi, yi) as follows:

L(0/1)i = 1

2(yisign{W·Xi})2= 1−yi·sign{W ·Xi} (1.7) The simplification to the right-hand side of the above objective function is obtained by set- ting bothyi2and sign{W·Xi}2to 1, since they are obtained by squaring a value drawn from {−1,+1}. However, this objective function is not differentiable, because it has a staircase- like shape, especially when it is added over multiple points. Note that the 0/1 loss above is dominated by the term −yisign{W ·Xi}, in which the sign function causes most of the problems associated with non-differentiability. Since neural networks are defined by gradient-based optimization, we need to define a smooth objective function that is respon- sible for the perceptron updates. It can be shown [41] that the updates of the perceptron implicitly optimize theperceptron criterion. This objective function is defined by dropping the sign function in the above 0/1 loss and setting negative values to 0 in order to treat all correct predictions in a uniform and lossless way:

Li= max{−yi(W·Xi),0} (1.8)

The reader is encouraged to use calculus to verify that the gradient of this smoothed objec- tive function leads to the perceptron update, and the update of the perceptron is essentially

(30)

W ⇐W −α∇WLi. The modified loss function to enable gradient computation of a non- differentiable function is also referred to as asmoothed surrogate loss function. Almost all continuous optimization-based learning methods (such as neural networks) with discrete outputs (such as class labels) use some type of smoothed surrogate loss function.

LOSS

PERCEPTRON CRITERION HINGE LOSS

1 0

VALUE OF W X FOR POSITIVE CLASS INSTANCE

Figure 1.6: Perceptron criterion versus hinge loss

Although the aforementioned perceptron criterion was reverse engineered by working backwards from the perceptron updates, the nature of this loss function exposes some of the weaknesses of the updates in the original algorithm. An interesting observation about the perceptron criterion is that one can setW to the zero vectorirrespective of the training data setin order to obtain the optimal loss value of 0. In spite of this fact, the perceptron updates continue to converge to a clear separator between the two classes in linearly separable cases;

after all, a separator between the two classes provides a loss value of 0 as well. However, the behavior for data that are not linearly separable is rather arbitrary, and the resulting solution is sometimes not even a good approximate separator of the classes. The direct sensitivity of the loss to the magnitude of the weight vector can dilute the goal of class separation; it is possible for updates to worsen the number of misclassifications significantly while improving the loss. This is an example of how surrogate loss functions might sometimes not fully achieve their intended goals. Because of this fact, the approach is not stable and can yield solutions of widely varying quality.

Several variations of the learning algorithm were therefore proposed for inseparable data, and a natural approach is to always keep track of the best solution in terms of the number of misclassifications [128]. This approach of always keeping the best solution in one’s “pocket”

is referred to as thepocket algorithm. Another highly performing variant incorporates the notion of margin in the loss function, which creates an identical algorithm to the linear support vector machine. For this reason, the linear support vector machine is also referred to as theperceptron of optimal stability.

1.2.1.2 Relationship with Support Vector Machines

The perceptron criterion is a shifted version of thehinge-loss used in support vector ma- chines (see Chapter2). The hinge loss looks even more similar to the zero-one loss criterion of Equation1.7, and is defined as follows:

Lsvmi = max{1−yi(W·Xi),0} (1.9) Note that the perceptron does not keep the constant term of 1 on the right-hand side of Equation1.7, whereas the hinge loss keeps this constant within the maximization function.

This change does not affect the algebraic expression for the gradient, but it does change

(31)

which points are lossless and should not cause an update. The relationship between the perceptron criterion and the hinge loss is shown in Figure 1.6. This similarity becomes particularly evident when the perceptron updates of Equation1.6are rewritten as follows:

W ⇐W+α

(X,y)∈S+

yX (1.10)

Here, S+ is defined as the set of all misclassified training points X S that satisfy the conditiony(W·X)<0. This update seems to look somewhat different from the perceptron, because the perceptron uses the errorE(X) for the update, which is replaced withyin the update above. A key point is that the (integer) error valueE(X) = (ysign{W ·X})∈ {−2,+2} can never be 0 for misclassified points in S+. Therefore, we have E(X) = 2y for misclassified points, and E(X) can be replaced with y in the updates after absorbing the factor of 2 within the learning rate. This update is identical to that used by the primal support vector machine (SVM) algorithm [448], except that the updates are performed only for the misclassified points in the perceptron, whereas the SVM also uses the marginally correct points near the decision boundary for updates. Note that the SVM uses the condition y(W ·X)<1 [instead of using the condition y(W ·X)<0] to define S+, which is one of the key differences between the two algorithms. This point shows that the perceptron is fundamentally not very different from well-known machine learning algorithms like the support vector machine in spite of its different origins. Freund and Schapire provide a beautiful exposition of the role of margin in improving stability of the perceptron and also its relationship with the support vector machine [123]. It turns out that many traditional machine learning models can be viewed as minor variations of shallow neural architectures like the perceptron. The relationships between classical machine learning models and shallow neural networks are described in detail in Chapter2.

1.2.1.3 Choice of Activation and Loss Functions

The choice of activation function is a critical part of neural network design. In the case of the perceptron, the choice of the sign activation function is motivated by the fact that a binary class label needs to be predicted. However, it is possible to have other types of situations where different target variables may be predicted. For example, if the target variable to be predicted is real, then it makes sense to use the identity activation function, and the resulting algorithm is the same as least-squares regression. If it is desirable to predict a probability of a binary class, it makes sense to use asigmoidfunction for activating the output node, so that the prediction ˆy indicates the probability that the observed value,y, of the dependent variable is 1. The negative logarithm of|y/2−0.5 + ˆy|is used as the loss, assuming thatyis coded from{−1,1}. If ˆyis the probability thatyis 1, then|y/2−0.5 + ˆy|is the probability that the correct value is predicted. This assertion is easy to verify by examining the two cases wherey is 0 or 1. This loss function can be shown to be representative of the negative log-likelihood of the training data (see Section2.2.3of Chapter2).

The importance of nonlinear activation functions becomes significant when one moves from the single-layered perceptron to the multi-layered architectures discussed later in this chapter. Different types of nonlinear functions such as thesign,sigmoid, or hyperbolic tan- gentsmay be used in various layers. We use the notation Φ to denote the activation function:

ˆ

y= Φ(W ·X) (1.11)

Therefore, a neuron really computes two functions within the node, which is why we have incorporated the summation symbol Σ as well as the activation symbol Φ within a neuron.

The break-up of the neuron computations into two separate values is shown in Figure1.7.

Referințe

DOCUMENTE SIMILARE

As a result of these works in deep learning neural networks,we comewith idea of providing trained sets and networks that make the tracking algorithm easier and also

error function, multivariate neural network approximation, quasi- interpolation operator, Baskakov type operator, quadrature type operator, mul- tivariate modulus of continuity,

Results obtained reveal the reliability and good predicatively of neural network model for the prediction of the size of Ag-NPs in green method.. (Received May 22, 2013;

De¸si ˆın ambele cazuri de mai sus (S ¸si S ′ ) algoritmul Perceptron g˘ ase¸ste un separator liniar pentru datele de intrare, acest fapt nu este garantat ˆın gazul general,

3, 4 we present the results of testing the model with an excitation of type step and with an excitation with sinusoid signal of different frequency.. The diagram of the response at

Adopting a straightforward intuitive approach and approximating a single scale factor, several application schemes of the deep networks are evaluated and meaningful conclusions

However, the sphere is topologically different from the donut, and from the flat (Euclidean) space.. Classification of two

The number of vacancies for the doctoral field of Medicine, Dental Medicine and Pharmacy for the academic year 2022/2023, financed from the state budget, are distributed to