Chapter 9

Convolutional Networks

Convolutional networks

(LeCun, 1989), also known as

convolutional neural

networks

, or CNNs, are a specialized kind of neural network for processing data

that has a known grid-like topology. Examples include time-series data, which can

be thought of as a 1-D grid taking samples at regular time intervals, and image data,

which can be thought of as a 2-D grid of pixels. Convolutional networks have been

tremendously successful in practical applications. The name “convolutional neural

network” indicates that the network employs a mathematical operation called

convolution

. Convolution is a specialized kind of linear operation. Convolutional

networks are simply neural networks that use convolution in place of general matrix

multiplication in at least one of their layers.

In this chapter, we ﬁrst describe what convolution is. Next, we explain the

motivation behind using convolution in a neural network. We then describe an

operation called

pooling

, which almost all convolutional networks employ. Usually,

the operation used in a convolutional neural network does not correspond precisely

to the deﬁnition of convolution as used in other ﬁelds, such as engineering or

pure mathematics. We describe several variants on the convolution function that

are widely used in practice for neural networks. We also show how convolution

may be applied to many kinds of data, with diﬀerent numbers of dimensions. We

then discuss means of making convolution more eﬃcient. Convolutional networks

stand out as an example of neuroscientiﬁc principles inﬂuencing deep learning.

We discuss these neuroscientiﬁc principles, then conclude with comments about

the role convolutional networks have played in the history of deep learning. One

topic this chapter does not address is how to choose the architecture of your

convolutional network. The goal of this chapter is to describe the kinds of tools

that convolutional networks provide, while chapter 11 describes general guidelines

326

CHAPTER 9. CONVOLUTIONAL NETWORKS

for choosing which tools to use in which circumstances. Research into convolutional

network architectures proceeds so rapidly that a new best architecture for a given

benchmark is announced every few weeks to months, rendering it impractical to

describe in print the best architecture. Nonetheless, the best architectures have

consistently been composed of the building blocks described here.

9.1 The Convolution Operation

In its most general form, convolution is an operation on two functions of a real-

valued argument. To motivate the deﬁnition of convolution, we start with examples

of two functions we might use.

Suppose we are tracking the location of a spaceship with a laser sensor. Our

laser sensor provides a single output

(

), the position of the spaceship at time

Both

and

are real valued, that is, we can get a diﬀerent reading from the laser

sensor at any instant in time.

Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy

estimate of the spaceship’s position, we would like to average several measurements.

Of course, more recent measurements are more relevant, so we will want this to

be a weighted average that gives more weight to recent measurements. We can

do this with a weighting function

(

), where

is the age of a measurement. If

we apply such a weighted average operation at every moment, we obtain a new

function s providing a smoothed estimate of the position of the spaceship:

s(t) =



x(a)w(t − a)da. (9.1)

This operation is called

convolution

. The convolution operation is typically

denoted with an asterisk:

s(t) = (x ∗ w)(t). (9.2)

In our example,

needs to be a valid probability density function, or the output

will not be a weighted average. Also,

needs to be 0 for all negative arguments,

or it will look into the future, which is presumably beyond our capabilities. These

limitations are particular to our example, though. In general, convolution is deﬁned

for any functions for which the above integral is deﬁned and may be used for other

purposes besides taking weighted averages.

In convolutional network terminology, the ﬁrst argument (in this example, the

function

) to the convolution is often referred to as the

input

, and the second

327

CHAPTER 9. CONVOLUTIONAL NETWORKS

argument (in this example, the function

) as the

kernel

. The output is sometimes

referred to as the feature map.

In our example, the idea of a laser sensor that can provide measurements at

every instant is not realistic. Usually, when we work with data on a computer,

time will be discretized, and our sensor will provide data at regular intervals.

In our example, it might be more realistic to assume that our laser provides a

measurement once per second. The time index

can then take on only integer

values. If we now assume that

and

are deﬁned only on integer

, we can deﬁne

the discrete convolution:

s(t) = (x ∗ w)(t) =

∞



a=−∞

x(a)w(t − a). (9.3)

In machine learning applications, the input is usually a multidimensional array

of data, and the kernel is usually a multidimensional array of parameters that are

adapted by the learning algorithm. We will refer to these multidimensional arrays

as tensors. Because each element of the input and kernel must be explicitly stored

separately, we usually assume that these functions are zero everywhere but in the

ﬁnite set of points for which we store the values. This means that in practice, we

can implement the inﬁnite summation as a summation over a ﬁnite number of

array elements.

Finally, we often use convolutions over more than one axis at a time. For

example, if we use a two-dimensional image

as our input, we probably also want

to use a two-dimensional kernel K:

S(i, j) = (I ∗ K)(i, j) =



I(m, n)K(i − m, j − n). (9.4)

Convolution is commutative, meaning we can equivalently write

S(i, j) = (K ∗I)(i, j) =



I(i − m, j −n)K(m, n). (9.5)

Usually the latter formula is more straightforward to implement in a machine

learning library, because there is less variation in the range of valid values of

and n.

The commutative property of convolution arises because we have

ﬂipped

the

kernel relative to the input, in the sense that as

increases, the index into the

input increases, but the index into the kernel decreases. The only reason to ﬂip

the kernel is to obtain the commutative property. While the commutative property

328

CHAPTER 9. CONVOLUTIONAL NETWORKS

is useful for writing proofs, it is not usually an important property of a neural

network implementation. Instead, many neural network libraries implement a

related function called the

cross-correlation

, which is the same as convolution

but without ﬂipping the kernel:

S(i, j) = (K ∗I)(i, j) =



I(i + m, j + n)K(m, n). (9.6)

Many machine learning libraries implement cross-correlation but call it convolution.

In this text we follow this convention of calling both operations convolution and

specify whether we mean to ﬂip the kernel or not in contexts where kernel ﬂipping

is relevant. In the context of machine learning, the learning algorithm will learn

the appropriate values of the kernel in the appropriate place, so an algorithm based

on convolution with kernel ﬂipping will learn a kernel that is ﬂipped relative to the

kernel learned by an algorithm without the ﬂipping. It is also rare for convolution

to be used alone in machine learning; instead convolution is used simultaneously

with other functions, and the combination of these functions does not commute

regardless of whether the convolution operation ﬂips its kernel or not.

See ﬁgure 9.1 for an example of convolution (without kernel ﬂipping) applied

to a 2-D tensor.

Discrete convolution can be viewed as multiplication by a matrix, but the

matrix has several entries constrained to be equal to other entries. For example,

for univariate discrete convolution, each row of the matrix is constrained to be

equal to the row above shifted by one element. This is known as a

Toeplitz

matrix

. In two dimensions, a

doubly block circulant matrix

corresponds to

convolution. In addition to these constraints that several elements be equal to

each other, convolution usually corresponds to a very sparse matrix (a matrix

whose entries are mostly equal to zero). This is because the kernel is usually much

smaller than the input image. Any neural network algorithm that works with

matrix multiplication and does not depend on speciﬁc properties of the matrix

structure should work with convolution, without requiring any further changes

to the neural network. Typical convolutional neural networks do make use of

further specializations in order to deal with large inputs eﬃciently, but these are

not strictly necessary from a theoretical perspective.

9.2 Motivation

Convolution leverages three important ideas that can help improve a machine

learning system:

sparse interactions

parameter sharing

and

equivariant

329

CHAPTER 9. CONVOLUTIONAL NETWORKS

a b c d

e f g h

i j k l

w x

y z

aw + bx +

ey + fz

aw + bx +

ey + fz

bw + cx +

fy + gz

bw + cx +

fy + gz

cw + dx +

gy + hz

cw + dx +

gy + hz

ew + fx +

iy + jz

ew + fx +

iy + jz

fw + gx +

jy + kz

fw + gx +

jy + kz

gw + hx +

ky + lz

gw + hx +

ky + lz

Input

Kernel

Output

Figure 9.1: An example of 2-D convolution without kernel ﬂipping. We restrict the output

to only positions where the kernel lies entirely within the image, called “valid” convolution

in some contexts. We draw boxes with arrows to indicate how the upper-left element of

the output tensor is formed by applying the kernel to the corresponding upper-left region

of the input tensor.

representations

. Moreover, convolution provides a means for working with

inputs of variable size. We now describe each of these ideas in turn.

Traditional neural network layers use matrix multiplication by a matrix of

parameters with a separate parameter describing the interaction between each

input unit and each output unit. This means that every output unit interacts

with every input unit. Convolutional networks, however, typically have

sparse

interactions

(also referred to as

sparse connectivity

sparse weights

). This

is accomplished by making the kernel smaller than the input. For example,

when processing an image, the input image might have thousands or millions of

pixels, but we can detect small, meaningful features such as edges with kernels

that occupy only tens or hundreds of pixels. This means that we need to store

fewer parameters, which both reduces the memory requirements of the model

and improves its statistical eﬃciency. It also means that computing the output

330

CHAPTER 9. CONVOLUTIONAL NETWORKS

requires fewer operations. These improvements in eﬃciency are usually quite large.

If there are

inputs and

outputs, then matrix multiplication requires

m ×n

parameters, and the algorithms used in practice have

(

m × n

) runtime (per

example). If we limit the number of connections each output may have to

, then

the sparsely connected approach requires only

k × n

parameters and

(

k × n

)

runtime. For many practical applications, it is possible to obtain good performance

on the machine learning task while keeping

several orders of magnitude smaller

than

. For graphical demonstrations of sparse connectivity, see ﬁgure 9.2 and

ﬁgure 9.3. In a deep convolutional network, units in the deeper layers may indirectly

interact with a larger portion of the input, as shown in ﬁgure 9.4. This allows the

network to eﬃciently describe complicated interactions between many variables by

constructing such interactions from simple building blocks that each describe only

sparse interactions.

Parameter sharing

refers to using the same parameter for more than one

function in a model. In a traditional neural net, each element of the weight matrix

Figure 9.2: Sparse connectivity, viewed from below. We highlight one input unit,

, and

highlight the output units in

that are aﬀected by this unit. (Top)When

is formed by

convolution with a kernel of width 3, only three outputs are aﬀected by

. (Bottom)When

is formed by matrix multiplication, connectivity is no longer sparse, so all the outputs

are aﬀected by x

331

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.3: Sparse connectivity, viewed from above. We highlight one output unit,

, and

highlight the input units in

that aﬀect this unit. These units are known as the

receptive

ﬁeld

. (Top)When

is formed by convolution with a kernel of width 3, only three

inputs aﬀect s

. (Bottom)When s is formed by matrix multiplication, connectivity is no

longer sparse, so all the inputs aﬀect s

Figure 9.4: The receptive ﬁeld of the units in the deeper layers of a convolutional network

is larger than the receptive ﬁeld of the units in the shallow layers. This eﬀect increases if

the network includes architectural features like strided convolution (ﬁgure 9.12) or pooling

(section 9.3). This means that even though direct connections in a convolutional net are

very sparse, units in the deeper layers can be indirectly connected to all or most of the

input image.

332

CHAPTER 9. CONVOLUTIONAL NETWORKS

is used exactly once when computing the output of a layer. It is multiplied by

one element of the input and then never revisited. As a synonym for parameter

sharing, one can say that a network has

tied weights

, because the value of the

weight applied to one input is tied to the value of a weight applied elsewhere. In

a convolutional neural net, each member of the kernel is used at every position

of the input (except perhaps some of the boundary pixels, depending on the

design decisions regarding the boundary). The parameter sharing used by the

convolution operation means that rather than learning a separate set of parameters

for every location, we learn only one set. This does not aﬀect the runtime of

forward propagation—it is still

(

k × n

)—but it does further reduce the storage

requirements of the model to

parameters. Recall that

is usually several orders

of magnitude smaller than

. Since

and

are usually roughly the same size,

practically insigniﬁcant compared to

m ×n

. Convolution is thus dramatically more

eﬃcient than dense matrix multiplication in terms of the memory requirements

and statistical eﬃciency. For a graphical depiction of how parameter sharing works,

see ﬁgure 9.5.

As an example of both of these ﬁrst two principles in action, ﬁgure 9.6 shows

how sparse connectivity and parameter sharing can dramatically improve the

Figure 9.5: Parameter sharing. Black arrows indicate the connections that use a particular

parameter in two diﬀerent models. (Top)The black arrows indicate uses of the central

element of a 3-element kernel in a convolutional model. Because of parameter sharing, this

single parameter is used at all input locations. (Bottom)The single black arrow indicates

the use of the central element of the weight matrix in a fully connected model. This model

has no parameter sharing, so the parameter is used only once.

333

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.6: Eﬃciency of edge detection. The image on the right was formed by taking

each pixel in the original image and subtracting the value of its neighboring pixel on

the left. This shows the strength of all the vertically oriented edges in the input image,

which can be a useful operation for object detection. Both images are 280 pixels tall.

The input image is 320 pixels wide, while the output image is 319 pixels wide. This

transformation can be described by a convolution kernel containing two elements, and

requires 319

280

3 = 267

960 ﬂoating-point operations (two multiplications and

one addition per output pixel) to compute using convolution. To describe the same

transformation with a matrix multiplication would take 320

280

319

280, or over

eight billion, entries in the matrix, making convolution four billion times more eﬃcient for

representing this transformation. The straightforward matrix multiplication algorithm

performs over sixteen billion ﬂoating point operations, making convolution roughly 60,000

times more eﬃcient computationally. Of course, most of the entries of the matrix would be

zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication

and convolution would require the same number of ﬂoating-point operations to compute.

The matrix would still need to contain 2

319

280 = 178

640 entries. Convolution

is an extremely eﬃcient way of describing transformations that apply the same linear

transformation of a small local region across the entire input. Photo credit: Paula

Goodfellow.

eﬃciency of a linear function for detecting edges in an image.

In the case of convolution, the particular form of parameter sharing causes the

layer to have a property called

equivariance

to translation. To say a function is

equivariant means that if the input changes, the output changes in the same way.

Speciﬁcally, a function

(

) is equivariant to a function

(

)) =

(

)). In

the case of convolution, if we let

be any function that translates the input, that

is, shifts it, then the convolution function is equivariant to

. For example, let

be a function giving image brightness at integer coordinates. Let

be a function

mapping one image function to another image function, such that



(

) is the

image function with



(

x, y

) =

(

x −

, y

). This shifts every pixel of

one unit to

the right. If we apply this transformation to

, then apply convolution, the result

will be the same as if we applied convolution to



, then applied the transformation

334

CHAPTER 9. CONVOLUTIONAL NETWORKS

to the output. When processing time-series data, this means that convolution

produces a sort of timeline that shows when diﬀerent features appear in the input.

If we move an event later in time in the input, the exact same representation of it

will appear in the output, just later. Similarly with images, convolution creates a

2-D map of where certain features appear in the input. If we move the object in the

input, its representation will move the same amount in the output. This is useful

for when we know that some function of a small number of neighboring pixels is

useful when applied to multiple input locations. For example, when processing

images, it is useful to detect edges in the ﬁrst layer of a convolutional network.

The same edges appear more or less everywhere in the image, so it is practical to

share parameters across the entire image. In some cases, we may not wish to share

parameters across the entire image. For example, if we are processing images that

are cropped to be centered on an individual’s face, we probably want to extract

diﬀerent features at diﬀerent locations—the part of the network processing the top

of the face needs to look for eyebrows, while the part of the network processing

the bottom of the face needs to look for a chin.

Convolution is not naturally equivariant to some other transformations, such

as changes in the scale or rotation of an image. Other mechanisms are necessary

for handling these kinds of transformations.

Finally, some kinds of data cannot be processed by neural networks deﬁned by

matrix multiplication with a ﬁxed-shape matrix. Convolution enables processing

of some of these kinds of data. We discuss this further in section 9.7.

9.3 Pooling

A typical layer of a convolutional network consists of three stages (see ﬁgure 9.7).

In the ﬁrst stage, the layer performs several convolutions in parallel to produce a

set of linear activations. In the second stage, each linear activation is run through

a nonlinear activation function, such as the rectiﬁed linear activation function.

This stage is sometimes called the

detector stage

. In the third stage, we use a

pooling function to modify the output of the layer further.

A pooling function replaces the output of the net at a certain location with a

summary statistic of the nearby outputs. For example, the

max pooling

(Zhou

and Chellappa, 1988) operation reports the maximum output within a rectangular

neighborhood. Other popular pooling functions include the average of a rectangular

neighborhood, the

norm of a rectangular neighborhood, or a weighted average

based on the distance from the central pixel.

335

CHAPTER 9. CONVOLUTIONAL NETWORKS

Convolutional Layer

Input to layer

Convolution stage:

Aﬃne transform

Detector stage:

Nonlinearity

e.g., rectiﬁed linear

Pooling stage

Next layer

Input to layers

Convolution layer:

Aﬃne transform

Detector layer: Nonlinearity

e.g., rectiﬁed linear

Pooling layer

Next layer

Complex layer terminology Simple layer terminology

Figure 9.7: The components of a typical convolutional neural network layer. There are two

commonly used sets of terminology for describing these layers. (Left)In this terminology,

the convolutional net is viewed as a small number of relatively complex layers, with

each layer having many “stages.” In this terminology, there is a one-to-one mapping

between kernel tensors and network layers. In this book we generally use this terminology.

(Right)In this terminology, the convolutional net is viewed as a larger number of simple

layers; every step of processing is regarded as a layer in its own right. This means that

not every “layer” has parameters.

In all cases, pooling helps to make the representation approximately

invariant

to small translations of the input. Invariance to translation means that if we

translate the input by a small amount, the values of most of the pooled outputs

do not change. See ﬁgure 9.8 for an example of how this works. Invariance to local

translation can be a useful property if we care more about whether some feature

is present than exactly where it is. For example, when determining whether an

image contains a face, we need not know the location of the eyes with pixel-perfect

accuracy, we just need to know that there is an eye on the left side of the face and

an eye on the right side of the face. In other contexts, it is more important to

preserve the location of a feature. For example, if we want to ﬁnd a corner deﬁned

336

CHAPTER 9. CONVOLUTIONAL NETWORKS

0.1 1. 0.2

1.1. 1.

0.1

0.2

... ...

0.3 0.1 1.

1.0.3 1.

0.2

... ...

DETECTOR STAGE

POOLING STAGE

DETECTOR STAGE

Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output

of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top

row shows the outputs of max pooling, with a stride of one pixel between pooling regions

and a pooling region width of three pixels. (Bottom)A view of the same network, after

the input has been shifted to the right by one pixel. Every value in the bottom row has

changed, but only half of the values in the top row have changed, because the max pooling

units are sensitive only to the maximum value in the neighborhood, not its exact location.

by two edges meeting at a speciﬁc orientation, we need to preserve the location of

the edges well enough to test whether they meet.

The use of pooling can be viewed as adding an inﬁnitely strong prior that

the function the layer learns must be invariant to small translations. When this

assumption is correct, it can greatly improve the statistical eﬃciency of the network.

Pooling over spatial regions produces invariance to translation, but if we pool

over the outputs of separately parametrized convolutions, the features can learn

which transformations to become invariant to (see ﬁgure 9.9).

Because pooling summarizes the responses over a whole neighborhood, it is

possible to use fewer pooling units than detector units, by reporting summary

statistics for pooling regions spaced

pixels apart rather than 1 pixel apart. See

ﬁgure 9.10 for an example. This improves the computational eﬃciency of the

network because the next layer has roughly

times fewer inputs to process. When

337

CHAPTER 9. CONVOLUTIONAL NETWORKS

Large response

in pooling unit

Large response

in pooling unit

Large

response

in detector

unit 1

Large

response

in detector

unit 3

Figure 9.9: Example of learned invariances. A pooling unit that pools over multiple features

that are learned with separate parameters can learn to be invariant to transformations of

the input. Here we show how a set of three learned ﬁlters and a max pooling unit can learn

to become invariant to rotation. All three ﬁlters are intended to detect a hand written 5.

Each ﬁlter attempts to match a slightly diﬀerent orientation of the 5. When a 5 appears in

the input, the corresponding ﬁlter will match it and cause a large activation in a detector

unit. The max pooling unit then has a large activation regardless of which detector unit

was activated. We show here how the network processes two diﬀerent inputs, resulting

in two diﬀerent detector units being activated. The eﬀect on the pooling unit is roughly

the same either way. This principle is leveraged by maxout networks (Goodfellow et al.,

2013a) and other convolutional networks. Max pooling over spatial positions is naturally

invariant to translation; this multichannel approach is only necessary for learning other

transformations.

0.1 1. 0.2

1. 0.2

0.1

0.0 0.1

Figure 9.10: Pooling with downsampling. Here we use max pooling with a pool width of

three and a stride between pools of two. This reduces the representation size by a factor

of two, which reduces the computational and statistical burden on the next layer. Note

that the rightmost pooling region has a smaller size but must be included if we do not

want to ignore some of the detector units.

the number of parameters in the next layer is a function of its input size (such as

when the next layer is fully connected and based on matrix multiplication), this

338

CHAPTER 9. CONVOLUTIONAL NETWORKS

reduction in the input size can also result in improved statistical eﬃciency and

reduced memory requirements for storing the parameters.

For many tasks, pooling is essential for handling inputs of varying size. For

example, if we want to classify images of variable size, the input to the classiﬁcation

layer must have a ﬁxed size. This is usually accomplished by varying the size of an

oﬀset between pooling regions so that the classiﬁcation layer always receives the

same number of summary statistics regardless of the input size. For example, the

ﬁnal pooling layer of the network may be deﬁned to output four sets of summary

statistics, one for each quadrant of an image, regardless of the image size.

Some theoretical work gives guidance as to which kinds of pooling one should

use in various situations (Boureau et al., 2010). It is also possible to dynamically

pool features together, for example, by running a clustering algorithm on the

locations of interesting features (Boureau et al., 2011). This approach yields a

diﬀerent set of pooling regions for each image. Another approach is to learn a

single pooling structure that is then applied to all images (Jia et al., 2012).

Pooling can complicate some kinds of neural network architectures that use

top-down information, such as Boltzmann machines and autoencoders. These

issues are discussed further when we present these types of networks in part III.

Pooling in convolutional Boltzmann machines is presented in section 20.6. The

inverse-like operations on pooling units needed in some diﬀerentiable networks are

covered in section 20.10.6.

Some examples of complete convolutional network architectures for classiﬁcation

using convolution and pooling are shown in ﬁgure 9.11.

9.4 Convolution and Pooling as an Inﬁnitely Strong

Prior

Recall the concept of a

prior probability distribution

from section 5.6. This is

a probability distribution over the parameters of a model that encodes our beliefs

about what models are reasonable, before we have seen any data.

Priors can be considered weak or strong depending on how concentrated the

probability density in the prior is. A weak prior is a prior distribution with high

entropy, such as a Gaussian distribution with high variance. Such a prior allows

the data to move the parameters more or less freely. A strong prior has very low

entropy, such as a Gaussian distribution with low variance. Such a prior plays a

more active role in determining where the parameters end up.

An inﬁnitely strong prior places zero probability on some parameters and says

339

CHAPTER 9. CONVOLUTIONAL NETWORKS

Input image:

256x256x3

Output of

convolution +

ReLU: 256x256x64

Output of pooling

with stride 4:

64x64x64

Output of

convolution +

ReLU: 64x64x64

Output of pooling

with stride 4:

16x16x64

Output of reshape to

vector:

16,384 units

Output of matrix

multiply: 1,000 units

Output of softmax:

1,000 class

probabilities

Input image:

256x256x3

Output of

convolution +

ReLU: 256x256x64

Output of pooling

with stride 4:

64x64x64

Output of

convolution +

ReLU: 64x64x64

Output of pooling to

3x3 grid: 3x3x64

Output of reshape to

vector:

576 units

Output of matrix

multiply: 1,000 units

Output of softmax:

1,000 class

probabilities

Input image:

256x256x3

Output of

convolution +

ReLU: 256x256x64

Output of pooling

with stride 4:

64x64x64

Output of

convolution +

ReLU: 64x64x64

Output of

convolution:

16x16x1,000

Output of average

pooling: 1x1x1,000

Output of softmax:

1,000 class

probabilities

Output of pooling

with stride 4:

16x16x64

Figure 9.11: Examples of architectures for classiﬁcation with convolutional networks. The

speciﬁc strides and depths used in this ﬁgure are not advisable for real use; they are

designed to be very shallow to ﬁt onto the page. Real convolutional networks also often

involve signiﬁcant amounts of branching, unlike the chain structures used here for simplicity.

(Left)A convolutional network that processes a ﬁxed image size. After alternating between

convolution and pooling for a few layers, the tensor for the convolutional feature map is

reshaped to ﬂatten out the spatial dimensions. The rest of the network is an ordinary

feedforward network classiﬁer, as described in chapter 6. (Center)A convolutional network

that processes a variably sized image but still maintains a fully connected section. This

network uses a pooling operation with variably sized pools but a ﬁxed number of pools,

in order to provide a ﬁxed-size vector of 576 units to the fully connected portion of the

network. (Right)A convolutional network that does not have any fully connected weight

layer. Instead, the last convolutional layer outputs one feature map per class. The model

presumably learns a map of how likely each class is to occur at each spatial location.

Averaging a feature map down to a single value provides the argument to the softmax

classiﬁer at the top.

340

CHAPTER 9. CONVOLUTIONAL NETWORKS

that these parameter values are completely forbidden, regardless of how much

support the data give to those values.

We can imagine a convolutional net as being similar to a fully connected net,

but with an inﬁnitely strong prior over its weights. This inﬁnitely strong prior

says that the weights for one hidden unit must be identical to the weights of its

neighbor but shifted in space. The prior also says that the weights must be zero,

except for in the small, spatially contiguous receptive ﬁeld assigned to that hidden

unit. Overall, we can think of the use of convolution as introducing an inﬁnitely

strong prior probability distribution over the parameters of a layer. This prior

says that the function the layer should learn contains only local interactions and is

equivariant to translation. Likewise, the use of pooling is an inﬁnitely strong prior

that each unit should be invariant to small translations.

Of course, implementing a convolutional net as a fully connected net with an

inﬁnitely strong prior would be extremely wasteful computationally. But thinking

of a convolutional net as a fully connected net with an inﬁnitely strong prior can

give us some insights into how convolutional nets work.

One key insight is that convolution and pooling can cause underﬁtting. Like

any prior, convolution and pooling are only useful when the assumptions made

by the prior are reasonably accurate. If a task relies on preserving precise spatial

information, then using pooling on all features can increase the training error.

Some convolutional network architectures (Szegedy et al., 2014a) are designed to

use pooling on some channels but not on other channels, in order to get both

highly invariant features and features that will not underﬁt when the translation

invariance prior is incorrect. When a task involves incorporating information from

very distant locations in the input, then the prior imposed by convolution may be

inappropriate.

Another key insight from this view is that we should only compare convolu-

tional models to other convolutional models in benchmarks of statistical learning

performance. Models that do not use convolution would be able to learn even if we

permuted all the pixels in the image. For many image datasets, there are separate

benchmarks for models that are

permutation invariant

and must discover the

concept of topology via learning and for models that have the knowledge of spatial

relationships hard coded into them by their designer.

341

CHAPTER 9. CONVOLUTIONAL NETWORKS

9.5 Variants of the Basic Convolution Function

When discussing convolution in the context of neural networks, we usually do

not refer exactly to the standard discrete convolution operation as it is usually

understood in the mathematical literature. The functions used in practice diﬀer

slightly. Here we describe these diﬀerences in detail and highlight some useful

properties of the functions used in neural networks.

First, when we refer to convolution in the context of neural networks, we usually

actually mean an operation that consists of many applications of convolution in

parallel. This is because convolution with a single kernel can extract only one kind

of feature, albeit at many spatial locations. Usually we want each layer of our

network to extract many kinds of features, at many locations.

Additionally, the input is usually not just a grid of real values. Rather, it is a

grid of vector-valued observations. For example, a color image has a red, green

and blue intensity at each pixel. In a multilayer convolutional network, the input

to the second layer is the output of the ﬁrst layer, which usually has the output

of many diﬀerent convolutions at each position. When working with images, we

usually think of the input and output of the convolution as being 3-D tensors, with

one index into the diﬀerent channels and two indices into the spatial coordinates

of each channel. Software implementations usually work in batch mode, so they

will actually use 4-D tensors, with the fourth axis indexing diﬀerent examples in

the batch, but we will omit the batch axis in our description here for simplicity.

Because convolutional networks usually use multichannel convolution, the linear

operations they are based on are not guaranteed to be commutative, even if kernel

ﬂipping is used. These multichannel operations are only commutative if each

operation has the same number of output channels as input channels.

Assume we have a 4-D kernel tensor

with element K

i,j,k,l

giving the connection

strength between a unit in channel

of the output and a unit in channel

of the

input, with an oﬀset of

rows and

columns between the output unit and the

input unit. Assume our input consists of observed data

with element V

i,j,k

giving

the value of the input unit within channel

at row

and column

. Assume our

output consists of

with the same format as

. If

is produced by convolving

across V without ﬂipping K, then

i,j,k



l,m,n

l,j+m−1,k+n−1

i,l,m,n

, (9.7)

where the summation over

and

is over all values for which the tensor

indexing operations inside the summation are valid. In linear algebra notation,

342

CHAPTER 9. CONVOLUTIONAL NETWORKS

we index into arrays using a 1 for the ﬁrst entry. This necessitates the

−

1 in the

above formula. Programming languages such as C and Python index starting from

0, rendering the above expression even simpler.

We may want to skip over some positions of the kernel to reduce the computa-

tional cost (at the expense of not extracting our features as ﬁnely). We can think

of this as downsampling the output of the full convolution function. If we want to

sample only every

pixels in each direction in the output, then we can deﬁne a

downsampled convolution function c such that

i,j,k

= c(K, V, s)

i,j,k



l,m,n



l,(j−1)×s+m,(k−1)×s+n

i,l,m,n



. (9.8)

We refer to

as the

stride

of this downsampled convolution. It is also possible

to deﬁne a separate stride for each direction of motion. See ﬁgure 9.12 for an

illustration.

One essential feature of any convolutional network implementation is the ability

to implicitly zero pad the input

to make it wider. Without this feature, the

width of the representation shrinks by one pixel less than the kernel width at

each layer. Zero padding the input allows us to control the kernel width and

the size of the output independently. Without zero padding, we are forced to

choose between shrinking the spatial extent of the network rapidly and using small

kernels—both scenarios that signiﬁcantly limit the expressive power of the network.

See ﬁgure 9.13 for an example.

Three special cases of the zero-padding setting are worth mentioning. One is

the extreme case in which no zero padding is used whatsoever, and the convolution

kernel is allowed to visit only positions where the entire kernel is contained entirely

within the image. In MATLAB terminology, this is called

valid

convolution. In

this case, all pixels in the output are a function of the same number of pixels in

the input, so the behavior of an output pixel is somewhat more regular. However,

the size of the output shrinks at each layer. If the input image has width

and

the kernel has width

, the output will be of width

m − k

+ 1. The rate of this

shrinkage can be dramatic if the kernels used are large. Since the shrinkage is

greater than 0, it limits the number of convolutional layers that can be included

in the network. As layers are added, the spatial dimension of the network will

eventually drop to 1

1, at which point additional layers cannot meaningfully be

considered convolutional. Another special case of the zero-padding setting is when

just enough zero padding is added to keep the size of the output equal to the size

of the input. MATLAB calls this

same

convolution. In this case, the network

can contain as many convolutional layers as the available hardware can support,

since the operation of convolution does not modify the architectural possibilities

343

CHAPTER 9. CONVOLUTIONAL NETWORKS

Strided

convolution

Downsampling

Convolution

Figure 9.12: Convolution with a stride. In this example, we use a stride of two.

(Top)Convolution with a stride length of two implemented in a single operation. (Bot-

tom)Convolution with a stride greater than one pixel is mathematically equivalent to

convolution with unit stride followed by downsampling. Obviously, the two-step approach

involving downsampling is computationally wasteful, because it computes many values

that are then discarded.

available to the next layer. The input pixels near the border, however, inﬂuence

fewer output pixels than the input pixels near the center. This can make the

border pixels somewhat underrepresented in the model. This motivates the other

extreme case, which MATLAB refers to as

full

convolution, in which enough zeros

are added for every pixel to be visited

times in each direction, resulting in an

output image of width m + k − 1. In this case, the output pixels near the border

are a function of fewer pixels than the output pixels near the center. This can

make it diﬃcult to learn a single kernel that performs well at all positions in

the convolutional feature map. Usually the optimal amount of zero padding (in

344

CHAPTER 9. CONVOLUTIONAL NETWORKS

... ...

...

Figure 9.13: The eﬀect of zero padding on network size. Consider a convolutional network

with a kernel of width six at every layer. In this example, we do not use any pooling, so

only the convolution operation itself shrinks the network size. (Top)In this convolutional

network, we do not use any implicit zero padding. This causes the representation to

shrink by ﬁve pixels at each layer. Starting from an input of sixteen pixels, we are only

able to have three convolutional layers, and the last layer does not ever move the kernel,

so arguably only two of the layers are truly convolutional. The rate of shrinking can

be mitigated by using smaller kernels, but smaller kernels are less expressive, and some

shrinking is inevitable in this kind of architecture. (Bottom)By adding ﬁve implicit zeros

to each layer, we prevent the representation from shrinking with depth. This allows us to

make an arbitrarily deep convolutional network.

terms of test set classiﬁcation accuracy) lies somewhere between “valid” and “same”

convolution.

In some cases, we do not actually want to use convolution, but want to use

locally connected layers instead (LeCun, 1986, 1989). In this case, the adjacency

matrix in the graph of our MLP is the same, but every connection has its own

weight, speciﬁed by a 6-D tensor

. The indices into

are respectively:

; the

345

CHAPTER 9. CONVOLUTIONAL NETWORKS

output channel;

, the output row;

, the output column;

, the input channel;

the row oﬀset within the input; and

, the column oﬀset within the input. The

linear part of a locally connected layer is then given by

i,j,k



l,m,n

l,j+m−1,k+n−1

i,j,k,l,m,n

] . (9.9)

This is sometimes also called

unshared convolution

, because it is a similar oper-

ation to discrete convolution with a small kernel, but without sharing parameters

across locations. Figure 9.14 compares local connections, convolution, and full

connections.

Locally connected layers are useful when we know that each feature should be

a function of a small part of space, but there is no reason to think that the same

feature should occur across all of space. For example, if we want to tell if an image

is a picture of a face, we only need to look for the mouth in the bottom half of the

image.

It can also be useful to make versions of convolution or locally connected layers

in which the connectivity is further restricted, for example to constrain each output

channel

to be a function of only a subset of the input channels

. A common

way to do this is to make the ﬁrst

output channels connect to only the ﬁrst

input channels, the second

output channels connect to only the second

input channels, and so on. See ﬁgure 9.15 for an example. Modeling interactions

between few channels allows the network to have fewer parameters, reducing

memory consumption, increasing statistical eﬃciency, and reducing the amount of

computation needed to perform forward and back-propagation. It accomplishes

these goals without reducing the number of hidden units.

Tiled convolution

(Gregor and LeCun, 2010a; Le et al., 2010) oﬀers a com-

promise between a convolutional layer and a locally connected layer. Rather than

learning a separate set of weights at every spatial location, we learn a set of kernels

that we rotate through as we move through space. This means that immediately

neighboring locations will have diﬀerent ﬁlters, as in a locally connected layer, but

the memory requirements for storing the parameters will increase only by a factor

of the size of this set of kernels, rather than by the size of the entire output feature

map. See ﬁgure 9.16 for a comparison of locally connected layers, tiled convolution,

and standard convolution.

To deﬁne tiled convolution algebraically, let

be a 6-D tensor, where two of

the dimensions correspond to diﬀerent locations in the output map. Rather than

having a separate index for each location in the output map, output locations cycle

through a set of

diﬀerent choices of kernel stack in each direction. If

is equal to

346

CHAPTER 9. CONVOLUTIONAL NETWORKS

a b a b a b a b a

a b c d e f g h i

Figure 9.14: Comparison of local connections, convolution, and full connections.

(Top)A locally connected layer with a patch size of two pixels. Each edge is labeled with

a unique letter to show that each edge is associated with its own weight parameter.

(Center)A convolutional layer with a kernel width of two pixels. This model has exactly

the same connectivity as the locally connected layer. The diﬀerence lies not in which units

interact with each other, but in how the parameters are shared. The locally connected layer

has no parameter sharing. The convolutional layer uses the same two weights repeatedly

across the entire input, as indicated by the repetition of the letters labeling each edge.

(Bottom)A fully connected layer resembles a locally connected layer in the sense that each

edge has its own parameter (there are too many to label explicitly with letters in this

diagram). It does not, however, have the restricted connectivity of the locally connected

layer.

the output width, this is the same as a locally connected layer.

i,j,k



l,m,n

l,j+m−1,k+n−1

i,l,m,n,j%t+1,k%t+1

, (9.10)

where percent is the modulo operation, with

= 0, (

+ 1)%

= 1, and so on.

It is straightforward to generalize this equation to use a diﬀerent tiling range for

each dimension.

347

CHAPTER 9. CONVOLUTIONAL NETWORKS

Input Tensor

Output Tensor

Spatial coordinates

Channel coordinates

Figure 9.15: A convolutional network with the ﬁrst two output channels connected to

only the ﬁrst two input channels, and the second two output channels connected to only

the second two input channels.

348

CHAPTER 9. CONVOLUTIONAL NETWORKS

a b a b a b a b a

a b c d e f g h i

a b c d a b c d a

Figure 9.16: A comparison of locally connected layers, tiled convolution, and standard

convolution. All three have the same sets of connections between units, when the same

size of kernel is used. This diagram illustrates the use of a kernel that is two pixels wide.

The diﬀerences between the methods lies in how they share parameters. (Top) A locally

connected layer has no sharing at all. We indicate that each connection has its own

weight by labeling each connection with a unique letter. (Center) Tiled convolution has a

set of

diﬀerent kernels. Here we illustrate the case of

= 2. One of these kernels has

edges labeled “a” and “b,” while the other has edges labeled “c” and “d.” Each time we

move one pixel to the right in the output, we move on to using a diﬀerent kernel. This

means that, like the locally connected layer, neighboring units in the output have diﬀerent

parameters. Unlike the locally connected layer, after we have gone through all

available

kernels, we cycle back to the ﬁrst kernel. If two output units are separated by a multiple

steps, then they share parameters. (Bottom) Traditional convolution is equivalent to

tiled convolution with

= 1. There is only one kernel, and it is applied everywhere, as

indicated in the diagram by using the kernel with weights labeled “a” and “b” everywhere.

Locally connected layers and tiled convolutional layers both have an interesting

interaction with max pooling: the detector units of these layers are driven by

diﬀerent ﬁlters. If these ﬁlters learn to detect diﬀerent transformed versions of

349

CHAPTER 9. CONVOLUTIONAL NETWORKS

the same underlying features, then the max-pooled units become invariant to the

learned transformation (see ﬁgure 9.9). Convolutional layers are hard coded to be

invariant speciﬁcally to translation.

Other operations besides convolution are usually necessary to implement a

convolutional network. To perform learning, one must be able to compute the

gradient with respect to the kernel, given the gradient with respect to the outputs.

In some simple cases, this operation can be performed using the convolution

operation, but many cases of interest, including the case of stride greater than 1,

do not have this property.

Recall that convolution is a linear operation and can thus be described as a

matrix multiplication (if we ﬁrst reshape the input tensor into a ﬂat vector). The

matrix involved is a function of the convolution kernel. The matrix is sparse, and

each element of the kernel is copied to several elements of the matrix. This view

helps us to derive some of the other operations needed to implement a convolutional

network.

Multiplication by the transpose of the matrix deﬁned by convolution is one

such operation. This is the operation needed to back-propagate error derivatives

through a convolutional layer, so it is needed to train convolutional networks

that have more than one hidden layer. This same operation is also needed if we

wish to reconstruct the visible units from the hidden units (Simard et al., 1992).

Reconstructing the visible units is an operation commonly used in the models

described in part III of this book, such as autoencoders, RBMs, and sparse coding.

Transpose convolution is necessary to construct convolutional versions of those

models. Like the kernel gradient operation, this input gradient operation can

sometimes be implemented using a convolution but in general requires a third

operation to be implemented. Care must be taken to coordinate this transpose

operation with the forward propagation. The size of the output that the transpose

operation should return depends on the zero-padding policy and stride of the

forward propagation operation, as well as the size of the forward propagation’s

output map. In some cases, multiple sizes of input to forward propagation can

result in the same size of output map, so the transpose operation must be explicitly

told what the size of the original input was.

These three operations—convolution, backprop from output to weights, and

backprop from output to inputs—are suﬃcient to compute all the gradients needed

to train any depth of feedforward convolutional network, as well as to train

convolutional networks with reconstruction functions based on the transpose of

convolution. See Goodfellow (2010) for a full derivation of the equations in the

fully general multidimensional, multiexample case. To give a sense of how these

350

CHAPTER 9. CONVOLUTIONAL NETWORKS

equations work, we present the two-dimensional, single example version here.

Suppose we want to train a convolutional network that incorporates strided

convolution of kernel stack

applied to multichannel image

with stride

deﬁned by c(K, V, s), as in equation 9.8. Suppose we want to minimize some loss

function

(

V, K

). During forward propagation, we will need to use

itself to

output

, which is then propagated through the rest of the network and used to

compute the cost function

. During back-propagation, we will receive a tensor

such that G

i,j,k

∂

∂Z

i,j,k

J(V, K).

To train the network, we need to compute the derivatives with respect to the

weights in the kernel. To do so, we can use a function

g(G, V, s)

i,j,k,l

∂

∂K

i,j,k,l

J(V, K) =



m,n

i,m,n

j,(m−1)×s+k,(n−1)×s+l

. (9.11)

If this layer is not the bottom layer of the network, we will need to compute

the gradient with respect to

to back-propagate the error further down. To do

so, we can use a function

h(K, G, s)

i,j,k

∂

∂V

i,j,k

J(V, K) (9.12)



l,m

s.t.

(l−1)×s+m=j



n,p

s.t.

(n−1)×s+p=k



q,i,m,p

q,l,n

. (9.13)

Autoencoder networks, described in chapter 14, are feedforward networks

trained to copy their input to their output. A simple example is the PCA algorithm,

which copies its input

to an approximate reconstruction

using the function



W x

. It is common for more general autoencoders to use multiplication

by the transpose of the weight matrix just as PCA does. To make such models

convolutional, we can use the function

to perform the transpose of the convolution

operation. Suppose we have hidden units

in the same format as

and we deﬁne

a reconstruction

R = h(K, H, s). (9.14)

To train the autoencoder, we will receive the gradient with respect to

a tensor

. To train the decoder, we need to obtain the gradient with respect

. This is given by

(

H, E, s

). To train the encoder, we need to obtain the

gradient with respect to

. This is given by

(

K, E, s

). It is also possible to

diﬀerentiate through

using

and

, but these operations are not needed for the

back-propagation algorithm on any standard network architectures.

351

CHAPTER 9. CONVOLUTIONAL NETWORKS

Generally, we do not use only a linear operation to transform from the inputs

to the outputs in a convolutional layer. We generally also add some bias term to

each output before applying the nonlinearity. This raises the question of how to

share parameters among the biases. For locally connected layers, it is natural to

give each unit its own bias, and for tiled convolution, it is natural to share the

biases with the same tiling pattern as the kernels. For convolutional layers, it is

typical to have one bias per channel of the output and share it across all locations

within each convolution map. If the input is of known, ﬁxed size, however, it is also

possible to learn a separate bias at each location of the output map. Separating

the biases may slightly reduce the statistical eﬃciency of the model, but it allows

the model to correct for diﬀerences in the image statistics at diﬀerent locations.

For example, when using implicit zero padding, detector units at the edge of the

image receive less total input and may need larger biases.

9.6 Structured Outputs

Convolutional networks can be used to output a high-dimensional structured object,

rather than just predicting a class label for a classiﬁcation task or a real value for

a regression task. Typically this object is just a tensor, emitted by a standard

convolutional layer. For example, the model might emit a tensor

, where S

i,j,k

is the probability that pixel (

j, k

) of the input to the network belongs to class

This allows the model to label every pixel in an image and draw precise masks

that follow the outlines of individual objects.

One issue that often comes up is that the output plane can be smaller than

the input plane, as shown in ﬁgure 9.13. In the kinds of architectures typically

used for classiﬁcation of a single object in an image, the greatest reduction in the

spatial dimensions of the network comes from using pooling layers with large stride.

To produce an output map of similar size as the input, one can avoid pooling

altogether (Jain et al., 2007). Another strategy is to simply emit a lower-resolution

grid of labels (Pinheiro and Collobert, 2014, 2015). Finally, in principle, one could

use a pooling operator with unit stride.

One strategy for pixel-wise labeling of images is to produce an initial guess

of the image labels, then reﬁne this initial guess using the interactions between

neighboring pixels. Repeating this reﬁnement step several times corresponds to

using the same convolutions at each stage, sharing weights between the last layers of

the deep net (Jain et al., 2007). This makes the sequence of computations performed

by the successive convolutional layers with weights shared across layers a particular

352

CHAPTER 9. CONVOLUTIONAL NETWORKS

(1)

(2)

(3)

(1)

(2)

(3)

Figure 9.17: An example of a recurrent convolutional network for pixel labeling. The

input is an image tensor X, with axes corresponding to image rows, image columns, and

channels (red, green, blue). The goal is to output a tensor of labels

, with a probability

distribution over labels for each pixel. This tensor has axes corresponding to image rows,

image columns, and the diﬀerent classes. Rather than outputting

in a single shot, the

recurrent network iteratively reﬁnes its estimate

by using a previous estimate of

as input for creating a new estimate. The same parameters are used for each updated

estimate, and the estimate can be reﬁned as many times as we wish. The tensor of

convolution kernels

is used on each step to compute the hidden representation given the

input image. The kernel tensor

is used to produce an estimate of the labels given the

hidden values. On all but the ﬁrst step, the kernels

are convolved over

to provide

input to the hidden layer. On the ﬁrst time step, this term is replaced by zero. Because

the same parameters are used on each step, this is an example of a recurrent network, as

described in chapter 10.

kind of recurrent network (Pinheiro and Collobert, 2014, 2015). Figure 9.17 shows

the architecture of such a recurrent convolutional network.

Once a prediction for each pixel is made, various methods can be used to

further process these predictions to obtain a segmentation of the image into regions

(Briggman et al., 2009; Turaga et al., 2010; Farabet et al., 2013). The general idea

is to assume that large groups of contiguous pixels tend to be associated with the

same label. Graphical models can describe the probabilistic relationships between

neighboring pixels. Alternatively, the convolutional network can be trained to

maximize an approximation of the graphical model training objective (Ning et al.,

2005; Thompson et al., 2014).

353

CHAPTER 9. CONVOLUTIONAL NETWORKS

9.7 Data Types

The data used with a convolutional network usually consist of several channels,

each channel being the observation of a diﬀerent quantity at some point in space

or time. See table 9.1 for examples of data types with diﬀerent dimensionalities

and number of channels.

For an example of convolutional networks applied to video, see Chen et al.

(2010).

So far we have discussed only the case where every example in the train and test

data has the same spatial dimensions. One advantage to convolutional networks

is that they can also process inputs with varying spatial extents. These kinds of

input simply cannot be represented by traditional, matrix multiplication-based

neural networks. This provides a compelling reason to use convolutional networks

even when computational cost and overﬁtting are not signiﬁcant issues.

For example, consider a collection of images in which each image has a diﬀerent

width and height. It is unclear how to model such inputs with a weight matrix of

ﬁxed size. Convolution is straightforward to apply; the kernel is simply applied a

diﬀerent number of times depending on the size of the input, and the output of the

convolution operation scales accordingly. Convolution may be viewed as matrix

multiplication; the same convolution kernel induces a diﬀerent size of doubly block

circulant matrix for each size of input. Sometimes the output of the network as

well as the input is allowed to have variable size, for example, if we want to assign

a class label to each pixel of the input. In this case, no further design work is

necessary. In other cases, the network must produce some ﬁxed-size output, for

example, if we want to assign a single class label to the entire image. In this case,

we must make some additional design steps, like inserting a pooling layer whose

pooling regions scale in size proportional to the size of the input, to maintain a

ﬁxed number of pooled outputs. Some examples of this kind of strategy are shown

in ﬁgure 9.11.

Note that the use of convolution for processing variably sized inputs makes

sense only for inputs that have variable size because they contain varying amounts

of observation of the same kind of thing—diﬀerent lengths of recordings over time,

diﬀerent widths of observations over space, and so forth. Convolution does not

make sense if the input has variable size because it can optionally include diﬀerent

kinds of observations. For example, if we are processing college applications, and

our features consist of both grades and standardized test scores, but not every

applicant took the standardized test, then it does not make sense to convolve the

same weights over features corresponding to the grades as well as the features

354

CHAPTER 9. CONVOLUTIONAL NETWORKS

Single channel Multichannel

1-D

Audio waveform: The axis we

convolve over corresponds to

time. We discretize time and

measure the amplitude of the

waveform once per time step.

Skeleton animation data: Anima-

tions of 3-D computer-rendered

characters are generated by alter-

ing the pose of a “skeleton” over

time. At each point in time, the

pose of the character is described

by a speciﬁcation of the angles of

each of the joints in the charac-

ter’s skeleton. Each channel in

the data we feed to the convolu-

tional model represents the angle

about one axis of one joint.

2-D

Audio data that has been prepro-

cessed with a Fourier transform:

We can transform the audio wave-

form into a 2-D tensor with dif-

ferent rows corresponding to dif-

ferent frequencies and diﬀerent

columns corresponding to diﬀer-

ent points in time. Using convolu-

tion in the time makes the model

equivariant to shifts in time. Us-

ing convolution across the fre-

quency axis makes the model

equivariant to frequency, so that

the same melody played in a dif-

ferent octave produces the same

representation but at a diﬀerent

height in the network’s output.

Color image data: One channel

contains the red pixels, one the

green pixels, and one the blue

pixels. The convolution kernel

moves over both the horizontal

and the vertical axes of the im-

age, conferring translation equiv-

ariance in both directions.

3-D

Volumetric data: A common

source of this kind of data is med-

ical imaging technology, such as

CT scans.

Color video data: One axis corre-

sponds to time, one to the height

of the video frame, and one to

the width of the video frame.

Table 9.1: Examples of diﬀerent formats of data that can be used with convolutional

networks.

355

CHAPTER 9. CONVOLUTIONAL NETWORKS

corresponding to the test scores.

9.8 Eﬃcient Convolution Algorithms

Modern convolutional network applications often involve networks containing more

than one million units. Powerful implementations exploiting parallel computation

resources, as discussed in section 12.1, are essential. In many cases, however, it

is also possible to speed up convolution by selecting an appropriate convolution

algorithm.

Convolution is equivalent to converting both the input and the kernel to the

frequency domain using a Fourier transform, performing point-wise multiplication

of the two signals, and converting back to the time domain using an inverse

Fourier transform. For some problem sizes, this can be faster than the naive

implementation of discrete convolution.

When a

-dimensional kernel can be expressed as the outer product of

vectors, one vector per dimension, the kernel is called

separable

. When the

kernel is separable, naive convolution is ineﬃcient. It is equivalent to compose

one-dimensional convolutions with each of these vectors. The composed approach

is signiﬁcantly faster than performing one

-dimensional convolution with their

outer product. The kernel also takes fewer parameters to represent as vectors.

If the kernel is

elements wide in each dimension, then naive multidimensional

convolution requires

(

) runtime and parameter storage space, while separable

convolution requires

(

w × d

) runtime and parameter storage space. Of course,

not every convolution can be represented in this way.

Devising faster ways of performing convolution or approximate convolution

without harming the accuracy of the model is an active area of research. Even tech-

niques that improve the eﬃciency of only forward propagation are useful because

in the commercial setting, it is typical to devote more resources to deployment of

a network than to its training.

9.9 Random or Unsupervised Features

Typically, the most expensive part of convolutional network training is learning

the features. The output layer is usually relatively inexpensive because of the

small number of features provided as input to this layer after passing through

several layers of pooling. When performing supervised training with gradient

descent, every gradient step requires a complete run of forward propagation and

356

CHAPTER 9. CONVOLUTIONAL NETWORKS

backward propagation through the entire network. One way to reduce the cost of

convolutional network training is to use features that are not trained in a supervised

fashion.

There are three basic strategies for obtaining convolution kernels without

supervised training. One is to simply initialize them randomly. Another is to

design them by hand, for example, by setting each kernel to detect edges at a

certain orientation or scale. Finally, one can learn the kernels with an unsupervised

criterion. For example, Coates et al. (2011) apply

-means clustering to small

image patches, then use each learned centroid as a convolution kernel. In Part III

we describe many more unsupervised learning approaches. Learning the features

with an unsupervised criterion allows them to be determined separately from the

classiﬁer layer at the top of the architecture. One can then extract the features for

the entire training set just once, essentially constructing a new training set for the

last layer. Learning the last layer is then typically a convex optimization problem,

assuming the last layer is something like logistic regression or an SVM.

Random ﬁlters often work surprisingly well in convolutional networks (Jarrett

et al., 2009; Saxe et al., 2011; Pinto et al., 2011; Cox and Pinto, 2011). Saxe et al.

(2011) showed that layers consisting of convolution followed by pooling naturally

become frequency selective and translation invariant when assigned random weights.

They argue that this provides an inexpensive way to choose the architecture of

a convolutional network: ﬁrst, evaluate the performance of several convolutional

network architectures by training only the last layer; then take the best of these

architectures and train the entire architecture using a more expensive approach.

An intermediate approach is to learn the features, but using methods that do

not require full forward and back-propagation at every gradient step. As with

multilayer perceptrons, we use greedy layer-wise pretraining, to train the ﬁrst layer

in isolation, then extract all features from the ﬁrst layer only once, then train the

second layer in isolation given those features, and so on. In chapter 8 we described

how to perform supervised greedy layer-wise pretraining, and in part III extend this

to greedy layer-wise pretraining using an unsupervised criterion at each layer. The

canonical example of greedy layer-wise pretraining of a convolutional model is the

convolutional deep belief network (Lee et al., 2009). Convolutional networks oﬀer

us the opportunity to take the pretraining strategy one step further than is possible

with multilayer perceptrons. Instead of training an entire convolutional layer at a

time, we can train a model of a small patch, as Coates et al. (2011) do with

-means.

We can then use the parameters from this patch-based model to deﬁne the kernels

of a convolutional layer. This means that it is possible to use unsupervised learning

to train a convolutional network without ever using convolution during the training

357

CHAPTER 9. CONVOLUTIONAL NETWORKS

process. Using this approach, we can train very large models and incur a high

computational cost only at inference time (Ranzato et al., 2007b; Jarrett et al.,

2009; Kavukcuoglu et al., 2010; Coates et al., 2013). This approach was popular

from roughly 2007 to 2013, when labeled datasets were small and computational

power was more limited. Today, most convolutional networks are trained in a

purely supervised fashion, using full forward and back-propagation through the

entire network on each training iteration.

As with other approaches to unsupervised pretraining, it remains diﬃcult to

tease apart the cause of some of the beneﬁts seen with this approach. Unsupervised

pretraining may oﬀer some regularization relative to supervised training, or it

may simply allow us to train much larger architectures because of the reduced

computational cost of the learning rule.

9.10 The Neuroscientiﬁc Basis for Convolutional

Networks

Convolutional networks are perhaps the greatest success story of biologically

inspired artiﬁcial intelligence. Though convolutional networks have been guided

by many other ﬁelds, some of the key design principles of neural networks were

drawn from neuroscience.

The history of convolutional networks begins with neuroscientiﬁc experiments

long before the relevant computational models were developed. Neurophysiologists

David Hubel and Torsten Wiesel collaborated for several years to determine many

of the most basic facts about how the mammalian vision system works (Hubel and

Wiesel, 1959, 1962, 1968). Their accomplishments were eventually recognized with

a Nobel prize. Their ﬁndings that have had the greatest inﬂuence on contemporary

deep learning models were based on recording the activity of individual neurons in

cats. They observed how neurons in the cat’s brain responded to images projected

in precise locations on a screen in front of the cat. Their great discovery was

that neurons in the early visual system responded most strongly to very speciﬁc

patterns of light, such as precisely oriented bars, but responded hardly at all to

other patterns.

Their work helped to characterize many aspects of brain function that are

beyond the scope of this book. From the point of view of deep learning, we can

focus on a simpliﬁed, cartoon view of brain function.

In this simpliﬁed view, we focus on a part of the brain called V1, also known

as the

primary visual cortex

. V1 is the ﬁrst area of the brain that begins to

358

CHAPTER 9. CONVOLUTIONAL NETWORKS

perform signiﬁcantly advanced processing of visual input. In this cartoon view,

images are formed by light arriving in the eye and stimulating the retina, the

light-sensitive tissue in the back of the eye. The neurons in the retina perform

some simple preprocessing of the image but do not substantially alter the way it is

represented. The image then passes through the optic nerve and a brain region

called the lateral geniculate nucleus. The main role, as far as we are concerned

here, of both anatomical regions is primarily just to carry the signal from the eye

to V1, which is located at the back of the head.

A convolutional network layer is designed to capture three properties of V1:

V1 is arranged in a spatial map. It actually has a two-dimensional structure,

mirroring the structure of the image in the retina. For example, light

arriving at the lower half of the retina aﬀects only the corresponding half of

V1. Convolutional networks capture this property by having their features

deﬁned in terms of two-dimensional maps.

V1 contains many

simple cells

. A simple cell’s activity can to some extent

be characterized by a linear function of the image in a small, spatially

localized receptive ﬁeld. The detector units of a convolutional network are

designed to emulate these properties of simple cells.

V1 also contains many

complex cells

. These cells respond to features that

are similar to those detected by simple cells, but complex cells are invariant

to small shifts in the position of the feature. This inspires the pooling units

of convolutional networks. Complex cells are also invariant to some changes

in lighting that cannot be captured simply by pooling over spatial locations.

These invariances have inspired some of the cross-channel pooling strategies

in convolutional networks, such as maxout units (Goodfellow et al., 2013a).

Though we know the most about V1, it is generally believed that the same

basic principles apply to other areas of the visual system. In our cartoon view of

the visual system, the basic strategy of detection followed by pooling is repeatedly

applied as we move deeper into the brain. As we pass through multiple anatomical

layers of the brain, we eventually ﬁnd cells that respond to some speciﬁc concept

and are invariant to many transformations of the input. These cells have been

nicknamed “grandmother cells”—the idea is that a person could have a neuron that

activates when seeing an image of their grandmother, regardless of whether she

appears in the left or right side of the image, whether the image is a close-up of

her face or zoomed-out shot of her entire body, whether she is brightly lit or in

shadow, and so on.

359

CHAPTER 9. CONVOLUTIONAL NETWORKS

These grandmother cells have been shown to actually exist in the human brain,

in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers

tested whether individual neurons would respond to photos of famous individuals.

They found what has come to be called the “Halle Berry neuron,” an individual

neuron that is activated by the concept of Halle Berry. This neuron ﬁres when a

person sees a photo of Halle Berry, a drawing of Halle Berry, or even text containing

the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself;

other neurons responded to the presence of Bill Clinton, Jennifer Aniston, and so

forth.

These medial temporal lobe neurons are somewhat more general than modern

convolutional networks, which would not automatically generalize to identifying

a person or object when reading its name. The closest analog to a convolutional

network’s last layer of features is a brain area called the inferotemporal cortex (IT).

When viewing an object, information ﬂows from the retina, through the LGN, to

V1, then onward to V2, then V4, then IT. This happens within the ﬁrst 100ms

of glimpsing an object. If a person is allowed to continue looking at the object

for more time, then information will begin to ﬂow backward as the brain uses

top-down feedback to update the activations in the lower level brain areas. If we

interrupt the person’s gaze, however, and observe only the ﬁring rates that result

from the ﬁrst 100ms of mostly feedforward activation, then IT proves to be similar

to a convolutional network. Convolutional networks can predict IT ﬁring rates and

perform similarly to (time-limited) humans on object recognition tasks (DiCarlo,

2013).

That being said, there are many diﬀerences between convolutional networks

and the mammalian vision system. Some of these diﬀerences are well known

to computational neuroscientists but outside the scope of this book. Some of

these diﬀerences are not yet known, because many basic questions about how the

mammalian vision system works remain unanswered. As a brief list:

•

The human eye is mostly very low resolution, except for a tiny patch called the

fovea

. The fovea only observes an area about the size of a thumbnail held at

arms length. Though we feel as if we can see an entire scene in high resolution,

this is an illusion created by the subconscious part of our brain, as it stitches

together several glimpses of small areas. Most convolutional networks actually

receive large full-resolution photographs as input. The human brain makes

several eye movements called

saccades

to glimpse the most visually salient

or task-relevant parts of a scene. Incorporating similar attention mechanisms

into deep learning models is an active research direction. In the context of

deep learning, attention mechanisms have been most successful for natural

360

CHAPTER 9. CONVOLUTIONAL NETWORKS

language processing, as described in section 12.4.5.1. Several visual models

with foveation mechanisms have been developed but so far have not become

the dominant approach (Larochelle and Hinton, 2010; Denil et al., 2012).

•

The human visual system is integrated with many other senses, such as

hearing, and factors like our moods and thoughts. Convolutional networks

so far are purely visual.

•

The human visual system does much more than just recognize objects. It is

able to understand entire scenes, including many objects and relationships

between objects, and it processes rich 3-D geometric information needed for

our bodies to interface with the world. Convolutional networks have been

applied to some of these problems, but these applications are in their infancy.

•

Even simple brain areas like V1 are heavily aﬀected by feedback from higher

levels. Feedback has been explored extensively in neural network models but

has not yet been shown to oﬀer a compelling improvement.

•

While feedforward IT ﬁring rates capture much of the same information as

convolutional network features, it is not clear how similar the intermediate

computations are. The brain probably uses very diﬀerent activation and

pooling functions. An individual neuron’s activation probably is not well

characterized by a single linear ﬁlter response. A recent model of V1 involves

multiple quadratic ﬁlters for each neuron (Rust et al., 2005). Indeed our

cartoon picture of “simple cells” and “complex cells” might create a nonexistent

distinction; simple cells and complex cells might both be the same kind of

cell but with their “parameters” enabling a continuum of behaviors ranging

from what we call “simple” to what we call “complex.”

It is also worth mentioning that neuroscience has told us relatively little

about how to train convolutional networks. Model structures with parameter

sharing across multiple spatial locations date back to early connectionist models

of vision (Marr and Poggio, 1976), but these models did not use the modern

back-propagation algorithm and gradient descent. For example, the neocognitron

(Fukushima, 1980) incorporated most of the model architecture design elements of

the modern convolutional network but relied on a layer-wise unsupervised clustering

algorithm.

Lang and Hinton (1988) introduced the use of back-propagation to train

time-delay neural networks

(TDNNs). To use contemporary terminology,

TDNNs are one-dimensional convolutional networks applied to time series. Back-

propagation applied to these models was not inspired by any neuroscientiﬁc observa-

361

CHAPTER 9. CONVOLUTIONAL NETWORKS

tion and is considered by some to be biologically implausible. Following the success

of back-propagation-based training of TDNNs, LeCun et al. (1989) developed the

modern convolutional network by applying the same training algorithm to 2-D

convolution applied to images.

So far we have described how simple cells are roughly linear and selective

for certain features, complex cells are more nonlinear and become invariant to

some transformations of these simple cell features, and stacks of layers that

alternate between selectivity and invariance can yield grandmother cells for speciﬁc

phenomena. We have not yet described precisely what these individual cells detect.

In a deep nonlinear network, it can be diﬃcult to understand the function of

individual cells. Simple cells in the ﬁrst layer are easier to analyze, because their

responses are driven by a linear function. In an artiﬁcial neural network, we can

just display an image of the convolution kernel to see what the corresponding

channel of a convolutional layer responds to. In a biological neural network, we

do not have access to the weights themselves. Instead, we put an electrode in

the neuron, display several samples of white noise images in front of the animal’s

retina, and record how each of these samples causes the neuron to activate. We

can then ﬁt a linear model to these responses to obtain an approximation of the

neuron’s weights. This approach is known as

reverse correlation

(Ringach and

Shapley, 2004).

Reverse correlation shows us that most V1 cells have weights that are described

Gabor functions

. The Gabor function describes the weight at a 2-D point

in the image. We can think of an image as being a function of 2-D coordinates,

(

x, y

). Likewise, we can think of a simple cell as sampling the image at a set of

locations, deﬁned by a set of

coordinates

and a set of

coordinates

, then

applying weights that are also a function of the location,

(

x, y

). From this point

of view, the response of a simple cell to an image is given by

s(I) =



x∈X



y∈Y

w(x, y)I(x, y). (9.15)

Speciﬁcally, w(x, y) takes the form of a Gabor function:

w(x, y; α, β

, β

, f, φ, x

, y

, τ ) = α exp



−β

2

− β

2



cos(fx



+ φ), (9.16)

where



= (x − x

) cos(τ ) + (y − y

) sin(τ ) (9.17)

and



= −(x − x

) sin(τ ) + (y − y

) cos(τ ). (9.18)

362

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.18: Gabor functions with a variety of parameter settings. White indicates

large positive weight, black indicates large negative weight, and the background gray

corresponds to zero weight. (Left)Gabor functions with diﬀerent values of the parameters

that control the coordinate system:

, and

. Each Gabor function in this grid is

assigned a value of

and

proportional to its position in its grid, and

is chosen so

that each Gabor ﬁlter is sensitive to the direction radiating out from the center of the

grid. For the other two plots,

, and

are ﬁxed to zero. (Center)Gabor functions

with diﬀerent Gaussian scale parameters

and

. Gabor functions are arranged in

increasing width (decreasing

) as we move left to right through the grid, and increasing

height (decreasing

) as we move top to bottom. For the other two plots, the

values

are ﬁxed to 1.5 times the image width. (Right)Gabor functions with diﬀerent sinusoid

parameters

and

. As we move top to bottom,

increases, and as we move left to right,

increases. For the other two plots,

is ﬁxed to 0 and

is ﬁxed to 5 times the image

width.

Here,

, and

are parameters that control the properties

of the Gabor function. Figure 9.18 shows some examples of Gabor functions with

diﬀerent settings of these parameters.

The parameters

, and

deﬁne a coordinate system. We translate and

rotate

and

to form



and



. Speciﬁcally, the simple cell will respond to image

features centered at the point (

), and it will respond to changes in brightness

as we move along a line rotated τ radians from the horizontal.

Viewed as a function of



and



, the function

then responds to changes in

brightness as we move along the



axis. It has two important factors: one is a

Gaussian function, and the other is a cosine function.

The Gaussian factor

α exp



−β

2

− β

2



can be seen as a gating term that

ensures that the simple cell will respond only to values near where



and



are

both zero, in other words, near the center of the cell’s receptive ﬁeld. The scaling

factor

adjusts the total magnitude of the simple cell’s response, while

and

control how quickly its receptive ﬁeld falls oﬀ.

363

CHAPTER 9. CONVOLUTIONAL NETWORKS

The cosine factor

cos

(



) controls how the simple cell responds to changing

brightness along the



axis. The parameter

controls the frequency of the cosine,

and φ controls its phase oﬀset.

Altogether, this cartoon view of simple cells means that a simple cell responds

to a speciﬁc spatial frequency of brightness in a speciﬁc direction at a speciﬁc

location. Simple cells are most excited when the wave of brightness in the image

has the same phase as the weights. This occurs when the image is bright where the

weights are positive and dark where the weights are negative. Simple cells are most

inhibited when the wave of brightness is fully out of phase with the weights—when

the image is dark where the weights are positive and bright where the weights are

negative.

The cartoon view of a complex cell is that it computes the

norm of the

2-D vector containing two simple cells’ responses:

(

) =



(I)

+ s

(I)

. An

important special case occurs when

has all the same parameters as

except

for

, and

is set such that

is one quarter cycle out of phase with

. In this

case,

and

form a

quadrature pair

. A complex cell deﬁned in this way

responds when the Gaussian reweighted image

(

x, y

)

exp

(

−β

2

−β

2

) contains

a high-amplitude sinusoidal wave with frequency

in direction

near (

, y

regardless of the phase oﬀset of this wave. In other words, the complex cell is

invariant to small translations of the image in direction

, or to negating the image

(replacing black with white and vice versa).

Some of the most striking correspondences between neuroscience and machine

learning come from visually comparing the features learned by machine learning

models with those employed by V1. Olshausen and Field (1996) showed that

a simple unsupervised learning algorithm, sparse coding, learns features with

receptive ﬁelds similar to those of simple cells. Since then, we have found that

an extremely wide variety of statistical learning algorithms learn features with

Gabor-like functions when applied to natural images. This includes most deep

learning algorithms, which learn these features in their ﬁrst layer. Figure 9.19

shows some examples. Because so many diﬀerent learning algorithms learn edge

detectors, it is diﬃcult to conclude that any speciﬁc learning algorithm is the “right”

model of the brain just based on the features it learns (though it can certainly be a

bad sign if an algorithm does not learn some sort of edge detector when applied to

natural images). These features are an important part of the statistical structure

of natural images and can be recovered by many diﬀerent approaches to statistical

modeling. See Hyvärinen et al. (2009) for a review of the ﬁeld of natural image

statistics.

364

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.19: Many machine learning algorithms learn features that detect edges or speciﬁc

colors of edges when applied to natural images. These feature detectors are reminiscent

of the Gabor functions known to be present in the primary visual cortex. (Left)Weights

learned by an unsupervised learning algorithm (spike and slab sparse coding) applied

to small image patches. (Right)Convolution kernels learned by the ﬁrst layer of a fully

supervised convolutional maxout network. Neighboring pairs of ﬁlters drive the same

maxout unit.

9.11 Convolutional Networks and the History of Deep

Learning

Convolutional networks have played an important role in the history of deep

learning. They are a key example of a successful application of insights obtained

by studying the brain to machine learning applications. They were also some of

the ﬁrst deep models to perform well, long before arbitrary deep models were

considered viable. Convolutional networks were also some of the ﬁrst neural

networks to solve important commercial applications and remain at the forefront

of commercial applications of deep learning today. For example, in the 1990s, the

neural network research group at AT&T developed a convolutional network for

reading checks (LeCun et al., 1998b). By the end of the 1990s, this system deployed

by NCR was reading over 10 percent of all the checks in the United States. Later,

several OCR and handwriting recognition systems based on convolutional nets

were deployed by Microsoft (Simard et al., 2003). See chapter 12 for more details

on such applications and more modern applications of convolutional networks. See

LeCun et al. (2010) for a more in-depth history of convolutional networks up to

2010.

Convolutional networks were also used to win many contests. The current

intensity of commercial interest in deep learning began when Krizhevsky et al.

365

CHAPTER 9. CONVOLUTIONAL NETWORKS

(2012) won the ImageNet object recognition challenge, but convolutional networks

had been used to win other machine learning and computer vision contests with

less impact for years earlier.

Convolutional nets were some of the ﬁrst working deep networks trained with

back-propagation. It is not entirely clear why convolutional networks succeeded

when general back-propagation networks were considered to have failed. It may

simply be that convolutional networks were more computationally eﬃcient than

fully connected networks, so it was easier to run multiple experiments with them

and tune their implementation and hyperparameters. Larger networks also seem

to be easier to train. With modern hardware, large fully connected networks

appear to perform reasonably on many tasks, even when using datasets that were

available and activation functions that were popular during the times when fully

connected networks were believed not to work well. It may be that the primary

barriers to the success of neural networks were psychological (practitioners did

not expect neural networks to work, so they did not make a serious eﬀort to use

neural networks). Whatever the case, it is fortunate that convolutional networks

performed well decades ago. In many ways, they carried the torch for the rest of

deep learning and paved the way to the acceptance of neural networks in general.

Convolutional networks provide a way to specialize neural networks to work

with data that has a clear grid-structured topology and to scale such models to

very large size. This approach has been the most successful on a two-dimensional

image topology. To process one-dimensional sequential data, we turn next to

another powerful specialization of the neural networks framework: recurrent neural

networks.

366