Chapter 9
Convolutional Networks
Convolutional networks
(LeCun, 1989), also known as
convolutional neural
networks
, or CNNs, are a specialized kind of neural network for processing data
that has a known grid-like topology. Examples include time-series data, which can
be thought of as a 1-D grid taking samples at regular time intervals, and image data,
which can be thought of as a 2-D grid of pixels. Convolutional networks have been
tremendously successful in practical applications. The name “convolutional neural
network” indicates that the network employs a mathematical operation called
convolution
. Convolution is a specialized kind of linear operation. Convolutional
networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.
In this chapter, we first describe what convolution is. Next, we explain the
motivation behind using convolution in a neural network. We then describe an
operation called
pooling
, which almost all convolutional networks employ. Usually,
the operation used in a convolutional neural network does not correspond precisely
to the definition of convolution as used in other fields, such as engineering or
pure mathematics. We describe several variants on the convolution function that
are widely used in practice for neural networks. We also show how convolution
may be applied to many kinds of data, with different numbers of dimensions. We
then discuss means of making convolution more efficient. Convolutional networks
stand out as an example of neuroscientific principles influencing deep learning.
We discuss these neuroscientific principles, then conclude with comments about
the role convolutional networks have played in the history of deep learning. One
topic this chapter does not address is how to choose the architecture of your
convolutional network. The goal of this chapter is to describe the kinds of tools
that convolutional networks provide, while chapter 11 describes general guidelines
326
CHAPTER 9. CONVOLUTIONAL NETWORKS
for choosing which tools to use in which circumstances. Research into convolutional
network architectures proceeds so rapidly that a new best architecture for a given
benchmark is announced every few weeks to months, rendering it impractical to
describe in print the best architecture. Nonetheless, the best architectures have
consistently been composed of the building blocks described here.
9.1 The Convolution Operation
In its most general form, convolution is an operation on two functions of a real-
valued argument. To motivate the definition of convolution, we start with examples
of two functions we might use.
Suppose we are tracking the location of a spaceship with a laser sensor. Our
laser sensor provides a single output
x
(
t
), the position of the spaceship at time
t
.
Both
x
and
t
are real valued, that is, we can get a different reading from the laser
sensor at any instant in time.
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy
estimate of the spaceship’s position, we would like to average several measurements.
Of course, more recent measurements are more relevant, so we will want this to
be a weighted average that gives more weight to recent measurements. We can
do this with a weighting function
w
(
a
), where
a
is the age of a measurement. If
we apply such a weighted average operation at every moment, we obtain a new
function s providing a smoothed estimate of the position of the spaceship:
s(t) =
x(a)w(t a)da. (9.1)
This operation is called
convolution
. The convolution operation is typically
denoted with an asterisk:
s(t) = (x w)(t). (9.2)
In our example,
w
needs to be a valid probability density function, or the output
will not be a weighted average. Also,
w
needs to be 0 for all negative arguments,
or it will look into the future, which is presumably beyond our capabilities. These
limitations are particular to our example, though. In general, convolution is defined
for any functions for which the above integral is defined and may be used for other
purposes besides taking weighted averages.
In convolutional network terminology, the first argument (in this example, the
function
x
) to the convolution is often referred to as the
input
, and the second
327
CHAPTER 9. CONVOLUTIONAL NETWORKS
argument (in this example, the function
w
) as the
kernel
. The output is sometimes
referred to as the feature map.
In our example, the idea of a laser sensor that can provide measurements at
every instant is not realistic. Usually, when we work with data on a computer,
time will be discretized, and our sensor will provide data at regular intervals.
In our example, it might be more realistic to assume that our laser provides a
measurement once per second. The time index
t
can then take on only integer
values. If we now assume that
x
and
w
are defined only on integer
t
, we can define
the discrete convolution:
s(t) = (x w)(t) =
a=−∞
x(a)w(t a). (9.3)
In machine learning applications, the input is usually a multidimensional array
of data, and the kernel is usually a multidimensional array of parameters that are
adapted by the learning algorithm. We will refer to these multidimensional arrays
as tensors. Because each element of the input and kernel must be explicitly stored
separately, we usually assume that these functions are zero everywhere but in the
finite set of points for which we store the values. This means that in practice, we
can implement the infinite summation as a summation over a finite number of
array elements.
Finally, we often use convolutions over more than one axis at a time. For
example, if we use a two-dimensional image
I
as our input, we probably also want
to use a two-dimensional kernel K:
S(i, j) = (I K)(i, j) =
m
n
I(m, n)K(i m, j n). (9.4)
Convolution is commutative, meaning we can equivalently write
S(i, j) = (K I)(i, j) =
m
n
I(i m, j n)K(m, n). (9.5)
Usually the latter formula is more straightforward to implement in a machine
learning library, because there is less variation in the range of valid values of
m
and n.
The commutative property of convolution arises because we have
flipped
the
kernel relative to the input, in the sense that as
m
increases, the index into the
input increases, but the index into the kernel decreases. The only reason to flip
the kernel is to obtain the commutative property. While the commutative property
328
CHAPTER 9. CONVOLUTIONAL NETWORKS
is useful for writing proofs, it is not usually an important property of a neural
network implementation. Instead, many neural network libraries implement a
related function called the
cross-correlation
, which is the same as convolution
but without flipping the kernel:
S(i, j) = (K I)(i, j) =
m
n
I(i + m, j + n)K(m, n). (9.6)
Many machine learning libraries implement cross-correlation but call it convolution.
In this text we follow this convention of calling both operations convolution and
specify whether we mean to flip the kernel or not in contexts where kernel flipping
is relevant. In the context of machine learning, the learning algorithm will learn
the appropriate values of the kernel in the appropriate place, so an algorithm based
on convolution with kernel flipping will learn a kernel that is flipped relative to the
kernel learned by an algorithm without the flipping. It is also rare for convolution
to be used alone in machine learning; instead convolution is used simultaneously
with other functions, and the combination of these functions does not commute
regardless of whether the convolution operation flips its kernel or not.
See figure 9.1 for an example of convolution (without kernel flipping) applied
to a 2-D tensor.
Discrete convolution can be viewed as multiplication by a matrix, but the
matrix has several entries constrained to be equal to other entries. For example,
for univariate discrete convolution, each row of the matrix is constrained to be
equal to the row above shifted by one element. This is known as a
Toeplitz
matrix
. In two dimensions, a
doubly block circulant matrix
corresponds to
convolution. In addition to these constraints that several elements be equal to
each other, convolution usually corresponds to a very sparse matrix (a matrix
whose entries are mostly equal to zero). This is because the kernel is usually much
smaller than the input image. Any neural network algorithm that works with
matrix multiplication and does not depend on specific properties of the matrix
structure should work with convolution, without requiring any further changes
to the neural network. Typical convolutional neural networks do make use of
further specializations in order to deal with large inputs efficiently, but these are
not strictly necessary from a theoretical perspective.
9.2 Motivation
Convolution leverages three important ideas that can help improve a machine
learning system:
sparse interactions
,
parameter sharing
and
equivariant
329
CHAPTER 9. CONVOLUTIONAL NETWORKS
a b c d
e f g h
i j k l
w x
y z
aw + bx +
ey + fz
aw + bx +
ey + fz
bw + cx +
fy + gz
bw + cx +
fy + gz
cw + dx +
gy + hz
cw + dx +
gy + hz
ew + fx +
iy + jz
ew + fx +
iy + jz
fw + gx +
jy + kz
fw + gx +
jy + kz
gw + hx +
ky + lz
gw + hx +
ky + lz
Input
Kernel
Output
Figure 9.1: An example of 2-D convolution without kernel flipping. We restrict the output
to only positions where the kernel lies entirely within the image, called “valid” convolution
in some contexts. We draw boxes with arrows to indicate how the upper-left element of
the output tensor is formed by applying the kernel to the corresponding upper-left region
of the input tensor.
representations
. Moreover, convolution provides a means for working with
inputs of variable size. We now describe each of these ideas in turn.
Traditional neural network layers use matrix multiplication by a matrix of
parameters with a separate parameter describing the interaction between each
input unit and each output unit. This means that every output unit interacts
with every input unit. Convolutional networks, however, typically have
sparse
interactions
(also referred to as
sparse connectivity
or
sparse weights
). This
is accomplished by making the kernel smaller than the input. For example,
when processing an image, the input image might have thousands or millions of
pixels, but we can detect small, meaningful features such as edges with kernels
that occupy only tens or hundreds of pixels. This means that we need to store
fewer parameters, which both reduces the memory requirements of the model
and improves its statistical efficiency. It also means that computing the output
330
CHAPTER 9. CONVOLUTIONAL NETWORKS
requires fewer operations. These improvements in efficiency are usually quite large.
If there are
m
inputs and
n
outputs, then matrix multiplication requires
m ×n
parameters, and the algorithms used in practice have
O
(
m × n
) runtime (per
example). If we limit the number of connections each output may have to
k
, then
the sparsely connected approach requires only
k × n
parameters and
O
(
k × n
)
runtime. For many practical applications, it is possible to obtain good performance
on the machine learning task while keeping
k
several orders of magnitude smaller
than
m
. For graphical demonstrations of sparse connectivity, see figure 9.2 and
figure 9.3. In a deep convolutional network, units in the deeper layers may indirectly
interact with a larger portion of the input, as shown in figure 9.4. This allows the
network to efficiently describe complicated interactions between many variables by
constructing such interactions from simple building blocks that each describe only
sparse interactions.
Parameter sharing
refers to using the same parameter for more than one
function in a model. In a traditional neural net, each element of the weight matrix
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
Figure 9.2: Sparse connectivity, viewed from below. We highlight one input unit,
x
3
, and
highlight the output units in
s
that are affected by this unit. (Top)When
s
is formed by
convolution with a kernel of width 3, only three outputs are affected by
x
. (Bottom)When
s
is formed by matrix multiplication, connectivity is no longer sparse, so all the outputs
are affected by x
3
.
331
CHAPTER 9. CONVOLUTIONAL NETWORKS
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
Figure 9.3: Sparse connectivity, viewed from above. We highlight one output unit,
s
3
, and
highlight the input units in
x
that affect this unit. These units are known as the
receptive
field
of
s
3
. (Top)When
s
is formed by convolution with a kernel of width 3, only three
inputs affect s
3
. (Bottom)When s is formed by matrix multiplication, connectivity is no
longer sparse, so all the inputs affect s
3
.
x
1
x
1
x
2
x
2
x
3
x
3
h
2
h
2
h
1
h
1
h
3
h
3
h
4
h
4
x
5
x
5
h
5
h
5
g
2
g
2
g
1
g
1
g
3
g
3
g
4
g
4
g
5
g
5
Figure 9.4: The receptive field of the units in the deeper layers of a convolutional network
is larger than the receptive field of the units in the shallow layers. This effect increases if
the network includes architectural features like strided convolution (figure 9.12) or pooling
(section 9.3). This means that even though direct connections in a convolutional net are
very sparse, units in the deeper layers can be indirectly connected to all or most of the
input image.
332
CHAPTER 9. CONVOLUTIONAL NETWORKS
is used exactly once when computing the output of a layer. It is multiplied by
one element of the input and then never revisited. As a synonym for parameter
sharing, one can say that a network has
tied weights
, because the value of the
weight applied to one input is tied to the value of a weight applied elsewhere. In
a convolutional neural net, each member of the kernel is used at every position
of the input (except perhaps some of the boundary pixels, depending on the
design decisions regarding the boundary). The parameter sharing used by the
convolution operation means that rather than learning a separate set of parameters
for every location, we learn only one set. This does not affect the runtime of
forward propagation—it is still
O
(
k × n
)—but it does further reduce the storage
requirements of the model to
k
parameters. Recall that
k
is usually several orders
of magnitude smaller than
m
. Since
m
and
n
are usually roughly the same size,
k
is
practically insignificant compared to
m ×n
. Convolution is thus dramatically more
efficient than dense matrix multiplication in terms of the memory requirements
and statistical efficiency. For a graphical depiction of how parameter sharing works,
see figure 9.5.
As an example of both of these first two principles in action, figure 9.6 shows
how sparse connectivity and parameter sharing can dramatically improve the
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
x
3
x
3
x
5
x
5
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
s
5
s
5
Figure 9.5: Parameter sharing. Black arrows indicate the connections that use a particular
parameter in two different models. (Top)The black arrows indicate uses of the central
element of a 3-element kernel in a convolutional model. Because of parameter sharing, this
single parameter is used at all input locations. (Bottom)The single black arrow indicates
the use of the central element of the weight matrix in a fully connected model. This model
has no parameter sharing, so the parameter is used only once.
333
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
each pixel in the original image and subtracting the value of its neighboring pixel on
the left. This shows the strength of all the vertically oriented edges in the input image,
which can be a useful operation for object detection. Both images are 280 pixels tall.
The input image is 320 pixels wide, while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing two elements, and
requires 319
×
280
×
3 = 267
,
960 floating-point operations (two multiplications and
one addition per output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320
×
280
×
319
×
280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representing this transformation. The straightforward matrix multiplication algorithm
performs over sixteen billion floating point operations, making convolution roughly 60,000
times more efficient computationally. Of course, most of the entries of the matrix would be
zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication
and convolution would require the same number of floating-point operations to compute.
The matrix would still need to contain 2
×
319
×
280 = 178
,
640 entries. Convolution
is an extremely efficient way of describing transformations that apply the same linear
transformation of a small local region across the entire input. Photo credit: Paula
Goodfellow.
efficiency of a linear function for detecting edges in an image.
In the case of convolution, the particular form of parameter sharing causes the
layer to have a property called
equivariance
to translation. To say a function is
equivariant means that if the input changes, the output changes in the same way.
Specifically, a function
f
(
x
) is equivariant to a function
g
if
f
(
g
(
x
)) =
g
(
f
(
x
)). In
the case of convolution, if we let
g
be any function that translates the input, that
is, shifts it, then the convolution function is equivariant to
g
. For example, let
I
be a function giving image brightness at integer coordinates. Let
g
be a function
mapping one image function to another image function, such that
I
=
g
(
I
) is the
image function with
I
(
x, y
) =
I
(
x
1
, y
). This shifts every pixel of
I
one unit to
the right. If we apply this transformation to
I
, then apply convolution, the result
will be the same as if we applied convolution to
I
, then applied the transformation
334
CHAPTER 9. CONVOLUTIONAL NETWORKS
g
to the output. When processing time-series data, this means that convolution
produces a sort of timeline that shows when different features appear in the input.
If we move an event later in time in the input, the exact same representation of it
will appear in the output, just later. Similarly with images, convolution creates a
2-D map of where certain features appear in the input. If we move the object in the
input, its representation will move the same amount in the output. This is useful
for when we know that some function of a small number of neighboring pixels is
useful when applied to multiple input locations. For example, when processing
images, it is useful to detect edges in the first layer of a convolutional network.
The same edges appear more or less everywhere in the image, so it is practical to
share parameters across the entire image. In some cases, we may not wish to share
parameters across the entire image. For example, if we are processing images that
are cropped to be centered on an individual’s face, we probably want to extract
different features at different locations—the part of the network processing the top
of the face needs to look for eyebrows, while the part of the network processing
the bottom of the face needs to look for a chin.
Convolution is not naturally equivariant to some other transformations, such
as changes in the scale or rotation of an image. Other mechanisms are necessary
for handling these kinds of transformations.
Finally, some kinds of data cannot be processed by neural networks defined by
matrix multiplication with a fixed-shape matrix. Convolution enables processing
of some of these kinds of data. We discuss this further in section 9.7.
9.3 Pooling
A typical layer of a convolutional network consists of three stages (see figure 9.7).
In the first stage, the layer performs several convolutions in parallel to produce a
set of linear activations. In the second stage, each linear activation is run through
a nonlinear activation function, such as the rectified linear activation function.
This stage is sometimes called the
detector stage
. In the third stage, we use a
pooling function to modify the output of the layer further.
A pooling function replaces the output of the net at a certain location with a
summary statistic of the nearby outputs. For example, the
max pooling
(Zhou
and Chellappa, 1988) operation reports the maximum output within a rectangular
neighborhood. Other popular pooling functions include the average of a rectangular
neighborhood, the
L
2
norm of a rectangular neighborhood, or a weighted average
based on the distance from the central pixel.
335
CHAPTER 9. CONVOLUTIONAL NETWORKS
Convolutional Layer
Input to layer
Convolution stage:
Ane transform
Detector stage:
Nonlinearity
e.g., rectified linear
Pooling stage
Next layer
Input to layers
Convolution layer:
Ane transform
Detector layer: Nonlinearity
e.g., rectified linear
Pooling layer
Next layer
Complex layer terminology Simple layer terminology
Figure 9.7: The components of a typical convolutional neural network layer. There are two
commonly used sets of terminology for describing these layers. (Left)In this terminology,
the convolutional net is viewed as a small number of relatively complex layers, with
each layer having many “stages.” In this terminology, there is a one-to-one mapping
between kernel tensors and network layers. In this book we generally use this terminology.
(Right)In this terminology, the convolutional net is viewed as a larger number of simple
layers; every step of processing is regarded as a layer in its own right. This means that
not every “layer” has parameters.
In all cases, pooling helps to make the representation approximately
invariant
to small translations of the input. Invariance to translation means that if we
translate the input by a small amount, the values of most of the pooled outputs
do not change. See figure 9.8 for an example of how this works. Invariance to local
translation can be a useful property if we care more about whether some feature
is present than exactly where it is. For example, when determining whether an
image contains a face, we need not know the location of the eyes with pixel-perfect
accuracy, we just need to know that there is an eye on the left side of the face and
an eye on the right side of the face. In other contexts, it is more important to
preserve the location of a feature. For example, if we want to find a corner defined
336
CHAPTER 9. CONVOLUTIONAL NETWORKS
0.1 1. 0.2
1.1. 1.
0.1
0.2
... ...
... ...
0.3 0.1 1.
1.0.3 1.
0.2
1.
... ...
... ...
DETECTOR STAGE
POOLING STAGE
POOLING STAGE
DETECTOR STAGE
Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output
of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top
row shows the outputs of max pooling, with a stride of one pixel between pooling regions
and a pooling region width of three pixels. (Bottom)A view of the same network, after
the input has been shifted to the right by one pixel. Every value in the bottom row has
changed, but only half of the values in the top row have changed, because the max pooling
units are sensitive only to the maximum value in the neighborhood, not its exact location.
by two edges meeting at a specific orientation, we need to preserve the location of
the edges well enough to test whether they meet.
The use of pooling can be viewed as adding an infinitely strong prior that
the function the layer learns must be invariant to small translations. When this
assumption is correct, it can greatly improve the statistical efficiency of the network.
Pooling over spatial regions produces invariance to translation, but if we pool
over the outputs of separately parametrized convolutions, the features can learn
which transformations to become invariant to (see figure 9.9).
Because pooling summarizes the responses over a whole neighborhood, it is
possible to use fewer pooling units than detector units, by reporting summary
statistics for pooling regions spaced
k
pixels apart rather than 1 pixel apart. See
figure 9.10 for an example. This improves the computational efficiency of the
network because the next layer has roughly
k
times fewer inputs to process. When
337
CHAPTER 9. CONVOLUTIONAL NETWORKS
Large response
in pooling unit
Large response
in pooling unit
Large
response
in detector
unit 1
Large
response
in detector
unit 3
Figure 9.9: Example of learned invariances. A pooling unit that pools over multiple features
that are learned with separate parameters can learn to be invariant to transformations of
the input. Here we show how a set of three learned filters and a max pooling unit can learn
to become invariant to rotation. All three filters are intended to detect a hand written 5.
Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector
unit. The max pooling unit then has a large activation regardless of which detector unit
was activated. We show here how the network processes two different inputs, resulting
in two different detector units being activated. The effect on the pooling unit is roughly
the same either way. This principle is leveraged by maxout networks (Goodfellow et al.,
2013a) and other convolutional networks. Max pooling over spatial positions is naturally
invariant to translation; this multichannel approach is only necessary for learning other
transformations.
0.1 1. 0.2
1. 0.2
0.1
0.1
0.0 0.1
Figure 9.10: Pooling with downsampling. Here we use max pooling with a pool width of
three and a stride between pools of two. This reduces the representation size by a factor
of two, which reduces the computational and statistical burden on the next layer. Note
that the rightmost pooling region has a smaller size but must be included if we do not
want to ignore some of the detector units.
the number of parameters in the next layer is a function of its input size (such as
when the next layer is fully connected and based on matrix multiplication), this
338
CHAPTER 9. CONVOLUTIONAL NETWORKS
reduction in the input size can also result in improved statistical efficiency and
reduced memory requirements for storing the parameters.
For many tasks, pooling is essential for handling inputs of varying size. For
example, if we want to classify images of variable size, the input to the classification
layer must have a fixed size. This is usually accomplished by varying the size of an
offset between pooling regions so that the classification layer always receives the
same number of summary statistics regardless of the input size. For example, the
final pooling layer of the network may be defined to output four sets of summary
statistics, one for each quadrant of an image, regardless of the image size.
Some theoretical work gives guidance as to which kinds of pooling one should
use in various situations (Boureau et al., 2010). It is also possible to dynamically
pool features together, for example, by running a clustering algorithm on the
locations of interesting features (Boureau et al., 2011). This approach yields a
different set of pooling regions for each image. Another approach is to learn a
single pooling structure that is then applied to all images (Jia et al., 2012).
Pooling can complicate some kinds of neural network architectures that use
top-down information, such as Boltzmann machines and autoencoders. These
issues are discussed further when we present these types of networks in part III.
Pooling in convolutional Boltzmann machines is presented in section 20.6. The
inverse-like operations on pooling units needed in some differentiable networks are
covered in section 20.10.6.
Some examples of complete convolutional network architectures for classification
using convolution and pooling are shown in figure 9.11.
9.4 Convolution and Pooling as an Infinitely Strong
Prior
Recall the concept of a
prior probability distribution
from section 5.6. This is
a probability distribution over the parameters of a model that encodes our beliefs
about what models are reasonable, before we have seen any data.
Priors can be considered weak or strong depending on how concentrated the
probability density in the prior is. A weak prior is a prior distribution with high
entropy, such as a Gaussian distribution with high variance. Such a prior allows
the data to move the parameters more or less freely. A strong prior has very low
entropy, such as a Gaussian distribution with low variance. Such a prior plays a
more active role in determining where the parameters end up.
An infinitely strong prior places zero probability on some parameters and says
339
CHAPTER 9. CONVOLUTIONAL NETWORKS
Input image:
256x256x3
Output of
convolution +
ReLU: 256x256x64
Output of pooling
with stride 4:
64x64x64
Output of
convolution +
ReLU: 64x64x64
Output of pooling
with stride 4:
16x16x64
Output of reshape to
vector:
16,384 units
Output of matrix
multiply: 1,000 units
Output of softmax:
1,000 class
probabilities
Input image:
256x256x3
Output of
convolution +
ReLU: 256x256x64
Output of pooling
with stride 4:
64x64x64
Output of
convolution +
ReLU: 64x64x64
Output of pooling to
3x3 grid: 3x3x64
Output of reshape to
vector:
576 units
Output of matrix
multiply: 1,000 units
Output of softmax:
1,000 class
probabilities
Input image:
256x256x3
Output of
convolution +
ReLU: 256x256x64
Output of pooling
with stride 4:
64x64x64
Output of
convolution +
ReLU: 64x64x64
Output of
convolution:
16x16x1,000
Output of average
pooling: 1x1x1,000
Output of softmax:
1,000 class
probabilities
Output of pooling
with stride 4:
16x16x64
Figure 9.11: Examples of architectures for classification with convolutional networks. The
specific strides and depths used in this figure are not advisable for real use; they are
designed to be very shallow to fit onto the page. Real convolutional networks also often
involve significant amounts of branching, unlike the chain structures used here for simplicity.
(Left)A convolutional network that processes a fixed image size. After alternating between
convolution and pooling for a few layers, the tensor for the convolutional feature map is
reshaped to flatten out the spatial dimensions. The rest of the network is an ordinary
feedforward network classifier, as described in chapter 6. (Center)A convolutional network
that processes a variably sized image but still maintains a fully connected section. This
network uses a pooling operation with variably sized pools but a fixed number of pools,
in order to provide a fixed-size vector of 576 units to the fully connected portion of the
network. (Right)A convolutional network that does not have any fully connected weight
layer. Instead, the last convolutional layer outputs one feature map per class. The model
presumably learns a map of how likely each class is to occur at each spatial location.
Averaging a feature map down to a single value provides the argument to the softmax
classifier at the top.
340
CHAPTER 9. CONVOLUTIONAL NETWORKS
that these parameter values are completely forbidden, regardless of how much
support the data give to those values.
We can imagine a convolutional net as being similar to a fully connected net,
but with an infinitely strong prior over its weights. This infinitely strong prior
says that the weights for one hidden unit must be identical to the weights of its
neighbor but shifted in space. The prior also says that the weights must be zero,
except for in the small, spatially contiguous receptive field assigned to that hidden
unit. Overall, we can think of the use of convolution as introducing an infinitely
strong prior probability distribution over the parameters of a layer. This prior
says that the function the layer should learn contains only local interactions and is
equivariant to translation. Likewise, the use of pooling is an infinitely strong prior
that each unit should be invariant to small translations.
Of course, implementing a convolutional net as a fully connected net with an
infinitely strong prior would be extremely wasteful computationally. But thinking
of a convolutional net as a fully connected net with an infinitely strong prior can
give us some insights into how convolutional nets work.
One key insight is that convolution and pooling can cause underfitting. Like
any prior, convolution and pooling are only useful when the assumptions made
by the prior are reasonably accurate. If a task relies on preserving precise spatial
information, then using pooling on all features can increase the training error.
Some convolutional network architectures (Szegedy et al., 2014a) are designed to
use pooling on some channels but not on other channels, in order to get both
highly invariant features and features that will not underfit when the translation
invariance prior is incorrect. When a task involves incorporating information from
very distant locations in the input, then the prior imposed by convolution may be
inappropriate.
Another key insight from this view is that we should only compare convolu-
tional models to other convolutional models in benchmarks of statistical learning
performance. Models that do not use convolution would be able to learn even if we
permuted all the pixels in the image. For many image datasets, there are separate
benchmarks for models that are
permutation invariant
and must discover the
concept of topology via learning and for models that have the knowledge of spatial
relationships hard coded into them by their designer.
341
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.5 Variants of the Basic Convolution Function
When discussing convolution in the context of neural networks, we usually do
not refer exactly to the standard discrete convolution operation as it is usually
understood in the mathematical literature. The functions used in practice differ
slightly. Here we describe these differences in detail and highlight some useful
properties of the functions used in neural networks.
First, when we refer to convolution in the context of neural networks, we usually
actually mean an operation that consists of many applications of convolution in
parallel. This is because convolution with a single kernel can extract only one kind
of feature, albeit at many spatial locations. Usually we want each layer of our
network to extract many kinds of features, at many locations.
Additionally, the input is usually not just a grid of real values. Rather, it is a
grid of vector-valued observations. For example, a color image has a red, green
and blue intensity at each pixel. In a multilayer convolutional network, the input
to the second layer is the output of the first layer, which usually has the output
of many different convolutions at each position. When working with images, we
usually think of the input and output of the convolution as being 3-D tensors, with
one index into the different channels and two indices into the spatial coordinates
of each channel. Software implementations usually work in batch mode, so they
will actually use 4-D tensors, with the fourth axis indexing different examples in
the batch, but we will omit the batch axis in our description here for simplicity.
Because convolutional networks usually use multichannel convolution, the linear
operations they are based on are not guaranteed to be commutative, even if kernel
flipping is used. These multichannel operations are only commutative if each
operation has the same number of output channels as input channels.
Assume we have a 4-D kernel tensor
K
with element K
i,j,k,l
giving the connection
strength between a unit in channel
i
of the output and a unit in channel
j
of the
input, with an offset of
k
rows and
l
columns between the output unit and the
input unit. Assume our input consists of observed data
V
with element V
i,j,k
giving
the value of the input unit within channel
i
at row
j
and column
k
. Assume our
output consists of
Z
with the same format as
V
. If
Z
is produced by convolving
K
across V without flipping K, then
Z
i,j,k
=
l,m,n
V
l,j+m1,k+n1
K
i,l,m,n
, (9.7)
where the summation over
l
,
m
and
n
is over all values for which the tensor
indexing operations inside the summation are valid. In linear algebra notation,
342
CHAPTER 9. CONVOLUTIONAL NETWORKS
we index into arrays using a 1 for the first entry. This necessitates the
1 in the
above formula. Programming languages such as C and Python index starting from
0, rendering the above expression even simpler.
We may want to skip over some positions of the kernel to reduce the computa-
tional cost (at the expense of not extracting our features as finely). We can think
of this as downsampling the output of the full convolution function. If we want to
sample only every
s
pixels in each direction in the output, then we can define a
downsampled convolution function c such that
Z
i,j,k
= c(K, V, s)
i,j,k
=
l,m,n
V
l,(j1)×s+m,(k1)×s+n
K
i,l,m,n
. (9.8)
We refer to
s
as the
stride
of this downsampled convolution. It is also possible
to define a separate stride for each direction of motion. See figure 9.12 for an
illustration.
One essential feature of any convolutional network implementation is the ability
to implicitly zero pad the input
V
to make it wider. Without this feature, the
width of the representation shrinks by one pixel less than the kernel width at
each layer. Zero padding the input allows us to control the kernel width and
the size of the output independently. Without zero padding, we are forced to
choose between shrinking the spatial extent of the network rapidly and using small
kernels—both scenarios that significantly limit the expressive power of the network.
See figure 9.13 for an example.
Three special cases of the zero-padding setting are worth mentioning. One is
the extreme case in which no zero padding is used whatsoever, and the convolution
kernel is allowed to visit only positions where the entire kernel is contained entirely
within the image. In MATLAB terminology, this is called
valid
convolution. In
this case, all pixels in the output are a function of the same number of pixels in
the input, so the behavior of an output pixel is somewhat more regular. However,
the size of the output shrinks at each layer. If the input image has width
m
and
the kernel has width
k
, the output will be of width
m k
+ 1. The rate of this
shrinkage can be dramatic if the kernels used are large. Since the shrinkage is
greater than 0, it limits the number of convolutional layers that can be included
in the network. As layers are added, the spatial dimension of the network will
eventually drop to 1
×
1, at which point additional layers cannot meaningfully be
considered convolutional. Another special case of the zero-padding setting is when
just enough zero padding is added to keep the size of the output equal to the size
of the input. MATLAB calls this
same
convolution. In this case, the network
can contain as many convolutional layers as the available hardware can support,
since the operation of convolution does not modify the architectural possibilities
343
CHAPTER 9. CONVOLUTIONAL NETWORKS
x
1
x
1
x
2
x
2
x
3
x
3
s
1
s
1
s
2
s
2
x
5
x
5
s
3
s
3
x
1
x
1
x
2
x
2
x
3
x
3
z
2
z
2
z
1
z
1
z
3
z
3
z
4
z
4
x
5
x
5
z
5
z
5
s
1
s
1
s
2
s
2
s
3
s
3
Strided
convolution
Downsampling
Convolution
Figure 9.12: Convolution with a stride. In this example, we use a stride of two.
(Top)Convolution with a stride length of two implemented in a single operation. (Bot-
tom)Convolution with a stride greater than one pixel is mathematically equivalent to
convolution with unit stride followed by downsampling. Obviously, the two-step approach
involving downsampling is computationally wasteful, because it computes many values
that are then discarded.
available to the next layer. The input pixels near the border, however, influence
fewer output pixels than the input pixels near the center. This can make the
border pixels somewhat underrepresented in the model. This motivates the other
extreme case, which MATLAB refers to as
full
convolution, in which enough zeros
are added for every pixel to be visited
k
times in each direction, resulting in an
output image of width m + k 1. In this case, the output pixels near the border
are a function of fewer pixels than the output pixels near the center. This can
make it difficult to learn a single kernel that performs well at all positions in
the convolutional feature map. Usually the optimal amount of zero padding (in
344
CHAPTER 9. CONVOLUTIONAL NETWORKS
... ...
...
...
...
...
...
...
...
Figure 9.13: The effect of zero padding on network size. Consider a convolutional network
with a kernel of width six at every layer. In this example, we do not use any pooling, so
only the convolution operation itself shrinks the network size. (Top)In this convolutional
network, we do not use any implicit zero padding. This causes the representation to
shrink by five pixels at each layer. Starting from an input of sixteen pixels, we are only
able to have three convolutional layers, and the last layer does not ever move the kernel,
so arguably only two of the layers are truly convolutional. The rate of shrinking can
be mitigated by using smaller kernels, but smaller kernels are less expressive, and some
shrinking is inevitable in this kind of architecture. (Bottom)By adding five implicit zeros
to each layer, we prevent the representation from shrinking with depth. This allows us to
make an arbitrarily deep convolutional network.
terms of test set classification accuracy) lies somewhere between “valid” and “same”
convolution.
In some cases, we do not actually want to use convolution, but want to use
locally connected layers instead (LeCun, 1986, 1989). In this case, the adjacency
matrix in the graph of our MLP is the same, but every connection has its own
weight, specified by a 6-D tensor
W
. The indices into
W
are respectively:
i
; the
345
CHAPTER 9. CONVOLUTIONAL NETWORKS
output channel;
j
, the output row;
k
, the output column;
l
, the input channel;
m
,
the row offset within the input; and
n
, the column offset within the input. The
linear part of a locally connected layer is then given by
Z
i,j,k
=
l,m,n
[V
l,j+m1,k+n1
w
i,j,k,l,m,n
] . (9.9)
This is sometimes also called
unshared convolution
, because it is a similar oper-
ation to discrete convolution with a small kernel, but without sharing parameters
across locations. Figure 9.14 compares local connections, convolution, and full
connections.
Locally connected layers are useful when we know that each feature should be
a function of a small part of space, but there is no reason to think that the same
feature should occur across all of space. For example, if we want to tell if an image
is a picture of a face, we only need to look for the mouth in the bottom half of the
image.
It can also be useful to make versions of convolution or locally connected layers
in which the connectivity is further restricted, for example to constrain each output
channel
i
to be a function of only a subset of the input channels
l
. A common
way to do this is to make the first
m
output channels connect to only the first
n
input channels, the second
m
output channels connect to only the second
n
input channels, and so on. See figure 9.15 for an example. Modeling interactions
between few channels allows the network to have fewer parameters, reducing
memory consumption, increasing statistical efficiency, and reducing the amount of
computation needed to perform forward and back-propagation. It accomplishes
these goals without reducing the number of hidden units.
Tiled convolution
(Gregor and LeCun, 2010a; Le et al., 2010) offers a com-
promise between a convolutional layer and a locally connected layer. Rather than
learning a separate set of weights at every spatial location, we learn a set of kernels
that we rotate through as we move through space. This means that immediately
neighboring locations will have different filters, as in a locally connected layer, but
the memory requirements for storing the parameters will increase only by a factor
of the size of this set of kernels, rather than by the size of the entire output feature
map. See figure 9.16 for a comparison of locally connected layers, tiled convolution,
and standard convolution.
To define tiled convolution algebraically, let
K
be a 6-D tensor, where two of
the dimensions correspond to different locations in the output map. Rather than
having a separate index for each location in the output map, output locations cycle
through a set of
t
different choices of kernel stack in each direction. If
t
is equal to
346
CHAPTER 9. CONVOLUTIONAL NETWORKS
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
x
4
x
4
s
4
s
4
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
s
1
s
1
s
3
s
3
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
x
4
x
4
s
4
s
4
x
5
x
5
s
5
s
5
a b a b a b a b a
a b c d e f g h i
x
4
x
4
x
3
x
3
s
4
s
4
s
2
s
2
Figure 9.14: Comparison of local connections, convolution, and full connections.
(Top)A locally connected layer with a patch size of two pixels. Each edge is labeled with
a unique letter to show that each edge is associated with its own weight parameter.
(Center)A convolutional layer with a kernel width of two pixels. This model has exactly
the same connectivity as the locally connected layer. The difference lies not in which units
interact with each other, but in how the parameters are shared. The locally connected layer
has no parameter sharing. The convolutional layer uses the same two weights repeatedly
across the entire input, as indicated by the repetition of the letters labeling each edge.
(Bottom)A fully connected layer resembles a locally connected layer in the sense that each
edge has its own parameter (there are too many to label explicitly with letters in this
diagram). It does not, however, have the restricted connectivity of the locally connected
layer.
the output width, this is the same as a locally connected layer.
Z
i,j,k
=
l,m,n
V
l,j+m1,k+n1
K
i,l,m,n,j%t+1,k%t+1
, (9.10)
where percent is the modulo operation, with
t
%
t
= 0, (
t
+ 1)%
t
= 1, and so on.
It is straightforward to generalize this equation to use a different tiling range for
each dimension.
347
CHAPTER 9. CONVOLUTIONAL NETWORKS
Input Tensor
Output Tensor
Spatial coordinates
Channel coordinates
Figure 9.15: A convolutional network with the first two output channels connected to
only the first two input channels, and the second two output channels connected to only
the second two input channels.
348
CHAPTER 9. CONVOLUTIONAL NETWORKS
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
a b a b a b a b a
a b c d e f g h i
x
1
x
1
x
2
x
2
x
3
x
3
s
2
s
2
s
1
s
1
s
3
s
3
s
4
s
4
x
5
x
5
s
5
s
5
a b c d a b c d a
Figure 9.16: A comparison of locally connected layers, tiled convolution, and standard
convolution. All three have the same sets of connections between units, when the same
size of kernel is used. This diagram illustrates the use of a kernel that is two pixels wide.
The differences between the methods lies in how they share parameters. (Top) A locally
connected layer has no sharing at all. We indicate that each connection has its own
weight by labeling each connection with a unique letter. (Center) Tiled convolution has a
set of
t
different kernels. Here we illustrate the case of
t
= 2. One of these kernels has
edges labeled “a” and “b,” while the other has edges labeled “c” and “d.” Each time we
move one pixel to the right in the output, we move on to using a different kernel. This
means that, like the locally connected layer, neighboring units in the output have different
parameters. Unlike the locally connected layer, after we have gone through all
t
available
kernels, we cycle back to the first kernel. If two output units are separated by a multiple
of
t
steps, then they share parameters. (Bottom) Traditional convolution is equivalent to
tiled convolution with
t
= 1. There is only one kernel, and it is applied everywhere, as
indicated in the diagram by using the kernel with weights labeled “a” and “b” everywhere.
Locally connected layers and tiled convolutional layers both have an interesting
interaction with max pooling: the detector units of these layers are driven by
different filters. If these filters learn to detect different transformed versions of
349
CHAPTER 9. CONVOLUTIONAL NETWORKS
the same underlying features, then the max-pooled units become invariant to the
learned transformation (see figure 9.9). Convolutional layers are hard coded to be
invariant specifically to translation.
Other operations besides convolution are usually necessary to implement a
convolutional network. To perform learning, one must be able to compute the
gradient with respect to the kernel, given the gradient with respect to the outputs.
In some simple cases, this operation can be performed using the convolution
operation, but many cases of interest, including the case of stride greater than 1,
do not have this property.
Recall that convolution is a linear operation and can thus be described as a
matrix multiplication (if we first reshape the input tensor into a flat vector). The
matrix involved is a function of the convolution kernel. The matrix is sparse, and
each element of the kernel is copied to several elements of the matrix. This view
helps us to derive some of the other operations needed to implement a convolutional
network.
Multiplication by the transpose of the matrix defined by convolution is one
such operation. This is the operation needed to back-propagate error derivatives
through a convolutional layer, so it is needed to train convolutional networks
that have more than one hidden layer. This same operation is also needed if we
wish to reconstruct the visible units from the hidden units (Simard et al., 1992).
Reconstructing the visible units is an operation commonly used in the models
described in part III of this book, such as autoencoders, RBMs, and sparse coding.
Transpose convolution is necessary to construct convolutional versions of those
models. Like the kernel gradient operation, this input gradient operation can
sometimes be implemented using a convolution but in general requires a third
operation to be implemented. Care must be taken to coordinate this transpose
operation with the forward propagation. The size of the output that the transpose
operation should return depends on the zero-padding policy and stride of the
forward propagation operation, as well as the size of the forward propagation’s
output map. In some cases, multiple sizes of input to forward propagation can
result in the same size of output map, so the transpose operation must be explicitly
told what the size of the original input was.
These three operations—convolution, backprop from output to weights, and
backprop from output to inputs—are sufficient to compute all the gradients needed
to train any depth of feedforward convolutional network, as well as to train
convolutional networks with reconstruction functions based on the transpose of
convolution. See Goodfellow (2010) for a full derivation of the equations in the
fully general multidimensional, multiexample case. To give a sense of how these
350
CHAPTER 9. CONVOLUTIONAL NETWORKS
equations work, we present the two-dimensional, single example version here.
Suppose we want to train a convolutional network that incorporates strided
convolution of kernel stack
K
applied to multichannel image
V
with stride
s
as
defined by c(K, V, s), as in equation 9.8. Suppose we want to minimize some loss
function
J
(
V, K
). During forward propagation, we will need to use
c
itself to
output
Z
, which is then propagated through the rest of the network and used to
compute the cost function
J
. During back-propagation, we will receive a tensor
G
such that G
i,j,k
=
Z
i,j,k
J(V, K).
To train the network, we need to compute the derivatives with respect to the
weights in the kernel. To do so, we can use a function
g(G, V, s)
i,j,k,l
=
K
i,j,k,l
J(V, K) =
m,n
G
i,m,n
V
j,(m1)×s+k,(n1)×s+l
. (9.11)
If this layer is not the bottom layer of the network, we will need to compute
the gradient with respect to
V
to back-propagate the error further down. To do
so, we can use a function
h(K, G, s)
i,j,k
=
V
i,j,k
J(V, K) (9.12)
=
l,m
s.t.
(l1)×s+m=j
n,p
s.t.
(n1)×s+p=k
q
K
q,i,m,p
G
q,l,n
. (9.13)
Autoencoder networks, described in chapter 14, are feedforward networks
trained to copy their input to their output. A simple example is the PCA algorithm,
which copies its input
x
to an approximate reconstruction
r
using the function
W
W x
. It is common for more general autoencoders to use multiplication
by the transpose of the weight matrix just as PCA does. To make such models
convolutional, we can use the function
h
to perform the transpose of the convolution
operation. Suppose we have hidden units
H
in the same format as
Z
and we define
a reconstruction
R = h(K, H, s). (9.14)
To train the autoencoder, we will receive the gradient with respect to
R
as
a tensor
E
. To train the decoder, we need to obtain the gradient with respect
to
K
. This is given by
g
(
H, E, s
). To train the encoder, we need to obtain the
gradient with respect to
H
. This is given by
c
(
K, E, s
). It is also possible to
differentiate through
g
using
c
and
h
, but these operations are not needed for the
back-propagation algorithm on any standard network architectures.
351
CHAPTER 9. CONVOLUTIONAL NETWORKS
Generally, we do not use only a linear operation to transform from the inputs
to the outputs in a convolutional layer. We generally also add some bias term to
each output before applying the nonlinearity. This raises the question of how to
share parameters among the biases. For locally connected layers, it is natural to
give each unit its own bias, and for tiled convolution, it is natural to share the
biases with the same tiling pattern as the kernels. For convolutional layers, it is
typical to have one bias per channel of the output and share it across all locations
within each convolution map. If the input is of known, fixed size, however, it is also
possible to learn a separate bias at each location of the output map. Separating
the biases may slightly reduce the statistical efficiency of the model, but it allows
the model to correct for differences in the image statistics at different locations.
For example, when using implicit zero padding, detector units at the edge of the
image receive less total input and may need larger biases.
9.6 Structured Outputs
Convolutional networks can be used to output a high-dimensional structured object,
rather than just predicting a class label for a classification task or a real value for
a regression task. Typically this object is just a tensor, emitted by a standard
convolutional layer. For example, the model might emit a tensor
S
, where S
i,j,k
is the probability that pixel (
j, k
) of the input to the network belongs to class
i
.
This allows the model to label every pixel in an image and draw precise masks
that follow the outlines of individual objects.
One issue that often comes up is that the output plane can be smaller than
the input plane, as shown in figure 9.13. In the kinds of architectures typically
used for classification of a single object in an image, the greatest reduction in the
spatial dimensions of the network comes from using pooling layers with large stride.
To produce an output map of similar size as the input, one can avoid pooling
altogether (Jain et al., 2007). Another strategy is to simply emit a lower-resolution
grid of labels (Pinheiro and Collobert, 2014, 2015). Finally, in principle, one could
use a pooling operator with unit stride.
One strategy for pixel-wise labeling of images is to produce an initial guess
of the image labels, then refine this initial guess using the interactions between
neighboring pixels. Repeating this refinement step several times corresponds to
using the same convolutions at each stage, sharing weights between the last layers of
the deep net (Jain et al., 2007). This makes the sequence of computations performed
by the successive convolutional layers with weights shared across layers a particular
352
CHAPTER 9. CONVOLUTIONAL NETWORKS
ˆ
Y
(1)
ˆ
Y
(1)
ˆ
Y
(2)
ˆ
Y
(2)
ˆ
Y
(3)
ˆ
Y
(3)
H
(1)
H
(1)
H
(2)
H
(2)
H
(3)
H
(3)
XX
U
U
U
V
V
V
W
W
Figure 9.17: An example of a recurrent convolutional network for pixel labeling. The
input is an image tensor X, with axes corresponding to image rows, image columns, and
channels (red, green, blue). The goal is to output a tensor of labels
ˆ
Y
, with a probability
distribution over labels for each pixel. This tensor has axes corresponding to image rows,
image columns, and the different classes. Rather than outputting
ˆ
Y
in a single shot, the
recurrent network iteratively refines its estimate
ˆ
Y
by using a previous estimate of
ˆ
Y
as input for creating a new estimate. The same parameters are used for each updated
estimate, and the estimate can be refined as many times as we wish. The tensor of
convolution kernels
U
is used on each step to compute the hidden representation given the
input image. The kernel tensor
V
is used to produce an estimate of the labels given the
hidden values. On all but the first step, the kernels
W
are convolved over
ˆ
Y
to provide
input to the hidden layer. On the first time step, this term is replaced by zero. Because
the same parameters are used on each step, this is an example of a recurrent network, as
described in chapter 10.
kind of recurrent network (Pinheiro and Collobert, 2014, 2015). Figure 9.17 shows
the architecture of such a recurrent convolutional network.
Once a prediction for each pixel is made, various methods can be used to
further process these predictions to obtain a segmentation of the image into regions
(Briggman et al., 2009; Turaga et al., 2010; Farabet et al., 2013). The general idea
is to assume that large groups of contiguous pixels tend to be associated with the
same label. Graphical models can describe the probabilistic relationships between
neighboring pixels. Alternatively, the convolutional network can be trained to
maximize an approximation of the graphical model training objective (Ning et al.,
2005; Thompson et al., 2014).
353
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.7 Data Types
The data used with a convolutional network usually consist of several channels,
each channel being the observation of a different quantity at some point in space
or time. See table 9.1 for examples of data types with different dimensionalities
and number of channels.
For an example of convolutional networks applied to video, see Chen et al.
(2010).
So far we have discussed only the case where every example in the train and test
data has the same spatial dimensions. One advantage to convolutional networks
is that they can also process inputs with varying spatial extents. These kinds of
input simply cannot be represented by traditional, matrix multiplication-based
neural networks. This provides a compelling reason to use convolutional networks
even when computational cost and overfitting are not significant issues.
For example, consider a collection of images in which each image has a different
width and height. It is unclear how to model such inputs with a weight matrix of
fixed size. Convolution is straightforward to apply; the kernel is simply applied a
different number of times depending on the size of the input, and the output of the
convolution operation scales accordingly. Convolution may be viewed as matrix
multiplication; the same convolution kernel induces a different size of doubly block
circulant matrix for each size of input. Sometimes the output of the network as
well as the input is allowed to have variable size, for example, if we want to assign
a class label to each pixel of the input. In this case, no further design work is
necessary. In other cases, the network must produce some fixed-size output, for
example, if we want to assign a single class label to the entire image. In this case,
we must make some additional design steps, like inserting a pooling layer whose
pooling regions scale in size proportional to the size of the input, to maintain a
fixed number of pooled outputs. Some examples of this kind of strategy are shown
in figure 9.11.
Note that the use of convolution for processing variably sized inputs makes
sense only for inputs that have variable size because they contain varying amounts
of observation of the same kind of thing—different lengths of recordings over time,
different widths of observations over space, and so forth. Convolution does not
make sense if the input has variable size because it can optionally include different
kinds of observations. For example, if we are processing college applications, and
our features consist of both grades and standardized test scores, but not every
applicant took the standardized test, then it does not make sense to convolve the
same weights over features corresponding to the grades as well as the features
354
CHAPTER 9. CONVOLUTIONAL NETWORKS
Single channel Multichannel
1-D
Audio waveform: The axis we
convolve over corresponds to
time. We discretize time and
measure the amplitude of the
waveform once per time step.
Skeleton animation data: Anima-
tions of 3-D computer-rendered
characters are generated by alter-
ing the pose of a “skeleton” over
time. At each point in time, the
pose of the character is described
by a specification of the angles of
each of the joints in the charac-
ter’s skeleton. Each channel in
the data we feed to the convolu-
tional model represents the angle
about one axis of one joint.
2-D
Audio data that has been prepro-
cessed with a Fourier transform:
We can transform the audio wave-
form into a 2-D tensor with dif-
ferent rows corresponding to dif-
ferent frequencies and different
columns corresponding to differ-
ent points in time. Using convolu-
tion in the time makes the model
equivariant to shifts in time. Us-
ing convolution across the fre-
quency axis makes the model
equivariant to frequency, so that
the same melody played in a dif-
ferent octave produces the same
representation but at a different
height in the network’s output.
Color image data: One channel
contains the red pixels, one the
green pixels, and one the blue
pixels. The convolution kernel
moves over both the horizontal
and the vertical axes of the im-
age, conferring translation equiv-
ariance in both directions.
3-D
Volumetric data: A common
source of this kind of data is med-
ical imaging technology, such as
CT scans.
Color video data: One axis corre-
sponds to time, one to the height
of the video frame, and one to
the width of the video frame.
Table 9.1: Examples of different formats of data that can be used with convolutional
networks.
355
CHAPTER 9. CONVOLUTIONAL NETWORKS
corresponding to the test scores.
9.8 Efficient Convolution Algorithms
Modern convolutional network applications often involve networks containing more
than one million units. Powerful implementations exploiting parallel computation
resources, as discussed in section 12.1, are essential. In many cases, however, it
is also possible to speed up convolution by selecting an appropriate convolution
algorithm.
Convolution is equivalent to converting both the input and the kernel to the
frequency domain using a Fourier transform, performing point-wise multiplication
of the two signals, and converting back to the time domain using an inverse
Fourier transform. For some problem sizes, this can be faster than the naive
implementation of discrete convolution.
When a
d
-dimensional kernel can be expressed as the outer product of
d
vectors, one vector per dimension, the kernel is called
separable
. When the
kernel is separable, naive convolution is inefficient. It is equivalent to compose
d
one-dimensional convolutions with each of these vectors. The composed approach
is significantly faster than performing one
d
-dimensional convolution with their
outer product. The kernel also takes fewer parameters to represent as vectors.
If the kernel is
w
elements wide in each dimension, then naive multidimensional
convolution requires
O
(
w
d
) runtime and parameter storage space, while separable
convolution requires
O
(
w × d
) runtime and parameter storage space. Of course,
not every convolution can be represented in this way.
Devising faster ways of performing convolution or approximate convolution
without harming the accuracy of the model is an active area of research. Even tech-
niques that improve the efficiency of only forward propagation are useful because
in the commercial setting, it is typical to devote more resources to deployment of
a network than to its training.
9.9 Random or Unsupervised Features
Typically, the most expensive part of convolutional network training is learning
the features. The output layer is usually relatively inexpensive because of the
small number of features provided as input to this layer after passing through
several layers of pooling. When performing supervised training with gradient
descent, every gradient step requires a complete run of forward propagation and
356
CHAPTER 9. CONVOLUTIONAL NETWORKS
backward propagation through the entire network. One way to reduce the cost of
convolutional network training is to use features that are not trained in a supervised
fashion.
There are three basic strategies for obtaining convolution kernels without
supervised training. One is to simply initialize them randomly. Another is to
design them by hand, for example, by setting each kernel to detect edges at a
certain orientation or scale. Finally, one can learn the kernels with an unsupervised
criterion. For example, Coates et al. (2011) apply
k
-means clustering to small
image patches, then use each learned centroid as a convolution kernel. In Part III
we describe many more unsupervised learning approaches. Learning the features
with an unsupervised criterion allows them to be determined separately from the
classifier layer at the top of the architecture. One can then extract the features for
the entire training set just once, essentially constructing a new training set for the
last layer. Learning the last layer is then typically a convex optimization problem,
assuming the last layer is something like logistic regression or an SVM.
Random filters often work surprisingly well in convolutional networks (Jarrett
et al., 2009; Saxe et al., 2011; Pinto et al., 2011; Cox and Pinto, 2011). Saxe et al.
(2011) showed that layers consisting of convolution followed by pooling naturally
become frequency selective and translation invariant when assigned random weights.
They argue that this provides an inexpensive way to choose the architecture of
a convolutional network: first, evaluate the performance of several convolutional
network architectures by training only the last layer; then take the best of these
architectures and train the entire architecture using a more expensive approach.
An intermediate approach is to learn the features, but using methods that do
not require full forward and back-propagation at every gradient step. As with
multilayer perceptrons, we use greedy layer-wise pretraining, to train the first layer
in isolation, then extract all features from the first layer only once, then train the
second layer in isolation given those features, and so on. In chapter 8 we described
how to perform supervised greedy layer-wise pretraining, and in part III extend this
to greedy layer-wise pretraining using an unsupervised criterion at each layer. The
canonical example of greedy layer-wise pretraining of a convolutional model is the
convolutional deep belief network (Lee et al., 2009). Convolutional networks offer
us the opportunity to take the pretraining strategy one step further than is possible
with multilayer perceptrons. Instead of training an entire convolutional layer at a
time, we can train a model of a small patch, as Coates et al. (2011) do with
k
-means.
We can then use the parameters from this patch-based model to define the kernels
of a convolutional layer. This means that it is possible to use unsupervised learning
to train a convolutional network without ever using convolution during the training
357
CHAPTER 9. CONVOLUTIONAL NETWORKS
process. Using this approach, we can train very large models and incur a high
computational cost only at inference time (Ranzato et al., 2007b; Jarrett et al.,
2009; Kavukcuoglu et al., 2010; Coates et al., 2013). This approach was popular
from roughly 2007 to 2013, when labeled datasets were small and computational
power was more limited. Today, most convolutional networks are trained in a
purely supervised fashion, using full forward and back-propagation through the
entire network on each training iteration.
As with other approaches to unsupervised pretraining, it remains difficult to
tease apart the cause of some of the benefits seen with this approach. Unsupervised
pretraining may offer some regularization relative to supervised training, or it
may simply allow us to train much larger architectures because of the reduced
computational cost of the learning rule.
9.10 The Neuroscientific Basis for Convolutional
Networks
Convolutional networks are perhaps the greatest success story of biologically
inspired artificial intelligence. Though convolutional networks have been guided
by many other fields, some of the key design principles of neural networks were
drawn from neuroscience.
The history of convolutional networks begins with neuroscientific experiments
long before the relevant computational models were developed. Neurophysiologists
David Hubel and Torsten Wiesel collaborated for several years to determine many
of the most basic facts about how the mammalian vision system works (Hubel and
Wiesel, 1959, 1962, 1968). Their accomplishments were eventually recognized with
a Nobel prize. Their findings that have had the greatest influence on contemporary
deep learning models were based on recording the activity of individual neurons in
cats. They observed how neurons in the cat’s brain responded to images projected
in precise locations on a screen in front of the cat. Their great discovery was
that neurons in the early visual system responded most strongly to very specific
patterns of light, such as precisely oriented bars, but responded hardly at all to
other patterns.
Their work helped to characterize many aspects of brain function that are
beyond the scope of this book. From the point of view of deep learning, we can
focus on a simplified, cartoon view of brain function.
In this simplified view, we focus on a part of the brain called V1, also known
as the
primary visual cortex
. V1 is the first area of the brain that begins to
358
CHAPTER 9. CONVOLUTIONAL NETWORKS
perform significantly advanced processing of visual input. In this cartoon view,
images are formed by light arriving in the eye and stimulating the retina, the
light-sensitive tissue in the back of the eye. The neurons in the retina perform
some simple preprocessing of the image but do not substantially alter the way it is
represented. The image then passes through the optic nerve and a brain region
called the lateral geniculate nucleus. The main role, as far as we are concerned
here, of both anatomical regions is primarily just to carry the signal from the eye
to V1, which is located at the back of the head.
A convolutional network layer is designed to capture three properties of V1:
1.
V1 is arranged in a spatial map. It actually has a two-dimensional structure,
mirroring the structure of the image in the retina. For example, light
arriving at the lower half of the retina affects only the corresponding half of
V1. Convolutional networks capture this property by having their features
defined in terms of two-dimensional maps.
2.
V1 contains many
simple cells
. A simple cell’s activity can to some extent
be characterized by a linear function of the image in a small, spatially
localized receptive field. The detector units of a convolutional network are
designed to emulate these properties of simple cells.
3.
V1 also contains many
complex cells
. These cells respond to features that
are similar to those detected by simple cells, but complex cells are invariant
to small shifts in the position of the feature. This inspires the pooling units
of convolutional networks. Complex cells are also invariant to some changes
in lighting that cannot be captured simply by pooling over spatial locations.
These invariances have inspired some of the cross-channel pooling strategies
in convolutional networks, such as maxout units (Goodfellow et al., 2013a).
Though we know the most about V1, it is generally believed that the same
basic principles apply to other areas of the visual system. In our cartoon view of
the visual system, the basic strategy of detection followed by pooling is repeatedly
applied as we move deeper into the brain. As we pass through multiple anatomical
layers of the brain, we eventually find cells that respond to some specific concept
and are invariant to many transformations of the input. These cells have been
nicknamed “grandmother cells”—the idea is that a person could have a neuron that
activates when seeing an image of their grandmother, regardless of whether she
appears in the left or right side of the image, whether the image is a close-up of
her face or zoomed-out shot of her entire body, whether she is brightly lit or in
shadow, and so on.
359
CHAPTER 9. CONVOLUTIONAL NETWORKS
These grandmother cells have been shown to actually exist in the human brain,
in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers
tested whether individual neurons would respond to photos of famous individuals.
They found what has come to be called the “Halle Berry neuron,” an individual
neuron that is activated by the concept of Halle Berry. This neuron fires when a
person sees a photo of Halle Berry, a drawing of Halle Berry, or even text containing
the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself;
other neurons responded to the presence of Bill Clinton, Jennifer Aniston, and so
forth.
These medial temporal lobe neurons are somewhat more general than modern
convolutional networks, which would not automatically generalize to identifying
a person or object when reading its name. The closest analog to a convolutional
network’s last layer of features is a brain area called the inferotemporal cortex (IT).
When viewing an object, information flows from the retina, through the LGN, to
V1, then onward to V2, then V4, then IT. This happens within the first 100ms
of glimpsing an object. If a person is allowed to continue looking at the object
for more time, then information will begin to flow backward as the brain uses
top-down feedback to update the activations in the lower level brain areas. If we
interrupt the person’s gaze, however, and observe only the firing rates that result
from the first 100ms of mostly feedforward activation, then IT proves to be similar
to a convolutional network. Convolutional networks can predict IT firing rates and
perform similarly to (time-limited) humans on object recognition tasks (DiCarlo,
2013).
That being said, there are many differences between convolutional networks
and the mammalian vision system. Some of these differences are well known
to computational neuroscientists but outside the scope of this book. Some of
these differences are not yet known, because many basic questions about how the
mammalian vision system works remain unanswered. As a brief list:
The human eye is mostly very low resolution, except for a tiny patch called the
fovea
. The fovea only observes an area about the size of a thumbnail held at
arms length. Though we feel as if we can see an entire scene in high resolution,
this is an illusion created by the subconscious part of our brain, as it stitches
together several glimpses of small areas. Most convolutional networks actually
receive large full-resolution photographs as input. The human brain makes
several eye movements called
saccades
to glimpse the most visually salient
or task-relevant parts of a scene. Incorporating similar attention mechanisms
into deep learning models is an active research direction. In the context of
deep learning, attention mechanisms have been most successful for natural
360
CHAPTER 9. CONVOLUTIONAL NETWORKS
language processing, as described in section 12.4.5.1. Several visual models
with foveation mechanisms have been developed but so far have not become
the dominant approach (Larochelle and Hinton, 2010; Denil et al., 2012).
The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional networks
so far are purely visual.
The human visual system does much more than just recognize objects. It is
able to understand entire scenes, including many objects and relationships
between objects, and it processes rich 3-D geometric information needed for
our bodies to interface with the world. Convolutional networks have been
applied to some of these problems, but these applications are in their infancy.
Even simple brain areas like V1 are heavily affected by feedback from higher
levels. Feedback has been explored extensively in neural network models but
has not yet been shown to offer a compelling improvement.
While feedforward IT firing rates capture much of the same information as
convolutional network features, it is not clear how similar the intermediate
computations are. The brain probably uses very different activation and
pooling functions. An individual neuron’s activation probably is not well
characterized by a single linear filter response. A recent model of V1 involves
multiple quadratic filters for each neuron (Rust et al., 2005). Indeed our
cartoon picture of “simple cells” and “complex cells” might create a nonexistent
distinction; simple cells and complex cells might both be the same kind of
cell but with their “parameters” enabling a continuum of behaviors ranging
from what we call “simple” to what we call “complex.”
It is also worth mentioning that neuroscience has told us relatively little
about how to train convolutional networks. Model structures with parameter
sharing across multiple spatial locations date back to early connectionist models
of vision (Marr and Poggio, 1976), but these models did not use the modern
back-propagation algorithm and gradient descent. For example, the neocognitron
(Fukushima, 1980) incorporated most of the model architecture design elements of
the modern convolutional network but relied on a layer-wise unsupervised clustering
algorithm.
Lang and Hinton (1988) introduced the use of back-propagation to train
time-delay neural networks
(TDNNs). To use contemporary terminology,
TDNNs are one-dimensional convolutional networks applied to time series. Back-
propagation applied to these models was not inspired by any neuroscientific observa-
361
CHAPTER 9. CONVOLUTIONAL NETWORKS
tion and is considered by some to be biologically implausible. Following the success
of back-propagation-based training of TDNNs, LeCun et al. (1989) developed the
modern convolutional network by applying the same training algorithm to 2-D
convolution applied to images.
So far we have described how simple cells are roughly linear and selective
for certain features, complex cells are more nonlinear and become invariant to
some transformations of these simple cell features, and stacks of layers that
alternate between selectivity and invariance can yield grandmother cells for specific
phenomena. We have not yet described precisely what these individual cells detect.
In a deep nonlinear network, it can be difficult to understand the function of
individual cells. Simple cells in the first layer are easier to analyze, because their
responses are driven by a linear function. In an artificial neural network, we can
just display an image of the convolution kernel to see what the corresponding
channel of a convolutional layer responds to. In a biological neural network, we
do not have access to the weights themselves. Instead, we put an electrode in
the neuron, display several samples of white noise images in front of the animal’s
retina, and record how each of these samples causes the neuron to activate. We
can then fit a linear model to these responses to obtain an approximation of the
neuron’s weights. This approach is known as
reverse correlation
(Ringach and
Shapley, 2004).
Reverse correlation shows us that most V1 cells have weights that are described
by
Gabor functions
. The Gabor function describes the weight at a 2-D point
in the image. We can think of an image as being a function of 2-D coordinates,
I
(
x, y
). Likewise, we can think of a simple cell as sampling the image at a set of
locations, defined by a set of
x
coordinates
X
and a set of
y
coordinates
Y
, then
applying weights that are also a function of the location,
w
(
x, y
). From this point
of view, the response of a simple cell to an image is given by
s(I) =
xX
yY
w(x, y)I(x, y). (9.15)
Specifically, w(x, y) takes the form of a Gabor function:
w(x, y; α, β
x
, β
y
, f, φ, x
0
, y
0
, τ ) = α exp
β
x
x
2
β
y
y
2
cos(fx
+ φ), (9.16)
where
x
= (x x
0
) cos(τ ) + (y y
0
) sin(τ ) (9.17)
and
y
= (x x
0
) sin(τ ) + (y y
0
) cos(τ ). (9.18)
362
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.18: Gabor functions with a variety of parameter settings. White indicates
large positive weight, black indicates large negative weight, and the background gray
corresponds to zero weight. (Left)Gabor functions with different values of the parameters
that control the coordinate system:
x
0
,
y
0
, and
τ
. Each Gabor function in this grid is
assigned a value of
x
0
and
y
0
proportional to its position in its grid, and
τ
is chosen so
that each Gabor filter is sensitive to the direction radiating out from the center of the
grid. For the other two plots,
x
0
,
y
0
, and
τ
are fixed to zero. (Center)Gabor functions
with different Gaussian scale parameters
β
x
and
β
y
. Gabor functions are arranged in
increasing width (decreasing
β
x
) as we move left to right through the grid, and increasing
height (decreasing
β
y
) as we move top to bottom. For the other two plots, the
β
values
are fixed to 1.5 times the image width. (Right)Gabor functions with different sinusoid
parameters
f
and
φ
. As we move top to bottom,
f
increases, and as we move left to right,
φ
increases. For the other two plots,
φ
is fixed to 0 and
f
is fixed to 5 times the image
width.
Here,
α
,
β
x
,
β
y
,
f
,
φ
,
x
0
,
y
0
, and
τ
are parameters that control the properties
of the Gabor function. Figure 9.18 shows some examples of Gabor functions with
different settings of these parameters.
The parameters
x
0
,
y
0
, and
τ
define a coordinate system. We translate and
rotate
x
and
y
to form
x
and
y
. Specifically, the simple cell will respond to image
features centered at the point (
x
0
,
y
0
), and it will respond to changes in brightness
as we move along a line rotated τ radians from the horizontal.
Viewed as a function of
x
and
y
, the function
w
then responds to changes in
brightness as we move along the
x
axis. It has two important factors: one is a
Gaussian function, and the other is a cosine function.
The Gaussian factor
α exp
β
x
x
2
β
y
y
2
can be seen as a gating term that
ensures that the simple cell will respond only to values near where
x
and
y
are
both zero, in other words, near the center of the cell’s receptive field. The scaling
factor
α
adjusts the total magnitude of the simple cell’s response, while
β
x
and
β
y
control how quickly its receptive field falls off.
363
CHAPTER 9. CONVOLUTIONAL NETWORKS
The cosine factor
cos
(
fx
+
φ
) controls how the simple cell responds to changing
brightness along the
x
axis. The parameter
f
controls the frequency of the cosine,
and φ controls its phase offset.
Altogether, this cartoon view of simple cells means that a simple cell responds
to a specific spatial frequency of brightness in a specific direction at a specific
location. Simple cells are most excited when the wave of brightness in the image
has the same phase as the weights. This occurs when the image is bright where the
weights are positive and dark where the weights are negative. Simple cells are most
inhibited when the wave of brightness is fully out of phase with the weights—when
the image is dark where the weights are positive and bright where the weights are
negative.
The cartoon view of a complex cell is that it computes the
L
2
norm of the
2-D vector containing two simple cells’ responses:
c
(
I
) =
s
0
(I)
2
+ s
1
(I)
2
. An
important special case occurs when
s
1
has all the same parameters as
s
0
except
for
φ
, and
φ
is set such that
s
1
is one quarter cycle out of phase with
s
0
. In this
case,
s
0
and
s
1
form a
quadrature pair
. A complex cell defined in this way
responds when the Gaussian reweighted image
I
(
x, y
)
exp
(
β
x
x
2
β
y
y
2
) contains
a high-amplitude sinusoidal wave with frequency
f
in direction
τ
near (
x
0
, y
0
),
regardless of the phase offset of this wave. In other words, the complex cell is
invariant to small translations of the image in direction
τ
, or to negating the image
(replacing black with white and vice versa).
Some of the most striking correspondences between neuroscience and machine
learning come from visually comparing the features learned by machine learning
models with those employed by V1. Olshausen and Field (1996) showed that
a simple unsupervised learning algorithm, sparse coding, learns features with
receptive fields similar to those of simple cells. Since then, we have found that
an extremely wide variety of statistical learning algorithms learn features with
Gabor-like functions when applied to natural images. This includes most deep
learning algorithms, which learn these features in their first layer. Figure 9.19
shows some examples. Because so many different learning algorithms learn edge
detectors, it is difficult to conclude that any specific learning algorithm is the “right”
model of the brain just based on the features it learns (though it can certainly be a
bad sign if an algorithm does not learn some sort of edge detector when applied to
natural images). These features are an important part of the statistical structure
of natural images and can be recovered by many different approaches to statistical
modeling. See Hyvärinen et al. (2009) for a review of the field of natural image
statistics.
364
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.19: Many machine learning algorithms learn features that detect edges or specific
colors of edges when applied to natural images. These feature detectors are reminiscent
of the Gabor functions known to be present in the primary visual cortex. (Left)Weights
learned by an unsupervised learning algorithm (spike and slab sparse coding) applied
to small image patches. (Right)Convolution kernels learned by the first layer of a fully
supervised convolutional maxout network. Neighboring pairs of filters drive the same
maxout unit.
9.11 Convolutional Networks and the History of Deep
Learning
Convolutional networks have played an important role in the history of deep
learning. They are a key example of a successful application of insights obtained
by studying the brain to machine learning applications. They were also some of
the first deep models to perform well, long before arbitrary deep models were
considered viable. Convolutional networks were also some of the first neural
networks to solve important commercial applications and remain at the forefront
of commercial applications of deep learning today. For example, in the 1990s, the
neural network research group at AT&T developed a convolutional network for
reading checks (LeCun et al., 1998b). By the end of the 1990s, this system deployed
by NCR was reading over 10 percent of all the checks in the United States. Later,
several OCR and handwriting recognition systems based on convolutional nets
were deployed by Microsoft (Simard et al., 2003). See chapter 12 for more details
on such applications and more modern applications of convolutional networks. See
LeCun et al. (2010) for a more in-depth history of convolutional networks up to
2010.
Convolutional networks were also used to win many contests. The current
intensity of commercial interest in deep learning began when Krizhevsky et al.
365
CHAPTER 9. CONVOLUTIONAL NETWORKS
(2012) won the ImageNet object recognition challenge, but convolutional networks
had been used to win other machine learning and computer vision contests with
less impact for years earlier.
Convolutional nets were some of the first working deep networks trained with
back-propagation. It is not entirely clear why convolutional networks succeeded
when general back-propagation networks were considered to have failed. It may
simply be that convolutional networks were more computationally efficient than
fully connected networks, so it was easier to run multiple experiments with them
and tune their implementation and hyperparameters. Larger networks also seem
to be easier to train. With modern hardware, large fully connected networks
appear to perform reasonably on many tasks, even when using datasets that were
available and activation functions that were popular during the times when fully
connected networks were believed not to work well. It may be that the primary
barriers to the success of neural networks were psychological (practitioners did
not expect neural networks to work, so they did not make a serious effort to use
neural networks). Whatever the case, it is fortunate that convolutional networks
performed well decades ago. In many ways, they carried the torch for the rest of
deep learning and paved the way to the acceptance of neural networks in general.
Convolutional networks provide a way to specialize neural networks to work
with data that has a clear grid-structured topology and to scale such models to
very large size. This approach has been the most successful on a two-dimensional
image topology. To process one-dimensional sequential data, we turn next to
another powerful specialization of the neural networks framework: recurrent neural
networks.
366