1 Introduction
Convolutional neural networks (CNNs) deliver reliable solutions for the problems of image classification, speech recognition, recommendation, and language translation. Software frameworks such as Caffe, Tensorflow, and PyTorch have emerged to support the increasing variety of CNNs on multiple hardware architectures. Most of these frameworks introduce a middle-layer representation for the network primitives that are efficiently implemented in high-performance numerical libraries [
5,
16,
24]
Training and running a CNN is a computationally-intensive task with convolution layers accounting for roughly 80% of the total CNN inference time on a contemporary CPU. More than three-quarters of the CNN inference market relies on CPUs because of their low inference latency [
22].
General matrix multiplication (GEMM) is the most-used primitive in high-performance libraries. Convolution is typically computed by performing the
im2col transformation on the input image and calling a library
GEMM routine [
3]. Other approaches suggest reducing the memory footprint of
im2col or implementing convolution through efficient architecture-specific assembly kernels [
1,
4,
7,
8,
11].
The convolution algorithm presented in this article targets memory hierarchy optimization on CPUs.
The core idea of YaConv is to pack the input image into an L3-cache-resident buffer, preload a smaller chunk of the buffer into L1 cache, and use this image chunk for computation with all L2-cache-resident filter elements before switching to other image elements. To our knowledge,
YaConv is the first convolution algorithm that implements all of the following:
(1)
uses unchanged GEMM microkernel from a high-performance library;
(2)
integrates domain-specific packing of elements the into cache with the calls to GEMM microkernels;
(3)
avoids unnecessary additional copies of input-image elements.
Points (2) and (3) are important distinct design decisions. Several methods have been proposed to eliminate or reduce the copy of the image tensor in memory [
1,
4,
11]. However, unlike
YaConv, they do not address the cache reload issue encountered while calling a library
GEMM routine multiple times on the same elements. Moreover,
YaConv is a novel algorithm that benefits lessons learned from the seminal work by Goto and Van Geijn [
9]. In particular,
YaConv employes a tiling and packing strategy that allows convolution to be efficiently expressed in terms of
GEMM calls, while eliminating redundant copies of the input image.
YaConv successfully repurposes
GEMM building blocks to perform convolution. It aims to compute convolution at a performance level that is close to the machine’s peak without requiring the writing of new assembly kernels. Different from other
Basic Linear Algebra Subprograms (BLAS) operations, which can be directly and efficiently implemented in terms of
GEMM micro-kernel calls, convolution has many variants and widely different algorithms. Therefore, it is difficult to justify the time and effort needed not only to implement, but also to maintain, multiple micro-kernel implementations for convolution. Processor-architecture industrial organizations already spend significant resources ensuring that the
GEMM micro kernel is performant in their current and novel architectures. Thus,
YaConv’s design saves engineering and time effort by reusing the already widely used and maintained
GEMM micro kernels in high-performance BLAS libraries. This article shows that a very efficient convolution algorithm can be constructed using the building blocks and lessons learned from the BLIS Project [
16] to implement novel tiling and packing logic while reusing
GEMM micro kernels—also a core concept in BLIS. There are several ways to implement a convolution algorithm with the same constraints. The version implemented and evaluated in this article is the most promising in terms of performance on the actual layers found in PyTorch.
For some layers,
YaConv’s performance is on par with a library
GEMM routine, which is around
\(80\text{--}90\%\) of the machine’s theoretical peak performance [
9,
16,
24]. In comparison with another solution that also works for multiple CPU architectures,
im2col convolution,
YaConv is 24% faster, measured as the geometric mean of the speedup (w.r.t.
im2col) over 73 layers taken from real CNN models. Moreover,
YaConv achieves this level of performance using
\(10 \times\) less memory than
im2col convolution, requiring only a small buffer space.
An experimental study based on varying input image sizes confirms that the performance of
YaConv is dependent on architecture-specific parameters within the
GEMM microkernel. The results of this study point to ways to reduce this sensitivity by tuning the image height in the intermediate layers of certain CNNs. Alternatively, the insights from this study could guide design decisions in code generators for CNN, such as Ansor [
26], or into compiler framework, e.g., LLVM [
12] and MLIR [
13]. Cache utilization evaluation on the range of inputs reveals that two parameters—the number of output channels (
M) and image height (
H)—affect the performance of
YaConv the most, with consistent speedup over the baseline when
\(M \lt 500\) or
\(H \gt 20\) .
The contributions of this article include:
•
A clear description of the cache inefficiencies of the previous convolution algorithms (Section
3) that guides the design principles for
YaConv;
•
Explanation of the new convolution algorithm that avoids unnecessary additional copies of image elements (Section
4) and a working implementation that uses unchanged
GEMM microkernels, integrated into the popular numerical library BLIS [
16];
•
A public and free software implementation of
YaConv as an extension to BLIS [
16];
1 and
•
An experimental evaluation study on convolutional layers found in practice and on a grid of parameters indicating that the performance of
YaConv is superior to
im2col on multiple architectures due to better utilization of the memory hierarchy (Section
5).
4 YaConv
Two principles are at the core of the new convolution algorithm: the algorithm should not require redundant copies or loads of input elements in cache and the algorithm must use unaltered
GEMM microkernels. We follow the CPU cache utilization guidelines presented in Reference [
9].
YaConv introduces a new iteration pattern for convolution and controls packing of the input tensor elements into the cache. Eliminating redundant copies in the image tensor is central to
YaConv and it also enables the reuse of the filter tensor as discussed bellow (Figure
4, step
). The tensor-streaming choices for both
im2col-based convolution and
YaConv are determined by the nesting order of the loops surrounding the micro kernel.
YaConv nests the loops in such a way that the packing of each tensor in a given level of the memory hierarchy is based on how the operands are used in the micro kernel. Although the results show that
YaConv significantly outperforms
im2col-based convolutions, this article does not claim that the packing choices made are optimal. Alternative nesting orders can be found by space exploration via analytical models [
15].
In naive convolution, two loops iterate over the spatial dimensions of the input image and compute the sum of products between the weight tensor and each image patch (Figure
2(b)). In
YaConv, these sums do not happen in the same loop iteration. Instead, the
YaConv computes one-row convolutions between a selected row of each filter and the corresponding image patches.
Figure
4 demonstrates how
YaConv computes the same convolution as in Figure
3 using packing and outer-product microkernels from a library
GEMM. In step
,
YaConv reshapes the input tensor as a column-major matrix
\([W \cdot C] \times [H]\) . This step does not incur any data copy overhead, because the memory layout of such a matrix matches the layout of the input tensor
\(I_{H \times W \times C}\) .
Step
in Figure
4 shows how
YaConv packs the input tensor into an L3-cache-resident buffer. Zeroed-out elements in the right tile of size
\([W \cdot C] \times n_r\) artificially extend the size of the packed input tensor buffer. The reason for doing so lies within the
GEMM microkernel—it performs poorly when the matrix sizes are not multiples of
\(n_r\) . By first packing the input tensor, we bring the input tensor elements to all cache levels, ensuring that the buffer is in L3 cache.
In step
,
YaConv packs the weight tensor as
\(F_h\) separate matrices of size
\(M \times [F_w \cdot C]\) into an L2-cache-resident buffer. This copy operation might evict some input tensor elements—as it happens in the traditional
GEMM algorithm—but ensures that
\(M \cdot F_w \cdot C\) weight tensor elements are loaded to L2 cache. To perform partial convolutions for each filter row, one
\(M \times [F_w \cdot C]\) matrix is packed in the weight buffer at a time.
In step
,
YaConv computes each convolution
\(W_{1 \times F_w \times C \times M} \otimes I_{H \times W \times C} = O_{H_{{out}} \times W_{{out}} \times M}\) as
\(W_{{out}}\) GEMMs with sizes
\([H] \times [F_w \cdot C] \times [M]\) . Having packed the input tensor as contiguous tiles of size
\([W \cdot C] \times [n_r]\) ,
YaConv passes portions of size
\([F_w \cdot C] \times [n_r]\) of the packed input buffer as a second operand to the
GEMM microkernel—one of such portions is highlighted by the blue rectagle in Figure
4. The first operand of the microkernel call is a tile of size
\([m_r] \times [F_w \cdot C]\) from the packed weight buffer.
Each
GEMM with sizes
\([m_r] \times [F_w \cdot C] \times [n_r]\) computes
\(n_r\) output elements for
\(m_r\) output channels. All of these elements correspond to the same column of the output image (
\(W_{{out}}\) -dimension), given by the vertical offset of the tile of size
\([F_w \cdot C] \times [n_r]\) within the packed input tensor. One
GEMM call computes the result of applying one row of
\(m_r\) filters to
\(n_r\) rows of the input image at the same column offset. The corresponding weight and input tensor elements are highlighted by green and blue rectangles in Figure
4. Each result of size
\(m_r \times n_r\) is placed in the output array (column-major
\([M] \times [H_{{out}} \cdot W_{{out}}]\) matrix) as
\(n_r\) columns at stride
\(W_{{out}} \cdot M\) , because every
GEMM call corresponds to applying one filter row over
\(n_r\) input image rows at a fixed column.
4.1 Extra Memory Usage
Because
YaConv computes
GEMM between every possible combination of the weight and image tiles, and some of these elements are not part of the convolution result, they are not accumulated in the output. In the example in Figure
4, the first filter row is multiplied with elements of the blue portion of the input image as part of the
GEMM microkernel call with width
\(n_r = 4\) . However, only the first three columns of this highlighted image area should be multiplied with the first filter row and the result of this multiplication to be stored in the output tensor. The result of the dot product between the elements 16,17,18 of the input tensor and the first filter row is a part of applying the filter on the image when the last filter row is outside of the image bounds.
One way to discard these elements is to adjust the width of every GEMM microkernel call to smaller values than \(n_r\) , leading to significant performance degradation. Instead, YaConv reserves a larger buffer for the output elements than the \(H_{{out}} \cdot W_{{out}} \cdot M\) elements needed by convolution. The actual output tensor \(H_{{out}} \times W_{{out}} \times M\) is located at an offset within this buffer space, while the extra space is used for the spillover elements of the GEMM calls.
The output buffer contains the output array and two extra parts
before and
after the actual output tensor. For a given problem,
YaConv uses space for
\((F_h - 1 - P_h) \cdot W_{{out}} \cdot M\) extra elements before the output array. Because the image buffer is enlarged to the optimal microkernel size and to account for the spillover results from
GEMM,
YaConv also reserves some space for the elements immediately after the actual output. In Figure
4,
\(H_{{extra}}\) is the
H enlargement required until the next multiple of
\(n_r\) . Therefore, extra space required after the output array is
\((H_{{extra}} + F_h - 1 - P_h) \cdot W_{{out}} \cdot M\) , where the second part comes from the
GEMM spillover elements. In total,
YaConv requires
\((H_{{extra}} + 2 \cdot (F_h - 1 - P_h) \cdot W_{{out}} \cdot M\) extra space, at that
\(H_{{extra}} \lt n_r\) . Asymptotically, extra space complexity for
YaConv is
\(O((F_h - P_h) \cdot W_{{out}} \cdot M)\) , which is sublinear on the output size
\(H_{{out}} \cdot W_{{out}} \cdot M\) .
4.2 Tiling and Block Sizes
In real applications, weight and image tensors are large enough to not fit in the cache. We add loop tiling to our algorithm to improve cache locality and TLB entry usage, as suggested by Goto and Geijn. The conventional
GEMM algorithm [
9] uses three cache block sizes
\(MC, KC, NC\) for two packed buffers
\(\tilde{A}_{MC \times KC}\) and
\(\tilde{B}_{KC \times NC}\) :
(1)
KC is calculated to fill L1 cache with tiles \(m_r \times KC\) and \(KC \times n_r\) of the operands of each microkernel call;
(2)
MC is set to fill the L2 cache with \(MC \times KC\) elements of the packed buffer \(\tilde{A}\) ;
(3)
NC determines the size of the buffer \(\tilde{B}\) that resides in L3 cache.
High-performance BLAS libraries set these sizes more conservatively as temporary variables and output tile
\(m_r \times n_r\) take some L1 cache space, and TLB is typically a more limiting factor than L2 cache [
9,
16,
24].
The sizes for weight and image buffers in YaConv are taken from the BLIS implementation of the GEMM routine. The order of calls to packing routines determines the cache-level residence for each buffer. YaConv packs a portion of the image in the outermost loop, thus loading its elements to all cache levels. The packing of weight elements into the buffer follows image packing. Such an order, coupled with adequate tile sizes, ensures that the weight elements will not be evicted from L2 cache by the image elements during packing. Two innermost loops around the microkernel (with steps \(n_r\) and \(m_r\) ) iterate over L1-sized tiles of weight and image buffers and call the microkernel code to compute partial results.
Additionally, when \(W = F_h = F_w = 1, P_h = P_w = 0\) , the weight and image tensors can be thought of as matrices and convolution degenerates into a GEMM with sizes \(M \times C \times H_{{out}}\) . The loops over \(F_h\) and \(W_{{out}}\) contain only one iteration and YaConv becomes the conventional matrix multiplication algorithm.
6 Related Work
Many convolution algorithms are not expressed in terms of
GEMM [
2]. For instance, prior work explored the use of Fast-Fourier Transform [
18] and Winograd’s algorithm [
14]. Winograd-based convolution algorithms are designed to minimize arithmetic intensity and aim to be efficient for convolutions over small tiles for small filters and small batch sizes.
YaConv does not reduce the arithmetic intensity of convolutions, but does eliminate unnecessary copies performed by traditional
GEMM-based convolutions. Moreover,
YaConv is not sensitive to small filters and batch sizes, but
YaConv is sensitive to the number of filters and performs better with small number of filters. FFT-based convolution algorithms are known to be faster then direct convolution algorithms for large filter sizes [
19]. Although the complexity of convolution algorithms is established, there is no thorough empirical study contrasting standardized and well-maintained implementations of such algorithms on modern hardware. This work enables such study by making available an implementation of
YaConv that extends and re-uses building blocks from BLIS—a widely used, well-documented and maintained linear algebra library[
16]—publicly available and as free software.
3 The remainder of this section focuses on previous work closely related with
YaConv, which is a novel convolution algorithm aimed at reusing
GEMM micro kernels from high-performance libraries.
A common approach to deal with the performance inefficiencies of naïve convolution is
im2col convolution. The method was first introduced by Chellapilla et al. and it became popular due to its use in Caffe [
3,
10].
im2col is a data-copy procedure that prepares patches of the input image in a separate patch matrix. Each patch is a portion of the input image of the same size as the filter. Each input element is copied up to
\(F_h \cdot F_w\) times into this patch matrix—one time for each patch-filter product in which the element takes part. (The exact number of copies depends on the stride of the convolution.) The goal of the
im2col transformation is to place patch elements adjacent in memory to optimize cache locality and to take advantage of the highly optimized
GEMM routines readily available for most architectures.
im2col has three drawbacks: (1) memory footprint: the patch matrix requires additional storage of
\(O(F_h \cdot F_w)\) of the image size; (2) an additional copy operation; and (3) enlarged inner dimension of the
GEMM, given by the size
\(F_h \cdot F_w \cdot C\) of the patch matrix, leads to lower reuse of elements loaded into the cache. This effect is pronounced when
\(K \gt \gt M\) and
\(K \gt \gt N\) .
A way to reduce the overhead of
im2col consists in not creating the patch matrix explicitly [
6,
11]. One solution to avoid the creation of a patch matrix consists in creating an extra level of indirection in the form of a buffer of pointers to input image patches. This extra indirection reduces the space and time needed to prepare the patches for access—there is no need to have multiple copies of the same element. However, it increases the access time for each element, because each access requires one additional pointer dereference [
6]. This Indirect Convolution Algorithm requires a modified
GEMM microkernel that allows arbitrary strides between elements [
6]. Another solution integrates
im2col with the packing step of
GEMM in BLIS [
16] to prevent the extra data copying [
11]. Although neither of these methods keeps redundant copies of input elements in memory, each matrix element is loaded into cache
\(F_h \cdot F_w\) times. In contrast,
YaConv changes the packing routine and the iteration pattern to eliminate the need to copy the input elements either directly or via pointers.
The same approach can be implemented in hardware [
21]. Soltaniyeh et al. design a new accelerator that integrates the
im2col transform and
GEMM. To decrease the overhead of the data transformation, the authors design interconnected patch units based on systolic arrays. These units act as buffers for holding common patch elements and reduce redundant accesses to the same elements. Although
YaConv could be implemented in hardware,
YaConv’s key design points allow it to be implemented on any commodity hardware without special instructions or specific hardware support.
Another way to reduce the memory footprint of
im2col is to call the library
GEMM routine several times [
4].
Memory Efficient Convolution (MEC) creates a different patch matrix that reduces the size of the patch matrix by a factor of
\(F_w\) in comparison with the original
im2col transformation [
4]. The
GEMM routine is called
\(W_{{out}}\) times on this transformed patch matrix, each time storing its output at an offset in the output tensor. Although the MEC algorithm reduces the space overhead and can be faster in some cases, it was shown to underperform
im2col on a wide range of input parameters [
1]. A reason for the MEC’s inefficiency could be the cache overwrite between consecutive
GEMM calls. The number of algebraic operations needed to compute a
GEMM with sizes
\(100 \times 100 \times 100\) is the same as the number of operations needed to compute a
GEMM with sizes
\(100 \times 100 \times 10\) ten times. However, the latter can be several times slower, because fewer elements are reused while they are available in the caches.
Reducing the memory bandwidth per element can be achieved by manipulating data in registers [
25]. Zhao et al. optimize convolution training for the Sunway TaihuLight supercomputer that features CPE vector units organized in a mesh. The authors propose to map the whole convolution to tiles that fit in each CPE and move the overlapping patch elements between adjacent CPEs with specialized register communication instructions. In the same way as
YaConv, this approach eliminates the need for extra buffer space and uses elements in the cache until no longer needed. However, the solution depends on the instruction-level optimizations available on the specific architecture under evaluation [
25].
Anderson et al. propose a way to repeatedly call the
GEMM routine without copying elements of the input tensor. Their low-memory algorithm separates the weight tensor
\(F_h \times F_w \times M \times C\) into
\(F_h \cdot F_w\) tensors, each of shape
\(1 \times 1 \times M \times C\) . Anderson et�al. notice that convolution with
\(1 \times 1\) filters is the same as a
GEMM with sizes
\(M \times C \times [H_{{out}} \cdot W_{{out}}]\) . Convolution with any other filter size is a sum of
\(F_h \cdot F_w\) such convolutions, each corresponding to scaling the whole input image by one of the
\(F_h \cdot F_w\) filter elements [
1]. To account for the relative position of the
\(1 \times 1\) filter elements within the
\(F_h \times F_w \times M \times C\) weight tensor,
\(F_h \cdot F_w\) scalings are accumulated in the output array at corresponding offsets. Although MEC and the approach of Anderson et�al. solve the memory problem of
im2col convolution, they do not address the problem of redundant cache reloads.
YaConv makes full use of each element present in the cache before the element is evicted, because
YaConv integrates the loading of elements into cache with computation.
While the aforementioned methods either explicitly copy the input tensor or call a
GEMM routine several times, direct convolution methods optimize the naive convolution loop nest. As naive convolution suffers from cache misses and an unoptimized multiply-add pattern, one solution is to call specific efficient assembly code for each convolution size at run-time. Georganas et al. present a JIT-optimized implementation for convolution that is aimed at whole-model performance optimization. They design efficient assembly kernels for direct convolution using hardware-specific vector instructions and cache prefetching [
7]. To alleviate memory bandwidth issues, cache and register blocking are applied to change the layout of elements in memory [
7]. This approach attains up to 90% of the peak machine performance but requires an enormous amount of engineering for each instruction set (10K+ code lines in assembly). However,
YaConv introduces a generic cache utilization strategy and relies on the existing
GEMM microkernels. With its new approach,
YaConv achieves the same convolution performance as the work of Georganas et al. without requiring hand-crafted assembly programming.
A novel
batch-reduce GEMMkernel may be better suited for convolution [
8]. The suggested microkernel introduces additional optimization opportunities for convolution, as contrasted with the traditional
GEMM microkernel from Equation (
1). Among several deep-learning primitives, the convolution implementation of Georganas et al. matches the performance of their previous work [
7] and significantly reduces the amount of manual assembly engineering. While this work achieves the best performance of all convolution algorithms found in the literature, it introduces a new microkernel that has to be written for each architecture.
YaConv relies on the
GEMM microkernel that is also used for most level-3 BLAS routines and machine learning workloads, e.g., kMeans, PCA, SVM.
As compared to
GEMM-based algorithms,
YaConv uses asymptotically less extra memory [
1,
3,
4]. Convolution based on the
im2col transform creates a patch matrix of size
\(O(F_h \cdot F_w \cdot C \cdot H_{{out}} \cdot W_{{out}})\) [
3]. MEC requires
\(O(F_h \cdot C \cdot H_{{out}} \cdot W_{{out}})\) extra space [
4]. The algorithm of Anderson et al. also requires extra memory immediately before and after the output array, which they estimate as
\(O(M \cdot H \cdot W)\) . Considering that
\(C \approx M\) and
\(H \approx H_{{out}} \approx W \approx W_{{out}}\) ,
YaConv uses the same or less extra space in comparison with
GEMM-based methods.
Optimal tiling size can be found through space exploration or by using an analytical model [
15]. Convolution algorithms such as
YaConv and the analytical model by Li et al. are complementary. Li et al.’s solution also relies on BLIS-like micro kernels (Section 6, second paragraph in Reference [
15]). The analytical-model approach proposed by Li et al. can be used in an automated-tuning approach to discover good tile sizes for
YaConv, and similar algorithms, for specific convolutions.