WO2018073975A1

WO2018073975A1 - Improved sparse convolution neural network

Info

Publication number: WO2018073975A1
Application number: PCT/JP2016/081973
Authority: WO
Inventors: Vijay DAULTANI
Original assignee: Nec Corporation
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2018-04-26

Abstract

A computer-implemented information processing method for an inference phase of a convolution neural network, the method including steps of: generating a list of non-zero elements from a learned sparse kernel to be used for a convolution layer of the convolution neural network; when performing convolution on an input feature map, loading only elements of the input feature map which correspond to the non-zero elements of the generated list; and performing convolution arithmetic operations using the loaded elements of the input data map and the non-zero elements of the list, thereby reducing the number of operations necessary to generate an output feature map of the convolution layer.

Description

DESCRIPTION

TITLE OF INVENTION

IMPROVED SPARSE CONVOLUTION NEURAL NETWORK

TECHNICAL FIELD

The present disclosure relates to convolutional neural networks used for information processing. More specifically, the present disclosure relates to a method implemented on a computer system to provide an improved convolutional neural network for information processing using sparse direct convolution (improved sparse convolution).

BACKGROUND ART

Recently, deep learning has been widely applied to the field of machine learning, particularly through the use of artificial neural networks which have shown promising results in various fields. A convolutional neural network (CNN), which is one class of artificial neural networks, has seen significant research contributions in the past few years. CNNs have exhibited exceptional properties which have inspired their use for a multitude of challenging tasks. Image processing, text processing, speech processing, trade markets, etc. are some examples of the many fields where CNNs are being applied. Machine learning has a long history and its techniques been applied in many fields for various tasks. Before CNNs were used for these tasks, designers of machine learning systems had to determine which input features should be used to train computers in order to achieve good results. Specific features were chosen based on the designer's experience and intuition. Machine learning techniques used these manually decided features for learning on training data. Careful selection of features required a large amount of time and effort, and had a huge impact on the results of tasks that machine learning was used to solve. Such decisions with regard to choosing features were limited by a designer's capability of wisely choosing the correct set of features. However, the use of CNNs has changed this by automatically learning such features and has replaced the need for a designer to choose the features.

In general a CNN can be viewed as a computation graph which is a thin wrapper around nodes (i.e. layers) connected together in some order. This interconnection of layers which form a computation graph or a network is also known as a model. Different types of inputs (e.g., image, voice, etc.) have different characteristics and hence a single CNN model which suits every type of input is extremely difficult to design. Therefore, new CNN models are often singularly designed to either solve a particular problem or optimize an existing model. A CNN model includes a number of layers and their interconnections. A typical CNN model may include some or all of the below mentioned common elements like a convolutional layer, an activation layer, a pooling layer, a fully connected layer, a softmax layer and an SVM layer. Although the above mentioned elements may be common to CNN models, the configuration of the connections of these layers differentiates one CNN model from another.

Artificial neural networks can be thought of as a simplified emulation of the visual cortex system in a human brain. However, current artificial neural networks are designed with specific engineering goals and not to emulate all the functionalities of a brain. Hence, researchers have developed models inspired by very complex human visual cortex systems. This has an advantage in that it reduces the amount of computations within the limits of current state of the art hardware. In these abstracted mathematical models, specific tasks from the visual cortex system may be assigned to specific layers in artificial neural networks.

Layers in CNN models are arranged in specific patterns. For example, a convolutional layer is usually followed by an activation layer which is sometimes followed by a pooling layer. Together, the convolutional and activation layers model the capability of a single cell in the brain, i.e., where a cell fires (activates) if an excitatory signal (encouraging the cell to transmit the information forward to other neurons) on its dendrites is strong enough to overcome a particular threshold. Similarly, in a CNN model, a neuron activates if the output of a convolution operation is stronger than a predetermined threshold.

Since CNNs can have millions of neurons, the computing capability required to perform the computation for convolutional neural networks is proportional to the number of neurons in the network. High computing capability required by CNNs has inspired researchers to find acceleration methods and optimization techniques which can reduce such requirement while still achieving the same state of the art results.

Different CNN models can vary from each other in many ways. One of these differences can be the depth of the network (i.e. the number of layers in the network), the size (height, width, and depth) of each layer, the type of activation functions, usage of pooling layers and so on.

Convolutional layer consists of stacks of kernels which are convolved with the input from the previous layer. Each of the kernels produce an output feature map as the output of a convolution operation between an input and kernels and each feature map forms a channel of the convolutional layer output.

Convolution operations can be realized using several well-known standard algorithms like GEMM (General Matrix Multiplication), FFT (Fast Fourier Transform), and Direct Convolution. In most applications of CNNs (e.g. , image processing) each CNN model consists of one or more convolutional layers. A convolutional layer's input, output, and kernels are typically 4 dimensional arrays which are also called 4-tensors. Different convolution neural networks have different characteristics and are benefited if tensors are stored in memory in some specific data layout which will be useful for the CNN.

A general and simple CNN model may have a configuration such that an input operation is followed by convolution, followed by activation, followed by pooling, followed by a fully connected layer, followed by a softmax layer which may be used as an output of CNN model.

Different CNN models have different numbers and different configurations of these layers. One such well-known CNN model is AlexNet, which is used for image recognition. An example of the time breakdown of forward propagation for each layer of a sample CNN model can be found in NPL 2.

A single CNN model typically consists of several convolutional layers. A stack of kernels is associated with each convolutional layer, which defines the 4-tensor shape for that layer. Different shapes of these 4-tensors can favor different convolutional algorithms. Such influence of the 4-tensor shape on different algorithms was studied extensively for several state-of-the art CNN models, as disclosed in NPL 3. As a conclusion, the author of the paper found that no single convolutional (GEMM, FFT, Direct) algorithm outperformed all other algorithms for different settings (shapes of 4-tensor) of the convolutional layers. Also shown that no single data layout for tensors is good for all these different setting of convolutional layers, hence a sophisticated mechanism to change the data layout from one to another was devised. [Citation List]

[Non Patent Literature]

[NPL 1 ]

"Convolutional Neural Networks for Visual Recognition.", Stanford CS class CS231 n notes

http://cs231 n.github.io/convolutional-networks/

[NPL 2]

Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann Lecun and Rob Fergus, "Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation", Advances in Neural Information Processing Systems 27 (2014), pp 1269-1277

https://arxiv.org/pdf/1404.0736v2.pdf

[NPL 3]

Chao Li, Yi Yang, Min Feng, Chakradhar Srimat and Huiyang Zhou. "Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs.", International Conference on High Performance Computing, Networking, Storage, and Analysis (2016)

http://people.engr.ncsu.edu/hzhou/SC-16-DNN.pdf

[NPL4]

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, Marianna Penksy, "Sparse Convolutional Neural Networks", 2015 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 806-814

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/ Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf

DISCLOSURE OF INVENTION Problem to be Solved by the Invention

CNNs have been applied in a variety of areas ranging from image processing, text processing, speech processing, trade markets, etc. Data in these domains have completely different characteristics and hence demands researchers to come up with CNN models while keeping this different data and their characteristics in mind. This leads to CNN models which vary from each other. Properties like depth, size, type, or location of layers are what make one CNN model different from another. Different convolutional layers can have very different input, kernels, and output 4-tensor shapes. This large difference between different convolution layers in the same CNN or different CNN models can lead to a huge performance gap between different computational methods for a single layer at once. A thorough analysis is shown in NPL3 in which, for convolution layers with different parameters, no single existing convolution algorithm can work best for all configurations.

Since memory in a computer is always sequential (starting from 0), and therefore, if one wants to store a matrix which is logically two-dimensional data in a computer's memory, it can be stored in one of two ways. These two ways are known as row-major order (HW) or column-major (WH) order. Similarly, when one wants to store four-dimensional arrays, i.e. 4-tensors, there are twenty-four possible ways. Out of these twenty-four ways, the two most commonly used are NCHW or CHWN. However, it is shown in NPL3 that neither data layout (e.g., NCHW or CHWN) is optimal over the other in terms of performance for different convolutional layers.

Selecting an optimal convolutional algorithm and optimal data layout have mostly been derived from heuristics and often depends upon a computer's configuration. No single convolution algorithm and no single data layout for storing 4-tensors for input, kernel and output feature maps outperform all other convolution algorithms and data layout for all the different configurations of the convolutional layers.

Optimizing the computational method for a CNN and thereby reducing the number of operations required to complete a processing task is in high demand and is of great importance because conventional CNNs require a significant amount of computational power and processing time.

Means for Solving the Problem

The present disclosure provides an improved convolution computational method which can reduce the number of operations in CNNs and provides an information processing system using the same. This makes it possible to provide a single data layout which increases processing speed by reducing the number of necessary memory access operations regardless of different settings such as the shape of a 4-tensor of input, the kernel, or the output of a convolutional layer.

As a first aspect of the present invention, there is provided a computer-implemented information processing method for an inference phase of a convolution neural network, the method including steps of: generating a list of non-zero elements from a learned sparse kernel to be used for a convolution layer of the convolution neural network; when performing convolution on an input feature map, loading only elements of the input feature map which correspond to 0 the non-zero elements of the generated list; and performing convolution arithmetic operations using the loaded elements of the input data map and the non-zero elements of the list, thereby reducing the number of operations necessary to generate an output feature map of the convolution layer.

As a second aspect of the present invention in accordance with the first aspect, the input feature map is a 4-tensor; and elements of the input feature map are stored in memory in CHWN order.

As a third aspect of the present invention, there is provided a non-transitory computer readable storage medium storing instructions which cause a computer to perform an information processing method for an inference phase of a convolution neural network, the method including steps of: generating a list of non-zero elements from a learned sparse kernel to be used for a convolution layer of the convolution neural network; when performing convolution on an input feature map, loading only elements of the input feature map which correspond to the non-zero elements of the generated list; and performing convolution arithmetic operations using the loaded elements of the input data map and the non-zero elements of the list, thereby reducing the number of operations necessary to generate an output feature map of the convolution layer. As a fourth aspect of the present invention in accordance with the third aspect, the input feature map is a 4-tensor; and elements of the input feature map are stored in memory in CHWN order. Advantageous Effects of the Invention

The various aspects of the present invention improve the overall performance of a CNN system realize more efficient convolution operations of a convolutional neural network over conventional methods. By implementing a combination of kernel preprocessing and sparse direct convolution (improved sparse convolution), the number of basic operations in convolution (e.g. , multiplication and addition) depending upon a sparsity ratio (i.e. , number of zero elements present in the kernel) can be reduced, thereby reducing processing time and necessary processing power.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram of a configuration of a computer system by which information processing apparatuses according to exemplary embodiments of the present disclosure may be achieved.

FIG. 2 is a block diagram of a schematic configuration for a simplistic representation of a CNN model.

FIG. 3 is a block diagram of a comparison between prior art and an embodiment of the present invention. FIG. 4A is a flowchart of an example of training phase processing, which as an output of processing generates dense kernels to be used in the inference phase processing.

FIG. 4B is a flowchart of an example of inference phase processing, which uses dense kernels generated during training phase processing.

FIG. 5 is an example of a convolution operation being performed in the convolutional layer of a CNN model.

FIG. 6A is a block diagram of a comparative example of a direct convolution algorithm used to perform the convolution operation in the convolutional layer of a CNN of prior art.

FIG. 6B is a block diagram of a comparative example of a GEMM (Matrix Multiplication) convolution algorithm used to perform the convolution operation in the convolutional layer of a CNN of prior art.

FIG. 6C is a block diagram of sparse direct convolution and kernel preprocessing used to perform the convolution operation in the convolutional layer of a CNN according to an embodiment of the present invention.

FIG. 7A is a flowchart of an example of kernel preprocessing of an embodiment of the present invention.

FIG. 7B is a flowchart of an example of operation in the information processing apparatus according to the embodiment of the present invention.

FIG. 8A is a comparative example representing the direct convolution algorithm and the number of basic operations performed therein.

FIG. 8B is a comparative example representing GEMM (Matrix

Multiplication) convolution and the number of basic operations performed therein.

FIG. 8C is an example representing the sparse direct convolution algorithm and the number of the basic operations performed therein for the embodiment of the present invention.

FIG. 9 is a detailed example of the sparse direct convolution operations using a nonzero element list to generate an output feature map in accordance with an embodiment of the present invention.

FIG. 10A is an example of input feature maps for two images. FIG. 10B is an example of elements stored in memory in NCHW order.

FIG. 10C is an example of elements stored in memory in CHWN order. EMBODIMENTS FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the figures.

First, Fig. 1 a block diagram of a configuration of a computer 100 (also referred to as a "computer system") on which a computational method in accordance with embodiments of the present invention may be achieved. The computer system 100 includes a processor 110, a cache subsystem 120, a GPU subsystem 130, a graphic output device 140, a memory bridge 150, an I/O (Input/Output) subsystem 160, input devices 170 (e.g. a mouse 171 and a keyboard 172), a memory subsystem 180, and secondary storage 190. The computer system 100 may include a plurality of graphics output devices 140. The processor 110 includes registers 111 . The registers 111 are used to store data used by execution units included in the processor 110 from the cache subsystem 120. The registers 111 and other parts of the processor 110 are present on the same chip to reduce latency. The cache subsystem 120 may have two or more levels of cache. The processor 110 and at least a level of cache subsystem may be implemented on the same chip. The number (e.g. level 1 , level 2, level 3, etc.) and locations (on or off chip in the processor 110) of the levels may vary among systems having different architectures. Therefore, for the sake of simplification of the variation in configurations among systems having different architectures, the cache subsystem 120 is shown as a module separated from the processor 110. Input devices such as a mouse and a keyboard 170 are connected to the memory bridge 150 via the I/O subsystem. In order to facilitate a high level functional understanding of a CNN to which the present invention may be applied, a block diagram for a simplified representation of a known CNN model (AlexNet) is shown in Fig. 2 and will be briefly described. This particular CNN is commonly used for image processing (image recognition) and therefore, input 210 (also referred to as an input feature map) in this case is an image input to the CNN. The input data 210 is input to the convolution neural network 200, first to a convolution layer 220 where a convolution operation is performed on the input data 210 (input feature map) in conjunction with a kernel for the convolution layer. The convolution processing function is the same function for all convolution layers 220, 230, 240, 250, 260 of the CNN . Next, the output (also referred to as an output feature map) of the convolution layer 220 is input to the activation layer 230 where activation processing is performed. Next, the output of the activation layer 230 is input to the pooling layer 240 where pooling processing is performed. The convolution neural network 200 may further include one or more of fully connected layers 270, 280 each of which may be combined with an activation layer 271 , 281 . Operation continues through each layer until the final layer's output is generated as the output of the system. In this example, Softmax 290 is the output of the system which is the determination of the class of the object in the input 210. Although Fig. 2 is given as an example, the convolution neural network 200 may include any number of such combinations of convolution, activation and/or pooling processing, which may be followed by fully-connected layers and activation layers. The convolution neural network 200 includes any number of convolution processing layers, and it is the convolution layers and the operations therein that are improved in the present invention, as will be described below.

FIG. 3 is a block diagram of a comparison between prior art and an embodiment of the present invention. Conventionally, CNNs consists of two phases, i.e. , a training phase 311 and an inference phase 313. Kernels may be initialized to zero, random values, or values may be chosen in accordance with more advanced techniques. A training phase consists of learning the values of a kernel through a series of forward and backward passes through the CNN by inputting a sufficient number of training data sets in order to adjust the kernel values until a target accuracy is achieved. The final output of the training phase 311 is a set of learned dense kernels (hereinafter, the term "learned" kernel means that the values for the kernel were acquired through training). These learned dense kernels 312 are used in an inference phase 313 for evaluating test images at run time. For this example of a conventional CNN, dense convolution 314 is used for all convolution layers of CNN model. In the case of a conventional CNN using sparse convolution, the block diagram would be substantially the same as that of the dense convolution except with a sparsity constraint in the training phase to produce a set of learned sparse kernels to be used for an inference phase of the CNN.

In the block diagram of an embodiment of the present invention shown in the lower half of Fig. 3, the CNN consists of two phases, i.e., a training phase with a sparsity constraint 321 and an inference phase 325. Conventional techniques are used in training phase to perform training with sparsity constraint 321 , and the result of the training phase with the sparsity constraint 321 is a set of learned sparse kernels 322. However, the present invention takes particular advantage of a special property of learned sparse kernels in that the learned sparse kernels contain many zero elements depending upon the sparsity constraint in training phase. These learned sparse kernels 322 are given as input to a preprocessing algorithm 323 which extracts information regarding values of the non-zero elements of the learned sparse kernels 322 and their indexes, and generates, as output, a nonzero element list 324 for each convolutional layer of the CNN model. This nonzero element list 324 for each convolutional layer of the CNN model is then given as input to the CNN for use the inference phase 325 which uses a sparse direct convolutional algorithm (improved sparse CNN algorithm) 326 to perform convolutional layer processing. Since many elements in the learned sparse kernels 322 are zero, the nonzero element list 324 is used to reduce unnecessary convolution arithmetic operations (i.e., multiplication with zero) and to avoid unnecessary loading of values from memory, thereby resulting in an overall increase in computation efficiency.

Here, a detailed description of the flowchart for the training phase 311 of Fig. 3 will be described with reference to Fig. 4. First, the input network configuration is analyzed in S411 , where a CNN model, such as that shown as an example in Fig. 1 , is initialized along with input data to be used to train the model. Next, forward propagation processing is performed in S412 using current weights (kernels). Thereafter, if a predetermined target accuracy has not been achieved, S413 proceeds to backward propagation processing in S414 in which the value of elements in the kernels are adjusted so as to reduce the error and increase accuracy of the evaluation performed by the CNN. This process of forward and backward processing is iteratively performed iteratively for all of the training data until the predetermined target accuracy is achieved in step S413. Once the target accuracy achieved in S413, learned dense kernels become the output in S415 and training process in completed. The output of training phase of Fig. 4A is the set of learned dense kernels S415, which is represented by reference symbol 312 in Fig. 3.

FIG. 4B is a flowchart for the inference phase 325 according to the conventional CNN on the upper half of Fig. 3. In S421 , the input network configuration is analyzed similarly to the first step of the training phase except, here, data to be evaluated is input instead of training data. Next, S422 fetches the next layer in the CNN model. S423 checks if end of the network configuration has been reached. If not, i.e., there are more layers to process in the network, the flow advances to S424. In S424, if the current processing layer is a convolution layer, the flow advances to S426 where the data to be evaluated (input feature maps) and dense kernels are utilized to perform convolution processing. On the other hand, if S424 finds that the current processed layer is not convolutional layer than layer specific processing is performed. If S423 finds that there are no more layers to be processed the input network configuration control advances to S427 where the output is reported. After S427 is executed the inference process is completed.

Fig. 5 represents an example of the convolution operations performed in the convolutional layer of a conventional CNN model using a sparse kernel (i.e. , a kernel in which many elements are zero). An input feature map 510 represents input to the convolutional layer from previous layer or, in the case of the first convolution layer, the input feature map 410 is the input image to the CNN model. The first convolutional layer takes input i.e. input feature map 510 in the form of input images. Each channel of the input image is represented as a separate input feature map. For example, for CNN models for image recognition input images can be grayscale or color images. Grayscale images consist of only one value for each pixel in the image, representing the intensity of light on that pixel. A single value for each pixel results in grayscale images to be represented by only one channel. A color image has three values for each pixel corresponding to the red, green, and blue components for each pixel. Therefore, a color image consists of three channels each representing one of three colors for each pixel in an image. For grayscale images, the first convolutional layer takes one channel per image as a single input feature map, and for color images, the first convolutional layer takes three channels per image as three input feature maps. In Fig. 5, one input feature map 510 represents a single channel of a grayscale image as input. It should be understood that this is merely an example of one feature map for explanatory purposes and that each convolutional layer can have any number of input feature maps.

The kernel 520 of Fig. 5 represents a filter (with weights) applied on the input feature maps. Each kernel 520 is applied on each input feature maps 510 spatially. The number of channels in kernel is equal to number of input feature maps. For example, in Fig. 5, there is only one input feature map, and therefore, the number of channels in kernel is also one. A single convolution operation between kernel 520 and a patch of input feature map 510 is represented in Fig. 4. The patch, as can be seen in Fig. 5, is a portion of the input feature map that is filtered using the kernel 520 to produce a pixel of the output feature map. The convolution operation is realized by arithmetic operations of multiplication and addition. The output of a convolution operation is calculated by first multiplying each pixel in kernel with its corresponding pixel in the input feature map patch, and then all multiplication results are added as the output of the convolution operation. The output feature map 530 is generated as an output of the convolution operation between input feature map 510 and kernel 520. The output feature map calculation 540 shows how a combination of multiplications and addition between input feature map 510 and kernel 520 results in the output feature map 530. The convolution between an input feature map 510 and kernel 520 results in one output feature map 530. Therefore, the number of output feature maps 530 is always equal to the number of kernels present in the convolutional layer.

Fig. 6A is a block diagram of the direct convolution algorithm 613, which is a conventional algorithm used in prior art such as that shown in the upper half of Fig. 3. The direct convolution algorithm takes input such as input feature maps 611 from a previous layer. Input feature maps 611 of the current layer are output feature maps 614 of the previous layer in the CNN network model configuration. Learned dense kernels 612, represent learned dense kernels 312 of Fig. 3. Direct convolutional algorithm 613 takes input feature maps 611 and then performs convolution operations using learned dense kernels 612 of current convolution layer. Input feature maps 611 for each convolutional layer are represented by a 3-dimensional matrix. In order to utilize the hardware optimally, in a practical setting, rather than giving a single input image as input, a batch of images is given as input. A 3-dimensional matrix for each image leads to a 4-dimensional matrix for batch of images. Also a single convolutional layer consists of more than one kernel. Each kernel is represented by a 3-dimensional matrix which leads to a 4-dimensional matrix for multiple kernels. The direct convolutional algorithm 613 takes a 4-dimensional matrix representation of input feature maps 611 and 4-dimensional matrix representation of learned dense kernels 612 and generates 4-dimensional matrix representation for output feature maps 614. It should be noted that the direct convolutional algorithm 613, does not perform any kind of intermediate transformation either on the 4-dimensional matrix representation of input feature maps 611 or on the 4-dimensional matrix representation of the learned dense kernels 612. The output feature maps 614 are then given as an input to the next layer in the CNN model hierarchy.

Fig. 6B is a block diagram of example of a GEMM (General Matrix Multiplication) convolution algorithm which is another conventional algorithm in prior art in 314 of Fig. 3. Input feature maps 621 are taken as input to the convolutional layer. Input feature maps 621 are output feature maps of previous layers. Input feature maps 621 which are represented by 4-dimensional matrix are first transformed to 2-dimensional representation using 4-dimensional to 2-dimensional 622 transformation procedure. Input matrix 623 is 2-dimensional representation of the input feature maps 621 . Learned dense kernels are also represented by 4-dimensional matrix are first transformed to 2-dimensional representation using 4-dimensional to 2-dimensional 625 transformation procedure. Kernel matrix 626 is 2-dimensional representation of the learned dense kernels 624. After both input feature maps 621 and learned dense kernels 624 are transformed in their respective 2-dimensional representation input matrix 623, and kernel matrix 626 GEMM (general matrix multiplication) 627 is performed to achieve the effect of the convolution operation. Output of GEMM is output feature maps 628. Output feature maps 628 are also represented by their 2-dimensional representation. It is evident from the block diagram that steps 622 and 625 which transform the 4-dimensional matrix representation input feature maps and 4-dimensional matrix representation of learned dense kernels to their respective 2-dimensional matrix representation contribute to overhead towards the total execution time of the algorithm. Although one can avoid processing of 625 by doing it before hand and keep the 2-dimensional matrix representation of learned dense kernels 626 ready before starting the inference phase, but still one cannot avoid the processing in step 622, as it changes each time the input image is changed.

Fig. 6C is a block diagram of an embodiment of the present invention in an inference phase of a CNN. This embodiment of the present invention utilizes a combination of preprocessing kernels 633 to produce a nonzero element list and sparse direct convolution 635. By performing the preprocessing and sparse direct convolution using the nonzero element list, it is possible to reduce the number of multiplications and additions significantly, resulting in increased computational efficiency. Input feature maps 631 (which may be an initial input to the CNN or output feature maps of a previous layer) are taken as input to the convolution layer. Input feature maps 631 may be represented by a 4-dimensional matrix. Learned sparse kernels 632 correspond to learned sparse kernels 322 of Fig. 3. In this embodiment, learned sparse kernels 632 are taken as an output of a training phase with a sparsity constraint 321 of Fig. 3. Here, preprocessing kernels 633 represents preprocessing kernels algorithm 323 in Fig. 3 and sparse direct convolution algorithm 635 represents sparse direct convolution algorithm 325 of Fig. 3.

A detailed flow of the preprocessing kernels algorithm 633 of this embodiment is explained here with reference to Fig. 7A. Input to the preprocessing kernels algorithm 323 is a set of learned sparse 5 kernels 322, and the output of the preprocessing kernels algorithms 323 is a nonzero element list 324. Processing starts from a fetch layer in the network configuration S711 . Then, S712 checks if there is a layer in the network configuration of CNN model to be processed next. If so, S713 then checks if the layer to be processed is a convolutional layer. If it is found that the layer to be processed is convolutional layer then path for YES is followed. An empty nonzero element list is initialized in S714, if none exists, for the convolution layer to be processed. Next, nonzero elements in learned sparse kernels are identified and appended to the nonzero element list in S716 and S717 respectively. Each entry of the list has a nonzero value and an index of the nonzero element which indicates a position of the value from the learned sparse kernels. More specifically, the index information is the index of the nonzero element along each dimension in a 4-dimensional matrix representation of the learned sparse kernels. After an entry is appended to the list in S717, the operation repeats (through S715) until all nonzero elements of the kernels for the convolution layer are contained in the list. When all the nonzero elements in the learned sparse kernels are processed, the flow returns to S711 where the process of creating nonzero element lists for each convolution layer of the CNN is repeated until there are no more layers to be processed in the CNN at S712. After all the layers are processed sequentially in the network configuration of CNN model, the nonzero element lists are output from the preprocessing algorithm.

Fig. 7B is a flowchart for an overview of the inference phase

325 of Fig. 3. Inference processing of this embodiment of the present invention starts by fetching a next layer in the network configuration at S721 . If the next layer exists at S722, the flow advances to S723 where it is checked whether the layer is a convolution layer or not. If it is found that the layer to be processed is a convolutional layer then path of YES is followed to S725. S725 then performs the processing of sparse direct convolutional algorithm

326 of Fig. 3 and 635 of Fig. 6C. The input to the sparse direct convolutional algorithm is input feature map from previous layers (or the initial input data to the CNN in the case that the convolution layer is the first layer of the CNN) and a nonzero element list S718 which is the output of the preprocessing algorithm using learned sparse kernels. If the next layer of S723 is not a convolutional layer, then it performs the layer specific tasks of that layer without any modification. This process repeats until all layers of the CNN are processed and S722 finds no more layers exist in the network configuration of CNN model at which point the flow advances to S726 where the output is reported.

Here, a more detailed operation of the sparse direct convolution using the nonzero element list of the embodiment of the present invention will be described with reference to Fig. 9. First, in S910, a next entry from the nonzero element list is fetched and the flow advances to S915. If the entry exists, the flow advances to S920 where the nonzero kernel value KV_nz at the index <K_nz, KC_nz, Kl_nz, KJ_nz> is extracted, where K_nz is the kernel index, KC_nz is the kernel channel index (i.e., red, blue, or green channel of a color image), Kl_nz is the height index, and KJ_nz is the width index. Next, from S925, for each output element <Ρκ_ηζ, Οκ_ηζ> of the output feature map for K_nz, the steps S935 to S950 are performed until all of the convolution arithmetic operations for the current fetched entry are completed and stored in the output maps, as described below.

In S935, the starting index <i, j> for the patch of the input feature map is found using

i = P«_nz ^* KerU and j = Οκ_ ηζ ^* KerV, where KerU is the height of the kernel and KerV is the width of the kernel. Then, in S940, the actual index <ii, jj> for multiplication in the input feature map patch is found using ii_nz= i + Ki_nz and jj_nz= j + Kj_nz. Thereafter, in S942, multiple values are loaded for the batch_size (number of input feature maps) starting from the value at index <0, KC_nz, ii_nz, jj_nz> in memory, where 0 is the first index for the input feature map, KC_nz is the channel index, ii_nz is the height index, and jj_nz is the width index. Here, the batch_size number of elements are loaded into a vector iv nz to be used in S945. In S945, scalar vector multiplication is performed, i.e., Kv_nz x iv_nz[0... batch_size] and stored in the output feature map at <0, K_nz, Ρκ_ηζ, Οκ_ηζ> in S950. If the output feature map at the index has an existing value from the previous steps, the new value may be added to the existing value to calculate a summation for all entries of the nonzero element list. After these values are added to the output feature maps, the flow returns to S910 to fetch the next entry in the nonzero element list. If there are no more elements in the nonzero element list, the output feature map is output from the sparse direct convolution process.

It should be noted that any memory addressing scheme may be used, the above method of this embodiment for retrieving values from memory is particularly preferable if the feature maps are stored in memory according to a CHWN addressing scheme such as that shown in Fig. 10C. The reason is that pixel values at the same location in all the input feature maps (or images) are stored in consecutive memory locations, which provides further efficiency in terms of loading values from memory. The CHWN order for storing elements in memory will be described in more detail later.

Fig. 8A is a comparative example of processing of the direct convolution algorithm 613 of Fig. 6A. The direct convolutional algorithm takes in two inputs, i.e. , a first input of input feature maps 811 (which are the input feature maps 611 of Fig. 6A) and a second input of learned dense kernels 612 of Fig. 6A. The three input feature maps 811 represent three color channels (red, green, and blue) in a 3-dimensional matrix of an input image. Three channels in learned dense kernels 812 represent one channel for each feature map of the input. Since the example Fig. 8A consists of only one kernel, only one output feature map is generated. Output feature map 814 consists of 4 values O0, O1 , 02, and 03. Calculation for each of these four values of is represented by four columns in 813. Calculation of OO, 01 , 02, and 03, takes 48 multiplications and 44 additions in total as shown in 815.

Fig. 8B is a comparative example of processing of convolution using a conventional GEMM algorithm 627 of Fig. 6B. Convolution using the GEMM algorithm takes in two inputs, i.e., a first input of input feature maps 821 (which are the input feature maps 621 of Fig. 6B) and a second input of learned dense kernels 624 of Fig. 6B. The convolution using GEMM requires transforming input feature map 821 (input feature maps 621 of Fig.6B) to an input matrix 823 (input matrix 623 of Fig. 6B) using a 4-dimenional matrix to 2-dimensional matrix 822. Similarly learned dense kernels 824 (learned dense kernels 624 of Fig.6B) are transformed to a kernel matrix 826 (kernel matrix 626 of Fig. 6B) using 4-dimensional matrix to 2-dimensional matrix 825. After input matrix 823 and kernel matrix 826 are generated, general matrix multiplication is performed. Here, the number of multiplications and additions for generating outputs O0, 01 , 02, and 03, require 48 multiplications and 44 additions in total as shown in 829.

Fig. 8C is an example of the embodiment of the present invention for processing convolution using sparse direct convolution algorithm 635 of Fig. 6C. The preprocessing kernel algorithm 833 (preprocessing kernel algorithm 633 of Fig. 6C) takes in learned sparse kernels 832 (learned sparse kernels 632 of Fig. 6C) as input and transforms them into nonzero element lists 834 (nonzero element lists 634 of Fig. 6C). The nonzero element lists 834 (nonzero element lists 634 of Fig. 6C) include one entry for each nonzero element present in the kernels. For this specific example it is shown that kernel elements KRO, KG2, and KB3 are nonzero elements, hence corresponding values and indexes (shown as <channel, height, width>) are stored in the nonzero element list 834. This nonzero element list 834 is taken as input by the sparse direct convolution algorithm 635 of Fig. 6C. Calculations made by the sparse direct convolution algorithm 635 of Fig. 6C are shown in 835. The sparse direct convolution algorithm 635 generates output as output feature maps. The number of multiplications and additions for generating output OO, 01 , 02, and 03, take 12 multiplications and 8 additions in total as shown in 837.

Here, NCHW and CHWN memory addressing schemes will be described with reference to FIGS. 10A, 10B, and 10C. As can be seen in FIGS. 10A, 10B, and 10C, N represents a data item which is an image in this case where N = 0 is Image 0 and N = 1 is Image 1 ; C represents the channels of the data item which in this case are Red, Green, and Blue; H represents the height (row); and W represents the width (column). As an example, the value B6 would have an index in which N = 1 , H = 1 , W = 0, and C = 2, wherein the NCHW index would be <1 , 2, 1 , 0> and for the CHWN scheme, the index would be <2, 1 , 0, 1 >. FIG. 10B shows the memory storage scheme for NCHW order where memory addresses increase from left to right. NCHW gives first priority to N then C then H and finally W. Therefore, as an example, the indexes from left to right would be <0, 0, 0, 0>, <0, 0, 0, 1 >, <0, 0, 1 , 0>, <0, 0, 1 , 1 > ... which results in the storage of the corresponding values of R0, R1 , R2, R3, ... as can be seen in FIG. 10B. On the other hand, CHWN gives first priority to C then H then W and finally N. Using the same example for the first four indexes of <0, 0, 0, 0>, <0, 0, 0, 1 >, <0, 0, 1 , 0>, <0, 0, 1 , 1 > the corresponding values would be R0, R4, R1 , R5 as can be seen in FIG. 10C. For the CHWN order, it should be noted that pixel values for the same CHW positions of the two images are side by side in memory. In the case of a large number of images, the same pixel positions for different images would be stored in consecutive addresses in memory. This is particularly advantageous for the embodiment of the present invention described above and leads to higher efficiency when loading a vector of values to be used in the convolution processes using the nonzero element list as previously mentioned.

Hereinabove, a preferred embodiment of the present invention has been described with reference to the figures. However, it should be apparent to those skilled in the art that the present invention is not limited to the above embodiment and may be modified without departing from the scope of the invention. Therefore, it should be understood that the present invention is not to be considered as being limited by the foregoing description and is only limited by the scope of the appended claims.

For example, in the embodiment of the present invention, the CNN model to which the concept of the present invention is applied is one which is commonly used for recognizing objects in images, i.e., AlexNet. However, those skilled in the art should easily recognize that the present invention may be applied to any CNN model for the evaluation of any type of input data and still achieve a reduction in convolution arithmetic operations, thereby improving the efficiency of processing.

Furthermore, in the above described embodiment, the preprocessing operations in which the nonzero element list is generated is performed as part of in the inference phase of the CNN processing. However, such preprocessing may also be performed at the final output stage of a training phase where the nonzero element list is the output of the training phase. This would be particularly useful if, for example, the training phase which may require a very large amount of computing power is performed using a large number of training data sets on supercomputer, and then the output nonzero element list is then provided to a less powerful device, such as a mobile device, in order to perform inference phases with less processing overhead. Reference Symbols List

computer system 100

processor 110

registers 111

cache subsystem 120

GPU subsystem 130

graphic output device 140

memory bridge 150

Input/Output subsystem 160

input devices 170

mouse 171

keyboard 172

memory subsystem 180

secondary storage 190 convolution neural network 200

input data 210

convolution layer 220, 230, 240, 250, 260

fully connected layer 270, 280

activation layer 271 , 281

Softmax 290

training phase 311

learned dense kernel 312, 322, 612, 624, 626, 632, 824, 832 preprocessing kernels algorithm 323, 633, 833

nonzero element list 324, 634, 834

inference phase 325

sparse direct convolutional algorithm 326, 635

input feature map 410, 510, 611 , 621 , 631 , 811 , 821 kernel 520

output feature map 530, 614, 628, 814

direct convolutional algorithm 613, 635

input matrix 623, 823

kernel matrix 626, 826

GEMM algorithm 627

sparse direct convolution 635

2-dimensional matrix 822, 825

Claims

1 . A computer-implemented information processing method for an inference phase of a convolution neural network, the method comprising steps of:

generating a list of non-zero elements from a learned sparse kernel to be used for a convolution layer of the convolution neural network;

when performing convolution on an input feature map, loading only elements of the input feature map which correspond to the non-zero elements of the generated list; and

performing convolution arithmetic operations using the loaded elements of the input data map and the non-zero elements of the list, thereby reducing the number of operations necessary to generate an output feature map of the convolution layer.

2. The method of Claim 1 , wherein

the input feature map is a 4-tensor; and

elements of the input feature map are stored in memory in CHWN order.

3. A non-transitory computer readable storage medium storing instructions which cause a computer to perform an information processing method for an inference phase of a convolution neural network, the method comprising the steps of:

4. The non-transitory computer readable storage medium of Claim 3, wherein

the input feature map is a 4-tensor; and

elements of the input feature map are stored in memory in CHWN order.