WO2022191730A1

WO2022191730A1 - Accelerated execution of convolution operation by convolutional neural network

Info

Publication number: WO2022191730A1
Application number: PCT/RU2021/000100
Authority: WO
Inventors: Yury Alexandrovich PARFENOV; Vladimir Mikhailovich KRYZHANOVSKIY; Stanislav Yuryevich KAMENEV; Alexander Alexandrovich ZURUEV
Original assignee: Huawei Technologies Co., Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2022-09-15
Also published as: EP4295276A1; CN116997911A

Abstract

The present disclosure relates to a technique for accelerating the execution of a convolution operation by a convolutional neural network (CNN). Specifically, convolutional filters of the CNN are divided into groups, for each of which individual parts of the different convolutional filters are stored physically close to each other in a system memory. That is, for each group, one part of the first convolutional filter is followed by one part of the second convolutional filter which is then followed by one part of the third convolutional filter, and so on. The more each group comprises the convolutional filters, the more reuses of input activations are possible during the convolution operation (without having to load the weights from sparse memory addresses), thereby accelerating its execution. The number of the convolutional filters in each group is generally limited by the number of registers of a processor which will perform the convolution operation.

Description

ACCELERATED EXECUTION OF CONVOLUTION OPERATION BY CONVOLUTIONAL

NEURAL NETWORK

TECHNICAL FIELD

The present disclosure relates generally to the field of data processing, and particularly to an apparatus and method for preparing convolutional filters of a convolutional neural network (CNN) for a convolution operation, an apparatus and method for performing the convolution operation by using the prepared convolutional filters of the CNN, as well as computer program products embodying the method steps in the form of computer codes.

BACKGROUND

Deep neural networks (DNNs) are widely used in various fields of human activity. Examples of their applications include object detection and recognition, image analysis, and data classification. A special type of the DNN that deals with image processing extremely well is a CNN. The main constructive blocks of the CNN are convolutional layers each comprising a set of convolutional filters. Each convolutional filter is configured as a combination of small weight matrices (usually 3 3 weight matrices). The convolutional layers are the most computation-intensive part of the CNN, for which reason they are also the most time- consuming and power-consuming layers of the CNN. Thus, techniques for reducing the computational costs and/or memory requirements of the convolutional layers are desired.

It is quite popular to use graphics processing units (GPUs) to accelerate the operation (e.g., training) of the convolutional neural network. Out of the existing GPUs, the Mali family of GPUs designed by ARM is of particular interest. It offers application programming interface (API) support for Open Computing Language (OpenCL), Open Graphics Library (OpenGL), DirectX, and Google RenderScript. The Mali GPUs are characterized in that they do not have a local memory. In other words, the Mali GPUs do not deal with a fast-shared memory region that can be shared among a group of GPU threads.

The implementation of the convolution operation in the form of the ARM Compute Library is delivered with the existing mobile devices comprising the Mali GPUs. The ARM Compute Library involves using a series of 2D convolutions, in each of which input activations loaded from Random Access Memory (RAM) to registers of the Mali GPU are used from 1 to 3 times in case of 3 x 3 convolutional filters. This small number of reuses may cause the Mali GPU to perform a large number of memory load operations during the convolution operation, thereby delaying the execution of the convolution operation. In turn, this may not allow the CNNs to be used in some computational tasks, thereby limiting the application of the CNNs.

SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure.

It is an objective of the present disclosure to provide a technical solution that enables accelerated execution of convolutional operations of CNNs.

The objective above is achieved by the features of the independent claims in the appended claims. Further embodiments and examples are apparent from the dependent claims, the detailed description and the accompanying drawings.

According to a first aspect, an apparatus for preparing convolutional filters of a CNN is provided. The apparatus comprises a data storage and at least one processor. The data storage is configured to store weights and processor-executable instructions. The weights initially arranged in the form of the convolutional filters of the CNN. Each of the convolutional filters has a filter length that is defined as a number of the weights in the convolutional filter. When executed by the at least one processor, the processor-executable instructions cause the at least one processor to operate as follows. At first, the at least one processor divides the convolutional filters into N groups, where N ³ 1. Each of the N groups comprises a number n of the convolutional filters, where n > 1. Each of the n convolutional filters in each of the N groups is provided with a filter index. For each of the N groups, the at least one processor then additionally divides each of the n convolutional filters into m weight vectors. Each of the m weight vectors has a length g less than the filter length and is provided with a vector index. After that, the at least one processor stores the N groups one-by-one in the data storage such that the m * n weight vectors of each of the N groups are arranged as an array W = {¾}, where i = 1, 2, 3, ...n and j = 1, 2, 3, ...m, and where v_{i }} is the weight vector with the vector index j in the convolutional filter with the filter index i in the group. The vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index i reaches n. By storing the weights of the convolutional filters of the CNN in this data storage format, i.e. as the array W it is possible to alleviate the so-called data sparseness problem and, thus, increase the utilization efficiency of a processor (e.g., the Mali GPU) that should execute a convolution operation itself. By using this data storage format, it is possible to increase the number of reuses of loaded input data (or, in other words, input activations) in the convolution operation without having to load the weights from sparse memory addresses, thereby reducing computation time as well as power consumption. As a result, the list of computational tasks to be solved by using the CNN may be expanded irrespective of the limited hardware capabilities of the existing mobile devices (e.g., the mobile devices with the Mali GPUs).

In one embodiment of the first aspect, each of the convolutional filters comprises an equal number of filtering channels, each of the filtering channels has a channel length that is defined as a number of the weights in the filtering channel, and the filter length is defined as a sum of the channel lengths of the filtering channels in each of the convolutional filters. In this embodiment, the length g of each of the m weight vectors is equal to a part of the sum of the channel lengths. By selecting g in this way, it is possible to change the number of the weight vectors in each convolutional filter, if required and depending on particular applications, thereby making the apparatus according to the first aspect more flexible in use.

In one embodiment of the first aspect, the length g of each of the m weight vectors is less than channel length. By selecting g in this way, it is possible to change the number of the weight vectors in each filtering channel of the convolutional filter, if required and depending on particular applications, thereby making the apparatus according to the first aspect more flexible in use.

In one embodiment of the first aspect, the number n of the convolutional filters in each of the N groups is selected based on a number of registers of at least one processor to be used for performing the convolution operation. This may allow one to adapt the data storage format of the weights of the convolutional filters for different types of processors (e.g., the Mali GPUs).

In one embodiment of the first aspect, if the weights are initially arranged in the form of a number k of the convolutional filters in the data storage and k is indivisible by n, the at least one processor is further configured, before dividing the convolutional filters into the N groups, to add at least one zero-filled convolutional filter to the k convolutional filters to make k divisible by n. By so doing, it is possible to ensure the same number of the convolutional filters in each of the N groups, irrespective of the initial number of the stored convolutional filters.

According to a second aspect, a computing apparatus for performing a convolution operation is provided. The apparatus comprises a data storage and at least one processor. The data storage is configured to store an input data array to be convoluted, the N groups of the convolutional filters which are prepared by the apparatus according to the first aspect, and processor-executable instructions. The input data array comprises input data subarrays. The at least one processor comprises registers. When executed by the at least one processor, the processor-executable instructions cause the at least one processor to operate as follows: a) load, to the registers, one of the input data subarrays; b) load, to the registers, a subarray of at least n weight vectors from the array W = {n_{ί ;}·} of one of the N groups of the convolutional filters; c) for the loaded input data subarray, obtain intermediate vectors by using a sliding window having a length t; d) produce partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors; e) accumulate the partial output data to the registers; f) repeat operations a)-e) for other one or more of the input data subarrays with one or more subsequent subarrays of at least n weight vectors until all subarrays of at least n weight vectors are loaded from the array W = {ΐ7_{ί ;}·} of said one of the N groups of the convolutional filters; g) store all the accumulated partial output data as an output data subarray to the data storage; and h) repeat operations a)-g) for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array from the stored output data subarrays in the data storage.

By using the weights thus stored, i.e. as the array W = {¾}, it is possible to alleviate the so- called data sparseness problem and, thus, increase the utilization efficiency of the at least one processor included in the apparatus according to the second aspect. By using this data storage format, it is possible to increase the number of reuses of loaded input data (or, in other words, input activations) during the convolution operation without having to load the weights from sparse memory addresses, thereby reducing computation time as well as power consumption by the apparatus according to the second aspect. As a result, the list of computational tasks to be solved by using the CNN may be expanded.

In one embodiment of the second aspect, the at least one processor is configured to load, to the registers, the subarray of the at least n weight vectors in operation b) sequentially or all at once. This may make the apparatus according to the second aspect more flexible in use. In one embodiment of the second aspect, the at least one processor is configured to perform operations a) and b) in parallel. This may reduce the time required to perform these operations.

In one embodiment of the second aspect, the length t of the sliding window is selected based on a type of the at least one processor performing operations a)-h). By so doing, it is possible to properly select the best length t for a specific processor, thereby increasing the execution efficiency of the convolution operation itself.

In one embodiment of the second aspect, the at least one processor is implemented as at least one GPU. By using the GPU(s), it is possible to increase the execution efficiency of the convolution operation.

In one embodiment of the second aspect, the input data array comprises the overlapping input data subarrays. This means that the apparatus according to the second aspect is not “sensitive” to the initial segmentation of the input data array, which makes it more flexible in use.

According to a third aspect, a method for preparing convolutional filters of a CNN for a convolution operation is provided. The method starts with the step of providing weights initially arranged in the form of the convolutional filters of the CNN. Each of the convolutional filters has a filter length that is defined as a number of the weights in the convolutional filter. Then, the method proceeds to the step of dividing the convolutional filters into N groups, where N > 1. Each of the N groups comprises a number n of the convolutional filters, where n > 1. Each of the n convolutional filters in each of the N groups is provided with a filter index. Next, the method goes on to the step of additionally dividing, for each of the N groups, each of the n convolutional filters into m weight vectors. Each of the m weight vectors has a length g less than the filter length and is provided with a vector index. After that, the method proceeds to the step of storing the N groups one-by-one such that the m * n weight vectors in each of the N groups are arranged as an array W {¾}, where i = 1, 2, 3, ...n and j = 1, 2, 3, ... m, and where Vi_j is the weight vector with the vector index j in the convolutional filter with the filter index i in the group. The vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index t reaches n. By storing the weights of the convolutional filters of the CNN in this data storage format, i.e. as the array W = {¾}, it is possible to alleviate the so-called data sparseness problem and, thus, increase the utilization efficiency of a processor that should execute a convolution operation itself. By using this data storage format, it is possible to increase the number of reuses of loaded input data (or, in other words, input activations) in the convolution operation without having to load the weights from sparse memory addresses, thereby reducing computation time as well as power consumption. As a result, the list of computational tasks to be solved by the CNN may be expanded irrespective of the limited hardware capabilities of the existing mobile devices (e.g., the mobile devices with the Mali GPUs).

According to a fourth aspect, a method for performing a convolution operation by using the convolutional filters prepared using the method according to the third aspect is provided. The method according to the fourth aspect comprises the following steps: a) providing an input data array to be convoluted, the input data array comprising input data subarrays; b) loading, to registers of at least one processor, one of the input data subarrays; c) loading, to the registers of the at least one processor, a subarray of at least n weight vectors from the array W = {v* _j] of one of the N groups of the convolutional filters; d) for the loaded input data subarray, obtaining intermediate vectors by using a sliding window having a length t; e) producing partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors; f) accumulating the partial output data to the registers of the at least one processor; g) repeating steps b)-f) for other one or more of the input data subarrays with one or more subsequent subarrays of at least n weight vectors until all subarrays of at least n weight vectors are loaded from the array

said one of the N groups of the convolutional filters; h) storing all the accumulated partial output data as an output data subarray to a data storage; and i) repeating operations b)-h) for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array from the stored output data subarrays in the data storage.

By using the weights thus stored, i.e. as the array W =

it is possible to alleviate the so- called data sparseness problem and, thus, increase the utilization efficiency of the at least one processor that should execute the convolution operation. By using this data storage format, it is possible to increase the number of reuses of loaded input data (or, in other words, input activations) during the convolution operation without having to load the weights from sparse memory addresses, thereby reducing computation time as well as power consumption by the at least one processor. As a result, the list of computational tasks to be solved by the CNN may be expanded.

According to a fifth aspect, a computer program product is provided. The computer program product comprises a computer-readable storage medium storing a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to the third aspect. By using such a computer program product, it is possible to simplify the implementation of the method according to the third aspect in any computing apparatus, like the apparatus according to the first aspect.

According to a sixth aspect, a computer program product is provided. The computer program product comprises a computer-readable storage medium storing a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to the fourth aspect. By using such a computer program product, it is possible to simplify the implementation of the method according to the fourth aspect in any computing apparatus, like the apparatus according to the second aspect.

Other features and advantages of the present disclosure will be apparent upon reading the following detailed description and reviewing the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is explained below with reference to the accompanying drawings in which:

FIG. 1 shows a flowchart of a method for performing a series of 2D convolution operations in accordance with the ARM Compute Library;

FIG. 2 shows an exemplary visual representation of a data storage format for weights of convolutional filters used in the method shown in FIG. 1 ;

FIGs. 3A-3C explain how input activations are converted to output activations in each 2D convolution operation in accordance with the method shown in FIG. 1 ;

FIGs. 4A-4C explain how to use a fused multiply-add (FMA) operation in the method shown in FIG. 1;

FIG. 5 shows a block diagram of an apparatus for preparing convolutional filters of a CNN for a convolution operation in accordance with one exemplary embodiment;

FIG. 6 shows a flowchart of a method for preparing the convolutional filters of the CNN for the convolution operation in accordance with one exemplary embodiment; FIG. 7 shows an exemplary visual representation of a data storage format for the weights of the convolutional filters, as provided by the method shown in FIG. 6;

FIG. 8 shows a block diagram of a computing apparatus for performing a convolution operation in accordance with one exemplary embodiment;

FIG. 9 shows a flowchart of a method for performing the convolution operation by using the convolutional filters obtained by the method shown in FIG. 6, in accordance with one exemplary embodiment;

FIG. 10 shows one example of how the method shown in FIG. 9 is performed with respect to one thread of a processor included in the apparatus shown in FIG. 8 by using one group of convolutional filters prepared in accordance with the method shown in FIG. 6;

FIGs. 11A and 11 B show comparative bar charts of speedup coefficients obtained by using the method shown in FIG. 9 and a benchmark method for performing a convolution operation based on ARM Compute Library for different sizes of an input data array and convolutional filters.

DETAILED DESCRIPTION

Various embodiments of the present disclosure are further described in more detail with reference to the accompanying drawings. However, the present disclosure may be embodied in many other forms and should not be construed as limited to any certain structure or function discussed in the following description. In contrast, these embodiments are provided to make the description of the present disclosure detailed and complete.

According to the detailed description, it will be apparent to the ones skilled in the art that the scope of the present disclosure encompasses any embodiment thereof, which is disclosed herein, irrespective of whether this embodiment is implemented independently or in concert with any other embodiment of the present disclosure. For example, the apparatuses and methods disclosed herein may be implemented in practice by using any numbers of the embodiments provided herein. Furthermore, it should be understood that any embodiment of the present disclosure may be implemented using one or more of the features presented in the appended claims.

The word “exemplary” is used herein in the meaning of “used as an illustration”. Unless otherwise stated, any embodiment described herein as “exemplary” should not be construed as preferable or having an advantage over other embodiments. As used in the embodiments disclosed herein, a convolutional neural network (CNN) may refer to a trained multilayer neural network architecture in which one or more convolution operations are implemented for various purposes. In particular, the CNN may be used to solve computer vision tasks (e.g., image classification, classification with localization, object detection, super resolution, joint demosaicing and denoising, noise reduction, image enhancement, etc.), as well as speech and audio processing tasks (e.g., text-to-speech, speech-to-text, etc.). The main constructive blocks of the CNN are convolutional layers. The convolutional layers are strong feature extractors in which convolutional filters configured to retrieve features of input data (e.g., image data or a time series) to be processed. The input data may be also referred to as input activations (correspondingly, output data resulted from the convolution operation may be also referred to as output activations). The convolutional filters are typically represented as a combination of small weight matrices (usually 3 x 3 weight matrices) which are “slid” over the input data. In this sense, the convolution operation may be considered as the result of respective matrix multiplication operations between the convolutional filters and the input data. In the CNN, the convolutional layers are not fully connected, but their output is usually passed to one or more fully connected layers that makes a final decision (e.g., a classification decision). Since the convolutional layers are known as the most computation-intensive part of the CNN, they are also the most time-consuming and power-consuming layers of the CNN. Given this, the execution speed of the convolution operation is mainly determined by these layers.

To accelerate the execution of the convolution operation by the CNNs, graphics processing units (GPUs) are usually used, among which Mali GPUs designed by ARM are of particular interest. The Mali GPUs do not have a local memory. In other words, during operation, they do not rely on a fast-shared memory that can be shared among a group of GPU threads.

It should be also noted that a comprehensive collection of computation-intensive functions (in particular, for the convolution operation) optimized for the Mali GPUs is available in the form of the ARM Compute Library. These functions are available for application developers, for example, through the Android Neural Networks API (NNAPI). Relative to the convolution operation, the ARM Compute Library involves dealing with 2D input/output data of weights and activations and performing a series of 2D convolutions.

FIG. 1 shows a flowhart of a method 100 for performing a series of 2D convolution operations in accordance with the ARM Compute Library. More specifically, the method 100 describes the execution of the 2D convolution operations for one thread of a Mali GPU. The method 100 starts with a step S102, in which the Mali GPU loads one subarray of weights of a 3 x 3 convolutional filter of a CNN and one subarray of input activations. Both the subarray of weights and the subarray of input activations are schematically shown as rectangular parallelepipeds in FIG. 1. The convolutional filter is usually divided into convoluting channels, and the loaded subarray of weights of the convolutional filter corresponds to one of the filtering channels in this case. It should be also noted that the sizes of the convolutional filter and the loaded subarray of input data are given for illustrative purposes only. Then, the method 100 proceeds to a step S104, in which the Mali GPU performs a 2D convolution operation based on the loaded subarray of weights and the loaded subarray of the input activations. Next, the method 100 goes on to a step S106, in which the Mali GPU accumulates the result of the 2D convolution operation to its registers. After that, the method 100 proceeds to a step S108, in which the Mali GPU loads a subsequent subarray of weights of the convolutional filter (e.g., by moving a sliding window to a next filtering channel). The Mali GPU repeats the steps S102-S108 c — 1 times, i.e. until all subarrays of weights are loaded from all c filtering channels of the convolutional filter. The method 100 ends up with a step S110, in which the Mali GPU stores the accumulated results of the 2D convolutions operations as a subarray of output activations to a system memory, such, for example, as RAM. The method 100 may be repeated by the Mali GPU for other threads, i.e. for other convolutional filters of the CNN and other subarrays of input activations, resulting in other subarrays of output activations. These subarrays of output activations are then combined into a final array of output activations.

FIG. 2 shows an exemplary visual representation of a data storage format 200 for the weights of the convolutional filters used in the method 100. In particular, this format is referred to as a KCRS format, where K is the output number of channels of each 2D convolution operation, C is the input number of channels of each 2D convolution operation, R is the filter height, and S is the filter width. There are four convolutional filters shown in FIG. 2, for which reason K = 4. Each of the four filters comprises three filtering channels, for which reason C = 3. Each of the filtering channels comprises three rows and three columns, for which reason R = 3 and S = 3. The numbers shown in the filtering channels correspond to memory addresses. As follows from FIG. 2, weights of a next convolutional filter are not used until all weights of a previous convolutional filter are loaded.

FIGs. 3A-3C explain how the input activations are converted to the output activations by using each 2D convolution operation in accordance with the method 100. As can be seen, each part of the input activations is loaded by the Mali GPU row by row, and each loaded row (colored in light grey in FIGs. 3A-3C) may be used from 1 to 3 times (for the case of the 3 x 3 convolutional filter) whenever it is needed to calculate rows of the output activations. Thus, each 2D convolution operation is presented as a series of 1 D convolution operations performed by a fused multiply-add (FMA) operation. FIGs. 4A-4C explains how to use the FMA operation in the method 100. In particular, FIGs. 4A-4C show how a 6-component row-vector v = (v₁ v₂, v₃, v₄, v₅,v₆) of the input activations is convoluted to a 4-component row-vector of the output activations by using one row-vector w = (w₁ w₂,w₃) of weights of the 3 x 3 convolutional filter. To make this convolution possible, it is required to divide the 6-component row-vector into 3 sub-vectors by using a sliding window. These sub-vectors are denoted as v_1;4, v_2:5, and v_3:6. Then, the sub-vector v_1;4 is multiplied by the weight w₄ (see FIG. 4A), the sub-vector v_2:5 is multiplied by the weight w₂ (see FIG. 4B), and the sub-vector v_3:6 is multiplied by the weight w₃ (see FIG. 4C). The results of these multiplications are accumulated as the 4-component row-vector of the output activations. In FIGs. 4A-4C, the symbol “+=” denotes this accumulation.

As noted above, the ARM Compute Library involves using the 2D convolution operations, in each of which each loaded row of the subarray of the input activations may be used from 1 to 3 times (depending on the row used) for the 3 x 3 convolutional filter. This small number of reuses implies that the Mali GPU needs to perform multiple memory reads during each 2D convolution operation, thereby causing a delay in its execution. For this reason, it may be impossible to use the CNNs on the Mali GPU when solving time-sensitive computational tasks, thereby limiting the application of the CNNs.

The exemplary embodiments disclosed herein provide a technical solution that allows mitigating or even eliminating the above-sounded drawbacks peculiar to the prior art. In particular, the technical solution disclosed herein involves dividing convolutional filters of a CNN into groups, for each of which individual parts (e.g., individual rows of filtering channels or the filtering channels themselves) of the different convolutional filters are then stored physically close to each other in a system memory. For example, for each group, one part of a first convolutional filter is followed by one part of a second convolutional filter which is then followed by one part of a third convolutional filter, and so on. The more each group comprises the convolutional filters, the more reuses of input activations are possible during a convolution operation, which allows accelerating its execution. In general, the number of the convolutional filters in each group is limited by a number of registers of a processor (e.g., a GPU) to be used for performing the convolution operation.

FIG. 5 shows a block diagram of an apparatus 500 for preparing convolutional filters a CNN for a convolution operation in accordance with one exemplary embodiment. The apparatus 500 may be part of a user equipment (UE) or implemented as an individual apparatus which may be accessed by the UE via a wireless or wired connection. The UE may refer to a mobile device, a mobile station, a terminal, a subscriber unit, a mobile phone, a cellular phone, a smart phone, a cordless phone, a personal digital assistant (PDA), a wireless communication device, a desktop computer, a laptop computer, a tablet computer, a single-board computer (SBC) (e.g., a Raspberry Pi device), a gaming device, a netbook, a smartbook, an ultrabook, a medical device or medical equipment, a biometric sensor, a wearable device (e.g., a smart watch, smart glasses, a smart wrist band, etc.), an entertainment device (e.g., an audio player, a video player, etc.), a vehicular component or sensor (e.g., a driver-assistance system), a smart meter/sensor, an unmanned vehicle (e.g., an industrial robot, a quadcopter, etc.) and its component (e.g., a self-driving car computer), industrial manufacturing equipment, a global positioning system (GPS) device, an Internet-of-Things (loT) device, an Industrial loT (lloT) device, a machine-type communication (MTC) device, a group of Massive loT (MloT) or Massive MTC (mMTC) devices/sensors, or any other suitable device configured to support wireless communications. In some embodiments, the UE may refer to at least two collocated and inter-connected UEs thus defined.

As shown in FIG. 5, the apparatus 500 comprises a processor 502 and a data storage 504. The data storage 504 stores (trained) weights initially arranged in the form of convolutional filters 506 of the CNN. Each of the convolutional filters 506 has a filter length that is defined as a number of the weights therein. The data storage 504 further stores processor-executable instructions 508 which, when executed by the processor 502, cause the processor 502 to store the convolutional filters 506 of the CNN in a certain data storage format, as will be described further in more detail. It should be noted that the number, arrangement and interconnection of the constructive elements constituting the apparatus 500, which are shown in FIG. 5, are not intended to be any limitation of the present disclosure, but merely used to provide a general idea of how the constructive elements may be implemented within the apparatus 500. For example, the processor 502 may be replaced with several processors, as well as the data storage 504 may be replaced with several removable and/or fixed storage devices, depending on particular applications. Furthermore, being implemented individually, the apparatus 500 may further comprise a transceiver configured to perform data reception and transmission for different purposes. In some embodiments, such a transceiver may be implemented as two individual devices, with one for a receiving operation and another for a transmitting operation. Irrespective of its implementation, the transceiver is intended to be capable of performing different operations required to perform the data reception and transmission, such, for example, as signal modulation/demodulation, encoding/decoding, etc.

The processor 502 may be implemented as a CPU, general-purpose processor, single purpose processor, GPU, microcontroller, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), digital signal processor (DSP), complex programmable logic device, etc. It should be also noted that the processor 502 may be implemented as any combination of one or more of the aforesaid. As an example, the processor 502 may be a combination of two or more microprocessors.

The data storage 504 may be implemented as a classical nonvolatile or volatile memory used in the modern electronic computing machines. As an example, the nonvolatile memory may include Read-Only Memory (ROM), ferroelectric Random-Access Memory (RAM), Programmable ROM (PROM), Electrically Erasable PROM (EEPROM), solid state drive (SSD), flash memory, magnetic disk storage (such as hard drives and magnetic tapes), optical disc storage (such as CD, DVD and Blu-ray discs), etc. As for the volatile memory, examples thereof include Dynamic RAM, Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Static RAM, etc.

The processor-executable instructions 508 stored in the data storage 504 may be configured as a computer-executable code which causes the processor 502 to perform the aspects of the present disclosure. The computer-executable code for carrying out operations or steps for the aspects of the present disclosure may be written in any combination of one or more programming languages, such as Java, C++, or the like. In some examples, the computer- executable code may be in the form of a high-level language or in a pre-compiled form and be generated by an interpreter (also pre-stored in the data storage 504) on the fly.

FIG. 6 shows a flowchart of a method 600 for preparing the convolutional filters 506 of the CNN for the convolution operation in accordance with one exemplary embodiment. In other words, the method 600 describes the operation of the apparatus 500. The method 600 starts with a step S602, in which the convolutional filters 506 are provided to the processor 502. Given the configuration of the apparatus 500, said providing implies that the processor 502 reads the convolutional filters 506 from the data storage 504. Then, the method 600 proceeds to a step S604, in which the processor 502 divides the convolutional filters 506 into N groups, where N ³ 1. Each of the N groups comprises a number n of the convolutional filters 506, where n > 1. Each of the n convolutional filters 506 in each of the N groups is provided with a filter index. Next, the method 600 goes on to a step S606, in which the processor 502 additionally divides, for each of the N groups, each of the n convolutional filters 506 into m weight vectors. Each of the m weight vectors has a length g less than the filter length and is provided with a vector index. After that, the method 600 proceeds to a step S608, in which the processor 502 stores, to the data storage 504, the N groups one-by-one such that the m * n weight vectors in each of the N groups are arranged as an array W = {¾·}, where i = 1, 2, 3, ... n and j = 1, 2, 3, ... m, and where v_{i j} is the weight vector with the vector index j in the convolutional filter 506 with the filter index i in the group. The vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index i reaches n. It should be noted that the method 600 does not change the weights of the convolutional filters 506 but merely changes their memory addresses in the data storage 504.

FIG. 7 shows an exemplary visual representation of a data storage format 700 for the weights of the convolutional filters 506, as provided by the method 600. In particular, FIG. 7 shows four convolutional filters divided equally in two groups, i.e. each group comprises two of the four convolutional filters. Each of the four convolutional filters comprises three filtering channels, and each of the filtering channels comprises three rows and three columns. It should be apparent to those skilled in the art that such configuration of the convolutional filters is for illustrative purposes only, and the convolutional filters may be configured differently, if required and depending on particular applications. The same is true for the shown number of the convolutional filters and the shown number of the groups, i.e. these numbers may be changed, if required and depending on particular applications. The numbers shown in FIG. 7 correspond to the memory addresses. As follows from FIG. 7, the first g weights of the first filtering channel of the second convolutional filter in the first group are loaded after the first g weights of the first filtering channel of the first convolutional filter in the first group are loaded. Similarly, the second g weights of the first filtering channel of the second convolutional filter in the first group are loaded after the second g weights of the first filtering channel of the first convolutional filter in the first group are loaded, and so on. In other words, the length g of each weight vector obtained in the step S606 of the method 600 is set to a row length (i.e. the number of the weights in each row of the filtering channel). At the same time, the next (second) group is not involved until the previous group is fully used, i.e. until all the weights of the first group are loaded. Thus, the data storage format 700 fully differs from the data storage format 200 typically used in the ARM Compute Library. The present authors have found that the data storage format 700 may allow one to alleviate the data sparseness problem and, thus, increase the utilization efficiency of a processor that should execute the convolution operation itself by using the convolutional filters 506 of the CNN. More specifically, it may allow increasing the number of reuses of loaded input data (or, in other words, input activations) during the convolution operation without having to load the weights from sparse memory addresses, thereby accelerating its execution.

It should be noted that the length g of each weight vector obtained in the step S606 of the method 600 may be set in a variety of ways, depending on particular applications. In one embodiment, it may be equal to a part of the sum of the channel lengths of the filtering channels constituting the convolutional filer 506. In another embodiment, it may be less than the channel length of one of the filtering channels constituting the convolutional filter 506 (i.e. g may be equal to one or two rows of the filtering channel, for example). Furthermore, the number n of the convolutional filters 506 in each of the N groups may be selected based on a number of registers of a processor to be used for performing the convolution operation. By so doing, it is possible to adapt the method 600 for different types of processors (e.g., GPUs).

In one embodiment, if the weights are initially arranged in the form of a number k of the convolutional filters 506 in the data storage 504 and k is indivisible by n, the method 600 may comprise a further step before the step S604, in which the processor 502 adds at least one zero-filled convolutional filter to the k convolutional filters to make k divisible by n. By so doing, it is possible to ensure the same number of the convolutional filters in each of the N groups, irrespective of the initial number of the stored convolutional filters.

FIG. 8 shows a block diagram of a computing apparatus 800 for performing a convolution operation in accordance with one exemplary embodiment. Similar to the apparatus 500, the apparatus 800 may be part of a UE or implemented as an individual apparatus which may be accessed by the UE via a wireless or wired connection. Moreover, the apparatuses 500 and 800 may be integrated into the same UE, if required. As shown in FIG. 8, the apparatus 800 comprises a processor 802 and a data storage 804. Unlike the processor 502, the processor 802 should comprise registers 806 which are used in the convolution operation. The data storage 804 stores the convolutional filters 506 of the CNN which are prepared by the apparatus 500 in accordance with the method 600. The data storage 804 further stores processor-executable instructions 808 which, when executed by the processor 802, cause the processor 802 to perform the convolution operation, as will be described further in more detail. The data storage 804 further stores an input data array 810 to be convoluted. The input data array 810 comprise input data subarrays. It should be again noted that the number, arrangement and interconnection of the constructive elements constituting the apparatus 800, which are shown in FIG. 8, are not intended to be any limitation of the present disclosure, but merely used to provide a general idea of how the constructive elements may be implemented within the apparatus 800. For example, the processor 802 may be replaced with several processors, as well as the data storage 804 may be replaced with several removable and/or fixed storage devices, depending on particular applications. Furthermore, being implemented individually, the apparatus 800 may further comprise a transceiver configured to perform data reception and transmission for different purposes. In some embodiments, such a transceiver may be implemented as two individual devices, with one for a receiving operation and another for a transmitting operation. Irrespective of its implementation, the transceiver is intended to be capable of performing different operations required to perform the data reception and transmission, such, for example, as signal modulation/demodulation, encoding/decoding, etc. As for the processor 802, the data storage 804 and the processor-executable instructions 808, they may be implemented in a similar manner as the processor 502, the data storage 504 and the processor-executable instructions 508, respectively. In the meantime, possible implementations of the processor 802 should allow for the presence of the registers 806.

FIG. 9 shows a flowchart of a method 900 for performing the convolution operation by using the convolutional filters 506 of the CNN in accordance with one exemplary embodiment. In other words, the method 900 describes the operation of the apparatus 800. The method 900 starts with a step S902, in which the processor 802 loads, to the registers 806, one of the input data subarrays of the input data array 810. Then, the method 900 proceeds to a step S904, in which the processor 802 loads, to the registers 806, a subarray of at least n weight vectors from the array W = { v_tj } of one of the N groups of the convolutional filters 506. The subarray of the at least n weight vectors may be loaded sequentially or all at once, depending on particular applications. Moreover, the steps S902 and S904 may be performed in parallel, if required. Next, the method 900 goes on to a step S906, in which the processor 802 obtains intermediate vectors for the loaded input data subarray by using a sliding window having a length t. The length t of the sliding window may be selected based on the type of the processor 802. After that, the processor 802 produces partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors in a step S908, and accumulates the partial output data to the registers 806 in a step S910. A next step S912 involves repeating the step S902-S908 for other one or more of the input data subarrays with one or more subsequent subarrays of at least n weight vectors until all subarrays of at least n weight vectors are loaded from the array W = {v _ί>7·} of said one of the N groups of the convolutional filters. Then, the method 900 proceeds to a step S914, in which the processor 802 stores all the accumulated partial output data as an output data subarray to the data storage 804. Further, in a step S916, the processor 802 repeats the steps S902- S914 for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array (or the whole set of output activations) from the stored output data subarrays in the data storage 804. It should be noted that, if required, the output data subarrays may be formed in parallel (e.g., in case of a GPU or multicore processor, one thread may perform the steps S902-S914 to produce some part of the output data, while the other thread may perform the same steps to produce some other part of the output data).

FIG. 10 shows one example of how the method 900 is performed with respect to one thread of the processor 802 of the apparatus 800 by using one of the groups of convolutional filters prepared in accordance with the method 600. More specifically, the group of convolutional filters is shown to have eight (n = 8) convolutional filters. Each of the eight convolutional filters comprises an equal number of filtering channels each comprising three rows and three columns. The length g of each weight vector in each of the eight convolutional filters is intended to be equal to one row, as schematically shown by using grey-colored rectangular parallelepipeds in the eight convolutional filters. In other words, g is equal to 3. The arrows shown in FIG. 10 denote the corresponding steps of the method 900. During the convolution operation executed by the processor 802 in accordance with the method 900, the processor 802 loads, each time in the step S904, the subarray of n = 8 weight vectors to the registers 806, i.e. g * n = 3 * 8 = 24 weights in total. By using these eight weight vectors, it is possible for the processor 802 to use each loaded input data subarray (or, in other words, each loaded subarray of input activations) 8 times, thereby performing 24 scalar-vector multiplications in the step S908. Thus, the data storage format 700 obtained by the method 600 provides a larger number of reuses of the loaded input data subarray (without having to load the weights from sparse memory addresses) in the method 900 compared to the data storage format 200 used in the ARM Compute Library.

FIGs. 11A and 11B show comparative bar charts of speedup coefficients obtained by using the method 900 and a benchmark method for performing a convolution operation based on the ARM Compute Library for different sizes of an input data array and convolutional filters. The benchmark method was performed on the following System-on-a-Chip: Huawei Kirin 980 (GPU: Mali-G76 MP10). It should be noted that the method 900 and the benchmark method were performed for aligned and unaligned memory store operations. An unaligned memory store operation is implemented when data with a size of N bytes are stored to a memory address that is not evenly divisible by N. If the memory address is evenly divisible by N, an aligned memory store operation is implemented. N is selected as the most efficient size for accessing memory on a specific processor. Referring back to FIGs. 11A and 11 B, each speedup coefficient is defined as a ratio between a convolution execution time t_ARM achieved by using the benchmark method and a convolution execution time t₉₀₀ achieved by using the method 900. The sizes of the input data array and the convolutional filters are shown in the form of the following string: “WxHxCxF”, where W is the width of the input data array, H is the height of the input data array, C is the number of channels of the input data array, and F is the number of the convolutional filters.

In FIG. 11 A, the comparative bar charts are obtained at constant spatial dimensions W and H, i.e. 1920x1080, while changing only parameters C and F. As can be seen, the speedup coefficient is always more than 1 for all sizes of the input data array, thereby meaning that t₉₀₀ < t_ARM- The maximum speedup coefficient is 1.34. In FIG. 11 B, the comparative bar charts are obtained at different parameters W, H, C, and F. As follows from FIG. 11 B, both the method 900 and the benchmark method are inefficient for the input data array with small W, H, and F (see, in particular, the speedup coefficient at 60x24x2048x8). Therefore, there is no much sense to compare the method 900 and the benchmark method at small W, H, and F.

It should be noted that each step or operation of the methods 600 or 900, or any combinations of the steps or operations, can be implemented by various means, such as hardware, firmware, and/or software. As an example, one or more of the steps or operations described above can be embodied by processor executable instructions, data structures, program modules, and other suitable data representations. Furthermore, the executable instructions which embody the steps or operations described above can be stored on a corresponding data carrier and executed by the processor 502 and the processor 802, respectively. This data carrier can be implemented as any computer-readable storage medium configured to be readable by the processor 502 and the processor 802 to execute the processor executable instructions. Such computer-readable storage media can include both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, the computer-readable media comprise media implemented in any method or technology suitable for storing information. In more detail, the practical examples of the computer- readable media include, but are not limited to information-delivery media, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile discs (DVD), holographic media or other optical disc storage, magnetic tape, magnetic cassettes, magnetic disk storage, and other magnetic storage devices.

Although the exemplary embodiments of the present disclosure are described herein, it should be noted that any various changes and modifications could be made in the embodiments of the present disclosure, without departing from the scope of legal protection which is defined by the appended claims. In the appended claims, the word “comprising” does not exclude other elements or operations, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An apparatus for preparing convolutional filters of a convolutional neural network (CNN) for a convolution operation, comprising: a data storage configured to store: weights initially arranged in the form of the convolutional filters of the CNN, each of the convolutional filters having a filter length that is defined as a number of the weights in the convolutional filter; and processor-executable instructions; and at least one processor coupled to the data storage and configured, when executing the processor-executable instructions, to: divide the convolutional filters into N groups, where N > 1 , each of the N groups comprising a number n of the convolutional filters, where n > 1, and each of the n convolutional filters in each of the N groups being provided with a filter index; for each of the N groups, additionally divide each of the n convolutional filters into m weight vectors, each of the m weight vectors having a length g less than the filter length and being provided with a vector index; and store the N groups one-by-one in the data storage such that the m * n weight vectors of each of the N groups are arranged as an array W = {¾}, where i = 1, 2, 3, ... n and j = 1, 2, 3, ... m, and where VQ is the weight vector with the vector index j in the convolutional filter with the filter index i in the group, the vector index j changing by 1 incrementally whenever the filter index i returns to 1, the filter index i changing by 1 incrementally every g weights, and the filter index i returning to 1 after the filter index i reaches n.

2. The apparatus of claim 1 , wherein each of the convolutional filters comprises an equal number of filtering channels, each of the filtering channels having a channel length that is defined as a number of the weights in the filtering channel, and the filter length being defined as a sum of the channel lengths of the filtering channels in each of the convolutional filters, and wherein the length g of each of the m weight vectors is equal to a part of the sum of the channel lengths.

3. The apparatus of claim 2, wherein the length g of each of the m weight vectors is less than the channel length.

4. The apparatus of any one of claims 1 to 3, wherein the number n of the convolutional filters in each of the N groups is selected based on a number of registers of at least one processor to be used for performing the convolution operation.

5. The apparatus of any one of claims 1 to 4, wherein, if the weights are initially arranged in the form of a number k of the convolutional filters in the data storage and k is indivisible by n, the at least one processor is further configured, before dividing the convolutional filters into the N groups, to add at least one zero-filled convolutional filter to the k convolutional filters to make k divisible by n.

6. A computing apparatus for performing a convolution operation, comprising: a data storage configured to store: an input data array to be convoluted, the input data array comprising input data subarrays; the N groups of the convolutional filters which are prepared by the apparatus according to any one of claims 1 to 5; and processor-executable instructions; and at least one processor coupled to the data storage and comprising registers, wherein the at least one processor is configured, when executing the processor-executable instructions, to: a) load, to the registers, one of the input data subarrays; b) load, to the registers, a subarray of at least n weight vectors from the array W =

[vi ] of one of the N groups of the convolutional filters; c) for the loaded input data subarray, obtain intermediate vectors by using a sliding window having a length t; d) produce partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors; e) accumulate the partial output data to the registers; f) repeat operations a)-e) for other one or more of the input data subarrays with one or more subsequent subarrays of at least n weight vectors until all subarrays of at least n weight vectors are loaded from the array W = {v_{i ;·}} of said one of the N groups of the convolutional filters; g) store all the accumulated partial output data as an output data subarray to the data storage; h) repeat operations a)-g) for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array from the stored output data subarrays in the data storage.

7. The apparatus of claim 6, wherein the at least one processor is configured to load, to the registers, the subarray of the at least n weight vectors in operation b) sequentially or all at once.

8. The apparatus of claim 6 or 7, wherein the at least one processor is configured to perform operations a) and b) in parallel.

9. The apparatus of any one of claims 6 to 8, wherein the length t of the sliding window is selected based on a type of the at least one processor performing operations a)-h).

10. The apparatus of any one of claims 6 to 9, wherein the at least one processor is implemented as at least one graphics processing unit (GPU).

11. The apparatus of any one of claims 6 to 10, wherein the input data array comprises the overlapping input data subarrays.

12. A method for preparing convolutional filters of a convolutional neural network (CNN) for a convolution operation, comprising: providing weights initially arranged in the form of the convolutional filters of the CNN, each of the convolutional filters having a filter length that is defined as a number of the weights in the convolutional filter; dividing the convolutional filters into N groups, where N > 1 , each of the N groups comprising a number n of the convolutional filters, where n > 1 , and each of the n convolutional filters in each of the N groups being provided with a filter index; for each of the N groups, additionally dividing each of the n convolutional filters into m weight vectors, each of the m weight vectors having a length g less than the filter length and being provided with a vector index; storing the N groups one-by-one such that the m * n weight vectors in each of the N groups are arranged as an array W = {¾}, where i = 1, 2, 3, ...n and j = 1, 2, 3

and where v_{i j} is the weight vector with the vector index j in the convolutional filter with the filter index i in the group, the vector index j changing by 1 incrementally whenever the filter index i returns to 1, the filter index i changing by 1 incrementally every g weights, and the filter index i returning to 1 after the filter index i reaches n.

13. A method for performing a convolution operation by using the convolutional filters prepared using the method according to claim 12, comprising: a) providing an input data array to be convoluted, the input data array comprising input data subarrays; b) loading, to registers of at least one processor, one of the input data subarrays; c) loading, to the registers of the at least one processor, a subarray of at least n weight vectors from the array W = {n_{ί ;}·} of one of the N groups of the convolutional filters; d) for the loaded input data subarray, obtaining intermediate vectors by using a sliding window having a length t; e) producing partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors; f) accumulating the partial output data to the registers of the at least one processor; g) repeating steps b)-f) for other one or more of the input data subarrays with one or more subsequent subarrays of at least n weight vectors until all subarrays of at least n weight vectors are loaded from the array W = {v _{i ;}·} of said one of the N groups of the convolutional filters; h) storing all the accumulated partial output data as an output data subarray to a data storage; and i) repeating operations b)-h) for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array from the stored output data subarrays in the data storage.

14. A computer program product comprising a computer-readable storage medium, wherein the computer-readable storage medium stores a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to claim 12.

15. A computer program product comprising a computer-readable storage medium, wherein the computer-readable storage medium stores a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to claim 13.