WO2022191730A1 - Accelerated execution of convolution operation by convolutional neural network - Google Patents
Accelerated execution of convolution operation by convolutional neural network Download PDFInfo
- Publication number
- WO2022191730A1 WO2022191730A1 PCT/RU2021/000100 RU2021000100W WO2022191730A1 WO 2022191730 A1 WO2022191730 A1 WO 2022191730A1 RU 2021000100 W RU2021000100 W RU 2021000100W WO 2022191730 A1 WO2022191730 A1 WO 2022191730A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor
- convolutional
- convolutional filters
- groups
- filter
- Prior art date
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 claims abstract description 84
- 239000013598 vector Substances 0.000 claims description 90
- 238000013500 data storage Methods 0.000 claims description 57
- 238000001914 filtration Methods 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 abstract description 31
- 238000001994 activation Methods 0.000 abstract description 31
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 230000000052 comparative effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Definitions
- the present disclosure relates generally to the field of data processing, and particularly to an apparatus and method for preparing convolutional filters of a convolutional neural network (CNN) for a convolution operation, an apparatus and method for performing the convolution operation by using the prepared convolutional filters of the CNN, as well as computer program products embodying the method steps in the form of computer codes.
- CNN convolutional neural network
- Deep neural networks are widely used in various fields of human activity. Examples of their applications include object detection and recognition, image analysis, and data classification.
- a special type of the DNN that deals with image processing extremely well is a CNN.
- the main constructive blocks of the CNN are convolutional layers each comprising a set of convolutional filters. Each convolutional filter is configured as a combination of small weight matrices (usually 3 3 weight matrices).
- the convolutional layers are the most computation-intensive part of the CNN, for which reason they are also the most time- consuming and power-consuming layers of the CNN. Thus, techniques for reducing the computational costs and/or memory requirements of the convolutional layers are desired.
- GPUs graphics processing units
- the Mali family of GPUs designed by ARM is of particular interest. It offers application programming interface (API) support for Open Computing Language (OpenCL), Open Graphics Library (OpenGL), DirectX, and Google RenderScript.
- OpenCL Open Computing Language
- OpenGL Open Graphics Library
- DirectX DirectX
- Google RenderScript application programming interface
- the Mali GPUs are characterized in that they do not have a local memory. In other words, the Mali GPUs do not deal with a fast-shared memory region that can be shared among a group of GPU threads.
- the implementation of the convolution operation in the form of the ARM Compute Library is delivered with the existing mobile devices comprising the Mali GPUs.
- the ARM Compute Library involves using a series of 2D convolutions, in each of which input activations loaded from Random Access Memory (RAM) to registers of the Mali GPU are used from 1 to 3 times in case of 3 x 3 convolutional filters.
- RAM Random Access Memory
- This small number of reuses may cause the Mali GPU to perform a large number of memory load operations during the convolution operation, thereby delaying the execution of the convolution operation. In turn, this may not allow the CNNs to be used in some computational tasks, thereby limiting the application of the CNNs.
- an apparatus for preparing convolutional filters of a CNN comprises a data storage and at least one processor.
- the data storage is configured to store weights and processor-executable instructions.
- the weights initially arranged in the form of the convolutional filters of the CNN.
- Each of the convolutional filters has a filter length that is defined as a number of the weights in the convolutional filter.
- the processor-executable instructions cause the at least one processor to operate as follows. At first, the at least one processor divides the convolutional filters into N groups, where N 3 1. Each of the N groups comprises a number n of the convolutional filters, where n > 1.
- Each of the n convolutional filters in each of the N groups is provided with a filter index.
- the at least one processor then additionally divides each of the n convolutional filters into m weight vectors.
- Each of the m weight vectors has a length g less than the filter length and is provided with a vector index.
- the vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index i reaches n.
- each of the convolutional filters comprises an equal number of filtering channels, each of the filtering channels has a channel length that is defined as a number of the weights in the filtering channel, and the filter length is defined as a sum of the channel lengths of the filtering channels in each of the convolutional filters.
- the length g of each of the m weight vectors is equal to a part of the sum of the channel lengths.
- the length g of each of the m weight vectors is less than channel length.
- the number n of the convolutional filters in each of the N groups is selected based on a number of registers of at least one processor to be used for performing the convolution operation. This may allow one to adapt the data storage format of the weights of the convolutional filters for different types of processors (e.g., the Mali GPUs).
- the at least one processor is further configured, before dividing the convolutional filters into the N groups, to add at least one zero-filled convolutional filter to the k convolutional filters to make k divisible by n.
- a computing apparatus for performing a convolution operation.
- the apparatus comprises a data storage and at least one processor.
- the data storage is configured to store an input data array to be convoluted, the N groups of the convolutional filters which are prepared by the apparatus according to the first aspect, and processor-executable instructions.
- the input data array comprises input data subarrays.
- the at least one processor comprises registers.
- the at least one processor is configured to load, to the registers, the subarray of the at least n weight vectors in operation b) sequentially or all at once. This may make the apparatus according to the second aspect more flexible in use.
- the at least one processor is configured to perform operations a) and b) in parallel. This may reduce the time required to perform these operations.
- the length t of the sliding window is selected based on a type of the at least one processor performing operations a)-h).
- the at least one processor is implemented as at least one GPU.
- the GPU(s) it is possible to increase the execution efficiency of the convolution operation.
- the input data array comprises the overlapping input data subarrays. This means that the apparatus according to the second aspect is not “sensitive” to the initial segmentation of the input data array, which makes it more flexible in use.
- a method for preparing convolutional filters of a CNN for a convolution operation starts with the step of providing weights initially arranged in the form of the convolutional filters of the CNN.
- Each of the convolutional filters has a filter length that is defined as a number of the weights in the convolutional filter.
- the method proceeds to the step of dividing the convolutional filters into N groups, where N > 1.
- Each of the N groups comprises a number n of the convolutional filters, where n > 1.
- Each of the n convolutional filters in each of the N groups is provided with a filter index.
- the method goes on to the step of additionally dividing, for each of the N groups, each of the n convolutional filters into m weight vectors.
- Each of the m weight vectors has a length g less than the filter length and is provided with a vector index.
- the vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index t reaches n.
- the list of computational tasks to be solved by the CNN may be expanded irrespective of the limited hardware capabilities of the existing mobile devices (e.g., the mobile devices with the Mali GPUs).
- this data storage format it is possible to increase the number of reuses of loaded input data (or, in other words, input activations) during the convolution operation without having to load the weights from sparse memory addresses, thereby reducing computation time as well as power consumption by the at least one processor.
- the list of computational tasks to be solved by the CNN may be expanded.
- a computer program product comprises a computer-readable storage medium storing a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to the third aspect.
- a computer program product comprises a computer-readable storage medium storing a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to the fourth aspect.
- FIG. 1 shows a flowchart of a method for performing a series of 2D convolution operations in accordance with the ARM Compute Library
- FIG. 2 shows an exemplary visual representation of a data storage format for weights of convolutional filters used in the method shown in FIG. 1 ;
- FIGs. 3A-3C explain how input activations are converted to output activations in each 2D convolution operation in accordance with the method shown in FIG. 1 ;
- FIGs. 4A-4C explain how to use a fused multiply-add (FMA) operation in the method shown in FIG. 1;
- FIG. 5 shows a block diagram of an apparatus for preparing convolutional filters of a CNN for a convolution operation in accordance with one exemplary embodiment
- FIG. 6 shows a flowchart of a method for preparing the convolutional filters of the CNN for the convolution operation in accordance with one exemplary embodiment
- FIG. 7 shows an exemplary visual representation of a data storage format for the weights of the convolutional filters, as provided by the method shown in FIG. 6;
- FIG. 8 shows a block diagram of a computing apparatus for performing a convolution operation in accordance with one exemplary embodiment
- FIG. 9 shows a flowchart of a method for performing the convolution operation by using the convolutional filters obtained by the method shown in FIG. 6, in accordance with one exemplary embodiment
- FIG. 10 shows one example of how the method shown in FIG. 9 is performed with respect to one thread of a processor included in the apparatus shown in FIG. 8 by using one group of convolutional filters prepared in accordance with the method shown in FIG. 6;
- FIGs. 11A and 11 B show comparative bar charts of speedup coefficients obtained by using the method shown in FIG. 9 and a benchmark method for performing a convolution operation based on ARM Compute Library for different sizes of an input data array and convolutional filters.
- a convolutional neural network may refer to a trained multilayer neural network architecture in which one or more convolution operations are implemented for various purposes.
- the CNN may be used to solve computer vision tasks (e.g., image classification, classification with localization, object detection, super resolution, joint demosaicing and denoising, noise reduction, image enhancement, etc.), as well as speech and audio processing tasks (e.g., text-to-speech, speech-to-text, etc.).
- the main constructive blocks of the CNN are convolutional layers.
- the convolutional layers are strong feature extractors in which convolutional filters configured to retrieve features of input data (e.g., image data or a time series) to be processed.
- the input data may be also referred to as input activations (correspondingly, output data resulted from the convolution operation may be also referred to as output activations).
- the convolutional filters are typically represented as a combination of small weight matrices (usually 3 x 3 weight matrices) which are “slid” over the input data. In this sense, the convolution operation may be considered as the result of respective matrix multiplication operations between the convolutional filters and the input data.
- the convolutional layers are not fully connected, but their output is usually passed to one or more fully connected layers that makes a final decision (e.g., a classification decision). Since the convolutional layers are known as the most computation-intensive part of the CNN, they are also the most time-consuming and power-consuming layers of the CNN. Given this, the execution speed of the convolution operation is mainly determined by these layers.
- GPUs graphics processing units
- Mali GPUs designed by ARM are of particular interest.
- the Mali GPUs do not have a local memory. In other words, during operation, they do not rely on a fast-shared memory that can be shared among a group of GPU threads.
- ARM Compute Library a comprehensive collection of computation-intensive functions (in particular, for the convolution operation) optimized for the Mali GPUs is available in the form of the ARM Compute Library. These functions are available for application developers, for example, through the Android Neural Networks API (NNAPI). Relative to the convolution operation, the ARM Compute Library involves dealing with 2D input/output data of weights and activations and performing a series of 2D convolutions.
- NAPI Android Neural Networks API
- FIG. 1 shows a flowhart of a method 100 for performing a series of 2D convolution operations in accordance with the ARM Compute Library. More specifically, the method 100 describes the execution of the 2D convolution operations for one thread of a Mali GPU.
- the method 100 starts with a step S102, in which the Mali GPU loads one subarray of weights of a 3 x 3 convolutional filter of a CNN and one subarray of input activations. Both the subarray of weights and the subarray of input activations are schematically shown as rectangular parallelepipeds in FIG. 1.
- the convolutional filter is usually divided into convoluting channels, and the loaded subarray of weights of the convolutional filter corresponds to one of the filtering channels in this case.
- the method 100 proceeds to a step S104, in which the Mali GPU performs a 2D convolution operation based on the loaded subarray of weights and the loaded subarray of the input activations.
- the method 100 goes on to a step S106, in which the Mali GPU accumulates the result of the 2D convolution operation to its registers.
- the method 100 proceeds to a step S108, in which the Mali GPU loads a subsequent subarray of weights of the convolutional filter (e.g., by moving a sliding window to a next filtering channel).
- the Mali GPU repeats the steps S102-S108 c — 1 times, i.e. until all subarrays of weights are loaded from all c filtering channels of the convolutional filter.
- the method 100 ends up with a step S110, in which the Mali GPU stores the accumulated results of the 2D convolutions operations as a subarray of output activations to a system memory, such, for example, as RAM.
- the method 100 may be repeated by the Mali GPU for other threads, i.e. for other convolutional filters of the CNN and other subarrays of input activations, resulting in other subarrays of output activations. These subarrays of output activations are then combined into a final array of output activations.
- FIG. 2 shows an exemplary visual representation of a data storage format 200 for the weights of the convolutional filters used in the method 100.
- this format is referred to as a KCRS format, where K is the output number of channels of each 2D convolution operation, C is the input number of channels of each 2D convolution operation, R is the filter height, and S is the filter width.
- K is the output number of channels of each 2D convolution operation
- C is the input number of channels of each 2D convolution operation
- R the filter height
- S the filter width.
- There are four convolutional filters shown in FIG. 2, for which reason K 4.
- the numbers shown in the filtering channels correspond to memory addresses.
- weights of a next convolutional filter are not used until all weights of a previous convolutional filter are loaded.
- FIGs. 3A-3C explain how the input activations are converted to the output activations by using each 2D convolution operation in accordance with the method 100.
- each part of the input activations is loaded by the Mali GPU row by row, and each loaded row (colored in light grey in FIGs. 3A-3C) may be used from 1 to 3 times (for the case of the 3 x 3 convolutional filter) whenever it is needed to calculate rows of the output activations.
- each 2D convolution operation is presented as a series of 1 D convolution operations performed by a fused multiply-add (FMA) operation.
- FIGs. 4A-4C explains how to use the FMA operation in the method 100. In particular, FIGs.
- it is required to divide the 6-component row-vector into 3 sub-vectors by using a sliding window. These sub-vectors are denoted as v 1;4 , v 2:5 ,
- the sub-vector v 1;4 is multiplied by the weight w 4 (see FIG. 4A)
- the sub-vector v 2:5 is multiplied by the weight w 2 (see FIG. 4B)
- the sub-vector v 3:6 is multiplied by the weight w 3 (see FIG. 4C).
- the ARM Compute Library involves using the 2D convolution operations, in each of which each loaded row of the subarray of the input activations may be used from 1 to 3 times (depending on the row used) for the 3 x 3 convolutional filter.
- This small number of reuses implies that the Mali GPU needs to perform multiple memory reads during each 2D convolution operation, thereby causing a delay in its execution. For this reason, it may be impossible to use the CNNs on the Mali GPU when solving time-sensitive computational tasks, thereby limiting the application of the CNNs.
- the exemplary embodiments disclosed herein provide a technical solution that allows mitigating or even eliminating the above-sounded drawbacks peculiar to the prior art.
- the technical solution disclosed herein involves dividing convolutional filters of a CNN into groups, for each of which individual parts (e.g., individual rows of filtering channels or the filtering channels themselves) of the different convolutional filters are then stored physically close to each other in a system memory. For example, for each group, one part of a first convolutional filter is followed by one part of a second convolutional filter which is then followed by one part of a third convolutional filter, and so on.
- the more each group comprises the convolutional filters the more reuses of input activations are possible during a convolution operation, which allows accelerating its execution.
- the number of the convolutional filters in each group is limited by a number of registers of a processor (e.g., a GPU) to be used for performing the convolution operation.
- FIG. 5 shows a block diagram of an apparatus 500 for preparing convolutional filters a CNN for a convolution operation in accordance with one exemplary embodiment.
- the apparatus 500 may be part of a user equipment (UE) or implemented as an individual apparatus which may be accessed by the UE via a wireless or wired connection.
- UE user equipment
- the UE may refer to a mobile device, a mobile station, a terminal, a subscriber unit, a mobile phone, a cellular phone, a smart phone, a cordless phone, a personal digital assistant (PDA), a wireless communication device, a desktop computer, a laptop computer, a tablet computer, a single-board computer (SBC) (e.g., a Raspberry Pi device), a gaming device, a netbook, a smartbook, an ultrabook, a medical device or medical equipment, a biometric sensor, a wearable device (e.g., a smart watch, smart glasses, a smart wrist band, etc.), an entertainment device (e.g., an audio player, a video player, etc.), a vehicular component or sensor (e.g., a driver-assistance system), a smart meter/sensor, an unmanned vehicle (e.g., an industrial robot, a quadcopter, etc.) and its component (e.g., a self-driving car computer), industrial
- the apparatus 500 comprises a processor 502 and a data storage 504.
- the data storage 504 stores (trained) weights initially arranged in the form of convolutional filters 506 of the CNN.
- Each of the convolutional filters 506 has a filter length that is defined as a number of the weights therein.
- the data storage 504 further stores processor-executable instructions 508 which, when executed by the processor 502, cause the processor 502 to store the convolutional filters 506 of the CNN in a certain data storage format, as will be described further in more detail. It should be noted that the number, arrangement and interconnection of the constructive elements constituting the apparatus 500, which are shown in FIG.
- the apparatus 500 may further comprise a transceiver configured to perform data reception and transmission for different purposes.
- a transceiver may be implemented as two individual devices, with one for a receiving operation and another for a transmitting operation.
- the transceiver is intended to be capable of performing different operations required to perform the data reception and transmission, such, for example, as signal modulation/demodulation, encoding/decoding, etc.
- the processor 502 may be implemented as a CPU, general-purpose processor, single purpose processor, GPU, microcontroller, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), digital signal processor (DSP), complex programmable logic device, etc. It should be also noted that the processor 502 may be implemented as any combination of one or more of the aforesaid. As an example, the processor 502 may be a combination of two or more microprocessors.
- the data storage 504 may be implemented as a classical nonvolatile or volatile memory used in the modern electronic computing machines.
- the nonvolatile memory may include Read-Only Memory (ROM), ferroelectric Random-Access Memory (RAM), Programmable ROM (PROM), Electrically Erasable PROM (EEPROM), solid state drive (SSD), flash memory, magnetic disk storage (such as hard drives and magnetic tapes), optical disc storage (such as CD, DVD and Blu-ray discs), etc.
- ROM Read-Only Memory
- RAM ferroelectric Random-Access Memory
- PROM Programmable ROM
- EEPROM Electrically Erasable PROM
- SSD solid state drive
- flash memory magnetic disk storage (such as hard drives and magnetic tapes), optical disc storage (such as CD, DVD and Blu-ray discs), etc.
- the volatile memory examples thereof include Dynamic RAM, Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Static RAM, etc.
- the processor-executable instructions 508 stored in the data storage 504 may be configured as a computer-executable code which causes the processor 502 to perform the aspects of the present disclosure.
- the computer-executable code for carrying out operations or steps for the aspects of the present disclosure may be written in any combination of one or more programming languages, such as Java, C++, or the like.
- the computer- executable code may be in the form of a high-level language or in a pre-compiled form and be generated by an interpreter (also pre-stored in the data storage 504) on the fly.
- FIG. 6 shows a flowchart of a method 600 for preparing the convolutional filters 506 of the CNN for the convolution operation in accordance with one exemplary embodiment.
- the method 600 describes the operation of the apparatus 500.
- the method 600 starts with a step S602, in which the convolutional filters 506 are provided to the processor 502. Given the configuration of the apparatus 500, said providing implies that the processor 502 reads the convolutional filters 506 from the data storage 504. Then, the method 600 proceeds to a step S604, in which the processor 502 divides the convolutional filters 506 into N groups, where N 3 1. Each of the N groups comprises a number n of the convolutional filters 506, where n > 1.
- each of the n convolutional filters 506 in each of the N groups is provided with a filter index.
- the method 600 goes on to a step S606, in which the processor 502 additionally divides, for each of the N groups, each of the n convolutional filters 506 into m weight vectors.
- Each of the m weight vectors has a length g less than the filter length and is provided with a vector index.
- v i j is the weight vector with the vector index j in the convolutional filter 506 with the filter index i in the group.
- the vector index j changes by 1 incrementally whenever the filter index i returns to 1, the filter index i changes by 1 incrementally every g weights, and the filter index i returns to 1 after the filter index i reaches n. It should be noted that the method 600 does not change the weights of the convolutional filters 506 but merely changes their memory addresses in the data storage 504.
- FIG. 7 shows an exemplary visual representation of a data storage format 700 for the weights of the convolutional filters 506, as provided by the method 600.
- FIG. 7 shows four convolutional filters divided equally in two groups, i.e. each group comprises two of the four convolutional filters.
- Each of the four convolutional filters comprises three filtering channels, and each of the filtering channels comprises three rows and three columns.
- the convolutional filters may be configured differently, if required and depending on particular applications.
- the same is true for the shown number of the convolutional filters and the shown number of the groups, i.e. these numbers may be changed, if required and depending on particular applications.
- the numbers shown in FIG. 7 correspond to the memory addresses.
- the first g weights of the first filtering channel of the second convolutional filter in the first group are loaded after the first g weights of the first filtering channel of the first convolutional filter in the first group are loaded.
- the second g weights of the first filtering channel of the second convolutional filter in the first group are loaded after the second g weights of the first filtering channel of the first convolutional filter in the first group are loaded, and so on.
- the length g of each weight vector obtained in the step S606 of the method 600 is set to a row length (i.e. the number of the weights in each row of the filtering channel).
- the data storage format 700 fully differs from the data storage format 200 typically used in the ARM Compute Library.
- the present authors have found that the data storage format 700 may allow one to alleviate the data sparseness problem and, thus, increase the utilization efficiency of a processor that should execute the convolution operation itself by using the convolutional filters 506 of the CNN. More specifically, it may allow increasing the number of reuses of loaded input data (or, in other words, input activations) during the convolution operation without having to load the weights from sparse memory addresses, thereby accelerating its execution.
- the length g of each weight vector obtained in the step S606 of the method 600 may be set in a variety of ways, depending on particular applications. In one embodiment, it may be equal to a part of the sum of the channel lengths of the filtering channels constituting the convolutional filer 506. In another embodiment, it may be less than the channel length of one of the filtering channels constituting the convolutional filter 506 (i.e. g may be equal to one or two rows of the filtering channel, for example). Furthermore, the number n of the convolutional filters 506 in each of the N groups may be selected based on a number of registers of a processor to be used for performing the convolution operation. By so doing, it is possible to adapt the method 600 for different types of processors (e.g., GPUs).
- processors e.g., GPUs
- the method 600 may comprise a further step before the step S604, in which the processor 502 adds at least one zero-filled convolutional filter to the k convolutional filters to make k divisible by n.
- FIG. 8 shows a block diagram of a computing apparatus 800 for performing a convolution operation in accordance with one exemplary embodiment. Similar to the apparatus 500, the apparatus 800 may be part of a UE or implemented as an individual apparatus which may be accessed by the UE via a wireless or wired connection. Moreover, the apparatuses 500 and 800 may be integrated into the same UE, if required. As shown in FIG. 8, the apparatus 800 comprises a processor 802 and a data storage 804. Unlike the processor 502, the processor 802 should comprise registers 806 which are used in the convolution operation.
- the data storage 804 stores the convolutional filters 506 of the CNN which are prepared by the apparatus 500 in accordance with the method 600.
- the data storage 804 further stores processor-executable instructions 808 which, when executed by the processor 802, cause the processor 802 to perform the convolution operation, as will be described further in more detail.
- the data storage 804 further stores an input data array 810 to be convoluted.
- the input data array 810 comprise input data subarrays.
- the number, arrangement and interconnection of the constructive elements constituting the apparatus 800, which are shown in FIG. 8, are not intended to be any limitation of the present disclosure, but merely used to provide a general idea of how the constructive elements may be implemented within the apparatus 800.
- the processor 802 may be replaced with several processors, as well as the data storage 804 may be replaced with several removable and/or fixed storage devices, depending on particular applications.
- the apparatus 800 may further comprise a transceiver configured to perform data reception and transmission for different purposes.
- a transceiver may be implemented as two individual devices, with one for a receiving operation and another for a transmitting operation.
- the transceiver is intended to be capable of performing different operations required to perform the data reception and transmission, such, for example, as signal modulation/demodulation, encoding/decoding, etc.
- the processor 802 the data storage 804 and the processor-executable instructions 808, they may be implemented in a similar manner as the processor 502, the data storage 504 and the processor-executable instructions 508, respectively.
- possible implementations of the processor 802 should allow for the presence of the registers 806.
- FIG. 9 shows a flowchart of a method 900 for performing the convolution operation by using the convolutional filters 506 of the CNN in accordance with one exemplary embodiment.
- the method 900 describes the operation of the apparatus 800.
- the subarray of the at least n weight vectors may be loaded sequentially or all at once, depending on particular applications. Moreover, the steps S902 and S904 may be performed in parallel, if required.
- the method 900 goes on to a step S906, in which the processor 802 obtains intermediate vectors for the loaded input data subarray by using a sliding window having a length t. The length t of the sliding window may be selected based on the type of the processor 802.
- the processor 802 produces partial output data of size n x t by multiplying the intermediate vectors by the loaded subarray of the at least n weight vectors in a step S908, and accumulates the partial output data to the registers 806 in a step S910.
- the processor 802 repeats the steps S902- S914 for the rest of the input data subarrays and the rest of the N groups of the convolutional filters, thereby forming an output data array (or the whole set of output activations) from the stored output data subarrays in the data storage 804.
- the output data subarrays may be formed in parallel (e.g., in case of a GPU or multicore processor, one thread may perform the steps S902-S914 to produce some part of the output data, while the other thread may perform the same steps to produce some other part of the output data).
- the arrows shown in FIG. 10 denote the corresponding steps of the method 900.
- the processor 802 uses these eight weight vectors, it is possible for the processor 802 to use each loaded input data subarray (or, in other words, each loaded subarray of input activations) 8 times, thereby performing 24 scalar-vector multiplications in the step S908.
- the data storage format 700 obtained by the method 600 provides a larger number of reuses of the loaded input data subarray (without having to load the weights from sparse memory addresses) in the method 900 compared to the data storage format 200 used in the ARM Compute Library.
- FIGs. 11A and 11B show comparative bar charts of speedup coefficients obtained by using the method 900 and a benchmark method for performing a convolution operation based on the ARM Compute Library for different sizes of an input data array and convolutional filters.
- the benchmark method was performed on the following System-on-a-Chip: Huawei Kirin 980 (GPU: Mali-G76 MP10). It should be noted that the method 900 and the benchmark method were performed for aligned and unaligned memory store operations. An unaligned memory store operation is implemented when data with a size of N bytes are stored to a memory address that is not evenly divisible by N. If the memory address is evenly divisible by N, an aligned memory store operation is implemented.
- each speedup coefficient is defined as a ratio between a convolution execution time t ARM achieved by using the benchmark method and a convolution execution time t 900 achieved by using the method 900.
- the sizes of the input data array and the convolutional filters are shown in the form of the following string: “WxHxCxF”, where W is the width of the input data array, H is the height of the input data array, C is the number of channels of the input data array, and F is the number of the convolutional filters.
- the comparative bar charts are obtained at constant spatial dimensions W and H, i.e. 1920x1080, while changing only parameters C and F.
- the speedup coefficient is always more than 1 for all sizes of the input data array, thereby meaning that t 900 ⁇ t ARM -
- the maximum speedup coefficient is 1.34.
- the comparative bar charts are obtained at different parameters W, H, C, and F.
- both the method 900 and the benchmark method are inefficient for the input data array with small W, H, and F (see, in particular, the speedup coefficient at 60x24x2048x8). Therefore, there is no much sense to compare the method 900 and the benchmark method at small W, H, and F.
- each step or operation of the methods 600 or 900, or any combinations of the steps or operations can be implemented by various means, such as hardware, firmware, and/or software.
- one or more of the steps or operations described above can be embodied by processor executable instructions, data structures, program modules, and other suitable data representations.
- the executable instructions which embody the steps or operations described above can be stored on a corresponding data carrier and executed by the processor 502 and the processor 802, respectively.
- This data carrier can be implemented as any computer-readable storage medium configured to be readable by the processor 502 and the processor 802 to execute the processor executable instructions.
- Such computer-readable storage media can include both volatile and nonvolatile media, removable and non-removable media.
- the computer-readable media comprise media implemented in any method or technology suitable for storing information.
- the practical examples of the computer- readable media include, but are not limited to information-delivery media, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile discs (DVD), holographic media or other optical disc storage, magnetic tape, magnetic cassettes, magnetic disk storage, and other magnetic storage devices.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2021/000100 WO2022191730A1 (en) | 2021-03-11 | 2021-03-11 | Accelerated execution of convolution operation by convolutional neural network |
EP21720868.5A EP4295276A1 (en) | 2021-03-11 | 2021-03-11 | Accelerated execution of convolution operation by convolutional neural network |
CN202180095369.3A CN116997911A (en) | 2021-03-11 | 2021-03-11 | Accelerating convolutional neural networks to perform convolutional operations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2021/000100 WO2022191730A1 (en) | 2021-03-11 | 2021-03-11 | Accelerated execution of convolution operation by convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022191730A1 true WO2022191730A1 (en) | 2022-09-15 |
Family
ID=75639957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2021/000100 WO2022191730A1 (en) | 2021-03-11 | 2021-03-11 | Accelerated execution of convolution operation by convolutional neural network |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4295276A1 (en) |
CN (1) | CN116997911A (en) |
WO (1) | WO2022191730A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115640494A (en) * | 2022-12-14 | 2023-01-24 | 北京登临科技有限公司 | Convolution calculation unit, AI operation array and related equipment |
-
2021
- 2021-03-11 CN CN202180095369.3A patent/CN116997911A/en active Pending
- 2021-03-11 EP EP21720868.5A patent/EP4295276A1/en active Pending
- 2021-03-11 WO PCT/RU2021/000100 patent/WO2022191730A1/en active Application Filing
Non-Patent Citations (3)
Title |
---|
PERRY GIBSON ET AL: "Optimizing Grouped Convolutions on Edge Devices", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 June 2020 (2020-06-17), XP081697831 * |
RAMANATHAN AKSHAY KRISHNA ET AL: "Look-Up Table based Energy Efficient Processing in Cache Support for Neural Network Acceleration", 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), IEEE, 17 October 2020 (2020-10-17), pages 88 - 101, XP033856424, DOI: 10.1109/MICRO50266.2020.00020 * |
ZHENG LIANMIN: "Optimizing Mobile Deep Learning on ARM GPU with TVM", 16 January 2018 (2018-01-16), XP055870554, Retrieved from the Internet <URL:https://tvm.apache.org/2018/01/16/opt-mali-gpu> [retrieved on 20211208] * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115640494A (en) * | 2022-12-14 | 2023-01-24 | 北京登临科技有限公司 | Convolution calculation unit, AI operation array and related equipment |
Also Published As
Publication number | Publication date |
---|---|
EP4295276A1 (en) | 2023-12-27 |
CN116997911A (en) | 2023-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108229655B (en) | Convolutional neural network (CNN) processing method and device | |
CN111213125B (en) | Efficient direct convolution using SIMD instructions | |
US11574171B2 (en) | Neural network architecture using convolution engines | |
US20210319284A1 (en) | System and architecture including processor and neural network accelerator | |
JP6720264B2 (en) | Learning method and learning apparatus for image segmentation, and image segmentation method and image segmentation apparatus using the same | |
US10534841B2 (en) | Appartus and methods for submatrix operations | |
KR102452951B1 (en) | Method and apparatus for performing convolution operation in neural network | |
US11836971B2 (en) | Method and device with convolution neural network processing | |
CN110781923A (en) | Feature extraction method and device | |
US20230019151A1 (en) | Implementation of pooling and unpooling or reverse pooling in hardware | |
KR20190111810A (en) | Systems and methods of data processing | |
CN110109646A (en) | Data processing method, device and adder and multiplier and storage medium | |
JP2020126651A (en) | Method and apparatus for processing convolution operation in neural network | |
CN114792387A (en) | Image restoration method and apparatus | |
EP4295276A1 (en) | Accelerated execution of convolution operation by convolutional neural network | |
CN117435855B (en) | Method for performing convolution operation, electronic device, and storage medium | |
CN111310115A (en) | Data processing method, device and chip, electronic equipment and storage medium | |
CN111985617A (en) | Processing method and device of 3D convolutional neural network on neural network processor | |
CN111860838A (en) | Full connection layer calculation method and device of neural network | |
CN116933864A (en) | Universal high-precision distributed algorithm training method and system | |
US20230131543A1 (en) | Apparatus and method with multi-task processing | |
CN111985618A (en) | Processing method and device of 3D convolutional neural network on neural network processor | |
CN111953318B (en) | Median filtering method and device adaptive to pipeline architecture and filter | |
WO2022198233A1 (en) | Efficient compression of activation functions | |
CN114902240A (en) | Neural network channel number searching method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21720868 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180095369.3 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021720868 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2021720868 Country of ref document: EP Effective date: 20230920 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |