WO2024104879A1

WO2024104879A1 - Apparatus and method for performing a depthwise convolution operation

Info

Publication number: WO2024104879A1
Application number: PCT/EP2023/081278
Authority: WO
Inventors: Pasi Kolinummi; Andrew Robert BALDWIN
Original assignee: Nokia Solutions And Networks Oy
Priority date: 2022-11-14
Filing date: 2023-11-09
Publication date: 2024-05-23

Abstract

The present disclosure relates to a technique for performing a depthwise convolution operation on a set of input data channels. For this purpose, a depthwise convolution operation is independently performed for each input data channel by scanning it with a convolution filter. At each scanning position of the filter, at least one output data element of an output data channel is calculated, as well as at least one partial convolution result which is required for at least one next output data element of the output data channel at one or more next scanning positions of the filter is determined and accumulated. The accumulated partial convolution results may be further used together with new partial convolution results to produce the output data element(s) at the next scanning position(s) of the filter. This allows reducing the number of power-expensive memory read operations during the depthwise convolution operation, thereby decreasing power consumption.

Description

APPARATUS AND METHOD FOR PERFORMING A DEPTHWISE CONVOLUTION OPERATION

TECHNICAL FIELD

The present disclosure relates generally to the field of neural networks, and particularly to a technique for efficiently performing a depthwise convolution operation on a set of input data channels.

BACKGROUND

Machine Learning (ML) processing has proven to solve multiple problems better than traditional computational methods. Among different ML models, those based on Convolutional Neural Networks (CNN) are of particular interest, since they may be used for much-needed applications, such as image recognition and detection, speech recognition, very high data rate processing (e.g., physical layer or layer 1 processing in a wireless communication network), etc. Since a CNN requires many operations, reducing the number of parameters and the amount of computation of the CNN is constantly under discussion. A depthwise-separable CNN is the latest special CNN which allows reducing the amount of computation by decomposing the traditional 3D convolution into two convolution operations, namely: depthwise convolution and pointwise convolution, while its calculation accuracy is little different from that of the traditional 3D convolution.

There are also different hardware (HW) architectures designed and optimized for increasing the speed, efficiency and accuracy of computer devices that are assumed to run the depthwise-separable CNN. One of them is an ML accelerator. However, the data transfer and memory operations (e.g., memory read operations) required for the execution of the depthwise convolution have a large effect on power consumption since the area of the ML accelerator is typically more than 50% memory.

Therefore, there is a need for a solution that allows reducing the amount of memory read operations and data transfer required for depthwise convolution processing such that its execution can be pipelined efficiently with the pointwise convolution running in parallel on a different HW unit. SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure.

It is an objective of the present disclosure to provide a technical solution that allows reducing a number of power-expensive memory read operations during the depthwise convolution.

The objective above is achieved by the features of the independent claims in the appended claims. Further embodiments and examples are apparent from the dependent claims, the detailed description, and the accompanying drawings.

According to a first aspect, an apparatus for performing a depthwise convolution operation is provided. The apparatus comprises at least one processor and at least one memory. The at least one memory comprises a computer program code. The at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to operate at least as follows. At first, the apparatus receives a set of input data channels. Then, the apparatus performs the depthwise convolution operation independently for each input data channel of the set of input data channels. More specifically, the depthwise convolution operation is performed by scanning the input data channel with a convolution filter. At each scanning position of the convolution filter, the apparatus calculates at least one output data element of an output data channel, determines whether at least one partial convolution result obtained at the scanning position of the convolution filter is required for calculating at least one next output data element of the output data channel at one or more next scanning positions of the convolution filter, and accumulates the at least one partial result if it is determined that the at least one partial result is required for calculating the at least one next output data element of the output data channel at the one or more next scanning positions of the convolution filter. Thus, the partial convolution result(s) obtained at the previous scanning position of the convolution filter may be summed or accumulated together with one or more new partial convolution results obtained at the current scanning position of the convolution filter to produce the output data element(s) of the output data channel. In other words, the apparatus does not need to reload or reread each input data element from the input data channel at the current scanning position of the convolution filter if that input data element has been used at the previous scanning position of the convolution filter. By so doing, it is possible to reduce the total number of memory read operations needed for the execution of the depthwise convolution operation, while achieving a level of accuracy in the output data element(s) comparable to that achieved with the traditional 3D convolution. Furthermore, the apparatus thus configured may efficiently support different convolution parameters (kernel sizes, shapes, padding, dilation, etc.) for each input data channel - i.e., each input data element is loaded only once independently of the filter size.

In one example embodiment of the first aspect, the at least one processor comprises a processor register configured to store a set of filter coefficients. In this embodiment, the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to define the convolution filter based on the set of filter coefficients. By storing all filter coefficients in the processor register, it is possible to avoid memory read operations required to obtain them during the depthwise convolution operation.

In one example embodiment of the first aspect, the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the depthwise convolution operation for each input data channel of the set of input data channels in parallel. By so doing, it is possible to reduce the execution time of the depthwise convolution operation.

In one example embodiment of the first aspect, each of the input data channel and the output data channel is a 2D data array. In this embodiment, the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to calculate, during the depthwise convolution operation, the output data channel row by row. This "row by row" approach may be beneficial in some use scenarios (e.g., when it is impossible to implement the "at least two rows in parallel" approach discussed below).

In another example embodiment of the first aspect, the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to calculate, during the depthwise convolution operation, at least two rows of output data elements of the output data channel in parallel. This may also allow reducing the execution time of the depthwise convolution operation.

According to a second aspect, a method for performing a depthwise convolution operation is provided. The method starts with the step of receiving a set of input data channels. Then, the method proceeds to the step of performing the depthwise convolution operation independently for each input data channel of the set of input data channels. More specifically, the depthwise convolution operation involves scanning the input data channel with a convolution filter. At each scanning position of the convolution filter, at least one output data element of an output data channel is calculated, as well as it is determined whether at least one partial convolution result obtained at the scanning position of the convolution filter is required for calculating at least one next output data element of the output data channel at one or more next scanning positions of the convolution filter. The at least one partial result is then accumulated if it is determined that the at least one partial result is required for calculating the at least one next output data element of the output data channel at the one or more next scanning positions of the convolution filter. Thus, the partial convolution result(s) obtained at the previous scanning position of the convolution filter may be summed or accumulated together with one or more new partial convolution results obtained at the current scanning position of the convolution filter to produce the output data element(s) of the output data channel. In other words, there is no need to reload or reread each input data element from the input data channel at the current scanning position of the convolution filter if that input data element has been used at the previous scanning position of the convolution filter. By so doing, it is possible to reduce the total number of memory read operations needed for the execution of the depthwise convolution operation, while achieving a level of accuracy in the output data element(s) comparable to that achieved with the traditional 3D convolution. Furthermore, the method may efficiently support different convolution parameters (kernel sizes, shapes, padding, dilation, etc.) for each input data channel - i.e., each input data element is loaded only once independently of the filter size.

In one example embodiment of the second aspect, the method further comprises the steps of storing a set of filter coefficients in a processor register and defining the convolution filter based on the set of filter coefficients. By storing all filter coefficients in the processor register, it is possible to avoid memory read operations required to obtain them during the depthwise convolution operation.

In one example embodiment of the second aspect, the depthwise convolution operation is performed for each input data channel of the set of input data channels in parallel. By so doing, it is possible to reduce the execution time of the depthwise convolution operation.

In one example embodiment of the second aspect, each of the input data channel and the output data channel is a 2D data array. In this embodiment, the output data channel is calculated row by row during the depthwise convolution operation. This "row by row" approach may be beneficial in some use scenarios (e.g., when it is impossible to implement the "at least two rows in parallel" approach).

In another example embodiment of the second aspect, at least two rows of output data elements of the output data channel are calculated in parallel during the depthwise convolution operation. This may also allow reducing the execution time of the depthwise convolution operation.

According to a third aspect, a computer program product is provided. The computer program product comprises a computer-readable storage medium that stores a computer code. Being executed by at least one processor, the computer code causes the at least one processor to perform the method according to the second aspect. By using such a computer program product, it is possible to simplify the implementation of the method according to the second aspect in any computing device, like the apparatus according to the first aspect.

According to a fourth aspect, an apparatus for performing a depthwise convolution operation is provided. The apparatus comprises a means for receiving a set of input data channels and a means for performing the depthwise convolution operation independently for each input data channel of the set of input data channels. More specifically, the depthwise convolution operation is performed by scanning the input data channel with a convolution filter. At each scanning position of the convolution filter, at least one output data element of an output data channel is calculated, as well as it is determined whether at least one partial convolution result obtained at the scanning position of the convolution filter is required for calculating at least one next output data element of the output data channel at one or more next scanning positions of the convolution filter. The at least one partial result is then accumulated if it is determined that the at least one partial result is required for calculating the at least one next output data element of the output data channel at the one or more next scanning positions of the convolution filter. Thus, the partial convolution result(s) obtained at the previous scanning position of the convolution filter may be summed or accumulated together with one or more new partial convolution results obtained at the current scanning position of the convolution filter to produce the output data element(s) of the output data channel. In other words, the apparatus does not need to reload or reread each input data element from the input data channel at the current scanning position of the convolution filter if that input data element has been used at the previous scanning position of the convolution filter. By so doing, it is possible to reduce the total number of memory read operations needed for the execution of the depthwise convolution operation, while achieving a level of accuracy in the output data element(s) comparable to that achieved with the traditional 3D convolution. Furthermore, the apparatus thus configured may efficiently support different convolution parameters (kernel sizes, shapes, padding, dilation, etc.) for each input data channel - i.e., each input data element is loaded only once independently of the filter size.

Other features and advantages of the present disclosure will be apparent upon reading the following detailed description and reviewing the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is explained below with reference to the accompanying drawings in which:

FIG. 1 shows a block diagram of an apparatus for performing a depthwise convolution operation in accordance with one example embodiment;

FIG. 2 shows a block diagram of a processor that may be used in the apparatus of FIG. 1 in accordance with one example embodiment;

FIG. 3 explains how two parallel-connected processors of FIG. 2 may perform the depthwise convolution operation on two different input data channel in parallel by using the same operation commands or programming parameters; FIG. 4 shows a flowchart of a method for operating the apparatus of FIG. 1 in accordance with one example embodiment;

FIGs. 5A and 5B schematically illustrates how first and middle rows of output data elements of an output data channel are calculated based on the "row by row" approach by using the method of FIG. 4;

FIGs. 6A and 6B schematically illustrates how rows of output data elements of an output data channel are calculated based on the "two rows in parallel" approach by using the method of FIG. 4;

FIG. 7 schematically illustrates how all rows of output data elements of an output data channel are calculated based on the "all rows in parallel" approach by using the method of FIG. 4; and

FIG. 8 schematically illustrates another example of how to calculate all rows of output data elements of an output data channel based on the "all rows in parallel" approach by using the method of FIG. 4.

DETAILED DESCRIPTION

Various embodiments of the present disclosure are further described in more detail with reference to the accompanying drawings. However, the present disclosure can be embodied in many other forms and should not be construed as limited to any certain structure or function discussed in the following description. In contrast, these embodiments are provided to make the description of the present disclosure detailed and complete.

According to the detailed description, it will be apparent to the ones skilled in the art that the scope of the present disclosure encompasses any embodiment thereof, which is disclosed herein, irrespective of whether this embodiment is implemented independently or in concert with any other embodiment of the present disclosure. For example, the apparatus and method disclosed herein can be implemented in practice by using any numbers of the embodiments provided herein. Furthermore, it should be understood that any embodiment of the present disclosure can be implemented using one or more of the elements presented in the appended claims. Unless otherwise stated, any embodiment recited herein as "example embodiment" should not be construed as preferable or having an advantage over other embodiments.

Depthwise-separable convolutions are a well understood and widely used ML model architecture technique that reduces the total number of calculation operations needed to achieve a desired level of accuracy in an output result when compared to the traditional 3D convolution (also referred to as "full convolution") with a similar filter size. More specifically, the depthwise-separable convolution replaces one full convolution layer with a filter size N X M X C with the following two layers:

- a depthwise convolution layer with C x [/V x M x 1] filters that operates independently on each input data channel; and

- a pointwise convolution layer with a filter size 1 X 1 X C , which has l/( f x M) computational complexity compared to the full convolution.

While depthwise convolutions are not mathematically identical to the full convolution, they have been found in practice to give similar results with many fewer calculation operations.

The pointwise convolution can generally be computed with good efficiency using identical hardware (HW) (typically a matrix multiplier unit) as for the full convolution. However, the depthwise convolutions do not map well to matrix multiplication since they operate independently on each input data channel. Thus, the depthwise convolution should be calculated with a different kind of HW.

Currently, there is an ML HW accelerator designed and optimized for increasing the execution accuracy of depthwise convolution operations on computer devices. However, this ML HW accelerator is not efficient in terms of power consumption, which is mainly caused by many memory operations (e.g., memory read operations) used in the depthwise convolution operations.

The example embodiments disclosed herein provide a technical solution that allows mitigating or even eliminating the above-sounded drawbacks peculiar to the prior art. In particular, the technical solution disclosed herein relates to a technique for efficiently (in terms of a number of read operations) performing a depthwise convolution operation on a set of input data channels. For this purpose, a depthwise convolution operation is independently performed for each input data channel of the set of input data channels by scanning the input data channel with a convolution filter. At each scanning position of the convolution filter, at least one output data element of an output data channel is calculated, as well as at least one partial convolution result which is required for calculating at least one other output data element of the output data channel at one or more next scanning positions of the convolution filter is determined and accumulated. Thus, each input data element of each input data channel is read or loaded only once, all needed partial results that can be computed from that input data element are calculated as soon as possible (based on processor capabilities, e.g., a number of available processing units included in a processor) and then later summed or accumulated together with other partial results to produce the output data element(s). All of this allows reducing the number of power-expensive memory read operations during the depthwise convolution operation, thereby decreasing power consumption.

FIG. 1 shows a block diagram of an apparatus 100 for performing a depthwise convolution operation in accordance with one example embodiment. The apparatus 100 may be part of any electronic computing device or User Equipment (UE). Some examples of the UE may include, but not limited to, a mobile station, a mobile terminal, a mobile subscriber unit, a mobile phone, a cellular phone, a smart phone, a cordless phone, a personal digital assistant (PDA), a wireless communication device, a desktop computer, a laptop computer, a tablet computer, a gaming device, a netbook, a smartbook, an ultrabook, a medical mobile device or equipment, a biometric sensor, a wearable device (e.g., a smart watch, smart glasses, a smart wrist band, etc.), an entertainment device (e.g., an audio player, a video player, etc.), a vehicular component or sensor (e.g., a driver-assistance system), a smart meter/sensor, an unmanned vehicle (e.g., an industrial robot, a quadcopter, etc.) and its component (e.g., a selfdriving car computer), industrial manufacturing equipment, a global positioning system (GPS) device, an Internet-of-Things (loT) device, an Industrial loT (lloT) device, a machine-type communication (MTC) device, a group of Massive loT (MIoT) or Massive MTC (mMTC) devices/sensors. In some embodiments, the UE may refer to at least two collocated and interconnected UEs thus defined.

As shown in FIG. 1, the apparatus 100 comprises a set of processors 102-1, 102-2, ..., 102-N and a memory 104. The memory 104 stores processor-executable instructions 106 which, when executed by each of the set of processors 102-1, 102-2, ..., 102-N, cause the apparatus 100 to perform the aspects of the present disclosure, as will be described below in more detail. It should be noted that the number, arrangement, and interconnection of the constructive elements constituting the apparatus 100, which are shown in FIG. 1, are not intended to be any limitation of the present disclosure, but merely used to provide a general idea of how the constructive elements may be implemented within the apparatus 100. For example, each or some of the set of processors 102-1, 102-2, ..., 102-N may be replaced with several processors, as well as the memory 104 may be replaced with several removable and/or fixed storage devices, depending on particular applications. The apparatus 100 may also comprise a single processor (instead of the multiple processors 102-1, 102-2, ..., 102-N) which is configured to perform all the steps discussed below.

Each of the processors 102-1, 102-2, ..., 102-N may be implemented as a CPU, general-purpose processor, single-purpose processor, microcontroller, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), digital signal processor (DSP), complex programmable logic device, etc. It should be also noted that each of the processors 102-1, 102-2, ..., 102-N may be implemented as any combination of one or more of the aforesaid. As an example, each of the processors 102-1, 102-2, ..., 102-N may be a combination of two or more microprocessors.

The memory 104 may be implemented as a classical nonvolatile or volatile memory used in the modern electronic computing machines. As an example, the nonvolatile memory may include Read-Only Memory (ROM), ferroelectric Random-Access Memory (RAM), Programmable ROM (PROM), Electrically Erasable PROM (EEPROM), solid state drive (SSD), flash memory, magnetic disk storage (such as hard drives and magnetic tapes), optical disc storage (such as CD, DVD and Blu-ray discs), etc. As for the volatile memory, examples thereof include Dynamic RAM, Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Static RAM, etc.

The processor-executable instructions 106 stored in the memory 104 may be configured as a computer-executable program code which causes each of the processors 102-1, 102-2, ..., 102- N to perform the aspects of the present disclosure. The computer-executable program code for carrying out operations or steps for the aspects of the present disclosure may be written in any combination of one or more programming languages, such as Java, C++, or the like. In some examples, the computer-executable program code may be in the form of a high-level language or in a pre-compiled form and be generated by an interpreter (also pre-stored in the memory 104) on the fly.

FIG. 2 shows a block diagram of a processor 200 that may be used in the apparatus 100 in accordance with one example embodiment. More specifically, each of the processors 102-1, 102-2, ..., 102-N of the apparatus 100 may be implemented as the processor 200 or two or more parallel-connected processors 200.

As shown in FIG. 2, the processor 200 comprises three processing units 202-1, 202-2, 202-3, which are coupled in parallel to each other. The processing unit 202-1 comprises a multiplier 204-1 and a register 206-1, the processing unit 202-2 comprises a multiplier 204-2 and a register 206-2, and the processing unit 202-3 comprises a multiplier 204-3 and a register 206- 3. Each of the multipliers 204-1, 204-2, 204-3 is configured to perform elementwise multiplication operations (which are part of the depthwise convolution) on input data from an input data channel by using a convolution filter. Each of the registers 206-1, 206-2, 206-3 is configured to store filter coefficients for the convolution filter. Thus, the filter coefficients are stored locally in the processor 200. The input data channel may be processed by one or more of the processing units 202-1, 202-2, 202-3. This should be decided by a selection logic 208. At the same time, the selection logic 208 is optional, and the input data channel may be (e.g., evenly) divided into three sub-channels each processed by one of the processing units 202-1, 202-2, 202-3. The results of the elementwise multiplication operations are provided by the multipliers 204-1, 204-2, 204-3 to another selection logic 210 which is configured to decide which of summators or accumulators 212-1, 212-2, 212-3 should accumulate them. Each of the accumulators 212-1, 212-2, 212-3 is a type of a register for short-term, intermediate storage of arithmetic and logic data in the processor 200. The processor 200 also comprises a program dispatching unit 214 which is configured to provide operation commands or programming parameters (e.g., the instructions 106) to each of the constructive elements of the processor 200 (e.g., the operation command for the selection logic 208 may cause the selection logic 208 to select the processing units 202-1 and 202-3 for further processing of the input data channel). It should be noted that the number of the parallel processing units and/or the number of the accumulators may change depending on particular applications (e.g., the capabilities of the processor 200). Since the parallel processors 102-1, 102-2, ..., 102-N each implemented as the processor 200 may process the same input data channel at the same time, they may share the same programming parameters (e.g., the instructions 106), so that only one set of programming parameters is needed for all depthwise convolution operations. This significantly reduces parameter memory requirements, as well as the power consumption needed to read the programming parameters (e.g., the instructions 106) compared to the situation that the processing of the input data channel would not be in parallel or would be performed with entirely independent processors each having its own parameter memory.

FIG. 3 explains how two parallel-connected processors 200 may perform the depthwise convolution operation on two different input data channels in parallel by using the same operation commands or programming parameters. As follows from FIG. 3, the program dispatching units 214 of the two processors 200 may receive the same operation commands or programming parameters (e.g., the instructions 106) and cause the other constructive elements of the processors 200 to operate similarly based on these operation commands or programming parameters. By so doing, it is possible to provide parallelism in the depthwise convolution operation on different input data channels as well as to reduce memory requirements and power consumption, since there is no need for each of the two processors 200 to have its own program memory. It should be noted that the number of the parallel- connected processors 200 may vary depending on particular applications, but all of them may share the same operation commands or programming parameters.

FIG. 4 shows a flowchart of a method 400 for operating the apparatus 100 in accordance with one example embodiment. The method 400 starts with a step S402, in which the apparatus 100 receives a set of input data channels. Then, the method 400 proceeds to a step S404, in which the apparatus 100 perform (by using the processors 102-1, 102-2, ..., 102-N) the depthwise convolution operation independently for each input data channel of the set of input data channels. More specifically, each of the processors 102-1, 102-2, ..., 102-N scans a different input data channel of the set of input data channels with a convolution filter (i.e., the convolution filter is moved across the input data channel). The convolution filter may be defined based on a set of filter coefficients which may be locally stored in one or more processor registers (like the registers 206-1, 206-2, 206-3) of each of the processors 102-1, 102-2, ..., 102-N. At each scanning position of the convolution filter, each of the processors 102-1, 102-2, ..., 102-N calculates (e.g., by using the multipliers 204-1, 204-2, 204-3) one or more output data elements of an output data channel, as well as determines whether one or more partial convolution results obtained at the scanning position are required for calculating one or more next output data elements of the output data channel at one or more next scanning positions of the convolution filter. If it is determined that the partial convolution result(s) is(are) needed for the next scanning position(s) of the convolution filter, each of the processors 102-1, 102-2, ..., 102-N accumulates the corresponding partial convolution result(s) (e.g., in one or more of the accumulators 212-1, 212-2, 212-3). It should be noted that the processors 102-1, 102-2, ..., 102-N may perform the step S404 for the whole set of input data channels in parallel.

In one embodiment, if each of the input data channel and the output data channel is a 2D data array, each of the processors 102-1, 102-2, ..., 102-N may calculate the output data channel row by row during the depthwise convolution operation. In alternative embodiment, each of the processors 102-1, 102-2, ..., 102-N may calculate two or more rows of output data elements of the output data channel in parallel during the depthwise convolution operation (e.g., the multipliers 204-1, 204-2, 204-3 may calculate multiple rows in parallel and store the final results in the accumulators 212-1, 212-2, 212-3 where the final results are ready only after multiple cycles, i.e., when all partial convolution results for a given final result are obtained and summed together).

For example, for one output data element to be obtained in a 3x3 filter case, one needs take 9 input data elements. Therefore, once a first input data element comes, e.g., to the processor 102-1, the processor 102-1 will calculate all the partial convolution results where this input data element is used. By so doing, the processor 102-1 (and similarly the other processors 102-2, ..., 102-N) may read each input data element only once.

By calculating multiple rows in parallel, it is possible to provide a scalable system where the number of the accumulators to store the partial convolution results is a scaling factor to balance the accumulator count compared to the reduction of the data transmission.

Moreover, since the filter coefficients are fixed, it is easy to keep all of them in a small number of registers of each of the processors 102-1, 102-2, ..., 102-N (e.g., close to the processing units 202-1, 202-2, 202-3) and avoid memory read operations for the filter coefficients. At a minimum, if there is the same number of the processing units 202-1, 202-2, 202-3 (i.e., the multipliers 204-1, 204-2, 204-3) as the filter coefficients - for example, 9 for a 3x3 filter - then each of the processing units 202-1, 202-2, 202-3 requires only a single register for the filter coefficients. Alternatively, the processing units 202-1, 202-2, 202-3 may have more registers to be able to support a larger maximum filter size, or more smaller filters in parallel if more simultaneous input data elements are available.

For example, the filter size may be 3x3 and input data may be like 4096x4096 data elements. Even though filter sizes may be larger like 7x7 or even 11x11, it is much smaller than the size of the input data. Small filters are commonly used in almost all known CNNs since a filter size increase does not usually improve the results.

FIGs. 5A and 5B schematically illustrates how first and middle rows of output data elements of an output data channel are calculated based on the "row by row" approach by using the method 400. The last row of output data elements may be calculated similarly to the first row of output data elements. More specifically, FIGs. 5A and 5B illustrate a total cycle count for one input data channel (e.g., processed by the processor 102-1 implemented as the processor 200) in accordance with the "row by row" approach where input transfer (reading or loading), elementwise multiplication and output processing (e.g., activation, quantization and output write) happen in parallel. One or more other input data channels may be processed (e.g., by one or more of the other processors 102-2, ..., 102-N each implemented as the processor 200) in the same manner. Black filled boxes are (final) output data elements calculated during the depthwise convolution operation, and they are assumed to be stored in the accumulator shown as bold, i.e., the bold font indicates which accumulator result is taken to the output processing. Dot filled boxes correspond to zero padding or input data elements close to zero (without padding there are no extra zeros, and all rows are processed similarly to the middle rows). Boxes having a pattern fill in the form of diagonal stripes are new input data elements needed for a certain cycle of the depthwise convolution operation. Boxes having a pattern fill in the form of horizontal stripes are old input data elements that have been used already, i.e., they represent the partial convolution results already stored in the accumulators. The example shown in FIGs. 5A and 5B is a 3x3 filter, and this already shows 2 to 3 times reduction on the data transfer compared to the number of elementwise multiplication operations. One can see that the elementwise multiplication dominates time in this example. Since in the depthwise convolution each input data channel is processed independently, and thus if there are, for example, 256 input data channels, then the apparatus 100 (i.e., the processors 102- 1, 102-2, ..., 102-N) may process all these channels in parallel.

FIGs. 6A and 6B schematically illustrates how rows of output data elements of an output data channel are calculated based on the "two rows in parallel" approach by using the method 400. More specifically, FIGs. 6A and 6B show a similar approach as in FIGs. 5A and 5B, but with calculating two rows, i.e., two output data elements based on the input data elements. Black filled boxes are again (final) output data elements calculated during the depthwise convolution operation, and they are assumed to be stored in the accumulator shown as bold, i.e., the bold font indicates which accumulator result is taken to the output processing. A contour area indicates the input data elements needed to calculate the output data elements. The calculation is based on the partial convolution results, so the calculation is performed once new input data elements (the last column in shaded box) arrive to corresponding one of the processors 102-1, 102-2, ..., 102-N. This shows that normally benefits are like 4 input data elements and 18 multiplication 4.5 times saving.

FIG. 7 schematically illustrates how all rows of output data elements of an output data channel are calculated based on the "all rows in parallel" approach by using the method 400. More specifically, FIG. 7 shows a similar approach as in FIGs. 6A and 6B, but with calculating four rows in parallel. In this example, a small number of input data elements is used for convenience only; the number of input data elements is lot bigger in reality and thus there will be similar middle row calculations as in the previous example shown in FIGs. 6A and 6B. This example needs again more accumulators, i.e., 12 compared to 6 in the example shown in FIGs. 6A and 6B. However, the benefit is also bigger, i.e., 30 multiplications based on 4 input data elements, i.e., 7.5 saving. This proves the input data saving approach. It should be noted that, in FIG. 7 (as well as in next FIG. 8) the term "unit" refers to a parallel processing engine, i.e., existing HW pipes that operate in parallel, and the term "layer" refers to an input data channel like the one shown in FIG. 2, for example. Generally, this 4-row calculation starts to be good compromise once the normal filter size is 3x3. Once the filter size is bigger, then it is beneficial to have even more rows in parallel because each column of output data elements needs a larger set of input data elements. FIG. 8 schematically illustrates another example of how to calculate all rows of output data elements of an output data channel based on the "all rows in parallel" approach by using the method 400. More specifically, FIG. 8 illustrates the practical HW solution for the good throughput where each input data element is read once, and input data transmission will be the practical limit of the throughput. Since each input data element is used only once, this is a very optimal solution in HW for the given example. This solution needs multiple parallel multipliers to process input data in parallel. In this example, it is required to have 9 parallel multipliers, while the number of multipliers is a design parameter of the HW in general. However, normally the target is to match the input data transfer and the multiplication capabilities like shown in this example. This example also shows that in a real system there are multiple sub-operations. The last step where there is no input data element transmission is combined to the first cycle of the next data where there are 8 input data elements needed before any result is obtained. This leads to the solution that after initial pipeline latency, the system can obtain new input data elements in every cycle and produce the result in every cycle, which leads to predictable system throughput with the input data element transfer capability as a sizing factor.

It should be noted that each step or operation of the method 400, or any combinations of the steps or operations, can be implemented by various means, such as hardware, firmware, and/or software. As an example, one or more of the steps or operations described above can be embodied by processor executable instructions, data structures, program modules, and other suitable data representations. Furthermore, the processor-executable instructions which embody the steps or operations described above can be stored on a corresponding data carrier and executed by any one or more of the processors 102-1, 102-2, ..., 102-N. This data carrier can be implemented as any computer-readable storage medium configured to be readable by said at least one processor to execute the processor executable instructions. Such computer-readable storage media can include both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, the computer-readable media comprise media implemented in any method or technology suitable for storing information. In more detail, the practical examples of the computer-readable media include, but are not limited to information-delivery media, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile discs (DVD), holographic media or other optical disc storage, magnetic tape, magnetic cassettes, magnetic disk storage, and other magnetic storage devices.

Although the example embodiments of the present disclosure are described herein, it should be noted that any various changes and modifications could be made in the embodiments of the present disclosure, without departing from the scope of legal protection which is defined by the appended claims. In the appended claims, the word "comprising" does not exclude other elements or operations, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An apparatus for performing a depthwise convolution operation, comprising: at least one processor; and at least one memory including a computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive a set of input data channels; and perform the depthwise convolution operation independently for each input data channel of the set of input data channels by: scanning the input data channel with a convolution filter; and at each scanning position of the convolution filter: calculating at least one output data element of an output data channel; determining whether at least one partial convolution result obtained at the scanning position of the convolution filter is required for calculating at least one next output data element of the output data channel at one or more next scanning positions of the convolution filter; and if the at least one partial convolution result is required for calculating the at least one next output data element of the output data channel at the one or more next scanning positions of the convolution filter, accumulating the at least one partial result.

2. The apparatus of claim 1, wherein the at least one processor comprises a processor register configured to store a set of filter coefficients, and wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to define the convolution filter based on the set of filter coefficients.

3. The apparatus of claim 1 or 2, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the depthwise convolution operation for each input data channel of the set of input data channels in parallel. The apparatus of any one of claims 1 to 3, wherein each of the input data channel and the output data channel is a 2D data array, and wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to calculate, during the depthwise convolution operation, the output data channel row by row. The apparatus of any one of claims 1 to 3, wherein each of the input data channel and the output data channel is a 2D data array, and wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to calculate, during the depthwise convolution operation, at least two rows of output data elements of the output data channel in parallel. A method for performing a depthwise convolution operation, comprising: receiving a set of input data channels; and performing the depthwise convolution operation independently for each input data channel of the set of input data channels by: scanning the input data channel with a convolution filter; and at each scanning position of the convolution filter: calculating at least one output data element of an output data channel; determining whether at least one partial convolution result obtained at the scanning position of the convolution filter is required for calculating at least one next output data element of the output data channel at one or more next scanning positions of the convolution filter; and if the at least one partial convolution result is required for calculating the at least one next output data element of the output data channel at the one or more next scanning positions of the convolution filter, accumulating the at least one partial result.

7. The method of claim 6, further comprising: storing a set of filter coefficients in a processor register; and defining the convolution filter based on the set of filter coefficients.

8. The method of claim 6 or 7, wherein the depthwise convolution operation is performed for each input data channel of the set of input data channels in parallel.

9. The method of any one of claims 6 to 8, wherein each of the input data channel and the output data channel is a 2D data array, and wherein the output data channel is calculated row by row during the depthwise convolution operation.

10. The method of any one of claims 6 to 8, wherein each of the input data channel and the output data channel is a 2D data array, and wherein at least two rows of output data elements of the output data channel are calculated in parallel during the depthwise convolution operation.

11. A computer program product comprising a computer-readable storage medium, wherein the computer-readable storage medium stores a computer code which, when executed by at least one processor, causes the at least one processor to perform the method according to any one of claims 6 to 10.