CN113449852B

CN113449852B - Convolutional neural network computing method, system on chip and electronic device

Info

Publication number: CN113449852B
Application number: CN202110897011.4A
Authority: CN
Inventors: 孙伟昶
Original assignee: ARM Technology China Co Ltd
Current assignee: ARM Technology China Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2023-02-03
Anticipated expiration: 2041-08-05
Also published as: CN113449852A

Abstract

The application relates to the field of neural networks, and discloses a calculation method of a convolutional neural network, a system on a chip and electronic equipment. The calculation method of the convolutional neural network comprises the following steps: determining a segmentation mode of a convolution kernel according to the size of parameter cache on a deep learning processing chip; segmenting each convolution kernel of the plurality of convolution kernels according to a segmentation mode so as to divide each convolution kernel into N partial convolution kernels; grouping the plurality of convolution kernels to obtain N convolution kernel groups; loading the first convolution kernel group, the second convolution kernel group, the third convolution kernel group and the fourth convolution kernel group into a parameter cache respectively, and performing convolution operation on input data and the convolution kernel groups loaded into the parameter cache respectively to obtain N convolution operation results; and combining the N convolution operation results. Therefore, when the storage space of the parameter cache is smaller than the storage space required by the plurality of convolution kernels, the data of the convolution kernels do not need to be loaded frequently, and the efficiency of convolution operation can be improved.

Description

Convolutional neural network computing method, system on chip and electronic device

Technical Field

The present application relates to the field of neural networks, and in particular, to a method for computing a convolutional neural network, a system on a chip, and an electronic device.

Background

In recent years, along with the rapid development of Artificial Intelligence (AI) technology, applications of unmanned vehicles, unmanned planes, intelligent terminals, and the like supporting AI are becoming more and more widespread. The AI processes data input by various sensors in real time through a neural network technology, so as to realize the perception of the external environment. Generally, in order to improve the processing performance of the AI application terminal, a dedicated hardware platform is used to implement a specific operation, for example, a convolution operation involved in the AI application terminal originally transplanted with a convolution neural network model is implemented by the dedicated hardware platform, and meanwhile, in order to store input data and convolution kernel data of the convolution operation, the dedicated hardware platform sets corresponding caches for the input data and the convolution kernel data respectively.

However, the storage capacity of the cache for storing the convolution kernel data in some existing dedicated hardware platforms is limited, and if the data amount of the convolution kernel data exceeds the storage capacity of the cache for the convolution kernel data in these dedicated hardware platforms, these dedicated hardware platforms need to load the convolution kernel data in batches during performing convolution operation, which results in frequent transmission of the convolution kernel data, and thus the calculation efficiency of performing convolution operation when the data amount of the convolution kernel data is large is low. Therefore, the application range of some existing dedicated hardware platforms is narrow, which is not beneficial to popularization and application of products.

Disclosure of Invention

The embodiment of the application provides a calculation method of a convolutional neural network, a system on a chip and electronic equipment, which are used for solving the problem that the efficiency of convolution operation is low because convolution kernel data needs to be frequently loaded when a plurality of convolution kernels used for the convolution operation cannot be completely cached in the existing scheme.

In a first aspect, an embodiment of the present application provides a method for computing a convolutional neural network, where the method is used for a deep learning processing chip, and the method includes:

determining a segmentation mode of a convolution kernel according to the size of parameter cache on the deep learning processing chip;

segmenting each convolution kernel in the plurality of convolution kernels according to a segmentation mode so as to divide each convolution kernel into N partial convolution kernels, wherein the N partial convolution kernels comprise first to Nth partial convolution kernels, and the plurality of convolution kernels are used for performing convolution operation on input data;

grouping the convolution kernels to obtain N convolution kernel groups, wherein a plurality of first part convolution kernel sets are first convolution kernel groups, a plurality of Nth part convolution kernel sets are Nth convolution kernel groups, and the storage space required by each of the first convolution kernel group to the Nth convolution kernel group is smaller than the storage space of the parameter cache;

loading the first to Nth convolution kernel groups into a parameter cache respectively, and performing convolution operation on input data and the convolution kernel groups loaded into the parameter cache respectively to obtain N convolution operation results;

and combining the N convolution operation results, and determining the combined result as the convolution operation result of the input data and the convolution kernels.

In a possible implementation of the first aspect, the segmentation mode of the convolution kernel includes a segmentation direction and a segmentation position, and the segmentation direction includes a height direction of the convolution kernel and a width direction of the convolution kernel.

In a possible implementation of the first aspect, the method further includes:

the splitting position is determined according to the storage space of the parameter cache and the size of the convolution kernel, and the size of the convolution kernel comprises the width, the height and the channel number of the convolution kernel.

In one possible implementation of the first aspect, the first to nth partial convolution kernels have the same number of channels.

In one possible implementation of the first aspect described above, the value of N is 2.

In a possible implementation of the first aspect, the loading the first to nth convolution kernel groups into the parameter cache, and performing convolution operation on the input data and the convolution kernel groups loaded into the parameter cache, respectively, to obtain N convolution operation results, includes:

loading the first convolution kernel group into a parameter cache, and performing convolution operation on input data and the first convolution kernel group in the parameter cache to obtain a first convolution operation result; and loading the second convolution kernel group into the parameter cache, and performing convolution operation on the input data and the second convolution kernel group in the parameter cache to obtain a second convolution operation result.

In a possible implementation of the first aspect, performing convolution operation on the input data and a first convolution kernel group in the parameter cache to obtain a first convolution operation result includes:

determining first input data according to the size of a first part of convolution kernels in the first convolution kernel group and the input data;

and carrying out convolution operation on the first input data and the first part of convolution kernels in the first convolution kernel group to obtain a first convolution operation result.

In a possible implementation of the first aspect, performing convolution operation on the input data and a second convolution kernel group in the parameter cache to obtain a second convolution operation result includes:

determining second input data according to the size of a second part of convolution kernels in the second convolution kernel group and the input data;

and carrying out convolution operation on the second input data and a second part of convolution kernels in the second convolution kernel group to obtain a second convolution operation result.

In one possible implementation of the first aspect, merging the N convolution operation results, and determining the merged result as a convolution operation result of the input data and the plurality of convolution kernels includes:

and adding the data at the corresponding position in the first convolution operation result and the second convolution operation result, and taking the added result as the convolution operation result of the input data and the plurality of convolution kernels.

In a second aspect, an embodiment of the present application provides a system on chip, including:

a memory to store instructions for execution by one or more processors of a system-on-chip;

a processor, being one of the processors of the system on chip, for performing the calculation method of the convolutional neural network in the first aspect and various possible implementations of the first aspect when the instructions are executed by the processor.

In a third aspect, an embodiment of the present application provides an electronic device, which includes the system on chip in the second aspect, a processor and a memory;

a memory to store instructions for execution by one or more processors of an electronic device;

a processor for performing the method of computing a convolutional neural network in the above-described first aspect and various possible implementations of the first aspect when the instructions are executed by one or more processors.

The method for calculating the convolutional neural network comprises the steps of segmenting a plurality of convolution kernels which are subjected to convolution operation with input data to obtain N partial convolution kernels, grouping the segmented convolution kernels into N convolution kernel groups, enabling the storage space required by each convolution kernel group to be smaller than the storage space of a parameter cache on a deep learning processing chip, respectively loading each convolution kernel group to the parameter cache to perform convolution operation with the input data to obtain N convolution operation results, and finally combining the N convolution operation results to obtain a final result.

Drawings

Fig. 1 is a schematic diagram illustrating an operation process of a standard convolution operation in a technical solution;

fig. 2 (a) shows a schematic diagram of a multiplication circuit in a technical solution;

FIG. 2 (b) shows a schematic diagram of the storage of a convolution kernel in a parameter cache;

FIG. 2 (c) shows a schematic diagram of multiple convolution kernels stored in a parameter cache;

FIG. 2 (d) shows a data block of input data stored in an input data cache;

FIG. 3 illustrates a flow chart of a method of computing a convolutional neural network provided herein, in accordance with some embodiments of the present application;

FIG. 4 illustrates a block diagram of a hardware architecture of a system on a chip provided herein, according to some embodiments of the present application;

fig. 5 illustrates a block diagram of an electronic device provided herein, according to some embodiments of the present disclosure.

Detailed Description

Illustrative embodiments of the present application include, but are not limited to, a convolutional neural network computing method, system-on-chip, and electronic device.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiments of the present application relate to the field of neural network technology, and in order to better understand the solution of the embodiments of the present application, the following first introduces terms and concepts related to the neural network that the embodiments of the present application may relate to.

Standard convolution operation: when an input data having a plurality of data channels is subjected to a standard convolution operation by one convolution kernel, the convolution kernel needs to convolve all data in the input data having a plurality of data channels.

Fig. 1 exemplarily shows a schematic diagram of a standard convolution operation, for example, assuming that input data of a convolutional neural network is data of 3 channels of an image, and assuming that the number of pixels of the image in the vertical direction and the horizontal direction is 5 respectively, the size of the input data can be represented by 5 × 3. When the standard convolution operation needs to be performed on the 5 × 3 input data, it is necessary to perform convolution operation (i.e., multiply-add operation) on all data in the input data by using the convolution kernel with the same number of channels as 3, so as to obtain a corresponding convolution result (also called a feature map).

For example, as shown in fig. 1, the input data of 5 × 3 is convolved with 4 convolution kernels of 3 × 3, respectively. The 4 convolution kernels are respectively a convolution kernel K1, a convolution kernel K2, a convolution kernel K3 and a convolution kernel K4. The data of 3 channels in the input data of 5 × 3 are respectively denoted as channel C1, channel C2, and channel C3. The convolution kernel K1 performs convolution operation on all data of the channels C1 to C3 in the input data to obtain the feature map P1. And performing convolution operation on all data from the channel C1 to the channel C3 in the input data by using the convolution kernel K2 to obtain a feature map P2. And performing convolution operation on all data from the channel C1 to the channel C3 in the input data by using the convolution kernel K3 to obtain a feature map P3. And performing convolution operation on all data from the channel C1 to the channel C3 in the input data by using the convolution kernel K4 to obtain a feature map P4.

In the following description of the embodiments, the technical solutions of the present application will be described with reference to a convolutional neural network involving a convolutional operation (i.e., a multiply-add operation). The convolution operation in the technical solution of the present application may include other types of convolution operations besides the standard convolution operation, such as a deep convolution operation, a group convolution operation, an extended convolution operation, and a deconvolution operation. It is understood that, in addition to the convolutional neural network, the technical solution of the present application may also be applied to other neural networks involving multiply-add operations, which is not limited in the present application.

Furthermore, it should be understood that the above description of the operation procedure of performing the standard convolution on the data including 3 channels by the convolution kernel with the number of channels being 3 is only a simple example for explaining the rough operation procedure of the standard convolution. In practical application, the technical scheme of the application does not limit the number of channels of input data and convolution kernel data involved in standard convolution operation. For example, in some embodiments, in the convolution kernel segmentation method provided by the present application, in the process of performing a standard convolution operation, the number of channels of the input data participating in the operation and the convolution kernel data may be an integer multiple of 32.

Fig. 2 (a) shows a multiplication circuit 200 in a related art scheme, which can implement convolution operation (i.e., multiply-add operation) in a convolutional neural network model, where the multiplication circuit 200 may be a part of a deep learning processing chip, and the deep learning processing chip is a hardware circuit specially processing deep learning related techniques, and is used for performing deep learning-related calculations, such as standard convolution operation in a convolutional neural network. Referring to fig. 2 (a), the multiplication circuit 200 includes a Direct Memory Access (DMA) control unit 201, a parameter buffer 202, an input data buffer 203, an output buffer 204, and a PE array 205 composed of a plurality of Processing units (PEs).

The DMA control unit 201 is configured to read the input data stored in the external storage space into the input data buffer 203. The parameter buffer 202 is used to store data of convolution kernels participating in convolution operation. The input data buffer 203 is used to store input data read by the DMA control unit 201 from an external memory space. The PE array 205 is composed of a plurality of PEs, each PE for performing a multiply-add operation on at least part of input data and convolution kernel data. The output buffer 204 is used for storing the result output by the PE array 205 when performing convolution operation.

It is to be understood that the PE array 205 shown in fig. 2 (a) having 16 PEs with 4 rows and 4 columns is merely one exemplary configuration of the PE array in the multiplication circuit 200. In practical application, the technical scheme of the application does not limit the number of processing units in each row and each column in the PE array. For example, in some embodiments, the number of processing elements in each row and column in the PE array may be 16.

The parameter cache 202 stores data of a plurality of convolution kernels, and assuming that the parameter cache 202 needs to store 32 convolution kernels with a size of 5 × 32, the storage manner of the 32 convolution kernels in the parameter cache 202 will be described in detail below with reference to fig. 2 (b) to 2 (c).

The 32 convolution kernels are respectively denoted as a convolution kernel K0, a convolution kernel K1, a convolution kernel K2 up to a convolution kernel K31. Fig. 2 (b) shows a storage manner of one convolution kernel K0 in the parameter buffer 202. As shown in fig. 2 (b), the height of the convolution kernel K0 is 5, the width is 5, the channel is 32, the 5 heights of the convolution kernel are respectively denoted as height H0, height H1, height H2, height H3 and height H4, the 5 widths of the convolution kernel are respectively denoted as width W0, width W1, width W2, width W3 and width W4, and the 32 channels of the convolution kernel are respectively denoted as channel C0, channel C2 … to channel C31. The corresponding position of the data in the convolution kernel K0 is represented by a triplet: < height H, width W, channel C >. Before storage, the convolution kernel is divided into 5 × 5=25 data blocks according to the product of the height and the width of the convolution kernel K0, the height and the width of the triple at the corresponding position of each data in the same data block are the same, and channels are different. All 25 data blocks are numbered in sequence starting from a data block defined by height H0 and width W0, thereby obtaining data blocks numbered 1 to 25. For example, the corresponding positions of all data in the data block numbered 1 are from < H0, W0, C0>, < H0, W0, C1> up to < H0, W0, C31>, the corresponding positions of all data in the data block numbered 11 are from < H2, W0, C0>, < H2, W0, C1> up to < H2, W0, C31>, and the corresponding positions of all data in the data block numbered 25 are from < H4, W4, C0>, < H4, W4, C1> up to < H4, W4, C31>.

The convolution kernel K0 is stored in the parameter cache 202, and is sequentially stored according to the number of the data block in the convolution kernel, as shown in fig. 2 (b), the data block 1, the data block 2, and the data block 3 … … are sequentially stored in the parameter cache 202, and the data in each data block is sequentially stored according to the corresponding position of the data. For example, 32 data in the data block 1 are sequentially stored at corresponding positions < H0, W0, C0>, < H0, W0, C1> … … < H0, W0, C31>.

Fig. 2 (c) shows a storage manner of 32 convolution kernels in the parameter cache 202, and similarly, a plurality of convolution kernels are stored in sequence according to corresponding convolution kernel numbers, wherein data of each convolution kernel is stored continuously in the parameter cache 202, and storage areas of different convolution kernels in the parameter cache 202 do not intersect.

The input data is stored in the input data buffer 203 in a manner similar to the storage of a convolution kernel in the parameter buffer 202, and the input data can be divided into a plurality of data blocks according to the product of the height and the width of the input data, and the plurality of data blocks are numbered and then stored according to the numbers of the data blocks. For example, if the size of the input data is 28 × 32, the height and width are 28, and the number of channels is 32, the input data is divided into 28 × 28=784 data blocks, the data blocks are numbered from 1 to 784, and the data blocks are sequentially stored in the input data buffer according to the numbers.

It can be understood that, when the number of channels of the input data or the convolution kernel data exceeds 32, the data blocks of the first 32 channels of the input data or the convolution kernel data may be divided and numbered first, then the data blocks of the remaining channels may be divided and numbered, and finally the data blocks may be sequentially stored according to the numbers of the data blocks.

It can be understood that, in the above example, the data block division of the input data and the convolution kernel data according to 32 channels is only an exemplary method of data block division, and in practical application, the number of channels according to which the data block division is performed may be an integer multiple of 32. For example, in some embodiments, the data block partitioning may be performed for input data or convolution kernel data according to 64 channel numbers.

The following describes in detail the multiplication circuit 200 shown in fig. 2 (a) performing a standard convolution operation.

Assuming that the multiplier circuit 200 shown in fig. 2 (a) performs a standard convolution operation on the input data stored in the input data buffer 203 (the size of the input data is 28 × 32, and the number of channels is from C0 to C31) through 32 convolution kernels (i.e., K0 to K31) with the size of 5 × 32, all 32 convolution kernels with 5 × 32 need to perform convolution on all data of channels C0 to C31 in the input data, where the storage space size of the parameter buffer 202 is 18KB (Kilobyte ), the storage space size of the input buffer 203 can fully store the size of the input data, and the space size of the input buffer 203 is not specifically limited in the embodiment of the present application.

Specifically, the storage space required by the 32 convolution kernels with the size of 5 × 32 is 5 × 32=25kb, and since the storage space 25KB required by the 32 convolution kernels is larger than the space size 18KB of the parameter cache 202, the parameter cache 202 cannot store all 32 convolution kernels at a time, and the loading is performed twice, the first half of the data of each convolution kernel in the 32 convolution kernels K0 to K31 is loaded for the first time, the size of the data is about 18KB, and the second half of the data of each convolution kernel in the 32 convolution kernels K0 to K31 is loaded for the second time, the size of the data is about 7KB. The input data and the first half data of K0-K31 are firstly subjected to convolution operation, after the convolution operation is completed, the second half data of K0-K31 are loaded into the parameter cache 202 and then are subjected to convolution operation with the input data, and the convolution operation result of the input data and 32 convolution kernels is obtained.

After the parameter cache 202 loads the convolution kernel data for the first time, the parameter cache 202 stores the data of the first half of each convolution kernel of the 32 convolution kernels K0 to K31, where the data of the first half of each convolution kernel may be, for example, the first 15 data blocks in fig. 2 (b), or the first 17 data blocks, and may be specifically determined according to an average storage space when the storage space of the parameter cache 202 stores the 32 convolution kernels. The input data buffer 203 stores the input data of the 32 channels, and the 32 channels are respectively marked as channel C0, channel C1 … to channel C31. When the PE array 205 shown in fig. 2 (a) performs convolution operation on the input data obtained from the input data buffer 203 based on the convolution kernel data obtained from the parameter buffer 202, each PE in the PE array 205 shown in fig. 2 (a) uses the first half data of one of the convolution kernels K0 to K3 to convolve a data block of the input data of the 32 channels, which is the same as the convolution kernel in size, where the data block is 5 × 32. It can be understood that, after the first half data of the convolution kernels K0 to K3 and the data block are convolved, each PE in the PE array 205 acquires a next batch of first half data of the convolution kernels K4 to K7 from the parameter cache 202, and performs convolution operation on the newly acquired first half data of the convolution kernels and the data block until the first half data of the 32 convolution kernels stored in the parameter cache 202 and the data block are all convolved. Wherein, one data block with the same size as the convolution kernel in the input data is: and sliding on the input data by adopting a set step length to obtain a corresponding data block in the sliding window. It will be appreciated that since the input data has 32 data lanes, each sliding window (i.e. data block) on the input data also has 32 data lanes.

For example, as shown in fig. 2 (d), for input data of 28 × 32, the input data is slid in a sliding window of 5 × 32, starting with the first data at the upper left of the channel C0 and having a step size of 1. For example, the data block obtained when the sliding window is located at the starting position is A1, the sliding window is slid once to the right to obtain a data block A2, the sliding window is slid once to the right to obtain a data block A3 (not shown), the sliding window is continuously slid once to the right to obtain a data block A4 (not shown), the sliding window is continuously slid to the right until the rightmost end of the input data, and each sliding is performed to obtain one data block. Then, sliding down the data block A1 by one data, and continuing sliding from left to right until the rightmost end of the input data, each time sliding to obtain one data block. The sliding process continues until the sliding window reaches the bottom right of the input data.

The PE array 205 acquires the first half data to be subjected to convolution each time in batches from the first half data of the 32 convolution kernels K0 to K31 stored in the parameter buffer 202. For example, the PE array 205 obtains the first half data of the first convolution kernels K0, K1, K2, and K3 to be convolved from the parameter buffer 202, and the data participating in the convolution operation in the PE10, PE20, PE30, and PE40 (4 PEs in the first column) in the PE array 205 shown in fig. 2 (a) are all the first half data of the convolution kernel K0; the data participating in the convolution operations in PE11, PE21, PE31, and PE41 (4 PEs in the second row) are all the first half data of the above-mentioned convolution kernel K1; the data participating in the convolution operations in PE12, PE22, PE32, and PE42 (4 PEs in the third column) are all the first half data of the above convolution kernel K2; the data participating in the convolution operation in PE13, PE23, PE33, and PE43 (4 PEs in the fourth column) are all the first half data of the above-described convolution kernel K3. After the convolution operation of the first half data of the first 4 convolution kernels is completed, the operation result is stored in the output buffer 204. The PE array 205 obtains the first half data of the second batch of convolution kernels K4, K5, K6, and K7 from the parameter cache 202, and then performs convolution operation using the corresponding PE; after the convolution operation of the first half data of the second convolution kernel is completed, the PE array 205 acquires the first half data of the third convolution kernel from the parameter cache 202 and performs the convolution operation, and the acquisition and the convolution operation of the first half data of the convolution kernel of the PE array 205 continue until the first half data of all the convolution kernels in the parameter cache 202 are subjected to the convolution operation.

Because each data block of the input data needs to perform convolution operation with 32 complete convolution kernels, while only the first half data of the 32 convolution kernels is currently stored in the parameter cache 202, and each data block also needs to perform convolution operation with the second half data of the 32 convolution kernels, the parameter cache 202 needs to load the second half data of the 32 convolution kernels from the memory, and the newly loaded second half data of the convolution kernels covers a part of the first half data of the 32 convolution kernels stored in the parameter cache 202.

After the parameter cache 202 finishes loading the second half data of the 32 convolution kernels, the PE array 205 acquires the second half data of the 4 convolution kernels K0, K1, K2, and K3 from the parameter cache 202, performs convolution operations on the second half data and the data blocks A1, A2, A3, and A4, respectively, and adds the obtained convolution results to the convolution results of the first half data of the K0, K1, K2, and K3 and the data blocks A1, A2, A3, and A4 stored in the output cache 204, thereby obtaining convolution operation results of the data blocks A1, A2, A3, and A4 and the convolution kernels K0, K1, K2, and K3, respectively. The PE array 205 then obtains the second half data of 4 convolution kernels K4, K5, K6, and K7 from the parameter buffer 202 to perform convolution operation on the data blocks, and adds the obtained convolution result to the convolution results of the first half data of K4, K5, K6, and K7 and the data blocks A1, A2, A3, and A4 stored in the output buffer 204, thereby obtaining the convolution operation results of the data blocks A1, A2, A3, and A4 and the convolution kernels K4, K5, K6, and K7, respectively. This continues until the result of the convolution operation of the data blocks A1, A2, A3 and A4 with the convolution kernels K28, K29, K30 and K31, respectively, is obtained.

Similarly, the PE array 205 acquires the data blocks each convolved in batches from the input data stored in the input data buffer 203. For example, the PE array 205 obtains the first blocks A1, A2, A3, and A4 of input data to be convolved from the input data buffer 203, and the data blocks of the input data participating in the convolution operation in the PE10, the PE20, the PE30, and the PE40 (4 PEs in the first column) in the PE array 205 shown in fig. 2 (a) are the data blocks A1, A2, A3, and A4, respectively; data blocks of input data participating in convolution operation in PE11, PE21, PE31, and PE41 (4 PEs in the second column) are data blocks A1, A2, A3, and A4, respectively; the data blocks of the input data participating in the convolution operation in PE12, PE22, PE32, and PE42 (4 PEs in the third column) are data blocks A1, A2, A3, and A4, respectively; the data blocks of the input data participating in the convolution operation in PE13, PE23, PE33, and PE43 (4 PEs in the fourth column) are data blocks A1, A2, A3, and A4, respectively. After the convolution operation of the first batch of 4 data blocks of the input data is completed, the PE array 205 acquires the second batch of data blocks of the input data from the input data cache 203, and then performs the convolution operation using the corresponding PE; after the convolution operation of the data blocks of the second batch of input data is completed, the PE array 205 acquires the data blocks of the third batch of input data from the input data buffer 203 and performs the convolution operation, and the acquisition and convolution operation of the data blocks of the input data of the PE array 205 are continued until the convolution operation is performed on the data blocks of the input data in the input data buffer 203.

It is to be understood that, in the PE array 205 shown in fig. 2 (a), convolution kernel data participating in convolution operation is the same for each PE in the same column of PEs, and data blocks of input data participating in convolution operation are different for each PE in the same column of PEs. In the PE array 205 shown in fig. 2 (a), the convolution kernel data participating in the convolution operation is different for each PE in the same row of PEs, and the data blocks of the input data participating in the convolution operation are the same for each PE in the same row of PEs.

As can be seen from the above description of the calculation process of the multiplication circuit 200 performing the standard convolution operation, when the storage space of the parameter buffer 202 cannot store all the convolution kernel data, performing the convolution operation requires that all the convolution kernel data be divided into the first half data and the second half data and loaded into the parameter buffer 202 in two times, so that it is necessary to wait for the two times of loading of the convolution kernel data to perform the convolution operation on each data block in the input data and all the convolution kernels, and frequent loading of data may result in a higher time delay of the multiplication circuit 200, thereby resulting in a lower efficiency of the convolution operation.

In order to solve the technical problem in the related art shown in fig. 2, an embodiment of the present application provides a technical solution for implementing a standard convolution operation when a storage space of a parameter cache of a PE array is smaller than a storage space required by a plurality of convolution kernels. Compared with the technical scheme shown in fig. 2, in the technical scheme shown in fig. 3, the multiple convolution kernels are divided into two parts in the same manner, the storage space required by the data of the convolution kernels in each part is smaller than the storage space of the parameter cache 202, then the input data and the convolution kernels in the two parts are respectively subjected to convolution operation, finally the respectively obtained feature maps are integrated, and the integrated feature map is the same as the feature map obtained by the convolution operation of the input data and the complete convolution kernels.

As shown in fig. 3, the calculation scheme of the convolutional neural network in some embodiments of the present application includes:

step S301: the storage space required by the plurality of convolution kernels is determined and compared to the size of the storage space of the parameter cache 202. Specifically, the storage space required for 32 convolution cores of size 5 × 32 is 5 × 32=25kb, the storage space of the parameter cache 202 is 18KB, and the storage space required for 32 convolution cores of size 5 × 32 is greater than the storage space of the parameter cache 202 because of 25kb > < 18kb. The plurality of convolution kernels are used for performing convolution operation with input data.

Step S302: under the condition that the storage space required by the convolution kernels is larger than that of the parameter cache 202, each convolution kernel is segmented into a plurality of partial convolution kernels, and the partial convolution kernels segmented by the convolution kernels are grouped to obtain a plurality of convolution kernel groups.

In the embodiment of the present application, a segmentation mode for segmenting a plurality of convolution kernels may be determined according to the size of the parameter cache 202. The segmentation mode of the convolution kernel may include, but is not limited to, a segmentation direction, a segmentation position, and the like, where the segmentation direction may be a width direction or a height direction of the convolution kernel, the segmentation position is used to indicate a height position (i.e., a row number) or a width position (i.e., a column number) of the convolution kernel to be segmented, and if the segmentation direction is the width direction, the segmentation position indicates which row or rows of the convolution kernel to be segmented; if the slicing direction is the elevation direction, the slicing position indicates which column or columns of the convolution kernel to slice at. The slicing position may be determined according to the storage space of the parameter cache and the size of the convolution kernel. The sizes of the convolution kernels comprise the width, the height and the number of channels of the convolution kernels, and when the convolution kernels are segmented, segmentation is not carried out on the channels of the convolution kernels.

It is understood that the number of the plurality of partial convolution kernels obtained after each convolution kernel segmentation may be 2 or more than 2, and is determined according to the storage space of the parameter cache 202 and the storage space required by the plurality of convolution kernels, for example, if the storage space of the parameter cache 202 is 10KB and the storage space required by the plurality of convolution kernels is 15KB, each convolution kernel may be segmented into 2 partial convolution kernels; if the storage space of the parameter cache 202 is 10KB and the storage space required for the plurality of convolution kernels is 30KB, each convolution kernel needs to be split into 3 partial convolution kernels.

After each convolution kernel in the multiple convolution kernels is segmented, all obtained partial convolution kernels are grouped, the grouping size is equal to the number of the partial convolution kernels obtained after each convolution kernel is segmented, and the obtained grouping is a convolution kernel group. After each convolution kernel is segmented, a plurality of partial convolution kernels are obtained, for example, a first partial convolution kernel, a second partial convolution kernel, … …, and an nth partial convolution kernel group are obtained, where N is an integer greater than or equal to 2. The partial convolution kernels in each convolution kernel group are all partial convolution kernels of the same cut-out part, for example, all the partial convolution kernels in the first convolution kernel group are first partial convolution kernels cut-out from each convolution kernel, and the relative positions of the first partial convolution kernels in each convolution kernel are the same; all the partial convolution kernels in the second convolution kernel group are second partial convolution kernels cut from each convolution kernel, and the relative positions of the second partial convolution kernels in each convolution kernel are equal.

In addition, the storage space required by each of the plurality of convolution kernel groups is smaller than the storage space of the parameter cache 202, so that each convolution kernel group can be loaded into the parameter cache 202 all at once.

In the embodiments of the present application, two convolution kernel groups are taken as an example for explanation, where the two convolution kernel groups include a first convolution kernel group and a second convolution kernel group. Hereinafter, an example of the 3 rd embodiment will be described with the dicing direction as the width direction and the dicing position as the dicing position. Slicing the convolution kernel K0 of size 5 × 32, resulting in a convolution kernel K0' of size 5 × 3 × 32 and a convolution kernel K0 "of size 5 × 2 × 32; the convolution kernel K1 of size 5 × 32 is sliced, resulting in a convolution kernel K1' of size 5 × 3 × 32 and a convolution kernel K1 ″ of size 5 × 2 × 32. Similarly, the convolution kernels K2, K3, …, K31 are sliced using the same slicing method, resulting in convolution kernels K2', K2", K3', K3", …, K31 'and K31", where the convolution kernels K2', K3', …, K31' are all 5 × 3 × 32 and the convolution kernels K2", K3", …, K31" are all 5 × 2 × 32.

In addition, in some embodiments, the convolution kernel is split into two parts, either in the width direction or in the height direction of the convolution kernel. For example, the convolution kernel of 5 × 32 may be sliced into a convolution kernel of 5 × 3 × 32 and a convolution kernel of 5 × 2 × 32 in the width direction of the convolution kernel, or may be sliced into a convolution kernel of 3 × 5 × 32 and a convolution kernel of 2 × 5 × 32 in the height direction of the convolution kernel.

And segmenting the 32 convolution kernels according to the segmentation position of the 3 rd row in the width direction to obtain 64 partial convolution kernels, wherein the partial convolution kernels with the sizes of 5 × 3 × 32 are used as first partial convolution kernels, and the partial convolution kernels with the sizes of 5 × 2 × 32 are used as second partial convolution kernels. The 32 partial convolution kernels of size 5 x 3 x 32 make up the first convolution kernel set, and the 32 partial convolution kernels of size 5 x 2 x 32 make up the second convolution kernel set. The partial convolution kernels in the first convolution kernel group and the partial convolution kernels in the second convolution kernel group are respectively used for performing convolution operation with the input data in the input data buffer 203.

Step S303: the first set of convolution kernels is loaded into parameter cache 202. The 32 partial convolution kernels in the first convolution kernel group are K0', K1', …, K31', where the size of each partial convolution kernel is 5 × 3 × 32, the storage space required by the first convolution kernel group is 5 × 3 × 32=15kb, and the storage space of the parameter cache 202 is 18kb,15kb ± 18kb, so that the parameter cache 202 can store all 32 partial convolution kernels of the size 5 × 3 × 32 in the first convolution kernel group.

Step S304: the first input data is determined according to the size of a part of convolution kernels in the first convolution kernel group and the input data in the input data buffer 203. Because the partial convolution kernels in the first convolution kernel group are only a part of the convolution kernels before slicing, part of the data in the input data cannot be subjected to convolution operation with the partial convolution kernels in the first convolution kernel group, for example, the partial convolution kernels in the first convolution kernel group are partial convolution kernels with the size of 5 × 3 × 32 sliced from the convolution kernels with the size of 5 × 32, the partial convolution kernels in the first convolution kernel group cannot be subjected to convolution operation with the data in the last 2 rows of the input data with the size of 28 × 32, and the data in the last 2 rows of the input data can only be subjected to convolution operation with the partial convolution kernels with the size of 5 × 2 × 32 in the second convolution kernel group. Therefore, the data which is not convolved with the partial convolution kernels in the first convolution kernel group is removed from the input data, so that the first input data can be obtained, the size of the first input data is 28 × 26 × 32, and 26 rows are rows which are left after the last 2 rows of the input data are removed.

Step S305: and carrying out convolution operation on the first input data and partial convolution kernels in the first convolution kernel group to obtain a first convolution result. The first input data and the partial convolution kernels in the first convolution kernel group are subjected to convolution operation in the above-mentioned multiplication circuit 200, the first input data is stored in the input data buffer 203, and the 32 partial convolution kernels in the first convolution kernel group are stored in the parameter buffer 202. The first input data is obtained into a plurality of data blocks according to the method for obtaining the data blocks through the sliding window, wherein the size of the sliding window is the same as that of the partial convolution kernels in the first convolution kernel group, namely 5 × 3 × 32. The PE array in the multiplication circuit 200 sequentially obtains 4 data blocks and 4 partial convolution kernels from the input data buffer 203 and the parameter buffer 202 to perform convolution operation until the first input data and the partial convolution kernels in the first convolution kernel group all complete convolution operation.

The first input data with the size of 28 × 26 × 32 is convolved with 32 partial convolution kernels with the size of 5 × 3 × 32 in the first convolution kernel group to obtain a first convolution result, and the size of the first convolution result is 24 × 32. In some embodiments, the resulting first convolution result is stored in output buffer 204.

Step S306: the second set of convolution kernels is loaded into parameter cache 202. The 32 partial convolution kernels in the second convolution kernel group are K0", K1", …, K31", where the size of each partial convolution kernel is 5 × 2 × 32, the storage space required for the second convolution kernel group is 5 × 2 × 32=10kb, and the storage space of the parameter cache 202 is 18kb,10kb, and 18kb, so that the parameter cache 202 can store all 32 partial convolution kernels of size 5 × 2 kb 32 in the second convolution kernel group.

Step S307: the second input data is determined based on the size of a portion of the convolution kernels in the second set of convolution kernels and the input data in the input data buffer 203. Since the partial convolution kernel in the second convolution kernel group is only a part of the pre-slicing convolution kernel, there is a part of data in the input data that will not be subjected to convolution operation with the partial convolution kernel in the second convolution kernel group, for example, the partial convolution kernel in the second convolution kernel group is a partial convolution kernel with a size of 5 × 2 × 32 that is sliced from the pre-slicing convolution kernel with a size of 5 × 32, so that the partial convolution kernel in the second convolution kernel group will not be subjected to convolution operation with the first 3 rows of data in the input data with a size of 28 × 32, and the first 3 rows of data in the input data can only be subjected to convolution operation with the partial convolution kernel with a size of 5 × 3 × 32 in the first convolution kernel group. Therefore, the data which is not convolved with the partial convolution kernel in the second convolution kernel group is removed from the input data, so that the second input data can be obtained, the size of the second input data is 28 × 25 × 32, and 25 rows are rows which are left after the first 3 rows are removed from the input data.

Step S308: and performing convolution operation on the second input data and part of convolution kernels in the second convolution kernel group to obtain a second convolution result. The second input data is convolved with the partial convolution kernels in the second convolution kernel group in the above-described multiplication circuit 200, the second input data is stored in the input data buffer 203, and the 32 partial convolution kernels in the second convolution kernel group are stored in the parameter buffer 202. The second input data is obtained into a plurality of data blocks according to the method for obtaining the data blocks through the sliding window, wherein the size of the sliding window is the same as that of the partial convolution kernels in the second convolution kernel group, namely 5 × 2 × 32. The PE array in the multiplication circuit 200 sequentially obtains 4 data blocks and 4 partial convolution kernels from the input data buffer 203 and the parameter buffer 202 to perform convolution operation until the second input data and the partial convolution kernels in the second convolution kernel group all complete convolution operation.

And performing convolution operation on the second input data with the size of 28 × 25 × 32 and 32 partial convolution kernels with the size of 5 × 2 × 32 in the second convolution kernel group to obtain a second convolution result, wherein the size of the second convolution result is also 24 × 32. In some embodiments, the resulting second convolution result is also stored in output buffer 204.

Step S309: and combining the first convolution result and the second convolution result to obtain a final convolution result. Because the obtained first convolution result and the second convolution result have the same size, the data on the corresponding positions of the first convolution result and the second convolution result are directly added, and the final convolution result can be obtained. For example, the first convolution result and the second convolution result are both 24 × 32, the data at the corresponding positions of the first convolution result and the second convolution result are added, i.e., the data corresponding to the position < H0, W0, C0> of the first convolution result is added to the data corresponding to the position < H0, W0, C0> of the second convolution result, the data corresponding to the position < H0, W0, C1> is added, … is added, and the corresponding data corresponding to the position < H23, W23, C31> are added, and the result of adding the corresponding data at all the positions is the final convolution result.

Therefore, according to the calculation method of the convolutional neural network provided by the application, when the multiplication circuit 200 executes the standard convolutional operation, even if the storage space of the parameter cache 202 cannot store all the convolutional kernels, the convolutional kernels do not need to be loaded for many times in each convolutional operation, so that the time delay of the multiplication circuit 200 can be reduced, and the efficiency of the convolutional operation can be improved.

A system on chip including the multiplication circuit 200 provided by the present application will be described below. For example, as shown in fig. 4, a System On Chip (SOC) 400 includes a multiplication circuit 200, a master Central Processing Unit (CPU) 410, a Double Data Rate (DDR) memory 420, and an Advanced eXtensible Interface (AXI) bus 430. The multiplication circuit 200, master CPU410, and DDR memory 420 communicate over an AXI bus 430. The structure and the operation principle of the multiplication circuit 200 have been introduced in the foregoing, and please refer to the text description related to fig. 2 (a), which is not repeated herein.

DDR memory 420 may be used to load and store data and/or instructions. For example, in some embodiments, DDR memory 420 may be used to load or store convolution kernel data, input data, convolution result data output by multiplication circuit 200, and the like, involved in performing convolution operations by multiplication circuit 200.

Master CPU410 may include one or more single-core or multi-core processors. In some embodiments, master CPU410 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the master CPU410 may be configured to cause the multiplication circuit 200 to perform standard convolution operations under different application scenarios.

Fig. 5 provides a block diagram of an electronic device 100, according to some embodiments of the present application. As shown in fig. 5, the electronic device 100 includes a memory 110, an input-output device 120, a processor 140, a communication module 130, and a system-on-chip 400.

The multiplication circuit 200 is used for performing the standard convolution operation in different scenarios, and the structure and the working principle have been described above, and please refer to the above description of the text related to fig. 2 (a), which is not described herein again.

The Processor 140 may include one or more Processing units, for example, a Processing module or a Processing circuit that may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a Microprocessor (MCU), an Artificial Intelligence (AI) Processor, or a Programmable logic device (FPGA), among others. In some embodiments, assuming that the electronic device 100 is an autonomous vehicle, the processor 140 is configured to control a driving state of the autonomous vehicle according to the image recognition result output by the multiplication circuit 200. For another example, in some embodiments, assuming that the electronic device 100 identifies a door as a human face, the processor 140 is configured to determine whether to open the door according to the human face identification result output by the multiplication circuit 200.

Memory 110, which may be used to store data, software programs, and modules, may be a Volatile Memory (Volatile Memory), such as a Random-Access Memory (RAM); or a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories, or may be a removable storage medium such as a Secure Digital (SD) memory card. For example, the memory 110 stores an operation program of the multiplication circuit 200, a convolution operation result output by the multiplication circuit 200, a captured image, convolution kernel data related to the convolution operation performed by the multiplication circuit 200, and the like.

The input/output device 120 may include a display screen, a touch screen, a speaker, and the like.

A communication module 130, such as a WIFI module, a Universal Serial Bus (USB), a 4G and 5G module, and the like. For the electronic device 100 to communicate with other electronic devices through the communication module 130.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., digital Video Disk (DVD)), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this Application, a processing system includes any system having a Processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-Only memories (CD-ROMs), magneto-optical disks, read-Only memories (ROMs), random Access Memories (RAMs), erasable Programmable Read-Only memories (EPROMs), electrically Erasable Programmable Read-Only memories (EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable memories for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet to transmit information in an electrical, optical, acoustical or other form of propagated signals. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned embodiments of the apparatus of the present application do not introduce units/modules that are not so closely related to solve the technical problems proposed by the present application, which does not indicate that there are no other units/modules in the above-mentioned embodiments of the apparatus.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A calculation method of a convolutional neural network is used for a deep learning processing chip and is characterized by comprising the following steps:

determining a segmentation mode of a convolution kernel according to the size of the parameter cache on the deep learning processing chip;

segmenting each convolution kernel in a plurality of convolution kernels according to the segmentation mode so as to divide each convolution kernel into N partial convolution kernels, wherein the N partial convolution kernels comprise first to Nth partial convolution kernels, and the convolution kernels are used for performing convolution operation on input data;

grouping the convolution kernels to obtain N convolution kernel groups, wherein a plurality of first part convolution kernel sets are first convolution kernel groups, a plurality of Nth part convolution kernel sets are Nth convolution kernel groups, and a storage space required by each of the first to Nth convolution kernel groups is smaller than a storage space of the parameter cache;

loading the first to Nth convolution kernel groups to the parameter cache respectively, and performing convolution operation on the input data and the convolution kernel groups loaded to the parameter cache respectively to obtain N convolution operation results;

merging the N convolution operation results, and determining the merged result as the convolution operation result of the input data and the plurality of convolution kernels, wherein the N convolution operation results are the same in size;

the loading the first to nth convolution kernel groups to the parameter cache, and performing convolution operation on the input data and the convolution kernel groups loaded to the parameter cache to obtain N convolution operation results respectively includes:

loading the first set of convolution kernels into the parameter cache,

according to the size of the first part of convolution kernels in the first convolution kernel group, removing data which are not subjected to convolution operation with part of convolution kernels in the first convolution kernel group from the input data, and determining first input data;

performing convolution operation on the first input data and the first part of convolution kernels in a first convolution kernel group to obtain a first convolution operation result;

loading the second set of convolution kernels into the parameter cache,

according to the size of the second partial convolution kernel in the second convolution kernel group, removing data which are not subjected to convolution operation with the partial convolution kernels in the second convolution kernel group from the input data, and determining second input data;

and carrying out convolution operation on the second input data and the second part of convolution kernels in the second convolution kernel group to obtain a second convolution operation result.

2. The method of claim 1, wherein the slicing manner of the convolution kernel comprises a slicing direction and a slicing position, and the slicing direction comprises a height direction of the convolution kernel and a width direction of the convolution kernel.

3. The method of claim 2, further comprising:

the slicing position is determined according to the storage space of the parameter buffer and the size of the convolution kernel, an

The sizes of the convolution kernels include the width, height and channel number of the convolution kernels.

4. The method of claim 1, wherein the first through nth partial convolution kernels have the same number of channels.

5. The method according to any one of claims 1 to 4, wherein N has a value of 2.

6. The method of claim 1, wherein combining the N convolution operation results and determining a combined result as a convolution operation result of the input data and the plurality of convolution kernels comprises:

7. A system on a chip, comprising:

a processor, being one of the processors of a system on a chip, for performing the method of computing a convolutional neural network of any one of claims 1 to 6 when the instructions are executed by the processor.

8. An electronic device comprising the system-on-chip of claim 7, and a processor and a memory;

a processor for performing the method of computing a convolutional neural network of any one of claims 1-6 when the instructions are executed by one or more processors.