CN112668708B

CN112668708B - Convolution operation device for improving data utilization rate

Info

Publication number: CN112668708B
Application number: CN202011578577.2A
Authority: CN
Inventors: 廖湘萍; 丁永林; 曹学成; 李炜
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-10-14
Anticipated expiration: 2040-12-28
Also published as: CN112668708A

Abstract

The invention discloses a convolution operation device for improving data utilization rate, wherein a multiply-accumulate unit adopts a parallel structure, input feature map data is divided into a plurality of data blocks according to the parallelism of the multiply-accumulate unit, the convolution operation is calculated by taking the data blocks as a unit, the same data block is repeatedly read and used, and the convolution operation is completed with all weights of the current network layer, so that the repeated input cache times of the data block are reduced, the data utilization rate is improved, the requirement of on-chip storage capacity is reduced, the process of reading the feature map data from off-chip storage is avoided, and the overall efficiency of the convolution operation is obviously improved.

Description

Convolution operation device for improving data utilization rate

Technical Field

The application belongs to the field of computers, and particularly relates to a convolution operation device for improving data utilization rate.

Background

In recent years, deep neural network technology has been widely used, especially in image processing, speech recognition, and text classification applications. The deep neural network has excellent accuracy, and more accurate results are obtained by analyzing huge data. The deeper the deep neural network is, the stronger the simulation capability is, and the higher the inference accuracy is.

Convolutional neural networks belong to deep neural networks. Although the convolutional neural network has high accuracy, as the number of the neural network layers increases, the calculation amount of the convolutional neural network algorithm becomes huge, and the requirements on data bandwidth and data storage are higher. Since the resources of hardware implementation are limited, the implementation cannot be increased as the calculation amount of the algorithm increases. And the characteristic that a convolution kernel window needs to slide through the whole image exists in the calculation process of the convolution neural network, so that a large amount of data can be repeatedly used in convolution operation. The on-chip storage space of the arithmetic device is limited, and data needs to be temporarily stored in the external storage space, so that the arithmetic is limited by the data bandwidth, and the arithmetic efficiency is influenced.

Disclosure of Invention

An object of the present application is to provide a convolution operation device that improves data utilization rate, reduces on-chip storage capacity, and improves convolution operation efficiency.

In order to achieve the purpose, the technical scheme adopted by the application is as follows:

the utility model provides an improve convolution arithmetic device of data utilization ratio, including input bus, input data buffer array, multiply and accumulate unit array, output data buffer array and output bus, input data buffer array includes input characteristic map data buffer array and weight buffer array, multiply and accumulate unit array includes Q parallel multiply and accumulate unit, wherein:

the input characteristic diagram data cache array is used for sequentially acquiring a plurality of data blocks from the input bus and caching the data blocks, wherein the data blocks are obtained by dividing input characteristic diagram data according to the parallelism Q of the multiply-accumulate unit;

the weight cache array is used for acquiring weight data from the input bus and caching the weight data, wherein the weight data is weight data of one or more convolution kernels;

the multiply-accumulate unit array is used for multiply-accumulate operation of the data blocks and the weight data, repeatedly reads the same data block and each convolution kernel to carry out multiply-accumulate operation until the multiply-accumulate operation of all the data blocks is completed, and outputs a convolution operation result obtained by carrying out multiply-accumulate operation on each data block and each convolution kernel;

the output data cache array is used for receiving convolution operation results output by the multiply-accumulate unit array, splicing all the convolution operation results to obtain output characteristic diagram data and caching the output characteristic diagram data;

and the output bus is used for outputting the output characteristic diagram data.

Several alternatives are provided below, but not as an additional limitation to the above general solution, but merely as a further addition or preference, each alternative being combinable individually for the above general solution or among several alternatives without technical or logical contradictions.

Preferably, the dividing the input feature map data according to the parallelism Q of the multiply-accumulate unit includes:

splitting input feature map data with the size of R x K x N into J data blocks with the size of L x K x N, wherein R is the height of the input feature map data, K is the width of the input feature map data, N is the number of channels of the input feature map data, L is the height of the data blocks, K is the width of the data blocks, namely the width of the input feature map data, and N is the number of channels of the data blocks, namely the number of channels of the input feature map data.

Preferably, each address space in the input feature map data cache array stores Q pixel data, and one data block is stored by using consecutive H × E × N address spaces, where H is the height of a convolution kernel, and E is the step of a current network layer in a convolutional neural network.

Preferably, when the input feature map data is divided, the following constraint conditions need to be satisfied: the height L of the data block satisfies the formula:

meanwhile, the relation among the number J of the data blocks, the height R of the input feature map data and the height L of the data blocks meets the following formula:

R＝L+(L-H+E)*(J-1)

in the formula, H is the height of the convolution kernel, and E is the step of the current network layer in the convolution neural network. Preferably, the size of the weight data is H × W × N × M, each address space of the weight cache array stores H × W weight data, and the current ownership value data is stored using consecutive N × M address spaces, where H denotes the height of a convolution kernel, W denotes the width of the convolution kernel, N denotes the number of channels of the convolution kernel, and M denotes the number of convolution kernels.

The utility model provides an improve data utilization's convolution arithmetic device, multiply and accumulate the unit and adopt parallel structure, with the splitting of input feature map data according to multiply and accumulate the parallelism degree of unit and be a plurality of data blocks, convolution operation uses the data block to calculate as the unit, repeated reading uses same data block, accomplish convolution operation with all weights of current network layer, the number of times that this data block repeated input was cached has been reduced, thereby data utilization has been improved and the on-chip memory capacity requirement has been reduced, the process that need read the feature map data from off-chip storage has been avoided, show and promoted convolution operation overall efficiency.

Drawings

FIG. 1 is a schematic structural diagram of a convolution operation apparatus for improving data utilization rate according to the present application;

FIG. 2 is a schematic diagram of multiply-accumulate operation of the convolution operation device for improving data utilization rate according to the present application;

FIG. 3 is a schematic diagram of the present application inputting a feature map data cache array to cache data;

FIG. 4 is a diagram illustrating a weight cache array according to the present application;

FIG. 5 is a diagram illustrating a data block cache according to an embodiment of the present application;

FIG. 6 is a diagram illustrating weight data caching in an embodiment of the present application;

FIG. 7 is a diagram illustrating multiply-accumulate operations in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In one embodiment, a convolution operation device for improving data utilization rate is provided to improve the convolution operation efficiency in each network layer of a convolution neural network so as to meet the calculation requirement of the convolution neural network with deeper depth.

As shown in fig. 1, the convolution operation apparatus for improving data utilization ratio of this embodiment includes an input bus, an input data buffer array, a multiply-accumulate unit array, an output data buffer array, and an output bus, where the input data buffer array includes an input feature map data buffer array and a weight buffer array, and the multiply-accumulate unit array includes Q parallel multiply-accumulate units (e.g., MACs), and Q is determined by the number of hardware resources, DSPs.

And the input characteristic diagram data cache array is used for sequentially acquiring a plurality of data blocks from the input bus and caching the data blocks, wherein the data blocks are obtained by dividing the input characteristic diagram data according to the parallelism Q of the multiply-accumulate unit.

And the weight cache array is used for acquiring weight data from the input bus and caching the weight data, wherein the weight data is the weight of one or more convolution kernels.

And a multiply-accumulate unit array, configured to perform multiply-accumulate operation on the data blocks and the weight data, as shown in fig. 2, when the multiply-accumulate operation is performed, repeatedly read the same data block and each convolution kernel to perform the multiply-accumulate operation, and after the multiply-accumulate operation on one data block and each convolution kernel is completed, read the next data block and each convolution kernel to perform the multiply-accumulate operation until the multiply-accumulate operation on all the data blocks is completed, and output a convolution operation result obtained by performing the multiply-accumulate operation on each data block and each convolution kernel.

And the output data cache array is used for receiving the convolution operation results output by the multiply-accumulate unit array, splicing all the convolution operation results to obtain output characteristic diagram data and caching the output characteristic diagram data.

The multiply-accumulate unit in the convolution operation device of the embodiment adopts a parallel structure, the input feature map data is split into a plurality of data blocks according to the parallelism of the multiply-accumulate unit, the convolution operation is calculated by taking the data blocks as a unit, the same data block is repeatedly read and used, and the convolution operation is completed with all weights of the current network layer, so that the repeated input caching times of the data blocks are reduced, the data utilization rate is improved, the requirement on the storage capacity in the chip is reduced, the process of reading the feature map data from the off-chip storage is avoided, and the overall efficiency of the convolution operation is obviously improved.

It is easy to understand that, usually, offset data is added in the convolution operation process to improve the accuracy of the convolution operation, so the convolution operation apparatus of this embodiment may further include an offset buffer array disposed in the input data buffer array, and the offset buffer array is used for buffering offset data and is capable of buffering at least the maximum capacity of a single-layer offset in all network layers.

It should be noted that, the convolution operation device of this embodiment operates once for one layer in the convolutional neural network, that is, the input feature map data, weight data, and bias data usually calculate corresponding data for the current layer. A plurality of convolution operation devices of the application are arranged or operated for a plurality of times to complete all convolution operations in the whole convolution neural network.

In this embodiment, dividing the input feature map data into a plurality of data blocks according to the parallelism Q of the multiply-accumulate unit includes:

as shown in fig. 3, when the convolution kernel size is 1*1 and the step size is 1, the input feature map data with the size of R × K × N is divided into J data blocks with the size of L × K × N, where R is the height of the input feature map data, K is the width of the input feature map data and the width of the data block, N is the number of channels of the input feature map data and the number of channels of the data block, and L is the height of the data block. After the data block is split into a plurality of data blocks, Q pixel data are stored in each address space in the characteristic diagram data cache array, and one data block is stored in N continuous address spaces.

When the convolution kernel size is 3*3, stepping 1, the input feature map data of size R × K × N is split into J data blocks of size L × K × N. After the data blocks are split into a plurality of data blocks, Q pixel data are stored in each address space in the characteristic diagram data cache array, and one data block is stored in continuous N x 3 address spaces.

And by analogy, if the size of the convolution kernel is H × W and the stepping of the current network layer in the convolutional neural network is E, wherein H represents the height of the convolution kernel, and W represents the width of the convolution kernel, splitting the input feature map data into a plurality of data blocks, inputting each address space in the feature map data cache array to store Q pixel data, and storing one data block by adopting continuous H × E × N address spaces.

When the input characteristic diagram data are split in the embodiment, the data can be partitioned according to experience, and partitioning information is issued to the input data cache array, so that the input characteristic diagram data cache array can conveniently acquire data blocks through the input bus according to the partitioning information; or solving according to a specified constraint condition to obtain blocking information, and sending the blocking information obtained by solving to the input data cache array.

In order to obtain a reasonably sized data block, in one embodiment, when the input feature map data is divided, the following constraint conditions are satisfied:

1) The height L of the data block satisfies the formula:

2) Meanwhile, the relation among the number J of the data blocks, the height R of the input feature map data and the height L of the data blocks meets the following formula:

R＝L+(L-H+E)*(J-1)

in the formula, H is the height of the convolution kernel, and E is the step of the current network layer in the convolution neural network.

As shown in fig. 4, if the size of the weight data is H × W × N × M (it is understood that the value data of the ownership value in the single layer network layer in the convolutional neural network), each address space of the weight cache array stores H × W weight data, and the current ownership value data is stored using consecutive N × M address spaces, where H denotes the height of the convolution kernel, W denotes the width of the convolution kernel, N denotes the number of channels of the convolution kernel, and M denotes the number of convolution kernels.

And after the multiplication and accumulation operation of J data blocks with the size of L x K x N and all weight data H x W x N x M of the current layer is completed, obtaining J x M groups of convolution operation results with the size of Q, and arranging the convolution operation results into complete output characteristic diagram data through splicing. The invention utilizes weight and bias sharing and adopts a parallel computing structure to carry out multiply-accumulate operation, thereby reducing the times of repeated input and cache of the same data block, and achieving the purposes of improving the data utilization rate and reducing the on-chip storage capacity.

In order to further enhance the data utilization rate of the convolution operation apparatus of the present application, a specific example is provided below for explanation.

Taking the input feature image size as 6 × 3, the input feature image block number as 2, the convolution kernel size as 1*1, the convolution kernel number as 4, the step number as 1, and the number of multiply-accumulate units as 18 as an example, the operation mode of the convolution operation device is as follows:

as shown in fig. 5, the input feature image data block is obtained from the input bus, and the size is 3 × 6 × 3, and the input feature image data block is cached in the array according to the manner shown in fig. 5, and each address space stores 3*6 pixel points, which are continuously stored in 3 address spaces.

It should be noted that, when caching data blocks, as long as a plurality of data blocks can be cached in advance under the condition that the input feature map data cache array is not full, the cached data blocks are sequentially read and subjected to convolution operation, and the data blocks participating in the convolution operation are removed, so as to circularly store new data blocks. Namely, the caching, the operation and the removal of the data block are a dynamic process, and the high efficiency of the data operation is ensured.

Fig. 5 shows a state where 2 data blocks have been stored, and in the actual operation process, one data block is obtained first for operation, and after the operation is finished, the next data block is obtained for caching and operation.

As shown in fig. 6, the weight data is obtained from the input bus, and the size is 1 × 3 × 4, and the weight data is cached in the weight array according to the manner shown in fig. 6, and each address space stores 1*1 weight data, and the weight data is continuously stored in 3*4 address spaces.

As shown in fig. 7, 3*6 feature image data of one address space are taken each time, 1*1 weight data are taken for multiply-accumulate operation, and 3*6 multiply-accumulate units perform multiply-accumulate operation in parallel. And continuously taking the characteristic image data and the weight data for 3 times, and completing the multiply-accumulate operation of a data block and a convolution kernel to obtain a group of output characteristic image data with the size of 18.

At this time, the weight data in the weight array is not completely obtained, the data of the 3 address spaces in the array are repeatedly read for 3 times, the weight data in the weight array are sequentially read until multiplication and accumulation operations are completed with all weights, and three groups of output characteristic image data with the size of 18 are obtained.

And after the 1 st data block finishes the operation, reading the characteristic image data of the next data block according to the mode, and repeatedly reading the weight data for 1 time. After the convolution operation is completed, four groups of output characteristic image data with the size of 18 are obtained again.

And obtaining eight groups of output characteristic image data with the size of 18 in total, splicing to obtain a complete output characteristic image with the size of 6*6 and the number of channels of 4, and completing convolution operation of the current network layer.

In the convolution operation device of the embodiment, the multiply-accumulate unit adopts a parallel structure, the convolution operation is calculated by taking an input characteristic image data block as a unit, the data block is reused, and the convolution operation is completed with all weights of the current network layer, so that the times of repeated input and cache of the data block are reduced, the data utilization rate is improved, and the requirement on the storage capacity in a chip is reduced.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The convolution operation device for improving the data utilization rate is characterized by comprising an input bus, an input data cache array, a multiply-accumulate unit array, an output data cache array and an output bus, wherein the input data cache array comprises an input characteristic diagram data cache array and a weight cache array, and the multiply-accumulate unit array comprises Q parallel multiply-accumulate units, wherein:

the output bus is used for outputting the output characteristic diagram data;

the dividing of the input feature map data according to the parallelism Q of the multiply-accumulate unit comprises the following steps:

splitting input feature map data with the size of R x K x N into J data blocks with the size of L x K x N, wherein R is the height of the input feature map data, K is the width of the input feature map data, N is the number of channels of the input feature map data, L is the height of the data blocks, K is the width of the data blocks, namely the width of the input feature map data, and N is the number of channels of the data blocks, namely the number of channels of the input feature map data;

when the input feature map data are divided, the following constraint conditions need to be met: the height L of the data block satisfies the formula:

2. The apparatus according to claim 1, wherein each address space in the input signature data buffer array stores Q pixel data, and one data block is stored using consecutive H × E × N address spaces, where H is the height of the convolution kernel and E is the step of the current network layer in the convolutional neural network.

3. The apparatus according to claim 1, wherein the weight data has a size H x W x N x M, each address space of the weight cache array stores H x W weight data, and the current ownership value data is stored using consecutive N x M address spaces, where H denotes a height of a convolution kernel, W denotes a width of the convolution kernel, N denotes a number of channels of the convolution kernel, and M denotes a number of the convolution kernels.