CN112230884A

CN112230884A - Target detection hardware accelerator and acceleration method

Info

Publication number: CN112230884A
Application number: CN202011494636.8A
Authority: CN
Inventors: 陈迟晓; 张锦山; 焦博; 张立华
Original assignee: Ji Hua Laboratory
Current assignee: Ji Hua Laboratory
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-01-15
Anticipated expiration: 2040-12-17
Also published as: CN112230884B

Abstract

The invention relates to the field of data processing, and provides a target detection hardware accelerator and an acceleration method, wherein the accelerator comprises a convolution arithmetic unit integrated with a multiplier and an adder, the convolution arithmetic unit receives convolution weight data and a characteristic diagram pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data; and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory. The invention can reduce the time and power consumption required by the accelerator for carrying out data transportation and improve the working efficiency of the accelerator.

Description

Target detection hardware accelerator and acceleration method

Technical Field

The invention relates to the field of data processing, in particular to a target detection hardware accelerator and an acceleration method.

Background

Under the support of big data analysis and large-scale high-speed computing platform, the neural network technology has been developed sufficiently. On one hand, Neural Network algorithms are continuously improved, and after a CNN (Convolutional Neural Network), new Network models such as RNN (Recurrent Neural Network), GAN (generative adaptive Network), and the like are layered endlessly; on the other hand, neural network algorithms are widely applied to embedded systems because they are prominent in the fields of image recognition, speech analysis, natural language processing, and the like. An embedded system is a special system on chip that has stringent requirements on system performance and power consumption. Therefore, the integration of neural network accelerators in a system on a chip has become a research hotspot.

With the proposal of various neural network algorithms, various neural network accelerators have come into play. However, in the process of data movement, the existing neural network accelerator mostly adopts the DRAM to read data, the data needs to be transmitted through a bus, the bandwidth of the bus is limited, and the power consumption and the time delay for reading a large amount of data from the DRAM are large, so that the working efficiency of the accelerator is greatly reduced. Therefore, how to reduce the time and power consumption required by the accelerator for data transfer and improve the working efficiency of the accelerator become a technical problem which needs to be solved urgently.

Disclosure of Invention

The invention mainly aims to provide a target detection hardware accelerator and an acceleration method, and aims to solve the problems of reducing the time and power consumption required by the accelerator for carrying out data transportation and improving the working efficiency of the accelerator.

To achieve the above object, the present invention provides an object detection hardware accelerator, including:

the convolution arithmetic unit is integrated with a multiplier and an adder, receives convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data;

the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data;

and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.

Preferably, the step of multiplying the convolution weight data and the feature map by the multiplier to obtain multiplication result data and convolution offset data includes:

and converting the convolution weight data and the characteristic diagram from parallel multiplication of 3 bits multiplied by 3 bits into serial multiplication of 3 bits multiplied by 1 bit.

Preferably, the accelerator further includes a main control module, configured to generate a convolution operation instruction of the convolution operator, a pooling operation instruction of the pooling operation unit, and an RBR operation instruction of the RBR operation unit.

Preferably, the batch normalization and quantification comprises: rescaling, normalizing and ReLU processing the pooled result data.

Preferably, the accelerator further includes a reorder buffer unit, disposed between the RBR operation unit and the block random access memory, and configured to sequence data units constituting the target feature data according to a convolution order and then store the data units in the block random access memory in sequence.

Preferably, the multiplication operation is implemented based on LUT units of the FPGA.

Preferably, the shift-add-sum processing is implemented based on an FPGA DSP unit.

In order to achieve the above object, the present invention further provides an acceleration method of a target detection hardware accelerator, including:

receiving convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, carrying out multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and carrying out shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiply-accumulate result data;

receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data;

and carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.

Preferably, the step of multiplying the convolution weight data and the feature map to obtain multiplication result data and convolution offset data includes:

Preferably, the method further comprises the step of ordering: and sequencing the data units forming the target characteristic data according to the convolution sequence, and then sequentially storing the data units to the block random access memory.

The invention provides a target detection hardware accelerator and an acceleration method, which comprises a convolution arithmetic unit integrated with a multiplier and an adder, wherein the convolution arithmetic unit receives convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data; and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory. The invention can reduce the time and power consumption required by the accelerator for carrying out data transportation and improve the working efficiency of the accelerator.

Drawings

FIG. 1 is a block diagram of a target detection hardware accelerator according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an acceleration method according to an embodiment of the present invention.

Description of the drawings: 1. a convolution operator; 2. a multiplier; 3. an adder; 4. a pooling operation unit; 5. an RBR operation unit; 6. a main control module; 7. a block random access memory; 8. and a reorder buffer unit.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

It will also be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present.

In addition, the descriptions related to "first", "second", etc. in the present invention are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

One aspect of the invention provides a target detection hardware accelerator.

Fig. 1 is a schematic diagram of an overall structure of a target detection hardware accelerator according to the present invention. The accelerator includes: the convolution arithmetic unit 1 is integrated with a multiplier 2 and an adder 3, the convolution arithmetic unit 1 receives convolution weight data and a characteristic diagram which are stored in a block random access memory 7 in advance, the multiplier 2 performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder 3 performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data; and the RBR operation unit 5 is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory 7.

In order to solve the problem that the existing neural network accelerator (such as eyeris) wastes time and power consumption during data transmission, the invention provides a novel neural network accelerator according to the technical problem. In the invention, the data handling and storage mode of the existing neural network accelerator is improved, and the convolution weight data and the characteristic diagram which are originally required to be stored in a Dynamic Random Access Memory (DRAM) are changed into those which are stored in a Block Random Access Memory (BRAM) 7, because the DRAM is adopted to read the data and needs to be transmitted through a bus, the bandwidth of the bus is limited, and the power consumption and the time delay for reading a large amount of data from the DRAM are large. The reading from the BRAM does not need to pass through a bus, the storage and the calculation are connected together, the power consumption is low, the speed is high, and therefore the calculation speed of the accelerator is improved.

In the present embodiment, the target detection hardware accelerator includes a convolution operator 1, a pooling operation unit 4, and an RBR operation unit 5.

Specifically, the convolution operator 1 is configured to receive convolution weight data and a feature map of features to be extracted, which are stored in the block random access memory 7 in advance. The convolution operator 1 is integrated with a multiplier 2 and an adder 3. The convolution weight data and the characteristic diagram are multiplied by the multiplier 2 to obtain multiplication result data and convolution offset data. In this embodiment, the multiplication operation is implemented based on LUT unit of FPGA, and the FPGA device belongs to a semi-custom circuit in the asic, and is a programmable logic array, which can effectively solve the problem and save DSP (digital signal processor) resources. The traditional method of adopting DSP to carry out multiplication operation is changed into the method of carrying out multiplication operation through LUT (the number of basic logic units of FPGA is far more than DSP), so that the problem that the number of gate circuits of the original DSP device is small is solved. However, since the LUT can handle only a simple logic function, in the present embodiment, the convolution weight data and the signature are converted from parallel multiplication of 3 bits × 3 bits by parallel multiplication to serial multiplication of 3 bits × 1bit, which is simpler in processing logic.

And then the multiplication result data and the convolution offset data output by the multiplier 2 are subjected to shift addition summation processing through an adder 3 to obtain multiplication and accumulation result data. The shift-add-sum processing is implemented based on an FPGA-based DSP unit.

It should be explained that the multiplication operation is usually implemented by DSP, but it is very wasteful and the number of DSP is very small, so it cannot be used on a large scale. Therefore, splitting 3 bits x 3 bits into groups of 1 bits x 1 bits, which is essentially and logic, can be implemented very simply by a LUT. The results of multiple groups of 1bit multiplied by 1bit can be shifted and added by few DSPs to complete accumulation, namely, multiplication operation and multiplication and accumulation operation are completed, and the DSP with limited number can be used in the multiplication and accumulation operation with higher operation precision.

For convenience of understanding, the accelerator calculation process of the present invention is explained in detail by the following steps:

1) fixing the 1 st row of the convolution kernel, inputting the 1 st row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and outputting one by one;

2) fixing the 2 nd line of the convolution kernel, inputting the 2 nd line elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st line element sum of the output feature map, and adding the 1 st line element sum and the previously obtained part sum one by one to form a new part sum output;

3) fixing the 3 rd row of the convolution kernel, inputting the 3 rd row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and adding the part sums obtained one by one to serve as complete element output;

and by analogy, calculating and outputting a feature diagram result according to rows. Each row of output is calculated, three rows of convolution kernels need to be fixed respectively in sequence, and input feature vectors corresponding to the row of convolution are input in sequence.

The invention also adopts ping-pong operation skill to process the buffer allocation of the input characteristic diagram, saves the buffer space and completes the seamless buffer and calculation of the data. In the layer 1, the cache A outputs the layer 1 input characteristic diagram to the calculation acceleration module, obtains a layer 2 input characteristic diagram and stores the layer 2 input characteristic diagram into the cache B; when the layer 1 operation is finished and the layer 2 operation is started, the cache B outputs the layer 2 input feature map to the calculation acceleration module and stores the layer 3 input feature map back to the cache A. And the cache C is used as a supplementary cache when the capacities of the cache A and the cache B are insufficient.

And a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data. The pooling operation unit 4 selects one of the multiply-accumulate result data obtained by the previous convolution calculation as the largest one, and discards the other multiply-accumulate result data.

In the present invention, the DRAM is changed to the BRAM, although the operation speed of the accelerator can be improved, the storage space of the BRAM is far less than that of the DRAM, and it may not be satisfied with storing a large amount of feature data, therefore, in this embodiment, the accelerator further includes an RBR operation unit 5 for performing batch standardization and quantization on the pooled result data to obtain target feature data, and the feature data originally occupying large resources is stored in the block random access memory 7 after being disassembled into 3 bits occupying small resources. Batch normalization and quantification includes: rescaling, normalizing and ReLU processing the pooled result data.

Further, an additional buffer block is inserted between the pooling operation unit 4 and the RBR operation unit 5. The workload of RBR can be reduced by the buffer block. It may reduce the activity of the RBR block that is the most pooling layer after these layers.

In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a main control module 6, configured to generate a convolution operation instruction of the convolution operator 1, a pooling operation instruction of the pooling operation unit 4, and an RBR operation instruction of the RBR operation unit 5.

In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a reorder buffer unit 8, disposed between the RBR operation unit 5 and the block random access memory 7, for sorting the data units constituting the target feature data according to a convolution order and then sequentially storing the data units in the block random access memory 7.

In this embodiment, since the target feature data output by the pooling operation unit 4 has 64 units, 64 output units of 64 units represent 64 output channels, and since the bandwidth of the accelerator is narrow, only 16 channels can be calculated at a time when the target feature data is stored in the block random access memory, the 64 channels are divided into 4 channels for storage separately, and the bandwidth of BRAM for storage is limited, and the cycle of the result is not stored in the buffer at a time, so that one reorder buffer unit 8 is used to allow write-back in multiple cycles and reorder order write-back to the designated address.

Another aspect of the present invention provides an acceleration method using the target detection hardware accelerator as described above, as shown in fig. 2, the method includes the following steps:

s1: receiving convolution weight data and a characteristic diagram which are pre-stored in a block random access memory (7), multiplying the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and shifting, adding and summing the multiplication result data and the convolution offset data to obtain multiplication and accumulation result data;

s2: receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data;

s3: and carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data, and storing the target characteristic data in the block random access memory 7.

In the present embodiment, the accelerator includes: the convolution arithmetic unit 1 is integrated with a multiplier 2 and an adder 3, the convolution arithmetic unit 1 receives convolution weight data and a characteristic diagram which are stored in a block random access memory 7 in advance, the multiplier 2 performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder 3 performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data; and the RBR operation unit 5 is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory 7.

It should be noted that, for simplicity of description, the above-mentioned embodiments are described as a series of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or communication connection may be an indirect coupling or communication connection between devices or units through some interfaces, and may be in a telecommunication or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above examples are only used to illustrate the technical solutions of the present invention, and do not limit the scope of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from these embodiments without making any inventive step, fall within the scope of the present invention. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art may still make various combinations, additions, deletions or other modifications of the features of the embodiments of the present invention according to the situation without conflict, so as to obtain different technical solutions without substantially departing from the spirit of the present invention, and these technical solutions also fall within the protection scope of the present invention.

Claims

1. A target detection hardware accelerator, comprising:

2. The target detection hardware accelerator of claim 1 wherein the step of multiplying the convolution weight data and the signature by the multiplier to obtain multiplication result data and convolution offset data comprises:

and converting the convolution weight data and the characteristic diagram from the parallel multiplication of 3 bits multiplied by 3 bits into the serial multiplication of 3 bits multiplied by 1 bit.

3. The target detection hardware accelerator of claim 1 further comprising a master control module to generate convolution operation instructions for the convolution operator, pooling operation instructions for a pooling operation unit, and RBR operation instructions for an RBR operation unit.

4. The target detection hardware accelerator of claim 1 wherein the batch normalization and quantization comprises: rescaling, normalizing and ReLU processing the pooled result data.

5. The target detection hardware accelerator of claim 1 further comprising a reorder buffer unit disposed between the RBR operation unit and the block random access memory for sorting data units constituting the target feature data in a convolution order and then sequentially storing the data units in the block random access memory.

6. The target detection hardware accelerator of claim 1 wherein the multiplication operation is based on an FPGA LUT unit implementation.

7. The target detection hardware accelerator of claim 1 wherein the shift-add-sum processing is implemented based on an FPGA-based DSP unit.

8. An acceleration method using the object detection hardware accelerator of any of claims 1-7, comprising:

9. An acceleration method according to claim 8, characterized in that said step of multiplying said convolution weight data and said feature map to obtain multiplication result data and convolution offset data comprises:

10. An acceleration method according to claim 8, characterized in that it further comprises a sorting step of: and sequencing the data units forming the target characteristic data according to the convolution sequence, and then sequentially storing the data units to the block random access memory.