CN112230884A - Target detection hardware accelerator and acceleration method - Google Patents

Target detection hardware accelerator and acceleration method Download PDF

Info

Publication number
CN112230884A
CN112230884A CN202011494636.8A CN202011494636A CN112230884A CN 112230884 A CN112230884 A CN 112230884A CN 202011494636 A CN202011494636 A CN 202011494636A CN 112230884 A CN112230884 A CN 112230884A
Authority
CN
China
Prior art keywords
data
convolution
multiplication
result data
random access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011494636.8A
Other languages
Chinese (zh)
Other versions
CN112230884B (en
Inventor
陈迟晓
张锦山
焦博
张立华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ji Hua Laboratory
Original Assignee
Ji Hua Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ji Hua Laboratory filed Critical Ji Hua Laboratory
Priority to CN202011494636.8A priority Critical patent/CN112230884B/en
Publication of CN112230884A publication Critical patent/CN112230884A/en
Application granted granted Critical
Publication of CN112230884B publication Critical patent/CN112230884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the field of data processing, and provides a target detection hardware accelerator and an acceleration method, wherein the accelerator comprises a convolution arithmetic unit integrated with a multiplier and an adder, the convolution arithmetic unit receives convolution weight data and a characteristic diagram pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data; and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory. The invention can reduce the time and power consumption required by the accelerator for carrying out data transportation and improve the working efficiency of the accelerator.

Description

Target detection hardware accelerator and acceleration method
Technical Field
The invention relates to the field of data processing, in particular to a target detection hardware accelerator and an acceleration method.
Background
Under the support of big data analysis and large-scale high-speed computing platform, the neural network technology has been developed sufficiently. On one hand, Neural Network algorithms are continuously improved, and after a CNN (Convolutional Neural Network), new Network models such as RNN (Recurrent Neural Network), GAN (generative adaptive Network), and the like are layered endlessly; on the other hand, neural network algorithms are widely applied to embedded systems because they are prominent in the fields of image recognition, speech analysis, natural language processing, and the like. An embedded system is a special system on chip that has stringent requirements on system performance and power consumption. Therefore, the integration of neural network accelerators in a system on a chip has become a research hotspot.
With the proposal of various neural network algorithms, various neural network accelerators have come into play. However, in the process of data movement, the existing neural network accelerator mostly adopts the DRAM to read data, the data needs to be transmitted through a bus, the bandwidth of the bus is limited, and the power consumption and the time delay for reading a large amount of data from the DRAM are large, so that the working efficiency of the accelerator is greatly reduced. Therefore, how to reduce the time and power consumption required by the accelerator for data transfer and improve the working efficiency of the accelerator become a technical problem which needs to be solved urgently.
Disclosure of Invention
The invention mainly aims to provide a target detection hardware accelerator and an acceleration method, and aims to solve the problems of reducing the time and power consumption required by the accelerator for carrying out data transportation and improving the working efficiency of the accelerator.
To achieve the above object, the present invention provides an object detection hardware accelerator, including:
the convolution arithmetic unit is integrated with a multiplier and an adder, receives convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data;
the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data;
and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.
Preferably, the step of multiplying the convolution weight data and the feature map by the multiplier to obtain multiplication result data and convolution offset data includes:
and converting the convolution weight data and the characteristic diagram from parallel multiplication of 3 bits multiplied by 3 bits into serial multiplication of 3 bits multiplied by 1 bit.
Preferably, the accelerator further includes a main control module, configured to generate a convolution operation instruction of the convolution operator, a pooling operation instruction of the pooling operation unit, and an RBR operation instruction of the RBR operation unit.
Preferably, the batch normalization and quantification comprises: rescaling, normalizing and ReLU processing the pooled result data.
Preferably, the accelerator further includes a reorder buffer unit, disposed between the RBR operation unit and the block random access memory, and configured to sequence data units constituting the target feature data according to a convolution order and then store the data units in the block random access memory in sequence.
Preferably, the multiplication operation is implemented based on LUT units of the FPGA.
Preferably, the shift-add-sum processing is implemented based on an FPGA DSP unit.
In order to achieve the above object, the present invention further provides an acceleration method of a target detection hardware accelerator, including:
receiving convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, carrying out multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and carrying out shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiply-accumulate result data;
receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data;
and carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.
Preferably, the step of multiplying the convolution weight data and the feature map to obtain multiplication result data and convolution offset data includes:
and converting the convolution weight data and the characteristic diagram from parallel multiplication of 3 bits multiplied by 3 bits into serial multiplication of 3 bits multiplied by 1 bit.
Preferably, the method further comprises the step of ordering: and sequencing the data units forming the target characteristic data according to the convolution sequence, and then sequentially storing the data units to the block random access memory.
The invention provides a target detection hardware accelerator and an acceleration method, which comprises a convolution arithmetic unit integrated with a multiplier and an adder, wherein the convolution arithmetic unit receives convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data; and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory. The invention can reduce the time and power consumption required by the accelerator for carrying out data transportation and improve the working efficiency of the accelerator.
Drawings
FIG. 1 is a block diagram of a target detection hardware accelerator according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an acceleration method according to an embodiment of the present invention.
Description of the drawings: 1. a convolution operator; 2. a multiplier; 3. an adder; 4. a pooling operation unit; 5. an RBR operation unit; 6. a main control module; 7. a block random access memory; 8. and a reorder buffer unit.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.
It will also be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present.
In addition, the descriptions related to "first", "second", etc. in the present invention are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
One aspect of the invention provides a target detection hardware accelerator.
Fig. 1 is a schematic diagram of an overall structure of a target detection hardware accelerator according to the present invention. The accelerator includes: the convolution arithmetic unit 1 is integrated with a multiplier 2 and an adder 3, the convolution arithmetic unit 1 receives convolution weight data and a characteristic diagram which are stored in a block random access memory 7 in advance, the multiplier 2 performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder 3 performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data; and the RBR operation unit 5 is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory 7.
In order to solve the problem that the existing neural network accelerator (such as eyeris) wastes time and power consumption during data transmission, the invention provides a novel neural network accelerator according to the technical problem. In the invention, the data handling and storage mode of the existing neural network accelerator is improved, and the convolution weight data and the characteristic diagram which are originally required to be stored in a Dynamic Random Access Memory (DRAM) are changed into those which are stored in a Block Random Access Memory (BRAM) 7, because the DRAM is adopted to read the data and needs to be transmitted through a bus, the bandwidth of the bus is limited, and the power consumption and the time delay for reading a large amount of data from the DRAM are large. The reading from the BRAM does not need to pass through a bus, the storage and the calculation are connected together, the power consumption is low, the speed is high, and therefore the calculation speed of the accelerator is improved.
In the present embodiment, the target detection hardware accelerator includes a convolution operator 1, a pooling operation unit 4, and an RBR operation unit 5.
Specifically, the convolution operator 1 is configured to receive convolution weight data and a feature map of features to be extracted, which are stored in the block random access memory 7 in advance. The convolution operator 1 is integrated with a multiplier 2 and an adder 3. The convolution weight data and the characteristic diagram are multiplied by the multiplier 2 to obtain multiplication result data and convolution offset data. In this embodiment, the multiplication operation is implemented based on LUT unit of FPGA, and the FPGA device belongs to a semi-custom circuit in the asic, and is a programmable logic array, which can effectively solve the problem and save DSP (digital signal processor) resources. The traditional method of adopting DSP to carry out multiplication operation is changed into the method of carrying out multiplication operation through LUT (the number of basic logic units of FPGA is far more than DSP), so that the problem that the number of gate circuits of the original DSP device is small is solved. However, since the LUT can handle only a simple logic function, in the present embodiment, the convolution weight data and the signature are converted from parallel multiplication of 3 bits × 3 bits by parallel multiplication to serial multiplication of 3 bits × 1bit, which is simpler in processing logic.
And then the multiplication result data and the convolution offset data output by the multiplier 2 are subjected to shift addition summation processing through an adder 3 to obtain multiplication and accumulation result data. The shift-add-sum processing is implemented based on an FPGA-based DSP unit.
It should be explained that the multiplication operation is usually implemented by DSP, but it is very wasteful and the number of DSP is very small, so it cannot be used on a large scale. Therefore, splitting 3 bits x 3 bits into groups of 1 bits x 1 bits, which is essentially and logic, can be implemented very simply by a LUT. The results of multiple groups of 1bit multiplied by 1bit can be shifted and added by few DSPs to complete accumulation, namely, multiplication operation and multiplication and accumulation operation are completed, and the DSP with limited number can be used in the multiplication and accumulation operation with higher operation precision.
For convenience of understanding, the accelerator calculation process of the present invention is explained in detail by the following steps:
1) fixing the 1 st row of the convolution kernel, inputting the 1 st row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and outputting one by one;
2) fixing the 2 nd line of the convolution kernel, inputting the 2 nd line elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st line element sum of the output feature map, and adding the 1 st line element sum and the previously obtained part sum one by one to form a new part sum output;
3) fixing the 3 rd row of the convolution kernel, inputting the 3 rd row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and adding the part sums obtained one by one to serve as complete element output;
and by analogy, calculating and outputting a feature diagram result according to rows. Each row of output is calculated, three rows of convolution kernels need to be fixed respectively in sequence, and input feature vectors corresponding to the row of convolution are input in sequence.
The invention also adopts ping-pong operation skill to process the buffer allocation of the input characteristic diagram, saves the buffer space and completes the seamless buffer and calculation of the data. In the layer 1, the cache A outputs the layer 1 input characteristic diagram to the calculation acceleration module, obtains a layer 2 input characteristic diagram and stores the layer 2 input characteristic diagram into the cache B; when the layer 1 operation is finished and the layer 2 operation is started, the cache B outputs the layer 2 input feature map to the calculation acceleration module and stores the layer 3 input feature map back to the cache A. And the cache C is used as a supplementary cache when the capacities of the cache A and the cache B are insufficient.
And a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data. The pooling operation unit 4 selects one of the multiply-accumulate result data obtained by the previous convolution calculation as the largest one, and discards the other multiply-accumulate result data.
In the present invention, the DRAM is changed to the BRAM, although the operation speed of the accelerator can be improved, the storage space of the BRAM is far less than that of the DRAM, and it may not be satisfied with storing a large amount of feature data, therefore, in this embodiment, the accelerator further includes an RBR operation unit 5 for performing batch standardization and quantization on the pooled result data to obtain target feature data, and the feature data originally occupying large resources is stored in the block random access memory 7 after being disassembled into 3 bits occupying small resources. Batch normalization and quantification includes: rescaling, normalizing and ReLU processing the pooled result data.
Further, an additional buffer block is inserted between the pooling operation unit 4 and the RBR operation unit 5. The workload of RBR can be reduced by the buffer block. It may reduce the activity of the RBR block that is the most pooling layer after these layers.
In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a main control module 6, configured to generate a convolution operation instruction of the convolution operator 1, a pooling operation instruction of the pooling operation unit 4, and an RBR operation instruction of the RBR operation unit 5.
In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a reorder buffer unit 8, disposed between the RBR operation unit 5 and the block random access memory 7, for sorting the data units constituting the target feature data according to a convolution order and then sequentially storing the data units in the block random access memory 7.
In this embodiment, since the target feature data output by the pooling operation unit 4 has 64 units, 64 output units of 64 units represent 64 output channels, and since the bandwidth of the accelerator is narrow, only 16 channels can be calculated at a time when the target feature data is stored in the block random access memory, the 64 channels are divided into 4 channels for storage separately, and the bandwidth of BRAM for storage is limited, and the cycle of the result is not stored in the buffer at a time, so that one reorder buffer unit 8 is used to allow write-back in multiple cycles and reorder order write-back to the designated address.
Another aspect of the present invention provides an acceleration method using the target detection hardware accelerator as described above, as shown in fig. 2, the method includes the following steps:
s1: receiving convolution weight data and a characteristic diagram which are pre-stored in a block random access memory (7), multiplying the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and shifting, adding and summing the multiplication result data and the convolution offset data to obtain multiplication and accumulation result data;
s2: receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data;
s3: and carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data, and storing the target characteristic data in the block random access memory 7.
In the present embodiment, the accelerator includes: the convolution arithmetic unit 1 is integrated with a multiplier 2 and an adder 3, the convolution arithmetic unit 1 receives convolution weight data and a characteristic diagram which are stored in a block random access memory 7 in advance, the multiplier 2 performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder 3 performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data; a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data; and the RBR operation unit 5 is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory 7.
In order to solve the problem that the existing neural network accelerator (such as eyeris) wastes time and power consumption during data transmission, the invention provides a novel neural network accelerator according to the technical problem. In the invention, the data handling and storage mode of the existing neural network accelerator is improved, and the convolution weight data and the characteristic diagram which are originally required to be stored in a Dynamic Random Access Memory (DRAM) are changed into those which are stored in a Block Random Access Memory (BRAM) 7, because the DRAM is adopted to read the data and needs to be transmitted through a bus, the bandwidth of the bus is limited, and the power consumption and the time delay for reading a large amount of data from the DRAM are large. The reading from the BRAM does not need to pass through a bus, the storage and the calculation are connected together, the power consumption is low, the speed is high, and therefore the calculation speed of the accelerator is improved.
In the present embodiment, the target detection hardware accelerator includes a convolution operator 1, a pooling operation unit 4, and an RBR operation unit 5.
Specifically, the convolution operator 1 is configured to receive convolution weight data and a feature map of features to be extracted, which are stored in the block random access memory 7 in advance. The convolution operator 1 is integrated with a multiplier 2 and an adder 3. The convolution weight data and the characteristic diagram are multiplied by the multiplier 2 to obtain multiplication result data and convolution offset data. In this embodiment, the multiplication operation is implemented based on LUT unit of FPGA, and the FPGA device belongs to a semi-custom circuit in the asic, and is a programmable logic array, which can effectively solve the problem and save DSP (digital signal processor) resources. The traditional method of adopting DSP to carry out multiplication operation is changed into the method of carrying out multiplication operation through LUT (the number of basic logic units of FPGA is far more than DSP), so that the problem that the number of gate circuits of the original DSP device is small is solved. However, since the LUT can handle only a simple logic function, in the present embodiment, the convolution weight data and the signature are converted from parallel multiplication of 3 bits × 3 bits by parallel multiplication to serial multiplication of 3 bits × 1bit, which is simpler in processing logic.
And then the multiplication result data and the convolution offset data output by the multiplier 2 are subjected to shift addition summation processing through an adder 3 to obtain multiplication and accumulation result data. The shift-add-sum processing is implemented based on an FPGA-based DSP unit.
It should be explained that the multiplication operation is usually implemented by DSP, but it is very wasteful and the number of DSP is very small, so it cannot be used on a large scale. Therefore, splitting 3 bits x 3 bits into groups of 1 bits x 1 bits, which is essentially and logic, can be implemented very simply by a LUT. The results of multiple groups of 1bit multiplied by 1bit can be shifted and added by few DSPs to complete accumulation, namely, multiplication operation and multiplication and accumulation operation are completed, and the DSP with limited number can be used in the multiplication and accumulation operation with higher operation precision.
For convenience of understanding, the accelerator calculation process of the present invention is explained in detail by the following steps:
1) fixing the 1 st row of the convolution kernel, inputting the 1 st row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and outputting one by one;
2) fixing the 2 nd line of the convolution kernel, inputting the 2 nd line elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st line element sum of the output feature map, and adding the 1 st line element sum and the previously obtained part sum one by one to form a new part sum output;
3) fixing the 3 rd row of the convolution kernel, inputting the 3 rd row elements of the input feature map one by one, performing one-dimensional convolution with fixed convolution weight to obtain the 1 st row element part sum of the output feature map, and adding the part sums obtained one by one to serve as complete element output;
and by analogy, calculating and outputting a feature diagram result according to rows. Each row of output is calculated, three rows of convolution kernels need to be fixed respectively in sequence, and input feature vectors corresponding to the row of convolution are input in sequence.
The invention also adopts ping-pong operation skill to process the buffer allocation of the input characteristic diagram, saves the buffer space and completes the seamless buffer and calculation of the data. In the layer 1, the cache A outputs the layer 1 input characteristic diagram to the calculation acceleration module, obtains a layer 2 input characteristic diagram and stores the layer 2 input characteristic diagram into the cache B; when the layer 1 operation is finished and the layer 2 operation is started, the cache B outputs the layer 2 input feature map to the calculation acceleration module and stores the layer 3 input feature map back to the cache A. And the cache C is used as a supplementary cache when the capacities of the cache A and the cache B are insufficient.
And a pooling operation unit 4 for receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data. The pooling operation unit 4 selects one of the multiply-accumulate result data obtained by the previous convolution calculation as the largest one, and discards the other multiply-accumulate result data.
In the present invention, the DRAM is changed to the BRAM, although the operation speed of the accelerator can be improved, the storage space of the BRAM is far less than that of the DRAM, and it may not be satisfied with storing a large amount of feature data, therefore, in this embodiment, the accelerator further includes an RBR operation unit 5 for performing batch standardization and quantization on the pooled result data to obtain target feature data, and the feature data originally occupying large resources is stored in the block random access memory 7 after being disassembled into 3 bits occupying small resources. Batch normalization and quantification includes: rescaling, normalizing and ReLU processing the pooled result data.
Further, an additional buffer block is inserted between the pooling operation unit 4 and the RBR operation unit 5. The workload of RBR can be reduced by the buffer block. It may reduce the activity of the RBR block that is the most pooling layer after these layers.
In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a main control module 6, configured to generate a convolution operation instruction of the convolution operator 1, a pooling operation instruction of the pooling operation unit 4, and an RBR operation instruction of the RBR operation unit 5.
In a further preferred embodiment of the present invention, as shown in fig. 1, the accelerator further includes a reorder buffer unit 8, disposed between the RBR operation unit 5 and the block random access memory 7, for sorting the data units constituting the target feature data according to a convolution order and then sequentially storing the data units in the block random access memory 7.
In this embodiment, since the target feature data output by the pooling operation unit 4 has 64 units, 64 output units of 64 units represent 64 output channels, and since the bandwidth of the accelerator is narrow, only 16 channels can be calculated at a time when the target feature data is stored in the block random access memory, the 64 channels are divided into 4 channels for storage separately, and the bandwidth of BRAM for storage is limited, and the cycle of the result is not stored in the buffer at a time, so that one reorder buffer unit 8 is used to allow write-back in multiple cycles and reorder order write-back to the designated address.
It should be noted that, for simplicity of description, the above-mentioned embodiments are described as a series of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or communication connection may be an indirect coupling or communication connection between devices or units through some interfaces, and may be in a telecommunication or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above examples are only used to illustrate the technical solutions of the present invention, and do not limit the scope of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from these embodiments without making any inventive step, fall within the scope of the present invention. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art may still make various combinations, additions, deletions or other modifications of the features of the embodiments of the present invention according to the situation without conflict, so as to obtain different technical solutions without substantially departing from the spirit of the present invention, and these technical solutions also fall within the protection scope of the present invention.

Claims (10)

1. A target detection hardware accelerator, comprising:
the convolution arithmetic unit is integrated with a multiplier and an adder, receives convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, the multiplier performs multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and the adder performs shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiplication accumulation result data;
the pooling operation unit is used for receiving the multiply-accumulate result data, performing pooling operation and outputting pooled result data;
and the RBR operation unit is used for carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.
2. The target detection hardware accelerator of claim 1 wherein the step of multiplying the convolution weight data and the signature by the multiplier to obtain multiplication result data and convolution offset data comprises:
and converting the convolution weight data and the characteristic diagram from the parallel multiplication of 3 bits multiplied by 3 bits into the serial multiplication of 3 bits multiplied by 1 bit.
3. The target detection hardware accelerator of claim 1 further comprising a master control module to generate convolution operation instructions for the convolution operator, pooling operation instructions for a pooling operation unit, and RBR operation instructions for an RBR operation unit.
4. The target detection hardware accelerator of claim 1 wherein the batch normalization and quantization comprises: rescaling, normalizing and ReLU processing the pooled result data.
5. The target detection hardware accelerator of claim 1 further comprising a reorder buffer unit disposed between the RBR operation unit and the block random access memory for sorting data units constituting the target feature data in a convolution order and then sequentially storing the data units in the block random access memory.
6. The target detection hardware accelerator of claim 1 wherein the multiplication operation is based on an FPGA LUT unit implementation.
7. The target detection hardware accelerator of claim 1 wherein the shift-add-sum processing is implemented based on an FPGA-based DSP unit.
8. An acceleration method using the object detection hardware accelerator of any of claims 1-7, comprising:
receiving convolution weight data and a characteristic diagram which are pre-stored in a block random access memory, carrying out multiplication operation on the convolution weight data and the characteristic diagram to obtain multiplication result data and convolution offset data, and carrying out shift addition summation processing on the multiplication result data and the convolution offset data to obtain multiply-accumulate result data;
receiving the multiply-accumulate result data, performing pooling operation, and outputting pooled result data;
and carrying out batch standardization and quantification on the pooling result data to obtain target characteristic data and storing the target characteristic data into the block random access memory.
9. An acceleration method according to claim 8, characterized in that said step of multiplying said convolution weight data and said feature map to obtain multiplication result data and convolution offset data comprises:
and converting the convolution weight data and the characteristic diagram from the parallel multiplication of 3 bits multiplied by 3 bits into the serial multiplication of 3 bits multiplied by 1 bit.
10. An acceleration method according to claim 8, characterized in that it further comprises a sorting step of: and sequencing the data units forming the target characteristic data according to the convolution sequence, and then sequentially storing the data units to the block random access memory.
CN202011494636.8A 2020-12-17 2020-12-17 Target detection hardware accelerator and acceleration method Active CN112230884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011494636.8A CN112230884B (en) 2020-12-17 2020-12-17 Target detection hardware accelerator and acceleration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011494636.8A CN112230884B (en) 2020-12-17 2020-12-17 Target detection hardware accelerator and acceleration method

Publications (2)

Publication Number Publication Date
CN112230884A true CN112230884A (en) 2021-01-15
CN112230884B CN112230884B (en) 2021-04-20

Family

ID=74124781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011494636.8A Active CN112230884B (en) 2020-12-17 2020-12-17 Target detection hardware accelerator and acceleration method

Country Status (1)

Country Link
CN (1) CN112230884B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
US10637500B2 (en) * 2017-10-12 2020-04-28 British Cayman Islands Intelligo Technology Inc. Apparatus and method for accelerating multiplication with non-zero packets in artificial neuron
CN111340201A (en) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 Convolutional neural network accelerator and method for performing convolutional operation thereof
WO2020190772A1 (en) * 2019-03-15 2020-09-24 Futurewei Technologies, Inc. Neural network model compression and optimization
CN111832719A (en) * 2020-07-28 2020-10-27 电子科技大学 Fixed point quantization convolution neural network accelerator calculation circuit

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
US10637500B2 (en) * 2017-10-12 2020-04-28 British Cayman Islands Intelligo Technology Inc. Apparatus and method for accelerating multiplication with non-zero packets in artificial neuron
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN111340201A (en) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 Convolutional neural network accelerator and method for performing convolutional operation thereof
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
WO2020190772A1 (en) * 2019-03-15 2020-09-24 Futurewei Technologies, Inc. Neural network model compression and optimization
CN111832719A (en) * 2020-07-28 2020-10-27 电子科技大学 Fixed point quantization convolution neural network accelerator calculation circuit

Also Published As

Publication number Publication date
CN112230884B (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN112214726B (en) Operation accelerator
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
CN104899182B (en) A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN115516450B (en) Inference engine circuit architecture
WO2021232422A1 (en) Neural network arithmetic device and control method thereof
CN110543936A (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN113743599A (en) Operation device and server of convolutional neural network
CN116227599A (en) Inference model optimization method and device, electronic equipment and storage medium
CN113869507B (en) Neural network accelerator convolution calculation device and method based on pulse array
CN110531954B (en) Multiplier, data processing method, chip and electronic equipment
CN110929854B (en) Data processing method and device and hardware accelerator
CN114519425A (en) Convolution neural network acceleration system with expandable scale
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN112230884B (en) Target detection hardware accelerator and acceleration method
CN113031912B (en) Multiplier, data processing method, device and chip
CN116167425A (en) Neural network acceleration method, device, equipment and medium
CN116090530A (en) Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number
CN113705794B (en) Neural network accelerator design method based on dynamic activation bit sparseness
CN111260070B (en) Operation method, device and related product
CN111258641B (en) Operation method, device and related product
CN113128673B (en) Data processing method, storage medium, neural network processor and electronic device
CN111260046B (en) Operation method, device and related product
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant