CN114662681A - YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly - Google Patents
YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly Download PDFInfo
- Publication number
- CN114662681A CN114662681A CN202210056834.9A CN202210056834A CN114662681A CN 114662681 A CN114662681 A CN 114662681A CN 202210056834 A CN202210056834 A CN 202210056834A CN 114662681 A CN114662681 A CN 114662681A
- Authority
- CN
- China
- Prior art keywords
- layer
- data
- module
- output
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004364 calculation method Methods 0.000 claims abstract description 18
- 238000001514 detection method Methods 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 238000013139 quantization Methods 0.000 claims abstract description 12
- 238000012805 post-processing Methods 0.000 claims abstract description 7
- 238000010586 diagram Methods 0.000 claims description 40
- 230000004913 activation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000000034 method Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000009825 accumulation Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000001629 suppression Effects 0.000 claims description 2
- 230000000903 blocking effect Effects 0.000 claims 1
- 230000004927 fusion Effects 0.000 abstract description 6
- 238000012821 model calculation Methods 0.000 abstract 1
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 2
- ROXBGBWUWZTYLZ-UHFFFAOYSA-N [6-[[10-formyl-5,14-dihydroxy-13-methyl-17-(5-oxo-2h-furan-3-yl)-2,3,4,6,7,8,9,11,12,15,16,17-dodecahydro-1h-cyclopenta[a]phenanthren-3-yl]oxy]-4-methoxy-2-methyloxan-3-yl] 4-[2-(4-azido-3-iodophenyl)ethylamino]-4-oxobutanoate Chemical compound O1C(C)C(OC(=O)CCC(=O)NCCC=2C=C(I)C(N=[N+]=[N-])=CC=2)C(OC)CC1OC(CC1(O)CCC2C3(O)CC4)CCC1(C=O)C2CCC3(C)C4C1=CC(=O)OC1 ROXBGBWUWZTYLZ-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Logic Circuits (AREA)
- Complex Calculations (AREA)
Abstract
A general hardware accelerator system platform capable of being deployed rapidly facing a YOLO algorithm belongs to the technical field of computers, considers the deployment requirements of the YOLO target detection algorithm on rapidness, high performance and low power consumption, and has wide application scenes. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is responsible for parameter initialization, image preprocessing, model parameter preprocessing, data segment address allocation, FPGA accelerator driving and image post-processing, and the FPGA subsystem is responsible for high-density calculation of a YOLO algorithm. After the platform is started, a YOLO algorithm configuration file is read to initialize accelerator driving parameters, after an image to be detected is preprocessed, model weight and bias data are read to perform quantization, fusion and reordering, an FPGA subsystem is driven to perform model calculation, and a target detection image is obtained after the calculation result is post-processed. The invention can realize rapid deployment of the YOLO algorithm on the premise of high performance and low power consumption.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a general hardware accelerator system platform capable of being rapidly deployed and oriented to a YOLO algorithm.
Background
With the improvement of computer computing power and the development of big data, a Convolutional Neural Network (CNN) becomes mainstream in the field of computer vision, a target detection algorithm is taken as an important branch of the field of computer vision based on CNN, and particularly, a yolo (you Only Look one) series algorithm shows excellent effects in speed and precision, and has wide applications in many fields, such as robot vision, safety monitoring, automatic driving, virtual reality and the like.
The YOLO algorithm is continuously developed towards the directions of intensive calculation, large data volume and complex structure, and meanwhile, version updating iteration is fast, so that the deployment difficulty of the algorithm is continuously improved, and the requirements of actual application scenes such as unmanned detection and the like on low delay, low power consumption and fast deployment are higher and higher.
The existing deployment scheme is generally based on a CPU, a GPU, an FPGA or an ASIC platform, but a CPU of a general processor cannot meet the requirement of high performance, the GPU has high acceleration delay and large power consumption, and the AISC has high development cost, so that the FPGA with the characteristics of programmability, high parallelism and low power consumption is widely concerned. However, as the complexity of the YOLO algorithm is higher and faster, the update iteration is faster and faster, and the problems of high deployment difficulty, long development period and the like are highlighted in the FPGA accelerator of the current convolutional neural network.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a design idea of a general hardware accelerator system platform which can be rapidly deployed and faces to a YOLO algorithm. The requirements of low delay and low power consumption are met, and meanwhile, the YOLO algorithm can be rapidly deployed. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is mainly responsible for logic control and small-scale data processing and mainly consists of a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module. The FPGA subsystem is mainly responsible for high-density calculation of a YOLO algorithm and mainly comprises an input/output cache module, a controller module, a data routing module and a YOLO operator module. The off-chip memory is mainly responsible for large-scale data storage. The software and hardware modules are briefly described below.
A parameter initialization module of the ARM subsystem reads an algorithm configuration file designated when a platform is started, so that the version, the structure and parameter information of each layer of a YOLO algorithm are obtained, and initialization of structural body variables of driving parameters is completed, and the module can be adapted to YOLOv1, YOLOv2, YOLOv2-tiny, YOLOv3, YOLOv3-tiny, YOLOv4 and YOLOv4-tiny algorithms. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The model parameter preprocessing module reads the weight and the bias data of the algorithm from the off-chip memory to carry out dynamic quantization, converts 32-bit floating point number into 16-bit fixed point number, completes the fusion of the normalization layer to the convolution layer, and arranges the weight data in the off-chip memory in the sequence of { K } KW,KHConversion of N, M into { N, M, K }W,KH}, KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel. When the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment does not enter the FPGA subsystem or a developed memory is enabled to carry out data segment splicing or interception through the distribution of data segment storage addresses, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip through the address management of the module directly, if the routing layer is calculated by equally dividing the output of the layer A into two groups along the channel direction, taking the second group as the input of the layer C, and the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAX2, the input data of the C layer needs to be initialized with inprtc _ start ═ OutptrA _ end ÷ 2 and the end address inprtrc _ end ═ OutptrA _ end, so that the routing layer can be skipped, and if the routing layer splices the output feature map data of the D layer and the E layer along the channel dimension, the output feature map data serve as the input of the G layer, and the output data volume of the D layer is XDD-layer output storage data segment head addressIs OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address needs to be OutptrE _ end ═ OutptrE _ start + XEX 2, the input data of the G layer needs to have an initial address of InptrG _ start equal to OutptrD _ start and an end address of InptrG _ end equal to OutptrE _ end, and the routing layer can be skipped, so that the memory copy and complex logic of the routing layer can be omitted in the management of the storage address. The FPGA accelerator driving module circularly drives the FPGA subsystem according to the layer ID number of the YOLO algorithm and executes the forward reasoning process of the algorithm.
After the FPGA subsystem is driven, the FPGA subsystem is controlled by the controller module, and the data routing module generates the address offset for reading the model data and the input feature map data. The input and output buffer module divides input data with a 128-bit data interface into 8 16-bit data, stores an input characteristic diagram, a weight and offset block data, and reduces the interaction times with an off-chip memory by multiplexing on-chip storage data. The YOLO operator module provides convolution, pooling, upsampling, reordering and shortcut operators, wherein the convolution is used as an operator with the largest calculation amount, 3 dimensions in the column direction of an input feature map channel, an output feature map channel and an output feature map are calculated in parallel, and the 3-dimension parallelism { Pif, Pof, Pox } is selected according to a common factor of the 3 dimensions of each convolution layer feature map and can be selected by a common factor of each convolution layer feature mapObtained, where cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the i-th layerThe width of the output characteristic diagram, n is the number of the convolution layers, and the requirement of DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceThe number of DSP resources on an FPGA chip is Poy, Pkx and Pky which are parallelism degrees of an output characteristic diagram row direction, a convolution kernel width and a convolution kernel height dimension respectively, and Poy is Pkx and Pky is 1, so that the parallelism is ensured and the hardware efficiency is improved. And selecting a corresponding operator in the YOLO operator module by an operator type signal of the control module for calculation, controlling by the control module after the final output characteristic diagram data is obtained, generating output address offset by the data routing module, splicing 8 output data with 16bit wide by the input and output cache module to form complete output data of a 128bit wide data interface, and storing the data in the output cache into an off-chip memory. After the FPGA accelerator driving module finishes execution, the post-processing module decodes the inference result and inhibits the non-maximum value to obtain an optimal detection frame and stores a detection image.
The invention can realize high-performance and low-power-consumption model reasoning, can save the complex step that the model data of the YOLO algorithm needs to be copied into an accelerator system after being processed by an upper computer, is compatible with a plurality of YOLO algorithm versions, can realize rapid deployment of the YOLO algorithm, and effectively solves the problems of high deployment difficulty and long development period caused by high complexity and quick version updating of the YOLO algorithm.
Drawings
FIG. 1 is a schematic diagram of a system platform.
FIG. 2 is a schematic diagram of a configuration file format.
FIG. 3 is a schematic diagram of an FPGA subsystem.
FIG. 4 is a schematic diagram of a parallel circuit for convolution operations.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a system platform of the present invention, which mainly comprises an ARM subsystem, an FPGA subsystem, and an off-chip memory. When the system is started, the system is started by the ARM subsystemThe parameter initialization module reads a configuration file of an algorithm, completes the initialization of structure variable of a driving parameter, writes the configuration file according to a certain format, as shown in fig. 2, and comprises a YOLO algorithm version, model each-Layer algorithm subtype, convolution kernel size and number, step length, activation function type, input/output characteristic diagram quantized Q value, hierarchical ID and routing Layer grouping number, input group ID, and direct connection Layer ID algorithm information, wherein the driving structure type is Network and Layer, wherein the Network comprises Layer structure array and Network Layer structure member, the Layer comprises Network input/output characteristic diagram size and channel number, convolution kernel size, computation subtype, step length, activation function type, input/output characteristic diagram quantized Q value, weight offset, characteristic diagram block coefficient, grouping number, input group ID and direct connection Layer ID structure member, and can reduce the difficulty of algorithm deployment, only the configuration file needs to be modified when the system is deployed. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The weight and the bias data of the algorithm are read through a model parameter preprocessing module, dynamic quantization is carried out on the weight and the bias data, a 32-bit floating point number is converted into a 16-bit fixed point number, and the conversion formula isBiBelongs to {0,1}, BW is bit width length of the fixed point number after quantization, Q is decimal length, which determines the representation range and precision of the BW fixed point number, and Q value is represented by the formula Q { ∑ Y { [ MIN ] } { [ sigma ] Y { [ equation ]float32-Y'float32(Yfixed(Yfloat32Q)) } is calculated, Yfloat32Is the floating point number, Y 'to be quantized'float32The quantization is carried out to a fixed point number and then converted to a 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error, so that the condition of data overflow can be reduced. Simultaneously executing the fusion of the normalization layer to the convolution layer, the fusion formula isWherein WeightnewAnd BiasnewRespectively, Weight and Bias data after fusion, Weight and Bias data before fusion, gamma and beta are scale factors and translation factors which are training parameters, mu and delta2The mean and variance, respectively, for each batch during training, e is a very small constant that prevents the denominator from being 0. The arrangement order of the weight data in the off-chip memory is changed from { KW,KHConversion of N, M into { N, M, K }W,KH},KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel, thereby reducing the consumption of FPGA resources, changing the memory arrangement sequence, improving the burst length of the AXI bus and improving the bandwidth utilization rate. The data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment address allocation cannot enter the FPGA subsystem or enable a developed memory to carry out data segment splicing or interception, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip by directly managing the address of the module, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAAnd 2, the routing layer can be skipped if the input data of the C layer needs to be set to InptrC _ start ═ OutptrA _ end ÷ 2 at the first address and to be set to InptrC _ end ═ OutptrA _ end at the last address, and if the routing layer is calculated, the output characteristic diagram data of the D layer and the E layer are spliced along the channel dimension to be used as the input of the G layer, and the output data quantity of the D layer is XDThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + XEX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start or OutptrD _ start and the last address of InptrG _ end or OutptrE _ end, and thus the routing layer is skipped for the pairMemory copying and complex logic at the routing layer can be omitted from the management of memory addresses. According to the layer ID number of the algorithm, the FPGA accelerator driving module drives the FPGA subsystem in a circulating mode to execute the forward reasoning process of the algorithm, a driving signal is composed of an input and output characteristic diagram storage data segment address, initialized input and output characteristic diagram size and channel number, convolution kernel size, step length, activation function type, weight offset, offset, operator type, input and output characteristic diagram quantization Q value and characteristic diagram block coefficient, the circulating frequency is the number of the algorithm layers, and the FPGA subsystem is started to work at the moment.
As shown in fig. 3, after the FPGA subsystem is started, the AXI-lite interface drive control module first determines the type of the current drive layer, the data routing module generates address offsets for reading model data and input feature map data, and the input/output buffer module reads input block data from the off-chip memory through the AXI interface, and then divides the input data with a bit width of 128-bit data interface into 8 data with 16 bits, and stores the data in the weight buffer and the input feature map buffer, thereby improving the bandwidth utilization. The YOLO calculation module provides convolution, pooling, up-sampling, reordering and shortcut operators, corresponding operators in the YOLO operator module are selected to be calculated through operator type signals of the control module, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions of an input feature map channel, an output feature map channel and an output feature map direction, the 3 dimensions of parallelism { Pif, Pof and Pox } are selected according to common factors of the 3 dimensions of each convolution layer feature map, and the selection is carried out through the selection of the 3 dimensionsObtained, cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the width of the output characteristic diagram of the i-th layer, n is the number of convolution layers, and is required to satisfy the DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceIs the number of DSP resources on the FPGA chip,and the POy (Pkx) and Pky (1) ensure the parallelism and improve the hardware efficiency. The circuit of convolution operator is a multiplication-addition tree structure, and the circuit diagram is shown in FIG. 4, and comprises Pif × Pof × Pox multipliers, Pof × Pox multipliers with depth log2(Pif) and Pof × Pox multiplexers and registers. When calculating, if the first period of the output characteristic point is multiplied by the accumulation calculation, the MUX selects the Bias to accumulate the current period multiplication and addition result, otherwise, the previous period multiplication and addition result is accumulated the current period multiplication and addition result. After the final output characteristic diagram result is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form a complete output data with 128bit wide, and the data in the output cache is stored in the off-chip memory. After the forward reasoning calculation is finished, a post-processing module in the ARM subsystem decodes the output result to obtain a detection frame, executes non-maximum value suppression to obtain an optimal detection frame, and stores a detection image.
Claims (5)
1. A general hardware accelerator system platform capable of being rapidly deployed facing a YOLO algorithm is characterized in that the system platform is composed of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem comprises a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module, after the system is started, the parameter initialization module loads an algorithm configuration file and initializes the variables of a driving structure body, the configuration file comprises a YOLO algorithm version, calculation sub-types of each Layer of the model, the size and number of convolution kernels, step length, an activation function type, an input/output characteristic diagram quantized Q value, a hierarchy ID, the grouping number of routing layers, an input group ID and Layer ID algorithm information, the driving structure body types are Network and Layer, wherein the Network comprises a Layer structure array and a Network Layer structure member, the Layer comprises the size and the number of channels of the network input and output characteristic diagram, the size of a convolution kernel, an operator type, a step size, an activation function type, an input and output characteristic diagram quantization Q value, a weight offset, an offset, a characteristic diagram block coefficient and a block divisionGroup number, input group ID and direct connection layer ID structure member; the image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]Scaling to the size of the first-layer input feature graph of the deployment algorithm according to the aspect ratio of the original graph, and storing in an off-chip memory; the model parameter preprocessing module loads the weight and the bias data into an off-chip memory, converts the weight and the bias data from 32-bit floating point number into 16-bit fixed point number, fuses the normalization layer into the convolution layer, and modifies the arrangement sequence of the weight data in the memory from { K }W,KHConversion of N, M into { N, M, K }W,KH},KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel; the data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the current layer does not enter the FPGA subsystem or make a developed memory to carry out data segment splicing or interception through address allocation, but directly takes the routing layer into consideration when the data segment address management of the module is output to the off-chip storage at each layer, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAX2, the routing layer can be skipped if the input data of the C layer has the first address of inprtrc _ start ═ OutptrA _ end ÷ 2 and the last address of inprtrc _ end ═ OutptrA _ end, the routing layer calculation is to splice the output characteristic diagram data of the D layer and the E layer along the channel dimension as the input of the G layer, and the output data volume of the D layer is XDThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + XEX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start-OutptrD _ start and the last address of InptrG _ end-OutptrE _ end, so that the management of the storage address can be omittedCopying and complex logic from the memory of the layer; according to the layer ID number of the algorithm, the FPGA accelerator driving module circularly drives the FPGA subsystem to execute the forward reasoning process of the algorithm; the FPGA subsystem comprises a controller module, a data routing module, an input/output cache module and a YOLO operator module, the FPGA subsystem is controlled by the controller module after being driven, and the data routing module generates address offset for reading model data and input characteristic diagram data; after the input/output cache module reads the input block data from the off-chip memory, the input data with the bit width of 128-bit data interface is divided into 8 data with 16 bits, and the data are stored in the weight cache and the input characteristic diagram cache; the YOLO operator module provides convolution, pooling, up-sampling, reordering and shortcut operators, the operators are selected by the operator type signal selection operators of the control module to calculate, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions in the column direction of an input feature diagram channel, an output feature diagram channel and an output feature diagram, the 3 dimensions are selected according to common factors of the 3 dimensions of each convolutional layer feature diagram, and the selection of the 3 dimensions { Pif, Pof and Pox } is carried out by the selection of the 3 dimensionsObtained, cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the width of the output characteristic diagram of the ith layer, n is the number of convolution layers, and the requirement of DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceThe number of DSP resources on the FPGA chip is Poy, Pkx and Pky which are parallelism degrees of the row direction of the output characteristic diagram, the width of a convolution kernel and the height dimension of the convolution kernel respectively, and Poy is Pkx, Pky and 1; after the final output characteristic diagram data is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form complete output data of a 128bit wide data interface, and the data in the output cache is stored in an off-chip memory(ii) a And after the FPGA accelerator driving module finishes executing, the post-processing module processes the detection result to obtain an optimal detection frame and stores the detection image.
2. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the model parameter preprocessing module converts the weight and bias data from 32-bit floating point number to 16-bit fixed point number, and the conversion formula isBW is the bit width length of the fixed point number after quantization, Q is the decimal length, which determines the representation range and precision of the BW fixed point number, and Q is represented by the formula Q-MIN { ∑ Yfloat32-Y'float32(Yfixed(Yfloat32Q)) | } is calculated, Yfloat32Is the floating point number, Y 'to be quantized'float32The quantization is carried out to fixed point number and then converted to 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error.
3. The YOLO algorithm-oriented rapidly-deployable general hardware accelerator system platform as claimed in claim 1, wherein the driving signal of the FPGA accelerator driving module is composed of an input/output feature map storage data segment address, an initialized input/output feature map size and channel number, a convolution kernel size, a step size, an activation function type, a weight offset, an offset, an operator type, an input/output feature map quantization Q value and a feature map blocking coefficient.
4. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the convolution operators of the YOLO operator module are in a multiply-add tree structure consisting of Pif x Pof x Pox multipliers, Pof x Pox with a depth of log2(Pif) and Pof × Pox multiplexers and registers, and when performing the calculation, if the first cycle of the output characteristic point is multiplied by the accumulation calculation, the MUX selectsAnd selecting the Bias to accumulate the multiplication and addition result of the current period, and otherwise, accumulating the multiplication and addition result of the current period for the multiplication and addition result of the previous period.
5. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein after the FPGA accelerator driver module finishes execution, the post-processing module decodes the calculation result to obtain a detection box, performs non-maximum suppression to obtain an optimal detection box, and stores the detection image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210056834.9A CN114662681B (en) | 2022-01-19 | 2022-01-19 | YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210056834.9A CN114662681B (en) | 2022-01-19 | 2022-01-19 | YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114662681A true CN114662681A (en) | 2022-06-24 |
CN114662681B CN114662681B (en) | 2024-05-28 |
Family
ID=82025644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210056834.9A Active CN114662681B (en) | 2022-01-19 | 2022-01-19 | YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114662681B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115482421A (en) * | 2022-11-15 | 2022-12-16 | 苏州万店掌软件技术有限公司 | Target detection method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414994A (en) * | 2020-03-03 | 2020-07-14 | 哈尔滨工业大学 | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof |
CN111459877A (en) * | 2020-04-02 | 2020-07-28 | 北京工商大学 | FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method |
CN113051216A (en) * | 2021-04-22 | 2021-06-29 | 南京工业大学 | MobileNet-SSD target detection device and method based on FPGA acceleration |
CN113705803A (en) * | 2021-08-31 | 2021-11-26 | 南京大学 | Image hardware identification system based on convolutional neural network and deployment method |
CN113792621A (en) * | 2021-08-27 | 2021-12-14 | 杭州电子科技大学 | Target detection accelerator design method based on FPGA |
-
2022
- 2022-01-19 CN CN202210056834.9A patent/CN114662681B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414994A (en) * | 2020-03-03 | 2020-07-14 | 哈尔滨工业大学 | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof |
CN111459877A (en) * | 2020-04-02 | 2020-07-28 | 北京工商大学 | FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method |
CN113051216A (en) * | 2021-04-22 | 2021-06-29 | 南京工业大学 | MobileNet-SSD target detection device and method based on FPGA acceleration |
CN113792621A (en) * | 2021-08-27 | 2021-12-14 | 杭州电子科技大学 | Target detection accelerator design method based on FPGA |
CN113705803A (en) * | 2021-08-31 | 2021-11-26 | 南京大学 | Image hardware identification system based on convolutional neural network and deployment method |
Non-Patent Citations (1)
Title |
---|
ANNE K. MADSEN等: "An Optimized FPGA-Based Hardware Accelerator for Physics-Based EKF for Battery Cell Management", 《IEEE》, 28 September 2020 (2020-09-28), pages 2158 - 1525 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115482421A (en) * | 2022-11-15 | 2022-12-16 | 苏州万店掌软件技术有限公司 | Target detection method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN114662681B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11907830B2 (en) | Neural network architecture using control logic determining convolution operation sequence | |
CN109063825B (en) | Convolutional neural network accelerator | |
Pestana et al. | A full featured configurable accelerator for object detection with YOLO | |
CN111414994B (en) | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof | |
CN113792621B (en) | FPGA-based target detection accelerator design method | |
CN114118347A (en) | Fine-grained per-vector scaling for neural network quantization | |
CN109993293B (en) | Deep learning accelerator suitable for heap hourglass network | |
GB2568102A (en) | Exploiting sparsity in a neural network | |
CN113313247B (en) | Operation method of sparse neural network based on data flow architecture | |
CN114970803A (en) | Machine learning training in a logarithmic system | |
TW202138999A (en) | Data dividing method and processor for convolution operation | |
CN114662681A (en) | YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly | |
CN111610963B (en) | Chip structure and multiply-add calculation engine thereof | |
JP7410961B2 (en) | arithmetic processing unit | |
CN114651249A (en) | Techniques to minimize the negative impact of cache conflicts caused by incompatible dominant dimensions in matrix multiplication and convolution kernels without dimension filling | |
CN115577747A (en) | High-parallelism heterogeneous convolutional neural network accelerator and acceleration method | |
CN115170381A (en) | Visual SLAM acceleration system and method based on deep learning | |
US20240045592A1 (en) | Computational storage device, storage system including the same and operation method therefor | |
US11442643B2 (en) | System and method for efficiently converting low-locality data into high-locality data | |
US20230252600A1 (en) | Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture | |
US20230334289A1 (en) | Deep neural network accelerator with memory having two-level topology | |
US20210209462A1 (en) | Method and system for processing a neural network | |
CN116363480A (en) | Computing device and method for image pixel processing network | |
CN115423083A (en) | Neural network accelerator with double scheduling modes | |
CN114996646A (en) | Operation method, device, medium and electronic equipment based on lookup table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |