CN114662681A - YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly - Google Patents

YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly Download PDF

Info

Publication number
CN114662681A
CN114662681A CN202210056834.9A CN202210056834A CN114662681A CN 114662681 A CN114662681 A CN 114662681A CN 202210056834 A CN202210056834 A CN 202210056834A CN 114662681 A CN114662681 A CN 114662681A
Authority
CN
China
Prior art keywords
layer
data
module
output
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210056834.9A
Other languages
Chinese (zh)
Other versions
CN114662681B (en
Inventor
谢雪松
王明浩
张小玲
张亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202210056834.9A priority Critical patent/CN114662681B/en
Publication of CN114662681A publication Critical patent/CN114662681A/en
Application granted granted Critical
Publication of CN114662681B publication Critical patent/CN114662681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Logic Circuits (AREA)
  • Complex Calculations (AREA)

Abstract

A general hardware accelerator system platform capable of being deployed rapidly facing a YOLO algorithm belongs to the technical field of computers, considers the deployment requirements of the YOLO target detection algorithm on rapidness, high performance and low power consumption, and has wide application scenes. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is responsible for parameter initialization, image preprocessing, model parameter preprocessing, data segment address allocation, FPGA accelerator driving and image post-processing, and the FPGA subsystem is responsible for high-density calculation of a YOLO algorithm. After the platform is started, a YOLO algorithm configuration file is read to initialize accelerator driving parameters, after an image to be detected is preprocessed, model weight and bias data are read to perform quantization, fusion and reordering, an FPGA subsystem is driven to perform model calculation, and a target detection image is obtained after the calculation result is post-processed. The invention can realize rapid deployment of the YOLO algorithm on the premise of high performance and low power consumption.

Description

YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a general hardware accelerator system platform capable of being rapidly deployed and oriented to a YOLO algorithm.
Background
With the improvement of computer computing power and the development of big data, a Convolutional Neural Network (CNN) becomes mainstream in the field of computer vision, a target detection algorithm is taken as an important branch of the field of computer vision based on CNN, and particularly, a yolo (you Only Look one) series algorithm shows excellent effects in speed and precision, and has wide applications in many fields, such as robot vision, safety monitoring, automatic driving, virtual reality and the like.
The YOLO algorithm is continuously developed towards the directions of intensive calculation, large data volume and complex structure, and meanwhile, version updating iteration is fast, so that the deployment difficulty of the algorithm is continuously improved, and the requirements of actual application scenes such as unmanned detection and the like on low delay, low power consumption and fast deployment are higher and higher.
The existing deployment scheme is generally based on a CPU, a GPU, an FPGA or an ASIC platform, but a CPU of a general processor cannot meet the requirement of high performance, the GPU has high acceleration delay and large power consumption, and the AISC has high development cost, so that the FPGA with the characteristics of programmability, high parallelism and low power consumption is widely concerned. However, as the complexity of the YOLO algorithm is higher and faster, the update iteration is faster and faster, and the problems of high deployment difficulty, long development period and the like are highlighted in the FPGA accelerator of the current convolutional neural network.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a design idea of a general hardware accelerator system platform which can be rapidly deployed and faces to a YOLO algorithm. The requirements of low delay and low power consumption are met, and meanwhile, the YOLO algorithm can be rapidly deployed. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is mainly responsible for logic control and small-scale data processing and mainly consists of a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module. The FPGA subsystem is mainly responsible for high-density calculation of a YOLO algorithm and mainly comprises an input/output cache module, a controller module, a data routing module and a YOLO operator module. The off-chip memory is mainly responsible for large-scale data storage. The software and hardware modules are briefly described below.
A parameter initialization module of the ARM subsystem reads an algorithm configuration file designated when a platform is started, so that the version, the structure and parameter information of each layer of a YOLO algorithm are obtained, and initialization of structural body variables of driving parameters is completed, and the module can be adapted to YOLOv1, YOLOv2, YOLOv2-tiny, YOLOv3, YOLOv3-tiny, YOLOv4 and YOLOv4-tiny algorithms. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The model parameter preprocessing module reads the weight and the bias data of the algorithm from the off-chip memory to carry out dynamic quantization, converts 32-bit floating point number into 16-bit fixed point number, completes the fusion of the normalization layer to the convolution layer, and arranges the weight data in the off-chip memory in the sequence of { K } KW,KHConversion of N, M into { N, M, K }W,KH}, KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel. When the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment does not enter the FPGA subsystem or a developed memory is enabled to carry out data segment splicing or interception through the distribution of data segment storage addresses, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip through the address management of the module directly, if the routing layer is calculated by equally dividing the output of the layer A into two groups along the channel direction, taking the second group as the input of the layer C, and the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAX2, the input data of the C layer needs to be initialized with inprtc _ start ═ OutptrA _ end ÷ 2 and the end address inprtrc _ end ═ OutptrA _ end, so that the routing layer can be skipped, and if the routing layer splices the output feature map data of the D layer and the E layer along the channel dimension, the output feature map data serve as the input of the G layer, and the output data volume of the D layer is XDD-layer output storage data segment head addressIs OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address needs to be OutptrE _ end ═ OutptrE _ start + XEX 2, the input data of the G layer needs to have an initial address of InptrG _ start equal to OutptrD _ start and an end address of InptrG _ end equal to OutptrE _ end, and the routing layer can be skipped, so that the memory copy and complex logic of the routing layer can be omitted in the management of the storage address. The FPGA accelerator driving module circularly drives the FPGA subsystem according to the layer ID number of the YOLO algorithm and executes the forward reasoning process of the algorithm.
After the FPGA subsystem is driven, the FPGA subsystem is controlled by the controller module, and the data routing module generates the address offset for reading the model data and the input feature map data. The input and output buffer module divides input data with a 128-bit data interface into 8 16-bit data, stores an input characteristic diagram, a weight and offset block data, and reduces the interaction times with an off-chip memory by multiplexing on-chip storage data. The YOLO operator module provides convolution, pooling, upsampling, reordering and shortcut operators, wherein the convolution is used as an operator with the largest calculation amount, 3 dimensions in the column direction of an input feature map channel, an output feature map channel and an output feature map are calculated in parallel, and the 3-dimension parallelism { Pif, Pof, Pox } is selected according to a common factor of the 3 dimensions of each convolution layer feature map and can be selected by a common factor of each convolution layer feature map
Figure RE-GDA0003639536680000031
Obtained, where cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the i-th layerThe width of the output characteristic diagram, n is the number of the convolution layers, and the requirement of DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceThe number of DSP resources on an FPGA chip is Poy, Pkx and Pky which are parallelism degrees of an output characteristic diagram row direction, a convolution kernel width and a convolution kernel height dimension respectively, and Poy is Pkx and Pky is 1, so that the parallelism is ensured and the hardware efficiency is improved. And selecting a corresponding operator in the YOLO operator module by an operator type signal of the control module for calculation, controlling by the control module after the final output characteristic diagram data is obtained, generating output address offset by the data routing module, splicing 8 output data with 16bit wide by the input and output cache module to form complete output data of a 128bit wide data interface, and storing the data in the output cache into an off-chip memory. After the FPGA accelerator driving module finishes execution, the post-processing module decodes the inference result and inhibits the non-maximum value to obtain an optimal detection frame and stores a detection image.
The invention can realize high-performance and low-power-consumption model reasoning, can save the complex step that the model data of the YOLO algorithm needs to be copied into an accelerator system after being processed by an upper computer, is compatible with a plurality of YOLO algorithm versions, can realize rapid deployment of the YOLO algorithm, and effectively solves the problems of high deployment difficulty and long development period caused by high complexity and quick version updating of the YOLO algorithm.
Drawings
FIG. 1 is a schematic diagram of a system platform.
FIG. 2 is a schematic diagram of a configuration file format.
FIG. 3 is a schematic diagram of an FPGA subsystem.
FIG. 4 is a schematic diagram of a parallel circuit for convolution operations.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a system platform of the present invention, which mainly comprises an ARM subsystem, an FPGA subsystem, and an off-chip memory. When the system is started, the system is started by the ARM subsystemThe parameter initialization module reads a configuration file of an algorithm, completes the initialization of structure variable of a driving parameter, writes the configuration file according to a certain format, as shown in fig. 2, and comprises a YOLO algorithm version, model each-Layer algorithm subtype, convolution kernel size and number, step length, activation function type, input/output characteristic diagram quantized Q value, hierarchical ID and routing Layer grouping number, input group ID, and direct connection Layer ID algorithm information, wherein the driving structure type is Network and Layer, wherein the Network comprises Layer structure array and Network Layer structure member, the Layer comprises Network input/output characteristic diagram size and channel number, convolution kernel size, computation subtype, step length, activation function type, input/output characteristic diagram quantized Q value, weight offset, characteristic diagram block coefficient, grouping number, input group ID and direct connection Layer ID structure member, and can reduce the difficulty of algorithm deployment, only the configuration file needs to be modified when the system is deployed. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The weight and the bias data of the algorithm are read through a model parameter preprocessing module, dynamic quantization is carried out on the weight and the bias data, a 32-bit floating point number is converted into a 16-bit fixed point number, and the conversion formula is
Figure RE-GDA0003639536680000051
BiBelongs to {0,1}, BW is bit width length of the fixed point number after quantization, Q is decimal length, which determines the representation range and precision of the BW fixed point number, and Q value is represented by the formula Q { ∑ Y { [ MIN ] } { [ sigma ] Y { [ equation ]float32-Y'float32(Yfixed(Yfloat32Q)) } is calculated, Yfloat32Is the floating point number, Y 'to be quantized'float32The quantization is carried out to a fixed point number and then converted to a 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error, so that the condition of data overflow can be reduced. Simultaneously executing the fusion of the normalization layer to the convolution layer, the fusion formula is
Figure RE-GDA0003639536680000052
Wherein WeightnewAnd BiasnewRespectively, Weight and Bias data after fusion, Weight and Bias data before fusion, gamma and beta are scale factors and translation factors which are training parameters, mu and delta2The mean and variance, respectively, for each batch during training, e is a very small constant that prevents the denominator from being 0. The arrangement order of the weight data in the off-chip memory is changed from { KW,KHConversion of N, M into { N, M, K }W,KH},KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel, thereby reducing the consumption of FPGA resources, changing the memory arrangement sequence, improving the burst length of the AXI bus and improving the bandwidth utilization rate. The data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment address allocation cannot enter the FPGA subsystem or enable a developed memory to carry out data segment splicing or interception, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip by directly managing the address of the module, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAAnd 2, the routing layer can be skipped if the input data of the C layer needs to be set to InptrC _ start ═ OutptrA _ end ÷ 2 at the first address and to be set to InptrC _ end ═ OutptrA _ end at the last address, and if the routing layer is calculated, the output characteristic diagram data of the D layer and the E layer are spliced along the channel dimension to be used as the input of the G layer, and the output data quantity of the D layer is XDThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + XEX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start or OutptrD _ start and the last address of InptrG _ end or OutptrE _ end, and thus the routing layer is skipped for the pairMemory copying and complex logic at the routing layer can be omitted from the management of memory addresses. According to the layer ID number of the algorithm, the FPGA accelerator driving module drives the FPGA subsystem in a circulating mode to execute the forward reasoning process of the algorithm, a driving signal is composed of an input and output characteristic diagram storage data segment address, initialized input and output characteristic diagram size and channel number, convolution kernel size, step length, activation function type, weight offset, offset, operator type, input and output characteristic diagram quantization Q value and characteristic diagram block coefficient, the circulating frequency is the number of the algorithm layers, and the FPGA subsystem is started to work at the moment.
As shown in fig. 3, after the FPGA subsystem is started, the AXI-lite interface drive control module first determines the type of the current drive layer, the data routing module generates address offsets for reading model data and input feature map data, and the input/output buffer module reads input block data from the off-chip memory through the AXI interface, and then divides the input data with a bit width of 128-bit data interface into 8 data with 16 bits, and stores the data in the weight buffer and the input feature map buffer, thereby improving the bandwidth utilization. The YOLO calculation module provides convolution, pooling, up-sampling, reordering and shortcut operators, corresponding operators in the YOLO operator module are selected to be calculated through operator type signals of the control module, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions of an input feature map channel, an output feature map channel and an output feature map direction, the 3 dimensions of parallelism { Pif, Pof and Pox } are selected according to common factors of the 3 dimensions of each convolution layer feature map, and the selection is carried out through the selection of the 3 dimensions
Figure RE-GDA0003639536680000061
Obtained, cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the width of the output characteristic diagram of the i-th layer, n is the number of convolution layers, and is required to satisfy the DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceIs the number of DSP resources on the FPGA chip,and the POy (Pkx) and Pky (1) ensure the parallelism and improve the hardware efficiency. The circuit of convolution operator is a multiplication-addition tree structure, and the circuit diagram is shown in FIG. 4, and comprises Pif × Pof × Pox multipliers, Pof × Pox multipliers with depth log2(Pif) and Pof × Pox multiplexers and registers. When calculating, if the first period of the output characteristic point is multiplied by the accumulation calculation, the MUX selects the Bias to accumulate the current period multiplication and addition result, otherwise, the previous period multiplication and addition result is accumulated the current period multiplication and addition result. After the final output characteristic diagram result is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form a complete output data with 128bit wide, and the data in the output cache is stored in the off-chip memory. After the forward reasoning calculation is finished, a post-processing module in the ARM subsystem decodes the output result to obtain a detection frame, executes non-maximum value suppression to obtain an optimal detection frame, and stores a detection image.

Claims (5)

1. A general hardware accelerator system platform capable of being rapidly deployed facing a YOLO algorithm is characterized in that the system platform is composed of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem comprises a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module, after the system is started, the parameter initialization module loads an algorithm configuration file and initializes the variables of a driving structure body, the configuration file comprises a YOLO algorithm version, calculation sub-types of each Layer of the model, the size and number of convolution kernels, step length, an activation function type, an input/output characteristic diagram quantized Q value, a hierarchy ID, the grouping number of routing layers, an input group ID and Layer ID algorithm information, the driving structure body types are Network and Layer, wherein the Network comprises a Layer structure array and a Network Layer structure member, the Layer comprises the size and the number of channels of the network input and output characteristic diagram, the size of a convolution kernel, an operator type, a step size, an activation function type, an input and output characteristic diagram quantization Q value, a weight offset, an offset, a characteristic diagram block coefficient and a block divisionGroup number, input group ID and direct connection layer ID structure member; the image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]Scaling to the size of the first-layer input feature graph of the deployment algorithm according to the aspect ratio of the original graph, and storing in an off-chip memory; the model parameter preprocessing module loads the weight and the bias data into an off-chip memory, converts the weight and the bias data from 32-bit floating point number into 16-bit fixed point number, fuses the normalization layer into the convolution layer, and modifies the arrangement sequence of the weight data in the memory from { K }W,KHConversion of N, M into { N, M, K }W,KH},KW,KHN and M are respectively the width, height, channel number and number of the convolution kernel; the data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the current layer does not enter the FPGA subsystem or make a developed memory to carry out data segment splicing or interception through address allocation, but directly takes the routing layer into consideration when the data segment address management of the module is output to the off-chip storage at each layer, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is XAThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + XAX2, the routing layer can be skipped if the input data of the C layer has the first address of inprtrc _ start ═ OutptrA _ end ÷ 2 and the last address of inprtrc _ end ═ OutptrA _ end, the routing layer calculation is to splice the output characteristic diagram data of the D layer and the E layer along the channel dimension as the input of the G layer, and the output data volume of the D layer is XDThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + XDX2, E layer output data quantity XEThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + XEX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start-OutptrD _ start and the last address of InptrG _ end-OutptrE _ end, so that the management of the storage address can be omittedCopying and complex logic from the memory of the layer; according to the layer ID number of the algorithm, the FPGA accelerator driving module circularly drives the FPGA subsystem to execute the forward reasoning process of the algorithm; the FPGA subsystem comprises a controller module, a data routing module, an input/output cache module and a YOLO operator module, the FPGA subsystem is controlled by the controller module after being driven, and the data routing module generates address offset for reading model data and input characteristic diagram data; after the input/output cache module reads the input block data from the off-chip memory, the input data with the bit width of 128-bit data interface is divided into 8 data with 16 bits, and the data are stored in the weight cache and the input characteristic diagram cache; the YOLO operator module provides convolution, pooling, up-sampling, reordering and shortcut operators, the operators are selected by the operator type signal selection operators of the control module to calculate, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions in the column direction of an input feature diagram channel, an output feature diagram channel and an output feature diagram, the 3 dimensions are selected according to common factors of the 3 dimensions of each convolutional layer feature diagram, and the selection of the 3 dimensions { Pif, Pof and Pox } is carried out by the selection of the 3 dimensions
Figure FDA0003476825940000021
Obtained, cd means a common factor for computing a set of integers, In _ NiIs the number of input feature map channels, Out _ M, at layer iiIs the number of output feature map channels, Out _ W, of the i-th layeriIs the width of the output characteristic diagram of the ith layer, n is the number of convolution layers, and the requirement of DSPnum<DSPdeviceWherein the DSPnumThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resourcesdeviceThe number of DSP resources on the FPGA chip is Poy, Pkx and Pky which are parallelism degrees of the row direction of the output characteristic diagram, the width of a convolution kernel and the height dimension of the convolution kernel respectively, and Poy is Pkx, Pky and 1; after the final output characteristic diagram data is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form complete output data of a 128bit wide data interface, and the data in the output cache is stored in an off-chip memory(ii) a And after the FPGA accelerator driving module finishes executing, the post-processing module processes the detection result to obtain an optimal detection frame and stores the detection image.
2. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the model parameter preprocessing module converts the weight and bias data from 32-bit floating point number to 16-bit fixed point number, and the conversion formula is
Figure FDA0003476825940000031
BW is the bit width length of the fixed point number after quantization, Q is the decimal length, which determines the representation range and precision of the BW fixed point number, and Q is represented by the formula Q-MIN { ∑ Yfloat32-Y'float32(Yfixed(Yfloat32Q)) | } is calculated, Yfloat32Is the floating point number, Y 'to be quantized'float32The quantization is carried out to fixed point number and then converted to 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error.
3. The YOLO algorithm-oriented rapidly-deployable general hardware accelerator system platform as claimed in claim 1, wherein the driving signal of the FPGA accelerator driving module is composed of an input/output feature map storage data segment address, an initialized input/output feature map size and channel number, a convolution kernel size, a step size, an activation function type, a weight offset, an offset, an operator type, an input/output feature map quantization Q value and a feature map blocking coefficient.
4. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the convolution operators of the YOLO operator module are in a multiply-add tree structure consisting of Pif x Pof x Pox multipliers, Pof x Pox with a depth of log2(Pif) and Pof × Pox multiplexers and registers, and when performing the calculation, if the first cycle of the output characteristic point is multiplied by the accumulation calculation, the MUX selectsAnd selecting the Bias to accumulate the multiplication and addition result of the current period, and otherwise, accumulating the multiplication and addition result of the current period for the multiplication and addition result of the previous period.
5. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein after the FPGA accelerator driver module finishes execution, the post-processing module decodes the calculation result to obtain a detection box, performs non-maximum suppression to obtain an optimal detection box, and stores the detection image.
CN202210056834.9A 2022-01-19 2022-01-19 YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed Active CN114662681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210056834.9A CN114662681B (en) 2022-01-19 2022-01-19 YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210056834.9A CN114662681B (en) 2022-01-19 2022-01-19 YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed

Publications (2)

Publication Number Publication Date
CN114662681A true CN114662681A (en) 2022-06-24
CN114662681B CN114662681B (en) 2024-05-28

Family

ID=82025644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210056834.9A Active CN114662681B (en) 2022-01-19 2022-01-19 YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed

Country Status (1)

Country Link
CN (1) CN114662681B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482421A (en) * 2022-11-15 2022-12-16 苏州万店掌软件技术有限公司 Target detection method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111459877A (en) * 2020-04-02 2020-07-28 北京工商大学 FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
CN113051216A (en) * 2021-04-22 2021-06-29 南京工业大学 MobileNet-SSD target detection device and method based on FPGA acceleration
CN113705803A (en) * 2021-08-31 2021-11-26 南京大学 Image hardware identification system based on convolutional neural network and deployment method
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111459877A (en) * 2020-04-02 2020-07-28 北京工商大学 FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
CN113051216A (en) * 2021-04-22 2021-06-29 南京工业大学 MobileNet-SSD target detection device and method based on FPGA acceleration
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA
CN113705803A (en) * 2021-08-31 2021-11-26 南京大学 Image hardware identification system based on convolutional neural network and deployment method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANNE K. MADSEN等: "An Optimized FPGA-Based Hardware Accelerator for Physics-Based EKF for Battery Cell Management", 《IEEE》, 28 September 2020 (2020-09-28), pages 2158 - 1525 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482421A (en) * 2022-11-15 2022-12-16 苏州万店掌软件技术有限公司 Target detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN114662681B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
US11907830B2 (en) Neural network architecture using control logic determining convolution operation sequence
CN109063825B (en) Convolutional neural network accelerator
Pestana et al. A full featured configurable accelerator for object detection with YOLO
CN111414994B (en) FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN113792621B (en) FPGA-based target detection accelerator design method
CN114118347A (en) Fine-grained per-vector scaling for neural network quantization
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
GB2568102A (en) Exploiting sparsity in a neural network
CN113313247B (en) Operation method of sparse neural network based on data flow architecture
CN114970803A (en) Machine learning training in a logarithmic system
TW202138999A (en) Data dividing method and processor for convolution operation
CN114662681A (en) YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly
CN111610963B (en) Chip structure and multiply-add calculation engine thereof
JP7410961B2 (en) arithmetic processing unit
CN114651249A (en) Techniques to minimize the negative impact of cache conflicts caused by incompatible dominant dimensions in matrix multiplication and convolution kernels without dimension filling
CN115577747A (en) High-parallelism heterogeneous convolutional neural network accelerator and acceleration method
CN115170381A (en) Visual SLAM acceleration system and method based on deep learning
US20240045592A1 (en) Computational storage device, storage system including the same and operation method therefor
US11442643B2 (en) System and method for efficiently converting low-locality data into high-locality data
US20230252600A1 (en) Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture
US20230334289A1 (en) Deep neural network accelerator with memory having two-level topology
US20210209462A1 (en) Method and system for processing a neural network
CN116363480A (en) Computing device and method for image pixel processing network
CN115423083A (en) Neural network accelerator with double scheduling modes
CN114996646A (en) Operation method, device, medium and electronic equipment based on lookup table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant