CN114662681A

CN114662681A - YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly

Info

Publication number: CN114662681A
Application number: CN202210056834.9A
Authority: CN
Inventors: 谢雪松; 王明浩; 张小玲; 张亮
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-06-24
Anticipated expiration: 2042-01-19
Also published as: CN114662681B

Abstract

A general hardware accelerator system platform capable of being deployed rapidly facing a YOLO algorithm belongs to the technical field of computers, considers the deployment requirements of the YOLO target detection algorithm on rapidness, high performance and low power consumption, and has wide application scenes. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is responsible for parameter initialization, image preprocessing, model parameter preprocessing, data segment address allocation, FPGA accelerator driving and image post-processing, and the FPGA subsystem is responsible for high-density calculation of a YOLO algorithm. After the platform is started, a YOLO algorithm configuration file is read to initialize accelerator driving parameters, after an image to be detected is preprocessed, model weight and bias data are read to perform quantization, fusion and reordering, an FPGA subsystem is driven to perform model calculation, and a target detection image is obtained after the calculation result is post-processed. The invention can realize rapid deployment of the YOLO algorithm on the premise of high performance and low power consumption.

Description

YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a general hardware accelerator system platform capable of being rapidly deployed and oriented to a YOLO algorithm.

Background

With the improvement of computer computing power and the development of big data, a Convolutional Neural Network (CNN) becomes mainstream in the field of computer vision, a target detection algorithm is taken as an important branch of the field of computer vision based on CNN, and particularly, a yolo (you Only Look one) series algorithm shows excellent effects in speed and precision, and has wide applications in many fields, such as robot vision, safety monitoring, automatic driving, virtual reality and the like.

The YOLO algorithm is continuously developed towards the directions of intensive calculation, large data volume and complex structure, and meanwhile, version updating iteration is fast, so that the deployment difficulty of the algorithm is continuously improved, and the requirements of actual application scenes such as unmanned detection and the like on low delay, low power consumption and fast deployment are higher and higher.

The existing deployment scheme is generally based on a CPU, a GPU, an FPGA or an ASIC platform, but a CPU of a general processor cannot meet the requirement of high performance, the GPU has high acceleration delay and large power consumption, and the AISC has high development cost, so that the FPGA with the characteristics of programmability, high parallelism and low power consumption is widely concerned. However, as the complexity of the YOLO algorithm is higher and faster, the update iteration is faster and faster, and the problems of high deployment difficulty, long development period and the like are highlighted in the FPGA accelerator of the current convolutional neural network.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a design idea of a general hardware accelerator system platform which can be rapidly deployed and faces to a YOLO algorithm. The requirements of low delay and low power consumption are met, and meanwhile, the YOLO algorithm can be rapidly deployed. The platform consists of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem is mainly responsible for logic control and small-scale data processing and mainly consists of a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module. The FPGA subsystem is mainly responsible for high-density calculation of a YOLO algorithm and mainly comprises an input/output cache module, a controller module, a data routing module and a YOLO operator module. The off-chip memory is mainly responsible for large-scale data storage. The software and hardware modules are briefly described below.

A parameter initialization module of the ARM subsystem reads an algorithm configuration file designated when a platform is started, so that the version, the structure and parameter information of each layer of a YOLO algorithm are obtained, and initialization of structural body variables of driving parameters is completed, and the module can be adapted to YOLOv1, YOLOv2, YOLOv2-tiny, YOLOv3, YOLOv3-tiny, YOLOv4 and YOLOv4-tiny algorithms. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The model parameter preprocessing module reads the weight and the bias data of the algorithm from the off-chip memory to carry out dynamic quantization, converts 32-bit floating point number into 16-bit fixed point number, completes the fusion of the normalization layer to the convolution layer, and arranges the weight data in the off-chip memory in the sequence of { K } K_W,K_HConversion of N, M into { N, M, K }_W,K_H}， K_W,K_HN and M are respectively the width, height, channel number and number of the convolution kernel. When the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment does not enter the FPGA subsystem or a developed memory is enabled to carry out data segment splicing or interception through the distribution of data segment storage addresses, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip through the address management of the module directly, if the routing layer is calculated by equally dividing the output of the layer A into two groups along the channel direction, taking the second group as the input of the layer C, and the data quantity of the output characteristic diagram of the layer A is X_AThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + X_AX2, the input data of the C layer needs to be initialized with inprtc _ start ═ OutptrA _ end ÷ 2 and the end address inprtrc _ end ═ OutptrA _ end, so that the routing layer can be skipped, and if the routing layer splices the output feature map data of the D layer and the E layer along the channel dimension, the output feature map data serve as the input of the G layer, and the output data volume of the D layer is X_DD-layer output storage data segment head addressIs OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + X_DX2, E layer output data quantity X_EThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address needs to be OutptrE _ end ═ OutptrE _ start + X_EX 2, the input data of the G layer needs to have an initial address of InptrG _ start equal to OutptrD _ start and an end address of InptrG _ end equal to OutptrE _ end, and the routing layer can be skipped, so that the memory copy and complex logic of the routing layer can be omitted in the management of the storage address. The FPGA accelerator driving module circularly drives the FPGA subsystem according to the layer ID number of the YOLO algorithm and executes the forward reasoning process of the algorithm.

After the FPGA subsystem is driven, the FPGA subsystem is controlled by the controller module, and the data routing module generates the address offset for reading the model data and the input feature map data. The input and output buffer module divides input data with a 128-bit data interface into 8 16-bit data, stores an input characteristic diagram, a weight and offset block data, and reduces the interaction times with an off-chip memory by multiplexing on-chip storage data. The YOLO operator module provides convolution, pooling, upsampling, reordering and shortcut operators, wherein the convolution is used as an operator with the largest calculation amount, 3 dimensions in the column direction of an input feature map channel, an output feature map channel and an output feature map are calculated in parallel, and the 3-dimension parallelism { Pif, Pof, Pox } is selected according to a common factor of the 3 dimensions of each convolution layer feature map and can be selected by a common factor of each convolution layer feature map

Obtained, where cd means a common factor for computing a set of integers, In _ N_iIs the number of input feature map channels, Out _ M, at layer i_iIs the number of output feature map channels, Out _ W, of the i-th layer_iIs the i-th layerThe width of the output characteristic diagram, n is the number of the convolution layers, and the requirement of DSP_num＜DSP_deviceWherein the DSP_numThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resources_deviceThe number of DSP resources on an FPGA chip is Poy, Pkx and Pky which are parallelism degrees of an output characteristic diagram row direction, a convolution kernel width and a convolution kernel height dimension respectively, and Poy is Pkx and Pky is 1, so that the parallelism is ensured and the hardware efficiency is improved. And selecting a corresponding operator in the YOLO operator module by an operator type signal of the control module for calculation, controlling by the control module after the final output characteristic diagram data is obtained, generating output address offset by the data routing module, splicing 8 output data with 16bit wide by the input and output cache module to form complete output data of a 128bit wide data interface, and storing the data in the output cache into an off-chip memory. After the FPGA accelerator driving module finishes execution, the post-processing module decodes the inference result and inhibits the non-maximum value to obtain an optimal detection frame and stores a detection image.

The invention can realize high-performance and low-power-consumption model reasoning, can save the complex step that the model data of the YOLO algorithm needs to be copied into an accelerator system after being processed by an upper computer, is compatible with a plurality of YOLO algorithm versions, can realize rapid deployment of the YOLO algorithm, and effectively solves the problems of high deployment difficulty and long development period caused by high complexity and quick version updating of the YOLO algorithm.

Drawings

FIG. 1 is a schematic diagram of a system platform.

FIG. 2 is a schematic diagram of a configuration file format.

FIG. 3 is a schematic diagram of an FPGA subsystem.

FIG. 4 is a schematic diagram of a parallel circuit for convolution operations.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a system platform of the present invention, which mainly comprises an ARM subsystem, an FPGA subsystem, and an off-chip memory. When the system is started, the system is started by the ARM subsystemThe parameter initialization module reads a configuration file of an algorithm, completes the initialization of structure variable of a driving parameter, writes the configuration file according to a certain format, as shown in fig. 2, and comprises a YOLO algorithm version, model each-Layer algorithm subtype, convolution kernel size and number, step length, activation function type, input/output characteristic diagram quantized Q value, hierarchical ID and routing Layer grouping number, input group ID, and direct connection Layer ID algorithm information, wherein the driving structure type is Network and Layer, wherein the Network comprises Layer structure array and Network Layer structure member, the Layer comprises Network input/output characteristic diagram size and channel number, convolution kernel size, computation subtype, step length, activation function type, input/output characteristic diagram quantized Q value, weight offset, characteristic diagram block coefficient, grouping number, input group ID and direct connection Layer ID structure member, and can reduce the difficulty of algorithm deployment, only the configuration file needs to be modified when the system is deployed. The image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]And scaling the image to the size of the first-layer input feature image of the deployment algorithm according to the aspect ratio of the original image, and storing the image in an off-chip memory. The weight and the bias data of the algorithm are read through a model parameter preprocessing module, dynamic quantization is carried out on the weight and the bias data, a 32-bit floating point number is converted into a 16-bit fixed point number, and the conversion formula is

B_iBelongs to {0,1}, BW is bit width length of the fixed point number after quantization, Q is decimal length, which determines the representation range and precision of the BW fixed point number, and Q value is represented by the formula Q { ∑ Y { [ MIN ] } { [ sigma ] Y { [ equation ]_float32-Y'_float32(Y_fixed(Y_float32Q)) } is calculated, Y_float32Is the floating point number, Y 'to be quantized'_float32The quantization is carried out to a fixed point number and then converted to a 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error, so that the condition of data overflow can be reduced. Simultaneously executing the fusion of the normalization layer to the convolution layer, the fusion formula is

Wherein Weight_newAnd Bias_newRespectively, Weight and Bias data after fusion, Weight and Bias data before fusion, gamma and beta are scale factors and translation factors which are training parameters, mu and delta²The mean and variance, respectively, for each batch during training, e is a very small constant that prevents the denominator from being 0. The arrangement order of the weight data in the off-chip memory is changed from { K_W,K_HConversion of N, M into { N, M, K }_W,K_H}，K_W,K_HN and M are respectively the width, height, channel number and number of the convolution kernel, thereby reducing the consumption of FPGA resources, changing the memory arrangement sequence, improving the burst length of the AXI bus and improving the bandwidth utilization rate. The data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the data segment address allocation cannot enter the FPGA subsystem or enable a developed memory to carry out data segment splicing or interception, but the routing layer is taken into consideration when the data output from each layer is stored outside the chip by directly managing the address of the module, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is X_AThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + X_AAnd 2, the routing layer can be skipped if the input data of the C layer needs to be set to InptrC _ start ═ OutptrA _ end ÷ 2 at the first address and to be set to InptrC _ end ═ OutptrA _ end at the last address, and if the routing layer is calculated, the output characteristic diagram data of the D layer and the E layer are spliced along the channel dimension to be used as the input of the G layer, and the output data quantity of the D layer is X_DThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + X_DX2, E layer output data quantity X_EThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + X_EX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start or OutptrD _ start and the last address of InptrG _ end or OutptrE _ end, and thus the routing layer is skipped for the pairMemory copying and complex logic at the routing layer can be omitted from the management of memory addresses. According to the layer ID number of the algorithm, the FPGA accelerator driving module drives the FPGA subsystem in a circulating mode to execute the forward reasoning process of the algorithm, a driving signal is composed of an input and output characteristic diagram storage data segment address, initialized input and output characteristic diagram size and channel number, convolution kernel size, step length, activation function type, weight offset, offset, operator type, input and output characteristic diagram quantization Q value and characteristic diagram block coefficient, the circulating frequency is the number of the algorithm layers, and the FPGA subsystem is started to work at the moment.

As shown in fig. 3, after the FPGA subsystem is started, the AXI-lite interface drive control module first determines the type of the current drive layer, the data routing module generates address offsets for reading model data and input feature map data, and the input/output buffer module reads input block data from the off-chip memory through the AXI interface, and then divides the input data with a bit width of 128-bit data interface into 8 data with 16 bits, and stores the data in the weight buffer and the input feature map buffer, thereby improving the bandwidth utilization. The YOLO calculation module provides convolution, pooling, up-sampling, reordering and shortcut operators, corresponding operators in the YOLO operator module are selected to be calculated through operator type signals of the control module, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions of an input feature map channel, an output feature map channel and an output feature map direction, the 3 dimensions of parallelism { Pif, Pof and Pox } are selected according to common factors of the 3 dimensions of each convolution layer feature map, and the selection is carried out through the selection of the 3 dimensions

Obtained, cd means a common factor for computing a set of integers, In _ N_iIs the number of input feature map channels, Out _ M, at layer i_iIs the number of output feature map channels, Out _ W, of the i-th layer_iIs the width of the output characteristic diagram of the i-th layer, n is the number of convolution layers, and is required to satisfy the DSP_num＜DSP_deviceWherein the DSP_numThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resources_deviceIs the number of DSP resources on the FPGA chip,and the POy (Pkx) and Pky (1) ensure the parallelism and improve the hardware efficiency. The circuit of convolution operator is a multiplication-addition tree structure, and the circuit diagram is shown in FIG. 4, and comprises Pif × Pof × Pox multipliers, Pof × Pox multipliers with depth log₂(Pif) and Pof × Pox multiplexers and registers. When calculating, if the first period of the output characteristic point is multiplied by the accumulation calculation, the MUX selects the Bias to accumulate the current period multiplication and addition result, otherwise, the previous period multiplication and addition result is accumulated the current period multiplication and addition result. After the final output characteristic diagram result is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form a complete output data with 128bit wide, and the data in the output cache is stored in the off-chip memory. After the forward reasoning calculation is finished, a post-processing module in the ARM subsystem decodes the output result to obtain a detection frame, executes non-maximum value suppression to obtain an optimal detection frame, and stores a detection image.

Claims

1. A general hardware accelerator system platform capable of being rapidly deployed facing a YOLO algorithm is characterized in that the system platform is composed of an ARM subsystem, an FPGA subsystem and an off-chip memory, wherein the ARM subsystem comprises a parameter initialization module, an image preprocessing module, a model parameter preprocessing module, a data segment address allocation module, an FPGA accelerator driving module and an image post-processing module, after the system is started, the parameter initialization module loads an algorithm configuration file and initializes the variables of a driving structure body, the configuration file comprises a YOLO algorithm version, calculation sub-types of each Layer of the model, the size and number of convolution kernels, step length, an activation function type, an input/output characteristic diagram quantized Q value, a hierarchy ID, the grouping number of routing layers, an input group ID and Layer ID algorithm information, the driving structure body types are Network and Layer, wherein the Network comprises a Layer structure array and a Network Layer structure member, the Layer comprises the size and the number of channels of the network input and output characteristic diagram, the size of a convolution kernel, an operator type, a step size, an activation function type, an input and output characteristic diagram quantization Q value, a weight offset, an offset, a characteristic diagram block coefficient and a block divisionGroup number, input group ID and direct connection layer ID structure member; the image preprocessing module reads the image to be detected with any resolution, and converts the image to be detected into [0,1 ] by dividing each pixel point by 255]Scaling to the size of the first-layer input feature graph of the deployment algorithm according to the aspect ratio of the original graph, and storing in an off-chip memory; the model parameter preprocessing module loads the weight and the bias data into an off-chip memory, converts the weight and the bias data from 32-bit floating point number into 16-bit fixed point number, fuses the normalization layer into the convolution layer, and modifies the arrangement sequence of the weight data in the memory from { K }_W,K_HConversion of N, M into { N, M, K }_W,K_H}，K_W,K_HN and M are respectively the width, height, channel number and number of the convolution kernel; the data segment address allocation module generates an input/output characteristic diagram to store a data segment address driving value, when the FPGA subsystem is driven, if the current layer of the drive is a routing layer, the current layer does not enter the FPGA subsystem or make a developed memory to carry out data segment splicing or interception through address allocation, but directly takes the routing layer into consideration when the data segment address management of the module is output to the off-chip storage at each layer, if the routing layer is calculated, the output of the layer A is equally divided into two groups along the channel direction, the second group is taken as the input of the layer C, the data quantity of the output characteristic diagram of the layer A is X_AThe first address of the output storage data segment is OutptrA _ start, and the last address is OutptrA _ end ═ OutptrA _ start + X_AX2, the routing layer can be skipped if the input data of the C layer has the first address of inprtrc _ start ═ OutptrA _ end ÷ 2 and the last address of inprtrc _ end ═ OutptrA _ end, the routing layer calculation is to splice the output characteristic diagram data of the D layer and the E layer along the channel dimension as the input of the G layer, and the output data volume of the D layer is X_DThe first address of the D-layer output storage data segment is OutptrD _ start, and the last address is OutptrD _ end ═ OutptrD _ start + X_DX2, E layer output data quantity X_EThe first address of the E-layer output storage data segment needs to be OutptrE _ start ═ OutptrD _ end +2, and the last address is OutptrE _ end ═ OutptrE _ start + X_EX 2, the routing layer can be skipped if the input data of the G layer has the first address of InptrG _ start-OutptrD _ start and the last address of InptrG _ end-OutptrE _ end, so that the management of the storage address can be omittedCopying and complex logic from the memory of the layer; according to the layer ID number of the algorithm, the FPGA accelerator driving module circularly drives the FPGA subsystem to execute the forward reasoning process of the algorithm; the FPGA subsystem comprises a controller module, a data routing module, an input/output cache module and a YOLO operator module, the FPGA subsystem is controlled by the controller module after being driven, and the data routing module generates address offset for reading model data and input characteristic diagram data; after the input/output cache module reads the input block data from the off-chip memory, the input data with the bit width of 128-bit data interface is divided into 8 data with 16 bits, and the data are stored in the weight cache and the input characteristic diagram cache; the YOLO operator module provides convolution, pooling, up-sampling, reordering and shortcut operators, the operators are selected by the operator type signal selection operators of the control module to calculate, the convolution is used as the operator with the largest calculation amount, parallel calculation is carried out on 3 dimensions in the column direction of an input feature diagram channel, an output feature diagram channel and an output feature diagram, the 3 dimensions are selected according to common factors of the 3 dimensions of each convolutional layer feature diagram, and the selection of the 3 dimensions { Pif, Pof and Pox } is carried out by the selection of the 3 dimensions

Obtained, cd means a common factor for computing a set of integers, In _ N_iIs the number of input feature map channels, Out _ M, at layer i_iIs the number of output feature map channels, Out _ W, of the i-th layer_iIs the width of the output characteristic diagram of the ith layer, n is the number of convolution layers, and the requirement of DSP_num<DSP_deviceWherein the DSP_numThe number of DSP resources, which is required to be consumed by the convolution operator, is Pif × Pof × Pox × Poy × Pkx × Pky, which is the number of DSP resources_deviceThe number of DSP resources on the FPGA chip is Poy, Pkx and Pky which are parallelism degrees of the row direction of the output characteristic diagram, the width of a convolution kernel and the height dimension of the convolution kernel respectively, and Poy is Pkx, Pky and 1; after the final output characteristic diagram data is obtained, the control module controls the data routing module to generate output address offset, the input and output cache module splices 8 output data with 16bit wide to form complete output data of a 128bit wide data interface, and the data in the output cache is stored in an off-chip memory(ii) a And after the FPGA accelerator driving module finishes executing, the post-processing module processes the detection result to obtain an optimal detection frame and stores the detection image.

2. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the model parameter preprocessing module converts the weight and bias data from 32-bit floating point number to 16-bit fixed point number, and the conversion formula is

BW is the bit width length of the fixed point number after quantization, Q is the decimal length, which determines the representation range and precision of the BW fixed point number, and Q is represented by the formula Q-MIN { ∑ Y_float32-Y'_float32(Y_fixed(Y_float32Q)) | } is calculated, Y_float32Is the floating point number, Y 'to be quantized'_float32The quantization is carried out to fixed point number and then converted to 32-bit floating point number, and the optimal Q value of each layer of weight and bias data quantization is obtained through the minimum accumulated error.

3. The YOLO algorithm-oriented rapidly-deployable general hardware accelerator system platform as claimed in claim 1, wherein the driving signal of the FPGA accelerator driving module is composed of an input/output feature map storage data segment address, an initialized input/output feature map size and channel number, a convolution kernel size, a step size, an activation function type, a weight offset, an offset, an operator type, an input/output feature map quantization Q value and a feature map blocking coefficient.

4. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein the convolution operators of the YOLO operator module are in a multiply-add tree structure consisting of Pif x Pof x Pox multipliers, Pof x Pox with a depth of log₂(Pif) and Pof × Pox multiplexers and registers, and when performing the calculation, if the first cycle of the output characteristic point is multiplied by the accumulation calculation, the MUX selectsAnd selecting the Bias to accumulate the multiplication and addition result of the current period, and otherwise, accumulating the multiplication and addition result of the current period for the multiplication and addition result of the previous period.

5. The YOLO algorithm-oriented rapidly deployable general hardware accelerator system platform as claimed in claim 1, wherein after the FPGA accelerator driver module finishes execution, the post-processing module decodes the calculation result to obtain a detection box, performs non-maximum suppression to obtain an optimal detection box, and stores the detection image.