CN108416422B

CN108416422B - FPGA-based convolutional neural network implementation method and device

Info

Publication number: CN108416422B
Application number: CN201810074941.8A
Authority: CN
Inventors: 罗聪; 万文涛; 梁洁
Original assignee: Nationz Technologies Inc
Current assignee: Nationz Technologies Inc
Priority date: 2017-12-29
Filing date: 2018-01-25
Publication date: 2024-03-01
Anticipated expiration: 2038-01-25
Also published as: CN108416422A; WO2019127838A1

Abstract

The invention provides a method and a device for realizing a convolutional neural network based on an FPGA, which comprises the steps of initializing editable resources of the FPGA to generate a functional module required by a realization model, loading weight data of each processing level in the convolutional neural network model to a memory storage of the FPGA, associating a state register and the processing level of the FPGA, storing the data to be processed to the memory storage through a memory controller of the FPGA, finally reading parameters of the state register, determining the processing level to be operated, completing the processing of the data by the processing level to be operated until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputting a processing result corresponding to the data to be processed; in the whole process, the convolutional neural network is realized by the hardware of the FPGA, and the convolutional neural network is not dependent on software any more, so that the problem that the conventional convolutional neural network technology depends on software realization is solved.

Description

FPGA-based convolutional neural network implementation method and device

Technical Field

The invention relates to the Field of Field Programmable Gate Arrays (FPGA), in particular to a convolutional neural network implementation method and device based on the FPGA.

Background

With the explosive growth of artificial intelligence, deep learning has become an effective means of extracting valuable information from a vast data analysis, and convolutional neural networks have received attention due to their weight reusability. Most of the convolutional neural networks are realized by software, the data volume is large, the requirement on the computing capacity of hardware is high, the high computing capacity depending on the cloud is high, and the power consumption is large.

Disclosure of Invention

The invention provides a convolutional neural network implementation method and device based on an FPGA (field programmable gate array), which are used for solving the problem that the conventional convolutional neural network technology depends on software implementation.

In order to solve the technical problems, the invention adopts the following technical scheme:

an FPGA-based convolutional neural network implementation method, which comprises the following steps:

initializing editable resources of the FPGA to generate an input cache module, an output cache module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module;

loading weight data of each processing level in the convolutional neural network model to be realized into a memory storage of the FPGA, and associating a state register and the processing level of the FPGA;

storing the data to be processed into a memory through a memory controller of the FPGA;

The operation control module reads the parameters of the state register, determines the to-be-operated processing level, controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputs the processing result corresponding to the to-be-processed data.

Further, when the to-be-operated processing level is a convolution computing level, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to complete the processing of the to-be-operated processing level on the data, and the processing comprises the following steps:

the control data reading module reads weight data and input data corresponding to the convolution computation level stored in the memory through the memory controller and stores the weight data and the input data into the input buffer module;

the control input control module inputs the weight data and the input data stored by the input buffer module into the neural network processing unit;

the control neural network processing unit calculates input data by using the weight data and outputs a calculation result;

Controlling the output control module to store the calculation result into the output buffer module;

and controlling the memory controller to read the calculation result in the output buffer module and store the calculation result into the memory.

Further, when the to-be-operated processing level is a pooled operation level, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to complete the processing of the to-be-operated processing level on the data, and the method comprises the following steps:

the control data reading module reads input data corresponding to the pooling operation level stored in the memory storage through the memory controller and stores the input data into the input cache module;

the control input control module divides the input data stored by the input buffer module into a plurality of pooling windows, and sequentially inputs the pooled data into the neural network processing unit from the pooling windows;

controlling a neural network processing unit to carry out maximum pooling comparison on input data and outputting a comparison result;

the control output control module stores the comparison result into the output buffer module;

and controlling the memory controller to read the comparison result in the output buffer module and store the comparison result into the memory.

Further, when the to-be-operated processing level is a connection operation level, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to complete the processing of the to-be-operated processing level on the data, and the processing comprises the following steps:

determining output data of other processing layers corresponding to the input data of the current processing layer;

the storage addresses of the output data of other preprocessing layers in the memory are configured as the input addresses of the input data of the current processing layer;

the control data reading module reads input data corresponding to the input address stored in the memory through the memory controller and stores the input data into the input buffer module.

Further, when the processing level to be operated is a reorganization operation level, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to complete the processing of the data by the processing level to be operated, and the processing includes:

the control data reading module reads the input data corresponding to the reorganization operation level stored in the memory through the memory controller and stores the input data into the input cache module;

The control input control module inputs the input data stored by the input buffer module into the neural network processing unit;

controlling a neural network processing unit to carry out recombination operation on input data and outputting a recombination result;

controlling the output control module to store the reorganization result into the output buffer module;

the control memory controller reads the reorganization result in the output buffer module and stores the reorganization result into the memory;

a mapping between the memory address of the input data in the memory storage and the memory address of the reorganization result in the memory storage is established.

Further, when the processing level to be operated is a classified operation level, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to complete the processing of the data by the processing level to be operated, including:

the control data reading module reads input data corresponding to the classification operation level stored in the memory storage through the memory controller and stores the input data into the input cache module;

the control input control module inputs the input data stored by the input buffer module as an input characteristic vector into the neural network processing unit;

The control neural network processing unit performs classification calculation on the input data and outputs a detection result;

controlling the output control module to store the detection result into the output buffer module;

and controlling the memory controller to read and output the detection result in the cache module and outputting the detection result.

Further, before the data to be processed is stored into the memory storage through the memory controller of the FPGA, the method further includes:

judging whether the data to be processed meets the calculation requirement of a convolutional neural network model to be realized or not;

if the calculation requirement is not met, carrying out normalization processing and/or bilinear interpolation processing on the data to be processed until the calculation requirement is met;

and storing the processed data to be processed into a memory.

An FPGA-based convolutional neural network implementation apparatus, comprising:

the initialization module is used for initializing editable resources of the FPGA and generating an input cache module, an output cache module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module; the operation control module is used for reading parameters of the state register, determining a to-be-operated processing level, controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputting a processing result corresponding to the to-be-processed data;

The loading module is used for loading weight data of each processing level in the convolutional neural network model to the memory storage of the FPGA, associating the state register of the FPGA with the processing level and storing the data to be processed to the memory storage through the memory controller of the FPGA.

Further, the neural network processing unit includes a plurality of processing units for processing data in parallel.

Further, the input buffer module comprises two input storage units, and the two input storage units are used for buffering input data and/or weight data of the neural network processing unit in a ping-pong double-buffering mode; and/or the output buffer module comprises two output storage units, and the two output storage units are used for buffering output data of the neural network processing unit in a ping-pong double-buffering mode.

Advantageous effects

The invention provides a method and a device for realizing a convolutional neural network based on an FPGA, which comprises the steps of initializing editable resources of the FPGA, generating an input buffer memory module, an output buffer memory module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module, loading weight data of each processing level in a convolutional neural network model to be realized into a memory storage of the FPGA, associating a state register and the processing level of the FPGA, storing the data to be processed into the memory storage through a memory controller of the FPGA, finally reading parameters of the state register by the operation control module, determining the level to be processed, and controlling the input buffer memory module, the output buffer memory module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the data by the input buffer memory module, the output control module, the neural network processing unit and the data reading module until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputting processing results corresponding to the data to be processed; in the whole process, the convolutional neural network is realized by the hardware of the FPGA, and the convolutional neural network is not dependent on software any more, so that the problem that the conventional convolutional neural network technology depends on software realization is solved.

Drawings

Fig. 1 is a flowchart of a convolutional neural network implementation method according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a convolutional neural network implementation device according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a convolutional neural network implementation device according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of a convolutional neural network implementation method according to a third embodiment of the present invention;

FIG. 5 is a schematic diagram of a ping-pong dual cache according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a neural network processing unit according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a max-pooling operation according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of logic control according to an embodiment of the present invention;

fig. 9 is a schematic diagram of data reorganization according to an embodiment of the present invention.

Detailed Description

The invention is applicable to all terminal equipment provided with FPGA chips, including PC, mobile phone, PAD, deposit machine and the like. The invention will be described in further detail below with reference to the drawings by means of specific embodiments.

Embodiment one:

fig. 1 is a flowchart of an FPGA-based convolutional neural network implementation method according to an embodiment of the present invention, please refer to fig. 1, and the FPGA-based convolutional neural network implementation method according to the embodiment includes the following steps:

S101: and initializing editable resources of the FPGA to generate an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module.

The editable resources of the FPGA can be constructed into any functional mode according to the requirement, and when the equipment is initialized, the editable resources of the FPGA are constructed into the functional mode which is necessary for realizing the convolutional neural network model, and then the convolutional neural network function is realized on the basis so as to process the data.

In the invention, the input buffer module and the output buffer module both adopt a ping-pong double buffer mechanism to buffer data, and the neural network processing unit comprises a plurality of PEs (Processing Element, processing units) which process the data in parallel; these will be described in embodiment three.

In the present invention, the neural network processing units are time-multiplexed, i.e. they have different roles at different processing levels.

S102: and loading weight data of each processing level in the convolutional neural network model to be realized into a memory storage of the FPGA, and associating a state register of the FPGA with the processing level.

The convolutional neural network model generally comprises a plurality of processing layers, and the convolutional neural network model according to the third embodiment of the invention comprises twenty-two convolutional layers, five maximum pooling layers, two connecting layers, one recombination layer, one classification layer and one preprocessing layer, and the total of thirty-two processing layers realizes real-time operation processing of input picture data and outputs detection results.

In order to identify the processing hierarchy, the invention carries out association identification on the status register and the processing hierarchy, wherein the association mode can be that a plurality of status registers are arranged, each status register corresponds to one processing hierarchy, or only one status register is arranged, and the status registers are updated in real time according to the operation process.

S103: and storing the data to be processed into a memory through a memory controller of the FPGA.

The data to be processed refers to data, such as image data, which is externally required to be processed by the convolutional neural network.

Because the data which can be processed by different convolutional neural network models are limited, before the data to be processed is stored into a memory through a memory controller of an FPGA, whether the data to be processed meets the calculation requirement of the convolutional neural network model to be realized or not needs to be judged, and if not, normalization processing and/or bilinear interpolation processing are carried out on the data to be processed until the calculation requirement is met, and the processed data to be processed is stored into the memory.

S104: the operation control module reads the parameters of the state register, determines the to-be-operated processing level, controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputs the processing result corresponding to the to-be-processed data.

The step mainly comprises the steps that an operation control module controls each functional module to process input data according to parameters of a state register, the convolutional neural network processing is completed, and specific implementation of the convolutional neural network processing is different for different kinds of processing layers.

When the processing hierarchy to be operated is a convolution computation hierarchy, step S104 includes:

When the processing hierarchy to be executed is a pooled operation hierarchy, step S104 includes:

When the processing hierarchy to be operated is the link operation hierarchy, step S104 includes:

When the processing hierarchy to be operated is the reorganization operation hierarchy, step S104 includes:

When the processing hierarchy to be executed is a classification operation hierarchy, step S104 includes:

The embodiment provides a convolutional neural network implementation method based on an FPGA, which comprises the steps of initializing editable resources of the FPGA, generating an input buffer memory module, an output buffer memory module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module, loading weight data of all processing levels in a convolutional neural network model to be implemented into a memory storage of the FPGA, associating a state register and the processing levels of the FPGA, storing the data to be processed into the memory storage through a memory controller of the FPGA, finally reading parameters of the state register by the operation control module, determining the levels to be processed, controlling the input buffer memory module, the output buffer memory module, the input control module, the output control module, the neural network processing unit and the data reading module to complete processing of the data by the levels to be processed until all the processing levels of the convolutional neural network model to be implemented are sequentially operated, and outputting processing results corresponding to the data to be processed; in the whole process, the convolutional neural network is realized by the hardware of the FPGA, and the convolutional neural network is not dependent on software any more, so that the problem that the conventional convolutional neural network technology depends on software realization is solved.

Embodiment two:

fig. 2 is a schematic structural diagram of a convolutional neural network implementation device according to a second embodiment of the present invention, please refer to fig. 2, and the convolutional neural network implementation device 2 according to the present embodiment includes:

the initialization module 21 is configured to initialize editable resources of the FPGA, and generate an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and an operation control module; the operation control module is used for reading parameters of the state register, determining a to-be-operated processing level, controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level until the sequential operation of all the processing levels of the convolutional neural network model to be realized is finished, and outputting a processing result corresponding to the to-be-processed data;

the loading module 22 is configured to load weight data of each processing level in the convolutional neural network model to be implemented into a memory storage of the FPGA, associate a status register of the FPGA with the processing level, and store the data to be processed into the memory storage through a memory controller of the FPGA.

Embodiment III:

the present embodiment will be described taking an example of input data as a picture.

The embodiment realizes the deep learning convolutional neural network model by hardware, and the realized platform is a SNPS-DX 7 (model of a product of SNopsys) FPGA development board of the SNopsys (New Cisco technology Co., USA); specifically, firstly, the trained weight parameters of the convolutional neural network model are loaded into a DDR (Double Data Rate, double Rate synchronous dynamic random access memory) of an FPGA development board, then input Data are preprocessed in a preprocessing module, the preprocessed Data are transmitted into the DDR of the FPGA development board, then the weight parameters and the input Data of a current layer network in the DDR are continuously extracted through a DMA (Direct Memory Access ) unit, the weight parameters and the input Data are transmitted to an NPU (network process units, a neural network processing unit) for parallel operation, and the output Data after operation is input of the next layer and is stored back into the DDR through an output buffer module. And finally, the data feature vectors which finish all convolution operations are transmitted to a classification module to finish feature classification calculation.

Specifically, as shown in fig. 3, the apparatus provided in this embodiment includes an input end a, an input end B, a preprocessing unit 301, a DDR controller 302 (i.e., the above memory controller), a DDR memory 303 (i.e., the above memory), a DMA unit 304 for reading and writing weight data, a buffer unit 305 for buffering weight data, a buffer unit 306 for buffering input data, a DMA unit 307 for reading and writing input data, an input control module 308 for input control, an NPU unit 309, an output control module 310 for output control, an output buffer module 311 for buffering output data, an operation control module 312, and a classification calculation unit 313; the buffer unit 305 and the buffer unit 306 form the above input buffer module, the operation control module 312 includes an instruction unit 3121, a decoder 3122, and a control logic unit 3123, where the instruction unit 3121 is used for receiving instruction data, the decoder 3122 is used for decoding the instruction data, and the control logic unit 3123 is used for outputting a corresponding control instruction according to a decoding result.

The DDR controller 302 controls the connection and data transfer functions of the DDR memory with other external modules, including storage control of input data, read control of DMA read DDR data, storage control of output data after hardware operation, and read control of final output feature vector data.

The input buffer memory module and the output buffer memory module adopt a ping-pong double buffer memory mode. As shown in fig. 5, the buffer module includes a first buffer unit 51, a second buffer unit 52, and a selection control unit 53, where the selection control unit 53 is configured to select which buffer unit is to be buffered with input data, and control which buffer unit outputs buffered output data, and output an identification flag signal, where the flag is a current state of input/output buffer, and indicates a current data state of two buffers. Specifically, the input buffer includes a weight data buffer and an input data buffer, where the weight data buffer adopts a ping-pong double buffer to buffer the weight data, the bias value and the regularization parameter of the current layer, and when the weight data of one buffer area participates in the operation, the DMA unit 304 loads data into the other buffer area, so that the waiting time for loading data can be reduced; correspondingly, the input data buffer also adopts a ping-pong double-buffer mode to buffer the input data of the current layer, and when one buffer area participates in operation, the DMA unit 307 loads data to the other buffer area; and the output data buffer adopts a ping-pong double-buffer mode, and when the feature map data calculated by the NPU is buffered in one buffer area, the other buffer area in which the data is already buffered writes the data into the DDR memory.

As shown in fig. 6, the NPU unit 309 includes p×p parallel processing units PE (PE 0 to PEn) for parallel computing multiply/add/subtract operations of the convolution process. The intermediate result of the calculation of one channel and the convolution kernel is stored in a temporary storage register, the result of the calculation of the next channel and the convolution kernel is added with the intermediate result and then stored in the temporary storage register again, the calculation is repeated until the calculation of all channels and the convolution kernel is completed, and then BN (Batch Normalization) operation is carried out on the obtained data.

The batch regularized BN operation involves the following expressions:

wherein (1)>Gamma is a weight, beta is a correction value, epsilon is a constant for ensuring numerical stability, and three parameters gamma, beta and epsilon are obtained through cloud training.

After batch regularized BN operations are performed, the data is activated, and the activation function expression is: y= (x > 0)? x:0.1 x; that is, when x is greater than 0, y=x, and when x is less than 0, y=0.1x, x being the input of the NPU unit and y being the output of the NPU unit.

And finally, the calculated result stores the obtained new feature map data into the DDR memory through the output buffer module.

The input control module 308 and the output control module 310 are used for connecting the transmission trend of data between the two modules; specifically, the input control module 308 is configured to rearrange the data in the data buffer according to the interface of the NPU unit for inputting the data, and correctly transmit the data to the corresponding input interface, and the output control module 310 is configured to rearrange the output data of the NPU according to the input interface of the output buffer, and correctly transmit the data to the corresponding input interface.

The operation control module 312 controls the logic state of the whole system, and determines which level of the calculated deep convolutional neural network the current state is in by reading the current state register, so as to execute the logic control instruction under the corresponding state and control the operation of data.

As shown in fig. 4, the method provided in this embodiment includes the following steps:

s401: and acquiring weight data of the model, and loading the weight data into the DDR memory.

And acquiring weight data of the trained deep convolutional neural network from the cloud, and loading the weight data into the DDR memory of the FPGA development board through the USB.

Specifically, the GPU (GraphicsProcessingUnit, graphic processor) acceleration training yolo (a deep learning algorithm) convolutional neural network, get the weight data of the trained face detection model, load the weight data into the DDR memory of the FPGA development board through USB (Universal Serial Bus ).

S402: the input data is preprocessed and stored in the DDR memory.

The method comprises the following steps: carrying out normalization processing on input data to enable the input data to meet calculation requirements; carrying out bilinear interpolation processing on input data to enable the picture size to meet the calculation requirement; the preprocessed input data is stored in the DDR memory.

Specifically, the input picture data is subjected to normalization preprocessing, the gray value is divided by 255 to be normalized to be between 0 and 1, the size of the input picture data is rearranged to 416 x 416 by adopting a bilinear interpolation method, the input picture size requirement of the yolo convolutional neural network is met, and then the input picture data is stored in a DDR memory.

S403: the current status register is read and the corresponding processing hierarchy is determined.

And reading the current state register, judging which layer of the calculation depth convolutional neural network the current state is in, and executing a logic control instruction under the corresponding state to control the operation of data. Defining n state registers R0, R1, …, R (n-1), and Rn, wherein each register stores state data corresponding to the current layer, which means that the whole deep convolutional neural network needs to operate R0 to Rn common n layers of network, and control logic reads the registers according to the sequence, executes a logic control function of the corresponding layer, controls the flow direction of the whole hardware data, and completes the calculation of the deep convolutional neural network.

S404: and calling the corresponding processing hierarchy to process the data.

When the current state register corresponds to convolution calculation, the convolution calculation operation is executed, and the step comprises the following steps: loading weight parameters of a convolution layer and input data into a parallel convolution processing unit PE, setting a floating point number (32 bit)/fixed point number (16 bit) matrix with the weight parameters of k, wherein the input data is a floating point number (32 bit)/fixed point number (16 bit) matrix with the weight parameters of a, the sliding step length is 1, and the number of the parallel convolution processing units PE is P, so that the convolution sum of P input data and the weight can be calculated simultaneously; the convolution layer calculation comprises multiplication and accumulation operation of weight and input data, batch regularization BN calculation operation, offset addition and activation function activation, one featuremap is obtained after the input data of one convolution kernel and a plurality of input channels are calculated, and then the next convolution kernel and the input data are calculated after the convolution kernel and the input data are stored in a memory until the calculation of one layer of the deep convolution neural network is completed.

When the current status register is a pooling operation, performing a maximum pooling operation, where the step includes: and (3) setting the floating point number/fixed point number matrix with the input data of A.A, the sliding step length of s and the number of parallel convolution processing units of P.P, dividing the input data into (A/s) pooling windows by adopting maximum pooling operation, sequentially loading P.P input data at corresponding positions from the (A/s) pooling windows each time, and outputting P.P maximum pooling results after s cycles.

When the current status register is a connection operation, the connection layer operation is executed, and the connection layer operation uses the output data of one layer or two layers calculated before as the input data of the current layer, so that the address of the output data of one layer before reloading in the DDR memory is required to be used as the input address of the input data of the current layer, and the connection layer operation can be completed.

When the current status register is a reorganization layer operation, the reorganization layer operation is to split and reorganize the current layer, the original input data is set as 2h x 2w x 2c data, the step length is 2, and through the reorganization layer operation, the feature map of the output data is h x w x 8c, and an address mapping unit needs to be added to map the original address into new address storage data through an address, and the new address storage data is used as the input data of the next layer.

S405: judging whether the network layer calculation is finished, if so, executing step S406, otherwise, returning to step S403.

When the current status register is the classified layer calculation, determining that the network layer calculation is finished, and executing step S406; when the current status register is not the classification layer calculation, it is determined that the network layer calculation is not ended, and step S403 is executed.

S406: and executing the classification layer computing operation and outputting a result.

The classification layer is calculated by taking the calculation results after the previous operations of each convolution layer, pooling layer, connecting layer, recombination layer and the like as the input feature vectors of the layer, obtaining the detection results through classification calculation and outputting the detection results.

According to the embodiment, the complex deep convolutional neural network is realized by utilizing the hardware FPGA, so that the deep convolutional neural network model highly dependent on the strong cloud computing capability is put into a local terminal to operate, real-time processing of data is not required to be performed by depending on the network, and the problem that a large complex deep convolutional neural network cannot operate in a hardware terminal is solved. Meanwhile, the invention can process the deep convolutional neural network with more complex structure and more network layers, can adapt to the current deep learning algorithm, and can process a convolutional layer, a pooling layer, a connecting layer and a recombination layer. Compared with the prior method, the convolution layer can process batch regularized BN operation and activation function leakage function, and the realization of a connection layer and a recombination layer is increased, so that the convolution layer has frontier.

Furthermore, the embodiment is suitable for processing input graphic data and weight data of floating point number (32 bit) or fixed point number (16 bit), and can process deep learning algorithm models of different data types by changing internal multiplication, addition and subtraction units into floating point number or fixed point number operation units, so that the embodiment has high flexibility, and the weight and intermediate result data quantity are reduced, and the calculation accuracy is not greatly changed by changing the deep learning algorithm of the floating point number data type into the hardware fixed point number data type.

Furthermore, in this embodiment, the input image data or video frame data is directly processed by the preprocessing module, then the processed data is input into the convolutional neural network to perform operations of each layer, the data after the operations of each network layer are input as feature vectors of the classification layer, final classification detection calculation is performed, and finally the data is output, so that real-time detection of the face in the image or video frame is completed, the whole process is realized in local FPGA hardware, networking is not needed, and compared with the traditional CPU and GPU schemes, the power consumption is greatly reduced, and the configuration is more flexible to adapt to the current deep learning algorithm.

For example, a yolo convolutional neural network model has twenty-two convolutional layers, five maximum pooling layers, two connecting layers, one recombination layer, one classification layer and one preprocessing module, so that real-time operation processing of input picture data is realized and a detection result is output. The size of the input picture is changed into 416 x 416 after pretreatment, the convolution kernel sizes are 3*3 and 1*1, and the step size of the pooling layer is 2 x 2. And using a yolo convolutional neural network to train a face detection model at the cloud end by utilizing GPU acceleration, and inputting picture data.

For the maximum pooling operation, as shown in fig. 7, a floating point number/fixed point number matrix with input data being a×a, a sliding step length being s, and the number of parallel convolution processing units PE being p×p is set, the maximum pooling operation is adopted, the input data is divided into (a/s) pooling windows, P input data at corresponding positions are sequentially loaded from the (a/s) pooling windows each time, and p×p maximum pooling results can be output after s×s periods, which is the maximum pooling operation process.

As shown in fig. 8, the logic control states of the operation control module include:

(1) read_reg, status register logic; the whole deep convolutional neural network needs to operate R0, R1, …, R (n-1), rn is shared by n layers of network, state data of all network layers are stored in state registers from R0 to Rn, and the whole deep convolutional neural network can be operated by sequentially reading the values of the current state registers.

(2) conv, convolution operation control logic; the convolution operation is completed through the logic control states of data preparation-idel, data initialization-init, data operation-datamode, batch regularization BN operation-BN, activation function-Active and data output-output.

(3) pool, max pooling control logic; and finishing the maximum pooling operation through the logic control states of data preparation-ideal, data initialization-init, maximum value comparison-MAX, temporary storage value writing-write and data output-output.

(4) route, link layer control logic; and the address of the output data of a certain layer before address data preparation-idel and loading-Load addr in the DDR memory is used as the input address of the input data of the current layer, so that the operation of the wiring layer is completed.

(5) reorder, recombination layer control logic; after address data preparation-idel, address data calculation-Count addr and mapping-reorder, the data are rearranged, as shown in fig. 9, and the input data of 2h×2w×2c are mapped into new data of h×8c as the input data of the next layer.

The present invention also provides a computer-readable storage medium storing one or more programs that are executed to implement the steps of the methods provided by all embodiments of the present invention.

As can be seen from the implementation of the above embodiments, the present invention has the following advantages:

The foregoing is a further detailed description of the invention in connection with specific embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The method for realizing the convolutional neural network based on the FPGA is characterized by comprising the following steps of:

loading weight data of each processing level in a convolutional neural network model to be realized into a memory storage of the FPGA, and associating a state register of the FPGA with the processing level;

storing data to be processed into the memory through a memory controller of the FPGA;

and the operation control module reads the parameters of the state register, determines a to-be-operated processing level, controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level on the data until the sequential operation of all the processing levels of the to-be-realized convolutional neural network model is finished, and outputs a processing result corresponding to the to-be-processed data.

2. The convolutional neural network implementation method of claim 1, wherein when the processing hierarchy to be operated is a convolutional computation hierarchy, the operation control module controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data by the processing hierarchy to be operated comprises:

the data reading module is controlled to read weight data and input data corresponding to the convolution computation level stored in the memory through the memory controller, and the weight data and the input data are stored in the input cache module;

the input control module is controlled to input the weight data and the input data stored by the input buffer module into the neural network processing unit;

controlling the neural network processing unit to calculate the input data by using the weight data, and outputting a calculation result;

3. The convolutional neural network implementation method of claim 1, wherein when the processing hierarchy to be operated is a pooled operation hierarchy, the operation control module controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data by the processing hierarchy to be operated comprises:

controlling the data reading module to read the input data corresponding to the pooling operation level stored in the memory storage through the memory controller, and storing the input data into the input cache module;

the input control module is controlled to divide the input data stored by the input cache module into a plurality of pooling windows, and the input data are sequentially input into the neural network processing unit from the pooling windows;

controlling the neural network processing unit to carry out maximum pooling comparison on input data and outputting a comparison result;

controlling the output control module to store the comparison result into the output buffer module;

4. The convolutional neural network implementation method of claim 1, wherein when the processing hierarchy to be operated is a link operation hierarchy, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data by the processing hierarchy to be operated, including:

the storage addresses of the output data of the other processing layers in the memory are configured as the input addresses of the input data of the current processing layer;

and controlling the data reading module to read the input data corresponding to the input address stored in the memory through the memory controller, and storing the input data into the input cache module.

5. The convolutional neural network implementation method of claim 1, wherein when the processing hierarchy to be operated is a reorganization operation hierarchy, the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data by the processing hierarchy to be operated, including:

Controlling the data reading module to read the input data corresponding to the reorganization operation level stored in the memory storage through the memory controller, and storing the input data into the input cache module;

the input control module is controlled to input the input data stored by the input buffer module into the neural network processing unit;

controlling the neural network processing unit to carry out recombination operation on the input data and outputting a recombination result;

controlling the memory controller to read the reorganization result in the output buffer module and store the reorganization result into the memory;

and establishing a mapping between the storage address of the input data in the memory storage and the storage address of the recombination result in the memory storage.

6. The convolutional neural network implementation method of claim 1, wherein when the processing hierarchy to be executed is a classification operation hierarchy, the execution control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data by the processing hierarchy to be executed, including:

Controlling the data reading module to read input data corresponding to the classification operation level stored in the memory storage through the memory controller, and storing the input data into the input cache module;

the input control module is controlled to input the input data stored by the input buffer module as an input characteristic vector into the neural network processing unit;

controlling the neural network processing unit to perform classification calculation on the input data and outputting a detection result;

and controlling the memory controller to read the detection result in the output buffer module and output the detection result.

7. The convolutional neural network implementation method of any one of claims 1-6, further comprising, prior to storing the data to be processed to the memory storage via a memory controller of the FPGA:

judging whether the data to be processed meets the calculation requirement of the convolutional neural network model to be realized or not;

if not, carrying out normalization processing and/or bilinear interpolation processing on the data to be processed until the calculation requirement is met;

And storing the processed data to be processed into the memory.

8. The utility model provides a convolutional neural network realization device based on FPGA which characterized in that includes:

the initialization module is used for initializing editable resources of the FPGA and generating an input cache module, an output cache module, an input control module, an output control module, a neural network processing unit, a data reading module and an operation control module;

the loading module is used for loading weight data of each processing level in the convolutional neural network model to be realized into a memory storage of the FPGA, associating a state register of the FPGA with the processing level, and storing the data to be processed into the memory storage through a memory controller of the FPGA;

the operation control module is used for reading the parameters of the state register, determining a to-be-operated processing level, controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit and the data reading module to finish the processing of the to-be-operated processing level on the data until the sequential operation of all the processing levels of the to-be-realized convolutional neural network model is finished, and outputting a processing result corresponding to the to-be-processed data.

9. The convolutional neural network implementation device of claim 8, wherein the neural network processing unit comprises a plurality of processing units for processing data in parallel.

10. The convolutional neural network implementation device of claim 8 or 9, wherein the input buffer module comprises two input storage units, and the two input storage units are used for buffering input data and/or weight data of the neural network processing unit in a ping-pong double-buffering manner; and/or the output buffer module comprises two output storage units, and the two output storage units are used for buffering the output data of the neural network processing unit in a ping-pong double-buffering mode.