CN118014022A

CN118014022A - Deep learning-oriented FPGA universal heterogeneous acceleration method and equipment

Info

Publication number: CN118014022A
Application number: CN202410125333.0A
Authority: CN
Inventors: 陈栋; 田宗浩; 石胜斌; 陈凯; 张晓龙
Original assignee: PLA Army Academy of Artillery and Air Defense
Current assignee: PLA Army Academy of Artillery and Air Defense
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-05-10

Abstract

According to the deep learning-oriented FPGA universal heterogeneous acceleration method and device, a convolutional neural network is used as a basic structure, hardware acceleration design is realized on the FPGA by each layer of operation, and meanwhile, the basic hardware acceleration operator is used for optimizing design on different structure neural networks; each hardware acceleration operator is mutually independent and independently configured, realizes data and instruction interaction with the memory and the I/O peripheral through a standard AXI bus interface, and for a data layer with a dependency relationship, a custom advanced operator is designed through fusion of a plurality of operators, so that time delay caused by data access is reduced; according to the method, the computing characteristics of each link in the deep learning algorithm are fully analyzed, a simple and efficient heterogeneous system suitable for the embedded hardware platform is constructed, heterogeneous division of different tasks is realized, the computing advantages of the heterogeneous platform are efficiently utilized, the requirements of the embedded platform on real-time performance, low power consumption and the like are met, and the difficulty of deployment of the software algorithm on the hardware heterogeneous platform is reduced.

Description

Deep learning-oriented FPGA universal heterogeneous acceleration method and equipment

Technical Field

The invention relates to the technical field of heterogeneous acceleration, in particular to a deep learning-oriented FPGA universal heterogeneous acceleration method and equipment.

Background

In recent years, deep learning algorithms are rapidly popularized in different scenes, the performance of the deep learning algorithms depends on the high-performance computing capability of an upper computer, and the performance of the deep learning algorithms in embedded equipment such as small application platforms is often poor. With the improvement of the semiconductor manufacturing process, the computing capability of the embedded hardware platform is greatly increased, the heterogeneous processor brings new opportunities for the deployment of software algorithms at the edge, and particularly the FPGA hardware architecture has high computing performance and flexible configurability and has the unique advantages of low resources, low power consumption and high real-time requirements for the embedded platform. However, the FPGA heterogeneous design is oriented to a circuit structure, so that for a software engineer lacking in hardware understanding, the FPGA heterogeneous design improves the difficulty of transplanting a software algorithm to a heterogeneous platform, and particularly for a deep learning algorithm with superior performance, the difficulty of transplanting clearly improves the threshold for converting the software algorithm to engineering application, and the speed of updating and upgrading the deep learning algorithm cannot be adapted.

Disclosure of Invention

The invention provides a deep learning-oriented FPGA universal heterogeneous acceleration method, equipment and a storage medium, which can at least solve one of the technical problems in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

The FPGA universal heterogeneous acceleration method for deep learning takes a convolutional neural network as a basic structure, realizes hardware acceleration design on the FPGA by each layer of operation, and simultaneously uses a basic hardware acceleration operator to optimally design different structure neural networks;

the configuration space bus interface is a synchronous, low-bandwidth and low-power-consumption 32-bit control bus for the CPU to access UDLA configuration registers; UDLA act as slaves on the CSB interface, implementing the interface protocol, converting it to AMBA, OCP or any other system bus through the stuffing layer;

The interrupt interface is used for setting an interrupt line to be valid when a task is completed or an error occurs, sending an interrupt to the management processor to report the completion condition, and then the management processor starts the process again;

Repeating the command-execute-interrupt process until the inference of the entire network is completed;

The data backbone interface connection UDLA and the main system memory subsystem are synchronous, high-speed and highly configurable data buses, which can be designated to have different address sizes and different data sizes and send out requests of different sizes according to the system requirements;

DBB is a class interface protocol used in AXI compliant systems;

Each hardware acceleration operator is mutually independent and independently configured, realizes data and instruction interaction with the memory and the I/O peripheral through a standard AXI bus interface, and for a data layer with a dependency relationship, a custom advanced operator is designed through fusion of a plurality of operators, so that time delay caused by data access is reduced;

when the hardware acceleration operators independently run, each functional block is configured according to the execution time and the execution mode, and each functional block executes the task allocated by the functional block;

The independent operations begin and end with the allocated blocks performing memory-to-memory operations within and outside of the main system memory or the dedicated SRAM memory;

and in the fusion operation, some blocks are assembled into pipelines, and the performance is improved by bypassing the reading and writing of the memory data.

Further, the method comprises the following steps,

S1, designing a hardware acceleration operator based on HLS;

s2, constructing a configurable deep learning accelerator IP core based on the step S1;

s3, designing a heterogeneous acceleration method facing deep learning based on S2;

s4, finally comprehensively designing a heterogeneous acceleration method.

In yet another aspect, the invention also discloses a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method as described above.

In yet another aspect, the invention also discloses a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method as above.

According to the technical scheme, the computing characteristics of each link in the deep learning algorithm are fully analyzed, a simple and efficient heterogeneous system suitable for the embedded hardware platform is constructed, heterogeneous division of different tasks is realized, a complex software algorithm with large computing capacity and high parallelism is transplanted to the heterogeneous platform for hardware acceleration design, the computing advantages of the heterogeneous platform are efficiently utilized, the requirements of the embedded platform on real-time performance, low power consumption and the like are met, and the difficulty of deployment of the software algorithm on the hardware heterogeneous platform is reduced.

The invention provides a deep learning-oriented general heterogeneous accelerator design method (Universal DEEP LEARNING acceletion, UDLA), which provides a standard method for transplanting different deep learning models to an FPGA (field programmable gate array) through modular design of each hardware operator in the accelerator, and has stronger expandability and high configurability.

The deep learning algorithm is mostly data intensive calculation, a large amount of DSP and BRAM resources are consumed for hardware implementation, each unit is reasonably distributed into PS and PL of a heterogeneous platform by analyzing the structural characteristics of a convolutional neural network, an HLS is utilized to optimally design a hardware acceleration operator at the PL end, the running efficiency and the resource occupation of each unit are analyzed, and a universal accelerator UDLA based on FPGA hardware acceleration is established. Aiming at different deep learning algorithms, the accelerator transfers the deep learning networks with different structures into the embedded heterogeneous platform through combining hardware acceleration operators, so that the optimal design of resources, time sequences and power consumption is realized.

Specifically, the invention has the following advantages:

1) Designing a general deep learning accelerator UDLA; designing various hardware acceleration operators by utilizing HLS, and realizing flexible configuration of each operator in UDLA by optimization means such as parameter quantization, cyclic expansion, pipelining and the like;

2) Constructing a configurable deep learning accelerator IP core, providing a convenient method for the deep learning model to be rapidly deployed on an embedded heterogeneous platform, reducing the difficulty of transplanting a software algorithm to a hardware platform, and shortening the engineering conversion period;

3) And taking the optimal speed and the optimal resource as constraint conditions, two neural network acceleration design schemes are provided, and an FPGA resource consumption and time sequence dynamic balance scheme is provided for deep learning in heterogeneous platform deployment.

Drawings

FIG. 1 is a schematic diagram of a heterogeneous software/hardware system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network architecture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a generic deep learning accelerator according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a loop-expanding hardware configuration according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of circulating hydration of an embodiment of the present invention;

FIG. 7 is a schematic diagram of a convolutional HLS code according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating Relu activation function codes according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a Softmax activation function code according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a UDLA code architecture according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a deep learning FPGA heterogeneous acceleration design flow chart according to an embodiment of the invention;

FIG. 12 is a schematic diagram of a speed-optimized neural network according to an embodiment of the present invention;

Fig. 13 is a schematic diagram of a structure of a scale-optimized neural network according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.

As shown in fig. 1, in the FPGA general heterogeneous acceleration method for deep learning according to the present embodiment, a heterogeneous system is formed by combining computing processing units with different architectures, each processing unit cooperatively completes a computing task according to a task attribute, so as to fully play the computing performance of each unit, and improve the task execution efficiency, thereby improving the computing performance of the system.

FIG. 1 is a generic FPGA software/hardware heterogeneous system architecture; in the figure, a PL part aims at a high-speed logic, an arithmetic and data flow subsystem and the like in the system design, and a PS part mainly aims at an application program, a control instruction, an operating system, a functional interface interacting with underlying hardware and the like of the system, and PL and PS are interconnected through a high-bandwidth and low-delay advanced extensible interface AXI, so that the interface data transmission overhead is reduced.

HLS (High Level Synthesis) can be converted into RTL codes by using functions written in high-level languages such as C, C ++ or System, detail problems among bottom RTLs do not need to be concerned, a complete AXI interface is provided, an IP is conveniently inserted into a PL end of a heterogeneous platform, high-speed data and instruction interaction of PS and PL is realized, collaborative development design of FPGA hardware and software can be carried out at a System level angle, floating point operation and floating point operation with any precision are carried out, algorithm performance is debugged through algorithm optimization or tool instruction optimization, and dynamic optimization of throughput, time delay and power consumption is realized at a framework level.

The neural network is a basic structure of a deep learning algorithm, and can be understood as connection between different layers (shown in fig. 3), wherein the result of calculation of more than one layer is taken as input, the result of calculation of the layer is taken as input of a network of the next layer, each layer has different parameters and operations, and head-to-tail sequential connection or jump connection is adopted between the layers.

The convolution layer, the pooling layer, the full-connection layer and the activation function are basic components of the convolution neural network, and the performances of different network architectures are different in the aspects of convolution kernel, pooling mode, network depth, activation function and the like. The invention takes Convolutional Neural Network (CNN) as a basic structure, realizes hardware acceleration design on the FPGA for each layer of operation, and simultaneously uses basic hardware acceleration operators to optimally design the neural networks with different structures.

The deep learning algorithm is deployed on the embedded heterogeneous platform, only the neural network inference process is needed to be designed, namely, input data is predicted or classified through a trained neural network (offline model), most of the work is based on mathematical operation, and some characteristics are particularly suitable for being realized on a hardware platform, such as convolution operation, pooling operation and the like. The UDLA module utilizes HLS modular design to realize the operation suitable for being realized in the FPGA, so that the time delay and the resource consumption of the operation are optimally configured, as shown in figure 4;

In fig. 4, the configuration space bus (Configure Space Bus, CSB) interface is a synchronous, low bandwidth, low power, 32-bit control bus for the CPU to access UDLA the configuration registers. UDLA act as slaves on the CSB interface, enabling a very simple interface protocol that can be easily converted to AMBA, OCP or any other system bus by a simple stuffing layer. An interrupt interface (Interrupt Interface) is provided to enable the interrupt line when a task is completed or an error occurs, and to issue an interrupt to the management processor to report completion, and then the management processor will begin the process again. The command-execute-interrupt flow is repeated until the inference of the entire network is completed. The Data BackBone interface (DBB) link UDLA and the main system memory subsystem are a synchronous, high-speed and highly configurable Data bus that can be designated to have different address sizes, different Data sizes, and to issue requests of different sizes according to system requirements. DBB is a simple interface protocol similar to AXI and can be easily used in AXI compliant systems.

Each hardware acceleration operator is independent of each other and is configured independently, data and instruction interaction is realized through a standard AXI bus interface and a memory and an I/O peripheral, and for a data layer with a dependency relationship, a custom advanced operator is designed through fusion of a plurality of operators, so that time delay caused by data access is reduced. When the hardware acceleration operators independently run, each functional block is configured according to the execution time and the execution mode, and each functional block executes the assigned task. Independent operations begin and end with the allocated blocks performing memory-to-memory operations within and outside of the main system memory or the dedicated SRAM memory. The fusion operation is similar to an independent operation, some blocks are assembled into a pipeline, improving performance by bypassing memory data reads and writes, rather than having the blocks communicate with each other through a small FIFO, i.e., the convolution acceleration core passes data to the pooled acceleration core, which passes data to the fully connected acceleration core, etc.

As shown in fig. 2, the depth learning-oriented FPGA universal heterogeneous acceleration method in this embodiment specifically includes the following steps:

s1, designing a hardware acceleration operator based on HLS;

s4, finally comprehensively designing a heterogeneous acceleration method.

The following are respectively specified:

(1) Hardware acceleration operator design based on HLS

The optimal design of each hardware acceleration operator in the general deep learning accelerator UDLA is related to the running speed of the whole system, and the bottleneck is mainly reflected in two aspects of calculation amount and data transmission. By analyzing each network architecture, a large number of floating point operations are involved in the trained network, and a large number of circulating operations exist in each layer, so that larger hardware resource consumption and time delay are brought, and parameter quantization and circulating flow hydration are two important measures for compressing the model scale and reducing the time delay.

Parameter quantization

The weight and the activation function obtained by deep learning training are usually of 32-bit single-precision floating point data types, and the internal resources of the FPGA heterogeneous platform are limited, so that huge resource consumption and power consumption caused by floating point operation can be greatly reduced through parameter quantization within the acceptable range of precision loss and task delay, and occupation of registers and BRAM is reduced. Greatly improves the reasoning speed of the neural network.

(II) cyclic expansion

The calculation processes of the convolution layer, the pooling layer, the full-connection layer and the like are essentially a nested multi-cycle structure, and the bottom layer of the multi-cycle structure is formed by multiply-accumulate operation, so that the optimization of the internal cycle structure of the FPGA is very important. The cyclic expansion is to allocate N parts of corresponding computing resources for the operation in the cyclic, and expand the parallel N times of computation at the same time, so that the data processing capacity of a single clock is improved. If the number of loops is equal to N, then the loop is said to be fully expanded, and each iteration in the loop will start and end at the same time, so that the running speed is greatly increased when more system resources are occupied, as shown in fig. 5;

in fig. 5, the loop is completely spread, the convolution operation is spread from serial execution to parallel execution, the time of data storage and access is greatly reduced, the operation execution efficiency is improved, but the consumption of hardware resources is larger.

(III) pipeline design

Pipelining can be used for loops and functions, and the next operation does not need to wait for the completion of the last operation to begin, but can begin after the last operation leaves the initial resources free. One cycle of the flow process is shown in figure 6.

As shown in FIG. 6, the most important role of the circulating stream hydration is to reduce the input interval Initiation Interval (II), the clock period required by the function or cycle to accept new input data. In the absence of pipelining, each new cycle needs to wait three clock cycles later (ii=3) before starting, 9 clock cycles are required to perform three cycles altogether, while after pipelining is added, each new cycle needs to wait only one clock cycle (ii=1), and the time to perform three cycles is reduced to 4 clock cycles. The circulating flow hydration is introduced into the pipeline register, and the unit for executing operation is recycled, so that the resource consumption is more than that of the circulating flow-free hydration, but is far less than that of the circulating expansion, and the time delay is also lower than that of the circulating flow-free hydration.

Therefore, when the acceleration operator is designed, the relationship between the operation speed and the resource consumption is balanced by reasonably setting the cyclic expansion and the pipelining, and the performance of the hardware acceleration operator is greatly improved. Because the different network data sizes and the operation speed have larger demand difference, the interfaces of each hardware acceleration core are composed of input data, output data and configuration parameters, and the invention takes cyclic expansion and pipelining as configurable options when designing UDLA, so that the time-setting demand is reasonably configured according to hardware platform resources and tasks when the hardware acceleration is designed.

(IV) convolutional layer hardware acceleration operator design

According to the consideration of the universality of the convolution layer, the interface parameters of the hardware acceleration operator are subjected to generalization design by utilizing HLS so as to enhance the reusability of the operator, and the interface variables are shown in the table 1:

TABLE 1 convolutional layer acceleration operator interface configurable parameters

HLS code of the convolutional layer is shown in fig. 7; in fig. 7, the red square frame is variable information of the convolution layer operation, which can be obtained through analysis of a trained model, the blue square frame is convolution kernel operation, the green square frame is a convolution layer function, weight is a specific value of the convolution kernel, input is an input original image (not pad), output is an output convolution operation image, bias is a bias coefficient, the convolution kernel weight is copied into the weight_buf buffer by using memcpy function, and the convolution layer calculation is completed by calling the convolution kernel calculation function conv 1.

Fifth, design of full connection layer hardware acceleration operator

When the front layer is fully connected, the fully connected calculation can be converted into convolution with convolution kernel of 1×1; and when the front layer is a convolution layer, the full-connection calculation can be converted into global convolution with a convolution kernel of h×w, where h and w are the height and width of the front layer convolution result respectively. The core operation of the full join is the matrix-vector product:

y＝W×x (1)

According to the consideration of the universality of the full connection layer, the interface parameters of the required design are shown in the table 2:

table 2 pooling layer interface configurable parameters

The weight data storage of the fully connected layer accounts for 95% of the entire neural network, and for this reason, it is required to specify that the weight data be stored in BRAM instead of on a register, which increases the network data reading delay as the resources in BRAM need to read the data by addressing.

Sixth, design of hardware acceleration operator of activation function

Different activation functions used in CNNs consume different hardware resources in the FPGA implementation process. For example, the implementation of the program of ReLU function f (x) =max (0, x) is very simple, the input and output have the same data structure when designing the function, and the acceleration kernel code is shown in fig. 8;

the input of the multi-purpose Softmax function in the polynomial logistic regression and the linear discrimination is the result obtained by K different linear functions, and the probability that the sample vector x belongs to the j-th classification is as follows:

The input and output of the function have the same data structure, and the hardware acceleration design is shown in fig. 9;

The Softmax activation function involves multiple exponential calculations, which require significant hardware resources to be consumed in the FPGA. Therefore, the index calculation result is generated into a lookup table LUT in network initialization in advance, and the corresponding index value is directly searched in the lookup table according to the input value in function calculation. So that only one lookup will be required in the FPGA. In addition, since the Softmax function is less computationally intensive than the convolutional layer and the fully-connected layer, processing the function on ARM can also be considered in the case of heterogeneous support to save memory resources on FPGA.

In addition to the ReLU function and Softmax function, the activation function and the normalization function have quite other forms, such as: sigmoid, tanh, leaky-ReLU, P-ReLU, R-ReLU, log-likelihood, etc., which activation functions can be added to Vivado HLS as a base unit of neural network.

In order to enable UDLA to be better suitable for different network architectures, hardware acceleration operator libraries in UDLA can be enriched aiming at basic units existing in different networks, and requirements of UDLA on universality, flexibility and expandability are met. For example: LRN layer acceleration core, tensor transform, residual unit acceleration core, batch normalization, etc.

(2) Building configurable deep learning accelerator IP cores

The configurable deep learning accelerator IP core is a core link of the deep learning algorithm for flexibly accelerating design in the FPGA heterogeneous platform, when the running speed of the hardware acceleration IP core corresponding to the network cannot meet the requirement, the network is expanded and optimized for pipelining, and if the optimized running speed still does not meet the requirement, the further reduction of the weight parameter data length is needed to be considered. After the speed of the model meets the requirement, whether the internal resources of the FPGA system occupied by the model are reasonable or not is checked. If the current use of FPGA resources needs to be reduced, the length of the weight parameter data needs to be reduced, and the accuracy of the neural network model needs to be re-verified. After the resource occupation and the running speed of the network meet the requirements, the converted RTL program can be simulated and verified through Vivado HLS, and a corresponding neural network IP acceleration core is generated. The HLS code structure of the configurable deep learning accelerator IP core is shown in fig. 10;

In fig. 10, nn_project.cpp: the main function of the method is to describe the composition structure of the neural network, namely the basic units and connection modes used by the network.

Definition.h: the file is used only to define the most important global parameters that define the data type of each layer of network, as well as the data size of the input and output layers of the neural network.

Nn_ utils: this document is an implementation of the underlying elements that make up the neural network. Part of the code is a common code and does not change with the structure of the neural network to be transplanted. It can increase the versatility of the overall architecture by continuously adding basic network elements that implement different functions.

Parameters. H: the configuration of each layer in the neural network is stored in the file, and in the currently popular neural network frameworks such as keras, onnx and tensorflow, the configuration of each layer is recorded in a file similar to json format in a fixed format, so that a script for automatically converting the neural network description file under the other frameworks into the file is necessary.

Weights: this is a set of header files defining the weight parameters of each layer. Similar to parameters.h, popular open source neural network frameworks typically use h5 to record weight parameters. Because the weight parameters often relate to a large amount of data, the corresponding automatic conversion script can effectively reduce the workload of neural network migration.

(3) Deep learning-oriented heterogeneous acceleration design

UDLA can provide a simple, flexible and powerful reasoning acceleration solution, and can conveniently deploy a plurality of high-performance deep learning algorithms on an FPGA heterogeneous platform. Each acceleration operator in UDLA accelerators is configured independently, e.g., a system that does not require pooling may completely remove pooled acceleration cores, and a system that requires other convolution capabilities may expand the capabilities of the convolution units without modifying other units in the accelerator. The scheduling operation of each unit is entrusted to a coprocessor or a CPU, the coprocessor or the CPU runs on a scheduling boundary with extremely fine granularity, the configuration of one hardware accelerating core and an activate command are sent downwards, and meanwhile, a corresponding double buffer area is set for each accelerating core according to whether a dependency relationship exists between data, so that the parallel pipeline design among the hardware accelerating cores is ensured, as shown in fig. 11;

After training and compressing the deep learning algorithm, obtaining an offline model with higher precision, and determining the function and structure of the network by analyzing the offline model. In order to reduce the resource pressure of the FPGA heterogeneous platform, the weight parameters are generally required to be quantized for the first time, 32-bit single-precision floating point data used after network training is converted into 16-bit or 8-bit fixed point numbers, and then a compressed neural network model is verified to ensure that the network precision meets the requirement. And then, fully analyzing the network characteristics of the offline model, combining the performances of each hardware acceleration operator UDLA, establishing a hardware acceleration IP core of the neural network model in Vivado HLS, and controlling the composition and the application time of each hardware acceleration operator through PS.

(4) Heterogeneous acceleration scheme design

When the FPGA is optimized for realizing the neural network, the key point is to optimize matrix multiplication which occupies more than 99% of the whole network calculation and storage resources. Two configurable neural network acceleration heterogeneous schemes are designed for the invention: a speed-optimal neural network architecture and a scale-optimal neural network architecture.

First, speed-optimized neural network

Under the condition of sufficient FPGA resources, the neural network architecture with maximized speed carries out all the calculations in the FPGA, and the parameter and data caching are realized in the form of registers on the FPGA. The FPGA does not exchange data with other peripherals. Each layer in the neural network is independently instantiated aiming at parameters, multiplexing is not generated, and the network architecture is shown in fig. 12;

matrix operation in the speed optimal neural network structure adopts vector inner product expansion operation, all matrix multiplication and addition operations are replaced by hardware resources of an on-chip PL end by each layer of acceleration operators, time delay is short, hardware resources are consumed more, and the method is suitable for acceleration design of small-scale neural networks.

(II) Scale-optimized neural networks

In order to ensure that the FPGA operates a large-scale neural network as much as possible, a large-scale matrix multiplication network is designed by combining a pulsation array calculation process, and then operations such as pooling and activation functions are carried out, and a specifically selected algorithm is controlled through PS. The weight parameters are sent by PS to the matrix multiplication module via BRAM. Interlayer data enters the DDR through DMA for caching. The last Softmax obtained data from DDR by PS, its network architecture is shown in fig. 13;

In fig. 13, the matrix operation adopts the idea of systolic array, and the matrix multiplication and accumulation module is combined into one module, which is also very flexible to realize on the FPGA. For weight parameters, they may be stored on DDR DRAM or BRAM according to the size of the parameters, which are passed to the matrix multiplication computation unit by PS through AXI-FIFO or by instantiating BRAM as dual port RAM. When BRAM resources are rich, the dual-port RAM realized by BRAM can meet the requirement of most network interlayer buffering, while DDR DRAM can adopt a DMA scanner/gather mode to avoid PS participation, so that more efficient data throughput is realized, and the operation of a large-scale neural network is possible.

When the heterogeneous platform deploys a specific neural network, the implementation mode of each layer can be flexibly configured according to the requirement. For a layer with large calculation scale, a matrix multiplication module containing a large number of multipliers is synthesized, resource consumption is reduced by multiplexing the module, and the maximum scale of the neural network which can be operated by the FPGA is improved. For layers with small calculation scale, the weight parameters are directly stored in the registers of the FPGA, and a plurality of registers are accessed in one clock period, so that high concurrent multiplication calculation is realized, and the network running speed is greatly improved. Combining the two modes can achieve optimal utilization of FPGA resources.

In yet another embodiment of the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the deep learning oriented FPGA generic heterogeneous acceleration methods of the above embodiments.

It may be understood that the system provided by the embodiment of the present invention corresponds to the method provided by the embodiment of the present invention, and explanation, examples and beneficial effects of the related content may refer to corresponding parts in the above method.

The embodiment of the application also provides an electronic device, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus,

A memory for storing a computer program;

and the processor is used for realizing the FPGA universal heterogeneous acceleration method facing the deep learning when executing the programs stored in the memory.

The communication bus mentioned by the above electronic device may be a peripheral component interconnect standard (english: PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (english: extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (RAM, english: random Access Memory) or nonvolatile Memory (NVM, english: non-Volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (english: central Processing Unit, abbreviated as CPU), a network processor (english: network Processor, abbreviated as NP), etc.; it may also be a digital signal processor (English: DIGITAL SIGNAL Processing: DSP), an Application specific integrated Circuit (English: application SPECIFIC INTEGRATED Circuit: ASIC), a Field Programmable gate array (English: field-Programmable GATE ARRAY; FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The FPGA universal heterogeneous acceleration method for deep learning is characterized in that a convolutional neural network is taken as a basic structure, hardware acceleration design is realized on the FPGA by each layer of operation, and simultaneously, the basic hardware acceleration operator is used for optimizing the design of different structure neural networks;

DBB is a class interface protocol used in AXI compliant systems;

2. The deep learning-oriented FPGA generic heterogeneous acceleration method according to claim 1, characterized in that: comprises the steps of,

S1, designing a hardware acceleration operator based on HLS;

s4, finally comprehensively designing a heterogeneous acceleration method.

3. The deep learning-oriented FPGA generic heterogeneous acceleration method according to claim 2, characterized in that: the S1, designing a hardware acceleration operator based on HLS, comprising the following steps,

S11, quantifying parameters;

S12, circularly expanding; n corresponding computing resources are allocated to the operations in the loop, and parallel N times of computation are simultaneously developed, so that the data processing capacity of a single clock is improved; if the number of loops is equal to N, the loop is called complete expansion, and each iteration in the loop starts and ends at the same time, so that the running speed is greatly improved when more system resources are occupied;

s13, designing a pipeline; the pipelining is used for circulation and functions, the next operation can be started without waiting for the completion of the last operation, but can be started after the resource at the beginning is empty in the last operation;

The function or the loop is used for receiving the clock period required by new input data, after pipelining is added, each new loop only needs to wait for one clock period, and the time for executing three loops is reduced to 4 clock periods; circulating flow hydration is introduced into a flow register, and a unit for executing operation is recycled;

When UDLA is designed, loop expansion and pipelining are used as configurable options;

s14, designing a convolution layer hardware acceleration operator;

According to the consideration of the universality of the convolution layer, the interface parameters of the hardware acceleration operator are subjected to generalization design by utilizing HLS, and the interface variables are shown in the table 1:

S15, designing a full connection layer hardware acceleration operator; when the front layer is fully connected, the fully connected calculation can be converted into convolution with convolution kernel of 1×1; when the front layer is a convolution layer, the full connection calculation can be converted into global convolution with a convolution kernel of h multiplied by w, wherein h and w are respectively the height and width of the convolution result of the front layer; the core operation of the full join is the matrix-vector product:

y＝W×x (1)

table 2 pooling layer interface configurable parameters

Designating that weight data be stored in BRAM instead of on registers;

S16, designing an activation function hardware acceleration operator;

The method comprises the steps that a Softmax activation function is adopted, the Softmax activation function involves multiple times of index calculation, a large amount of hardware resources are occupied in the process of carrying out index calculation in an FPGA, a lookup table LUT is generated in advance when an index calculation result is initialized on a network, and corresponding index values are directly searched in the lookup table according to input values when the function is calculated; so that only one lookup will be required in the FPGA.

4. The deep learning-oriented FPGA generic heterogeneous acceleration method of claim 3, wherein: s2, constructing a configurable deep learning accelerator IP core based on the step S1, wherein the method specifically comprises,

When the running speed of the hardware acceleration IP core corresponding to the network cannot meet the requirement, the network is expanded and optimized in a pipelining manner, and if the optimized running speed still does not meet the requirement, the data length of the weight parameter needs to be further reduced;

After the speed of the model meets the requirement, checking whether the internal resources of the FPGA system occupied by the model are reasonable or not; if the current use of FPGA resources needs to be reduced, the length of the weight parameter data needs to be reduced, and the accuracy of the neural network model needs to be re-verified;

After the resource occupation and the running speed of the network meet the requirements, the converted RTL program can be simulated and verified through Vivado HLS to generate a corresponding neural network IP acceleration core, which comprises,

Nn_project. Cpp: the method is a main program and describes the composition structure of the neural network, namely a basic unit and a connection mode used by the network;

definition.h: the file is only used to define the most important global parameters, which define the data type of each layer of network, and the data sizes of the input layer and the output layer of the neural network;

Nn_ utils: this document is an implementation of the underlying elements that make up the neural network; part of the code is not changed along with the neural network structure to be transplanted, and is a universal code; the universality of the whole framework can be improved by continuously adding basic network units for realizing different functions;

parameters. H: the configuration of each layer in the neural network is stored in the file, and the configuration of each layer is recorded in a file similar to json format in a fixed format;

weights: this is a set of header files defining the weight parameters of each layer.

5. The deep learning-oriented FPGA generic heterogeneous acceleration method according to claim 2, characterized in that: the S3, the heterogeneous acceleration method facing deep learning is designed based on S2, comprises,

Each acceleration operator in UDLA accelerators is configured independently;

The scheduling operation of each unit is entrusted to a coprocessor or a CPU, the coprocessor or the CPU runs on a scheduling boundary with extremely fine granularity, the configuration of a hardware accelerating core and an activate command are sent downwards, and meanwhile, a corresponding double buffer area is set for each accelerating core according to whether a dependency relationship exists between data, so that parallel pipeline design among the hardware accelerating cores is ensured;

after training and compressing the deep learning algorithm, obtaining an offline model with higher precision, and determining the function and structure of the network by analyzing the offline model;

The method comprises the steps of firstly quantizing weight parameters, converting 32-bit single-precision floating point data used after network training into 16-bit or 8-bit fixed point numbers, and then verifying a compressed neural network model to ensure that network precision meets requirements;

And then, fully analyzing the network characteristics of the offline model, combining the performances of each hardware acceleration operator UDLA, establishing a hardware acceleration IP core of the neural network model in Vivado HLS, and controlling the composition and the application time of each hardware acceleration operator through PS.

6. The deep learning-oriented FPGA generic heterogeneous acceleration method of claim 5, wherein: the S4, final comprehensive design heterogeneous acceleration method comprises,

S41, a speed optimal neural network;

Under the condition of sufficient FPGA resources, the neural network architecture with maximized speed carries out all calculation in the FPGA, and the parameter and data caching are realized in a register form on the FPGA; the FPGA does not generate data exchange with other peripheral devices; each layer in the neural network is instantiated separately for parameters, and multiplexing is not generated;

7. The deep learning-oriented FPGA generic heterogeneous acceleration method of claim 6, wherein: s4, the final comprehensive design heterogeneous acceleration method further comprises the steps of,

S42, a scale optimal neural network;

Designing a large-scale matrix multiplication network by combining a pulsation array calculation process, wherein the operation of pooling and activation functions is followed, and a specific selected algorithm is controlled through PS; the weight parameters are sent to a matrix multiplication module by PS through BRAM; interlayer data enters DDR through DMA for caching; the final Softmax obtains data from DDR by PS;

The matrix operation adopts the idea of a pulse array, combines the matrix multiplication and accumulation module into a module, and is very flexible to realize on the FPGA; for the weight parameters, the weight parameters can be stored on DDR DRAM or BRAM according to the scale of the parameters, and the parameters are transferred to a matrix multiplication calculation unit by PS through AXI-FIFO or BRAM is instantiated as dual-port RAM; when BRAM resources are rich, the dual-port RAM realized by BRAM meets the requirement of most network interlayer buffering, while DDR DRAM adopts a DMA scanner/gateway mode to avoid PS participation, so that more efficient data throughput is realized, and the operation of a large-scale neural network is possible.

8. The deep learning-oriented FPGA generic heterogeneous acceleration method of claim 7, wherein:

When a heterogeneous platform deploys a specific neural network, the implementation mode of each layer can be flexibly configured according to the needs; for a layer with large calculation scale, a matrix multiplication module containing a large number of multipliers is synthesized, the resource consumption is reduced by multiplexing the module, and the maximum scale of the neural network which can be operated by the FPGA is improved; for layers with small calculation scale, the weight parameters are directly stored in the registers of the FPGA, and a plurality of registers are accessed in one clock period, so that high concurrent multiplication calculation is realized, and the network running speed is greatly improved; combining the two modes can achieve optimal utilization of FPGA resources.

9. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 8.