CN117114055B

CN117114055B - FPGA binary neural network acceleration method for industrial application scene

Info

Publication number: CN117114055B
Application number: CN202311376912.4A
Authority: CN
Inventors: 任磊; 乔一铭
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-04-09
Anticipated expiration: 2043-10-24
Also published as: CN117114055A

Abstract

The embodiment of the application provides an FPGA binary neural network acceleration method for industrial application scenes, which relates to the field of neural network models and comprises the following steps: when a binary neural network BNN model is determined to be deployed in the FPGA, first target data and a network layer of the BNN model are acquired; storing the first target data after first preprocessing, and distributing corresponding computing resources for different network layers of the BNN model; responding to a processing request of the second target data, and storing the second target data after second preprocessing; and the first target data and the second target data are read, and the computing resources are used, and the BNN model is operated by the first target data to process the second target data, so that life regression data output by the BNN model is obtained, and the computing efficiency of the model can be improved.

Description

FPGA binary neural network acceleration method for industrial application scene

Technical Field

The application relates to the field of neural network models, in particular to an FPGA binary neural network acceleration method for industrial application scenes.

Background

The development of the industrial internet enables a large number of sensors, devices and production lines to realize high-level data intercommunication, and simultaneously brings about massive data processing requirements. Conventional central processing units and graphics processors may face performance bottlenecks when handling large-scale data.

The deployment of binary neural networks (Binary Neural Networks, BNN) using field programmable gate arrays (Field Programmable Gate Array, FPGA) becomes a new solution.

Currently, the existing FPGA computing system for deploying BNN has the defects that the system performance is easily limited by corresponding hardware resources and the computing efficiency is poor.

Disclosure of Invention

The embodiment of the application provides an FPGA binary neural network acceleration method for industrial application scenes, which can improve the calculation efficiency.

In a first aspect, an embodiment of the present application provides an acceleration method for an FPGA binary neural network facing an industrial application scenario, including:

when a binary neural network BNN model is determined to be deployed in an FPGA, first target data are acquired, and a network layer of the BNN model is determined; the first target data comprises weight parameters of the BNN model and network structure parameters of the BNN model;

storing the first target data after first preprocessing, and distributing corresponding computing resources for different network layers of the BNN model;

responding to a processing request of second target data, and storing the second target data after second preprocessing; the second target data includes multivariate timing data for predicting a lifetime of the industrial device;

And reading the first target data and the second target data, using the computing resources, and operating the BNN model by the first target data to process the second target data so as to obtain life regression data output by the BNN model.

Optionally, the first preprocessing the first target data includes:

acquiring the type of the first target data;

if the type of the first target data is a weight parameter, processing the weight parameter in a 1-bit quantization mode;

if the type of the first target data is a network structure parameter, processing the network structure parameter in a static processing mode;

the second preprocessing of the second target data includes:

and determining the duty ratio of sign bits, integer bits and decimal bits of the second target data according to the distribution range of the second target data, and processing the second target data according to the duty ratio.

Optionally, the weighting parameter in a 1-bit quantization mode includes:

judging the positive and negative values of the weight parameters;

according to the positive and negative values, obtaining exclusive-or parameters corresponding to the weight parameters;

And performing exclusive-or operation on the weight parameters according to the exclusive-or parameters to obtain the processed weight parameters.

Optionally, the allocating corresponding computing resources for different network layers of the BNN model includes:

acquiring the storage overhead of a network layer in the BNN model and the complexity of the computing operation of the network layer;

determining target computing resources required by the network layer according to the storage overhead and the complexity of the computing operation;

and distributing corresponding computing resources for the network layer according to the target computing resources and the computing resources of the FPGA.

Optionally, the acquiring storage overhead of the network layer in the BNN model includes:

obtaining output data of the network layer according to the calculation data corresponding to the network layer; the calculated data includes at least one of input data, convolution kernel size, and output channels;

and obtaining the storage overhead of a network layer according to the weight parameter, the output data and the sum of the storage space required by the input data.

Optionally, after the allocating the corresponding computing resource for the network layer, the method further includes:

acquiring control logic corresponding to the network layer, so that the network layer performs data processing by utilizing the computing resources according to the control logic; the control logic is implemented by at least one of the following control signals:

A start signal, a stop signal, an update signal, a data read signal, and a data write signal.

Optionally, the method further comprises:

and performing time sequence optimization on the serial-parallel computing logic of the network layer in the BNN model.

Optionally, the performing timing optimization on the serial-parallel computation logic of the network layer in the BNN model includes:

acquiring the time complexity of a network layer in the BNN model;

determining a time sequence optimization scheme corresponding to the network layer according to the time complexity, the computing resources corresponding to the network layer and the type of the network layer; the timing optimization scheme includes at least one of a serial pipeline, a serial-parallel addition tree, and a parallel multiplication allocation.

Optionally, the serial pipeline includes:

dividing the calculation process of the network layer into M calculation stages according to the calculation resources of the network layer; m is a positive integer;

the computing operations of the different computing stages are performed in different clock cycles.

Optionally, after the reading the first target data and the second target data, the method further includes:

writing the first target data and the second target data into a buffer register; the buffer register is in a ping-pong buffer structure.

In a second aspect, an embodiment of the present application provides an FPGA binary neural network acceleration apparatus for an industrial application scenario, including:

the determining module is used for determining that when the binary neural network BNN model is deployed in the FPGA, first target data are obtained, and a network layer of the BNN model is obtained; the first target data comprises weight parameters of the BNN model and network structure parameters of the BNN model;

the processing module is used for storing the first target data after first preprocessing and distributing corresponding computing resources for different network layers of the BNN model;

the response module is used for responding to a processing request of second target data, performing second preprocessing on the second target data and then storing the second target data; the second target data includes multivariate timing data for predicting a lifetime of the industrial device;

and the computing module is used for reading the first target data and the second target data, using the computing resources, and operating the BNN model by the first target data to process the second target data so as to obtain life regression data output by the BNN model.

In a third aspect, the present application provides an electronic device, comprising: a memory and a processor;

The memory is used for storing computer instructions; the processor is configured to execute the computer instructions stored in the memory to implement the method of any one of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of any one of the first aspects.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the first aspects.

According to the FPGA binary neural network acceleration method for the industrial application scene, the computation efficiency of the BNN model can be improved.

Drawings

Fig. 1 is a schematic flow chart of an acceleration method of an FPGA binary neural network for an industrial application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a framework of a model data access module according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a model dynamic quantization module framework according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a modular binary computing unit deployment module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a hardware-adapted computing acceleration timing optimization module framework provided in an embodiment of the present application;

FIG. 6 is a timing diagram of a serial pipeline according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an FPGA binary neural network acceleration device for an industrial application scenario provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the embodiments of the present application, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and action, and the order of them is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to denote examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the current industrial field, keywords such as industrial Internet, artificial intelligence, edge computing and the like are leading to great changes in production and manufacturing. As manufacturing advances to the intelligent and automation directions, how to efficiently process massive data, implement real-time decisions, and reduce energy consumption becomes a challenge to be solved. The BNN model is deployed by adopting the field FPGA, so that the BNN model becomes a new solution.

In particular, the development of the industrial internet has enabled a high degree of data intercommunication between a large number of sensors, equipment and production lines. However, this also brings about massive data processing requirements. Conventional cpus and graphics processors may face performance bottlenecks when processing large-scale data, and energy efficiency may be difficult to meet the actual demands of the industry.

Artificial intelligence, particularly a deep learning model, has obvious advantages in solving the problem of large-scale data prediction processing, but the computational complexity is not neglected. The FPGA is used as a flexible programmable hardware platform, and can realize the efficient execution of the deep learning model on a hardware level.

In this context, the BNN model has significant advantages as a special type of deep learning model. The BNN uses the binarized weight and the activation function, so that the number of multiplication and addition operations in the model is greatly reduced, and the computational complexity is reduced. The characteristics complement the parallel computing capability of the FPGA, so that the FPGA can efficiently execute the BNN reasoning task.

However, the system performance of the existing FPGA computing system for deploying the BNN model is easy to be limited by corresponding hardware resources, so that the computing efficiency of the FPGA computing system is low.

In view of this, the application provides an FPGA binary neural network acceleration method for industrial application scenarios, which balances the relationship between the calculation accuracy and the performance by performing quantization processing on data in the BNN model calculation process, and allocates appropriate calculation resources to each network layer of the BNN model, so as to improve the calculation efficiency.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be implemented independently or combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 1 is a flow chart of an acceleration method of an FPGA binary neural network for an industrial application scenario provided in an embodiment of the present application, as shown in fig. 1, including:

s101, when a binary neural network BNN model is determined to be deployed in an FPGA, first target data are acquired, and a network layer of the BNN model is determined.

The execution body of the embodiment of the application may be Processing System of the FPGA.

The first target data includes weight parameters of the BNN model and network structure parameters of the BNN model. Wherein the network structure parameters may be parameters characterizing model features, such as activation functions, batch normalization (Batch Normalization, BN) parameters, etc.

In some embodiments, the first target data may be obtained by analyzing the BNN model. Or, the first target data may be written into the storage space of the FPGA for an external user, and obtained by performing data reading.

In the embodiment of the present application, the network layer of the BNN model may be each neural network layer included in the BNN model, for example, a pooling layer, a convolution layer, an attention layer, a full connection layer, and the like.

By parsing the BNN model, the types of the network layers included in the BNN model may be obtained.

In some embodiments, the network layer of the BNN model may also write into the storage space of the FPGA for an external user, and obtain the data by reading the data.

S102, storing the first target data after first preprocessing, and distributing corresponding computing resources for different network layers of the BNN model.

In the embodiment of the application, the first preprocessing may be to perform dynamic quantization processing on the first target data, and adjust the expression accuracy of the first data, so as to reduce the calculation overhead of the BNN model in the operation process and improve the calculation efficiency of the model.

In some embodiments, different quantization processing modes may be employed according to the type of the first target data.

Illustratively, the type of the first target data is acquired; if the type of the first target data is a weight parameter, processing the weight parameter by adopting a 1-bit quantization mode and an exclusive-or operation; and if the type of the first target data is a network structure parameter, processing the network structure parameter in a static processing mode.

Wherein, for weight parameters, since the weights in the BNN model are generally limited to plus or minus 1, the weight parameters can be processed in a 1-bit quantization mode. The 1-bit quantization may also be referred to as 1-bit quantization.

The network structure parameters are not generally affected by data distribution, so that the network structure parameters can be processed in a static processing manner. Static processing may refer to processing data with a fixed number of bits, for example, 32 bits.

In this embodiment of the present application, after the first target data is processed, the first target data may be stored. For example, the first target data is stored in an on-chip memory Block RAM (BRAM) of the FPGA.

In the embodiment of the present application, when determining each network layer included in the BNN model, computing resources may be allocated according to computing and data processing requirements corresponding to each network layer.

Illustratively, the storage overhead of the network layer in the BNN model is obtained, and the complexity of the computing operation of the network layer is obtained; determining target computing resources required by the network layer according to the storage overhead and the complexity of the computing operation; and distributing corresponding computing resources for the network layer according to the target computing resources and the computing resources of the FPGA.

The storage overhead of each network layer in the BNN model can be obtained according to the analysis of the storage requirements of weight parameters, input data, intermediate calculation results and the like. The complexity of the computing operation of the network layer can be obtained by adopting a corresponding algorithm according to the internal structure in the network layer and the structure for processing the data.

By analyzing the complexity of the computing operation, and the computing and storage resources of the FPGA, the resource requirements of each computing unit can be derived.

S103, responding to a processing request of second target data, and storing the second target data after second preprocessing.

In an embodiment of the present application, the second target data includes multivariate time series data for predicting a lifetime of the industrial device. Taking an industrial device as an example, the multivariate time series data may comprise: operating parameters, vibrations, temperature, etc. for multiple times. Industrial devices may also include devices like lithium batteries, bearings, etc., and embodiments of the present application are not limited in the type of industrial device.

The second preprocessing may be dynamic quantization processing.

Illustratively, the duty ratio of sign bits, integer bits and decimal bits of the second target data is determined according to the distribution range of the second target data, and the second target data is processed according to the duty ratio.

For example, dynamic quantization determines sign bits based on the positive and negative of data, determines the positions of decimal points based on the maximum and minimum values of data, and then assigns the duty ratios of integer and decimal bits based on the data range. The dynamic quantization mode allows the data to effectively reduce the number of bits required for representing the data on the premise of keeping a certain precision, thereby reducing the cost of storage and calculation. By taking into account the actual distribution of the data during quantization, dynamic quantization can achieve a more compact representation of the data to accommodate variations in the different input and output data.

Illustratively, for a segment of data that requires dynamic quantization, the range is [ a, b ]. The number of q=max { abs (a), abs (b) }, Q binary digits is recorded as the dynamically quantized integer digits, one sign bit is removed, and the remaining digits are all decimal digits.

Taking the second target data as 16 bits as an example, if the data range of the whole digits is [ -1,1]Q=1, the binary digit is 1, and the decimal digit precision is 2 ^-14 If the data range of the integer is [ -3,3]Q=3, binary 2 ^， B11, two bits are needed to represent the integer, the decimal point precision is 2 ^-16 。

In this embodiment of the present application, after the second target data is processed, the first target data may be stored. For example, the first target Data is stored in an external Double Data Rate (DDR) memory of the FPGA.

S104, reading the first target data and the second target data, and using the computing resources, wherein the first target data operates the BNN model to process the second target data, so that life regression data output by the BNN model is obtained.

In this embodiment of the present application, when an instruction for processing second target data is received, the second data is processed and stored, and then the first target data and the second target data are read from corresponding memories, and a calculation resource allocation result is input to a BNN model, so that the BNN model uses the calculation resource, and the first target data processes the second target data to obtain a corresponding lifetime prediction regression result.

According to the FPGA binary neural network acceleration method for the industrial application scene, when the binary neural network BNN model is deployed in the FPGA, first target data are obtained, and a network layer of the BNN model is obtained; storing the first target data after first preprocessing, and distributing corresponding computing resources for different network layers of the BNN model; responding to a processing request of second target data, and storing the second target data after second preprocessing; and reading the first target data and the second target data, using the computing resources, and operating a BNN model by the first target data to process the second target data so as to obtain life regression data output by the BNN model. By carrying out quantization processing on data in the BNN model calculation process, weighing the relation between calculation precision and performance, and distributing proper calculation resources to each network layer of the BNN model, the calculation efficiency is improved.

On the basis of the foregoing embodiments, the acceleration method provided in the embodiments of the present application may further perform 1-bit quantization processing on the weight data according to the following manner.

Illustratively, determining the positive and negative values of the weight parameters; according to the positive and negative values, obtaining exclusive-or parameters corresponding to the weight parameters; and performing exclusive-or operation on the weight parameters according to the exclusive-or parameters to obtain the processed weight parameters.

The weight in BNN is usually limited to plus or minus 1, and the exclusive or parameter corresponding to the weight parameter is obtained to represent plus or minus 1 by judging the plus or minus of the weight, so that the number of bits required by data representation is reduced during storage.

Taking 16bit data 16' F4a8f as an example, when the storage weight is 1, the exclusive-or parameter is 16' F0000, the calculation result is 16' F4a8f # -16 ' F0000, when the storage weight is 0, namely when a data symbol needs to be modified, the exclusive-or parameter is 16' F8000, the output data is 16' F4a8f # -16 ' F8000, on the premise of keeping the weight accuracy, the quantization mode greatly reduces the storage cost of the weight parameter, improves the calculation efficiency of a model, replaces multiplication operation by exclusive-or operation, reduces the number of used multipliers, and uses a logic unit (LE for multiple) to realize the same function, and simultaneously reduces one clock for each multiplication.

It should be understood that the exclusive or parameter may be set according to actual requirements, which is not limited in the embodiments of the present application.

According to the acceleration method provided by the embodiment of the application, when the computing resources of each network layer are allocated, the storage overhead of the network layer in the BNN model can be obtained according to the following mode:

illustratively, according to the calculation data corresponding to the network layer, obtaining output data of the network layer; the calculated data includes at least one of input data, convolution kernel size, and output channels; and obtaining the storage overhead of a network layer according to the weight parameter, the output data and the sum of the storage space required by the input data.

After determining the network layer type, a storage analysis is performed for each layer. This involves storage demand analysis of weight parameters, input data, intermediate calculation results, etc. For binary computation, the binary representation of the data and the corresponding storage mode need to be considered for storage analysis, so that the storage cost is reduced to the greatest extent, and the storage structure is optimized.

Illustratively, industrial multivariate time series data of the (n, c, t) dimension is taken as an example, where n represents the variable class, c represents the variable dimension, and t represents the time window.

When the convolution Kernel size Kernel_size is (1, k), the expansion conditions is (1, k, k≡2), the output channel is m, and no filling exists. The output data dimension may be expressed as (n, m, t/(k 3)).

The required storage overhead is the sum of the input data, the weight parameters and the required storage of the output data 16 (n, c, t) +16 (n, m, t/(k++3)) + (k+kζ2+kζ3).

When the storage expense is determined, the resource condition required by each calculation unit can be obtained by analyzing the complexity of calculation operation and the calculation and storage resources of the FPGA, so that the resource configuration and optimization can be performed in a targeted manner.

For example, for the above described time-series convolution operation, the required multiplication times are n×c×m×k (1+k+k≡2), and the addition operation is n×c×m (1+k+k≡2), according to the expected hardware resource allocation multiplication module and operation module. For example, for a three-layer sequential convolutional layer with conditions of (1, k, k 2), x, k 2*x logic cells are allocated layer by layer, the time required for each layer is n x c m k/x clocks, the addition operation instantiates k-3, k-2 and k-input addition trees layer by layer respectively so as to ensure that hardware resources are used as much as possible, save a calculation clock and improve calculation efficiency.

In the embodiment of the present application, when allocating corresponding computing resources to each network layer of the model, in order to ensure normal execution of the model, a corresponding control logic needs to be designed.

Illustratively, control logic corresponding to the network layer is obtained, so that the network layer performs data processing by using the computing resource according to the control logic; the control logic is implemented by at least one of the following control signals:

Wherein, the start signal: the calculation of this layer is started and at the same time the input data read is set to 1 and the output data write buffer is set to 1.

Stop signal: the calculation of this layer is stopped and at the same time the input data read is set to 0 and the output data write buffer is set to 0.

Update signal: a new round of computation begins and the output data buffer layer is cleaned up.

Reading data: the input data is read from the input data buffer.

And (3) data writing: the calculated data is stored into an output data buffer.

By the allocation of computing resources and the design of the control logic, a data path will be established. This means that the computation unit, the storage unit and the control logic will be connected together to form an efficient computation flow, ensuring correct flow and synchronization of data, and minimizing delays, improving computation efficiency.

In some embodiments, to further improve the computation efficiency of the model, the serial-parallel computation logic of the network layer in the BNN model may also be time-sequentially optimized.

Illustratively, acquiring a time complexity of a network layer in the BNN model; determining a time sequence optimization scheme corresponding to the network layer according to the time complexity, the computing resources corresponding to the network layer and the type of the network layer; the timing optimization scheme includes at least one of a serial pipeline, a serial-parallel addition tree, and a parallel multiplication allocation.

In the embodiment of the application, three key contents of serial pipeline, serial-parallel addition tree and parallel multiplication allocation are realized through the allocation of the computing resources and the time complexity analysis, and simultaneously, the parallel multiplication allocation is optimized in finer granularity, including convolution channel parallelism, window parallelism and attention layer parallelism, so that the computing efficiency of a model is improved.

Different network layers can correspond to different optimization schemes, for example, time sequence optimization is performed between layers through a serial pipeline, multiple addition operations can be performed in parallel through the design of a serial-parallel addition tree in each layer, and then the results are added in series according to a hierarchical structure, so that more efficient addition operation is achieved. For network layers with multiple multiplication operations, such as convolution layers, weight connection layers, etc., the multiplication operations are distributed to multiple computing units for parallel execution. This allocation allows the multiplication to be completed in a shorter time, thereby improving the computational efficiency.

In one possible implementation, the serial pipeline includes:

dividing the calculation process of the network layer into M calculation stages according to the calculation resources of the network layer; m is a positive integer; the computing operations of the different computing stages are performed in different clock cycles.

Where M may be determined based on a priori knowledge, which is not limited by the embodiments of the present application.

In some embodiments, after reading the first target data and the second target data, the first target data and the second target data may be written into a buffer register; the buffer register is in a ping-pong buffer structure.

The buffer register plays a role in buffering and adjusting in the process of data transmission, ensures the synchronization of data among different clock domains, and reduces the possibility of data loss and time sequence conflict.

On the basis of the FPGA binary neural network acceleration method facing the industrial application scene, the embodiment of the application also provides an FPGA binary neural network acceleration system facing the industrial application scene.

In some embodiments, the FPGA binary neural network acceleration system for industrial application scenarios provided in the embodiments of the present application includes: the system comprises a model data access module, a model dynamic quantization module, a modularized calculation unit deployment module and a hardware-adaptive calculation acceleration time sequence optimization module.

The model data access module is based on multivariable time sequence data in the field of industrial Internet, the deep neural network is used as a model, the internal memory and the external memory on the FPGA chip are used as hardware resources, the storage and the reading modes of various types of data are planned, the storage address and the storage space of the data are defined, and the access control signal of each type of data is set.

And the model dynamic quantization module is used for dynamically quantizing the trained binary neural network into a data format which can be directly deployed on the FPGA, balancing hardware storage, calculation resources and precision requirements, dynamically adjusting the data integer and decimal distribution mode, and dynamically quantizing different types of network modules and data types.

And the computing unit module is arranged in a modularized mode, the deep binary neural network is divided into different parts according to the functional layers, and each network layer is used for obtaining reasonable data paths, resources and performance analysis through calculation by combining hardware resource expectations and network structure parameters.

And an optimization module is needed during the hardware adaptation calculation acceleration, serial-parallel calculation in the time sequence is optimized for the appointed FPGA equipment and hardware resources, the load in the pipeline process is balanced, and parallel calculation resources are distributed.

The specific implementation of the above modules is described in turn below.

FIG. 2 is a schematic structural diagram of a model data access module framework, as shown in FIG. 2: including various ways of data storage.

In order to achieve efficient DNN acceleration on an FPGA, the first key step in the task is to design the model data access module. This module transfers data between different layers of the network to internal Block RAM (BRAM) and external Double Data Rate (DDR) memory to support the BNN reasoning process. The access of data is the basis of the whole reasoning process, which affects the efficiency and performance of the computation. In particular, this module involves the following key steps:

the BRAM is configured to store weight parameters, activation functions, and Batch Normalization (BN) parameters. These parameters are critical to the reasoning of DNN, they are shared and passed between the different layers of the model. By appropriate control logic signals and clock signals, the model data access module will read these parameters directly from the BRAM, providing the necessary data for the reasoning process.

DDR is used to store multivariable input data, output data, and interlayer calculation results. These data need to be read and written during the reasoning process. The model data access module realizes high-speed data transmission with the DDR through the AXI Interface by the control signal from the PS (Processing System) end and the system clock signal, ensures the correct flow of data between the PS end and the DDR, and provides necessary data for the reasoning process.

When the data to be used is read, the data is fed into a Buffer Register (Buffer Register). The buffer register plays a role in buffering and adjusting in the process of data transmission, ensures the synchronization of data among different clock domains, and reduces the possibility of data loss and time sequence conflict.

The buffer register is set into two buffer areas based on ping-pong design, and can be called as an A buffer and a B buffer. The two buffer areas have the same size and structure. In the initial state, one buffer (e.g., a buffer) is used to read or write data, while the other buffer (B buffer) is in the idle state. When one buffer (A buffer) is full or the reading is finished, the system is switched to the other buffer (B buffer) for data transmission. This alternate switching is continued to achieve continuous transmission of data. Due to the alternate use of the buffer areas, the ping-pong buffer can realize gapless data transmission. When one buffer is used for transmitting data, the other buffer can be simultaneously ready for data to be transmitted next, thereby reducing the time of transmission interruption. The ping-pong buffer design enables data transmission and calculation to be more tightly combined, reduces the time for waiting for data transmission, and improves the overall calculation efficiency. The method can help optimize the multivariate data transmission in the industrial Internet scene, so that the calculation and the data transmission are more coordinated and efficient.

Fig. 3 is a schematic structural diagram of a model dynamic quantization module framework, as shown in fig. 3: different data adopts different quantization modes.

In the application of BNN models in the industrial field, a dynamic quantization technology is used as an optimization strategy, and is important to improve the efficiency and performance of the model. Dynamic quantization in the model reasoning process, the representation accuracy of the data is flexibly adjusted according to the distribution and the range of the data, so that the calculation and storage overhead is reduced, and the prediction performance of the model is kept as much as possible.

The dynamic quantization process of input and output data is to determine the duty ratio of sign bits, integer bits, and decimal bits based on the distribution range of the data.

For the weight parameters, a more compact quantization approach can be used, since the weights in BNNs are typically limited to plus or minus 1. For example, 1-bit quantization is employed to store the weight parameters.

Network structure parameters, such as activation functions and Batch Normalization (BN) parameters, are generally not affected by the data distribution and therefore static 32-bit storage may be employed. The method reserves high precision of parameters and ensures stability and accuracy of network structure parameters in the reasoning process.

Three quantization methods in the binary model dynamic quantization module adopt a flexible strategy to adjust the expression accuracy of data aiming at different types of data. Dynamic quantization enables data to be stored in a more compact manner, reducing storage and computational overhead, and improving inference efficiency without affecting model performance.

FIG. 4 is a schematic diagram of a modular binary unit deployment module, as shown in FIG. 4, comprising: model network layer decomposition, network layer type determination, computing resource allocation, and the like.

In order to achieve efficient binary computation acceleration on an FPGA, the design of a general binary computation unit deployment module becomes a key link. The module aims at decomposing the BNN model into various layers, carrying out corresponding optimized deployment according to the network layer type, establishing a data path and carrying out integrated test on the basis of carrying out calculation resource allocation (storage analysis and performance resource analysis) and control logic.

Network decomposition is the basis for the deployment of general binary computing units. This step breaks down the BNN model into different layers, such as convolution, pooling, batch normalization, attention layer, full connection layer, and the like. Each layer type has different computational and data processing requirements, so the decomposition process helps to provide a clear target for subsequent deployment and optimization.

After determining the network layer type, a storage analysis will be performed for each network layer. This involves storage demand analysis of weight parameters, input data, intermediate calculation results, etc. For binary computation, the binary representation of the data and the corresponding storage mode need to be considered for storage analysis, so that the storage cost is reduced to the greatest extent, and the storage structure is optimized.

Performance resource analysis is performed based on the determination of storage requirements. By analyzing the complexity of the computing operation and the computing and storage resources of the FPGA, the resource condition required by each computing unit can be obtained, so that the resource configuration and optimization can be performed in a targeted manner.

At the same time, the design of the control logic is an indispensable loop. The control logic needs to ensure that the computing units start and stop in the correct order and that the data flows correctly. This includes the steps of generation, transmission and resolution of control signals to ensure smooth operation of the overall calculation process.

Based on these analyses and optimizations, a data path will be established. This means that the computation unit, the storage unit and the control logic will be connected together to form an efficient computation flow, ensuring correct flow and synchronization of data, and minimizing delays, improving computation efficiency.

Finally, the integrated test will verify the entire general binary unit deployment module. This will ensure that the various parts of the module can work together, enabling efficient BNN reasoning. The integration test will involve various input data, various network layer types, and various computational loads to verify the stability, accuracy, and performance of the module.

In some embodiments, by decomposing the network layer of the model, if the model structure is modified, the key steps such as computing resource allocation, control logic design and the like can be updated only for the modified network, so that the complexity of FPGA programming is reduced.

Taking an example model for realizing residual life prediction in an industrial application scene by using FPGA deployment BNN as an example. The data set used in the experiment is a C-MAPSS data set, and is a data set for predicting the residual life of the aeroengine. The hardware board used in the experiment is Zynq UltraScale+MPSoC ZCU104. The neural network is shown in the following table in terms of functional hierarchy, data size, operands, hardware allocation and number of inferred clocks.

In the above table, the convolutional layer (quan_conv1d) is exemplified by (1, 16, 30) for input data, (1, 32, 30) for output data, 18kb for BRAM of assigned storage weight parameters, 16 convolutional kernels for parallel computation, 16 kernel_size for (1, 3), 16 assigned logic units, and 30 clocks for computation.

When modifying the neural network layer parameters: when the output channel is changed from 32 to 64, the method proposed in the universal binary computing unit deployment module is utilized, and the implementation process is as follows: modifying the number of convolution kernels in the convolution layer to 64, firstly judging the type of the network layer to be a one-dimensional convolution layer, then carrying out storage analysis, inputting data to be (1, 16, 30), changing the output from (1, 32, 30) to (1, 64, 30), changing the data storage resource from 15360 bit to 30720 bit, and changing the weight storage resource from 2048 bit to 4096bit. And secondly, performing performance resource analysis, adjusting the data division access sequence when the control calculation clock number 30 is unchanged, and performing parallel operation on newly added data under the existing calculation time sequence, wherein the number of application logic units is changed from 16 to 32. The control logic starts and stops the signal unchanged, the data is updated, the read and write signal adds the memory address of new application, the data access address of the subsequent layer is modified sequentially. When the number of the control logic units is not changed to 16, the number of the calculated clocks is doubled to 60, newly added data is serially processed under the existing calculation time sequence, the starting signal is unchanged, the stopping signal, the updating signal and the data reading signal are delayed by 30 clocks, the starting and stopping signals of the subsequent network layers are delayed in sequence, and the data access signals are delayed in sequence.

1. Storage resource update: and calculating the data storage resources and the weight storage resource quantity which need to be modified according to the modified neural network layer parameters. Modifying addresses for accessing data and addresses for accessing weights, modifying data and weight access signals.

2. Multiplication and addition updates: the layer operand to be modified is calculated according to the modified neural network layer parameters. If the calculation time is considered preferentially and the hardware resources are sufficient, the number of logic units and the number of addition trees are increased proportionally, the newly added multiplication and addition operation is calculated in parallel with the original network, and the total clock number is ensured to be unchanged. If the hardware resources are limited, the hardware resource allocation quantity is not modified, and the new multiplication and addition operation and the original network serial calculation are performed.

3. Adjusting the control signal update according to the clock number: five control signals of the neural network layer are modified according to the modified neural network layer parameter calculation and the new hardware resource allocation condition. If the calculation time is prioritized, the start signal is unchanged, the update, read and write signals of the data of the layer and the next layer add new memory addresses, and if the hardware resource is prioritized, the start signal of the layer is unchanged, and the stop, data update, data read and data write signals delay clocks. All five signals of the next layer are sequentially delayed.

FIG. 5 is a schematic diagram of a hardware-adapted computational acceleration timing optimization module framework, as shown in FIG. 5, with a timing design including a serial pipeline, a serial-parallel addition tree, and parallel multiplication assignments.

The design of the module is based on the performance resource analysis, combines time complexity analysis, and aims to realize three key contents of serial pipeline, serial-parallel addition tree and parallel multiplication allocation through reasonable time sequence design, and simultaneously optimize the parallel multiplication allocation with finer granularity, including convolution channel parallelism, window parallelism and attention layer parallelism.

Serial pipeline: for computing operations of different layers, the computing process is divided into a plurality of stages through pipelining design, and different computing operations are executed in different clock cycles. Thus, overlapping execution of a plurality of computing operations can be realized, delay of a single operation is reduced, and overall computing throughput is improved.

As shown in fig. 6, the reasoning operations of the different rounds are performed in parallel, and the first grid of each row represents the first-layer network reasoning, and when the first reasoning is performed to the second-layer network reasoning, the first-layer network reasoning of the second reasoning can be performed simultaneously, and the output buffer layer of the first-layer network is updated.

Serial-parallel addition tree: in some computing operations, a large number of addition operations are involved. Through the design of the serial-parallel addition tree, a plurality of addition operations can be executed in parallel, and then the results are added in series according to a hierarchical structure, so that more efficient addition operation is realized.

Parallel multiplication allocation: for multiplication operations, it may be considered to distribute the multiplication operations to multiple computing units for execution in parallel. This allocation allows the multiplication to be completed in a shorter time, thereby improving the computational efficiency. The parallel multiplication allocation may be further optimized, including the following three parts:

convolution pooling channel parallelism: for convolution operations, the input channels and output channels may be divided into multiple groups, each of which is assigned to a set of hardware computing resources, enabling channel-level parallel computation.

Window parallelism: in the convolution operation, the data calculation in the windows are independent, the data in one window can be distributed to different calculation units, and the convolution result is calculated in parallel and then sent into an addition tree to obtain the result.

Full connection layer parallelism: in the fully connected layer, the degree of parallelism between tensor dimensions can be controlled, with each row or column being assigned to a fixed computational unit.

In summary, the acceleration effect of the FPGA on the BNN model can be effectively improved through the data storage design, the data dynamic quantization design, the computing resource allocation design and the time optimization design.

The embodiment of the application also provides an FPGA binary neural network accelerating device facing the industrial application scene.

Fig. 7 is a schematic structural diagram of an FPGA binary neural network acceleration device 70 for an industrial application scenario provided in an embodiment of the present application, as shown in fig. 7, including:

the determining module 701 is configured to determine that, when the binary neural network BNN model is deployed in the FPGA, first target data is obtained, and a network layer of the BNN model; the first target data includes weight parameters of the BNN model and network structure parameters of the BNN model.

The processing module 702 is configured to store the first target data after performing a first preprocessing, and allocate corresponding computing resources for different network layers of the BNN model.

A response module 703, configured to store second target data after performing a second preprocessing on the second target data in response to a processing request for the second target data; the second target data includes multivariate timing data for predicting a lifetime of the industrial device.

And a computing module 704, configured to read the first target data and the second target data, and use the computing resources, where the first target data runs the BNN model to process the second target data, so as to obtain life regression data output by the BNN model.

Optionally, the processing module 702 is further configured to obtain a type of the first target data; if the type of the first target data is a weight parameter, processing the weight parameter in a 1-bit quantization mode; and if the type of the first target data is a network structure parameter, processing the network structure parameter in a static processing mode.

Optionally, the response module 703 is further configured to determine a duty ratio of sign bits, integer bits, and decimal bits of the second target data according to the distribution range of the second target data, and process the second target data according to the duty ratio.

Optionally, the processing module 702 is further configured to determine a positive value and a negative value of the weight parameter; according to the positive and negative values, obtaining exclusive-or parameters corresponding to the weight parameters; and performing exclusive-or operation on the weight parameters according to the exclusive-or parameters to obtain the processed weight parameters.

Optionally, the processing module 702 is further configured to obtain a storage overhead of the network layer in the BNN model, and complexity of a computing operation of the network layer; determining target computing resources required by the network layer according to the storage overhead and the complexity of the computing operation; and distributing corresponding computing resources for the network layer according to the target computing resources and the computing resources of the FPGA.

Optionally, the processing module 702 is further configured to obtain output data of the network layer according to the calculation data corresponding to the network layer; the calculated data includes at least one of input data, convolution kernel size, and output channels; and obtaining the storage overhead of a network layer according to the weight parameter, the output data and the sum of the storage space required by the input data.

Optionally, the computing module 704 is further configured to obtain control logic corresponding to the network layer, so that the network layer performs data processing by using the computing resource according to the control logic; the control logic is implemented by at least one of the following control signals:

Optionally, the calculating module 704 is further configured to perform timing optimization on the serial-parallel calculating logic of the network layer in the BNN model.

Optionally, the calculating module 704 is further configured to obtain a time complexity of a network layer in the BNN model; determining a time sequence optimization scheme corresponding to the network layer according to the time complexity, the computing resources corresponding to the network layer and the type of the network layer; the timing optimization scheme includes at least one of a serial pipeline, a serial-parallel addition tree, and a parallel multiplication allocation.

Optionally, the computing module 704 is further configured to divide the computing process of the network layer into M computing phases according to the computing resources of the network layer; m is a positive integer; the computing operations of the different computing stages are performed in different clock cycles.

Optionally, the calculating module 704 is further configured to write the first target data and the second target data into a buffer register; the buffer register is in a ping-pong buffer structure.

The FPGA binary neural network acceleration device for the industrial application scene provided by the embodiment of the application can execute the FPGA binary neural network acceleration method for the industrial application scene provided by any embodiment, and the principle and the technical effect of the FPGA binary neural network acceleration device are similar and are not repeated here.

The embodiment of the application also provides electronic equipment.

Fig. 8 is a schematic structural diagram of an electronic device 80 according to an embodiment of the present application, as shown in fig. 8, including:

a processor 801.

A memory 802 for storing executable instructions of the terminal device.

In particular, the program may include program code including computer-operating instructions. Memory 802 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 801 is configured to execute computer-executable instructions stored in the memory 802, so as to implement the technical solutions described in the foregoing method embodiments.

The processor 801 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.

Optionally, the electronic device 80 may further comprise a communication interface 803, such that communication interaction with an external device, such as a user terminal (e.g., a mobile phone, tablet) may be performed through the communication interface 803. In a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented independently, the communication interface 803, the memory 802, and the processor 801 may be connected to each other and perform communication with each other through buses.

The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.

Alternatively, in a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented on a single chip, the communication interface 803, the memory 802, and the processor 801 may complete communication through internal interfaces.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, and the technical scheme of the embodiment of the FPGA binary neural network acceleration method for the industrial application scenario is implemented when the computer program is executed by a processor, and the implementation principle and the technical effect are similar and are not repeated herein.

In one possible implementation, the computer readable medium may include random access Memory (Random Access Memory, RAM), read-Only Memory (ROM), compact disk (compact disc Read-Only Memory, CD-ROM) or other optical disk Memory, magnetic disk Memory or other magnetic storage device, or any other medium targeted for carrying or storing the desired program code in the form of instructions or data structures, and accessible by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (Digital Subscriber Line, DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes optical disc, laser disc, optical disc, digital versatile disc (Digital Versatile Disc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The embodiment of the application also provides a computer program product, which comprises a computer program, when the computer program is executed by a processor, the technical scheme of the FPGA binary neural network acceleration method embodiment facing the industrial application scene is realized, and the realization principle and the technical effect are similar, and are not repeated here.

In the specific implementation of the terminal device or the server, it should be understood that the processor may be a central processing unit (in english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (in english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (in english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

Those skilled in the art will appreciate that all or part of the steps of any of the method embodiments described above may be accomplished by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium, which when executed, performs all or part of the steps of the method embodiments described above.

The technical solution of the present application, if implemented in the form of software and sold or used as a product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the technical solutions of the present application may be embodied in the form of a software product stored in a storage medium comprising a computer program or several instructions. The computer software product causes a computer device (which may be a personal computer, a server, a network device, or similar electronic device) to perform all or part of the steps of the methods described in embodiments of the present application.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

It should be further noted that, although the steps in the flowchart are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

It should be understood that the above-described device embodiments are merely illustrative, and that the device of the present application may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

In addition, each functional unit/module in each embodiment of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. It is thus understood that the technical solutions of the present application, or parts contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. An FPGA binary neural network acceleration method for industrial application scenes is characterized by comprising the following steps:

storing the first target data after first preprocessing, and distributing corresponding computing resources for different network layers of the BNN model, wherein the first preprocessing is related to the type of the first target data, and is a 1-bit quantization mode when the type is a weight parameter, and is a static processing mode when the type is a network structure parameter;

In response to a processing request for second target data, determining the duty ratio of sign bits, integer bits and decimal bits of the second target data according to the distribution range of the second target data, and processing and storing the second target data according to the duty ratio; the second target data includes multivariate timing data for predicting a lifetime of the industrial device;

reading the first target data and the second target data, using the computing resources, and operating the BNN model by the first target data to process the second target data so as to obtain life regression data output by the BNN model;

the allocating corresponding computing resources for different network layers of the BNN model comprises the following steps:

2. The method of claim 1, wherein processing the weight parameters in a 1-bit quantization manner comprises:

Judging the positive and negative values of the weight parameters;

3. The method according to claim 1, wherein said obtaining a storage overhead of a network layer in said BNN model comprises:

4. A method according to claim 3, wherein after said allocating corresponding computing resources to said network layer, the method further comprises:

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 5, wherein said performing timing optimization on the network layer's serial-parallel computation logic in the BNN model comprises:

acquiring the time complexity of a network layer in the BNN model;

7. The method of claim 6, wherein the serial pipeline comprises:

8. The method of claim 1, wherein after the reading the first target data and the second target data, the method further comprises: