CN116245150A - Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources - Google Patents
Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources Download PDFInfo
- Publication number
- CN116245150A CN116245150A CN202310174934.6A CN202310174934A CN116245150A CN 116245150 A CN116245150 A CN 116245150A CN 202310174934 A CN202310174934 A CN 202310174934A CN 116245150 A CN116245150 A CN 116245150A
- Authority
- CN
- China
- Prior art keywords
- mapping
- fpga
- resources
- neural network
- resource
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a neural network reconfigurable configuration mapping method facing FPGA resources, which relates to the technical field of embedded artificial intelligent systems and comprises the following steps: adjusting the FPGA platform according to different task demands, and constructing a plurality of mapping schemes oriented to resource constraint; establishing a mapping evaluation model by analyzing the resource characteristics of different FPGA platforms, evaluating a plurality of mapping schemes, and analyzing the differences between the different mapping schemes and the performance differences on the different FPGA platforms; and deploying different neural network algorithms according to the application field, and constructing a part of reconfigurable mapping adjustment scheme. The method is oriented to analysis of the FPGA platform resource characteristics, and the mapping space is effectively reduced by carrying out resource constraint. When a certain resource is limited, the performance boundary of the accelerator is further explored through the replacement method of the same logic and different resources. The configuration mapping method under the reconfigurable method of the FPGA part is supported, and a new mapping scheme can be rapidly completed by analyzing the resource change and parameter adjustment after the target platform part is reconfigurable.
Description
Technical Field
The invention relates to the technical field of embedded artificial intelligence systems, in particular to a neural network reconfigurable configuration mapping method facing FPGA resources.
Background
Because the FPGA platform has limited bandwidth, the problem of mismatch exists with higher data throughput of the neural network algorithm. Because FPGA resources and bandwidth are not fully utilized in the mapping process from the algorithm to the hardware structure, the existing method cannot fully utilize the existing resources to achieve the best performance. The neural network algorithm iterates rapidly and brings about higher complexity and specificity. The large data volume of the neural network brings high storage requirements, and the data flow scheduling method is matched with the local mapping from the algorithm to the hardware. In addition, due to the complex variability of the neural network structure and the diversity of the deployment environment, the resource of the target FPGA platform is used as a constraint to perform rapid design space exploration, and the optimal mapping method is found by combining the adaptability between the algorithm and the hardware.
The performance is greatly affected by the implementation of two different convolutional layers in ResNet-50 on a single design instance, corresponding to different mapping choices. There may be a 10-fold performance gap between using different mapping methods for the same convolutional layer and fixed accelerator structure. It is therefore necessary to find the best mapping method specific to a certain FPGA platform and a certain neural network algorithm by means of the algorithm.
Using a specific FPGA platform, limited preset on-chip memory and limited off-chip bandwidth will be provided. The constraints of resources and bandwidths are different among different FPGA platforms, the available parallelism is different when the same depth neural network DNN model is deployed on different FPGA platforms, and on-chip resources of the FPGA cannot be fully utilized to the maximum extent. On the other hand, because the memory occupation amount of the DNN is large, and the variability of the operation quantity and the model size aiming at different DNN models is large, the fixed mapping method cannot fully utilize the limited resources of the FPGA for each DNN model. Therefore, the accelerator hardware and software mapping method must be optimized to overcome the FPGA's impact on the limited on-chip memory of each DNN model.
Disclosure of Invention
The embodiment of the invention provides a neural network reconfigurable configuration mapping method for FPGA resources, which can solve the problems in the prior art.
The invention provides a neural network reconfigurable configuration mapping method facing FPGA resources, which comprises the following steps:
adjusting the FPGA platform according to different task demands, and constructing a plurality of mapping schemes oriented to resource constraint;
establishing a mapping evaluation model by analyzing the resource characteristics of different FPGA platforms, evaluating a plurality of mapping schemes, and analyzing the differences between the different mapping schemes and the performance differences on the different FPGA platforms;
and deploying different neural network algorithms according to the application field, and constructing a part of reconfigurable mapping adjustment scheme.
Compared with the prior art, the invention has the beneficial effects that:
(1) And the FPGA platform resource characteristics are analyzed, and the mapping space is effectively reduced by carrying out resource constraint.
(2) When a certain resource is limited, the performance boundary of the accelerator is further explored through the replacement method of the same logic and different resources.
(3) The configuration mapping method under the reconfigurable method of the FPGA part is supported, and a new mapping scheme can be rapidly completed by analyzing the resource change and parameter adjustment after the target platform part is reconfigurable.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a neural network reconfigurable configuration mapping method facing FPGA resources;
FIG. 2 is a diagram of the difference between NN model size and memory cell size on an FPGA;
FIG. 3 is a graph of performance prediction method reasoning delay performance errors;
FIG. 4 is a graph of performance error for accelerator power consumption by a performance prediction method;
FIG. 5 is a schematic diagram of a multi-dimensional array mapping to linear storage space algorithm;
fig. 6 is a schematic diagram of a reconfigurable configuration mapping scheme search algorithm.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-6, the invention provides a neural network reconfigurable configuration mapping method facing to FPGA resources, comprising the following steps:
the first step: according to different task demands, the FPGA platform is adjusted, and a plurality of mapping schemes facing resource constraint are constructed, and the method specifically comprises the following steps:
(1) Extracting accelerator structure parameters and runtime parameters with common characteristics by combining FPGA resource characteristics, and providing support for the construction of a resource constraint evaluation method and a performance prediction model;
(2) Combining the calculated data dependency relationship and the neural network structure parameters, and carrying out calculation system performance prediction and hardware structure adaptability evaluation aiming at the accelerator system architecture;
(3) And selecting and determining the optimal intelligent computing system architecture and a hardware mapping scheme thereof according to the evaluation result.
And a second step of: establishing a mapping evaluation model by analyzing the resource characteristics of different FPGA platforms, evaluating a plurality of mapping schemes, and analyzing the differences between the different mapping schemes and the performance differences on the different FPGA platforms;
and a third step of: and deploying different neural network algorithms according to the application field, and constructing a part of reconfigurable mapping adjustment scheme.
In the architecture design process of resource self-adaption, FPGA resources are used as constraints to reduce design space, and optimization selection of a mapping method is performed through an evaluation model of performance prediction;
according to the determined array scale and the data multiplexing method in the FPGA, the on-chip data bandwidth requirement can be determined, so that the output bandwidth of the memory bank is constrained. Meanwhile, the parallel computing method further limits the mapping method, so that the cost of storage resources caused by data caching is constrained by BRAM resources in the FPGA, and the cost of computing resources caused by computing tasks to be executed is constrained by DSP resources in the FPGA.
After the storage resources are limited, the task allocation policies of each organization level of the computing component need to be adjusted to meet the storage resource constraints. If the constraint condition cannot be met after the calculation task allocation strategy is adjusted, the reduction of the scale of the calculation component is considered and the architecture search is restarted.
When the computing resources are constrained, additional computing components need to be built according to the computing tasks, and the necessary DSP resource gaps are replaced by efficient multiplication adders built by using LUT resources in advance.
In addition to the computing resources and storage resources, the mapping scheme is also limited by FPGA interface resources. When the compute array size is large, the data supply requirements are high, thus causing frequent updates of the data in the on-chip buffer. At this time, off-chip memory access is required to be performed through the bus interface, and when the bus transmission rate is difficult to meet the calculation requirement, the mapping method needs to be adjusted, so that large data volume transmission is performed in one burst as much as possible. At this time, according to the mapping scheme, an off-chip bandwidth optimization scheme and an on-chip buffer access optimization scheme are used at the same time, so that the communication bandwidth pressure is further reduced.
The performance prediction model accepts parameter information such as DNN structure (number of layers, layer structure, precision, etc.), accelerator hardware architecture (memory hierarchy, number of PEs, etc.), hardware mapping, unit energy/delay cost of MAC operation, and memory access energy consumption for various memory hierarchies. The performance prediction model outputs estimated energy consumption, delay, and resource consumption when executing DNN in a target accelerator defined by a given neural network topology, hardware architecture, and hardware map. By means of coarse-grained fast evaluation, performance prediction is performed without considering the data channels of each module.
The construction part reconfigurable mapping adjustment scheme specifically comprises the following steps:
determining internal adjustment content of the neural network accelerator, including loading, unloading and modifying of the neural network operator acceleration unit, calculating array scale adjustment and storage array scale adjustment; determining a resource constraint limit on the adjusted accelerator; the mapping scheme is redetermined in conjunction with a neural network algorithm.
The accelerator tuning content is used to update the mapping model parameter list, including algorithm parameters such as convolution kernel size, feature map size, input/output, sliding step size, activation method, etc., and architecture parameters such as compute array size, storage array size, interface bandwidth, etc. And the updated parameter information is used as a new constraint condition to carry out the redetermining of the mapping scheme, and the determination of the mapping scheme is completed in the updated mapping space boundary after the FPGA is locally reconstructed according to the plurality of mapping schemes facing the resource constraint.
Example 1
The mapping method of the present invention determines parameter information for adjustment.
1. The method comprises the steps of determining parameters of a neural network, wherein a basic operator of the neural network comprises convolution conv, full connection fc, pooling pool, batch processing standardization bn, nonlinear activation act, residual res and the like. Each operator has its own separate structural information.
(1) The basic structural information of the convolutional layer includes the number of input channels (chip), the number of output channels (Chout), the feature map size (H/W), the boundary size (pad), the sliding step size (stride), and the convolutional kernel size (K).
(2) fc, the basic structural information of the full link layer includes the number of input channels (chip), the number of output channels (Chout), and the feature map size (H/W).
(3) pool, pool layer basic structure information includes pool type (pool_type), pool window size (pool_size), feature map size (H/W), channel number (Ch).
(4) bn, the basic structural information of the batch normalization layer includes the feature map size (H/W) and the channel number (Ch).
(5) The basic structure information of the nonlinear activation layer includes an activation type (act_type), a feature map size (H/W), and a channel number (Ch).
(6) res, the residual layer includes splice data source (res_src) information.
2. An accelerator structural parameter is determined. To determine the accelerator architecture parameters, one must specify the hardware organization, i.e., the topology of the PE interconnect compute and storage units, and the mapping space constraints that limit the set of mappings that the hardware allows. By building an abstract template of the architecture, which has sufficient parameterization capability, various architectures of interest can be modeled. For each storage level, the number of banks MN, the number of entries I in each bank, the number of bits W in each entry, the memory bandwidth B, and various other microarchitectural attributes may be specified. Other microarchitectural attributes may be explicitly specified by automatically inferring the interconnect network topology from the storage hierarchy specification. For the calculation unit, the number of coarse processing units (RCUs) RN, the number of fine processing units (FCUs) FN in each RCU, the number of Processing Elements (PEs) PN within the FCU, and the data quantization bit width QW need to be determined. For the data interface, the data bit width DW, burst length BL needs to be determined.
3. After the neural network structure parameters and the accelerator structure parameters are determined, the runtime parameters need to be determined according to the mapping method and the scheduling method. The number of memory accesses MA, the average resource utilization RA, the on-chip buffer throughput TR, etc. all need to be taken into consideration by the evaluation model as runtime parameters. In addition, because the neural network structure is segmented in the mapping process, the four added dimension information TI, TO, TW, TH after the segmentation of the channels and the feature map is also used as a runtime parameter for determining a calculation parallel scale, a data scheduling method and the like.
Example 2
The mapping method uses FPGA resources as constraints to confirm the boundary of the mapping space.
First, 70% of the DSP is used as a basic value of the convolution calculation array, and the rest is reserved for units for accelerating calculation of other operators. The parallel computation scales of the data are organized according TO three dimensions, corresponding TO RN, FN, PN in the structural parameters, and TI, TO, K in the runtime parameters and network parameters, respectively. After the calculation array scale is determined, the calculation parallelism and the adjustment of data multiplexing caused by the calculation array scale will cause the change of data requirements. The total data requirement is RN x FN x PN x 2 when the data multiplexing method is not used, which results in huge memory access and data transmission overhead on chip. The data multiplexing method can reduce the total data transmission amount on the chip, only needs to access partial data, and sends the data to the input end of the operation unit in the form of broadcasting and multicasting. Therefore, the on-chip data bandwidth requirement can be determined according to the determined calculation array scale and the data multiplexing method, so that the output bandwidth of the memory bank is constrained. Meanwhile, the parallel computing method further limits the mapping method, so that the cost of storage resources caused by the need of data caching is limited by BRAM resources in the FPGA. When the storage resources are limited, the task allocation policies of the various organizational levels of the computing components need to be adjusted to meet the storage resource constraints. If the constraint condition cannot be met after the calculation task allocation strategy is adjusted, the reduction of the scale of the calculation component is considered and the architecture search is restarted.
In the network construction process, three sets of parameters are required: network splitting method, accelerator parameters, layer tasks assigned to accelerators. To determine the optimal parameter configuration given the resource constraints, an accurate resource and performance model needs to be built for each layer of accelerators.
Adjusting the parameters of the accelerator has a different impact on the resource consumption. The exact formulation of resource cost and parameter settings is critical to system performance optimization. In neural network accelerator design, LUT and FF are not bottlenecks in accelerator system generation, while DSP and BRAM on-chip are significant limiting factors, and therefore need to be carefully evaluated in the modeling process. The main purpose of the DSP block in the convolutional accelerator is to build multipliers and adders, DSP consumption being related to the type of data being processed. Thus, DSP consumption in a convolutional accelerator can be expressed using equation (1).
N dsp =DSP data_type *TI*TO (1)
The corresponding number of DSP uses for the multipliers and adders for different data types is shown in table 1. Furthermore, the max-pooling acceleration procedure does not consume DSP resources, while the average pooling accelerator requires DSP modules to calculate the average output.
TABLE 1 data types and DSP consumption
Data type | float | fixed32 | fixed16 | fixed8 |
DSP consumption | 5 | 4 | 1 | 0.5 |
Considering three resources (DSP, BRAM and Logic) in the FPGA platform, a comparison is made using the resource utilization to show the design preference for hardware resources. In a comparison of DSP and BRAM, it can be seen that the accelerator design favors the use of more DSP resources, while similar preferences can be seen as well in Logic. This suggests that current FPGA designs are more likely to be limited by computational resources. Therefore, for an FPGA platform with relatively limited DSP resources, logic and DSP can be used for completing calculation tasks simultaneously to keep high parallelism, and for an FPGA platform with sufficient DSP resources, only DSP can be used to ensure higher working frequency.
Table 2 shows the resource consumption of building up different precision multipliers or adders using different FPGA resources. When a certain resource of the FPGA is insufficient, the resource allocation can be properly adjusted to maximize the utilization of the resource, so that higher calculation efficiency is realized.
Table 2 comparison of FPGA resource consumption of multipliers and adders for different types of data
The accelerators require BRAM resources to make backups of data between and within the accelerators. While current large FPGAs provide for Ultra RAM (URAM) or the like instead of on-chip storage, the RAM consumption model is similar to BRAM. Thus, RAM usage of the accelerator system can be estimated using the BRAM model. In addition to the raw input data and weight data, the output data between the computation blocks is stored as much as possible in the on-chip BRAM. BRAM consumption is calculated from two aspects: 1) Buffer memory input data consumption BRAM in the accelerator; 2) BRAM consumed by data buffering between different modules of the accelerator. The internal BRAM used by the accelerator is determined by the post-slice Tile size and the architectural information of the stride and fill (pad) of the layers in the input DNN model. The weight buffer size depends on the maximum kernel size of the layers allocated to the accelerator. A single BRAM block is limited to one read port and one write port. RAM in the FPGA is organized as BRAM blocks with fixed memory capacity, each BRAM block having a capacity of 18Kb (URAM 288 Kb). Thus, in the BRAM usage approximation, buffers of each partition occupy at least one BRAM block. The approximate RAM consumption of the accelerator system is shown in equation (2),
wherein BRAM is depth Refers toBRAM blocks of 1K depth in the platform. When the size of buffers is small (less than 16), for example, the weights of most network models, which are not counted as BRAM, can be implemented using LUTRAM resources in FPGA chips.
Example 3
The mapping method of the invention uses the analytical model to complete the early performance evaluation work of the mapping scheme.
The performance prediction model accepts parameter information such as DNN structure (number of layers, layer structure, precision, etc.), accelerator hardware architecture (memory hierarchy, number of PEs, etc.), hardware mapping, unit energy/delay cost of MAC operation, and memory access energy consumption for various memory hierarchies. The performance prediction model outputs estimated energy consumption, delay, and resource consumption when executing DNN in a target accelerator defined by a given neural network topology, hardware architecture, and hardware map. By means of coarse-grained fast evaluation, performance prediction is performed without considering the data channels of each module.
The performance evaluation uses an analytical model to describe the energy, delay and resource consumption of the accelerator using equations according to the DNN model and hardware design description. Using a system-level modeling approach, the total energy and latency of all modules is estimated using analytical equations and built-in attributes of each module. The invention adopts a layer-wise sequential calculation mode, so that the calculation and the energy consumption prediction are completed based on a mode of modeling layer by layer and accumulating all layers. The computation delay is mainly influenced by a parallel computation method and a computation scale, and since each granularity computation component in the computation array is responsible for computation of one dimension, when the mapping method is poor or the algorithm size parameter is extremely unmatched with the computation array size parameter, part of the computation components are idle, and therefore the execution period is rounded up. For each layer, the body of energy consumption can be divided into three parts, regardless of the type of layer: and calculating energy consumption, energy consumption of on-chip memory access and energy consumption of off-chip memory access. The energy consumption of the off-chip access memory is maximum.
The delay performance is mainly influenced by a mapping method, a calculation parallel method, a data stream scheduling method and an off-chip storage access bandwidth. The optimal network slicing scheme can also be explored through performance evaluation. The 4-dimensional slice is divided into two types, one is a channel slice constrained by computational resources, and the other is a feature map slice constrained by on-chip storage resources. Assuming that a convolution layer is divided into p calculation blocks by channel parameters, the internal structure parameter of each block is [ TI, TO, TW, TH ], and the calculation layer delay can be obtained by superposing the calculation block delays. The delay mainly needs to consider memory access delay and calculation delay in the calculation process. The access delay can be expressed by the following formula
Where TW, TH denote the width and height of the computation block and Bandwidth denote the off-chip memory access BandWidth. When a plurality of data interfaces are used for off-chip memory access, the feature map data and the weight data can be simultaneously moved to an on-chip buffer area, so that only a value with larger delay is needed to be taken as memory access delay, and the formula (3) is used. When only one data interface can complete the off-chip memory access, the input feature map, the weight and the output feature map data need to be time-division multiplexed to complete the data transmission, so that the formula (4) is used.
The computation delay is mainly influenced by a parallel computation method and a computation scale, and since each granularity computation component in the computation array is responsible for computation of one dimension, when the mapping method is poor or the algorithm size parameter is extremely unmatched with the computation array size parameter, part of the computation components are idle, and therefore the execution period is rounded up.
In addition, since off-chip memory access and data parallel computation can be performed simultaneously, considering the portion with larger latency as a critical path, the time consumption of another type of operation can be masked.
Therefore, as shown in formula (7) for the optimization target, under the limitation of the on-chip storage resource BRAM and the computing resource MAC, the optimal slicing scheme of each layer is found, and the minimum delay after layer-by-layer accumulation is the delay performance optimization index.
Because a layer-wise sequential calculation mode is adopted, the energy consumption prediction is completed in a mode of modeling layer by layer and then accumulating all layers. For each layer, the body of energy consumption can be divided into three parts, regardless of the type of layer: and calculating energy consumption, energy consumption of on-chip memory access and energy consumption of off-chip memory access. The energy consumption of the off-chip access memory is maximum.
Based on the precondition above, the total network energy consumption is:
the energy consumption of the i-th layer can be modeled as:
Energy i =E C *C i +E BRAM *N buffer +E DRAM *N DRAM (9)
wherein E is C For single calculated energy coefficient, C i For the calculated amount of the layer, N buffer To store the number of accesses on the layer, N DRAM Storing the number of accesses for the out-of-layer, E BRAM And E is DRAM The energy coefficients of the access on-chip BRAM and off-chip DRAM, respectively.
In the case of the convolution layer(s),
C conv =Co*Ci*H*W*K 2 (MUL)+Co*H*W*(Ci*K 2 -1)(ADD) (10)
equation (10) can be approximated as:
C conv =2*Co/Ci*H*W*K 2 (11)
since the multiplication operation is approximately equal to the addition operation, the energy coefficient
The total energy consumption calculated by measuring M times multiply-accumulate is divided by M.
In the layer of the activation layer(s),
C act =Co*H*W (12)
at this time, energy coefficient E C =E COM 。
In the pooling layer (calculated as average pooling),
at this time, energy coefficient E C =E ADD 。
In the bn layer of the semiconductor device,
C bn =Co*H*W (14)
at this time, energy coefficient E C =2*(E MUL +E ADD )。
The memory consumption is divided into an on-chip memory and an off-chip memory, namely, the energy consumption caused by accessing on-chip memory and the energy consumption caused by accessing off-chip DRAM. In the convolution calculation process, the total number of memory accesses is
NB conv =2*Co*Ci*H*W*K 2 +Co*H*W (15)
In the layer of the activation layer(s),
NB act =2*Co*H*W (16)
since the data is pre-stored in the on-chip buffer, this portion of the data is entirely available from the on-chip memory.
The number of times of access and storage outside the chip is
ND conv =a*Ci*H*W+b*Co*Ci*K 2 +CO*H*W (17)
Because the calculation orders caused by different data flow scheduling methods are different, the input data and the weight data are not necessarily read only once from the outside of the chip, so that the corresponding scaling coefficients a and b are obtained according to the different mapping methods and scheduling methods. Because the calculation fusion method is used, a plurality of operators are organized in a calculation layer in a pipelining mode, and for the memory access part in the chip, the execution memory access times can still add the results in the formulas (15) and (16),
NB=NB conv +NB act =2*Co*Ci*H*W*K 2 +3*Co*H*W (18)
for the off-chip memory, the calculation is still performed according to the formula (17)
Furthermore, the assessment is further limited using the rooline model. Since the input/output/weight data has a different amount of data in each tile. Different burst lengths and access modes will result in different effective bandwidths. Thus, different designs have different final rooflex, which makes the original rooflex-based approach very inaccurate for bandwidth-intensive applications (e.g., fully connected layers) predictions. It is therefore desirable to normalize the input/output/weight accessed DRAM traffic to the maximum effective bandwidth using a normalization factor.
Example 4
The mapping method of the invention combines with the optimization method to determine the optimal mapping scheme.
Before performing the mapping, the following principles are specified: (1) The loop boundary for each tile level determines the tile size for each data space at that level, with the size of the data space slices constrained by the buffer size for each level. (2) Parallel loops represent the division of tile space between instances of one level, this particular mapping resulting in duplication of some input data between adjacent tiles. (3) The order of the loops within a slice level determines the order in which sub-slices are transferred from that level to the internal level during execution. This representation produces a strictly inclusive sliced hierarchy, but this may not be optimal. When one data space is less reused at a level, it is allowed to extend capacity through that level to other data spaces, thus achieving a larger tile size and potentially resulting in a more optimized mapping. The mapping specification is used to specify which data spaces are allowed to reside at each level. This unified mapping representation allows reasoning about the possible mapping space of the architecture in a structured, programmed way, eventually finding the optimal mapping method.
The mapping space is an effective set of mappings of the neural network layer to the architecture. The effective mapping must meet three requirements. First, it must be compatible with the hardware data stream. For example, the mapping of the weight fixed data stream cannot function properly on the output fixed hardware. Second, the mapping needs to match the available computing resources. For example, if there are only four PEs in the system, a mapping that requires eight PEs will not run. Finally, it should meet different levels of memory constraints in the hierarchy. The buffer size chosen at design time sets an upper limit for the tile size that can be executed on hardware. Given the architecture specification and network layer dimensions, the mapper generates a valid map by enumerating all possible factorization of layer dimensions and compatible round robin order that satisfy the constraints described above. The size of the mapping space can be very large, so it is often impractical to scan through all possible mappings. To solve this problem, a mapping space pruning method needs to be designed.
Design space pruning requires that there be architectural parameters that need to be determined under FPGA platform resource constraints. The mapper prunes space by applying two user-defined constraints to optimize energy efficiency or performance. The first constraint is a reuse factor that determines the minimum amount of time reuse for different data types, with lower data reuse rates resulting in more off-chip memory accesses. The reuse factor provides a threshold to trim smaller slice sizes, which are less energy efficient due to lower memory reuse rates. The second constraint is resource utilization, which determines the number of PEs used in the system. Higher utilization means larger PE array sizes and less resource free, which typically results in better computational power efficiency within the PE, but overall performance may be reduced if fewer PEs are used. Higher utilization helps to improve system performance, but may not be energy optimal. Constraints are only used to tailor easily identifiable lower optimization points in design space, such as mapping methods with significantly smaller scales that result in underutilization of resources or poor multiplexing of data, and large-scale accelerator mapping beyond FPGA resource conditions.
Example 5
The partial reconfigurable mapping adjustment scheme steps are as follows: (1) Determining internal adjustment content of the neural network accelerator, including loading, unloading and modifying of a neural network operator acceleration unit; calculating array scale adjustment; and (5) adjusting the size of the storage array. (2) Resource constraint limits, i.e., mapped boundaries, are determined on the adjusted accelerator. The mapping scheme is redetermined in conjunction with a neural network algorithm.
The accelerator tuning content is used to update the mapping model parameter list, including algorithm parameters such as convolution kernel size, feature map size, input/output, sliding step size, activation method, etc., and architecture parameters such as compute array size, storage array size, interface bandwidth, etc. And the updated parameter information is used as a new constraint condition to carry out the redetermination of the mapping scheme. And according to the mapping scheme facing the resource constraint, after the FPGA is locally reconstructed, the mapping scheme is determined in the updated mapping space boundary.
The mapping method of the invention is mainly aimed at the acceleration application of the neural network, and completes the mapping process from algorithm software to accelerator hardware. The invention is oriented to multiple FPGA type platforms and is applicable to multiple neural network algorithms. The method is mainly used for carrying out mapping space exploration towards FPGA platform resource constraint, and carrying out mapping scheme performance prediction and determining an optimal mapping scheme under the condition of meeting target FPGA platform resource constraint. And analyzing different resource characteristics, and constructing a proper resource substitution method by combining the resource preference in the accelerator design process so as to obtain better acceleration performance.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (10)
1. The neural network reconfigurable configuration mapping method facing FPGA resources is characterized by comprising the following steps of:
adjusting the FPGA platform according to different task demands, and constructing a plurality of mapping schemes oriented to resource constraint;
establishing a mapping evaluation model by analyzing the resource characteristics of different FPGA platforms, evaluating a plurality of mapping schemes, and analyzing the differences between the different mapping schemes and the performance differences on the different FPGA platforms;
and deploying different neural network algorithms according to the application field, and constructing a part of reconfigurable mapping adjustment scheme.
2. The method for mapping reconfigurable configuration of a neural network facing FPGA resources according to claim 1, wherein constructing the plurality of mapping schemes facing resource constraints specifically comprises the following steps:
extracting accelerator structure parameters and operation parameters with common characteristics by combining with the resource characteristics of the FPGA platform, and providing support for the construction of a resource constraint evaluation method and a performance prediction model;
combining the calculated data dependency relationship and the neural network structure parameters, performing resource self-adaptive architecture design, and performing calculation system performance prediction and hardware structure adaptability evaluation aiming at an accelerator system architecture; and selecting and determining the optimal intelligent computing system architecture and a hardware mapping scheme thereof according to the evaluation result.
3. The method for reconfigurable configuration mapping of a neural network for FPGA resources according to claim 2, wherein in the process of performing resource-adaptive architecture design, FPGA resources are used as constraints to reduce a design space, and optimization selection of the mapping method is performed by an evaluation model for performance prediction;
according to the determined array scale and the data multiplexing method in the FPGA, the on-chip data bandwidth requirement can be determined, so that the output bandwidth of the memory bank is restrained; the cost of storage resources caused by the need of data caching is constrained by BRAM resources in the FPGA, and the cost of calculation resources caused by the need of calculation tasks to be executed is constrained by DSP resources in the FPGA.
4. The method for mapping reconfigurable configuration of neural network for FPGA resources according to claim 3, wherein after the storage resources are limited, task allocation policies of each organization level of the computing unit need to be adjusted to satisfy storage resource constraints; if the constraint condition can not be met after the calculation task allocation strategy is adjusted, the scale of the calculation component is reduced and the architecture search is restarted.
5. The method of claim 3, wherein after the computing resources are constrained, additional computing components are required to be constructed according to the computing tasks, and the necessary DSP resource gaps are replaced by efficient multiplication adders constructed by using LUT resources in advance.
6. The method for mapping reconfigurable configuration of neural network facing FPGA resources according to claim 3, wherein the mapping scheme is limited by FPGA interface resources besides the computing resources and storage resources, and when the computing array scale is large, off-chip storage access is required through a bus interface; when the bus transmission rate is difficult to meet the calculation requirement, an off-chip bandwidth optimization scheme and an on-chip buffer access optimization scheme are simultaneously used.
7. The method for mapping reconfigurable configuration of neural network for FPGA resources according to claim 2, wherein the performance prediction model accepts DNN structure, accelerator hardware architecture, hardware mapping, unit energy/delay cost of MAC operation, and memory access energy consumption of various memory hierarchies with various parameter information; the performance prediction model outputs estimated energy consumption, delay, and resource consumption.
8. The method for mapping reconfigurable configuration of a neural network for FPGA resources according to claim 1, wherein the constructing part of the reconfigurable mapping adjustment scheme specifically includes the following steps:
determining internal adjustment content of the neural network accelerator, including loading, unloading and modifying of the neural network operator acceleration unit, calculating array scale adjustment and storage array scale adjustment;
determining a resource constraint limit on the adjusted accelerator;
the mapping scheme is redetermined in conjunction with a neural network algorithm.
9. The method for mapping reconfigurable configuration of FPGA resources of claim 8, wherein the accelerator adjustment content is used to update a mapping model parameter list including algorithm parameters and architecture parameters.
10. The method for mapping reconfigurable configuration of FPGA resources-oriented neural network according to claim 9, wherein the updated parameter information is used as a new constraint condition to re-determine the mapping scheme, and the determination of the mapping scheme is completed in the updated mapping space boundary after the FPGA is locally reconfigured according to the plurality of mapping schemes oriented to the resource constraint.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174934.6A CN116245150A (en) | 2023-02-28 | 2023-02-28 | Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174934.6A CN116245150A (en) | 2023-02-28 | 2023-02-28 | Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116245150A true CN116245150A (en) | 2023-06-09 |
Family
ID=86632762
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310174934.6A Pending CN116245150A (en) | 2023-02-28 | 2023-02-28 | Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116245150A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116980423A (en) * | 2023-09-21 | 2023-10-31 | 浪潮电子信息产业股份有限公司 | Model scheduling method, device, computing system, equipment and readable storage medium |
-
2023
- 2023-02-28 CN CN202310174934.6A patent/CN116245150A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116980423A (en) * | 2023-09-21 | 2023-10-31 | 浪潮电子信息产业股份有限公司 | Model scheduling method, device, computing system, equipment and readable storage medium |
CN116980423B (en) * | 2023-09-21 | 2024-02-09 | 浪潮电子信息产业股份有限公司 | Model scheduling method, device, computing system, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097174B (en) | Method, system and device for realizing convolutional neural network based on FPGA and row output priority | |
Zhang et al. | BoostGCN: A framework for optimizing GCN inference on FPGA | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
US5742814A (en) | Background memory allocation for multi-dimensional signal processing | |
CN112116084A (en) | Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform | |
US20040068331A1 (en) | System and method for reducing wire delay or congestion during synthesis of hardware solvers | |
CN109472361B (en) | Neural network optimization method | |
KR20180123846A (en) | Logical-3d array reconfigurable accelerator for convolutional neural networks | |
CN110084363B (en) | Deep learning model acceleration method based on FPGA platform | |
CN116245150A (en) | Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources | |
Reggiani et al. | Pareto optimal design space exploration for accelerated CNN on FPGA | |
WO2022235251A1 (en) | Generating and globally tuning application-specific machine learning accelerators | |
CN117764128A (en) | Data processing method, device, equipment and readable storage medium | |
CN117391162A (en) | Accelerator based on convolutional neural network and acceleration method | |
Plagwitz et al. | To Spike or Not to Spike? A Quantitative Comparison of SNN and CNN FPGA Implementations | |
Li et al. | Hardware-aware NAS framework with layer adaptive scheduling on embedded system | |
Brown et al. | Nemo-cnn: An efficient near-memory accelerator for convolutional neural networks | |
KR20210017834A (en) | Memory Allocation Method and System | |
CN114546638A (en) | Ocean data assimilation method and system based on high-performance parallel optimization | |
Ma et al. | An intermediate-centric dataflow for transposed convolution acceleration on FPGA | |
CN116980423B (en) | Model scheduling method, device, computing system, equipment and readable storage medium | |
US20240119283A1 (en) | Multi-objective auto tuning for layer fusion and tensor tiling on multi-level cache hierarchy | |
Agrawal et al. | Partitioning and assignment exploration for multiple modes of ieee 802.11 n modem on heterogeneous mpsoc platforms | |
US20230244484A1 (en) | Bit-parallel vector composability for neural acceleration | |
Chang et al. | Accuracy vs. efficiency: Achieving both through hardware-aware quantization and reconfigurable architecture with mixed precision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |