CN113592088A - Parallelism determination method and system based on fine-grained convolution calculation structure - Google Patents
Parallelism determination method and system based on fine-grained convolution calculation structure Download PDFInfo
- Publication number
- CN113592088A CN113592088A CN202110888610.XA CN202110888610A CN113592088A CN 113592088 A CN113592088 A CN 113592088A CN 202110888610 A CN202110888610 A CN 202110888610A CN 113592088 A CN113592088 A CN 113592088A
- Authority
- CN
- China
- Prior art keywords
- dsp
- parallelism
- total
- para
- cycle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000010410 layer Substances 0.000 claims description 62
- 235000019800 disodium phosphate Nutrition 0.000 claims description 56
- 239000002356 single layer Substances 0.000 claims description 21
- 238000003860 storage Methods 0.000 claims description 18
- 238000013461 design Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 238000012216 screening Methods 0.000 claims description 10
- 230000001174 ascending effect Effects 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 238000013468 resource allocation Methods 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000013500 data storage Methods 0.000 claims description 6
- 230000003247 decreasing effect Effects 0.000 claims description 6
- 238000010845 search algorithm Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000002360 preparation method Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims 1
- 230000001133 acceleration Effects 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 4
- 238000013527 convolutional neural network Methods 0.000 description 15
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002059 diagnostic imaging Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
A parallelism determination method and a parallelism determination system based on a fine-grained convolution computing structure can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found. The parallelism determining method based on the fine-grained convolution calculation structure comprises the following steps: (1) constructing a problem model; (2) enumerating algorithm constraints; (3) and (5) fine-grained traversal solution.
Description
Technical Field
The invention relates to the technical field of FPGA hardware acceleration design of a convolutional neural network, in particular to a parallelism determination method based on a fine-grained convolutional computing structure and a parallelism determination system based on the fine-grained convolutional computing structure.
Background
Convolutional Neural Networks (CNN) is one of the representative algorithms for deep learning. Because of its excellent performance in the field of artificial intelligence, CNN is widely focused and applied in high-tech applications such as image classification, speech recognition, face recognition, autopilot, and medical imaging.
A Field Programmable Gate Array (FPGA) is a chip with excellent programming flexibility and high performance power consumption. At present, many CNN forward inference accelerators which seek to have low development cost, short development period and low application power consumption adopt an acceleration scheme based on an FPGA.
Because CNN is a computationally intensive structure, the accelerator needs to fully exploit the computational power of the FPGA chip, and therefore, the core topic of accelerator design is how to efficiently utilize the on-chip computational resources. Many classical FPGA CNN accelerators look at the optimization of convolution operation structures. For example, there is a document that proposes a fine-grained convolution computation structure that has both temporal and spatial granularity flexibility, allowing limited computational resources to be supported by hardware that delivers greater computational power.
The design of the FPGA CNN accelerator is a system project, and can fully exert the chip computing power of the FPGA by the support of a corresponding parallelism determination algorithm besides hardware support. Otherwise, if there is no suitable parallelism configuration, the configuration deployment mode of the convolution calculation structure is difficult to reach the ideal condition, and the resource distribution is uneven, which causes the waste of on-chip calculation power. Therefore, the parallelism determination system which can be matched with the fine-grained convolution calculation structure has wide application prospect.
In the method related to the design of the FPGA CNN accelerator, the description of the parallelism determining system can be divided into two categories.
The first type: there is a lack of a systematic description of a parallelism determination method in the literature. The determination process of the parallelism is not related in the literature or at all, and only the parallelism configuration parameters of the final accelerator are given; or only lists the constraint condition for limiting the value of the parallelism parameter, but does not give a concrete description of the method for determining the parallelism under the constraint condition. Therefore, the value selection process of the parallelism parameter depends heavily on engineering experience of accelerator designers, the optimal value selection result of the parallelism parameter cannot be ensured, and the opaque value selection process has no referential property for other accelerator designers.
The second type: the parallelism determination method of the system is given in the literature, but the adjustment space of the parallelism is limited by the flexibility of the convolution calculation structure. For example, accelerator layers have only input parallelism (Para)in) And output parallelism (Para)out) Two dimensions are adjustable, and Parain、ParaoutThe value of (a) is strictly limited to an integer power of 2, which makes the change granularity of the increase and decrease of the parallelism degree too large to realize small-amplitude adjustment. The coarse-grained parallelism determination algorithm is not friendly, and mainly shows the following two points:
1. along with the increase of the size of the original input image in the application scene, the size of the characteristic graph of each layer of the CNN is increased, and only Para is adoptedin、ParaoutThe computing resource allocation is adjusted, so that the single-graph computing time granularity and the resource granularity are overlarge, and the utilization efficiency of the on-chip computing resources is reduced.
2. Parallelism parameters that are limited to integer powers of 2 are not suitable for all CNN networks. For example, the number N of input/output profiles for which there are a large number of convolutional layers in AlexNetin/Nout(e.g., 3, 96, 384, 192) do not satisfy an integer power of 2, and mismatch of the number of signatures and parallelism reduces the efficiency of on-chip computing resource utilization.
Based on the analysis, in order to solve the problem of computing resource waste, a parallelism determining system with higher exploration dimension, flexible parameter change and optimal search result is urgently needed, and an accelerator designer can be helped to conveniently and efficiently obtain a parallelism configuration scheme matched with a fine-grained convolution structure.
Disclosure of Invention
In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide a parallelism determination method based on a fine-grained convolution computing structure, which can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found.
The technical scheme of the invention is as follows: the parallelism determination method based on the fine-grained convolution calculation structure comprises the following steps:
(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) An optimal parallelism search algorithm is proposed,the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresentative supplementZero size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Compared with the traditional parallelism determination method, the invention has the advantages that the exploration dimension is higher, the parameters can be adjusted in a fine granularity mode, the search result is better on an FPGA chip, meanwhile, the invention can be applied to the design of accelerators with different FPGA platforms and different CNN network structures, has good universality, and can fully develop the computing power of the FPGA chip and realize good acceleration effect with the help of the invention.
There is also provided a parallelism determination system based on a fine-grained convolution computation structure, comprising:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total amount of available DSP on chip
#DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughputIt is possible to reduce max { # cyclei};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal;
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Drawings
Fig. 1 shows a flow chart of a parallelism determination method based on a fine-grained convolution computation structure according to the invention.
Detailed Description
As shown in fig. 1, the parallelism determination method based on the fine-grained convolution calculation structure includes the following steps:
(1) constructing a problem model: configuring parameters for determining optimal parallelism of acceleratorsNumber (Para)in,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Compared with the traditional parallelism determination method, the invention has the advantages that the exploration dimension is higher, the parameters can be adjusted in a fine granularity mode, the search result is better on an FPGA chip, meanwhile, the invention can be applied to the design of accelerators with different FPGA platforms and different CNN network structures, has good universality, and can fully develop the computing power of the FPGA chip and realize good acceleration effect with the help of the invention.
Preferably, in step (1), a computation process of the matrix convolution is divided into Para and ParasegThe smaller matrices are convolved in turn, wherein ParasegIs equal toROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
ParasegIs proposed to lift the meter from two aspectsAnd calculating the resource utilization rate.
1. Aiming at the condition that the granularity of convolution computing resources is overlarge due to overlarge size of the feature map, the convolution operation of the whole feature map can be completed by using a method of sequentially processing feature map fragments after cutting and with smaller resource granularity.
2. The first-layer input characteristic diagram of most CNN networks is RGB three-channel, NinEqual to 3. Less NinCompress ParainIn pursuit of higher resource utilization, ParainOnly 1 or 3 can be selected. This severely limits the resource planning space of the first floor. ParasegIs equivalent to introducing ParainThe value range of (2) is extended from an integer domain (1 or 3) to a fractional domain (such as 1/2, 1/3 and 3/5 …), and the resource planning space is greatly expanded.
Preferably, in the step (2), the resource usage # DSP is calculated for a single layeriNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
Preferably, the step (3) comprises the following substeps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloci(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline;
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor and beta is DSP allocates a floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
Preferably, the step (3) comprises the following substeps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai;
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal;
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline;
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi;
(V) screening the data set S satisfying the alpha, beta constraintsi;
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), Si(i=1~5) Defining a Cartesian product S formed by domains, and calculating max { # cycle } corresponding to all elements meeting resource constraints (i is 1-5);
(VII) ascending alignment of max { # cycleiAnd (i is 1-5), and selecting min { max { # cycleiAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
It will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, corresponding to the method of the present invention, the present invention also includes a parallelism determination system based on a fine-grained convolution computation structure, which is generally represented in the form of functional blocks corresponding to the steps of the method. The system comprises:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot exceeding on chipUsing total number of storage resources # BRAMtotal;
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Preferably, in the building block, a calculation process of the matrix convolution is divided into Para by rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal toROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
Preferably, in the constraint module, the usage amount # DSP for a single-layer computing resourceiNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
Preferably, the traversal module performs the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloci(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline;
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme;traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
Preferably, the traversal module performs the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai;
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal;
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline;
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi;
(V) screening the data set S satisfying the alpha, beta constraintsi;
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.
Claims (10)
1. The parallelism determination method based on the fine-grained convolution calculation structure is characterized by comprising the following steps: which comprises the following steps:
(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
2. The fine-grained convolution computation structure-based parallelism determination method according to claim 1, characterized in that: in the step (1), a calculation process of the matrix convolution is divided into Para by rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal toROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
3. Root of herbaceous plantThe fine-grained convolution computation structure-based parallelism determination method according to claim 2, characterized in that: in the step (2), the resource usage amount # DSP is calculated for the single layeriNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
4. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloc i(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline;
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
5. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai;
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal;
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline;
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi;
(V) screening the data set S satisfying the alpha, beta constraintsi;
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
6. The parallelism determination system based on the fine-grained convolution calculation structure is characterized in that: it includes:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal;
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal;
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
7. The fine-grained convolution computation structure-based parallelism determination system of claim 6, characterized in that: in the construction module, the calculation process of a matrix convolution is divided into Para according to rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal toROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
8. The fine-grained convolution computation structure-based parallelism determination system of claim 7, characterized in that: in the constraint module, the usage amount of single-layer computing resources is # DSPiNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
9. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloc i(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline;
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
10. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai;
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal;
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline;
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi;
(V) screening the data set S satisfying the alpha, beta constraintsi;
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110888610.XA CN113592088B (en) | 2021-07-30 | 2021-07-30 | Parallelism determination method and system based on fine-granularity convolution computing structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110888610.XA CN113592088B (en) | 2021-07-30 | 2021-07-30 | Parallelism determination method and system based on fine-granularity convolution computing structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113592088A true CN113592088A (en) | 2021-11-02 |
CN113592088B CN113592088B (en) | 2024-05-28 |
Family
ID=78254703
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110888610.XA Active CN113592088B (en) | 2021-07-30 | 2021-07-30 | Parallelism determination method and system based on fine-granularity convolution computing structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113592088B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107392308A (en) * | 2017-06-20 | 2017-11-24 | 中国科学院计算技术研究所 | A kind of convolutional neural networks accelerated method and system based on programming device |
CN107993186A (en) * | 2017-12-14 | 2018-05-04 | 中国人民解放军国防科技大学 | 3D CNN acceleration method and system based on Winograd algorithm |
CN108280514A (en) * | 2018-01-05 | 2018-07-13 | 中国科学技术大学 | Sparse neural network acceleration system based on FPGA and design method |
GB201913353D0 (en) * | 2019-09-16 | 2019-10-30 | Samsung Electronics Co Ltd | Method for designing accelerator hardware |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
CN112001492A (en) * | 2020-08-07 | 2020-11-27 | 中山大学 | Mixed flow type acceleration framework and acceleration method for binary weight Densenet model |
CN112116084A (en) * | 2020-09-15 | 2020-12-22 | 中国科学技术大学 | Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform |
-
2021
- 2021-07-30 CN CN202110888610.XA patent/CN113592088B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107392308A (en) * | 2017-06-20 | 2017-11-24 | 中国科学院计算技术研究所 | A kind of convolutional neural networks accelerated method and system based on programming device |
CN107993186A (en) * | 2017-12-14 | 2018-05-04 | 中国人民解放军国防科技大学 | 3D CNN acceleration method and system based on Winograd algorithm |
CN108280514A (en) * | 2018-01-05 | 2018-07-13 | 中国科学技术大学 | Sparse neural network acceleration system based on FPGA and design method |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
GB201913353D0 (en) * | 2019-09-16 | 2019-10-30 | Samsung Electronics Co Ltd | Method for designing accelerator hardware |
CN112001492A (en) * | 2020-08-07 | 2020-11-27 | 中山大学 | Mixed flow type acceleration framework and acceleration method for binary weight Densenet model |
CN112116084A (en) * | 2020-09-15 | 2020-12-22 | 中国科学技术大学 | Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform |
Non-Patent Citations (5)
Title |
---|
X. QU等: "Cheetah: An Accurate Assessment Mechanism and a High-Throughput Acceleration Architecture Oriented Toward Resource Efficiency", 《IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS》, vol. 40, no. 5, pages 878 - 891, XP011850205, DOI: 10.1109/TCAD.2020.3011650 * |
X. ZHANG等: "DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs", 《IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD)》, pages 1 - 8 * |
孙明等: "面向卷积神经网络的硬件加速器设计方法", 《计算机工程与应用》, vol. 57, no. 13, pages 77 - 84 * |
宫磊: "可重构平台上面向卷积神经网络的异构多核加速方法研究", 《中国优秀博士学位论文全文数据库:信息科技辑》, no. 8, pages 1 - 119 * |
屈心媛等: "一种基于三维可变换CNN加速结构的并行度优化搜索算法", 《电子与信息学报》, vol. 44, no. 4, pages 1503 - 1512 * |
Also Published As
Publication number | Publication date |
---|---|
CN113592088B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs | |
EP3757901A1 (en) | Schedule-aware tensor distribution module | |
CN106919769B (en) | Hierarchical FPGA (field programmable Gate array) layout and wiring method based on multi-level method and empowerment hypergraph | |
EP4036810A1 (en) | Neural network processing method and apparatus, computer device and storage medium | |
Jiao et al. | 7.2 A 12nm programmable convolution-efficient neural-processing-unit chip achieving 825TOPS | |
WO2020119318A1 (en) | Self-adaptive selection and design method for convolutional-layer hardware accelerator | |
Chen et al. | Zara: A novel zero-free dataflow accelerator for generative adversarial networks in 3d reram | |
CN109472361B (en) | Neural network optimization method | |
US11886979B1 (en) | Shifting input values within input buffer of neural network inference circuit | |
CN110516316B (en) | GPU acceleration method for solving Euler equation by interrupted Galerkin method | |
CN115755954B (en) | Routing inspection path planning method, system, computer equipment and storage medium | |
US11645512B2 (en) | Memory layouts and conversion to improve neural network inference performance | |
US20210191733A1 (en) | Flexible accelerator for sparse tensors (fast) in machine learning | |
CN112183015B (en) | Chip layout planning method for deep neural network | |
CN115860081A (en) | Core particle algorithm scheduling method and system, electronic equipment and storage medium | |
CN106484532B (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
WO2021244045A1 (en) | Neural network data processing method and apparatus | |
CN113052306B (en) | Online learning chip based on heap width learning model | |
CN112183001B (en) | Hypergraph-based multistage clustering method for integrated circuits | |
CN113592088A (en) | Parallelism determination method and system based on fine-grained convolution calculation structure | |
CN116245150A (en) | Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources | |
Guan et al. | Crane: Mitigating accelerator under-utilization caused by sparsity irregularities in cnns | |
CN113986816B (en) | Reconfigurable computing chip | |
WO2022266888A1 (en) | Congestion prediction model training method, image processing method and apparatus | |
CN110415162B (en) | Adaptive graph partitioning method facing heterogeneous fusion processor in big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |