CN113592088A - Parallelism determination method and system based on fine-grained convolution calculation structure - Google Patents

Parallelism determination method and system based on fine-grained convolution calculation structure Download PDF

Info

Publication number
CN113592088A
CN113592088A CN202110888610.XA CN202110888610A CN113592088A CN 113592088 A CN113592088 A CN 113592088A CN 202110888610 A CN202110888610 A CN 202110888610A CN 113592088 A CN113592088 A CN 113592088A
Authority
CN
China
Prior art keywords
dsp
parallelism
total
para
cycle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110888610.XA
Other languages
Chinese (zh)
Other versions
CN113592088B (en
Inventor
屈心媛
黄志洪
蔡刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ehiway Microelectronic Science And Technology Suzhou Co ltd
Original Assignee
Ehiway Microelectronic Science And Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ehiway Microelectronic Science And Technology Suzhou Co ltd filed Critical Ehiway Microelectronic Science And Technology Suzhou Co ltd
Priority to CN202110888610.XA priority Critical patent/CN113592088B/en
Publication of CN113592088A publication Critical patent/CN113592088A/en
Application granted granted Critical
Publication of CN113592088B publication Critical patent/CN113592088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

A parallelism determination method and a parallelism determination system based on a fine-grained convolution computing structure can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found. The parallelism determining method based on the fine-grained convolution calculation structure comprises the following steps: (1) constructing a problem model; (2) enumerating algorithm constraints; (3) and (5) fine-grained traversal solution.

Description

Parallelism determination method and system based on fine-grained convolution calculation structure
Technical Field
The invention relates to the technical field of FPGA hardware acceleration design of a convolutional neural network, in particular to a parallelism determination method based on a fine-grained convolutional computing structure and a parallelism determination system based on the fine-grained convolutional computing structure.
Background
Convolutional Neural Networks (CNN) is one of the representative algorithms for deep learning. Because of its excellent performance in the field of artificial intelligence, CNN is widely focused and applied in high-tech applications such as image classification, speech recognition, face recognition, autopilot, and medical imaging.
A Field Programmable Gate Array (FPGA) is a chip with excellent programming flexibility and high performance power consumption. At present, many CNN forward inference accelerators which seek to have low development cost, short development period and low application power consumption adopt an acceleration scheme based on an FPGA.
Because CNN is a computationally intensive structure, the accelerator needs to fully exploit the computational power of the FPGA chip, and therefore, the core topic of accelerator design is how to efficiently utilize the on-chip computational resources. Many classical FPGA CNN accelerators look at the optimization of convolution operation structures. For example, there is a document that proposes a fine-grained convolution computation structure that has both temporal and spatial granularity flexibility, allowing limited computational resources to be supported by hardware that delivers greater computational power.
The design of the FPGA CNN accelerator is a system project, and can fully exert the chip computing power of the FPGA by the support of a corresponding parallelism determination algorithm besides hardware support. Otherwise, if there is no suitable parallelism configuration, the configuration deployment mode of the convolution calculation structure is difficult to reach the ideal condition, and the resource distribution is uneven, which causes the waste of on-chip calculation power. Therefore, the parallelism determination system which can be matched with the fine-grained convolution calculation structure has wide application prospect.
In the method related to the design of the FPGA CNN accelerator, the description of the parallelism determining system can be divided into two categories.
The first type: there is a lack of a systematic description of a parallelism determination method in the literature. The determination process of the parallelism is not related in the literature or at all, and only the parallelism configuration parameters of the final accelerator are given; or only lists the constraint condition for limiting the value of the parallelism parameter, but does not give a concrete description of the method for determining the parallelism under the constraint condition. Therefore, the value selection process of the parallelism parameter depends heavily on engineering experience of accelerator designers, the optimal value selection result of the parallelism parameter cannot be ensured, and the opaque value selection process has no referential property for other accelerator designers.
The second type: the parallelism determination method of the system is given in the literature, but the adjustment space of the parallelism is limited by the flexibility of the convolution calculation structure. For example, accelerator layers have only input parallelism (Para)in) And output parallelism (Para)out) Two dimensions are adjustable, and Parain、ParaoutThe value of (a) is strictly limited to an integer power of 2, which makes the change granularity of the increase and decrease of the parallelism degree too large to realize small-amplitude adjustment. The coarse-grained parallelism determination algorithm is not friendly, and mainly shows the following two points:
1. along with the increase of the size of the original input image in the application scene, the size of the characteristic graph of each layer of the CNN is increased, and only Para is adoptedin、ParaoutThe computing resource allocation is adjusted, so that the single-graph computing time granularity and the resource granularity are overlarge, and the utilization efficiency of the on-chip computing resources is reduced.
2. Parallelism parameters that are limited to integer powers of 2 are not suitable for all CNN networks. For example, the number N of input/output profiles for which there are a large number of convolutional layers in AlexNetin/Nout(e.g., 3, 96, 384, 192) do not satisfy an integer power of 2, and mismatch of the number of signatures and parallelism reduces the efficiency of on-chip computing resource utilization.
Based on the analysis, in order to solve the problem of computing resource waste, a parallelism determining system with higher exploration dimension, flexible parameter change and optimal search result is urgently needed, and an accelerator designer can be helped to conveniently and efficiently obtain a parallelism configuration scheme matched with a fine-grained convolution structure.
Disclosure of Invention
In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide a parallelism determination method based on a fine-grained convolution computing structure, which can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found.
The technical scheme of the invention is as follows: the parallelism determination method based on the fine-grained convolution calculation structure comprises the following steps:
(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) An optimal parallelism search algorithm is proposed,the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresentative supplementZero size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure BDA0003189319450000041
Figure BDA0003189319450000042
Compared with the traditional parallelism determination method, the invention has the advantages that the exploration dimension is higher, the parameters can be adjusted in a fine granularity mode, the search result is better on an FPGA chip, meanwhile, the invention can be applied to the design of accelerators with different FPGA platforms and different CNN network structures, has good universality, and can fully develop the computing power of the FPGA chip and realize good acceleration effect with the help of the invention.
There is also provided a parallelism determination system based on a fine-grained convolution computation structure, comprising:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total amount of available DSP on chip
#DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughputIt is possible to reduce max { # cyclei};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure BDA0003189319450000051
Figure BDA0003189319450000052
Drawings
Fig. 1 shows a flow chart of a parallelism determination method based on a fine-grained convolution computation structure according to the invention.
Detailed Description
As shown in fig. 1, the parallelism determination method based on the fine-grained convolution calculation structure includes the following steps:
(1) constructing a problem model: configuring parameters for determining optimal parallelism of acceleratorsNumber (Para)in,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure BDA0003189319450000061
Figure BDA0003189319450000071
Compared with the traditional parallelism determination method, the invention has the advantages that the exploration dimension is higher, the parameters can be adjusted in a fine granularity mode, the search result is better on an FPGA chip, meanwhile, the invention can be applied to the design of accelerators with different FPGA platforms and different CNN network structures, has good universality, and can fully develop the computing power of the FPGA chip and realize good acceleration effect with the help of the invention.
Preferably, in step (1), a computation process of the matrix convolution is divided into Para and ParasegThe smaller matrices are convolved in turn, wherein ParasegIs equal to
Figure BDA0003189319450000072
ROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
ParasegIs proposed to lift the meter from two aspectsAnd calculating the resource utilization rate.
1. Aiming at the condition that the granularity of convolution computing resources is overlarge due to overlarge size of the feature map, the convolution operation of the whole feature map can be completed by using a method of sequentially processing feature map fragments after cutting and with smaller resource granularity.
2. The first-layer input characteristic diagram of most CNN networks is RGB three-channel, NinEqual to 3. Less NinCompress ParainIn pursuit of higher resource utilization, ParainOnly 1 or 3 can be selected. This severely limits the resource planning space of the first floor. ParasegIs equivalent to introducing ParainThe value range of (2) is extended from an integer domain (1 or 3) to a fractional domain (such as 1/2, 1/3 and 3/5 …), and the resource planning space is greatly expanded.
Preferably, in the step (2), the resource usage # DSP is calculated for a single layeriNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
Preferably, the step (3) comprises the following substeps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloci(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor and beta is DSP allocates a floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
Preferably, the step (3) comprises the following substeps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi
(V) screening the data set S satisfying the alpha, beta constraintsi
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), Si(i=1~5) Defining a Cartesian product S formed by domains, and calculating max { # cycle } corresponding to all elements meeting resource constraints (i is 1-5);
(VII) ascending alignment of max { # cycleiAnd (i is 1-5), and selecting min { max { # cycleiAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
It will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, corresponding to the method of the present invention, the present invention also includes a parallelism determination system based on a fine-grained convolution computation structure, which is generally represented in the form of functional blocks corresponding to the steps of the method. The system comprises:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot exceeding on chipUsing total number of storage resources # BRAMtotal
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure BDA0003189319450000101
Figure BDA0003189319450000102
Preferably, in the building block, a calculation process of the matrix convolution is divided into Para by rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal to
Figure BDA0003189319450000103
ROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
Preferably, in the constraint module, the usage amount # DSP for a single-layer computing resourceiNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
Preferably, the traversal module performs the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloci(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme;traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
Preferably, the traversal module performs the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi
(V) screening the data set S satisfying the alpha, beta constraintsi
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims (10)

1. The parallelism determination method based on the fine-grained convolution calculation structure is characterized by comprising the following steps: which comprises the following steps:
(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein ParainIs the input parallelism, ParaoutIs the degree of output parallelism, ParasegIs the segmentation parallelism;
(2) enumerating the algorithmic constraints:
constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocationiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layeriTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal,#BRAMiStoring resource usage for a single layer;
(3) fine granularity traversal solution
In the process of traversing and searching the optimal parallelism scheme, ParasegIs given by ROWoutUnique determination, ROWoutRepresents ROWinThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolvedin、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N isinNumber of input feature maps for convolutional layer, Nout]Number of output feature maps, SIZE, for convolutional layersoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure FDA0003189319440000021
Figure FDA0003189319440000022
2. The fine-grained convolution computation structure-based parallelism determination method according to claim 1, characterized in that: in the step (1), a calculation process of the matrix convolution is divided into Para by rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal to
Figure FDA0003189319440000023
ROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
3. Root of herbaceous plantThe fine-grained convolution computation structure-based parallelism determination method according to claim 2, characterized in that: in the step (2), the resource usage amount # DSP is calculated for the single layeriNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
4. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloc i(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
5. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi
(V) screening the data set S satisfying the alpha, beta constraintsi
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
6. The parallelism determination system based on the fine-grained convolution calculation structure is characterized in that: it includes:
a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an acceleratorin,Paraout,Paraseg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;
a constraint module configured to enumerate algorithmic constraints
Constraint 1. to ensure the rationality of resource allocation, # DSPiAnd total number of available DSPs on chip # DSPtotalRatio of the convolution layer, close to the convolution layer calculation amount # OPiAccounts for the total network calculation amount # OPtotal(ii) percent (d);
constraint 2. throughput rate of full flow accelerator is limited by maximum # cycleiTo increase throughput, max { # cycle is minimizedi};
Constraint 3 sigma # DSPiNot more than total number of available DSP resources on chip # DSPtotal
Constraint 4 sigma # BRAMiNot more than total number of available storage resources # BRAM on chiptotal
A traversal module configured to traverse the Para in searching for the optimal parallelism solutionsegIs given by ROWoutUnique determination, Parain、Paraout、ROWoutRespectively have a value range of [1, Nin]、[1,Nout]、[1,SIZEout]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZEoutExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZEkerRepresents the convolution kernel size, # cyclecntlRepresenting the number of cycles required for the control logic, NpadRepresents the zero padding size, CBRAMData storage capacity on behalf of a BRAM
#DSPi=ROWoutgSIZEkergParaingParaout (2)
Figure FDA0003189319440000051
Figure FDA0003189319440000052
7. The fine-grained convolution computation structure-based parallelism determination system of claim 6, characterized in that: in the construction module, the calculation process of a matrix convolution is divided into Para according to rowssegThe smaller matrices are convolved in turn, wherein ParasegIs equal to
Figure FDA0003189319440000053
ROWinThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if ParasegEqual to 1 for 1, otherwise equal to 0, ROWoutRepresents ROWinThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span
ROWin=SIZEker+Strideg(ROWout-1)-2gNpadgξ (1)。
8. The fine-grained convolution computation structure-based parallelism determination system of claim 7, characterized in that: in the constraint module, the usage amount of single-layer computing resources is # DSPiNumber of cycles required for single layer calculation # cycleiConstraints 1 and 2; limited by the number of available resources, for # DSPiSingle-tier storage resource usage # BRAMiThere are constraints 3, 4.
9. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:
(3.1) data preparation work: specifying # DSPtotalAnd according to # OPiAccount for # OPtotalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSPalloc i(ii) a Designated # BRAMtotal(ii) a According to # OPtotalAnd # DSPtotalDetermining the theoretical minimum number of computational cycles # cyclebaseline
(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step sizein,Paraout,ROWout) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modesi(ii) a According to the constraint 1, SiThe selection rule of the medium element is based on the following basic assumptions: # cyclei,j、#DSPi,jDeviation from # cyclebaseline、#DSPallociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for SiAny element of (A)i,jSatisfy the constraint (# cycle)i,j/#cyclebaseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)i,j/#DSPalloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cyclei,jDenotes SiThe number of calculation cycles corresponding to the middle element j, # DSPi,jDenotes SiThe number of DSP resources occupied by the middle element j;
(3.3) to obtain the Accelerator time/resource overhead, set S is first computedi(i-1-5) Cartesian product S-S1×S2×…×S5Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constrainti1-5, sorting in ascending order, min { max { # cycleiThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.
10. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:
(I) calculate the calculation amount # OP of each layeriAnd network total calculation amount # OPtotalRatio of gammai
(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layeri alloc←γi·#DSPtotal
(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resourcesbaseline
(IV) level i, traverse Parain,ParaoutAnd ROWoutThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated0 iCalculate the corresponding # cyclei、#BRAMiAnd # DSPi
(V) screening the data set S satisfying the alpha, beta constraintsi
(VI) in all convolutional layers, traverse SiAll possible combinations of the elements in (A), SiDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);
(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.
CN202110888610.XA 2021-07-30 2021-07-30 Parallelism determination method and system based on fine-granularity convolution computing structure Active CN113592088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110888610.XA CN113592088B (en) 2021-07-30 2021-07-30 Parallelism determination method and system based on fine-granularity convolution computing structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110888610.XA CN113592088B (en) 2021-07-30 2021-07-30 Parallelism determination method and system based on fine-granularity convolution computing structure

Publications (2)

Publication Number Publication Date
CN113592088A true CN113592088A (en) 2021-11-02
CN113592088B CN113592088B (en) 2024-05-28

Family

ID=78254703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110888610.XA Active CN113592088B (en) 2021-07-30 2021-07-30 Parallelism determination method and system based on fine-granularity convolution computing structure

Country Status (1)

Country Link
CN (1) CN113592088B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
GB201913353D0 (en) * 2019-09-16 2019-10-30 Samsung Electronics Co Ltd Method for designing accelerator hardware
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN112001492A (en) * 2020-08-07 2020-11-27 中山大学 Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
GB201913353D0 (en) * 2019-09-16 2019-10-30 Samsung Electronics Co Ltd Method for designing accelerator hardware
CN112001492A (en) * 2020-08-07 2020-11-27 中山大学 Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
X. QU等: "Cheetah: An Accurate Assessment Mechanism and a High-Throughput Acceleration Architecture Oriented Toward Resource Efficiency", 《IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS》, vol. 40, no. 5, pages 878 - 891, XP011850205, DOI: 10.1109/TCAD.2020.3011650 *
X. ZHANG等: "DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs", 《IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD)》, pages 1 - 8 *
孙明等: "面向卷积神经网络的硬件加速器设计方法", 《计算机工程与应用》, vol. 57, no. 13, pages 77 - 84 *
宫磊: "可重构平台上面向卷积神经网络的异构多核加速方法研究", 《中国优秀博士学位论文全文数据库:信息科技辑》, no. 8, pages 1 - 119 *
屈心媛等: "一种基于三维可变换CNN加速结构的并行度优化搜索算法", 《电子与信息学报》, vol. 44, no. 4, pages 1503 - 1512 *

Also Published As

Publication number Publication date
CN113592088B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
Zhang et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs
EP3757901A1 (en) Schedule-aware tensor distribution module
CN106919769B (en) Hierarchical FPGA (field programmable Gate array) layout and wiring method based on multi-level method and empowerment hypergraph
EP4036810A1 (en) Neural network processing method and apparatus, computer device and storage medium
Jiao et al. 7.2 A 12nm programmable convolution-efficient neural-processing-unit chip achieving 825TOPS
WO2020119318A1 (en) Self-adaptive selection and design method for convolutional-layer hardware accelerator
Chen et al. Zara: A novel zero-free dataflow accelerator for generative adversarial networks in 3d reram
CN109472361B (en) Neural network optimization method
US11886979B1 (en) Shifting input values within input buffer of neural network inference circuit
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
CN115755954B (en) Routing inspection path planning method, system, computer equipment and storage medium
US11645512B2 (en) Memory layouts and conversion to improve neural network inference performance
US20210191733A1 (en) Flexible accelerator for sparse tensors (fast) in machine learning
CN112183015B (en) Chip layout planning method for deep neural network
CN115860081A (en) Core particle algorithm scheduling method and system, electronic equipment and storage medium
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
WO2021244045A1 (en) Neural network data processing method and apparatus
CN113052306B (en) Online learning chip based on heap width learning model
CN112183001B (en) Hypergraph-based multistage clustering method for integrated circuits
CN113592088A (en) Parallelism determination method and system based on fine-grained convolution calculation structure
CN116245150A (en) Neural network reconfigurable configuration mapping method for FPGA (field programmable Gate array) resources
Guan et al. Crane: Mitigating accelerator under-utilization caused by sparsity irregularities in cnns
CN113986816B (en) Reconfigurable computing chip
WO2022266888A1 (en) Congestion prediction model training method, image processing method and apparatus
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant