CN113592088A

CN113592088A - Parallelism determination method and system based on fine-grained convolution calculation structure

Info

Publication number: CN113592088A
Application number: CN202110888610.XA
Authority: CN
Inventors: 屈心媛; 黄志洪; 蔡刚
Original assignee: Ehiway Microelectronic Science And Technology Suzhou Co ltd
Current assignee: Ehiway Microelectronic Science And Technology Suzhou Co ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-02
Anticipated expiration: 2041-07-30
Also published as: CN113592088B

Abstract

A parallelism determination method and a parallelism determination system based on a fine-grained convolution computing structure can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found. The parallelism determining method based on the fine-grained convolution calculation structure comprises the following steps: (1) constructing a problem model; (2) enumerating algorithm constraints; (3) and (5) fine-grained traversal solution.

Description

Parallelism determination method and system based on fine-grained convolution calculation structure

Technical Field

The invention relates to the technical field of FPGA hardware acceleration design of a convolutional neural network, in particular to a parallelism determination method based on a fine-grained convolutional computing structure and a parallelism determination system based on the fine-grained convolutional computing structure.

Background

Convolutional Neural Networks (CNN) is one of the representative algorithms for deep learning. Because of its excellent performance in the field of artificial intelligence, CNN is widely focused and applied in high-tech applications such as image classification, speech recognition, face recognition, autopilot, and medical imaging.

A Field Programmable Gate Array (FPGA) is a chip with excellent programming flexibility and high performance power consumption. At present, many CNN forward inference accelerators which seek to have low development cost, short development period and low application power consumption adopt an acceleration scheme based on an FPGA.

Because CNN is a computationally intensive structure, the accelerator needs to fully exploit the computational power of the FPGA chip, and therefore, the core topic of accelerator design is how to efficiently utilize the on-chip computational resources. Many classical FPGA CNN accelerators look at the optimization of convolution operation structures. For example, there is a document that proposes a fine-grained convolution computation structure that has both temporal and spatial granularity flexibility, allowing limited computational resources to be supported by hardware that delivers greater computational power.

The design of the FPGA CNN accelerator is a system project, and can fully exert the chip computing power of the FPGA by the support of a corresponding parallelism determination algorithm besides hardware support. Otherwise, if there is no suitable parallelism configuration, the configuration deployment mode of the convolution calculation structure is difficult to reach the ideal condition, and the resource distribution is uneven, which causes the waste of on-chip calculation power. Therefore, the parallelism determination system which can be matched with the fine-grained convolution calculation structure has wide application prospect.

In the method related to the design of the FPGA CNN accelerator, the description of the parallelism determining system can be divided into two categories.

The first type: there is a lack of a systematic description of a parallelism determination method in the literature. The determination process of the parallelism is not related in the literature or at all, and only the parallelism configuration parameters of the final accelerator are given; or only lists the constraint condition for limiting the value of the parallelism parameter, but does not give a concrete description of the method for determining the parallelism under the constraint condition. Therefore, the value selection process of the parallelism parameter depends heavily on engineering experience of accelerator designers, the optimal value selection result of the parallelism parameter cannot be ensured, and the opaque value selection process has no referential property for other accelerator designers.

The second type: the parallelism determination method of the system is given in the literature, but the adjustment space of the parallelism is limited by the flexibility of the convolution calculation structure. For example, accelerator layers have only input parallelism (Para)_in) And output parallelism (Para)_out) Two dimensions are adjustable, and Para_in、Para_outThe value of (a) is strictly limited to an integer power of 2, which makes the change granularity of the increase and decrease of the parallelism degree too large to realize small-amplitude adjustment. The coarse-grained parallelism determination algorithm is not friendly, and mainly shows the following two points:

1. along with the increase of the size of the original input image in the application scene, the size of the characteristic graph of each layer of the CNN is increased, and only Para is adopted_in、Para_outThe computing resource allocation is adjusted, so that the single-graph computing time granularity and the resource granularity are overlarge, and the utilization efficiency of the on-chip computing resources is reduced.

2. Parallelism parameters that are limited to integer powers of 2 are not suitable for all CNN networks. For example, the number N of input/output profiles for which there are a large number of convolutional layers in AlexNet_in/N_out(e.g., 3, 96, 384, 192) do not satisfy an integer power of 2, and mismatch of the number of signatures and parallelism reduces the efficiency of on-chip computing resource utilization.

Based on the analysis, in order to solve the problem of computing resource waste, a parallelism determining system with higher exploration dimension, flexible parameter change and optimal search result is urgently needed, and an accelerator designer can be helped to conveniently and efficiently obtain a parallelism configuration scheme matched with a fine-grained convolution structure.

Disclosure of Invention

In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide a parallelism determination method based on a fine-grained convolution computing structure, which can remarkably improve the utilization rate of computing resources, finally realize a prominent CNN acceleration effect, and ensure that a parallelism configuration scheme with the optimal utilization rate of the computing resources under the resource limitation can be found.

The technical scheme of the invention is as follows: the parallelism determination method based on the fine-grained convolution calculation structure comprises the following steps:

(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an accelerator_in,Para_out,Para_seg) An optimal parallelism search algorithm is proposed,the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein Para_inIs the input parallelism, Para_outIs the degree of output parallelism, Para_segIs the segmentation parallelism;

(2) enumerating the algorithmic constraints:

constraint 1. Single layer resource usage # DSP to ensure rationality of resource allocation_iAnd total number of available DSPs on chip # DSP_totalRatio of the convolution layer, close to the convolution layer calculation amount # OP_iAccounts for the total network calculation amount # OP_total(ii) percent (d);

constraint 2 the throughput rate of a full flow accelerator is limited by the maximum number of cycles # cycles required for a single layer_iTo increase throughput, max { # cycle is minimized_i}；

Constraint 3 sigma # DSP_iNot more than total number of available DSP resources on chip # DSP_total；

Constraint 4 sigma # BRAM_iNot more than total number of available storage resources # BRAM on chip_total，#BRAM_iStoring resource usage for a single layer;

(3) fine granularity traversal solution

In the process of traversing and searching the optimal parallelism scheme, Para_segIs given by ROW_outUnique determination, ROW_outRepresents ROW_inThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolved_in、Para_out、ROW_outRespectively have a value range of [1, N_in]、[1,N_out]、[1,SIZE_out]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N is_inNumber of input feature maps for convolutional layer, N_out]Number of output feature maps, SIZE, for convolutional layers_outExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZE_kerRepresents the convolution kernel size, # cycle_cntlRepresenting the number of cycles required for the control logic, N_padRepresentative supplementZero size, C_BRAMData storage capacity on behalf of a BRAM

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

Compared with the traditional parallelism determination method, the invention has the advantages that the exploration dimension is higher, the parameters can be adjusted in a fine granularity mode, the search result is better on an FPGA chip, meanwhile, the invention can be applied to the design of accelerators with different FPGA platforms and different CNN network structures, has good universality, and can fully develop the computing power of the FPGA chip and realize good acceleration effect with the help of the invention.

There is also provided a parallelism determination system based on a fine-grained convolution computation structure, comprising:

a build module configured to build a problem model: configuring parameters (Para) for determining optimal parallelism of an accelerator_in,Para_out,Para_seg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameter with the highest utilization rate of the computing resources;

a constraint module configured to enumerate algorithmic constraints

Constraint 1. to ensure the rationality of resource allocation, # DSP_iAnd total amount of available DSP on chip

#DSP_totalRatio of the convolution layer, close to the convolution layer calculation amount # OP_iAccounts for the total network calculation amount # OP_total(ii) percent (d);

constraint 2. throughput rate of full flow accelerator is limited by maximum # cycle_iTo increase throughputIt is possible to reduce max { # cycle_i}；

Constraint 4 sigma # BRAM_iNot more than total number of available storage resources # BRAM on chip_total；

A traversal module configured to traverse the Para in searching for the optimal parallelism solution_segIs given by ROW_outUnique determination, Para_in、Para_out、ROW_outRespectively have a value range of [1, N_in]、[1,N_out]、[1,SIZE_out]The values of the three can be increased or decreased finely by taking 1 as the minimum step SIZE, SIZE_outExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZE_kerRepresents the convolution kernel size, # cycle_cntlRepresenting the number of cycles required for the control logic, N_padRepresents the zero padding size, C_BRAMData storage capacity on behalf of a BRAM

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

Drawings

Fig. 1 shows a flow chart of a parallelism determination method based on a fine-grained convolution computation structure according to the invention.

Detailed Description

As shown in fig. 1, the parallelism determination method based on the fine-grained convolution calculation structure includes the following steps:

(1) constructing a problem model: configuring parameters for determining optimal parallelism of acceleratorsNumber (Para)_in,Para_out,Para_seg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein Para_inIs the input parallelism, Para_outIs the degree of output parallelism, Para_segIs the segmentation parallelism;

(2) enumerating the algorithmic constraints:

(3) fine granularity traversal solution

In the process of traversing and searching the optimal parallelism scheme, Para_segIs given by ROW_outUnique determination, ROW_outRepresents ROW_inThe row number, Para, of the output feature map segment obtained after the row input feature map segment is convolved_in、Para_out、ROW_outRespectively have a value range of [1, N_in]、[1,N_out]、[1,SIZE_out]The values of the three can be increased or decreased finely by taking 1 as the minimum step size, and N is_inNumber of input feature maps for convolutional layer, N_out]Number of output feature maps, SIZE, for convolutional layers_outExpressing the SIZE of the output characteristic diagram, the number of DSP resources required by the ith layer convolution layer is formula (2), the required period number is formula (3), and the required BRAM storage resource number is formula (4), wherein SIZE_kerRepresents the convolution kernel size, # cycle_cntlRepresenting the number of cycles required for the control logic, N_padRepresents the zero padding size, C_BRAMData storage capacity on behalf of a BRAM

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

Preferably, in step (1), a computation process of the matrix convolution is divided into Para and Para_segThe smaller matrices are convolved in turn, wherein Para_segIs equal to

ROW_inThe number of lines representing the input feature map segment to be stored after segmentation is determined by formula (1), and the function xi is if and only if Para_segEqual to 1 for 1, otherwise equal to 0, ROW_outRepresents ROW_inThe number of lines of the output characteristic picture segment obtained after the line input characteristic picture segment is convolved, and Stride represents convolution span

ROW_in＝SIZE_ker+Strideg(ROW_out-1)-2gN_padgξ (1)。

Para_segIs proposed to lift the meter from two aspectsAnd calculating the resource utilization rate.

1. Aiming at the condition that the granularity of convolution computing resources is overlarge due to overlarge size of the feature map, the convolution operation of the whole feature map can be completed by using a method of sequentially processing feature map fragments after cutting and with smaller resource granularity.

2. The first-layer input characteristic diagram of most CNN networks is RGB three-channel, N_inEqual to 3. Less N_inCompress Para_inIn pursuit of higher resource utilization, Para_inOnly 1 or 3 can be selected. This severely limits the resource planning space of the first floor. Para_segIs equivalent to introducing Para_inThe value range of (2) is extended from an integer domain (1 or 3) to a fractional domain (such as 1/2, 1/3 and 3/5 …), and the resource planning space is greatly expanded.

Preferably, in the step (2), the resource usage # DSP is calculated for a single layer_iNumber of cycles required for single layer calculation # cycle_iConstraints 1 and 2; limited by the number of available resources, for # DSP_iSingle-tier storage resource usage # BRAM_iThere are constraints 3, 4.

Preferably, the step (3) comprises the following substeps:

(3.1) data preparation work: specifying # DSP_totalAnd according to # OP_iAccount for # OP_totalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSP_alloci(ii) a Designated # BRAM_total(ii) a According to # OP_totalAnd # DSP_totalDetermining the theoretical minimum number of computational cycles # cycle_baseline；

(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step size_in,Para_out,ROW_out) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modes_i(ii) a According to the constraint 1, S_iThe selection rule of the medium element is based on the following basic assumptions: # cycle_i,j、#DSP_i,jDeviation from # cycle_baseline、#DSP_allociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor and beta is DSP allocates a floating factor; for S_iAny element of (A)_i,jSatisfy the constraint (# cycle)_i,j/#cycle_baseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)_i,j/#DSP_alloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cycle_i,jDenotes S_iThe number of calculation cycles corresponding to the middle element j, # DSP_i,jDenotes S_iThe number of DSP resources occupied by the middle element j;

(3.3) to obtain the Accelerator time/resource overhead, set S is first computed_i(i-1-5) Cartesian product S-S₁×S₂×…×S₅Each element in S corresponds to a cross-layer combination scheme; traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constraint_i1-5, sorting in ascending order, min { max { # cycle_iThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.

Preferably, the step (3) comprises the following substeps:

(I) calculate the calculation amount # OP of each layer_iAnd network total calculation amount # OP_totalRatio of gamma_i；

(II) distributing the DSP available on the chip to each layer according to the calculated quantity distribution proportion, and distributing the DSP number # DSP to each layerⁱ _alloc←γ_i·#DSP_total；

(III) calculating the theoretical minimum number of calculation cycles # cycle according to the total calculation amount and the total calculation resources_baseline；

(IV) level i, traverse Para_in，Para_outAnd ROW_outThe feasible value is a Cartesian product formed by the three definition domains, and a parallelism parameter configuration set S under the condition of full combination is generated⁰ _iCalculate the corresponding # cycle_i、#BRAM_iAnd # DSP_i；

(V) screening the data set S satisfying the alpha, beta constraints_i；

(VI) in all convolutional layers, traverse S_iAll possible combinations of the elements in (A), S_i(i＝1～5) Defining a Cartesian product S formed by domains, and calculating max { # cycle } corresponding to all elements meeting resource constraints (i is 1-5);

(VII) ascending alignment of max { # cycle_iAnd (i is 1-5), and selecting min { max { # cycle_iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.

It will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, corresponding to the method of the present invention, the present invention also includes a parallelism determination system based on a fine-grained convolution computation structure, which is generally represented in the form of functional blocks corresponding to the steps of the method. The system comprises:

a constraint module configured to enumerate algorithmic constraints

Constraint 1. to ensure the rationality of resource allocation, # DSP_iAnd total number of available DSPs on chip # DSP_totalRatio of the convolution layer, close to the convolution layer calculation amount # OP_iAccounts for the total network calculation amount # OP_total(ii) percent (d);

constraint 2. throughput rate of full flow accelerator is limited by maximum # cycle_iTo increase throughput, max { # cycle is minimized_i}；

Constraint 4 sigma # BRAM_iNot exceeding on chipUsing total number of storage resources # BRAM_total；

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

Preferably, in the building block, a calculation process of the matrix convolution is divided into Para by rows_segThe smaller matrices are convolved in turn, wherein Para_segIs equal to

ROW_in＝SIZE_ker+Strideg(ROW_out-1)-2gN_padgξ (1)。

Preferably, in the constraint module, the usage amount # DSP for a single-layer computing resource_iNumber of cycles required for single layer calculation # cycle_iConstraints 1 and 2; limited by the number of available resources, for # DSP_iSingle-tier storage resource usage # BRAM_iThere are constraints 3, 4.

Preferably, the traversal module performs the following sub-steps:

(3.2) in the ith layer, the tuples (Para) are traversed in sequence by taking 1 as a step size_in,Para_out,ROW_out) All effective values of (A) are calculated to obtain a set S of single-layer accelerator time/resource overhead conditions in various parallelism combination modes_i(ii) a According to the constraint 1, S_iThe selection rule of the medium element is based on the following basic assumptions: # cycle_i,j、#DSP_i,jDeviation from # cycle_baseline、#DSP_allociThe farther away, the less likely it is to be an optimal parallelism solution; alpha is a calculation period floating factor, and beta is a DSP distribution floating factor; for S_iAny element of (A)_i,jSatisfy the constraint (# cycle)_i,j/#cycle_baseline) Falls within the interval [ 1-alpha, 1+ alpha ]]And (# DSP)_i,j/#DSP_alloci) Falls within the interval [ 1-beta, 1+ beta ]]Internal; wherein, # cycle_i,jDenotes S_iThe number of calculation cycles corresponding to the middle element j, # DSP_i,jDenotes S_iThe number of DSP resources occupied by the middle element j;

(3.3) to obtain the Accelerator time/resource overhead, set S is first computed_i(i-1-5) Cartesian product S-S₁×S₂×…×S₅Each element in S corresponds to a cross-layer combination scheme;traversing the set S, and calculating max { # cycle corresponding to all elements meeting the resource constraint_i1-5, sorting in ascending order, min { max { # cycle_iThe corresponding element is the parallelism allocation scheme adopted by the accelerator to obtain the best performance/resource utilization rate.

Preferably, the traversal module performs the following sub-steps:

(V) screening the data set S satisfying the alpha, beta constraints_i；

(VI) in all convolutional layers, traverse S_iAll possible combinations of the elements in (A), S_iDefining a Cartesian product S formed by the domains (i is 1-5), and calculating max { # cyclei } corresponding to all elements meeting the resource constraint (i is 1-5);

(VII) arranging max { # cycle } in ascending order (i ═ 1-5), and selecting min { # cycle { (VII)_iAnd outputting parameter information of the optimal parallelism under the constraint condition by using the corresponding parallelism element.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. The parallelism determination method based on the fine-grained convolution calculation structure is characterized by comprising the following steps: which comprises the following steps:

(1) constructing a problem model: configuring parameters (Para) for determining optimal parallelism of an accelerator_in,Para_out,Para_seg) The optimal parallelism search algorithm is provided, and the design target is as follows: traversing all feasible parallelism combination schemes with the finest granularity in the value interval, and screening out the parallelism configuration parameters with the highest computational resource utilization rate, wherein Para_inIs the input parallelism, Para_outIs the degree of output parallelism, Para_segIs the segmentation parallelism;

(2) enumerating the algorithmic constraints:

(3) fine granularity traversal solution

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

2. The fine-grained convolution computation structure-based parallelism determination method according to claim 1, characterized in that: in the step (1), a calculation process of the matrix convolution is divided into Para by rows_segThe smaller matrices are convolved in turn, wherein Para_segIs equal to

ROW_in＝SIZE_ker+Strideg(ROW_out-1)-2gN_padgξ (1)。

3. Root of herbaceous plantThe fine-grained convolution computation structure-based parallelism determination method according to claim 2, characterized in that: in the step (2), the resource usage amount # DSP is calculated for the single layer_iNumber of cycles required for single layer calculation # cycle_iConstraints 1 and 2; limited by the number of available resources, for # DSP_iSingle-tier storage resource usage # BRAM_iThere are constraints 3, 4.

4. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:

(3.1) data preparation work: specifying # DSP_totalAnd according to # OP_iAccount for # OP_totalThe percentage of the number of the DSP resources is pre-allocated, and the number of the DSP resources allocated to the ith layer is # DSP_{alloc i}(ii) a Designated # BRAM_total(ii) a According to # OP_totalAnd # DSP_totalDetermining the theoretical minimum number of computational cycles # cycle_baseline；

5. The fine-grained convolution computation structure-based parallelism determination method according to claim 3, characterized in that: the step (3) comprises the following sub-steps:

(V) screening the data set S satisfying the alpha, beta constraints_i；

6. The parallelism determination system based on the fine-grained convolution calculation structure is characterized in that: it includes:

a constraint module configured to enumerate algorithmic constraints

#DSP_i＝ROW_outgSIZE_kergPara_ingPara_out (2)

7. The fine-grained convolution computation structure-based parallelism determination system of claim 6, characterized in that: in the construction module, the calculation process of a matrix convolution is divided into Para according to rows_segThe smaller matrices are convolved in turn, wherein Para_segIs equal to

ROW_in＝SIZE_ker+Strideg(ROW_out-1)-2gN_padgξ (1)。

8. The fine-grained convolution computation structure-based parallelism determination system of claim 7, characterized in that: in the constraint module, the usage amount of single-layer computing resources is # DSP_iNumber of cycles required for single layer calculation # cycle_iConstraints 1 and 2; limited by the number of available resources, for # DSP_iSingle-tier storage resource usage # BRAM_iThere are constraints 3, 4.

9. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:

10. The fine-grained convolution computation structure-based parallelism determination system of claim 8, characterized in that: the traversal module executes the following sub-steps:

(V) screening the data set S satisfying the alpha, beta constraints_i；