CN109583006B

CN109583006B - Dynamic optimization method of field programmable gate array convolution layer based on cyclic cutting and rearrangement

Info

Publication number: CN109583006B
Application number: CN201811201717.7A
Authority: CN
Inventors: 陈朋; 陈庆清; 王海霞; 赵�智; 刘义鹏; 梁荣华
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2023-07-21
Anticipated expiration: 2038-10-16
Also published as: CN109583006A

Abstract

A dynamic optimization method of a field programmable gate array convolution layer based on cyclic cutting and rearrangement uses a high-level comprehensive tool to develop on a field programmable gate array platform, optimizes the convolution layer based on the cyclic cutting and rearrangement of convolution, adjusts the resource occupation and processing performance of the convolution layer, fully plays the parallel processing capability of the field programmable gate array, and improves the performance of a convolution neural network. The invention provides a dynamic optimization method of a field programmable gate array convolution layer based on cyclic cutting and rearrangement, which can greatly improve internal calculation speed and efficiency, thereby shortening calculation time and improving efficiency.

Description

Dynamic optimization method of field programmable gate array convolution layer based on cyclic cutting and rearrangement

Technical Field

The invention belongs to the technical field of digital image processing and pattern recognition, in particular to a dynamic optimization method of a field programmable gate array convolution layer based on cyclic cutting and rearrangement, aiming at the design of a core part convolution layer in a convolution neural network algorithm.

Background

The convolutional neural network is a multi-layer sensor developed on the basis of an artificial neural network, can be well adapted to deformation in the forms of translation, scaling, rotation and the like of images, is a sensitive sensor for extracting image characteristics, can achieve high accuracy by simulating optic nerve behaviors in living beings, and is widely applied to the fields of machine vision, pattern recognition, video monitoring, image searching and the like. The convolutional neural network belongs to a computationally intensive structure, but as the complexity of a model increases, model parameters are more and more, model scale and required calculation floating point numbers are larger and more, so that higher requirements on hardware resources are generated, and the model is not beneficial to being deployed and used on equipment with limited storage space and cruising duration.

Most of the convolutional neural network systems are basically realized in the GPU environment at present, and although the GPU has high parallel computing capability and can better solve the problem of computing speed, the convolutional neural network accelerator based on the GPU often has the problems of higher power consumption, larger volume and high cost.

Compared with the GPU, the field programmable gate array chip with a large amount of array logic and operation units has outstanding advantages in terms of size, power consumption and parallel operation. By means of the convolutional neural network realized by rich logic resources in the field programmable gate array and resources such as special multipliers, digital signal processing and the like, a large number of repeated and independent multiplication and addition operations in the algorithm can be executed in high parallelism, and the power consumption is reduced as much as possible while the computing capacity is ensured.

The traditional convolutional neural network construction mode for the field programmable gate array is designed based on a register transmission level description language, and has the problems of complex flow, long period, small optimization space and the like. Especially for the field programmable gate array, the method lacks effective characteristic analysis of the convolutional neural network implementation method, and the convolutional calculation has higher requirements on hardware resources. Therefore, in the edge computing environment, the design of a method for constructing a convolutional neural network becomes particularly important. In a field programmable gate array based convolutional neural network accelerator, field programmable gate array development implementation using a high-level synthesis tool has good scalability and requires only a short design time. The method adopts a high-level programming language to design an algorithm, and converts the algorithm into a trans-hierarchical design method of a register transmission level language which can be used for field programmable gate array design through the processes of compiling, semantic conversion, mapping, layout and wiring and the like. The circuit with high-level comprehensive design can obtain good performance under the condition of sufficient logic resources, but under the condition of complex equipment types and intensive resources, the design method and theory still need to be deeply explored.

Disclosure of Invention

In order to overcome the defect that the convolution layer in the prior art is too long in time consumption, the invention provides a dynamic optimization method of a field programmable gate array convolution layer based on cyclic cutting and rearrangement, which can greatly improve the internal calculation rate and efficiency, thereby shortening the calculation time and improving the efficiency.

The technical scheme adopted for solving the technical problems is as follows:

a dynamic optimization method of a field programmable gate array convolution layer based on cyclic cutting and rearrangement comprises the following steps:

1) Acquiring a calculation formula of a convolution layer according to the calculation process of convolution operation;

2) Setting corresponding segmentation parameters, and circularly segmenting the convolution layer calculation formula obtained in the step 1) to form two subcycles;

3) Analyzing the data sharing relation of the cycle parameters for the convolution layer calculation formula obtained in the step 1) and the subcycles obtained in the step 2);

4) According to the data sharing relation obtained by the analysis in the step 3), rearranging and unfolding optimization is carried out on the sub-loops obtained by the segmentation in the step 2) in a high-level comprehensive tool by inserting a compiling instruction in the conversion process;

5) Generating a corresponding comprehensive report by using a simulation tool of the high-level comprehensive tool, wherein the comprehensive report comprises the resource proportion used in the calculation process, comparing the obtained resource proportion report with the resource constraint condition, judging whether an optimal result under the current resource constraint condition is met, if not, modifying the segmentation parameters or the rearrangement sequence, and repeating the steps 2) and 3) and 4);

6) And 5) instantiating the convolution operation generated in the step 5) by using a high-level comprehensive tool, converting the C language into the Verilog language, generating a register transmission level circuit, and generating a corresponding convolution layer functional module.

Further, in the step 1), the convolution layer receives N feature maps with a size of w×h as input, each input feature map is generated by convolution kernel mapping with M windows with a size of k×k, and the translation step length of the windows is S, typically less than K, and the total N input feature maps form M output feature maps with a size of r×c, where the formula is as follows:

where OUT represents the output feature atlas, IN represents the input feature atlas, and W represents the weight set.

In the step 2), the calculation process of the convolution layer is divided into two subcycles, wherein one subcycle is shown in the following formula:

the combination < Tm, tn, tr > is the segmentation parameter set accordingly, where Tm, tn, tr and Tr are the segmentation of the output feature map depth, the input feature map depth, the output feature map width and length, respectively, and another sub-cycle is shown in the following formula:

in the step 3), according to the convolution calculation formula obtained in the step 1), the data sharing relationship between different loop iterations can be divided into three types: independent, independent and dependent;

i) Independent of: if loop iterator i _k If not present in any access function of array A, then the corresponding loop dimension is independent of array A;

II) independently: if the data space union and loop dimension i accessed by array A _k Is completely separable or for any two different parameters p ₁ And p ₂ For i _k ＝p ₁ And i _k ＝p ₂ Is disjoint in the data space of the different images, then the cyclic dimension i _k Independent of array A;

III) dependence: if the union of data spaces accessed on array A cannot follow a certain loop dimension i _k By performing the separation, the cyclic dimension i is considered _k Depending on array A;

the data sharing relationship is shown in the following table:

from a hardware implementation perspective, independent data sharing relationships generate direct connections between buffers and computing modules, independent data sharing relationships generate broadcast connections, and dependent data sharing relationships produce interconnections with complex topologies.

In the step 4), the generated hardware structure is optimized, wherein one optimizing technology is loop expansion, and the other key optimizing technology is pipeline loop, and the operations of different loop iterations are repeatedly executed;

optimizing the sub-loops obtained after the segmentation in the step 2), firstly rearranging the internal loops according to the data sharing relation obtained by analysis, then expanding the loops arranged at the innermost part, and simultaneously adding the pipeline loops to improve the throughput of the system, wherein the calculation process after optimization is shown in the following formula:

where F (x) represents loop unrolling and L (x) represents pipeline loop.

The beneficial effects of the invention are as follows: and a high-level comprehensive tool is used for optimizing the convolution layer based on a dynamic optimization method of cyclic cutting and rearrangement, so that the calculation efficiency on the convolution neural network on the field programmable gate array is improved. The optimization method has good suitability, can be applied to different structures such as a pooling layer, a full-connection layer and the like in an expanding manner, and can also be applied to different convolutional neural network models.

Drawings

FIG. 1 is a flow chart of a method of dynamic optimization of a field programmable gate array convolutional layer based on loop cut and reorder;

FIG. 2 is a diagram of a calculation process of a convolution layer;

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1 and 2, a dynamic optimization method of a field programmable gate array convolution layer based on loop cutting and rearrangement includes the steps of:

1) The calculation process of the convolution layer is shown in fig. 2, where the convolution layer receives N feature maps with a size w×h as input, each input feature map is generated by using a convolution kernel mapping with M windows with a size k×k, and the translation step length of the window is S, and is generally smaller than K, and the total N input feature maps form M output feature maps with a size r×c, where the formula is as follows:

wherein OUT represents the output feature atlas, IN represents the input feature atlas, W represents the weight set;

2) Since the resources of the field programmable gate array are limited, the loop cannot be fully expanded for calculation, so the calculation process of the convolution layer is divided into two subcycles, and one subcycle is shown in the following formula:

the combination < Tm, tn, tr > is the segmentation parameter set accordingly, where Tm, tn, tr and Tr are the segmentation of the output feature map depth, the input feature map depth, the output feature map width and length, respectively. Another sub-cycle is shown by the following formula:

3) According to the convolution calculation formula obtained in the step 1), the data sharing relation between different loop iterations can be divided into three types: independent, independent and dependent.

I) Irrespective of the fact that the first and second parts are. If loop iterator i _k Not in any access function of array a, the corresponding loop dimension is independent of array a.

II) independently. If the union of the data spaces accessed by array A and the loop dimension ik are completely separable, or for any two different parameters p ₁ And p ₂ For i _k ＝p ₁ And i _k ＝p ₂ Is disjoint in the data space of the different images, then the cyclic dimension i _k Independent of array a.

III) dependence. If the union of data spaces accessed on array A cannot follow a certain loop dimension i _k By performing the separation, the cyclic dimension i is considered _k Depending on array a.

The data sharing relationship is shown in the table below.

	Input IN	Weight W	Output OUT
				trr	Dependency of	Independent of each other	Independent and independent
tcc	Dependency of	Independent of each other	Independent and independent
				too	Independent of each other	Independent and independent	Independent and independent
tii	Independent and independent	Independent and independent	Independent of each other
				i	Dependency of	Independent and independent	Independent of each other
j	Dependency of	Independent and independent	Independent of each other

From a hardware implementation perspective, independent data sharing relationships generate direct connections between buffers and computing modules, independent data sharing relationships generate broadcast connections, and dependent data sharing relationships generate interconnections with complex topologies;

4) The high-level synthesis tool can optimize the generated hardware structure by inserting a compiling instruction in the conversion process. One such optimization technique is loop expansion, which can convert sequentially executed loop operations into parallel operations, thereby increasing the operation speed. Yet another key optimization technique, namely pipeline looping, improves system throughput by repeatedly performing operations for different loop iterations.

Optimizing the sub-loops obtained after the segmentation in the step 2), firstly rearranging the internal loops according to the data sharing relation obtained by analysis, then expanding the loops arranged at the innermost part, and simultaneously adding pipeline loops to improve the throughput of the system. The optimized calculation process is shown as follows:

where F (x) represents loop unrolling and L (x) represents pipeline loop;

5) Because the resources of the field programmable gate array are limited, the resources of the optimized calculation process are required to be evaluated after the segmentation and rearrangement in the step 2) 4), and a simulation tool of a high-level comprehensive tool is used for generating a corresponding comprehensive report, wherein the comprehensive report comprises the resource occupation ratio used in the calculation process;

comparing the obtained resource duty ratio report with the resource constraint condition, judging whether an optimal result under the current resource constraint condition is met, if not, modifying the segmentation parameters or rearranging the sequence, and repeating the steps 2) and 3) and 4);

6) And 5) instantiating the convolution operation generated in the step 5) by using a high-level comprehensive tool, converting the C language into the Verilog language, generating a register transmission level circuit and generating a corresponding functional module.

The foregoing embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the examples, and any other modifications, substitutions, combinations, and tailors without departing from the spirit and principles of the present invention should be equivalent to the above described embodiments, and are included in the scope of the present invention.

Claims

1. A method for dynamically optimizing a field programmable gate array convolutional layer based on cyclic cutting and reordering, the method comprising the steps of:

6) Instantiating the convolution operation generated in the step 5) by using a high-level comprehensive tool, converting the C language into a Verilog language, generating a register transmission level circuit, and generating a corresponding convolution layer functional module;

2. the method for dynamically optimizing a convolutional layer of a field programmable gate array based on cyclic cutting and rearrangement according to claim 1, wherein in the step 1), the convolutional layer receives N w×h feature maps as input, each input feature map is generated by mapping a convolutional kernel with M windows of k×k, the translation step of the window is S and is smaller than K, and a total of N input feature maps form M output feature maps with size of r×c, and the formula is as follows:

3. The method for dynamically optimizing a field programmable gate array convolution layer based on loop cutting and rearrangement according to claim 1 or 2, wherein in the step 3), according to the convolution calculation formula obtained in the step 1), the data sharing relationship between different loop iterations can be divided into three types: independent, independent and dependent;

II) independently: if the data space union and loop dimension i accessed by array A _k Is completely separable or for any two different parameters p ₁ And p ₂ For i _k ＝p ₁ And i _k ＝p ₂ Is disjoint in the data space of the different images, then the cyclic dimension i _k Independent of array AIs a kind of device for the treatment of a cancer;

the data sharing relation between trr and the input IN, the weight W and the output OUT is respectively dependent, irrelevant and independent;

the data sharing relation between tcc and the input IN, the weight W and the output OUT is respectively dependent, irrelevant and independent;

the data sharing relation between the too and the input IN, the weight W and the output OUT is irrelevant, independent and independent respectively;

tii is independent, independent and irrelevant to the data sharing relationship of the input IN, the weight W and the output OUT;

i is dependent, independent and irrelevant respectively with the data sharing relation of the input IN, the weight W and the output OUT;

the data sharing relation between j and the input IN, the weight W and the output OUT is respectively dependent, independent and irrelevant;

4. The method for dynamically optimizing a convolutional layer of a field programmable gate array based on loop cutting and rearrangement according to claim 1 or 2, wherein in the step 4), the generated hardware structure is optimized, one of the optimization techniques is loop expansion, and the other key optimization technique is pipeline loop, and the operations of different loop iterations are repeatedly executed;

where F (x) represents loop unrolling and L (x) represents pipeline loop.