CN111242295B

CN111242295B - Method and circuit capable of configuring pooling operator

Info

Publication number: CN111242295B
Application number: CN202010067775.6A
Authority: CN
Inventors: 何虎; 张坤宁; 赵烁; 邓宁
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2022-11-25
Anticipated expiration: 2040-01-20
Also published as: CN111242295A

Abstract

The invention discloses a method for configuring pooling operators, which is characterized in that a pooling cache on a slice is arranged before pooling calculation, data are stored into the pooling cache according to the arrangement sequence in convolution calculation, and then the data are sequentially taken out from the corresponding positions of the pooling cache according to the sequence of the pooling calculation for calculation. The invention not only can save the time of using a processor to calculate the pooling and optimize the performance of the accelerator, but also has good universality.

Description

Method and circuit capable of configuring pooling operator

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to pooling calculation in a deep learning technology, and particularly relates to a method and a circuit capable of configuring pooling operators.

Background

In recent years, CNN (convolutional neural network) plays an increasingly important role in the field of computer vision. However, with the development of deep learning technology and the increasingly diversified application scenarios, the number of CNN network parameters becomes increasingly large, and the difficulty in transplanting the CNN network parameters to hardware is increased. Therefore, the acceleration of the CNN network computation has become an important research content.

The mainstream CNN network includes a convolution layer, a batch normalization layer, an active layer, a pooling layer, a full-link layer, etc., where the convolution layer and the full-link layer mainly based on multiply-add calculation occupy most of the calculation amount of the network, so the emphasis of accelerating calculation at present is to design a special hardware structure to complete the multiply-add calculation, thereby shortening the calculation time of the two layers, i.e., convolution and full-link. The CNN accelerator mainly adopts a method of partitioning and storing input data into on-chip cache and circularly calculating, and combines the two technologies of data multiplexing and parallel calculation to jointly improve the calculation efficiency of the CNN network. On the basis of the design idea, the performance of the accelerator can be further optimized.

In the field of image research, pooling calculation is mainly performed by dividing blocks with the size of N x N on image feature mapping, and performing certain operation on the blocks to finally obtain an output result, so that the effect of reducing the scale of image data is achieved. The pooling calculation method is mainly divided into maximum pooling and average pooling. The maximum pooling takes the maximum value in the block as output, and the average pooling takes the average of all data in the block as output. The image is subjected to average pooling operation, so that the error of the variance increase of the estimated value caused by the limitation of the size of the neighborhood can be reduced, and the background information of the image is more reserved; on the other hand, max-pooling can reduce the shift in the estimated mean due to convolutional layer parameter errors and preserve more texture information. Therefore, when the convolutional neural network structure is designed, the two pooling modes are often used alternately. While one application that is more common in average pooling is global average pooling. The global average pooling is used for replacing a full connection layer, so that the dimension reduction can be directly realized, and more importantly, the parameter quantity of the network is greatly reduced.

Pooling layers in CNN networks can constantly reduce the spatial size of the data, so the number of parameters and the amount of computation also drops, which can suppress overfitting to some extent. The key point of the current CNN network accelerated computation is convolution computation, and pooling computation is performed by a processor. But as the time of the convolution calculation is optimized, the pooling calculation accounts for a higher and higher proportion of the total calculation time.

Disclosure of Invention

In order to overcome the drawbacks of the prior art, the present invention provides a method and a circuit for configuring a pooling operator, which can not only save the time for computing pooling by using a processor and optimize the performance of an accelerator, but also have good versatility, can support two computing types of maximum pooling and average pooling, support different block sizes, and the like, and have a very important role in improving the overall performance of a CNN accelerator by designing a hardware structure specially used for performing pooling computing.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method capable of configuring pooling operators is characterized in that a pooling cache is arranged to convert the data from the arrangement sequence of convolution calculation into the sequence convenient for pooling calculation, and a basic pooling calculation module is designed and multiplexed to support pooling calculation under any block size.

Specifically, a pooling buffer on a slice may be set before pooling calculation, data is stored into the pooling buffer according to the arrangement order in the convolution calculation, and then data is sequentially fetched from the corresponding positions of the pooling buffer according to the order of the pooling calculation for calculation, wherein the result of the convolution calculation is arranged in rows or columns, and the data of one pooling calculation is from blocks spanning different rows and columns.

For the maximum pooling calculation, deploying a corresponding number of four-input maximum pooling calculation modules according to the actual calculation parallelism, wherein each maximum pooling calculation module comprises three two-input comparators, four inputs of the maximum pooling calculation module are used as four inputs of the two comparators, two outputs of the two comparators are used as inputs of a third comparator, the operation of taking the maximum value from the four inputs is completed, data are sent to each maximum pooling calculation module according to the sequence of the taken data, and the parallel pooling calculation of a plurality of blocks is completed;

in the maximum pooling calculation, the specific logic for comparing two signed numbers in the comparator is to firstly judge whether the two signed numbers are consistent, and if so, directly comparing the sizes of the following digital bits; if the sign bits are not consistent, taking the number with the sign bit being 0 as an output result;

for the average pooling calculation, a corresponding number of four-input average pooling calculation modules are deployed according to the actual calculation parallelism, each average pooling calculation module comprises three two-input adders, wherein the four inputs of the average pooling calculation module are used as the four inputs of the two adders, the two outputs of the two adders are used as the inputs of the third adder to complete the addition operation of the four inputs, the output of the third adder is sent to a shifter, and the operation of taking the average number is realized by shifting two bits to the right.

Further, in the average pooling calculation, the average pooling calculation module is multiplexed several times, an accumulator is arranged between the output of the third adder and the input of the shifter, so that the global average pooling is realized, and the global average pooling calculation under any size is realized by changing the times of multiplexing the average pooling calculation module.

Furthermore, a plurality of pieces of pooled data are spliced into a piece of data with the same size as the data before pooling, and then the data is stored in an output cache, so that the data size of each layer of the input blocks of the network is ensured to be the same.

Furthermore, according to the actual calculation requirement, the data output from the previous pooling calculation module is stored in the output cache, and also enters the pooling cache for pooling calculation, so that the results of pooling calculation and non-pooling calculation are obtained simultaneously.

Further, the controller of the pooling unit controls the convolution calculation results to be stored in the pooling cache in sequence, and then outputs the address of the data at the upper left corner position in the primary pooling calculation block in the pooling cache; and performing data splicing after the pooling calculation is finished to give the address of the result data in the output cache.

The invention also provides a circuit capable of configuring the pooling operator, which comprises:

the on-chip pooling cache stores data according to the arrangement sequence in the convolution calculation and then takes out the data at the corresponding positions according to the order of the pooling calculation;

and the pooling computing module is used for receiving the data output by the pooling cache for computing.

The pooling calculation module includes a maximum pooling calculation module and an average pooling calculation module, wherein:

the maximum pooling computing modules are the same as the pooling blocks in number, each maximum pooling computing module is provided with four inputs, each maximum pooling computing module comprises three comparators with two inputs, the four inputs of the maximum pooling computing module are used as the four inputs of the two comparators, the two outputs of the two comparators are used as the inputs of the third comparator, the operation of taking the maximum value from the four inputs is completed, data are sent to each maximum pooling computing module according to the sequence of the taken data, and the parallel pooling computing of the blocks is completed;

the average pooling calculation module has the same number with the pooling blocks, each average pooling calculation module has four inputs, each average pooling calculation module comprises three adders with two inputs, the four inputs of the average pooling calculation module are used as the four inputs of the two adders, the two outputs of the two adders are used as the inputs of the third adder to complete the addition operation of the four inputs, the output of the third adder is sent to a shifter, and the operation of taking the average number is realized by shifting the output of the third adder by two bits to the right.

Further, in the average pooling calculation module, the average pooling calculation module is multiplexed several times, and an accumulator is arranged between the output of the third adder and the input of the shifter, so as to realize global average pooling, and the global average pooling calculation at any size is realized by changing the number of times of multiplexing the average pooling calculation module.

Further, the circuit capable of configuring the pooling operator of the present invention further comprises:

the controller is used for controlling the convolution calculation results to be stored into the pooling cache in sequence, then outputting the address of the data at the upper left corner position in the pooling calculation block in the pooling cache once, and performing data splicing after the pooling calculation is finished to give the address of the result data in the output cache; signals are also given whether pooling calculations are performed and the type of pooling calculations.

Compared with the prior art, the invention has the beneficial effects that:

the pooling calculation module designed above is used in the CNN accelerator, so that the time for pooling calculation by using a processor can be saved. Under the condition of no pooling module, a plurality of clock cycles are spent to store the data which completes the convolution and the activation operation into an output buffer; after the pooling module is added, the data is stored in the pooling cache first, and the time spent on storing the data in the pooling cache is the same as that spent on storing the data in the output cache. Under the pipeline design of the accelerator, compared with the original method of directly storing the pooled calculation into the output buffer, the pooled calculation has almost no additional time overhead. In addition, taking the pooled block size of 2 × 2 as an example, compared with the result obtained without pooling, although the data volume transferred to the memory by the output buffer is equal under the stitching policy, the transfer times are one fourth of the original times, thereby reducing the data transfer time. This improves the overall performance of the CNN accelerator to some extent.

In addition, from the general view, for the maximum pooling, the maximum pooling calculation under any block size can be realized as long as the calculation module with the maximum value of four inputs is multiplexed for a plurality of times; for average pooling, only a four-input average calculation module is needed to be multiplexed and an accumulator is added, so that global average pooling at any size can be realized. Therefore, the pooling calculation module has better flexibility and universality and can be suitable for the calculation of various mainstream pooling layers at present.

Drawings

FIG. 1 is a schematic diagram of the calculation sequence of convolution and pooling.

FIG. 2 is a schematic diagram of the location of the convolution calculation results in a buffer and the manner in which data is fetched before pooling calculations is performed.

Fig. 3 is a schematic diagram of a maximum pooling calculation module.

Fig. 4 is a schematic diagram of the structure of the average pooling calculation module.

FIG. 5 is a schematic diagram of the multiplexing mode of the average pooling calculation module.

Fig. 6 is a schematic diagram of a pooling splicing strategy.

FIG. 7 is a schematic diagram of the overall structure of a pooling module in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in fig. 1, the scheme and principle of the method and circuit for configuring pooling operators of the present invention are as follows:

one basic strategy used in CNN network accelerated computing is to block data, with both the input data transferred from memory to the input cache and the computed results transferred back from the output cache to memory being in blocks. The pooling calculation is after the convolution and activation calculations, but the block data input for the pooling calculation is not arranged consecutively in the convolution calculation order. Fig. 1 shows the order of calculation of convolution and pooling, respectively (block size 10 x 10 and pooling size 2 x 2 as examples). The result of the convolution calculation is arranged in rows, while the data of one pooling calculation comes from blocks across different rows, so the pooling hardware design is the first to solve the problem.

The solution to the above problem is to set an on-chip cache as a pooling cache before pooling calculation, store data into the pooling cache according to the arrangement order in the convolution calculation, and then sequentially fetch data from the corresponding positions of the pooling cache according to the order of the pooling calculation for calculation. Fig. 2 shows the storage location of the convolution calculation results in the cache and the manner in which the data is fetched before pooling calculations are performed. The controller of the pooling unit outputs the data addresses of the upper left corner positions in the pooling calculation block for one time, namely 0,2,4,6,8, 20 \8230, 80, 82, 84, 86, 88, in the sequence shown in the figure, and then the on-chip cache read-write module receives the addresses and outputs the addresses of four data for one time, such as 0,1 (0 + 1), 10 (0 + 10), and 11 (0 + 11), thereby ensuring that the data required by each pooling calculation is accurately and orderly output.

In the pooling calculation, two forms of average pooling and maximum pooling are divided.

The most common way of maximum pooling today is to divide 2 x 2 blocks out of the image every 2 elements and then take the maximum of 4 numbers in each block. Therefore, the invention adopts a two-input comparator, and designs a four-input maximum pooling calculation module which instantiates three two-input comparators and can complete the operation of taking the maximum value out of 4 numbers. Then, a plurality of computing modules can be deployed according to actual requirements, and data is sent into the pooling computing module according to the data taking sequence, so that parallel pooling computing of a plurality of blocks can be completed. The specific hardware structure is shown in fig. 3.

The specific logic for comparing the two signed numbers in the comparator is that whether the two signed numbers are consistent or not is judged firstly, and if so, the sizes of the following digital bits are directly compared; and if the sign bits are not consistent, taking the number with the sign bit being 0 as an output result.

The average pooling is calculated by dividing blocks of N x N from the image every N elements, and then dividing N in each block ² The number is averaged. The case when N =2 is taken as the basic average pooling module. Therefore, the two-input adder is designed firstly, then the average pooling calculation module is designed, the module instantiates three adders, finally the result of the addition of the four numbers is sent to a shifter, and the operation of taking the average number is realized by shifting two numbers to the right. Fig. 4 shows a hardware configuration of the average pooling calculation module.

A global average pooling approach is often used in CNN networks. The pooling block size in this manner is typically large, with a global average pooling size of typically 7 x 7 with an input feature map size of 224 x 3. In this case this can be achieved by multiplexing the above average pooling calculation module several times and adding an accumulator. Fig. 5 shows the way of multiplexing. If the size of the input feature map changes, the size of the global average pooling will also change. The global average pooling calculation under any size can be realized only by changing the times of multiplexing the basic calculation modules.

In addition, the size of the pooled calculation results is reduced compared to that before the calculation. Taking the block size of 2 × 2 and the step size of 2 as an example, the data amount will become one fourth of the original amount after the pooling is completed. If the result is directly stored in the output buffer, the size of the input block data is one fourth of the input block data of the previous layer during the next layer of the network calculation, and the CNN network often has a plurality of pooling calculation layers, which is obviously not favorable for the subsequent calculation. Therefore, an integration strategy for splicing four pieces of pooled data into one piece of data with the same size as the data before pooling and storing the data in an output cache is provided, and the strategy can ensure that the sizes of input block data of each layer of a network are the same. Fig. 6 shows a specific splicing method.

Fig. 6 illustrates an example of the size of the block data being 10 × 10, and numbers in the boxes indicate the corresponding relationship between each block data before pooling and the corresponding address stored in the output buffer after completion of the pooling calculation. It can be seen from the figure that 25 blocks of the first block data after completion of pooling calculation respectively correspond to addresses 0-4, 10-14, 20-24, 30-34 and 40-44 in the output cache, the second block corresponds to addresses 5-9, 15-19, 25-29, 35-39 and 45-49 of 82308230, and the four pooling results with the size of 25 are spliced into a block of data with the size of 100 again for the next layer of the network to continue calculation, thereby avoiding the problem that the input size of the next layer of the network is changed due to pooling operation. The output of the corresponding address is done by the controller of the pooling unit.

In the CNN accelerator design, a selector is arranged in front of a pooling module and used for controlling whether output data of a previous calculation module needs to be subjected to pooling calculation or not. In addition, the design supports simultaneous pooled and non-pooled computation of data. In this case, the data output from the previous computation module in the pooling mode is stored in the output chip for buffering and enters the pooling unit for computation. Originally, two same output chips are cached and work in a ping-pong mode, one output chip is used for temporarily storing data which are not subjected to pooling, and the other output chip is used for temporarily storing data which are subjected to pooling calculation, namely, the two output chips do not work in the ping-pong mode any more.

In the method for configuring the pooling operator, all contents can not leave the control of the controller on the data reading and writing and the calculation module. The controller firstly stores convolution calculation results into a pooling cache in sequence, and then outputs the address of the data at the upper left corner position in the pooling calculation block in the pooling cache for one time; and after the pooling calculation is finished, the address of the result data in the output cache is given according to the data splicing strategy. In addition, the controller also gives a signal whether to perform pooling calculations and the type of pooling calculations, and thus it is critical to ensure that pooling calculations can be performed in order.

FIG. 7 is a complete block diagram of the entire pooling module. In a specific embodiment of the present invention, an FPGA development board is used as an implementation platform of the CNN accelerator, two networks of the complete VGG16 and SSD300 are run, and the performance of the accelerator is compared between the case of pooling in software computation and the case of implementing the CNN accelerator in hardware. Both networks have five pooling layers with a block size of substantially 2 x 2 and a step size of substantially 2. The computations for these pooling layers can therefore be done entirely with the pooling modules described above.

The input patch data size of each layer of the VGG16 network and the SSD300 network is determined to be 9 × 16 and 12 × 16, respectively, in the present embodiment. If the ARM processor on the FPGA development board is used for completing the pooling calculations in the two networks, 23ms and 35ms are needed respectively, and if the pooling calculation module is used for performing the part of calculations, the time can be completely saved, and a certain amount of data transmission time can be reduced. The theoretical calculation time required by the accelerator to finish the reasoning process of one picture of the network is about 200ms and 400ms, and then the reasoning time of one picture can be shortened by 11.5% and 8.75% by using the pooling calculation module respectively. And with the continuous improvement of the computing power of the convolution computing unit and the continuous optimization of the accelerator technology, the optimization effect brought by the pooling computing module is more and more obvious, so that the hardware realization of pooling computing plays a very obvious role in improving the performance of the accelerator.

Claims

1. A method capable of configuring pooling operators is characterized in that a pooling cache is arranged to convert the data from the arrangement sequence of convolution calculation to the sequence convenient for pooling calculation, and a basic pooling calculation module is designed and multiplexed to support pooling calculation under any block size;

for maximum pooling calculation, deploying a corresponding number of four-input maximum pooling calculation modules according to actual calculation parallelism, wherein each maximum pooling calculation module comprises three two-input comparators, four inputs of the maximum pooling calculation module are used as four inputs of the two comparators, two outputs of the two comparators are used as inputs of the third comparator, the operation of taking the maximum value from the four inputs is completed, data are sent into each maximum pooling calculation module according to the data taking sequence, and parallel pooling calculation of a plurality of blocks is completed;

for average pooling calculation, deploying a corresponding number of four-input average pooling calculation modules according to actual calculation parallelism, wherein each average pooling calculation module comprises three two-input adders, the four inputs of the average pooling calculation module are used as the four inputs of the two adders, the two outputs of the two adders are used as the inputs of a third adder to finish the addition operation of the four inputs, the output of the third adder is sent to a shifter, and the operation of taking an average number is realized by shifting two bits to the right;

in the average pooling calculation, the average pooling calculation module is multiplexed for a plurality of times, an accumulator is arranged between the output of a third adder and the input of a shifter, so that the global average pooling is realized, and the global average pooling calculation under any size is realized by changing the times of multiplexing the average pooling calculation module;

and splicing a plurality of pieces of pooled data into a piece of data with the same size as the data before pooling, and storing the data into an output cache so as to ensure that the sizes of the input sub-blocks of each layer of the network are the same.

2. The method of claim 1, wherein the pooling operator is implemented by supporting data to be pooled and non-pooled simultaneously, and data output from a previous pooled computing module is stored in both an output buffer and a pooled buffer for pooling.

3. The method of claim 1, wherein the controller of the pooling unit first stores the convolution calculation results into the pooling buffer in order, and then outputs the address of the data at the upper left corner position in the primary pooling calculation block in the pooling buffer; and after the pooling calculation is finished, performing data splicing to give the address of the result data in the output cache.

4. A circuit that can configure a pooling operator, comprising:

the pooling computing module is used for receiving data output by the pooling cache for computing;

the number of instantiations of the maximum pooling computing module is consistent with the computing parallelism, each maximum pooling computing module has four inputs, each maximum pooling computing module comprises three comparators with two inputs, the four inputs of the maximum pooling computing module are used as the four inputs of the two comparators, the two outputs of the two comparators are used as the inputs of the third comparator, the operation of taking the maximum value from the four inputs is completed, data are sent to each maximum pooling computing module according to the sequence of taking the data out, and the parallel pooling computing of a plurality of blocks is completed;

the number of instantiations of the average pooling computing module is consistent with the computing parallelism, each average pooling computing module has four inputs, each average pooling computing module comprises three adders with two inputs, the four inputs of the average pooling computing module are used as the four inputs of the two adders, the two outputs of the two adders are used as the inputs of the third adder to complete the adding operation of the four inputs, the output of the third adder is sent to a shifter, and the operation of taking an average number is realized by shifting two bits to the right;

in the maximum pooling calculation module, the specific logic for comparing two signed numbers in the comparator is that whether the two signed numbers are consistent or not is judged firstly, and if so, the sizes of the following digital bits are directly compared; if the sign bits are not consistent, taking the number with the sign bit being 0 as an output result;

in the average pooling calculation module, the average pooling calculation module is multiplexed for a plurality of times, an accumulator is arranged between the output of a third adder and the input of the shifter, so that global average pooling is realized, and global average pooling calculation under any size is realized by changing the times of multiplexing the average pooling calculation module;

5. The configurable pooling operator circuit of claim 4, further comprising: