CN111191780B

CN111191780B - Averaging pooling accumulation circuit, device and method

Info

Publication number: CN111191780B
Application number: CN202010006439.0A
Authority: CN
Inventors: 郑旭标
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2024-03-19
Anticipated expiration: 2040-01-03
Also published as: CN111191780A

Abstract

The invention discloses a mean value pooling accumulation circuit, a device and a method, wherein the mean value pooling accumulation circuit comprises a double-port cache, a write control circuit, a read control circuit, a MUX, an addition circuit, a subtraction circuit, an output control circuit and an accumulation buffer; the averaging and pooling method comprises the following steps: performing BLK blocking on the input characteristic data; defining two cache arrays in the internal cache, and storing a first-dimension output accumulation result in one of the cache arrays according to a storage strategy of ping-pong operation; reading the accumulation result of the first dimension in the cache array, and performing accumulation operation of the second dimension; and outputting the second dimension accumulation result to a mean division circuit, and performing division operation in a SRAM table look-up mode. The beneficial effects of the invention are as follows: the accumulation in the two-dimensional direction can carry out operation in the second dimension according to the same one-dimensional accumulation circuit, so that the accumulation circuit has universality.

Description

Averaging pooling accumulation circuit, device and method

Technical Field

The invention relates to the technical field of computer vision and artificial intelligence, in particular to a mean value pooling accumulation circuit, a mean value pooling accumulation device and a mean value pooling accumulation method.

Background

Convolutional neural networks (convolutional neural networks, CNN) are increasingly used in the fields of image classification and image recognition. Convolutional neural networks typically comprise multiple sets of convolutional layers, pooling layers, etc., of neural network layers. The convolution layer is capable of extracting local features of the data, while the pooling layer is used to reduce the number of parameters and the operation of the neural network. The pooling layer typically contains two operations: max-pooling and average-pooling operations.

The average value pooling operation (also called as average value pooling operation) generally adopts an AI chip to operate so as to improve the operation speed, and AI chip software and hardware architectures which are promoted by different manufacturers are also various, but the average value pooling function of the architectures is not strong in universality, and cannot adapt to increasingly complex artificial intelligent algorithms.

Disclosure of Invention

Aiming at the problems, the invention provides a mean value pooling accumulation circuit, a mean value pooling accumulation device and a mean value pooling accumulation method, which mainly solve the problem that the universality of the mean value pooling function of an AI chip is not strong.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the equalizing pooling accumulation circuit comprises a dual-port buffer memory, a write control circuit, a read control circuit, a MUX, an addition circuit, a subtraction circuit, an output control circuit and an accumulation buffer memory;

the dual-port cache is used for caching the input characteristic data of the current period;

the write control circuit is used for controlling writing of the characteristic data into the dual-port cache;

the read control circuit is used for controlling and reading the characteristic data stored in the dual-port cache and controlling the MUX circuit;

the MUX is used for selecting the characteristic data and the filling data output by the dual-port cache;

the addition circuit is used for receiving the input characteristic data of each clock period of the input control unit, the temporary result of the subtraction circuit and inputting the current characteristic data into the accumulation buffer;

the subtracting circuit is used for realizing the subtracting function between the characteristic data output by the MUX and the accumulation buffer;

the output control unit is used for controlling the effective output of the accumulation buffer;

and the accumulation buffer is used for buffering and outputting the accumulated characteristic data.

A device for equalizing is provided, which comprises a top layer control circuit, an input control circuit, an output control circuit, the equalizing accumulating circuit, a BLK unit control circuit and an equalizing division circuit,

the top layer control circuit is used for performing control interaction with the system and controlling the average value pooling internal circuit;

the input control circuit is used for receiving the input characteristic data size and the input data address of the top layer control and controlling the input of information such as the characteristic data of the external storage and online module;

the output control circuit is used for receiving the size of the output characteristic data and the address of the input data controlled by the top layer and controlling and outputting information such as the characteristic data of the external storage and on-line module;

the average value pooling accumulation circuit is used for acquiring accumulated characteristic data;

the BLK unit control circuit is used for equalizing the block control of the pooling BLK;

and the mean division circuit configures a table look-up program of the precision range and performs division operation according to the table look-up program.

In some embodiments, the averaging and pooling accumulation circuit has two, one for the first dimension accumulation and the other for the second dimension accumulation.

The utility model provides a mean value pooling method, which is used for the mean value pooling device and comprises the following steps:

step one, BLK blocking is carried out on input characteristic data;

defining two cache arrays in the internal cache, and storing the output accumulation result of the current BLK block in the first dimension in one cache array according to a storage strategy of ping-pong operation;

step three, reading the accumulation result of the first dimension in the cache array, and performing accumulation operation of the second dimension;

and step four, outputting the second dimension accumulation result to a mean division circuit, and performing division operation in a SRAM table look-up mode.

In some embodiments, the first step is specifically: the channel dimension is segmented according to preset cblk_size, the size of the current output BLK is determined to be blk_wout_blk_hout_cblk_size according to the internal cache size, and the first dimension and the second dimension are output segmented according to blk_wout_blk_hout.

In some embodiments, the step two is specifically: the buffer memory array adopts a single-port SRAM, the data bit width of accumulated data of first dimension kh or kw is defined as d_size, the depth of each SRAM is blk_wout, the number of accumulated data of continuous Hout stored in the same Wout is defined as h_div, and the size of the single-port SRAM is defined as: cblk_size h_div d_size blk_wout.

In some embodiments, the third step is specifically: and reading cblk_size_h_div characteristic data in each clock period t 0-tn according to a mapping format of accumulated data stored in the cache array in the first dimension, and performing second dimension average value accumulation operation to obtain an accumulated result of each kw_kh.

In some embodiments, the step four is specifically: and (3) carrying out table lookup according to the current kernel (kw. Kh), wherein the table lookup range is associated with the maximum kernel size supported by the current pooling device.

The beneficial effects of the invention are as follows: the average pooling accumulation circuit is matched with an internal storage method, so that accumulation in two dimensions can be operated in a second dimension according to the same one-dimensional accumulation circuit, the accumulation circuit has universality, and the average pooling function of the chip has universality.

Drawings

FIG. 1 is a schematic diagram of a mean-pooling accumulation circuit according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an averaging and pooling device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a partitioning method for mean pooling in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an internal buffer structure of the average pooling in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an output pixel of the average pooling accumulation circuit according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a partitioned flow process with mean pooling in an embodiment of the invention;

FIG. 7 is a schematic diagram of a process flow of the average pooling first dimension BLK output in an embodiment of the invention;

fig. 8 is a schematic diagram of an operation flow of the second dimension accumulation circuit of the mean value pooling in the embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and the detailed description below, in order to make the objects, technical solutions and advantages of the present invention more clear and distinct. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings.

Example 1

According to fig. 1, the present embodiment proposes a mean-pooling accumulation circuit 204, which includes a dual-port buffer 101, a write control circuit 102, a read control circuit 103, a MUX104, an addition circuit 105, a subtraction circuit 106, an output control unit 107, and an accumulation buffer 108;

a dual-port buffer 101 for buffering input feature data of a current period;

a write control circuit 102 for controlling writing of the characteristic data to the dual-port cache 101;

a read control circuit 103 for controlling the read of the feature data stored in the dual-port cache 101, and controlling the MUX circuit 104;

MUX0104 for selecting the characteristic data and the filler data outputted from the dual-port buffer 101;

an addition circuit 105 for receiving the input of the feature data every clock cycle of the input control unit, the temporary result of the subtraction circuit 106, and the input of the current feature data to the accumulation buffer 108;

a subtracting circuit 106 for implementing a subtracting function between the feature data output from the MUX104 and the accumulation buffer 108;

an output control unit 107 for controlling the effective output of the accumulation buffer 108;

an accumulation buffer 108 for buffering and outputting the accumulated feature data.

The averaging and pooling accumulation circuit 204 utilizes a circuit structure similar to a difference, abandons the conventional method of parallel adders, and saves a large number of adders.

Example two

Referring to fig. 2, a device for equalizing and pooling includes a top layer control circuit 201, an input control circuit 202, an output control circuit 203, an equalizing and pooling accumulation circuit 204, a BLK unit control circuit 205 and an average division circuit 206,

a top-level control circuit 201 for performing control interaction with the system and controlling the average-pooling internal circuit; after receiving the system start signal and the configuration information, the top control circuit analyzes the current configuration information, and is used for transmitting parameters required by the current average value pooling device, including current characteristic data input and output information: win (input width), hin (input height), cin (input channel length), wout (output width), hout (output height), cout (output channel length), and current mean-value pooling kernel size, stride size, pad size, and the like. The top-level control circuit 201 performs control interaction on the input control circuit 202, the output control circuit 203 and the BLK unit control circuit 205 according to the average pooling BLK splitting method proposed in the following disclosure, and transmits the start information and the parameter information of the current feature data.

An input control circuit 202 for receiving the input feature data size and input data address of the top layer control, and controlling input of information such as feature data of the external storage and on-line module;

the output control circuit 203 is configured to receive the size of the output feature data and the address of the input data controlled by the top layer, and control and output information such as feature data of an external storage and online module;

the input control circuit 202 and the output control circuit 203 perform input/output of data in the form of BLK blocks, and are controlled by the BLK cell control circuit.

The average pooling accumulation circuit 204 is configured to obtain accumulated feature data; the average value pond accumulation circuit has two, and one is used for carrying out first dimension accumulation and the other is used for carrying out second dimension accumulation.

The BLK unit control circuit 205 is configured to average the pooled BLK block control;

the mean division circuit 206 configures a table look-up procedure for the precision range, and performs division operation according to the table look-up procedure.

Example III

According to the method shown in fig. 3, the averaging device comprises the following steps:

step one, BLK blocking is carried out on input characteristic data; the first step is specifically as follows: as in step 301 in fig. 3, the channel dimension is partitioned according to a preset cblk_size, where cblk_size > =1; when cblk_size=1, the BLK unit processed by the mean value pooling device of the present invention is a planar unit, when cblk_size >1, the BLK unit processed by the mean value pooling device is a stereo unit, step 301 in fig. 3 indicates that the width of tensor feature data of the current mean value pooling is Win and the height is Hin, according to the internal cache size, step 303 in fig. 3, parameter information of the pooling operation of the embodiment of the present invention includes information of the pooling core size kw/kh, the step size sw/sh, and the filling size pw/ph in two directions of width and height, step 304 in fig. 3, the size of the current output BLK is determined to be blk_wout_blk_hout_cblk, and the first dimension and the second dimension are output for blocking according to blk_wout_blk_hout.

Step two, as shown in fig. 4, two cache arrays are defined in the internal cache, specifically, group 0 401 and Group 1 402, and the output accumulation result of the first dimension of the current BLK block is stored in one of the cache arrays according to the storage policy of ping-pong operation; as shown in step 403 in fig. 4, the second step specifically is: the buffer memory array adopts a single-port SRAM, the data bit width of accumulated data of first dimension kh or kw is defined as d_size, the depth of each SRAM is blk_wout, the number of accumulated data of continuous Hout stored in the same Wout is defined as h_div, and the size of the single-port SRAM is defined as: cblk_size h_div d_size blk_wout. The size of the current output BLK is determined as: blk_wout cblk_size, assuming that the current output block size blk_wout blk_hout is smaller than the current tensor output wout hout, the BLK unit control circuit 205 will perform the block control operation for the pooling operation a plurality of times.

Fig. 4 above is an embodiment of the BLK internal memory according to the present invention, and is characterized in that the internal buffer is capable of storing the characteristic accumulated data of the current output BLK size blk_wout_blk_hout_cblk_size, and is designed according to the memory method described in fig. 7 below, but this embodiment does not represent that the apparatus of the present invention must be implemented in such a manner that the direction of the first dimension is selected from the H direction and the W direction in the flow, cblk_size is configurable, the characteristic data is block data or plane data, and each internal buffer is selected from the mediums such as a single port SRAM, a dual port SRAM, or a register file.

The implementation of the kernel pixels is described below in conjunction with FIG. 5:

after the kernel pixels are continuously accumulated, the current output control unit 107 controls to output the first accumulated feature output, and the output control device outputs the first accumulated feature output according to the pitch of stride periods, as sum_out_0 shown in 501 in fig. 5;

after the first characteristic data is output, the read control circuit 103 reads and outputs the characteristic data of the first address sequence of the dual-port buffer 101 in real time, controls the quantity between the characteristic data output by the MUX104 and the accumulation buffer 108 to be subtracted, and adds the temporary result to the addition circuit 105; the read control circuit 103 starts control, in effect, a write control enable delay kernel cycles, such as sum_out_1 shown at 502 in FIG. 5;

after outputting the Hout or Wout feature data, as shown by 503 in fig. 5, the data in the first dimension is written into the buffer array, and the output in the second dimension is directly output to the mean division circuit 206.

Read control circuit 103 of averaging and pooling circuit 204 includes processing boundary fill pixels, and MUX104 outputs a 0 value when fill is enabled.

Step three, reading the accumulation result of the first dimension in the cache array, and performing accumulation operation of the second dimension; the third step is specifically as follows: and reading cblk_size_h_div characteristic data in each clock period t 0-tn according to a mapping format of accumulated data stored in the cache array in the first dimension, and performing second dimension average value accumulation operation to obtain an accumulated result of each kw_kh. FIG. 6 is a block pipeline processing schematic of the mean pooling of the present invention, wherein the pooled tensor feature data is first partitioned into cblk blocks according to cblk_size, and then partitioned into BLK (n) unit processing examples as shown in FIG. 4, as described above, and the current cblk is divided into 3 partial processing pipelines, namely, fetch in, H_AVE, W_AVE, respectively. Fetch in is used as input control of data, and is read according to the configuration information assembled by the BLK unit control circuit 205; after delaying t0601, starting average value accumulation operation in H direction by input data; after H_AVE completes the accumulation of the average value pooled data of one BLK, W_AVE performs pooled accumulation processing of W dimension after t1602 time, and simultaneously performs division operation processing of a lookup table. The embodiments h_ave and w_ave are selectable in order of the switching.

The average value pooling accumulation circuit 204 in the third step is matched with an internal storage method, so that accumulation in two-dimensional directions can perform operation in a second dimension according to the same one-dimensional accumulation circuit, and the accumulation circuit has universality, so that the average value pooling function of the chip has universality.

FIG. 7 is a first dimension pooling embodiment of the present invention: BLK output process flow diagram.

In fig. 7 701, an example of a storage flow of the current BLK block w=0 pooled accumulation result is shown, where current block column data is passed through the average pooled accumulation circuit 204, cblk_size 1 feature accumulation data is output according to kh/sh/ph parameters every clock cycle, and continuous h_div feature data is stored in a first address sequence of a certain piece of dual-port cache 101; continuously outputting a column of data to be stored in the first row address of the dual-port cache 101;

in fig. 7, 702 illustrates a storage flow of the current BLK block w=1 pooled accumulation results, and blk_out accumulation results output by the current column according to kh/sh/ph parameters are output to the second row address sequence of the dual-port cache 101;

in fig. 7, 703 illustrates a storage flow of the current BLK block w= (blk_out-1) pooled accumulation results, and blk_out accumulation results output by the current column according to the kh/sh/ph parameter are output to the last row address sequence of the dual-port cache 101.

Fig. 7 is an example of operation and storage of first dimension averaging, in this embodiment, H dimension is adopted as an operation direction, and W dimension may be actually selected as a first dimension operation direction, and BLK block operation and storage in H or W direction are all within the protection scope of the present invention.

The invention relates to a BLK storage method of a mean value pooling accumulation circuit, wherein an SRAM array is dynamically changeable according to the current block size, and a single-chip dual-port buffer memory 101size is defined as follows: cblk_size h_div d_size blk_wout, when the size of the block blk_hout is smaller, stacking SRAM is optionally performed, and the size of the block blk_wout segmentation is increased; of course, if the W direction is selected as the first dimension operation direction of the averaging pool, the blk_hout block size can be increased as described above, thereby reducing redundant data reads and operations at the block boundaries.

And step four, outputting the second dimension accumulation result to a mean division circuit, and performing division operation in a SRAM table look-up mode. The fourth step is specifically as follows: and (3) carrying out table lookup according to the current kernel (kw. Kh), wherein the table lookup range is associated with the maximum kernel size supported by the current pooling device. Fig. 8 is a flowchart illustrating an embodiment of a second dimension averaging operation, in which, according to the present invention, a mapping format of first dimension stored in an internal buffer for accumulating data is used to read cblk_size_div feature data for each clock period t0 to tn to perform a second dimension average accumulating operation, so as to obtain an accumulating result of each kw_kh, and the second dimension average accumulating operation invokes the averaging accumulating circuit 204, and then simultaneously transmits the accumulating result to the average dividing circuit 206, where the portion is implemented in a configurable lookup table manner. The method adopts the table look-up mode, saves time sequence difference and large area of the traditional division circuit, can realize random kernel size by the division of the configurable table look-up, has variable precision, and has the universality of random kernel and data precision.

The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the essence of the present invention are intended to be included within the scope of the present invention.

Claims

1. The averaging and pooling method is characterized by being used for an averaging and pooling device, wherein the averaging and pooling device comprises a top layer control circuit, an input control circuit, an output control circuit, an averaging and pooling accumulation circuit, a BLK unit control circuit and an averaging and dividing circuit, and the top layer control circuit is used for controlling interaction with a system and controlling an averaging and pooling internal circuit; the input control circuit is used for receiving the input characteristic data size and the input data address of the top layer control and controlling the input of the characteristic data of the external storage and online module; the output control circuit is used for receiving the size of the output characteristic data and the address of the input data controlled by the top layer and controlling the output of the characteristic data of the external storage and online module; the average value pooling accumulation circuit is used for acquiring accumulated characteristic data; two averaging and pooling accumulation circuits are arranged, wherein one of the averaging and pooling accumulation circuits is used for accumulating the first dimension, and the other averaging and pooling accumulation circuit is used for accumulating the second dimension; the BLK unit control circuit is used for equalizing the block control of the pooling BLK; the average division circuit configures a table lookup program of the precision range and carries out division operation according to the table lookup program;

the average value pooling accumulation circuit comprises a dual-port buffer memory, a write control circuit, a read control circuit, a MUX, an addition circuit, a subtraction circuit, an output control circuit and an accumulation buffer memory; the dual-port cache is used for caching the input characteristic data of the current period; the write control circuit is used for controlling writing characteristic data into the dual-port cache; the read control circuit is used for controlling and reading the characteristic data stored in the dual-port cache and controlling the MUX circuit; the MUX is used for selecting the characteristic data and the filling data output by the dual-port cache; the addition circuit is used for receiving the input characteristic data of each clock period of the input control unit, the temporary result of the subtraction circuit and inputting the current characteristic data into the accumulation buffer; the subtracting circuit is used for realizing the subtracting function between the characteristic data output by the MUX and the accumulation buffer; the output control unit is used for controlling the effective output of the accumulation buffer; the accumulation buffer is used for buffering and outputting the accumulated characteristic data;

the method comprises the following steps:

step one, BLK blocking is carried out on input characteristic data;

outputting the second dimension accumulation result to a mean division circuit, and performing division operation in a SRAM table look-up mode;

the first step is specifically as follows: the channel dimension is segmented according to preset cblk_size, the size of the current output BLK is determined to be blk_wout blk_hout cblk_size according to the internal cache size, and output segmentation is carried out according to blk_wout blk_hout by the first dimension and the second dimension;

the second step is specifically as follows: the buffer memory array adopts a single-port SRAM, the data bit width of accumulated data of first dimension kh or kw is defined as d_size, the depth of each SRAM is blk_wout, the number of accumulated data of continuous Hout stored in the same Wout is defined as h_div, and the size of the single-port SRAM is defined as: cblk_size h_div d_size blk_wout;

the third step is specifically as follows: reading cblk_size_h_div characteristic data in each clock period t 0-tn according to the mapping format of accumulated data of the cache array stored in the first dimension, and performing second dimension average value accumulation operation to obtain an accumulation result of each kw_kh;

the fourth step is specifically as follows: and (3) carrying out table lookup according to the current kernel (kw. Kh), wherein the table lookup range is associated with the maximum kernel size supported by the current pooling device.