CN112633462A

CN112633462A - Block type inference method and system for memory optimization of convolution neural network

Info

Publication number: CN112633462A
Application number: CN202010922472.8A
Authority: CN
Inventors: 黄朝宗
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-10-08
Filing date: 2020-09-04
Publication date: 2021-04-09
Also published as: TW202115624A; TWI765336B

Abstract

The invention provides a block type inference method and a block type inference system for memory optimization of a convolutional neural network. The block inference step drives the arithmetic processing unit to perform multi-layer convolution operation on each input block data to generate output block data. The block inference step selects the ith layer to recalculate the features along the scan line feed direction according to the position of the output block data. The block inference step selects the i-th layer recycling features along the block scanning direction according to the i-th layer recalculated input feature block data. The convolution operation step executes convolution operation according to the recalculation characteristics of the ith layer and the recycling characteristics of the ith layer. Therefore, by using the calculation mode with different characteristics in different directions, the bandwidth requirement of the external memory can be greatly reduced without increasing excessive calculation amount and internal block registers.

Description

Block type inference method and system for memory optimization of convolution neural network

Technical Field

The present invention relates to a block inference method and system, and more particularly, to a block inference method and system for memory optimization of convolutional neural network.

Background

When a convolutional neural network is used in image processing applications, the bandwidth requirement of its external memory may be quite high, and using a block-wise inference procedure can greatly reduce this bandwidth requirement. However, there are overlapped eigenvectors between blocks, and two different processing methods are known, one is a recalculation method, and the other is a recycling method. The former increases the amount of computation and decreases the amount of output pixels, while the latter requires a large number of block registers to store the reused eigenvectors. Therefore, a block inference method and system for memory optimization of convolutional neural network that can greatly reduce the bandwidth requirement of the external memory without increasing too much computation and block register are not available in the market, and therefore, related practitioners seek solutions to this problem.

Disclosure of Invention

Therefore, the present invention provides a block inference method and system for memory optimization of convolutional neural network, wherein when block inference is performed, the calculated features are repeatedly used in the forward direction of the block, and a re-calculation mode is used in the other direction, so that the block inference can still greatly reduce the bandwidth requirement of the external memory without increasing too much calculation amount and the block register.

One embodiment of a method according to the present invention provides a memory-optimized block-wise inference method for convolutional neural networks, which is used for processing an input image. The block type inference method for memory optimization of the convolutional neural network comprises a parameter setting step, a segmentation step, a block inference step and a temporary storage step, wherein the parameter setting step is to set an inference parameter group, and the inference parameter group comprises a convolution depth, a block width, a block height and a multilayer convolution kernel size. The dividing step is to drive an operation processing unit to divide the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the sizes of the layer convolution kernels, wherein each input block data has an input block size. The block inference step is to drive the operation processing unit to execute a multi-layer convolution operation on each input block data to generate output block data, the multi-layer convolution operation comprises a first direction data selection step, a second direction data selection step and a convolution operation step, wherein the first direction data selection step selects a plurality of i-th layer recalculation characteristics along a scanning line-changing direction according to a position of the output block data, and then selects an i-th layer recalculation input characteristic block data according to the position of the output block data and the i-th layer recalculation characteristics, wherein i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selecting step selects a plurality of i-th layer recycling characteristics along a block scanning direction according to the i-th layer recalculation input characteristic block data, and combines the i-th layer recalculation input characteristic block data and the i-th layer recycling characteristics to generate i-th layer recycling input characteristic block data. In addition, the convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer repeated utilization input feature block data according to the i-th layer convolution kernel size, then to execute a convolution operation on each i-th layer sub-block input feature group and a convolution parameter group to generate an i-th layer sub-block output feature, and to combine the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form i-th layer output feature block data. The temporary storage step is to drive a block temporary storage to temporarily store the ith layer output characteristic block data and the ith layer recycling characteristics.

Therefore, the block type inference method for memory optimization of the convolutional neural network uses calculation modes with different characteristics in different directions, so that the block type inference can still greatly reduce the bandwidth requirement of an external memory on the premise of not increasing excessive calculation amount and a block temporary storage.

Other examples of the foregoing embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data to be equal to each input block data. When i is equal to the convolution depth, the i-th layer output feature block data is equal to the output block data.

Other examples of the foregoing embodiments are as follows: the i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size and an i-th layer recalculated input feature block channel number, and the i-th layer output feature block data has an i-th layer output feature block size and an i-th layer output feature block channel number. The size of the ith layer output feature block is larger than that of the ith layer recalculated input feature block, and the number of channels of the ith layer recalculated input feature block is equal to that of the ith layer output feature block.

Other examples of the foregoing embodiments are as follows: the block scanning direction is vertical to the scanning line-changing direction, the block width is larger than the block height, and the extending direction of the block height is parallel to the block scanning direction.

Other examples of the foregoing embodiments are as follows: the convolution depth, block width and block height are positive integers, and the i-th layer convolution kernel has a size of k_Wi×k_Hi. The i-th layer reuse features have a reuse feature number along the block scanning direction, and the reuse feature number is equal to k_Hi-1。

Other examples of the foregoing embodiments are as follows: the block width is represented as B_WWith convolution depth denoted D and block height denoted B_H. Input block size equal to B_W×B_H. The output block data has an output block size equal to (B)_W-2D)×B_H. The i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size equal to (B)_W-2i+2)×B_H. The ith layer recycling input feature block data has an ith layer recycling input feature block size equal to (B)_W-2i+2)×(B_H+2). The ith layer output characteristic block data has an ith layer output characteristic block size equal to (B)_W-2i)×B_H. The convolution depth is less than half the block width.

Other examples of the foregoing embodiments are as follows: when at least one of the input features of an ith sub-block input feature group is located in the outer region of the ith recycling input feature block data, the input features of the ith sub-block input feature group comprise a plurality of outer block features and a plurality of first inner block features. The outer region features represent calculated features, and the first inner region features represent non-calculated features. Furthermore, when the input features of an i-th layer sub-block input feature group are all located in the inner region of the i-th layer recycling input feature block data, the input features of the i-th layer sub-block input feature group only comprise a plurality of second inner block features. The arrangement sequence of the input feature block data along the block scanning direction is repeatedly utilized by the ith layer to form an outer area and an inner area.

Other examples of the foregoing embodiments are as follows: the external block characteristics are stored in a block temporary storage device, the block temporary storage device is provided with a temporary storage space, and the temporary storage space is obtained by recalculating the width, the convolution depth, the number of layers, the number of channels and the size of the i-th layer convolution kernel of the input characteristic block data through the i-th layer. The width of the i-th layer recalculated input feature block data is represented as B_WiThe convolution depth is represented as D, the number of layers is represented as i, the number of channels is represented as C, and the size of the convolution kernel of the ith layer is k_Wi×k_HiThe scratch space is denoted LBS and conforms to the following equation:

according to an embodiment of the present invention, a memory-optimized block-based inference system for convolutional neural networks is provided for processing an input image, the memory-optimized block-based inference system for convolutional neural networks includes a block register for accessing ith layer output feature block data and a plurality of ith layer recycling features and an operation processing unit. The arithmetic processing unit is electrically connected to the block register, receives the input image and is configured to perform operations comprising: parameter setting step, dividing step and block deducing step. The parameter setting step is to set an inference parameter set, wherein the inference parameter set comprises convolution depth, block width, block height and multi-layer convolution kernel size. The dividing step divides the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the sizes of the layer convolution kernels, wherein each input block data has an input block size. In addition, the block inference step is to perform a multi-layer convolution operation on each input block data to generate output block data, and the multi-layer convolution operation includes a first direction data selection step, a second direction data selection step, and a convolution operation step. The first direction data selection step is to select a plurality of i-th layer recalculation characteristics along the scanning line feed direction according to the position of the output block data, and then select the i-th layer recalculation input characteristic block data according to the position of the output block data and the i-th layer recalculation characteristics, wherein i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selecting step selects the i-th layer recycling characteristics along the block scanning direction according to the i-th layer recalculation input characteristic block data, and combines the i-th layer recalculation input characteristic block data and the i-th layer recycling characteristics to generate i-th layer recycling input characteristic block data. The convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer repeated utilization input feature block data according to the size of the i-th layer convolution kernel, then carry out convolution operation on each i-th layer sub-block input feature group and a convolution parameter group to generate i-th layer sub-block output features, and combine the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form i-th layer output feature block data.

Therefore, the block type inference system for optimizing the memory of the convolutional neural network uses calculation modes with different characteristics in different directions, so that the block type inference can still greatly reduce the bandwidth requirement of an external memory on the premise of not increasing excessive calculation amount and a block temporary storage.

Other examples of the foregoing embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data to be equal to each input block data. And when i is equal to the convolution depth, the i-th layer output characteristic block data is equal to the output block data.

Other examples of the foregoing embodiments are as follows: the i-th layer recalculated input feature block data comprises i-th layer recalculated input feature block size and i-th layer recalculated input feature block channel number, and the i-th layer output feature block data comprises i-th layer output feature block size and i-th layer output feature block channel number. The size of the ith layer output feature block is larger than that of the ith layer recalculated input feature block, and the number of channels of the ith layer recalculated input feature block is equal to that of the ith layer output feature block.

Other examples of the foregoing embodiments are as follows: the convolution depth, block width and block height are positive integers, and the i-th layer convolution kernel has a size of k_Wi×k_HiThe i-th layer of recycling features has a recycling feature number along the block scanning direction, and the recycling feature number is equal to k_Hi-1。

Other examples of the foregoing embodiments are as follows: the external block characteristics are stored in a block temporary storage device, the block temporary storage device is provided with a temporary storage space, and the temporary storage space is obtained by recalculating the width, the convolution depth, the number of layers, the number of channels and the size of the convolution kernel of the ith layer of the input characteristic block data through the ith layer. The width of the i-th layer recalculated input feature block data is represented as B_WiThe convolution depth is represented as D, the number of layers is represented as i, the number of channels is represented as C, and the size of the convolution kernel of the ith layer is k_Wi×k_HiThe scratch space is denoted LBS and conforms to the following equation:

drawings

FIG. 1 is a flow diagram illustrating a block-wise inference method of memory optimization for convolutional neural networks of a first embodiment of the present invention;

FIG. 2 is a schematic diagram showing the segmentation step of FIG. 1;

FIG. 3 is a schematic perspective view of input block data and output block data of a multi-layer convolution operation of the block inference step of FIG. 1;

FIG. 4 is a schematic diagram illustrating a first direction data selection step of FIG. 1;

FIG. 5 is a diagram illustrating a second direction data selection step of FIG. 1;

FIG. 6 is a diagram illustrating the layer 1 reuse of input feature block data of FIG. 3;

FIG. 7 is a schematic diagram showing a second embodiment of the channel shuffle of the present invention;

FIG. 8 is a block diagram illustrating a memory-optimized, segmented inference system for convolutional neural networks of a third embodiment of the present invention;

FIG. 9 is a flowchart showing a multilayer convolution operation with a 3 × 3 filter according to a fourth embodiment of the present invention; and

FIG. 10 is a diagram showing the results of a recalculation, reuse, and recalculation and reuse simulation of the present invention.

Description of reference numerals:

100: block type inference method for memory optimization of convolution neural network

S02: parameter setting step

S04: step of dividing

S06: block inference procedure

S062: first direction data selection step

S064: second direction data selection step

S066: convolution operation step

S08: temporary storage step

110: outputting the image

200: memory optimized block-based inference system for convolutional neural networks

212: inference parameter set

214: convolution parameter set

220: block register

230: arithmetic processing unit

232: convolution engine

B_WW1, W2, W3: block width

B_HH1, H2, H3: block height

C1: i-th layer reusing input characteristic block channel number

C2: number of channels of i-th layer middle feature block

C3: number of channels of i-th layer output feature block

D: depth of convolution

D_max: maximum support convolution depth

D1: scanning line feed direction

D2: block scanning direction

FC: recalculation

FU: can be repeatedly used

FCFU: recalculation and reuse

IB: inputting block data

IR: inner area

k-1: reuse of feature quantity

L1: layer 1

L1 FC: layer 1 recalculation feature

L1FC _ I: layer 1 recalculation of input feature block data

L1 FU: layer 1 recycling feature

L1FU _ I: layer 1 reuse of input feature block data

L1_ O: layer 1 output feature block data

L2: layer 2

L2 FC: layer 2 recalculation feature

L2FC _ I: layer 2 recalculation of input feature block data

L2 FU: layer 2 recycling feature

L2FU _ I: layer 2 reuse of input feature block data

L2_ O: layer 2 output feature block data

L3: layer 3

L3 FC: layer 3 recalculation feature

L3FC _ I: layer 3 recalculation of input feature block data

L3 FU: layer 3 recycling feature

L3FU _ I: layer 3 reuse of input feature block data

L3_ O: layer 3 output feature block data

LD: layer D

LiFi _ I: i-th layer reusing input characteristic block data

Li _ M: layer i middle block data

Li _ O: i-th layer output characteristic block data

NTR: normalized throughput rate

OB: outputting block data

OR: outer region

S: block register size limitation

SBG1, SBG11, SBG 12: layer 1 subblock input feature group

SBG 2: layer 2 sub-block input feature group

SBG 3: layer 3 subblock input feature group

Detailed Description

Various embodiments of the present invention will be described below with reference to the accompanying drawings. For the purpose of clarity, numerous implementation details are set forth in the following description. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, these implementation details are not necessary. In addition, some conventional structures and elements are shown in simplified schematic form in the drawings for the sake of simplifying the drawings; and repeated elements will likely be referred to using the same reference numerals.

In addition, when an element (or a unit or a module, etc.) is "connected" to another element, it can mean that the element is directly connected to the other element or that the element is indirectly connected to the other element, i.e., that there is another element between the element and the other element. When an element is explicitly described as being "directly connected" to another element, it is not intended that another element be interposed between the element and the other element. The terms first, second, third and the like are used for describing different elements only, and the elements themselves are not limited, so that the first element can be also called the second element. And the combination of elements/units/circuits herein is not a commonly known, conventional or existing combination in this field, and cannot be easily determined by a person skilled in the art whether the combination is easily accomplished by the person skilled in the art based on whether the elements/units/circuits are existing.

Referring to fig. 1, fig. 1 is a flowchart illustrating a block-based inference method 100 for memory optimization of a convolutional neural network according to a first embodiment of the present invention. The block inference method 100 for memory optimization of convolutional neural network is used for processing an input image to generate an output image, and includes a parameter setting step S02, a segmentation step S04, a block inference step S06, and a temporary storage step S08.

In the parameter setting step S02, an inference parameter set is set, which includes convolution depth (depth), block width, block height, and multi-layer convolution kernel size (kernel size). The number of layers of such layer convolution kernel sizes is equal to the convolution depth.

In the dividing step S04, the driving arithmetic processing unit divides the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the sizes of the layer convolution kernels, wherein each input block data has an input block size.

The block inference step S06 is to drive the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate output block data, wherein the multi-layer convolution operation includes a first direction data selection step S062, a second direction data selection step S064, and a convolution operation step S066. In the first direction data selecting step S062, a plurality of i-th layer recalculation features are selected along the scan line feed direction according to the position of the output block data, and then an i-th layer recalculation input feature block data is selected according to the position of the output block data and the i-th layer recalculation features, where i is one of a plurality of positive integers from 1 to the convolution depth. In addition, the second direction data selecting step S064 selects a plurality of i-th layer recycling features along the block scanning direction according to the i-th layer recalculation input feature block data, and combines the i-th layer recalculation input feature block data and the i-th layer recycling features to generate an i-th layer recycling input feature block data. In addition, in the convolution operation step S066, a plurality of i-th layer sub-block input feature groups are selected from the i-th layer repeated use input feature block data according to the i-th layer convolution kernel size, then convolution operation is performed on each i-th layer sub-block input feature group and the convolution parameter group to generate i-th layer sub-block output features, and the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups are combined to form i-th layer output feature block data. The convolution parameter set includes a weight parameter (weight parameter) and a bias parameter (bias parameter).

In the temporary storage step S08, a Block buffer bank (Block buffer bank) is used to temporarily store the i-th layer output feature Block data and the i-th layer recycling features.

Therefore, the block inference method 100 for memory optimization of convolutional neural network of the present invention uses different feature calculation modes in different directions, so that the block inference can still greatly reduce the bandwidth requirement of the external memory without increasing too much calculation amount and block register. The details of the above steps will be explained below by means of more detailed examples.

Referring to fig. 1 to 6, fig. 2 is a schematic diagram illustrating the dividing step S04 of fig. 1; fig. 3 is a schematic perspective view illustrating the input tile data IB and the output tile data OB of the multi-layer convolution operation of the tile inference step S06 of fig. 1; fig. 4 is a schematic view illustrating the first direction data selection step S062 of fig. 1; fig. 5 is a schematic view illustrating a second direction data extracting step S064 of fig. 1; and fig. 6 is a diagram illustrating the layer 1 reuse input feature patch data L1FU _ I of fig. 3. As shown, in this embodiment, the first direction data selecting step S062, the second direction data selecting step S064 and the convolution operation step S066 are performed for each layer (i.e., i of the ith layer is 1 to D). Convolution depth D, block width B_WAnd a block height B_HAre all positive integers. The ith layer convolution kernel size is k_Wi×k_HiWherein k is_Wi、k_HiAre all positive integers. The scan line feed direction D1 is horizontal, and the block scan direction D2 is vertical; in other words, the block scanning direction D2 is perpendicular to the scanning line feed direction D1. Block width B_WGreater than block height B_HAnd a block height B_HIs parallel to the block scanning direction D2. Input block size equal to B_W×B_H. Output block data OB toolHas an output block size equal to (B)_W-2D)×B_H. The i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size equal to (B)_W-2i+2)×B_H. The ith layer recycling input feature block data has an ith layer recycling input feature block size equal to (B)_W-2i+2)×(B_H+2). The ith layer output characteristic block data has an ith layer output characteristic block size equal to (B)_W-2i)×B_H. The ith layer output feature block data represents the output features of the ith layer after performing convolution operation, and is used for recalculation of the next layer (i +1 th layer) of the same block. Convolution depth D is less than block width B_WHalf of that. Furthermore, the i-th layer recycling feature has a recycling feature number along the block scanning direction D2, and the recycling feature number is equal to k_Hi-1 (i.e., k-1). The i-th layer reuse characteristic is the reuse of the same layer (i-th layer) for the next block. When i is equal to 1, recalculating the input characteristic block data equal to each input block data IB by the ith layer; when i is equal to the convolution depth D, the i-th layer output feature block data is equal to the output block data OB.

In fig. 3 to 6, the convolution depth D is 3 and the block width B is_WIs 10, block height B_HAt 4, the i-th layer convolution kernel size is 3 × 3, i.e., k_Wi＝k_HiK and are all 3. A convolution depth D of 3 represents a 3-layer convolution operation, so the multi-layer convolution operation includes a 1 st layer convolution operation, a 2 nd layer convolution operation, and a 3 rd layer convolution operation (i.e., i is 1, 2, and 3).

The layer 1 convolution operation (i ═ 1) includes a first direction data extraction step S062, a second direction data extraction step S064, and a convolution operation step S066. The first direction data selecting step S062 selects 6 layer 1 recalculation features L1FC (i.e., (D-i +1) × (k-1)) along the scan line feed direction D1 according to the position of the output block data OB (i.e., the layer 3 output feature block data L3_ O), and then selects the layer 1 recalculation features L1 and L1 recalculation features according to the position of the output block data OB and the layer 1 recalculation featuresThe recalc feature L1FC selects a layer 1 recalculation input feature block data L1FC _ I. The layer 1 recalculated input feature block data L1FC _ I is equal to the input block data IB, the input block size of the input block data IB is equal to the layer 1 recalculated input feature block size of the layer 1 recalculated input feature block data L1FC _ I, and is equal to (B)_W-2i+2)×B_HAs shown in fig. 3, layer 1L 1, layer 1L 1 and fig. 6, when the value is (10-2+2) × 4 is 10 × 4. Furthermore, the second direction data selecting step S064 selects 2 layer 1 recycling features L1FU along the block scanning direction D2 according to the layer 1 re-calculated input feature block data L1FC _ I, and combines the layer 1 re-calculated input feature block data L1FC _ I and the layer 1 recycling features L1FU to generate layer 1 recycling input feature block data L1FU _ I. The layer 1 reuse input feature block size of the layer 1 reuse input feature block data L1FU _ I is equal to (B)_W-2i+2)×(B_HThe +2) × (4+2) × (10 × 6 is (10-2+2) × (10 × 6), as shown in the 1 st layer L1 in fig. 3, the 1 st layer L1 in fig. 5, and fig. 6. In addition, the convolution operation step S066 is to select a plurality of layer 1 sub-block input feature groups SBG1 (i.e., 3 × 3 features) from the layer 1 repeated utilization input feature block data L1FU _ I according to the I-th layer convolution kernel size (i.e., 3 × 3) and then perform convolution operation on each layer 1 sub-block input feature group SBG1 and the convolution parameter set to generate layer 1 sub-block output features, and combine the layer 1 sub-block output features corresponding to the layer 1 sub-block input feature groups SBG1 to form layer 1 output feature block data L1_ O. The layer 1 output feature chunk size of the layer 1 output feature chunk data L1_ O is equal to (B)_W-2i)×B_HAs shown in fig. 3 and fig. 5, layer 1L 1 is (10-2) × 4 is 8 × 4.

The layer 2 convolution operation (i ═ 2) includes a first direction data extraction step S062, a second direction data extraction step S064, and a convolution operation step S066. The first direction data selecting step S062 selects 4 layer 2 recalculation features L2FC (i.e., (D-i +1) × (k-1)) along the scan line direction D1 according to the position of the output block data OB (i.e., the layer 3 output feature block data L3_ O), and then re-selects the layer 2 recalculation features according to the position of the output block data OB and the layer 2 recalculation featuresThe computation feature L2FC selects a layer 2 recomputed input feature block data L2FC _ I. The layer 2 re-computed input feature chunk data L2FC _ I is equal to the layer 1 output feature chunk data L1_ O. Layer 2 recalculated input feature chunk size for layer 2 recalculated input feature chunk data L2FC _ I equal to (B)_W-2i+2)×B_HAs shown in fig. 3 and fig. 4, layer 2L 2, is (10-4+2) × 4 is 8 × 4. Furthermore, the second direction data selecting step S064 selects 2 layer 2 recycling features L2FU along the block scanning direction D2 according to the layer 2 re-calculated input feature block data L2FC _ I, and combines the layer 2 re-calculated input feature block data L2FC _ I and the layer 2 recycling features L2FU to generate a layer 2 recycling input feature block data L2FU _ I. The layer 2 recycling input feature block size of the layer 2 recycling input feature block data L2FU _ I is equal to (B)_W-2i+2)×(B_HThe +2) × (4+2) × (10-4+ 6) × (4+2) × 8 × 6, as shown in fig. 3 and the 2 nd layer L2 of fig. 5. In addition, the convolution operation step S066 selects a plurality of layer 2 sub-block input feature groups SBG2 (i.e., 3 × 3 features) from the layer 2 repeated utilization input feature block data L2FU _ I according to the I-th layer convolution kernel size (i.e., 3 × 3) and then performs convolution operation on each layer 2 sub-block input feature group SBG2 and the convolution parameter set to generate layer 2 sub-block output features, and combines the layer 2 sub-block output features corresponding to the layer 2 sub-block input feature groups SBG2 to form layer 2 output feature block data L2_ O. The layer 2 output feature chunk size of the layer 2 output feature chunk data L2_ O is equal to (B)_W-2i)×B_HAs shown in fig. 3 and fig. 5, layer 2L 2 is (10-4) × 4 is 6 × 4.

The layer 3 convolution operation (i ═ 3) includes a first direction data extraction step S062, a second direction data extraction step S064, and a convolution operation step S066. In the first direction data selecting step S062, 2 layer 3 recalculation features L3FC (i.e., (D-I +1) × (k-1)) are selected along the scan line direction D1 according to the position of the output tile data OB (i.e., the layer 3 output feature tile data L3_ O), and then a layer 3 recalculation input feature tile data L3FC _ I is selected according to the position of the output tile data OB and the layer 3 recalculation features L3 FC. Layer 3 weightThe newly calculated input feature chunk data L3FC _ I is equal to the layer 2 output feature chunk data L2_ O. Layer 3 recalculated input feature chunk size for layer 3 recalculated input feature chunk data L3FC _ I equal to (B)_W-2i+2)×B_HAs shown in fig. 3 and fig. 4, layer 3L 3 is (10-6+2) × 4 is 6 × 4. Furthermore, the second direction data selecting step S064 selects 2 layer 3 recycling features L3FU along the block scanning direction D2 according to the layer 3 re-calculated input feature block data L3FC _ I, and combines the layer 3 re-calculated input feature block data L3FC _ I and the layer 3 recycling features L3FU to generate layer 3 recycling input feature block data L3FU _ I. The layer 3 recycling input feature block size of the layer 3 recycling input feature block data L3FU _ I is equal to (B)_W-2i+2)×(B_HThe +2) × (4+2) × (10-6+2) × (6 × 6 is shown in fig. 3 and layer 3L 3 of fig. 5. In addition, the convolution operation step S066 is to select a plurality of layer 3 sub-block input feature groups SBG3 (i.e., 3 × 3 features) from the layer 3 recycled input feature block data L3FU _ I according to the I-th layer convolution kernel size (i.e., 3 × 3) and then perform convolution operation on each layer 3 sub-block input feature group SBG3 and the convolution parameter set to generate layer 3 sub-block output features, and combine the layer 3 sub-block output features corresponding to the layer 3 sub-block input feature groups SBG3 to form layer 3 output feature block data L3_ O. The layer 3 output characteristic tile data L3_ O is equal to the output tile data OB. The layer 3 output feature chunk size of the layer 3 output feature chunk data L3_ O is equal to (B)_W-2i)×B_HThe output block size of the output block data OB is equal to (B) 4 × 4 (10-6) × 4_W-2D)×B_HAs shown in fig. 3 and fig. 5, layer 3L 3 is (10-6) × 4 ═ 4 × 4.

In the memory-optimized block-wise inference method 100 of convolutional neural network of the present invention, when at least one of the input features of an i-th layer of the input feature cluster of the sub-block is located in the outer region of the i-th layer of the reusable input feature block data, the input features of the i-th layer of the input feature cluster of the sub-block include a plurality of outer block features and a plurality of first inner block features. The outer region feature represents a feature that was operated on by the previous region, and the first inner region feature represents a feature that was not operated on by the current region. In addition, when the input features of an i-th sub-block input feature group are all located in the inner region of the i-th recycled input feature block data, the input features of the i-th sub-block input feature group only comprise a plurality of second inner block features, and the second inner block features represent the non-operated features of the current block. The i-th layer reuses the input feature block data into an outer region and an inner region in the arrangement order along the block scanning direction D2. For example, referring to fig. 6, when 6 of the 9 input features of the layer 1 sub-tile input feature group SBG11 are located in the outer region OR of the layer 1 recycling input feature block data L1FU _ I, the 9 input features of the layer 1 sub-tile input feature group SBG11 include 6 outer tile features and 3 inner tile features. The outer region features represent computed features and are located in the outer region OR, while the inner region features represent un-computed features and are located in the inner region IR. In addition, when the 9 input features of the layer 1 sub-block input feature group SBG12 are all located in the inner region IR of the layer 1 recycling input feature block data L1FU _ I, the 9 input features of the layer 1 sub-block input feature group SBG12 only include 9 inner block features, that is, the 9 input features are all inner block features. In addition, the layer 1 recycling input feature patch data L1FU _ I are arranged in the order of the outer region OR and the inner region IR along the patch scanning direction D2.

It should be noted that in the temporary storage step S08, the lowest k of the I-th layer LiFC _ I_Hi-1 column into the block register for the next block to become the next block of LiFi. For example, after the layer 1 convolution operation of the block inference step S06 is performed, a temporary storage step S08 is performed, in which the layer 1 recalculates the bottom k of the input feature block data L1FC _ I_Hi-1 column is stored in the block register for the next block, i.e. layer 1 recycling feature L1FU, which becomes the next block. After the layer 2 convolution operation of the block inference step S06 is performed, a temporary storage step S08 is performed, which recalculates the bottom k of the input feature block data L2FC _ I for the layer 2_HiThe-1 column is stored in the block register for the next block, i.e., the layer 2 recycling feature L2FU that becomes the next block. When block pushingAfter the convolution operation of layer 3 in step S06 is performed, a temporary storage step S08 is performed to recalculate the lowest k of the input feature block data L3FC _ I for layer 3_Hi-1 column into the block register for the next block, i.e. layer 3 recycling feature L3FU, to become the next block. Therefore, the calculation amount can be greatly reduced.

Referring to fig. 1 to 7 together, fig. 7 is a schematic diagram illustrating a channel shuffle (shuffle) according to a second embodiment of the present invention. The inference process of the present invention is applicable to the channel shuffling operation, and the I-th reuse input feature block data LiFU _ I has an I-th reuse input feature block size W1 × H1 and an I-th reuse input feature block channel number C1. The i-th inter block data Li _ M has an i-th inter feature block size W2 XH 2 and an i-th inter feature block channel number C2. The i-th layer output feature block data Li _ O has an i-th layer output feature block size W3 XH 3 and an i-th layer output feature block channel number C3. The i-th layer output feature block size W3 × H3 is larger than the i-th layer reuse input feature block size W1 × H1, and the i-th layer reuse input feature block size W1 × H1 is larger than the i-th layer intermediate feature block size W2 × H2. Where W1, W2, and W3 are block widths, and H1, H2, and H3 are block heights. In addition, the number of i-th-layer recycled input feature block channels C1 is equal to the number of i-th-layer output feature block channels C3, and the number of i-th-layer intermediate feature block channels C2 is greater than the number of i-th-layer recycled input feature block channels C1. For example, the i-th reuse input feature block size W1 × H1, the i-th inter feature block size W2 × H2, and the i-th output feature block size W3 × H3 may be 10 × 10, 8 × 8, and 16 × 16, respectively, and the i-th reuse input feature block channel number C1, the i-th inter feature block channel number C2, and the i-th output feature block channel number C3 may be 32, 128, and 32, respectively, but the invention is not limited thereto.

Therefore, the invention can realize specific multilayer convolution operation, when block type inference is carried out, the calculated characteristics are repeatedly utilized in the forward direction of the block (namely the block scanning direction D2), and a recalculation mode is adopted in the other direction (namely the scanning line feed direction D1), so that the block type inference can still greatly reduce the bandwidth requirement of an external memory on the premise of not increasing excessive calculation amount and the block register.

Referring to fig. 1, fig. 2, fig. 8 and fig. 9, wherein fig. 8 is a block diagram illustrating a block-based inference system 200 for memory optimization of a convolutional neural network according to a third embodiment of the present invention; and FIG. 9 is a flowchart illustrating a multilayer convolution operation with a 3 × 3 filter according to a third embodiment of the present invention. As shown, the memory-optimized block-based inference system 200 of convolutional neural network is used for processing an input image to generate an output image 110, and includes a block register 220 and an operation processing unit 230. The input block data IB, the inference parameter set 212 and the convolution parameter set 214 are input to the arithmetic processing unit 230, and the output block data OB is output to form the output image 110. The block register 220 is used to access the ith layer output feature block data and the ith layer reuse features, and the two types of temporary storage are temporary storage using different locations in the block register 220. In addition, the arithmetic processing unit 230 is electrically connected to the block register 220, and the arithmetic processing unit 230 receives the input image and is configured to implement the block-based inference method 100 for memory optimization of the convolutional neural network of fig. 1. The arithmetic processing unit 230 includes a Convolution Engine 232(Convolution Engine), and the Convolution Engine 232 is used for performing Convolution operation. The arithmetic processing unit 230 can be a microprocessor, a central processing unit, or an image processor, but the invention is not limited thereto. L1, L2, and LD represent the 1 st, 2 nd, and D th layers, respectively, and the 1 st, L1 through the D th layers LD are all operated by the convolution engine 232 of the operation processing unit 230. In addition, the block register 220 can store the outer block feature, and the block register 220 has a temporary storage space, which can recalculate the width B of the input feature block data through the ith layer_WiConvolution depth D, layer number i, channel number C and i-th layer convolution kernel size k_Wi×k_HiAnd (6) calculating to obtain. The staging space is denoted lbs (line Buffer size) and conforms to the following formula (1):

for example, if the first direction data selection step S062, the second direction data selection step S064, and the convolution operation step S066 are performed for each layer (i.e., i of the ith layer is 1 to D), and k is_Wi＝k_HiK and equal to 3, the temporary storage space follows the following equation (2):

therefore, the block type inference system 200 for memory optimization of convolutional neural network of the present invention uses different feature calculation modes in different directions, so that the block type inference can still greatly reduce the bandwidth requirement of the external memory on the input block data IB and the output block data OB without increasing too much calculation amount and the block register 220.

Referring to fig. 1 and 10 together, fig. 10 is a schematic diagram showing the comparison results of the recalculation (FC), the recycling (FU) and the recalculation and recycling (FCFU) of the present invention. The parameter setting condition is that the product value A is set to 64²The size of the output image 110 is 960 × 540, k_Wi＝k_HiK. The product value A is the block width B_WAnd a block height B_HMinimum value of multiplied values. The multi-layer convolution operation of the present invention has a Normalized Throughput Rate (NTR) obtained by the convolution depth D and the Normalized Computation Rate (NCR) calculated by the block width B_WBlock height B_HCalculating the convolution depth D and the variable h. The normalized throughput rate NTR and the normalized calculation rate NCR for the present invention conform to the following equations (3) and (4), respectively:

as can be seen from FIG. 10, if there is a block register size limit S for the block register 220, the maximum supported convolution depth D that the FU can support is reused_maxThe shallowest of the three; in contrast, recalculating FC can support a wide range of model convolution depths, but requires a high computational complexity, resulting in a significant reduction in the normalized throughput NTR. The recalculated and recycled FCFU of the present invention not only supports a wider range of model convolution depths than recycled FUs, but also provides a better normalized throughput NTR than recalculated FCs.

As can be seen from the above embodiments, the present invention has the following advantages: first, the block inference method for memory optimization of convolutional neural network of the present invention uses different feature calculation modes in different directions, so that the block inference can still greatly reduce the bandwidth requirement of the external memory without increasing too much calculation amount and block register. Secondly, the block type inference system for memory optimization of the convolutional neural network of the present invention uses the calculation mode with different characteristics in different directions, so that the block type inference can still greatly reduce the bandwidth requirement of the external memory on the premise of not increasing too much calculation amount and block temporary storage. Thirdly, the recalculation and reuse of the present invention not only supports a wider range of model convolution depths than reuse, but also provides a better normalized throughput rate for recalculation.

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A memory optimized block-wise inference method of a convolutional neural network for processing an input image, the memory optimized block-wise inference method of the convolutional neural network comprising the steps of:

a parameter setting step, setting an inference parameter group, wherein the inference parameter group comprises a convolution depth, a block width, a block height and a multilayer convolution kernel size;

a dividing step of driving an arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the size of the multi-layer convolution kernel, wherein each input block data has an input block size;

a block inference step of driving the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate an output block data, wherein the multi-layer convolution operation includes:

a first direction data selection step, selecting a plurality of i-th layer recalculation characteristics along a scanning line-feed direction according to a position of the output block data, and then selecting an i-th layer recalculation input characteristic block data according to the position of the output block data and the i-th layer recalculation characteristics, wherein i is one of a plurality of positive integers from 1 to the convolution depth;

a second direction data selection step of selecting a plurality of i-th layer reuse characteristics along a block scanning direction according to the i-th layer recalculated input characteristic block data, and combining the i-th layer recalculated input characteristic block data and the i-th layer reuse characteristics to generate i-th layer reuse input characteristic block data; and

a convolution operation step, selecting a plurality of i-th layer sub-block input feature groups from the i-th layer repeated utilization input feature block data according to the size of an i-th layer convolution kernel, then executing convolution operation on each i-th layer sub-block input feature group and a convolution parameter group to generate an i-th layer sub-block output feature, and combining the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form i-th layer output feature block data; and

a temporary storage step, driving a block temporary storage to temporarily store the ith layer output characteristic block data and the ith layer recycling characteristics.

2. The memory-optimized, block-wise inference method of convolutional neural networks of claim 1,

when i is equal to 1, recalculating input feature block data equal to each input block data by the i-th layer; and

when i is equal to the convolution depth, the i-th layer output feature block data is equal to the output block data.

3. The method of claim 1, wherein the i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size and an i-th layer recalculated input feature block channel number, the i-th layer output feature block data has an i-th layer output feature block size and an i-th layer output feature block channel number, the i-th layer output feature block size is larger than the i-th layer recalculated input feature block size, and the i-th layer recalculated input feature block channel number is equal to the i-th layer output feature block channel number.

4. The method of claim 1, wherein the block scan direction is perpendicular to the scan column direction, the block width is greater than the block height, and an extension direction of the block height is parallel to the block scan direction.

5. The method of claim 1, wherein the convolution depth, the block width, and the block height are positive integers, and the ith layer of convolution kernel has a size k_Wi×k_HiThe i-th layer recycling features have a recycling feature number along the block scanning direction, and the recycling feature number is equal to k_Hi-1。

6. The convolution spirit of claim 1A method of block-wise inference via network memory optimization, characterized in that the block width is denoted B_WThe convolution depth is denoted as D and the block height is denoted as B_H；

The input block size is equal to B_W×B_H；

The output block data has an output block size equal to (B)_W-2D)×B_H；

The i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size equal to (B)_W-2i+2)×B_H；

The ith layer recycling input feature block data has an ith layer recycling input feature block size equal to (B)_W-2i+2)×(B_H+2)；

The ith layer output feature block data has an ith layer output feature block size equal to (B)_W-2i)×B_H(ii) a And

the convolution depth is less than half the block width.

7. The memory-optimized, block-wise inference method of convolutional neural networks of claim 1,

when at least one of the input features of the i-th layer sub-block input feature group is located in an outer region of the i-th layer recycling input feature block data, the input features of the i-th layer sub-block input feature group comprise a plurality of outer block features and a plurality of first inner block features, the outer block features represent operated features, and the first inner block features represent non-operated features;

when the input features of one of the i-th layer sub-block input feature groups are all located in an inner region of the i-th layer recycling input feature block data, the input features of the i-th layer sub-block input feature group only comprise a plurality of second inner region block features; and

the ith layer reuses input feature block data in the arrangement sequence of the outer area and the inner area along the block scanning direction.

8. The method of claim 7, wherein the plurality of outer block features are stored in a block register, the block register has a temporary storage space obtained by the i-th layer recalculation input feature block data width, the convolution depth, a layer number, a channel number and the i-th layer convolution kernel size, the i-th layer recalculation input feature block data width is denoted as B_WiThe convolution depth is denoted as D, the layer number is denoted as i, the channel number is denoted as C, and the i-th layer convolution kernel size is k_Wi×k_HiThe scratch space is denoted LBS and conforms to the following equation:

9. a memory-optimized block-based inference system for convolutional neural networks for processing an input image, the memory-optimized block-based inference system comprising:

a block register for accessing an ith output feature block data and a plurality of ith reuse features; and

an arithmetic processing unit electrically connected to the block register, the arithmetic processing unit receiving the input image and configured to perform operations comprising:

a dividing step, dividing the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the size of the multilayer convolution kernel, wherein each input block data has an input block size; and

a block inference step of performing a multi-level convolution operation on each of the input block data to generate an output block data, wherein the multi-level convolution operation comprises:

a second direction data selection step of selecting the i-th layer reuse characteristics along a block scanning direction according to the i-th layer recalculated input characteristic block data, and combining the i-th layer recalculated input characteristic block data and the i-th layer reuse characteristics to generate i-th layer reuse input characteristic block data; and

a convolution operation step, selecting a plurality of i-th layer sub-block input feature groups from the i-th layer repeated utilization input feature block data according to the size of an i-th layer convolution kernel, then executing convolution operation on each i-th layer sub-block input feature group and a convolution parameter group to generate an i-th layer sub-block output feature, and combining the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form the i-th layer output feature block data.

10. The memory-optimized, tiled inference system of convolutional neural networks of claim 9,

11. The system of claim 9, wherein the i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size and an i-th layer recalculated input feature block channel number, the i-th layer output feature block data has an i-th layer output feature block size and an i-th layer output feature block channel number, the i-th layer output feature block size is larger than the i-th layer recalculated input feature block size, and the i-th layer recalculated input feature block channel number is equal to the i-th layer output feature block channel number.

12. The system of claim 9, wherein the block scan direction is perpendicular to the scan column direction, the block width is greater than the block height, and an extension direction of the block height is parallel to the block scan direction.

13. The system of claim 9, wherein the convolution depth, the block width, and the block height are positive integers, and the ith layer convolution kernel size is k_Wi×k_HiThe i-th layer recycling features have a recycling feature number along the block scanning direction, and the recycling feature number is equal to k_Hi-1。

14. The convolutional neural network memory-optimized tile-wise inference system of claim 9, wherein the tile width is represented as B_WThe convolution depth is denoted as D and the block height is denoted as B_H；

The input block size is equal to B_W×B_H；

The output block data has an output block size equal to (B)_W-2D)×B_H；

The i-th layer recalculated input feature block data has a firstThe i-th layer recalculates the input feature block size, and the i-th layer recalculates the input feature block size equal to (B)_W-2i+2)×B_H；

the convolution depth is less than half the block width.

15. The memory-optimized, tiled inference system of convolutional neural networks of claim 9,

16. The system of claim 15, wherein the plurality of outer block features are stored in a block register having a scratch space that passes through the ith layer of redundancyNewly calculating a width, the convolution depth, a layer number, a channel number and the i-th layer convolution kernel size of the input feature block data, wherein the width of the i-th layer newly calculated input feature block data is represented as B_WiThe convolution depth is denoted as D, the layer number is denoted as i, the channel number is denoted as C, and the i-th layer convolution kernel size is k_Wi×k_HiThe scratch space is denoted LBS and conforms to the following equation: