WO2020029767A1

WO2020029767A1 - Basic computing unit for convolutional neural network, and computing method

Info

Publication number: WO2020029767A1
Application number: PCT/CN2019/096750
Authority: WO
Inventors: 李朋; 赵鑫鑫; 姜凯; 于治楼
Original assignee: 山东浪潮人工智能研究院有限公司
Priority date: 2018-08-06
Filing date: 2019-07-19
Publication date: 2020-02-13
Also published as: CN109165728B; CN109165728A

Abstract

The present invention provides a basic computing unit for a convolutional neural network, and a computing method, the basic computing unit comprising a controller, an adder tree, an input buffer, several computing units, and an output buffer; each of the computing units comprises a block random access memory, several convolution operation units, an internal adder, and an activation and pooling unit. On the basis of the control of the controller, the input buffer loads, to the computing units, a corresponding number of lines of image data, and the block random access memory delivers an effective number of lines and a starting line number to the convolution operation units, so as to enable the convolution operation units to acquire image data of corresponding line numbers; the convolution operation units process the image data, and send same to the adder tree by means of the internal adders; the adder tree processes the image data sent from the internal adders, and sends same to an activation and pooling unit; and the activation and pooling unit processes the image data, and sends same to the output buffer. The present solution is able to implement an algorithm on the basis of hardware, making completion time of an algorithm controllable.

Description

Basic calculation unit and calculation method of convolutional neural network

Technical field

The present invention relates to the field of computer technology, and in particular, to a basic calculation unit and calculation method of a convolutional neural network.

Background technique

Convolutional Neural Networks (CNN) is an efficient identification method that has developed rapidly in recent years and attracted widespread attention.

At present, the CNN algorithm can be implemented based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). Among them, CPU and GPU are software implementation algorithms.

Because the existing implementation is a software-implemented algorithm, the time for algorithm completion is uncontrollable.

Summary of the invention

The invention provides a basic calculation unit and calculation method of a convolutional neural network, which can implement an algorithm based on hardware so that the completion time of the algorithm can be controlled.

In order to achieve the above object, the present invention is achieved by the following technical solutions:

In one aspect, the present invention provides a basic computing unit of a convolutional neural network, including:

A controller, an addition tree, an input buffer, at least one computing unit, and an output buffer;

The calculation unit includes a block random access memory, at least one convolution operation unit, an internal adder, and an activation pooling unit; and the block random access memory and the internal adder are connected to each of the convolution operation units, respectively. The internal adder and the activation pooling unit are both connected to the addition tree;

The input-side buffer is configured to execute for each of the computing units: based on control of the controller, load the buffered image data of the first row of at least one feature map to be processed into the current computing unit. , The current calculation unit corresponds to the first row number;

The block random memory is used for buffering each line of image data loaded by the input-side buffer; for each of the convolution operation units, performing: based on the control of the controller, for each line of image cached by itself Data, and sends the number of valid and starting rows to the current convolution operation unit;

The convolution operation unit is configured to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line The number is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder;

The internal adder is configured to process the image data sent by the convolution operation unit and send the image data to the addition tree;

The addition tree is used to process the image data sent by each of the internal adders based on the control of the controller and send the processed image data to an activation pooling unit;

The activation pooling unit is configured to process the image data sent from the addition tree based on the control of the controller, and send the processed image data to the output buffer.

Further, the activation pooling unit includes: an activation module, a first selector, a pooling module, a second selector, and a third selector;

One input terminal of the first selector is connected to the input terminal of the activation module, the other input terminal is connected to the output terminal of the activation module, and the output terminal is connected to the input terminal of the pooling module;

One input terminal of the second selector is connected to the input terminal of the pooling module, the other input terminal is connected to the output terminal of the pooling module, and the output terminal is connected to the first input terminal of the third selector. Connected

The first input terminal of the third selector is connected to the output terminal of the second selector, the second input terminal is connected to other computing units in the at least one computing unit, and the third input terminal is connected to the An input buffer is connected, a first output is connected to the output buffer, and a second output is connected to the input of the activation module;

The activation pooling unit is configured to control whether the activation module and / or the pooling module work based on the control of the controller;

The third selector is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller, and control the first input terminal. The output terminal or the second output terminal works.

Further, the activation pooling unit further includes: an offset processing unit;

The offset processing unit is respectively connected to the addition tree and the activation module;

The second output terminal is connected to the input terminal of the activation module through the offset processing unit.

Further, the current calculation unit includes 4 convolution operation units;

The number of lines required by any of the four convolution operation units to perform a basic convolution operation is 3, and the step size is 1;

The first number of rows corresponding to the current computing unit is 16 rows.

Further, the input buffer is further configured to determine a first row number corresponding to each of the calculation units according to formula 1, formula 2, and formula 3, wherein any of the convolution operation units executes a basic volume The step size during the product operation is 1;

The formula one includes:

The formula two includes:

The formula three includes:

Y _i = max {y _i }

Where X _ij is the number of rows required for the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a _i is the corresponding number of the i-th calculation unit. Preset optimization values, a _i is an integer, m is the number of convolution operation units in the i-th calculation unit, x _i is an optimization variable corresponding to the i-th calculation unit, and y _i is the first Intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y _i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.

Further, the input buffer is further configured to determine the first row number corresponding to each of the calculation units according to formulas 4 and 5, wherein different convolution operation units perform different basic convolution operations when performing the basic convolution operation. The number of rows required is the same and the step size is 1, and the number of convolution operation units included in different calculation units is the same;

The formula four includes:

The formula five includes:

Y = max {2 ^k }

Among them, A is the number of rows required by any of the convolution operation units to perform basic convolution operations, X is the number of convolution operation units included in any of the calculation units, k is an integer, and N is an external chip. The amount of storage resources that can be provided is in bytes, T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte, and D is the amount of storage resources and unit of a row of image data Is a byte, n is the number of the at least one calculation unit, Y is the number of the first row corresponding to each of the calculation units, and max {} is the maximum value.

Further, the external chip includes: an FPGA (Field-Programmable Gate Array) chip or a custom chip;

The other modules include any one or more of a pooling module included in the active pooling unit, the output buffer, and the addition tree.

Further, when the number of rows required by any one of the convolution operation units to perform a basic convolution operation is 3 and the step size is 1, for any of the blocks of random access memory: in the image data of any two adjacent caches , The last second line of image data cached before is the same as the first line of image data cached later, the last first line of image data cached before is the same as the second line image data cached after;

For any of the convolution operation units: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, and the image data of the first line cached third The line image data is the same as the second line image data buffered later.

Further, the input buffer is further configured to execute for each of the feature maps to be processed: when the controller controls the PAD of the feature map to be processed to be 1, the Peripheral zero-padding is performed to increase the number of rows and columns of the currently to-be-processed feature map by 2; and is specifically used to perform an image data loading operation based on the current to-be-processed feature map after zero-padding.

In another aspect, the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks, including:

Performing, for each of the computing units through the input buffer, loading: based on the control of the controller, loading the buffered image data of the first line number of the at least one feature map to be processed into the current computing unit, The first row number corresponds to the current computing unit;

Each block of image data loaded by the input buffer is buffered through the block random memory; for each of the convolution operation units, based on the control of the controller, each line of image data cached by itself is executed To send the number of valid rows and the number of starting rows to the current convolution operation unit;

Obtaining, by the convolution operation unit, image data of a second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line number The number of lines required to perform a basic convolution operation for itself; and performing basic convolution operation on the image data of the second number of lines to send to the internal adder;

Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree;

Processing the image data sent by each of the internal adders through the addition tree and sending it to an activation pooling unit based on the control of the controller;

Through the activation pooling unit, based on the control of the controller, the image data sent from the addition tree is processed and sent to the output buffer.

The invention provides a basic calculation unit and calculation method of a convolutional neural network. The basic calculation unit includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer. The calculation unit includes a block random memory, Several convolution operation units, internal adders, and activation pooling units. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The invention can implement the algorithm based on hardware so that the completion time of the algorithm can be controlled.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are Some embodiments of the present invention, for those skilled in the art, can obtain other drawings according to these drawings without paying creative labor.

FIG. 1 is a schematic diagram of a basic calculation unit of a convolutional neural network according to an embodiment of the present invention; FIG.

2 is a schematic diagram of a computing unit in a convolutional neural network according to an embodiment of the present invention;

3 is a flowchart of a calculation method of a basic calculation unit based on a convolutional neural network according to an embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These embodiments are part of, but not all of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the protection of the present invention. range.

As shown in FIG. 1, an embodiment of the present invention provides a basic calculation unit of a convolutional neural network, which may include:

A controller 101, an addition tree 102, an input buffer 103, at least one calculation unit 104, and an output buffer 105;

The calculation unit 104 includes a block random memory 1041, at least one convolution operation unit 1042, an internal adder 1043, and an activation pooling unit 1044.

The block random access memory 1041 and the internal adder 1043 are respectively connected to each of the convolution operation units 1042, and the internal adder 1043 and the activation pooling unit 1044 are connected to the addition tree 102;

The input-side buffer 103 is configured to execute, for each of the computing units 104, based on the control of the controller 101, loading the buffered image data of the first line number of the at least one to-be-processed feature map into A current calculation unit, where the current calculation unit corresponds to the first row number;

The block random memory 1041 is configured to buffer each row of image data loaded by the input-side buffer 103; for each of the convolution operation units 1042, execute: based on the control of the controller 101, cache for itself Each line of image data is sent to the current convolution operation unit with the number of valid lines and the number of starting lines;

The convolution operation unit 1042 is configured to obtain the image data of the second line number from all the image data buffered in the block random memory 1041 according to the number of valid lines and the number of the starting lines sent. The number of two lines is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder 1043;

The internal adder 1043 is configured to process the image data sent by the convolution operation unit 1042 and send the image data to the addition tree 102;

The addition tree 102 is configured to process, based on the control of the controller 101, image data sent by each of the internal adders 1043 and send the image data to an activation pooling unit 1044;

The activation pooling unit 1044 is configured to process the image data sent from the addition tree 102 based on the control of the controller 101 and send the processed image data to the output buffer 105.

An embodiment of the present invention provides a basic calculation unit of a convolutional neural network, including a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer. The calculation unit includes a block random memory and a number of convolution operation units. , Internal adder, activate pooling unit. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.

In detail, the effective number of rows may be the number of rows currently actually loaded in the block RAM.

In detail, loading the image data from the input buffer to the block random memory, the number of rows to be loaded, and which row to load from can be controlled by the controller according to the upper-layer command.

In detail, the controller can control each other unit module in the basic computing unit according to the upper-layer command. For example, after the convolution operation is completed, addition is performed through the addition tree module, and then data processing is performed according to the activation algorithm, and then pooling processing is performed. These processes are completed by the controller. For example, the controller controls whether it can be activated or pooled. For example, the controller sends the parameters of data loading, data storage, and data calculation to other unit modules, and whether the completion is returned by the other unit modules.

In one embodiment of the present invention, the controller is implemented by Xilinx FPGA-PL, and is connected to ARM (Xilinx FPGA-PS) via AXI (Advanced Xtensible Interface) bus. Among them, for ARM (Xilinx FPGA-PS), PS (Processing System, processing system) is part of ARM's SOC (System Chip), which has nothing to do with FPGA in the chip; PL (Progarmmable logic, programmable) Logic) is the part of the FPGA in the chip.

The design of the basic computational unit of the convolutional neural network proposed by the embodiment of the present invention has the characteristics of flexibility and universality through reasonable combination of convolution, pooling, addition tree, data storage, and control modules, so as to achieve the general purpose , Can realize many convolutional neural network algorithms, so it has broad application prospects.

In one embodiment of the present invention, the controller may be represented by a controller, the input buffer may be represented by IN BUFFER, the output buffer may be represented by OUT BUFFER, the addition tree may be represented by ADDER TREE, and at least one calculation unit may be CUA (Calculate Unit Array Computing unit array) indicates that the computing unit is CU, block random memory can be represented by Block RAM, convolution operation unit can be represented by CONV, internal adder can be represented by Inter Adder, activation module can be represented by RELU (Rectified Linear Unit), pooling module It can be represented by POOL, and the selector can be represented by MUX (multiplexer).

In an embodiment of the present invention, the processing results of the internal adder of each calculation unit may be all output to the addition tree. After the addition tree aggregates the processing results, according to the control information sent by the controller, all the aggregated processing results are sent. The activation pooling unit of the computing unit designated by the control information is used for subsequent activation, pooling, and other processing.

In an embodiment of the present invention, please refer to FIG. 2. The activation pooling unit 1044 includes: an activation module 10441, a first selector 10442, a pooling module 10443, a second selector 10444, and a third selector 10445.

One input terminal of the first selector 10442 is connected to the input terminal of the activation module 10441, the other input terminal is connected to the output terminal of the activation module 10441, and the output terminal is connected to the input terminal of the pooling module 10443. ;

One input terminal of the second selector 10444 is connected to the input terminal of the pooling module 10443, the other input terminal is connected to the output terminal of the pooling module 10443, and the output terminal is connected to the third selector 10445. The first input terminal is connected;

The first input terminal of the third selector 10445 is connected to the output terminal of the second selector 10444, the second input terminal is connected to other computing units 104 in the at least one computing unit 104, and the third input A terminal is connected to the input buffer 103, a first output is connected to the output buffer 105, and a second output is connected to the input of the activation module 10441;

The activation pooling unit 1044 is configured to control whether the activation module 10441 and / or the pooling module 10443 work based on the control of the controller 101;

The third selector 10445 is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller 101 to control the input terminal. The first output terminal or the second output terminal works.

In detail, referring to FIG. 2, the processing flow of the image data in the computing unit can be determined.

In detail, based on the control information of the controller, it is possible to determine which input end of the first selector works, and determine whether an activation process is required. In the same way, it is possible to determine which input end of the second selector works, and determine whether pooling processing is required.

In detail, based on the control information of the controller, it can be determined which input end of the third selector works and which output end works. The second input terminal of the third selector is connected to another computing unit. This other computing unit may be any computing unit other than the computing unit in which the computing unit is located.

In an embodiment of the present invention, the activation pooling unit 1044 further includes: an offset processing unit 10446;

The offset processing unit 10446 is connected to the addition tree 102 and the activation module 10441, respectively;

The second output terminal is connected to the input terminal of the activation module 10441 through the offset processing unit 10446.

For example, when the offset is 1, for the image data from the addition tree, each value is increased by 1. After the offset processing, the subsequent activation and pooling operations are performed.

In the embodiment of the present invention, the block random access memory is responsible for buffering the input data required during the convolution operation to ensure that each connected convolution operation unit can operate normally. For example, suppose a convolution operation unit uses a 3 × 3 convolution operation as the basic convolution unit to implement convolution operations in other dimensions. The convolution operation unit requires 3 rows of input data. In this way, the block random memory needs at least a cache. Enter these 3 lines.

Based on the foregoing, in an embodiment of the present invention, the current calculation unit includes four convolution operation units 1042;

Any one of the four convolution operation units 1042 needs 3 lines and a step size of 1 to perform a basic convolution operation;

For example, please refer to FIG. 2. Assume that the current calculation unit includes 4 identical convolution operation units, and each convolution operation unit needs 3 rows to perform basic convolution operations. For image processing, the following two situations can exist:

Case 1: the image data processed by each convolution operation unit in the same calculation unit is the same;

Case 2: The image data processed by each convolution operation unit in the same calculation unit is different.

For the above case 2, there may be a case 3: the image data processed by each convolution operation unit may come from different feature maps to be processed. In this way, the data buffered in the block random memory can come from the same feature map or from multiple feature maps, and the specific implementation can be controlled according to the controller.

In case 1, if each convolution operation unit processes the same three lines of image data, the block random memory needs to store at least three lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than three.

In case 2, if each convolution operation unit processes different 3 lines of image data at the same time, the block random memory needs to store at least 12 lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than 12.

Considering that the feature maps to be processed are mostly 128 × 128, 256 × 256, 512 × 512, etc., there may also be cases of 64 × 64, 32 × 32, 16 × 16, so that the number of rows is usually a multiple of two. Without the need to perform peripheral zero-padding on the feature map, if the block RAM is loaded with 12 rows each time, the last load is usually not 12 rows. For example, when the feature map is 128 × 128, it needs to be loaded 11 times. Only 8 rows are loaded at a time. When the feature map is 64 × 64, it needs to be loaded 6 times, and only 6 rows are loaded at the 6th time. In this way, there will be a waste of resources during the last load.

In this way, the number of rows loaded in the block RAM at a time can also be a power of two, and not less than 12. For example, you can select 16 rows.

Based on the same implementation principle, in other embodiments of the present invention, when there are more resources available in the block random access memory, it can also be selected as 32 lines, or even 64 lines.

In general, when there are more resources available in the block random memory, if the number of rows of the feature map to be processed is usually large, the number of rows selected should not be too small; when the number of rows of the feature map to be processed is usually small , The number of selected rows should not be too large.

Take block random memory to load 16 rows at a time. Even in the above case 2, these 16 lines can ensure that each convolution operation unit obtains at least two consecutive image data, thereby reducing the number of times to load image data from the input buffer. Correspondingly reduces the time delay caused by repeated loading of image data. This design is equivalent to batch processing. When one line of calculation is complete, it automatically switches to the next line for convolution operations, and eliminates the time redundancy between rows. It also eliminates the performance loss caused by convolution operations when the line is changed. Therefore, it is possible to improve operation efficiency and realize high-speed operation.

In an embodiment of the present invention, the input buffer 103 is further configured to determine a first row number corresponding to each of the calculation units 104 according to the following formula (1) to formula (3), where any A step length when the convolution operation unit 1042 performs a basic convolution operation is 1;

Y _i = max {y _i } (3)

Where X _ij is the number of rows required by the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a _i is the corresponding number of the i-th calculation unit. Preset optimization values, a _i is an integer, m is the number of convolution operation units in the i-th calculation unit, x _i is an optimization variable corresponding to the i-th calculation unit, and y _i is the first The intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y _i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.

In detail, this design is particularly suitable for application scenarios where the number of rows of the feature map to be processed is usually large.

In detail, the value of a _i can be flexibly set according to the control information of the controller. When a _i is 0, in the above case 3, the number of rows loaded by the block random access memory can only be obtained once by each convolution operation unit. This will cause the block random access memory to repeatedly load image data from the input buffer. Happening.

Under the premise that the step size is 1, when a _i is 1, in the above case 3, the number of rows loaded by the block random memory at a time can support each convolution operation unit to obtain two consecutive times; when a _i is 2, it can be Each convolution operation unit is supported for three consecutive acquisitions; when a _i is 3, each convolution operation unit is supported for four consecutive acquisitions; and so on.

Y _i is not necessarily due to the multiple side 2, so that the program is more suitable to be treated needs to be characterized in the case of FIG peripheral zero-padding process. Since Y _i takes the maximum value, it is possible to reduce the maximum degree of the block random access memory loads the image data from the input buffer repeatedly, thus improving the overall process efficiency.

Based on the foregoing, in another embodiment of the present invention, the input buffer 103 is further configured to determine a first line corresponding to each of the calculation units 104 according to the following formula (4) and formula (5). The number of rows required by different convolution operation units 1042 to perform the basic convolution operation is the same and the step size is 1, and the number of convolution operation units 1042 included in different calculation units 104 is the same;

Y = max {2 ^k } (5)

In detail, since Y is a power of two, this solution is more suitable for the case where the number of rows of the feature map to be processed is also a power of two, and there is no need to perform peripheral zero padding.

In an embodiment of the present invention, the external chip includes: an FPGA chip or a custom chip;

The other modules include any one or more of a pooling module 10443 included in the activated pooling unit 1044, the output buffer 105, and the addition tree 102.

In an embodiment of the present invention, when the number of rows required by any of the convolution operation units 1042 to perform a basic convolution operation is 3 and the step size is 1, for any of the block random access memories 1041: any adjacent Of the image data buffered twice, the last second line of image data cached before is the same as the first line of image data cached before, and the last first line of image data cached before is the second line of image data cached after the same;

For any of the convolution operation units 1042: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, The three lines of image data are the same as the second line of image data buffered later.

For example, suppose the block random memory P is currently loaded with 16 lines of image data, and the 4 convolution operation units connected all process the 16 lines of image data, and each uses a 3 × 3 convolution operation as the basic convolution unit. To implement the convolution operation, the number of lines required to perform the basic convolution operation is 3, and the step size is set to 1.

Assume that the block random access memory P sends to the convolution operation unit Q of the four convolution operation units, and the number of effective rows and the starting row issued are 16 and 1, respectively, then the convolution operation unit Q obtains from the 16 rows Lines 1 to 3. After the convolution operation unit Q completes the processing of the first line to the third line, it can notify the controller that the controller can control the block random access memory P, and send the effective number of rows to the convolution operation unit to 16 and the starting number to 2, the convolution operation unit Q obtains the second to fourth rows from the 16 rows. This loop is repeated until the convolution operation unit Q obtains the 14th to 16th rows from the 16 rows to complete the processing operation of the 16 rows of image data.

In an embodiment of the present invention, the input buffer 103 is further configured to execute for each of the feature maps to be processed: when the controller 101 controls the PAD of the current to-be-processed feature maps to 1, Performing zero padding on the periphery of the current to-be-processed feature map to increase the number of rows and columns of the current to-be-processed feature map by 2; and specifically for performing image data based on the current to-be-processed feature map after zero-padding Load operation.

For example, assume that the size of the feature map to be processed is 512 × 512, that is, it has 512 rows of image data and 512 columns of image data. Taking a basic convolution unit of 3 × 3 and a step size of 1, for example, the first to third rows are processed, then the second to fourth rows are processed, then the third to fifth rows are processed, and so on. It ends after processing lines 510 to 512. It can be seen that it needs to be processed 510 times, so the size of the result obtained is 510 × 510, which is two rows less than the original image and two columns less in the number of columns. When the convolution processing operation needs to be executed multiple times in a loop, the graph is reduced every time it is executed.

Based on this, in order to keep the size of the graph unchanged after the convolution processing operation is completed, PAD can be set to 1, before performing the convolution processing operation, the circle is supplemented with a circle of zero values, so that the graph size is changed from 512 × 512 becomes 514 × 514. In this way, after one convolution processing operation, an image of size 514 × 514 can be processed into an image of size 512 × 512. This loop keeps the graph size constant regardless of the number of convolution processing operations.

Based on the same implementation principle, the value of PAD is equivalent to the number of turns of zero padding on the periphery of the figure.

In detail, for the zero-padding process, zero padding can be performed in the input buffer, or can be performed in the block random access memory.

When zero-padding is performed in the input buffer, zero-padding is performed first, and then image data is loaded based on the zero-padding feature map. For example, the feature map has 32 rows. After zero-padding, the feature map has 34 rows. The zero-padded feature map has zero values in the first, last, first, and last columns. Assuming that 16 rows are loaded at a time, the first time, the first to the 16th rows of the 34 rows are loaded, the second 15th to the 30th rows are loaded, and the 34th rows are loaded the third time Lines 29 to 34.

When zero padding is performed in the block random memory, the image data is loaded first, and then zero padding is performed based on the loaded image data. For example, there are 32 lines in the feature map. Assuming that 16 lines are loaded at a time, the first to 15 of the 32 lines are loaded first. Since these 15 lines are the beginning of the picture, you can target the loaded image. Fifteen lines are padded with zeros in the first, first, and last columns. Load the 14th to 29th rows of the 32 rows for the second time. Since the 15 rows are not the beginning or end of the picture, the first and last columns of the 15 rows to be loaded can be zero-padded. Load the 28th to 32th rows of the 32 rows for the third time. Because these 5 rows are the end of the picture, the last row, first column, and last column of the loaded 5 rows can be zero-padded.

In addition, based on the same implementation principle, the feature map has a total of 8 rows, and when 16 rows are loaded at one time, the 8 rows are loaded when the first load. Since these 8 lines are all of the picture, peripheral zero padding can be performed once for the loaded 8 lines, that is, zero padding for the first row, last row, first column, and last column.

In the embodiment of the present invention, the CNN algorithm can be implemented based on hardware. The CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones.

As shown in FIG. 3, an embodiment of the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks described above, which specifically includes the following steps:

Step 301: through the input buffer, for each of the computing units, execute: based on the control of the controller, loading the buffered image data of the first row of at least one feature map to be processed into the current A computing unit, the first row number corresponds to the current computing unit.

Step 302: Cache each line of image data loaded by the input-side buffer through the block random memory; execute for each of the convolution operation units: based on the control of the controller, for each cache of its own cache One line of image data, the number of valid lines and the number of starting lines are sent to the current convolution operation unit.

Step 303: Use the convolution operation unit to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued. The number of two lines is the number of lines required to perform a basic convolution operation by itself; the image data of the second line number is subjected to a basic convolution operation and sent to the internal adder.

Step 304: Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree.

Step 305: Process the image data sent by each of the internal adders through the addition tree and send it to an activation pooling unit based on the control of the controller.

Step 306: The activation pooling unit processes the image data sent from the addition tree based on the control of the controller, and sends the processed image data to the output buffer.

The information exchange and execution process among the units, modules, components, etc. in the above method are based on the same concept as the product embodiment of the present invention, and the specific content can refer to the description in the product embodiment of the present invention, and will not be repeated here. To repeat.

In summary, the embodiments of the present invention have at least the following beneficial effects:

1. In the embodiment of the present invention, the basic calculation unit of the convolutional neural network includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer; the calculation unit includes a block random memory, a number of convolution operation units, Internal adder, activate pooling unit. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.

2. In the embodiment of the present invention, the proposed design of the basic computational unit of the convolutional neural network has the characteristics of flexibility and universality through a reasonable combination of convolution, pooling, addition tree, data storage, and control modules. To achieve the general purpose, many convolutional neural network algorithms can be realized, so it has broad application prospects.

3. In the embodiment of the present invention, when the calculation of one line is completed, the next line is automatically switched to perform the convolution operation, and the time redundancy between the lines is eliminated, and the performance loss caused by the convolution operation is changed. Therefore, it is possible to improve operation efficiency and realize high-speed operation.

4. In the embodiment of the present invention, the CNN algorithm can be implemented based on hardware. The CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones. field.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that between these entities or operations There is any such actual relationship or order. Moreover, the terms "including", "comprising", or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements but also those that are not explicitly listed Or other elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence "including one ..." do not exclude the existence of other identical factors in the process, method, article, or equipment that includes the elements.

A person of ordinary skill in the art may understand that all or part of the steps of the foregoing method embodiments may be implemented by a program instructing related hardware. The foregoing program may be stored in a computer-readable storage medium. When the program is executed, the program is executed. The method includes the steps of the foregoing method embodiment. The foregoing storage medium includes: a ROM, a RAM, a magnetic disk, or an optical disc, which can store program codes in various media.

Finally, it should be noted that the above are only preferred embodiments of the present invention, and are only used to explain the technical solution of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

A basic calculation unit of a convolutional neural network, which is characterized by:

A controller, an addition tree, an input buffer, at least one computing unit, and an output buffer;

The calculation unit includes a block random access memory, at least one convolution operation unit, an internal adder, and an activation pooling unit; and the block random access memory and the internal adder are connected to each of the convolution operation units, respectively. The internal adder and the activation pooling unit are both connected to the addition tree;

The input-side buffer is configured to execute for each of the computing units: based on control of the controller, load the buffered image data of the first row of at least one feature map to be processed into the current computing unit. , The current calculation unit corresponds to the first row number;

The block random memory is used for buffering each line of image data loaded by the input-side buffer; for each of the convolution operation units, performing: based on the control of the controller, for each line of image cached by itself Data, and sends the number of valid and starting rows to the current convolution operation unit;

The convolution operation unit is configured to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line The number is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder;

The internal adder is configured to process the image data sent by the convolution operation unit and send the image data to the addition tree;

The addition tree is used to process the image data sent by each of the internal adders based on the control of the controller and send the processed image data to an activation pooling unit;

The activation pooling unit is configured to process the image data sent from the addition tree based on the control of the controller, and send the processed image data to the output buffer.
The basic computing unit of the convolutional neural network according to claim 1, wherein:

The activation pooling unit includes: an activation module, a first selector, a pooling module, a second selector, and a third selector;

One input terminal of the first selector is connected to the input terminal of the activation module, the other input terminal is connected to the output terminal of the activation module, and the output terminal is connected to the input terminal of the pooling module;

One input terminal of the second selector is connected to the input terminal of the pooling module, the other input terminal is connected to the output terminal of the pooling module, and the output terminal is connected to the first input terminal of the third selector. Connected

The first input terminal of the third selector is connected to the output terminal of the second selector, the second input terminal is connected to other computing units in the at least one computing unit, and the third input terminal is connected to the An input buffer is connected, a first output is connected to the output buffer, and a second output is connected to the input of the activation module;

The activation pooling unit is configured to control whether the activation module and / or the pooling module work based on the control of the controller;

The third selector is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller, and control the first input terminal. The output terminal or the second output terminal works.
The basic computing unit of the convolutional neural network according to claim 2, wherein:

The activation pooling unit further includes: an offset processing unit;

The offset processing unit is respectively connected to the addition tree and the activation module;

The second output terminal is connected to the input terminal of the activation module through the offset processing unit.
The basic computing unit of the convolutional neural network according to claim 1, wherein:

The current calculation unit includes 4 convolution operation units;

The number of lines required by any of the four convolution operation units to perform a basic convolution operation is 3, and the step size is 1;

The first number of rows corresponding to the current computing unit is 16 rows.
The basic computing unit of the convolutional neural network according to claim 1, wherein:

The input buffer is further configured to determine the first number of rows corresponding to each of the calculation units according to Formula 1, Formula 2, and Formula 3. When any of the convolution operation units performs a basic convolution operation, The step size is 1;

The formula one includes:

The formula two includes:

The formula three includes:

Y i = max {y i }

Where X ij is the number of rows required for the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a i is the corresponding number of the i-th calculation unit. Preset optimization values, a i is an integer, m is the number of convolution operation units in the i-th calculation unit, x i is an optimization variable corresponding to the i-th calculation unit, and y i is the first The intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.
The basic computing unit of the convolutional neural network according to claim 1, wherein:

The input buffer is further configured to determine the number of first rows corresponding to each of the calculation units according to formulas 4 and 5, wherein different lines required by the convolution operation unit when performing a basic convolution operation are performed. The number is the same and the step size is 1, and the number of convolution operation units included in different calculation units is the same;

The formula four includes:

The formula five includes:

Y = max {2 k }

Among them, A is the number of rows required by any of the convolution operation units to perform basic convolution operations, X is the number of convolution operation units included in any of the calculation units, k is an integer, and N is an external chip. The amount of storage resources that can be provided is in bytes, T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte, and D is the amount of storage resources and unit of a row of image data Is a byte, n is the number of the at least one calculation unit, Y is the number of the first row corresponding to each of the calculation units, and max {} is the maximum value.
The basic computing unit of the convolutional neural network according to claim 5 or 6, wherein:

The external chip includes: a field programmable logic gate array FPGA chip or a custom chip;

The other modules include any one or more of a pooling module included in the active pooling unit, the output buffer, and the addition tree.
The basic calculation unit of the convolutional neural network according to any one of claims 1 to 6, wherein:

When the number of lines required by any one of the convolution operation units to perform a basic convolution operation is 3 and the step size is 1, for any of the blocks of random access memory: in the image data of any two adjacent buffers, the first The last 2 lines of image data cached are the same as the first line of image data cached later, and the last 1 line of image data cached before are the same as the second line image data cached later;

For any of the convolution operation units: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, and the image data of the first line cached third The line image data is the same as the second line image data buffered later.
The basic calculation unit of the convolutional neural network according to any one of claims 1 to 6, wherein:

The input buffer is further configured to execute for each of the feature maps to be processed: when the controller controls the PAD of the feature map to be processed to be 1, perform supplementation on the periphery of the feature map to be processed currently. Zero to increase the number of rows and columns of the current to-be-processed feature map by 2; and specifically to perform an image data loading operation based on the current to-be-processed feature map after zero padding.
A calculation method based on a basic calculation unit of a convolutional neural network according to any one of claims 1 to 9, characterized in that it comprises:

Performing, for each of the computing units through the input buffer, loading: based on the control of the controller, loading the buffered image data of the first line number of the at least one feature map to be processed into the current computing unit, The first row number corresponds to the current computing unit;

Each block of image data loaded by the input buffer is buffered through the block random memory; for each of the convolution operation units, based on the control of the controller, each line of image data cached by itself is executed To send the number of valid rows and the number of starting rows to the current convolution operation unit;

Obtaining, by the convolution operation unit, image data of a second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line number The number of lines required to perform a basic convolution operation for itself; and performing basic convolution operation on the image data of the second number of lines to send to the internal adder;

Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree;

Processing the image data sent by each of the internal adders through the addition tree and sending it to an activation pooling unit based on the control of the controller;

Through the activation pooling unit, based on the control of the controller, the image data sent from the addition tree is processed and sent to the output buffer.