WO2020029767A1 - Basic computing unit for convolutional neural network, and computing method - Google Patents

Basic computing unit for convolutional neural network, and computing method Download PDF

Info

Publication number
WO2020029767A1
WO2020029767A1 PCT/CN2019/096750 CN2019096750W WO2020029767A1 WO 2020029767 A1 WO2020029767 A1 WO 2020029767A1 CN 2019096750 W CN2019096750 W CN 2019096750W WO 2020029767 A1 WO2020029767 A1 WO 2020029767A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
unit
convolution operation
basic
units
Prior art date
Application number
PCT/CN2019/096750
Other languages
French (fr)
Chinese (zh)
Inventor
李朋
赵鑫鑫
姜凯
于治楼
Original Assignee
山东浪潮人工智能研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东浪潮人工智能研究院有限公司 filed Critical 山东浪潮人工智能研究院有限公司
Publication of WO2020029767A1 publication Critical patent/WO2020029767A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to the field of computer technology, and in particular, to a basic calculation unit and calculation method of a convolutional neural network.
  • CNN Convolutional Neural Networks
  • the CNN algorithm can be implemented based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit).
  • CPU and GPU are software implementation algorithms.
  • the invention provides a basic calculation unit and calculation method of a convolutional neural network, which can implement an algorithm based on hardware so that the completion time of the algorithm can be controlled.
  • the present invention provides a basic computing unit of a convolutional neural network, including:
  • a controller an addition tree, an input buffer, at least one computing unit, and an output buffer;
  • the calculation unit includes a block random access memory, at least one convolution operation unit, an internal adder, and an activation pooling unit; and the block random access memory and the internal adder are connected to each of the convolution operation units, respectively.
  • the internal adder and the activation pooling unit are both connected to the addition tree;
  • the input-side buffer is configured to execute for each of the computing units: based on control of the controller, load the buffered image data of the first row of at least one feature map to be processed into the current computing unit. ,
  • the current calculation unit corresponds to the first row number;
  • the block random memory is used for buffering each line of image data loaded by the input-side buffer; for each of the convolution operation units, performing: based on the control of the controller, for each line of image cached by itself Data, and sends the number of valid and starting rows to the current convolution operation unit;
  • the convolution operation unit is configured to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line
  • the number is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder;
  • the internal adder is configured to process the image data sent by the convolution operation unit and send the image data to the addition tree;
  • the addition tree is used to process the image data sent by each of the internal adders based on the control of the controller and send the processed image data to an activation pooling unit;
  • the activation pooling unit is configured to process the image data sent from the addition tree based on the control of the controller, and send the processed image data to the output buffer.
  • the activation pooling unit includes: an activation module, a first selector, a pooling module, a second selector, and a third selector;
  • One input terminal of the first selector is connected to the input terminal of the activation module, the other input terminal is connected to the output terminal of the activation module, and the output terminal is connected to the input terminal of the pooling module;
  • One input terminal of the second selector is connected to the input terminal of the pooling module, the other input terminal is connected to the output terminal of the pooling module, and the output terminal is connected to the first input terminal of the third selector.
  • the first input terminal of the third selector is connected to the output terminal of the second selector, the second input terminal is connected to other computing units in the at least one computing unit, and the third input terminal is connected to the An input buffer is connected, a first output is connected to the output buffer, and a second output is connected to the input of the activation module;
  • the activation pooling unit is configured to control whether the activation module and / or the pooling module work based on the control of the controller;
  • the third selector is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller, and control the first input terminal.
  • the output terminal or the second output terminal works.
  • the activation pooling unit further includes: an offset processing unit;
  • the offset processing unit is respectively connected to the addition tree and the activation module
  • the second output terminal is connected to the input terminal of the activation module through the offset processing unit.
  • the current calculation unit includes 4 convolution operation units
  • the number of lines required by any of the four convolution operation units to perform a basic convolution operation is 3, and the step size is 1;
  • the first number of rows corresponding to the current computing unit is 16 rows.
  • the input buffer is further configured to determine a first row number corresponding to each of the calculation units according to formula 1, formula 2, and formula 3, wherein any of the convolution operation units executes a basic volume
  • the step size during the product operation is 1;
  • the formula one includes:
  • the formula two includes:
  • the formula three includes:
  • X ij is the number of rows required for the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation
  • a i is the corresponding number of the i-th calculation unit.
  • a i is an integer
  • m is the number of convolution operation units in the i-th calculation unit
  • x i is an optimization variable corresponding to the i-th calculation unit
  • y i is the first Intermediate value corresponding to i computing units
  • D is the amount of storage resources of a row of image data and the unit is byte
  • n is the number of the at least one computing unit
  • N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes
  • the Y i is the i-th number corresponding to a first row calculation unit
  • max ⁇ To take the maximum.
  • the input buffer is further configured to determine the first row number corresponding to each of the calculation units according to formulas 4 and 5, wherein different convolution operation units perform different basic convolution operations when performing the basic convolution operation.
  • the number of rows required is the same and the step size is 1, and the number of convolution operation units included in different calculation units is the same;
  • the formula four includes:
  • the formula five includes:
  • A is the number of rows required by any of the convolution operation units to perform basic convolution operations
  • X is the number of convolution operation units included in any of the calculation units
  • k is an integer
  • N is an external chip.
  • the amount of storage resources that can be provided is in bytes
  • T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte
  • D is the amount of storage resources and unit of a row of image data Is a byte
  • n is the number of the at least one calculation unit
  • Y is the number of the first row corresponding to each of the calculation units
  • max ⁇ is the maximum value.
  • the external chip includes: an FPGA (Field-Programmable Gate Array) chip or a custom chip;
  • FPGA Field-Programmable Gate Array
  • the other modules include any one or more of a pooling module included in the active pooling unit, the output buffer, and the addition tree.
  • the step size is 1, for any of the blocks of random access memory: in the image data of any two adjacent caches , The last second line of image data cached before is the same as the first line of image data cached later, the last first line of image data cached before is the same as the second line image data cached after;
  • the image data of the second line cached before is the same as the image data of the first line cached later, and the image data of the first line cached third
  • the line image data is the same as the second line image data buffered later.
  • the input buffer is further configured to execute for each of the feature maps to be processed: when the controller controls the PAD of the feature map to be processed to be 1, the Peripheral zero-padding is performed to increase the number of rows and columns of the currently to-be-processed feature map by 2; and is specifically used to perform an image data loading operation based on the current to-be-processed feature map after zero-padding.
  • the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks, including:
  • Each block of image data loaded by the input buffer is buffered through the block random memory; for each of the convolution operation units, based on the control of the controller, each line of image data cached by itself is executed To send the number of valid rows and the number of starting rows to the current convolution operation unit;
  • the image data sent from the addition tree is processed and sent to the output buffer.
  • the invention provides a basic calculation unit and calculation method of a convolutional neural network.
  • the basic calculation unit includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer.
  • the calculation unit includes a block random memory, Several convolution operation units, internal adders, and activation pooling units.
  • the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data;
  • the convolution operation unit processes the image data and sends it to the addition tree via an internal adder;
  • the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit;
  • the activation pooling unit processes the image data and sends it to the output end Buffer.
  • the invention can implement the algorithm based on hardware so that the completion time of the algorithm can be controlled.
  • FIG. 1 is a schematic diagram of a basic calculation unit of a convolutional neural network according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a computing unit in a convolutional neural network according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a calculation method of a basic calculation unit based on a convolutional neural network according to an embodiment of the present invention.
  • an embodiment of the present invention provides a basic calculation unit of a convolutional neural network, which may include:
  • a controller 101 an addition tree 102, an input buffer 103, at least one calculation unit 104, and an output buffer 105;
  • the calculation unit 104 includes a block random memory 1041, at least one convolution operation unit 1042, an internal adder 1043, and an activation pooling unit 1044.
  • the block random access memory 1041 and the internal adder 1043 are respectively connected to each of the convolution operation units 1042, and the internal adder 1043 and the activation pooling unit 1044 are connected to the addition tree 102;
  • the input-side buffer 103 is configured to execute, for each of the computing units 104, based on the control of the controller 101, loading the buffered image data of the first line number of the at least one to-be-processed feature map into A current calculation unit, where the current calculation unit corresponds to the first row number;
  • the block random memory 1041 is configured to buffer each row of image data loaded by the input-side buffer 103; for each of the convolution operation units 1042, execute: based on the control of the controller 101, cache for itself Each line of image data is sent to the current convolution operation unit with the number of valid lines and the number of starting lines;
  • the convolution operation unit 1042 is configured to obtain the image data of the second line number from all the image data buffered in the block random memory 1041 according to the number of valid lines and the number of the starting lines sent.
  • the number of two lines is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder 1043;
  • the internal adder 1043 is configured to process the image data sent by the convolution operation unit 1042 and send the image data to the addition tree 102;
  • the addition tree 102 is configured to process, based on the control of the controller 101, image data sent by each of the internal adders 1043 and send the image data to an activation pooling unit 1044;
  • the activation pooling unit 1044 is configured to process the image data sent from the addition tree 102 based on the control of the controller 101 and send the processed image data to the output buffer 105.
  • An embodiment of the present invention provides a basic calculation unit of a convolutional neural network, including a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer.
  • the calculation unit includes a block random memory and a number of convolution operation units. , Internal adder, activate pooling unit.
  • the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data;
  • the convolution operation unit processes the image data and sends it to the addition tree via an internal adder;
  • the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit;
  • the activation pooling unit processes the image data and sends it to the output end Buffer.
  • the embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.
  • the effective number of rows may be the number of rows currently actually loaded in the block RAM.
  • loading the image data from the input buffer to the block random memory, the number of rows to be loaded, and which row to load from can be controlled by the controller according to the upper-layer command.
  • the controller can control each other unit module in the basic computing unit according to the upper-layer command. For example, after the convolution operation is completed, addition is performed through the addition tree module, and then data processing is performed according to the activation algorithm, and then pooling processing is performed. These processes are completed by the controller. For example, the controller controls whether it can be activated or pooled. For example, the controller sends the parameters of data loading, data storage, and data calculation to other unit modules, and whether the completion is returned by the other unit modules.
  • the controller is implemented by Xilinx FPGA-PL, and is connected to ARM (Xilinx FPGA-PS) via AXI (Advanced Xtensible Interface) bus.
  • ARM Xilinx FPGA-PS
  • PS Processing System
  • SOC System Chip
  • PL Progarmmable logic, programmable
  • the design of the basic computational unit of the convolutional neural network proposed by the embodiment of the present invention has the characteristics of flexibility and universality through reasonable combination of convolution, pooling, addition tree, data storage, and control modules, so as to achieve the general purpose , Can realize many convolutional neural network algorithms, so it has broad application prospects.
  • the controller may be represented by a controller
  • the input buffer may be represented by IN BUFFER
  • the output buffer may be represented by OUT BUFFER
  • the addition tree may be represented by ADDER TREE
  • at least one calculation unit may be CUA (Calculate Unit Array Computing unit array) indicates that the computing unit is CU
  • block random memory can be represented by Block RAM
  • convolution operation unit can be represented by CONV
  • internal adder can be represented by Inter Adder
  • activation module can be represented by RELU (Rectified Linear Unit)
  • pooling module It can be represented by POOL
  • the selector can be represented by MUX (multiplexer).
  • the processing results of the internal adder of each calculation unit may be all output to the addition tree.
  • the addition tree aggregates the processing results, according to the control information sent by the controller, all the aggregated processing results are sent.
  • the activation pooling unit of the computing unit designated by the control information is used for subsequent activation, pooling, and other processing.
  • the activation pooling unit 1044 includes: an activation module 10441, a first selector 10442, a pooling module 10443, a second selector 10444, and a third selector 10445.
  • One input terminal of the first selector 10442 is connected to the input terminal of the activation module 10441, the other input terminal is connected to the output terminal of the activation module 10441, and the output terminal is connected to the input terminal of the pooling module 10443. ;
  • One input terminal of the second selector 10444 is connected to the input terminal of the pooling module 10443, the other input terminal is connected to the output terminal of the pooling module 10443, and the output terminal is connected to the third selector 10445.
  • the first input terminal is connected;
  • the first input terminal of the third selector 10445 is connected to the output terminal of the second selector 10444, the second input terminal is connected to other computing units 104 in the at least one computing unit 104, and the third input A terminal is connected to the input buffer 103, a first output is connected to the output buffer 105, and a second output is connected to the input of the activation module 10441;
  • the activation pooling unit 1044 is configured to control whether the activation module 10441 and / or the pooling module 10443 work based on the control of the controller 101;
  • the third selector 10445 is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller 101 to control the input terminal.
  • the first output terminal or the second output terminal works.
  • the processing flow of the image data in the computing unit can be determined.
  • the third selector works and which output end works.
  • the second input terminal of the third selector is connected to another computing unit.
  • This other computing unit may be any computing unit other than the computing unit in which the computing unit is located.
  • the activation pooling unit 1044 further includes: an offset processing unit 10446;
  • the offset processing unit 10446 is connected to the addition tree 102 and the activation module 10441, respectively;
  • the second output terminal is connected to the input terminal of the activation module 10441 through the offset processing unit 10446.
  • each value is increased by 1.
  • the subsequent activation and pooling operations are performed.
  • the block random access memory is responsible for buffering the input data required during the convolution operation to ensure that each connected convolution operation unit can operate normally. For example, suppose a convolution operation unit uses a 3 ⁇ 3 convolution operation as the basic convolution unit to implement convolution operations in other dimensions. The convolution operation unit requires 3 rows of input data. In this way, the block random memory needs at least a cache. Enter these 3 lines.
  • the current calculation unit includes four convolution operation units 1042;
  • any one of the four convolution operation units 1042 needs 3 lines and a step size of 1 to perform a basic convolution operation
  • the first number of rows corresponding to the current computing unit is 16 rows.
  • each convolution operation unit needs 3 rows to perform basic convolution operations.
  • image processing the following two situations can exist:
  • the image data processed by each convolution operation unit may come from different feature maps to be processed.
  • the data buffered in the block random memory can come from the same feature map or from multiple feature maps, and the specific implementation can be controlled according to the controller.
  • the block random memory needs to store at least three lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than three.
  • the block random memory needs to store at least 12 lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than 12.
  • the feature maps to be processed are mostly 128 ⁇ 128, 256 ⁇ 256, 512 ⁇ 512, etc.
  • the last load is usually not 12 rows. For example, when the feature map is 128 ⁇ 128, it needs to be loaded 11 times. Only 8 rows are loaded at a time.
  • the feature map is 64 ⁇ 64, it needs to be loaded 6 times, and only 6 rows are loaded at the 6th time. In this way, there will be a waste of resources during the last load.
  • the number of rows loaded in the block RAM at a time can also be a power of two, and not less than 12. For example, you can select 16 rows.
  • the block random access memory when there are more resources available in the block random access memory, it can also be selected as 32 lines, or even 64 lines.
  • the number of rows of the feature map to be processed is usually large, the number of rows selected should not be too small; when the number of rows of the feature map to be processed is usually small , The number of selected rows should not be too large.
  • the input buffer 103 is further configured to determine a first row number corresponding to each of the calculation units 104 according to the following formula (1) to formula (3), where any A step length when the convolution operation unit 1042 performs a basic convolution operation is 1;
  • X ij is the number of rows required by the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation
  • a i is the corresponding number of the i-th calculation unit.
  • a i is an integer
  • m is the number of convolution operation units in the i-th calculation unit
  • x i is an optimization variable corresponding to the i-th calculation unit
  • y i is the first
  • the intermediate value corresponding to i computing units D is the amount of storage resources of a row of image data and the unit is byte
  • n is the number of the at least one computing unit
  • N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes
  • the Y i is the i-th number corresponding to a first row calculation unit
  • max ⁇ To take the maximum.
  • this design is particularly suitable for application scenarios where the number of rows of the feature map to be processed is usually large.
  • the value of a i can be flexibly set according to the control information of the controller.
  • a i is 0, in the above case 3, the number of rows loaded by the block random access memory can only be obtained once by each convolution operation unit. This will cause the block random access memory to repeatedly load image data from the input buffer. happening.
  • the step size is 1, when a i is 1, in the above case 3, the number of rows loaded by the block random memory at a time can support each convolution operation unit to obtain two consecutive times; when a i is 2, it can be Each convolution operation unit is supported for three consecutive acquisitions; when a i is 3, each convolution operation unit is supported for four consecutive acquisitions; and so on.
  • Y i is not necessarily due to the multiple side 2, so that the program is more suitable to be treated needs to be characterized in the case of FIG peripheral zero-padding process. Since Y i takes the maximum value, it is possible to reduce the maximum degree of the block random access memory loads the image data from the input buffer repeatedly, thus improving the overall process efficiency.
  • the input buffer 103 is further configured to determine a first line corresponding to each of the calculation units 104 according to the following formula (4) and formula (5).
  • the number of rows required by different convolution operation units 1042 to perform the basic convolution operation is the same and the step size is 1, and the number of convolution operation units 1042 included in different calculation units 104 is the same;
  • A is the number of rows required by any of the convolution operation units to perform basic convolution operations
  • X is the number of convolution operation units included in any of the calculation units
  • k is an integer
  • N is an external chip.
  • the amount of storage resources that can be provided is in bytes
  • T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte
  • D is the amount of storage resources and unit of a row of image data Is a byte
  • n is the number of the at least one calculation unit
  • Y is the number of the first row corresponding to each of the calculation units
  • max ⁇ is the maximum value.
  • the external chip includes: an FPGA chip or a custom chip;
  • the other modules include any one or more of a pooling module 10443 included in the activated pooling unit 1044, the output buffer 105, and the addition tree 102.
  • any of the block random access memories 1041 when the number of rows required by any of the convolution operation units 1042 to perform a basic convolution operation is 3 and the step size is 1, for any of the block random access memories 1041: any adjacent Of the image data buffered twice, the last second line of image data cached before is the same as the first line of image data cached before, and the last first line of image data cached before is the second line of image data cached after the same;
  • the image data of the second line cached before is the same as the image data of the first line cached later,
  • the three lines of image data are the same as the second line of image data buffered later.
  • the block random memory P is currently loaded with 16 lines of image data, and the 4 convolution operation units connected all process the 16 lines of image data, and each uses a 3 ⁇ 3 convolution operation as the basic convolution unit.
  • the number of lines required to perform the basic convolution operation is 3, and the step size is set to 1.
  • the convolution operation unit Q obtains from the 16 rows Lines 1 to 3.
  • the controller can control the block random access memory P, and send the effective number of rows to the convolution operation unit to 16 and the starting number to 2, the convolution operation unit Q obtains the second to fourth rows from the 16 rows. This loop is repeated until the convolution operation unit Q obtains the 14th to 16th rows from the 16 rows to complete the processing operation of the 16 rows of image data.
  • the input buffer 103 is further configured to execute for each of the feature maps to be processed: when the controller 101 controls the PAD of the current to-be-processed feature maps to 1, Performing zero padding on the periphery of the current to-be-processed feature map to increase the number of rows and columns of the current to-be-processed feature map by 2; and specifically for performing image data based on the current to-be-processed feature map after zero-padding Load operation.
  • the size of the feature map to be processed is 512 ⁇ 512, that is, it has 512 rows of image data and 512 columns of image data.
  • the first to third rows are processed, then the second to fourth rows are processed, then the third to fifth rows are processed, and so on. It ends after processing lines 510 to 512. It can be seen that it needs to be processed 510 times, so the size of the result obtained is 510 ⁇ 510, which is two rows less than the original image and two columns less in the number of columns.
  • the convolution processing operation needs to be executed multiple times in a loop, the graph is reduced every time it is executed.
  • PAD can be set to 1, before performing the convolution processing operation, the circle is supplemented with a circle of zero values, so that the graph size is changed from 512 ⁇ 512 becomes 514 ⁇ 514. In this way, after one convolution processing operation, an image of size 514 ⁇ 514 can be processed into an image of size 512 ⁇ 512. This loop keeps the graph size constant regardless of the number of convolution processing operations.
  • the value of PAD is equivalent to the number of turns of zero padding on the periphery of the figure.
  • zero padding can be performed in the input buffer, or can be performed in the block random access memory.
  • zero-padding is performed first, and then image data is loaded based on the zero-padding feature map.
  • the feature map has 32 rows.
  • the feature map has 34 rows.
  • the zero-padded feature map has zero values in the first, last, first, and last columns. Assuming that 16 rows are loaded at a time, the first time, the first to the 16th rows of the 34 rows are loaded, the second 15th to the 30th rows are loaded, and the 34th rows are loaded the third time Lines 29 to 34.
  • the image data is loaded first, and then zero padding is performed based on the loaded image data. For example, there are 32 lines in the feature map. Assuming that 16 lines are loaded at a time, the first to 15 of the 32 lines are loaded first. Since these 15 lines are the beginning of the picture, you can target the loaded image. Fifteen lines are padded with zeros in the first, first, and last columns. Load the 14th to 29th rows of the 32 rows for the second time. Since the 15 rows are not the beginning or end of the picture, the first and last columns of the 15 rows to be loaded can be zero-padded. Load the 28th to 32th rows of the 32 rows for the third time. Because these 5 rows are the end of the picture, the last row, first column, and last column of the loaded 5 rows can be zero-padded.
  • the feature map has a total of 8 rows, and when 16 rows are loaded at one time, the 8 rows are loaded when the first load. Since these 8 lines are all of the picture, peripheral zero padding can be performed once for the loaded 8 lines, that is, zero padding for the first row, last row, first column, and last column.
  • the CNN algorithm can be implemented based on hardware.
  • the CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones.
  • an embodiment of the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks described above, which specifically includes the following steps:
  • Step 301 through the input buffer, for each of the computing units, execute: based on the control of the controller, loading the buffered image data of the first row of at least one feature map to be processed into the current A computing unit, the first row number corresponds to the current computing unit.
  • Step 302 Cache each line of image data loaded by the input-side buffer through the block random memory; execute for each of the convolution operation units: based on the control of the controller, for each cache of its own cache One line of image data, the number of valid lines and the number of starting lines are sent to the current convolution operation unit.
  • Step 303 Use the convolution operation unit to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued.
  • the number of two lines is the number of lines required to perform a basic convolution operation by itself; the image data of the second line number is subjected to a basic convolution operation and sent to the internal adder.
  • Step 304 Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree.
  • Step 305 Process the image data sent by each of the internal adders through the addition tree and send it to an activation pooling unit based on the control of the controller.
  • Step 306 The activation pooling unit processes the image data sent from the addition tree based on the control of the controller, and sends the processed image data to the output buffer.
  • the basic calculation unit of the convolutional neural network includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer; the calculation unit includes a block random memory, a number of convolution operation units, Internal adder, activate pooling unit.
  • the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data;
  • the convolution operation unit processes the image data and sends it to the addition tree via an internal adder;
  • the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit;
  • the activation pooling unit processes the image data and sends it to the output end Buffer.
  • the embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.
  • the proposed design of the basic computational unit of the convolutional neural network has the characteristics of flexibility and universality through a reasonable combination of convolution, pooling, addition tree, data storage, and control modules.
  • convolutional neural network algorithms can be realized, so it has broad application prospects.
  • the CNN algorithm can be implemented based on hardware.
  • the CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones. field.
  • the foregoing program may be stored in a computer-readable storage medium.
  • the method includes the steps of the foregoing method embodiment.
  • the foregoing storage medium includes: a ROM, a RAM, a magnetic disk, or an optical disc, which can store program codes in various media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

The present invention provides a basic computing unit for a convolutional neural network, and a computing method, the basic computing unit comprising a controller, an adder tree, an input buffer, several computing units, and an output buffer; each of the computing units comprises a block random access memory, several convolution operation units, an internal adder, and an activation and pooling unit. On the basis of the control of the controller, the input buffer loads, to the computing units, a corresponding number of lines of image data, and the block random access memory delivers an effective number of lines and a starting line number to the convolution operation units, so as to enable the convolution operation units to acquire image data of corresponding line numbers; the convolution operation units process the image data, and send same to the adder tree by means of the internal adders; the adder tree processes the image data sent from the internal adders, and sends same to an activation and pooling unit; and the activation and pooling unit processes the image data, and sends same to the output buffer. The present solution is able to implement an algorithm on the basis of hardware, making completion time of an algorithm controllable.

Description

一种卷积神经网络的基本计算单元及计算方法Basic calculation unit and calculation method of convolutional neural network 技术领域Technical field
本发明涉及计算机技术领域,特别涉及一种卷积神经网络的基本计算单元及计算方法。The present invention relates to the field of computer technology, and in particular, to a basic calculation unit and calculation method of a convolutional neural network.
背景技术Background technique
卷积神经网络(Convolutional Neural Networks,CNN)是近年来发展迅速并引起广泛重视的一种高效识别方法。Convolutional Neural Networks (CNN) is an efficient identification method that has developed rapidly in recent years and attracted widespread attention.
目前,CNN算法可以基于CPU(Central Processing Unit,中央处理器)或GPU(Graphics Processing Unit,图形处理器)得以实现。其中,CPU和GPU均为软件实现算法。At present, the CNN algorithm can be implemented based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). Among them, CPU and GPU are software implementation algorithms.
由于现有实现方式是软件实现算法,故算法完成的时间不可控。Because the existing implementation is a software-implemented algorithm, the time for algorithm completion is uncontrollable.
发明内容Summary of the invention
本发明提供了一种卷积神经网络的基本计算单元及计算方法,能够基于硬件实现算法,以使算法完成时间可控。The invention provides a basic calculation unit and calculation method of a convolutional neural network, which can implement an algorithm based on hardware so that the completion time of the algorithm can be controlled.
为了达到上述目的,本发明是通过如下技术方案实现的:In order to achieve the above object, the present invention is achieved by the following technical solutions:
一方面,本发明提供了一种卷积神经网络的基本计算单元,包括:In one aspect, the present invention provides a basic computing unit of a convolutional neural network, including:
控制器、加法树、输入端缓存器、至少一个计算单元、输出端缓存器;A controller, an addition tree, an input buffer, at least one computing unit, and an output buffer;
所述计算单元包括块随机存储器、至少一个卷积运算单元、内部加法器、激活池化单元;所述块随机存储器和所述内部加法器均分别与每一个所述卷积运算单元相连,所述内部加法器和所述激活池化单元均与所述加法树相连;The calculation unit includes a block random access memory, at least one convolution operation unit, an internal adder, and an activation pooling unit; and the block random access memory and the internal adder are connected to each of the convolution operation units, respectively. The internal adder and the activation pooling unit are both connected to the addition tree;
所述输入端缓存器,用于针对每一个所述计算单元均执行:基于所述控 制器的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述当前计算单元对应有所述第一行数;The input-side buffer is configured to execute for each of the computing units: based on control of the controller, load the buffered image data of the first row of at least one feature map to be processed into the current computing unit. , The current calculation unit corresponds to the first row number;
所述块随机存储器,用于缓存所述输入端缓存器加载来的每一行图像数据;针对每一个所述卷积运算单元均执行:基于所述控制器的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数;The block random memory is used for buffering each line of image data loaded by the input-side buffer; for each of the convolution operation units, performing: based on the control of the controller, for each line of image cached by itself Data, and sends the number of valid and starting rows to the current convolution operation unit;
所述卷积运算单元,用于根据下发来的有效行数和起始行数,从所述块随机存储器缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器;The convolution operation unit is configured to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line The number is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder;
所述内部加法器,用于对所述卷积运算单元发来的图像数据进行处理后发送给所述加法树;The internal adder is configured to process the image data sent by the convolution operation unit and send the image data to the addition tree;
所述加法树,用于基于所述控制器的控制,对每一个所述内部加法器发来的图像数据进行处理后发送给一所述激活池化单元;The addition tree is used to process the image data sent by each of the internal adders based on the control of the controller and send the processed image data to an activation pooling unit;
所述激活池化单元,用于基于所述控制器的控制,对所述加法树发来的图像数据进行处理后发送给所述输出端缓存器。The activation pooling unit is configured to process the image data sent from the addition tree based on the control of the controller, and send the processed image data to the output buffer.
进一步地,所述激活池化单元包括:激活模块、第一选择器、池化模块、第二选择器、第三选择器;Further, the activation pooling unit includes: an activation module, a first selector, a pooling module, a second selector, and a third selector;
所述第一选择器的一输入端与所述激活模块的输入端相连、另一输入端与所述激活模块的输出端相连、输出端与所述池化模块的输入端相连;One input terminal of the first selector is connected to the input terminal of the activation module, the other input terminal is connected to the output terminal of the activation module, and the output terminal is connected to the input terminal of the pooling module;
所述第二选择器的一输入端与所述池化模块的输入端相连、另一输入端与所述池化模块的输出端相连、输出端与所述第三选择器的第一输入端相连;One input terminal of the second selector is connected to the input terminal of the pooling module, the other input terminal is connected to the output terminal of the pooling module, and the output terminal is connected to the first input terminal of the third selector. Connected
所述第三选择器的所述第一输入端与所述第二选择器的输出端相连、第二输入端与所述至少一个计算单元中的其他计算单元相连、第三输入端与所述输入端缓存器相连、第一输出端与所述输出端缓存器相连,第二输出端与所述激活模块的输入端相连;The first input terminal of the third selector is connected to the output terminal of the second selector, the second input terminal is connected to other computing units in the at least one computing unit, and the third input terminal is connected to the An input buffer is connected, a first output is connected to the output buffer, and a second output is connected to the input of the activation module;
所述激活池化单元,用于基于所述控制器的控制,控制所述激活模块和/或所述池化模块是否工作;The activation pooling unit is configured to control whether the activation module and / or the pooling module work based on the control of the controller;
所述第三选择器,用于基于所述控制器的控制,控制所述第一输入端、所述第二输入端和所述第三输入端中的一个输入端工作,控制所述第一输出端或所述第二输出端工作。The third selector is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller, and control the first input terminal. The output terminal or the second output terminal works.
进一步地,所述激活池化单元还包括:偏置量处理单元;Further, the activation pooling unit further includes: an offset processing unit;
所述偏置量处理单元分别与所述加法树和所述激活模块相连;The offset processing unit is respectively connected to the addition tree and the activation module;
所述第二输出端通过所述偏置量处理单元与所述激活模块的输入端相连。The second output terminal is connected to the input terminal of the activation module through the offset processing unit.
进一步地,所述当前计算单元包括4个卷积运算单元;Further, the current calculation unit includes 4 convolution operation units;
所述4个卷积运算单元中任一卷积运算单元执行基本卷积运算所需的行数为3行,步长为1;The number of lines required by any of the four convolution operation units to perform a basic convolution operation is 3, and the step size is 1;
所述当前计算单元对应的第一行数为16行。The first number of rows corresponding to the current computing unit is 16 rows.
进一步地,所述输入端缓存器,还用于根据公式一、公式二和公式三,确定每一个所述计算单元对应的第一行数,其中,任一所述卷积运算单元执行基本卷积运算时的步长为1;Further, the input buffer is further configured to determine a first row number corresponding to each of the calculation units according to formula 1, formula 2, and formula 3, wherein any of the convolution operation units executes a basic volume The step size during the product operation is 1;
所述公式一包括:The formula one includes:
Figure PCTCN2019096750-appb-000001
Figure PCTCN2019096750-appb-000001
所述公式二包括:The formula two includes:
Figure PCTCN2019096750-appb-000002
Figure PCTCN2019096750-appb-000002
所述公式三包括:The formula three includes:
Y i=max{y i} Y i = max {y i }
其中,X ij为所述至少一个计算单元中的第i个计算单元中的第j个卷积运算单元执行基本卷积运算所需的行数,a i为所述第i个计算单元对应的预设优化值,a i为整数,m为所述第i个计算单元中的卷积运算单元的个数,x i为所述第i个计算单元对应的优化变量,y i为所述第i个计算单元对应的中间值,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的 个数,N为外部芯片所能提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,Y i为所述第i个计算单元对应的第一行数,max{}为取最大值。 Where X ij is the number of rows required for the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a i is the corresponding number of the i-th calculation unit. Preset optimization values, a i is an integer, m is the number of convolution operation units in the i-th calculation unit, x i is an optimization variable corresponding to the i-th calculation unit, and y i is the first Intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.
进一步地,所述输入端缓存器,还用于根据公式四和公式五,确定每一个所述计算单元对应的第一行数,其中,不同所述卷积运算单元执行基本卷积运算时所需的行数相同且步长为1,不同计算单元包括的卷积运算单元个数相同;Further, the input buffer is further configured to determine the first row number corresponding to each of the calculation units according to formulas 4 and 5, wherein different convolution operation units perform different basic convolution operations when performing the basic convolution operation. The number of rows required is the same and the step size is 1, and the number of convolution operation units included in different calculation units is the same;
所述公式四包括:The formula four includes:
Figure PCTCN2019096750-appb-000003
Figure PCTCN2019096750-appb-000003
所述公式五包括:The formula five includes:
Y=max{2 k} Y = max {2 k }
其中,A为任一所述卷积运算单元执行基本卷积运算时所需的行数,X为任一所述计算单元包括的卷积运算单元个数,k为整数,N为外部芯片所能提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的个数,Y为每一个所述计算单元对应的第一行数,max{}为取最大值。Among them, A is the number of rows required by any of the convolution operation units to perform basic convolution operations, X is the number of convolution operation units included in any of the calculation units, k is an integer, and N is an external chip. The amount of storage resources that can be provided is in bytes, T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte, and D is the amount of storage resources and unit of a row of image data Is a byte, n is the number of the at least one calculation unit, Y is the number of the first row corresponding to each of the calculation units, and max {} is the maximum value.
进一步地,所述外部芯片包括:FPGA(Field-Programmable Gate Array,现场可编程逻辑门阵列)芯片或定制芯片;Further, the external chip includes: an FPGA (Field-Programmable Gate Array) chip or a custom chip;
所述其他模块包括:所述激活池化单元中包括的池化模块、所述输出端缓存器、所述加法树中任意一种或多种。The other modules include any one or more of a pooling module included in the active pooling unit, the output buffer, and the addition tree.
进一步地,任一所述卷积运算单元执行基本卷积运算所需的行数为3行且步长为1时,对于任一所述块随机存储器:任意相邻两次缓存的图像数据中,在前缓存的末第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的末第1行图像数据与在后缓存的第2行图像数据相同;Further, when the number of rows required by any one of the convolution operation units to perform a basic convolution operation is 3 and the step size is 1, for any of the blocks of random access memory: in the image data of any two adjacent caches , The last second line of image data cached before is the same as the first line of image data cached later, the last first line of image data cached before is the same as the second line image data cached after;
对于任一所述卷积运算单元:任意相邻两次获取的3行图像数据中,在前缓存的第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的第3行图像数据与在后缓存的第2行图像数据相同。For any of the convolution operation units: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, and the image data of the first line cached third The line image data is the same as the second line image data buffered later.
进一步地,所述输入端缓存器,还用于针对每一个所述待处理特征图均执行:所述控制器控制当前待处理特征图的PAD为1时,在所述当前待处理特征图的外围进行补零,以使所述当前待处理特征图的行数和列数均增加2;以及具体用于基于补零后的所述当前待处理特征图执行图像数据加载操作。Further, the input buffer is further configured to execute for each of the feature maps to be processed: when the controller controls the PAD of the feature map to be processed to be 1, the Peripheral zero-padding is performed to increase the number of rows and columns of the currently to-be-processed feature map by 2; and is specifically used to perform an image data loading operation based on the current to-be-processed feature map after zero-padding.
另一方面,本发明提供了一种基于上述任一所述卷积神经网络的基本计算单元的计算方法,包括:In another aspect, the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks, including:
通过所述输入端缓存器,针对每一个所述计算单元均执行:基于所述控制器的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述第一行数与所述当前计算单元相对应;Performing, for each of the computing units through the input buffer, loading: based on the control of the controller, loading the buffered image data of the first line number of the at least one feature map to be processed into the current computing unit, The first row number corresponds to the current computing unit;
通过所述块随机存储器,缓存所述输入端缓存器加载来的每一行图像数据;针对每一个所述卷积运算单元均执行:基于所述控制器的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数;Each block of image data loaded by the input buffer is buffered through the block random memory; for each of the convolution operation units, based on the control of the controller, each line of image data cached by itself is executed To send the number of valid rows and the number of starting rows to the current convolution operation unit;
通过所述卷积运算单元,根据下发来的有效行数和起始行数,从所述块随机存储器缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器;Obtaining, by the convolution operation unit, image data of a second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line number The number of lines required to perform a basic convolution operation for itself; and performing basic convolution operation on the image data of the second number of lines to send to the internal adder;
通过所述内部加法器,对所述卷积运算单元发来的图像数据进行处理后发送给所述加法树;Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree;
通过所述加法树,基于所述控制器的控制,对每一个所述内部加法器发来的图像数据进行处理后发送给一所述激活池化单元;Processing the image data sent by each of the internal adders through the addition tree and sending it to an activation pooling unit based on the control of the controller;
通过所述激活池化单元,基于所述控制器的控制,对所述加法树发来的图像数据进行处理后发送给所述输出端缓存器。Through the activation pooling unit, based on the control of the controller, the image data sent from the addition tree is processed and sent to the output buffer.
本发明提供了一种卷积神经网络的基本计算单元及计算方法,该基本计算单元包括控制器、加法树、输入端缓存器、若干计算单元、输出端缓存器; 计算单元包括块随机存储器、若干卷积运算单元、内部加法器、激活池化单元。基于控制器的控制,输入端缓存器向各个计算单元加载相应行数图像数据,块随机存储器向各个卷积运算单元下发有效行数和起始行数以使其获取相应行数图像数据;卷积运算单元处理图像数据后经内部加法器发送给加法树;加法树处理各内部加法器发来的图像数据后发送给一激活池化单元;激活池化单元处理图像数据后发送给输出端缓存器。本发明能够基于硬件实现算法,以使算法完成时间可控。The invention provides a basic calculation unit and calculation method of a convolutional neural network. The basic calculation unit includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer. The calculation unit includes a block random memory, Several convolution operation units, internal adders, and activation pooling units. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The invention can implement the algorithm based on hardware so that the completion time of the algorithm can be controlled.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are Some embodiments of the present invention, for those skilled in the art, can obtain other drawings according to these drawings without paying creative labor.
图1是本发明一实施例提供的一种卷积神经网络的基本计算单元的示意图;FIG. 1 is a schematic diagram of a basic calculation unit of a convolutional neural network according to an embodiment of the present invention; FIG.
图2是本发明一实施例提供的一种卷积神经网络中的计算单元的示意图;2 is a schematic diagram of a computing unit in a convolutional neural network according to an embodiment of the present invention;
图3是本发明一实施例提供的一种基于卷积神经网络的基本计算单元的计算方法的流程图。3 is a flowchart of a calculation method of a basic calculation unit based on a convolutional neural network according to an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These embodiments are part of, but not all of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the protection of the present invention. range.
如图1所示,本发明实施例提供了一种卷积神经网络的基本计算单元,可以包括:As shown in FIG. 1, an embodiment of the present invention provides a basic calculation unit of a convolutional neural network, which may include:
控制器101、加法树102、输入端缓存器103、至少一个计算单元104、输出端缓存器105;A controller 101, an addition tree 102, an input buffer 103, at least one calculation unit 104, and an output buffer 105;
所述计算单元104包括块随机存储器1041、至少一个卷积运算单元1042、内部加法器1043、激活池化单元1044;The calculation unit 104 includes a block random memory 1041, at least one convolution operation unit 1042, an internal adder 1043, and an activation pooling unit 1044.
所述块随机存储器1041和所述内部加法器1043均分别与每一个所述卷积运算单元1042相连,所述内部加法器1043和所述激活池化单元1044均与所述加法树102相连;The block random access memory 1041 and the internal adder 1043 are respectively connected to each of the convolution operation units 1042, and the internal adder 1043 and the activation pooling unit 1044 are connected to the addition tree 102;
所述输入端缓存器103,用于针对每一个所述计算单元104均执行:基于所述控制器101的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述当前计算单元对应有所述第一行数;The input-side buffer 103 is configured to execute, for each of the computing units 104, based on the control of the controller 101, loading the buffered image data of the first line number of the at least one to-be-processed feature map into A current calculation unit, where the current calculation unit corresponds to the first row number;
所述块随机存储器1041,用于缓存所述输入端缓存器103加载来的每一行图像数据;针对每一个所述卷积运算单元1042均执行:基于所述控制器101的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数;The block random memory 1041 is configured to buffer each row of image data loaded by the input-side buffer 103; for each of the convolution operation units 1042, execute: based on the control of the controller 101, cache for itself Each line of image data is sent to the current convolution operation unit with the number of valid lines and the number of starting lines;
所述卷积运算单元1042,用于根据下发来的有效行数和起始行数,从所述块随机存储器1041缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器1043;The convolution operation unit 1042 is configured to obtain the image data of the second line number from all the image data buffered in the block random memory 1041 according to the number of valid lines and the number of the starting lines sent. The number of two lines is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder 1043;
所述内部加法器1043,用于对所述卷积运算单元1042发来的图像数据进行处理后发送给所述加法树102;The internal adder 1043 is configured to process the image data sent by the convolution operation unit 1042 and send the image data to the addition tree 102;
所述加法树102,用于基于所述控制器101的控制,对每一个所述内部加法器1043发来的图像数据进行处理后发送给一所述激活池化单元1044;The addition tree 102 is configured to process, based on the control of the controller 101, image data sent by each of the internal adders 1043 and send the image data to an activation pooling unit 1044;
所述激活池化单元1044,用于基于所述控制器101的控制,对所述加法树102发来的图像数据进行处理后发送给所述输出端缓存器105。The activation pooling unit 1044 is configured to process the image data sent from the addition tree 102 based on the control of the controller 101 and send the processed image data to the output buffer 105.
本发明实施例提供了一种卷积神经网络的基本计算单元,包括控制器、加法树、输入端缓存器、若干计算单元、输出端缓存器;计算单元包括块随机存储器、若干卷积运算单元、内部加法器、激活池化单元。基于控制器的 控制,输入端缓存器向各个计算单元加载相应行数图像数据,块随机存储器向各个卷积运算单元下发有效行数和起始行数以使其获取相应行数图像数据;卷积运算单元处理图像数据后经内部加法器发送给加法树;加法树处理各内部加法器发来的图像数据后发送给一激活池化单元;激活池化单元处理图像数据后发送给输出端缓存器。本发明实施例能够基于硬件实现算法,以使算法完成时间可控。An embodiment of the present invention provides a basic calculation unit of a convolutional neural network, including a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer. The calculation unit includes a block random memory and a number of convolution operation units. , Internal adder, activate pooling unit. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.
详细地,有效行数可以为块随机存储器当前实际加载的行数。In detail, the effective number of rows may be the number of rows currently actually loaded in the block RAM.
详细地,从输入端缓存器向块随机存储器加载图像数据,所需加载的行数,以及从哪一行加载至哪一行,可以由控制器根据上层命令进行控制。In detail, loading the image data from the input buffer to the block random memory, the number of rows to be loaded, and which row to load from can be controlled by the controller according to the upper-layer command.
详细地,控制器根据上层命令,可以对基本计算单元中的各个其他单元模块进行控制。比如,卷积运算完成后,通过加法树模块做加法,然后根据激活算法进行数据处理,然后做池化处理,这些过程都是由控制器调配完成的。比如,控制器控制是否可以做激活或池化。比如,控制器向其它单元模块下发数据加载、数据存储、数据计算的参数,并由其它单元模块反馈是否完成。In detail, the controller can control each other unit module in the basic computing unit according to the upper-layer command. For example, after the convolution operation is completed, addition is performed through the addition tree module, and then data processing is performed according to the activation algorithm, and then pooling processing is performed. These processes are completed by the controller. For example, the controller controls whether it can be activated or pooled. For example, the controller sends the parameters of data loading, data storage, and data calculation to other unit modules, and whether the completion is returned by the other unit modules.
在本发明一个实施例中,控制器由Xilinx FPGA-PL实现,通过AXI(Advanced eXtensible Interface)总线与ARM(Xilinx FPGA-PS)相连。其中,对于ARM(Xilinx FPGA-PS)来说,PS(Processing System,处理系统)是与芯片中FPGA无关的ARM的SOC(System on Chip,系统级芯片)的部分;PL(Progarmmable Logic,可编程逻辑)是芯片中FPGA的部分。In one embodiment of the present invention, the controller is implemented by Xilinx FPGA-PL, and is connected to ARM (Xilinx FPGA-PS) via AXI (Advanced Xtensible Interface) bus. Among them, for ARM (Xilinx FPGA-PS), PS (Processing System, processing system) is part of ARM's SOC (System Chip), which has nothing to do with FPGA in the chip; PL (Progarmmable logic, programmable) Logic) is the part of the FPGA in the chip.
本发明实施例所提出的这一卷积神经网络基本计算单元的设计,通过将卷积、池化、加法树、数据存储及控制模块等合理结合,具有灵活通用的特点,以达到通用的目的,可以实现众多卷积神经网络算法,故具有广阔的应用前景。The design of the basic computational unit of the convolutional neural network proposed by the embodiment of the present invention has the characteristics of flexibility and universality through reasonable combination of convolution, pooling, addition tree, data storage, and control modules, so as to achieve the general purpose , Can realize many convolutional neural network algorithms, so it has broad application prospects.
本发明一个实施例中,控制器可以controller表示,输入端缓存器可以IN BUFFER表示,输出端缓存器可以OUT BUFFER表示,加法树可以ADDER TREE Module表示,至少一个计算单元可以CUA(Calculate Unit Array,计 算单元阵列)表示,计算单元即为CU,块随机存储器可以Block RAM表示,卷积运算单元可以CONV表示,内部加法器可以Inter Adder表示,激活模块可以RELU(Rectified Linear Unit)表示,池化模块可以POOL表示,选择器可以MUX(multiplexer)表示。In one embodiment of the present invention, the controller may be represented by a controller, the input buffer may be represented by IN BUFFER, the output buffer may be represented by OUT BUFFER, the addition tree may be represented by ADDER TREE, and at least one calculation unit may be CUA (Calculate Unit Array Computing unit array) indicates that the computing unit is CU, block random memory can be represented by Block RAM, convolution operation unit can be represented by CONV, internal adder can be represented by Inter Adder, activation module can be represented by RELU (Rectified Linear Unit), pooling module It can be represented by POOL, and the selector can be represented by MUX (multiplexer).
在本发明一个实施例中,各个计算单元的内部加法器的处理结果,可以均输出至加法树,加法树将处理结果汇聚后,根据控制器发来的控制信息,将汇聚的全部处理结果发送给该控制信息所指定计算单元的激活池化单元,以进行后续的激活、池化等处理。In an embodiment of the present invention, the processing results of the internal adder of each calculation unit may be all output to the addition tree. After the addition tree aggregates the processing results, according to the control information sent by the controller, all the aggregated processing results are sent. The activation pooling unit of the computing unit designated by the control information is used for subsequent activation, pooling, and other processing.
在本发明一个实施例中,请参考图2,所述激活池化单元1044包括:激活模块10441、第一选择器10442、池化模块10443、第二选择器10444、第三选择器10445;In an embodiment of the present invention, please refer to FIG. 2. The activation pooling unit 1044 includes: an activation module 10441, a first selector 10442, a pooling module 10443, a second selector 10444, and a third selector 10445.
所述第一选择器10442的一输入端与所述激活模块10441的输入端相连、另一输入端与所述激活模块10441的输出端相连、输出端与所述池化模块10443的输入端相连;One input terminal of the first selector 10442 is connected to the input terminal of the activation module 10441, the other input terminal is connected to the output terminal of the activation module 10441, and the output terminal is connected to the input terminal of the pooling module 10443. ;
所述第二选择器10444的一输入端与所述池化模块10443的输入端相连、另一输入端与所述池化模块10443的输出端相连、输出端与所述第三选择器10445的第一输入端相连;One input terminal of the second selector 10444 is connected to the input terminal of the pooling module 10443, the other input terminal is connected to the output terminal of the pooling module 10443, and the output terminal is connected to the third selector 10445. The first input terminal is connected;
所述第三选择器10445的所述第一输入端与所述第二选择器10444的输出端相连、第二输入端与所述至少一个计算单元104中的其他计算单元104相连、第三输入端与所述输入端缓存器103相连、第一输出端与所述输出端缓存器105相连,第二输出端与所述激活模块10441的输入端相连;The first input terminal of the third selector 10445 is connected to the output terminal of the second selector 10444, the second input terminal is connected to other computing units 104 in the at least one computing unit 104, and the third input A terminal is connected to the input buffer 103, a first output is connected to the output buffer 105, and a second output is connected to the input of the activation module 10441;
所述激活池化单元1044,用于基于所述控制器101的控制,控制所述激活模块10441和/或所述池化模块10443是否工作;The activation pooling unit 1044 is configured to control whether the activation module 10441 and / or the pooling module 10443 work based on the control of the controller 101;
所述第三选择器10445,用于基于所述控制器101的控制,控制所述第一输入端、所述第二输入端和所述第三输入端中的一个输入端工作,控制所述第一输出端或所述第二输出端工作。The third selector 10445 is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller 101 to control the input terminal. The first output terminal or the second output terminal works.
详细地,请参考图2,可以确定图像数据在计算单元中的处理流向。In detail, referring to FIG. 2, the processing flow of the image data in the computing unit can be determined.
详细地,基于控制器的控制信息,可以确定第一选择器的哪一输入端工作,进行确定是否需要激活处理。同理,可以确定第二选择器的哪一输入端工作,进行确定是否需要池化处理。In detail, based on the control information of the controller, it is possible to determine which input end of the first selector works, and determine whether an activation process is required. In the same way, it is possible to determine which input end of the second selector works, and determine whether pooling processing is required.
详细地,基于控制器的控制信息,可以确定第三选择器的哪一输入端工作,以及哪一输出端工作。其中,第三选择器的第二输入端与其他计算单元相连,这一其他计算单元可以为上述至少一个计算单元中,除自身所在的计算单元之外的任意一个计算单元。In detail, based on the control information of the controller, it can be determined which input end of the third selector works and which output end works. The second input terminal of the third selector is connected to another computing unit. This other computing unit may be any computing unit other than the computing unit in which the computing unit is located.
在本发明一个实施例中,所述激活池化单元1044还包括:偏置量处理单元10446;In an embodiment of the present invention, the activation pooling unit 1044 further includes: an offset processing unit 10446;
所述偏置量处理单元10446分别与所述加法树102和所述激活模块10441相连;The offset processing unit 10446 is connected to the addition tree 102 and the activation module 10441, respectively;
所述第二输出端通过所述偏置量处理单元10446与所述激活模块10441的输入端相连。The second output terminal is connected to the input terminal of the activation module 10441 through the offset processing unit 10446.
举例来说,偏置量为1时,针对来自加法树的图像数据,将其中每一个数值均加上1。进行偏置量处理后,才进行后续的激活、池化操作。For example, when the offset is 1, for the image data from the addition tree, each value is increased by 1. After the offset processing, the subsequent activation and pooling operations are performed.
本发明实施例中,块随机存储器负责缓存卷积运算过程中需要的输入数据,以保证所连接的各个卷积运算单元能够正常运行。比如,假设一个卷积运算单元采用3×3的卷积运算作为基本卷积单元来实现其它维度的卷积运算,则该卷积运算单元需要3行输入数据,如此,块随机存储器至少需要缓存这3行输入数据。In the embodiment of the present invention, the block random access memory is responsible for buffering the input data required during the convolution operation to ensure that each connected convolution operation unit can operate normally. For example, suppose a convolution operation unit uses a 3 × 3 convolution operation as the basic convolution unit to implement convolution operations in other dimensions. The convolution operation unit requires 3 rows of input data. In this way, the block random memory needs at least a cache. Enter these 3 lines.
基于上述内容,在本发明一个实施例中,所述当前计算单元包括4个卷积运算单元1042;Based on the foregoing, in an embodiment of the present invention, the current calculation unit includes four convolution operation units 1042;
所述4个卷积运算单元1042中任一卷积运算单元1042执行基本卷积运算所需的行数为3行,步长为1;Any one of the four convolution operation units 1042 needs 3 lines and a step size of 1 to perform a basic convolution operation;
所述当前计算单元对应的第一行数为16行。The first number of rows corresponding to the current computing unit is 16 rows.
举例来说,请参考图2,假设当前计算单元包括4个相同的卷积运算单元,各卷积运算单元执行基本卷积运算所需的行数为3行。对于图像处理, 可以存在下述两种情况:For example, please refer to FIG. 2. Assume that the current calculation unit includes 4 identical convolution operation units, and each convolution operation unit needs 3 rows to perform basic convolution operations. For image processing, the following two situations can exist:
情况1:同一计算单元中的各个卷积运算单元所处理的图像数据相同;Case 1: the image data processed by each convolution operation unit in the same calculation unit is the same;
情况2:同一计算单元中的各个卷积运算单元所处理的图像数据均不相同。Case 2: The image data processed by each convolution operation unit in the same calculation unit is different.
其中,针对上述情况2,可以存在情况3:各个卷积运算单元所处理图像数据可以来自于不同的待处理特征图。如此,块随机存储器所缓存的数据可以来自同一特征图,也可以来自多个特征图,具体实现可根据控制器进行控制。For the above case 2, there may be a case 3: the image data processed by each convolution operation unit may come from different feature maps to be processed. In this way, the data buffered in the block random memory can come from the same feature map or from multiple feature maps, and the specific implementation can be controlled according to the controller.
在情况1下,若各个卷积运算单元均处理相同的3行图像数据,则块随机存储器至少需要存储3行图像数据,即当前计算单元对应的第一行数不小于3。In case 1, if each convolution operation unit processes the same three lines of image data, the block random memory needs to store at least three lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than three.
在情况2下,若各个卷积运算单元同时处理不同的3行图像数据,则块随机存储器至少需要存储12行图像数据,即当前计算单元对应的第一行数不小于12。In case 2, if each convolution operation unit processes different 3 lines of image data at the same time, the block random memory needs to store at least 12 lines of image data, that is, the number of first lines corresponding to the current calculation unit is not less than 12.
考虑到待处理的特征图大多为128×128、256×256、512×512等,也会存在64×64、32×32、16×16的情况,使得行数通常为2的多次方。在无需对特征图进行外围补零处理的情况下,若块随机存储器每次加载12行,则最后一次加载通常不为12行,比如特征图为128×128时,需要加载11次,第11次仅加载8行,特征图为64×64时,需要加载6次,第6次仅加载4行。如此,末次加载时会存在资源浪费情况。Considering that the feature maps to be processed are mostly 128 × 128, 256 × 256, 512 × 512, etc., there may also be cases of 64 × 64, 32 × 32, 16 × 16, so that the number of rows is usually a multiple of two. Without the need to perform peripheral zero-padding on the feature map, if the block RAM is loaded with 12 rows each time, the last load is usually not 12 rows. For example, when the feature map is 128 × 128, it needs to be loaded 11 times. Only 8 rows are loaded at a time. When the feature map is 64 × 64, it needs to be loaded 6 times, and only 6 rows are loaded at the 6th time. In this way, there will be a waste of resources during the last load.
如此,可以使块随机存储器每次加载的行数同样为2的多次方,且不小于12。比如,可以选择为16行。In this way, the number of rows loaded in the block RAM at a time can also be a power of two, and not less than 12. For example, you can select 16 rows.
基于同样的实现原理,在本发明其他实施例中,当块随机存储器的可用资源较多时,同样可以选择为32行,甚至64行等。Based on the same implementation principle, in other embodiments of the present invention, when there are more resources available in the block random access memory, it can also be selected as 32 lines, or even 64 lines.
通常情况下,在块随机存储器的可用资源较多的情况下,若待处理特征图的行数通常较大时,所选择的行数不宜过小;当待处理特征图的行数通常较小时,所选择的行数也不宜过大。In general, when there are more resources available in the block random memory, if the number of rows of the feature map to be processed is usually large, the number of rows selected should not be too small; when the number of rows of the feature map to be processed is usually small , The number of selected rows should not be too large.
以块随机存储器每次加载16行为例,即使在上述情况2下,这16行至少可以保证各个卷积运算单元连续获取两次图像数据,从而相应减少从输入端缓存器加载图像数据的次数,对应的减少重复加载图像数据所造成的时间延迟。这个设计相当于批处理,当一行计算完成时自动换到下一行进行卷积运算,并消除行与行之间的时间冗余,消除卷积运算换行时造成的性能损耗。因此,可以提高运算效率,实现高速运算。Take block random memory to load 16 rows at a time. Even in the above case 2, these 16 lines can ensure that each convolution operation unit obtains at least two consecutive image data, thereby reducing the number of times to load image data from the input buffer. Correspondingly reduces the time delay caused by repeated loading of image data. This design is equivalent to batch processing. When one line of calculation is complete, it automatically switches to the next line for convolution operations, and eliminates the time redundancy between rows. It also eliminates the performance loss caused by convolution operations when the line is changed. Therefore, it is possible to improve operation efficiency and realize high-speed operation.
在本发明一个实施例中,所述输入端缓存器103,还用于根据下述公式(1)~公式(3),确定每一个所述计算单元104对应的第一行数,其中,任一所述卷积运算单元1042执行基本卷积运算时的步长为1;In an embodiment of the present invention, the input buffer 103 is further configured to determine a first row number corresponding to each of the calculation units 104 according to the following formula (1) to formula (3), where any A step length when the convolution operation unit 1042 performs a basic convolution operation is 1;
Figure PCTCN2019096750-appb-000004
Figure PCTCN2019096750-appb-000004
Figure PCTCN2019096750-appb-000005
Figure PCTCN2019096750-appb-000005
Y i=max{y i}                (3) Y i = max {y i } (3)
其中,X ij为所述至少一个计算单元中的第i个计算单元中的第j个卷积运算单元执行基本卷积运算所需的行数,a i为所述第i个计算单元对应的预设优化值,a i为整数,m为所述第i个计算单元中的卷积运算单元的个数,x i为所述第i个计算单元对应的优化变量,y i为所述第i个计算单元对应的中间值,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的个数,N为外部芯片所能提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,Y i为所述第i个计算单元对应的第一行数,max{}为取最大值。 Where X ij is the number of rows required by the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a i is the corresponding number of the i-th calculation unit. Preset optimization values, a i is an integer, m is the number of convolution operation units in the i-th calculation unit, x i is an optimization variable corresponding to the i-th calculation unit, and y i is the first The intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.
详细地,这一设计尤其适用于待处理特征图的行数通常较大的应用场景。In detail, this design is particularly suitable for application scenarios where the number of rows of the feature map to be processed is usually large.
详细地,可以根据控制器的控制信息,灵活设定a i的值。a i为0时,在上述情况3下,块随机存储器每次加载的行数,仅可支持各个卷积运算单元 获取一次,如此将会导致块随机存储器从输入端缓存器反复加载图像数据的情况。 In detail, the value of a i can be flexibly set according to the control information of the controller. When a i is 0, in the above case 3, the number of rows loaded by the block random access memory can only be obtained once by each convolution operation unit. This will cause the block random access memory to repeatedly load image data from the input buffer. Happening.
在步长为1的前提下,a i为1时,在上述情况3下,块随机存储器每次加载的行数,可支持各个卷积运算单元连续获取两次;a i为2时,可支持各个卷积运算单元连续获取三次;a i为3时,可支持各个卷积运算单元连续获取四次;依次类推。 Under the premise that the step size is 1, when a i is 1, in the above case 3, the number of rows loaded by the block random memory at a time can support each convolution operation unit to obtain two consecutive times; when a i is 2, it can be Each convolution operation unit is supported for three consecutive acquisitions; when a i is 3, each convolution operation unit is supported for four consecutive acquisitions; and so on.
由于Y i不一定为2的多次方,故这一方案更为适用于待处理特征图需要进行外围补零处理的情况。由于Y i取最大值,故可以最大程度的减少块随机存储器从输入端缓存器反复加载图像数据的次数,进而提高整体处理效率。 Y i is not necessarily due to the multiple side 2, so that the program is more suitable to be treated needs to be characterized in the case of FIG peripheral zero-padding process. Since Y i takes the maximum value, it is possible to reduce the maximum degree of the block random access memory loads the image data from the input buffer repeatedly, thus improving the overall process efficiency.
基于上述内容,在本发明另一个实施例中,所述输入端缓存器103,还用于根据下述公式(4)和公式(5),确定每一个所述计算单元104对应的第一行数,其中,不同所述卷积运算单元1042执行基本卷积运算时所需的行数相同且步长为1,不同计算单元104包括的卷积运算单元1042个数相同;Based on the foregoing, in another embodiment of the present invention, the input buffer 103 is further configured to determine a first line corresponding to each of the calculation units 104 according to the following formula (4) and formula (5). The number of rows required by different convolution operation units 1042 to perform the basic convolution operation is the same and the step size is 1, and the number of convolution operation units 1042 included in different calculation units 104 is the same;
Figure PCTCN2019096750-appb-000006
Figure PCTCN2019096750-appb-000006
Y=max{2 k}              (5) Y = max {2 k } (5)
其中,A为任一所述卷积运算单元执行基本卷积运算时所需的行数,X为任一所述计算单元包括的卷积运算单元个数,k为整数,N为外部芯片所能提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的个数,Y为每一个所述计算单元对应的第一行数,max{}为取最大值。Among them, A is the number of rows required by any of the convolution operation units to perform basic convolution operations, X is the number of convolution operation units included in any of the calculation units, k is an integer, and N is an external chip. The amount of storage resources that can be provided is in bytes, T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte, and D is the amount of storage resources and unit of a row of image data Is a byte, n is the number of the at least one calculation unit, Y is the number of the first row corresponding to each of the calculation units, and max {} is the maximum value.
详细地,由于Y为2的多次方,故这一方案更为适用于待处理特征图的行数也为2的多次方,且无需进行外围补零处理的情况。In detail, since Y is a power of two, this solution is more suitable for the case where the number of rows of the feature map to be processed is also a power of two, and there is no need to perform peripheral zero padding.
在本发明一个实施例中,所述外部芯片包括:FPGA芯片或定制芯片;In an embodiment of the present invention, the external chip includes: an FPGA chip or a custom chip;
所述其他模块包括:所述激活池化单元1044中包括的池化模块10443、 所述输出端缓存器105、所述加法树102中任意一种或多种。The other modules include any one or more of a pooling module 10443 included in the activated pooling unit 1044, the output buffer 105, and the addition tree 102.
在本发明一个实施例中,任一所述卷积运算单元1042执行基本卷积运算所需的行数为3行且步长为1时,对于任一所述块随机存储器1041:任意相邻两次缓存的图像数据中,在前缓存的末第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的末第1行图像数据与在后缓存的第2行图像数据相同;In an embodiment of the present invention, when the number of rows required by any of the convolution operation units 1042 to perform a basic convolution operation is 3 and the step size is 1, for any of the block random access memories 1041: any adjacent Of the image data buffered twice, the last second line of image data cached before is the same as the first line of image data cached before, and the last first line of image data cached before is the second line of image data cached after the same;
对于任一所述卷积运算单元1042:任意相邻两次获取的3行图像数据中,在前缓存的第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的第3行图像数据与在后缓存的第2行图像数据相同。For any of the convolution operation units 1042: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, The three lines of image data are the same as the second line of image data buffered later.
举例来说,假设块随机存储器P当前次加载了16行图像数据,且连接的4个卷积运算单元均处理这16行图像数据,且均采用3×3的卷积运算作为基本卷积单元来实现卷积运算,即执行基本卷积运算所需的行数为3行,同时设定步长均为1。For example, suppose the block random memory P is currently loaded with 16 lines of image data, and the 4 convolution operation units connected all process the 16 lines of image data, and each uses a 3 × 3 convolution operation as the basic convolution unit. To implement the convolution operation, the number of lines required to perform the basic convolution operation is 3, and the step size is set to 1.
假设块随机存储器P向这4个卷积运算单元中的卷积运算单元Q,下发的有效行数和起始行数分别为16和1,则卷积运算单元Q从这16行中获取第1行至第3行。卷积运算单元Q完成第1行至第3行的处理后,可以通知控制器,控制器可以控制块随机存储器P,向卷积运算单元Q下发有效行数为16和起始行数为2,则卷积运算单元Q从这16行中获取第2行至第4行。如此循环,直至卷积运算单元Q从这16行中获取第14行至第16行,以完成这16行图像数据的处理操作。Assume that the block random access memory P sends to the convolution operation unit Q of the four convolution operation units, and the number of effective rows and the starting row issued are 16 and 1, respectively, then the convolution operation unit Q obtains from the 16 rows Lines 1 to 3. After the convolution operation unit Q completes the processing of the first line to the third line, it can notify the controller that the controller can control the block random access memory P, and send the effective number of rows to the convolution operation unit to 16 and the starting number to 2, the convolution operation unit Q obtains the second to fourth rows from the 16 rows. This loop is repeated until the convolution operation unit Q obtains the 14th to 16th rows from the 16 rows to complete the processing operation of the 16 rows of image data.
在本发明一个实施例中,所述输入端缓存器103,还用于针对每一个所述待处理特征图均执行:所述控制器101控制当前待处理特征图的PAD为1时,在所述当前待处理特征图的外围进行补零,以使所述当前待处理特征图的行数和列数均增加2;以及具体用于基于补零后的所述当前待处理特征图执行图像数据加载操作。In an embodiment of the present invention, the input buffer 103 is further configured to execute for each of the feature maps to be processed: when the controller 101 controls the PAD of the current to-be-processed feature maps to 1, Performing zero padding on the periphery of the current to-be-processed feature map to increase the number of rows and columns of the current to-be-processed feature map by 2; and specifically for performing image data based on the current to-be-processed feature map after zero-padding Load operation.
举例来说,假设待处理特征图的大小为512×512,即具有512行图像数据和512列图像数据。以基本卷积单元为3×3且步长为1举例,则首先处理 第1行至第3行,然后处理第2行至第4行,随后处理第3行至第5行,依次类推,直至处理第510行至第512行后结束。可以看出,需要处理510次,故所得结果的大小为510×510,比原图在行数上少了两行,在列数上少了两列。当卷积处理操作需要循环执行多次时,每执行一次,图形缩小一次。For example, assume that the size of the feature map to be processed is 512 × 512, that is, it has 512 rows of image data and 512 columns of image data. Taking a basic convolution unit of 3 × 3 and a step size of 1, for example, the first to third rows are processed, then the second to fourth rows are processed, then the third to fifth rows are processed, and so on. It ends after processing lines 510 to 512. It can be seen that it needs to be processed 510 times, so the size of the result obtained is 510 × 510, which is two rows less than the original image and two columns less in the number of columns. When the convolution processing operation needs to be executed multiple times in a loop, the graph is reduced every time it is executed.
基于此,为了实现卷积处理操作完成后,图形大小保持不变,可以使PAD为1,以在进行卷积处理操作前,对图形进行外围补一圈零值,以使图形大小由512×512变为514×514,如此,经一次卷积处理操作,可将514×514大小的图像处理为512×512大小的图像。如此循环,无论经过多少次卷积处理操作,可以保持图形大小始终不变。Based on this, in order to keep the size of the graph unchanged after the convolution processing operation is completed, PAD can be set to 1, before performing the convolution processing operation, the circle is supplemented with a circle of zero values, so that the graph size is changed from 512 × 512 becomes 514 × 514. In this way, after one convolution processing operation, an image of size 514 × 514 can be processed into an image of size 512 × 512. This loop keeps the graph size constant regardless of the number of convolution processing operations.
基于同样的实现原理,PAD的数值,等同于在图形外围补零的圈数。Based on the same implementation principle, the value of PAD is equivalent to the number of turns of zero padding on the periphery of the figure.
详细地,对于补零处理,可以在输入端缓存器中即进行补零,也可以在块随机存储器中进行补零。In detail, for the zero-padding process, zero padding can be performed in the input buffer, or can be performed in the block random access memory.
在输入端缓存器中补零时,先补零,然后基于补零后的特征图进行图像数据加载。比如,特征图共32行,补零后,特征图共34行,补零后的特征图首行、末行、首列、末列的值均为零。假设一次加载16行,则第一次加载时,先加载这34行的第1行至第16行,第二次加载这34行的第15行至第30行,第三次加载这34行的第29行至第34行。When zero-padding is performed in the input buffer, zero-padding is performed first, and then image data is loaded based on the zero-padding feature map. For example, the feature map has 32 rows. After zero-padding, the feature map has 34 rows. The zero-padded feature map has zero values in the first, last, first, and last columns. Assuming that 16 rows are loaded at a time, the first time, the first to the 16th rows of the 34 rows are loaded, the second 15th to the 30th rows are loaded, and the 34th rows are loaded the third time Lines 29 to 34.
在块随机存储器中补零时,先加载图像数据,然后基于加载的图像数据进行补零。比如,特征图共32行,假设一次加载16行,则第一次加载时,先加载这32行的第1行至第15行,由于这15行是图片的开始,故可针对所加载的15行进行首行、首列、末列补零。第二次加载这32行的第14行至第29行,由于这15行不是图片的开始或结束,故可针对所加载的15行进行首列、末列补零。第三次加载这32行的第28行至第32行,由于这5行是图片的结束,故可针对所加载的5行进行末行、首列、末列补零。When zero padding is performed in the block random memory, the image data is loaded first, and then zero padding is performed based on the loaded image data. For example, there are 32 lines in the feature map. Assuming that 16 lines are loaded at a time, the first to 15 of the 32 lines are loaded first. Since these 15 lines are the beginning of the picture, you can target the loaded image. Fifteen lines are padded with zeros in the first, first, and last columns. Load the 14th to 29th rows of the 32 rows for the second time. Since the 15 rows are not the beginning or end of the picture, the first and last columns of the 15 rows to be loaded can be zero-padded. Load the 28th to 32th rows of the 32 rows for the third time. Because these 5 rows are the end of the picture, the last row, first column, and last column of the loaded 5 rows can be zero-padded.
此外,基于同样的实现原理,特征图共8行,且一次加载16行时,则第一次加载时,加载这8行。由于这8行是图片的全部,故可针对所加载的8行进行一次外围补零,即进行首行、末行、首列、末列补零。In addition, based on the same implementation principle, the feature map has a total of 8 rows, and when 16 rows are loaded at one time, the 8 rows are loaded when the first load. Since these 8 lines are all of the picture, peripheral zero padding can be performed once for the loaded 8 lines, that is, zero padding for the first row, last row, first column, and last column.
本发明实施例中,可以基于硬件实现CNN算法,CNN算法硬件实现技术具有算法完成时间可控、计算能力高、低功耗等优点,且可应用于各种小型设备终端,比如手机等领域。In the embodiment of the present invention, the CNN algorithm can be implemented based on hardware. The CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones.
如图3所示,本发明一个实施例提供了一种基于上述任一所述卷积神经网络的基本计算单元的计算方法,具体包括以下步骤:As shown in FIG. 3, an embodiment of the present invention provides a calculation method based on a basic calculation unit of any of the convolutional neural networks described above, which specifically includes the following steps:
步骤301:通过所述输入端缓存器,针对每一个所述计算单元均执行:基于所述控制器的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述第一行数与所述当前计算单元相对应。Step 301: through the input buffer, for each of the computing units, execute: based on the control of the controller, loading the buffered image data of the first row of at least one feature map to be processed into the current A computing unit, the first row number corresponds to the current computing unit.
步骤302:通过所述块随机存储器,缓存所述输入端缓存器加载来的每一行图像数据;针对每一个所述卷积运算单元均执行:基于所述控制器的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数。Step 302: Cache each line of image data loaded by the input-side buffer through the block random memory; execute for each of the convolution operation units: based on the control of the controller, for each cache of its own cache One line of image data, the number of valid lines and the number of starting lines are sent to the current convolution operation unit.
步骤303:通过所述卷积运算单元,根据下发来的有效行数和起始行数,从所述块随机存储器缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器。Step 303: Use the convolution operation unit to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued. The number of two lines is the number of lines required to perform a basic convolution operation by itself; the image data of the second line number is subjected to a basic convolution operation and sent to the internal adder.
步骤304:通过所述内部加法器,对所述卷积运算单元发来的图像数据进行处理后发送给所述加法树。Step 304: Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree.
步骤305:通过所述加法树,基于所述控制器的控制,对每一个所述内部加法器发来的图像数据进行处理后发送给一所述激活池化单元。Step 305: Process the image data sent by each of the internal adders through the addition tree and send it to an activation pooling unit based on the control of the controller.
步骤306:通过所述激活池化单元,基于所述控制器的控制,对所述加法树发来的图像数据进行处理后发送给所述输出端缓存器。Step 306: The activation pooling unit processes the image data sent from the addition tree based on the control of the controller, and sends the processed image data to the output buffer.
上述方法内的各单元、模块、组件等之间的信息交互、执行过程等内容,由于与本发明产品实施例基于同一构思,具体内容可参见本发明产品实施例中的叙述,此处不再赘述。The information exchange and execution process among the units, modules, components, etc. in the above method are based on the same concept as the product embodiment of the present invention, and the specific content can refer to the description in the product embodiment of the present invention, and will not be repeated here. To repeat.
综上所述,本发明的各个实施例至少具有如下有益效果:In summary, the embodiments of the present invention have at least the following beneficial effects:
1、本发明实施例中,卷积神经网络的基本计算单元包括控制器、加法树、 输入端缓存器、若干计算单元、输出端缓存器;计算单元包括块随机存储器、若干卷积运算单元、内部加法器、激活池化单元。基于控制器的控制,输入端缓存器向各个计算单元加载相应行数图像数据,块随机存储器向各个卷积运算单元下发有效行数和起始行数以使其获取相应行数图像数据;卷积运算单元处理图像数据后经内部加法器发送给加法树;加法树处理各内部加法器发来的图像数据后发送给一激活池化单元;激活池化单元处理图像数据后发送给输出端缓存器。本发明实施例能够基于硬件实现算法,以使算法完成时间可控。1. In the embodiment of the present invention, the basic calculation unit of the convolutional neural network includes a controller, an addition tree, an input buffer, a plurality of calculation units, and an output buffer; the calculation unit includes a block random memory, a number of convolution operation units, Internal adder, activate pooling unit. Based on the control of the controller, the input buffer loads the corresponding line number image data to each calculation unit, and the block random access memory sends the effective line number and the starting line number to each convolution operation unit to obtain the corresponding line number image data; The convolution operation unit processes the image data and sends it to the addition tree via an internal adder; the addition tree processes the image data sent by each internal adder and sends it to an activation pooling unit; the activation pooling unit processes the image data and sends it to the output end Buffer. The embodiment of the present invention can implement an algorithm based on hardware, so that the completion time of the algorithm can be controlled.
2、本发明实施例中,所提出的这一卷积神经网络基本计算单元的设计,通过将卷积、池化、加法树、数据存储及控制模块等合理结合,具有灵活通用的特点,以达到通用的目的,可以实现众多卷积神经网络算法,故具有广阔的应用前景。2. In the embodiment of the present invention, the proposed design of the basic computational unit of the convolutional neural network has the characteristics of flexibility and universality through a reasonable combination of convolution, pooling, addition tree, data storage, and control modules. To achieve the general purpose, many convolutional neural network algorithms can be realized, so it has broad application prospects.
3、本发明实施例中,当一行计算完成时自动换到下一行进行卷积运算,并消除行与行之间的时间冗余,消除卷积运算换行时造成的性能损耗。因此,可以提高运算效率,实现高速运算。3. In the embodiment of the present invention, when the calculation of one line is completed, the next line is automatically switched to perform the convolution operation, and the time redundancy between the lines is eliminated, and the performance loss caused by the convolution operation is changed. Therefore, it is possible to improve operation efficiency and realize high-speed operation.
4、本发明实施例中,可以基于硬件实现CNN算法,CNN算法硬件实现技术具有算法完成时间可控、计算能力高、低功耗等优点,且可应用于各种小型设备终端,比如手机等领域。4. In the embodiment of the present invention, the CNN algorithm can be implemented based on hardware. The CNN algorithm hardware implementation technology has the advantages of controllable algorithm completion time, high computing power, and low power consumption, and can be applied to various small device terminals, such as mobile phones. field.
需要说明的是,在本文中,诸如第一和第二之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个······”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同因素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that between these entities or operations There is any such actual relationship or order. Moreover, the terms "including", "comprising", or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements but also those that are not explicitly listed Or other elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence "including one ..." do not exclude the existence of other identical factors in the process, method, article, or equipment that includes the elements.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤 可以通过程序指令相关的硬件来完成,前述的程序可以存储在计算机可读取的存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质中。A person of ordinary skill in the art may understand that all or part of the steps of the foregoing method embodiments may be implemented by a program instructing related hardware. The foregoing program may be stored in a computer-readable storage medium. When the program is executed, the program is executed. The method includes the steps of the foregoing method embodiment. The foregoing storage medium includes: a ROM, a RAM, a magnetic disk, or an optical disc, which can store program codes in various media.
最后需要说明的是:以上所述仅为本发明的较佳实施例,仅用于说明本发明的技术方案,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所做的任何修改、等同替换、改进等,均包含在本发明的保护范围内。Finally, it should be noted that the above are only preferred embodiments of the present invention, and are only used to explain the technical solution of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

  1. 一种卷积神经网络的基本计算单元,其特征在于,包括:A basic calculation unit of a convolutional neural network, which is characterized by:
    控制器、加法树、输入端缓存器、至少一个计算单元、输出端缓存器;A controller, an addition tree, an input buffer, at least one computing unit, and an output buffer;
    所述计算单元包括块随机存储器、至少一个卷积运算单元、内部加法器、激活池化单元;所述块随机存储器和所述内部加法器均分别与每一个所述卷积运算单元相连,所述内部加法器和所述激活池化单元均与所述加法树相连;The calculation unit includes a block random access memory, at least one convolution operation unit, an internal adder, and an activation pooling unit; and the block random access memory and the internal adder are connected to each of the convolution operation units, respectively. The internal adder and the activation pooling unit are both connected to the addition tree;
    所述输入端缓存器,用于针对每一个所述计算单元均执行:基于所述控制器的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述当前计算单元对应有所述第一行数;The input-side buffer is configured to execute for each of the computing units: based on control of the controller, load the buffered image data of the first row of at least one feature map to be processed into the current computing unit. , The current calculation unit corresponds to the first row number;
    所述块随机存储器,用于缓存所述输入端缓存器加载来的每一行图像数据;针对每一个所述卷积运算单元均执行:基于所述控制器的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数;The block random memory is used for buffering each line of image data loaded by the input-side buffer; for each of the convolution operation units, performing: based on the control of the controller, for each line of image cached by itself Data, and sends the number of valid and starting rows to the current convolution operation unit;
    所述卷积运算单元,用于根据下发来的有效行数和起始行数,从所述块随机存储器缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器;The convolution operation unit is configured to obtain the image data of the second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line The number is the number of lines required to perform a basic convolution operation by itself; the basic convolution operation is performed on the image data of the second line number and sent to the internal adder;
    所述内部加法器,用于对所述卷积运算单元发来的图像数据进行处理后发送给所述加法树;The internal adder is configured to process the image data sent by the convolution operation unit and send the image data to the addition tree;
    所述加法树,用于基于所述控制器的控制,对每一个所述内部加法器发来的图像数据进行处理后发送给一所述激活池化单元;The addition tree is used to process the image data sent by each of the internal adders based on the control of the controller and send the processed image data to an activation pooling unit;
    所述激活池化单元,用于基于所述控制器的控制,对所述加法树发来的图像数据进行处理后发送给所述输出端缓存器。The activation pooling unit is configured to process the image data sent from the addition tree based on the control of the controller, and send the processed image data to the output buffer.
  2. 根据权利要求1所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 1, wherein:
    所述激活池化单元包括:激活模块、第一选择器、池化模块、第二选择器、第三选择器;The activation pooling unit includes: an activation module, a first selector, a pooling module, a second selector, and a third selector;
    所述第一选择器的一输入端与所述激活模块的输入端相连、另一输入端 与所述激活模块的输出端相连、输出端与所述池化模块的输入端相连;One input terminal of the first selector is connected to the input terminal of the activation module, the other input terminal is connected to the output terminal of the activation module, and the output terminal is connected to the input terminal of the pooling module;
    所述第二选择器的一输入端与所述池化模块的输入端相连、另一输入端与所述池化模块的输出端相连、输出端与所述第三选择器的第一输入端相连;One input terminal of the second selector is connected to the input terminal of the pooling module, the other input terminal is connected to the output terminal of the pooling module, and the output terminal is connected to the first input terminal of the third selector. Connected
    所述第三选择器的所述第一输入端与所述第二选择器的输出端相连、第二输入端与所述至少一个计算单元中的其他计算单元相连、第三输入端与所述输入端缓存器相连、第一输出端与所述输出端缓存器相连,第二输出端与所述激活模块的输入端相连;The first input terminal of the third selector is connected to the output terminal of the second selector, the second input terminal is connected to other computing units in the at least one computing unit, and the third input terminal is connected to the An input buffer is connected, a first output is connected to the output buffer, and a second output is connected to the input of the activation module;
    所述激活池化单元,用于基于所述控制器的控制,控制所述激活模块和/或所述池化模块是否工作;The activation pooling unit is configured to control whether the activation module and / or the pooling module work based on the control of the controller;
    所述第三选择器,用于基于所述控制器的控制,控制所述第一输入端、所述第二输入端和所述第三输入端中的一个输入端工作,控制所述第一输出端或所述第二输出端工作。The third selector is configured to control one of the first input terminal, the second input terminal, and the third input terminal to work based on the control of the controller, and control the first input terminal. The output terminal or the second output terminal works.
  3. 根据权利要求2所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 2, wherein:
    所述激活池化单元还包括:偏置量处理单元;The activation pooling unit further includes: an offset processing unit;
    所述偏置量处理单元分别与所述加法树和所述激活模块相连;The offset processing unit is respectively connected to the addition tree and the activation module;
    所述第二输出端通过所述偏置量处理单元与所述激活模块的输入端相连。The second output terminal is connected to the input terminal of the activation module through the offset processing unit.
  4. 根据权利要求1所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 1, wherein:
    所述当前计算单元包括4个卷积运算单元;The current calculation unit includes 4 convolution operation units;
    所述4个卷积运算单元中任一卷积运算单元执行基本卷积运算所需的行数为3行,步长为1;The number of lines required by any of the four convolution operation units to perform a basic convolution operation is 3, and the step size is 1;
    所述当前计算单元对应的第一行数为16行。The first number of rows corresponding to the current computing unit is 16 rows.
  5. 根据权利要求1所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 1, wherein:
    所述输入端缓存器,还用于根据公式一、公式二和公式三,确定每一个所述计算单元对应的第一行数,其中,任一所述卷积运算单元执行基本卷积运算时的步长为1;The input buffer is further configured to determine the first number of rows corresponding to each of the calculation units according to Formula 1, Formula 2, and Formula 3. When any of the convolution operation units performs a basic convolution operation, The step size is 1;
    所述公式一包括:The formula one includes:
    Figure PCTCN2019096750-appb-100001
    Figure PCTCN2019096750-appb-100001
    所述公式二包括:The formula two includes:
    Figure PCTCN2019096750-appb-100002
    Figure PCTCN2019096750-appb-100002
    所述公式三包括:The formula three includes:
    Y i=max{y i} Y i = max {y i }
    其中,X ij为所述至少一个计算单元中的第i个计算单元中的第j个卷积运算单元执行基本卷积运算所需的行数,a i为所述第i个计算单元对应的预设优化值,a i为整数,m为所述第i个计算单元中的卷积运算单元的个数,x i为所述第i个计算单元对应的优化变量,y i为所述第i个计算单元对应的中间值,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的个数,N为外部芯片所能提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,Y i为所述第i个计算单元对应的第一行数,max{ }为取最大值。 Where X ij is the number of rows required for the j-th convolution operation unit in the i-th calculation unit of the at least one calculation unit to perform a basic convolution operation, and a i is the corresponding number of the i-th calculation unit. Preset optimization values, a i is an integer, m is the number of convolution operation units in the i-th calculation unit, x i is an optimization variable corresponding to the i-th calculation unit, and y i is the first The intermediate value corresponding to i computing units, D is the amount of storage resources of a row of image data and the unit is byte, n is the number of the at least one computing unit, N is the amount of storage resources that the external chip can provide and the unit is bytes of storage resources to the basic computation unit T convolutional neural network used by other modules and the amount of bytes, the Y i is the i-th number corresponding to a first row calculation unit, max {} To take the maximum.
  6. 根据权利要求1所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 1, wherein:
    所述输入端缓存器,还用于根据公式四和公式五,确定每一个所述计算单元对应的第一行数,其中,不同所述卷积运算单元执行基本卷积运算时所需的行数相同且步长为1,不同计算单元包括的卷积运算单元个数相同;The input buffer is further configured to determine the number of first rows corresponding to each of the calculation units according to formulas 4 and 5, wherein different lines required by the convolution operation unit when performing a basic convolution operation are performed. The number is the same and the step size is 1, and the number of convolution operation units included in different calculation units is the same;
    所述公式四包括:The formula four includes:
    Figure PCTCN2019096750-appb-100003
    Figure PCTCN2019096750-appb-100003
    所述公式五包括:The formula five includes:
    Y=max{2 k} Y = max {2 k }
    其中,A为任一所述卷积运算单元执行基本卷积运算时所需的行数,X为任一所述计算单元包括的卷积运算单元个数,k为整数,N为外部芯片所能 提供的存储资源量且单位为字节,T为所述卷积神经网络的基本计算单元中其他模块所使用的存储资源量且单位为字节,D为一行图像数据的存储资源量且单位为字节,n为所述至少一个计算单元的个数,Y为每一个所述计算单元对应的第一行数,max{ }为取最大值。Among them, A is the number of rows required by any of the convolution operation units to perform basic convolution operations, X is the number of convolution operation units included in any of the calculation units, k is an integer, and N is an external chip. The amount of storage resources that can be provided is in bytes, T is the amount of storage resources used by other modules in the basic computation unit of the convolutional neural network and the unit is byte, and D is the amount of storage resources and unit of a row of image data Is a byte, n is the number of the at least one calculation unit, Y is the number of the first row corresponding to each of the calculation units, and max {} is the maximum value.
  7. 根据权利要求5或6所述的卷积神经网络的基本计算单元,其特征在于,The basic computing unit of the convolutional neural network according to claim 5 or 6, wherein:
    所述外部芯片包括:现场可编程逻辑门阵列FPGA芯片或定制芯片;The external chip includes: a field programmable logic gate array FPGA chip or a custom chip;
    所述其他模块包括:所述激活池化单元中包括的池化模块、所述输出端缓存器、所述加法树中任意一种或多种。The other modules include any one or more of a pooling module included in the active pooling unit, the output buffer, and the addition tree.
  8. 根据权利要求1至6中任一所述的卷积神经网络的基本计算单元,其特征在于,The basic calculation unit of the convolutional neural network according to any one of claims 1 to 6, wherein:
    任一所述卷积运算单元执行基本卷积运算所需的行数为3行且步长为1时,对于任一所述块随机存储器:任意相邻两次缓存的图像数据中,在前缓存的末第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的末第1行图像数据与在后缓存的第2行图像数据相同;When the number of lines required by any one of the convolution operation units to perform a basic convolution operation is 3 and the step size is 1, for any of the blocks of random access memory: in the image data of any two adjacent buffers, the first The last 2 lines of image data cached are the same as the first line of image data cached later, and the last 1 line of image data cached before are the same as the second line image data cached later;
    对于任一所述卷积运算单元:任意相邻两次获取的3行图像数据中,在前缓存的第2行图像数据与在后缓存的第1行图像数据相同,在前缓存的第3行图像数据与在后缓存的第2行图像数据相同。For any of the convolution operation units: among the 3 lines of image data obtained two times adjacently, the image data of the second line cached before is the same as the image data of the first line cached later, and the image data of the first line cached third The line image data is the same as the second line image data buffered later.
  9. 根据权利要求1至6中任一所述的卷积神经网络的基本计算单元,其特征在于,The basic calculation unit of the convolutional neural network according to any one of claims 1 to 6, wherein:
    所述输入端缓存器,还用于针对每一个所述待处理特征图均执行:所述控制器控制当前待处理特征图的PAD为1时,在所述当前待处理特征图的外围进行补零,以使所述当前待处理特征图的行数和列数均增加2;以及具体用于基于补零后的所述当前待处理特征图执行图像数据加载操作。The input buffer is further configured to execute for each of the feature maps to be processed: when the controller controls the PAD of the feature map to be processed to be 1, perform supplementation on the periphery of the feature map to be processed currently. Zero to increase the number of rows and columns of the current to-be-processed feature map by 2; and specifically to perform an image data loading operation based on the current to-be-processed feature map after zero padding.
  10. 一种基于权利要求1至9中任一所述卷积神经网络的基本计算单元的计算方法,其特征在于,包括:A calculation method based on a basic calculation unit of a convolutional neural network according to any one of claims 1 to 9, characterized in that it comprises:
    通过所述输入端缓存器,针对每一个所述计算单元均执行:基于所述控制器的控制,将缓存的至少一个待处理特征图中的第一行数的图像数据加载至当前计算单元,所述第一行数与所述当前计算单元相对应;Performing, for each of the computing units through the input buffer, loading: based on the control of the controller, loading the buffered image data of the first line number of the at least one feature map to be processed into the current computing unit, The first row number corresponds to the current computing unit;
    通过所述块随机存储器,缓存所述输入端缓存器加载来的每一行图像数据;针对每一个所述卷积运算单元均执行:基于所述控制器的控制,针对自身缓存的每一行图像数据,向当前卷积运算单元下发有效行数和起始行数;Each block of image data loaded by the input buffer is buffered through the block random memory; for each of the convolution operation units, based on the control of the controller, each line of image data cached by itself is executed To send the number of valid rows and the number of starting rows to the current convolution operation unit;
    通过所述卷积运算单元,根据下发来的有效行数和起始行数,从所述块随机存储器缓存的全部图像数据中,获取第二行数的图像数据,所述第二行数为自身执行基本卷积运算所需的行数;对所述第二行数的图像数据进行基本卷积运算后发送给所述内部加法器;Obtaining, by the convolution operation unit, image data of a second line number from all the image data buffered in the block random memory according to the number of valid lines and the number of starting lines issued, the second line number The number of lines required to perform a basic convolution operation for itself; and performing basic convolution operation on the image data of the second number of lines to send to the internal adder;
    通过所述内部加法器,对所述卷积运算单元发来的图像数据进行处理后发送给所述加法树;Process the image data sent by the convolution operation unit through the internal adder, and send the processed image data to the addition tree;
    通过所述加法树,基于所述控制器的控制,对每一个所述内部加法器发来的图像数据进行处理后发送给一所述激活池化单元;Processing the image data sent by each of the internal adders through the addition tree and sending it to an activation pooling unit based on the control of the controller;
    通过所述激活池化单元,基于所述控制器的控制,对所述加法树发来的图像数据进行处理后发送给所述输出端缓存器。Through the activation pooling unit, based on the control of the controller, the image data sent from the addition tree is processed and sent to the output buffer.
PCT/CN2019/096750 2018-08-06 2019-07-19 Basic computing unit for convolutional neural network, and computing method WO2020029767A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810884476.4A CN109165728B (en) 2018-08-06 2018-08-06 Basic computing unit and computing method of convolutional neural network
CN201810884476.4 2018-08-06

Publications (1)

Publication Number Publication Date
WO2020029767A1 true WO2020029767A1 (en) 2020-02-13

Family

ID=64895042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096750 WO2020029767A1 (en) 2018-08-06 2019-07-19 Basic computing unit for convolutional neural network, and computing method

Country Status (2)

Country Link
CN (1) CN109165728B (en)
WO (1) WO2020029767A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914996A (en) * 2020-06-30 2020-11-10 华为技术有限公司 Method for extracting data features and related device
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN116152307A (en) * 2023-04-04 2023-05-23 西安电子科技大学 SAR image registration preprocessing device based on FPGA

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165728B (en) * 2018-08-06 2020-12-18 浪潮集团有限公司 Basic computing unit and computing method of convolutional neural network
CN111832713B (en) * 2019-04-19 2024-06-18 北京灵汐科技有限公司 Parallel computing method and computing device based on line buffer Linebuffer
CN110503193B (en) * 2019-07-25 2022-02-22 瑞芯微电子股份有限公司 ROI-based pooling operation method and circuit
WO2021022441A1 (en) * 2019-08-05 2021-02-11 华为技术有限公司 Data transmission method and device, electronic device and readable storage medium
CN110597756B (en) * 2019-08-26 2023-07-25 光子算数(北京)科技有限责任公司 Calculation circuit and data operation method
CN114090470B (en) * 2020-07-29 2023-02-17 深圳市中科元物芯科技有限公司 Data preloading device and preloading method thereof, storage medium and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590535A (en) * 2017-09-08 2018-01-16 西安电子科技大学 Programmable neural network processor
CN107704922A (en) * 2017-04-19 2018-02-16 北京深鉴科技有限公司 Artificial neural network processing unit
US20180174031A1 (en) * 2016-10-10 2018-06-21 Gyrfalcon Technology Inc. Implementation Of ResNet In A CNN Based Digital Integrated Circuit
CN109165728A (en) * 2018-08-06 2019-01-08 济南浪潮高新科技投资发展有限公司 A kind of basic computational ele- ment and calculation method of convolutional neural networks

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398758B (en) * 2008-10-30 2012-04-25 北京航空航天大学 Detection method of code copy
CN105243399B (en) * 2015-09-08 2018-09-25 浪潮(北京)电子信息产业有限公司 A kind of method and apparatus that realizing image convolution, the method and apparatus for realizing caching
US10585848B2 (en) * 2015-10-08 2020-03-10 Via Alliance Semiconductor Co., Ltd. Processor with hybrid coprocessor/execution unit neural network unit
CN205726177U (en) * 2016-07-01 2016-11-23 浪潮集团有限公司 A kind of safety defense monitoring system based on convolutional neural networks chip
CN106203621B (en) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 The processor calculated for convolutional neural networks
CN106875011B (en) * 2017-01-12 2020-04-17 南京风兴科技有限公司 Hardware architecture of binary weight convolution neural network accelerator and calculation flow thereof
CN106779060B (en) * 2017-02-09 2019-03-08 武汉魅瞳科技有限公司 A kind of calculation method for the depth convolutional neural networks realized suitable for hardware design
CN107578014B (en) * 2017-09-06 2020-11-03 上海寒武纪信息科技有限公司 Information processing apparatus and method
CN107832845A (en) * 2017-10-30 2018-03-23 上海寒武纪信息科技有限公司 A kind of information processing method and Related product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180174031A1 (en) * 2016-10-10 2018-06-21 Gyrfalcon Technology Inc. Implementation Of ResNet In A CNN Based Digital Integrated Circuit
CN107704922A (en) * 2017-04-19 2018-02-16 北京深鉴科技有限公司 Artificial neural network processing unit
CN107590535A (en) * 2017-09-08 2018-01-16 西安电子科技大学 Programmable neural network processor
CN109165728A (en) * 2018-08-06 2019-01-08 济南浪潮高新科技投资发展有限公司 A kind of basic computational ele- ment and calculation method of convolutional neural networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KIM, CHAN: "A Deep Learning Convolution Architecture for Simple Embedded Applications", 2017 IEEE 7TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - BERLIN (ICCE- BERLIN, 31 December 2017 (2017-12-31), pages 74 - 78, XP033278119 *
WAN PEIQI ET AL.: "Comparison among different numeric representations in deep convolution neural networks", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, 31 December 2017 (2017-12-31) *
ZHANG BANG ET AL.: "Design and implementation of a FPGA-based accelerator for convolutional neural networks", JOURNAL OF FUDAN UNIVERSITY NATURAL SCIENCE, 30 April 2018 (2018-04-30) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914996A (en) * 2020-06-30 2020-11-10 华为技术有限公司 Method for extracting data features and related device
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN113592067B (en) * 2021-07-16 2024-02-06 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN116152307A (en) * 2023-04-04 2023-05-23 西安电子科技大学 SAR image registration preprocessing device based on FPGA
CN116152307B (en) * 2023-04-04 2023-07-21 西安电子科技大学 SAR image registration preprocessing device based on FPGA

Also Published As

Publication number Publication date
CN109165728B (en) 2020-12-18
CN109165728A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
WO2020029767A1 (en) Basic computing unit for convolutional neural network, and computing method
CA3069185C (en) Operation accelerator
US11157592B2 (en) Hardware implementation of convolutional layer of deep neural network
US11449576B2 (en) Convolution operation processing method and related product
CN108388537B (en) Convolutional neural network acceleration device and method
CN112633490B (en) Data processing device, method and related product for executing neural network model
CN109446996B (en) Face recognition data processing device and method based on FPGA
WO2022199027A1 (en) Random write method, electronic device and storage medium
US11960421B2 (en) Operation accelerator and compression method
CN113065643A (en) Apparatus and method for performing multi-task convolutional neural network prediction
CN109948777A (en) The implementation method of convolutional neural networks is realized based on the FPGA convolutional neural networks realized and based on FPGA
CN112416433A (en) Data processing device, data processing method and related product
CN112799726A (en) Data processing device, method and related product
US20230085718A1 (en) Neural network scheduling method and apparatus
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN114995782A (en) Data processing method, device, equipment and readable storage medium
US10963390B2 (en) Memory-adaptive processing method for convolutional neural network and system thereof
CN108647780B (en) Reconfigurable pooling operation module structure facing neural network and implementation method thereof
JP6829427B2 (en) Systems, methods, and programs for streamlining database queries
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
WO2020093968A1 (en) Convolution processing engine and control method, and corresponding convolutional neural network accelerator
CN110837419A (en) Inference engine system and method based on elastic batch processing and electronic equipment
WO2022007597A1 (en) Matrix operation method and accelerator
GB2582868A (en) Hardware implementation of convolution layer of deep neural network
CN111340224B (en) Accelerated design method of CNN (computer network) suitable for low-resource embedded chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19847686

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19847686

Country of ref document: EP

Kind code of ref document: A1