CN111279364A

CN111279364A - Convolution calculation device, convolution calculation method, convolution calculation processor and mobile equipment

Info

Publication number: CN111279364A
Application number: CN201980005258.1A
Authority: CN
Inventors: 韩峰; 杨康; 谷骞
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2020-06-12
Also published as: WO2020155044A1

Abstract

An apparatus (400), method, processor and mobile device for convolution calculation. The apparatus (400) comprises: a multiply-add unit array (410) including M rows and N columns of multiply-add units, wherein a specific multiply-add unit in the M rows and N columns of multiply-add units is used for multiplying a characteristic value input to the specific multiply-add unit and a weight value corresponding to the specific multiply-add unit, adding a product after multiplication and a previous multiply-add output result, and outputting a sum after addition as an output result of the specific multiply-add unit; and the accumulation unit array (420) comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiplication and accumulation unit array (410), and a specific accumulation unit in the N accumulation units is used for adding the output result of the last multiplication and accumulation unit in the column corresponding to the specific accumulation unit with the previous accumulation output result and outputting the added sum as the output result of the specific accumulation unit. Through the technical scheme, the efficiency of convolution calculation can be improved.

Description

Convolution calculation device, convolution calculation method, convolution calculation processor and mobile equipment

Copyright declaration

The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office official records and records.

Technical Field

The present application relates to the field of information technology, and more particularly, to an apparatus, method, processor and removable device for convolution calculation.

Background

Convolutional Neural Network (CNN) is a machine learning algorithm, and is widely applied to computer vision tasks such as target recognition, target detection, and semantic segmentation of images.

In the convolution calculation process of the conventional convolution neural network, the data processing parallelism is low, or more data moving operations are required, so that the efficiency is low. Therefore, how to improve the efficiency of convolution calculation becomes a technical problem to be solved urgently in the design of the convolution neural network.

Disclosure of Invention

The embodiment of the application provides a convolution calculation device, a convolution calculation method, a convolution calculation processor and a mobile device, and the convolution calculation efficiency can be improved.

In a first aspect, an apparatus for convolution calculation is provided, including: a specific multiplication and addition unit in the multiplication and addition units in M rows and N columns, wherein the specific multiplication and addition unit is used for multiplying the characteristic value input to the specific multiplication and addition unit and a weight value corresponding to the specific multiplication and addition unit, adding a product obtained after multiplication and a previous multiplication and addition output result, and outputting the added sum as an output result of the specific multiplication and addition unit, the specific multiplication and addition unit is any one of the multiplication and addition units in the M rows and the N columns, the previous multiplication and addition output result is an output result of a previous multiplication and addition unit of the specific multiplication and addition unit in the column where the specific multiplication and addition unit is located or zero, and M and N are positive integers; and the accumulation unit array comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiplication and accumulation unit array, and a specific accumulation unit in the N accumulation units is used for adding an output result of the last multiplication and accumulation unit in the column corresponding to the specific accumulation unit with a previous accumulation output result and outputting the added sum as an output result of the specific accumulation unit, wherein the previous accumulation output result is the output result of the previous accumulation unit of the specific accumulation unit or zero.

In a second aspect, a method of convolution calculation is provided, including: inputting a weight value to a multiplication and addition unit array, wherein the multiplication and addition unit array comprises M rows and N columns of multiplication and addition units; inputting a characteristic value to the multiplication and addition unit array; multiplying, by a specific multiply-add unit of the multiply-add units in the M rows and the N columns, a feature value input to the specific multiply-add unit and a weight value corresponding to the specific multiply-add unit, adding a product after the multiplication to a previous multiply-add output result, and outputting the added sum as an output result of the specific multiply-add unit, wherein the specific multiply-add unit is any one of the multiply-add units in the M rows and the N columns, the previous multiply-add output result is an output result of a previous multiply-add unit of the specific multiply-add unit in the column where the specific multiply-add unit is located or zero, and M and N are both positive integers; and adding the output result of the last multiply-add unit in the column corresponding to the specific accumulation unit with the previous accumulation output result through the specific accumulation unit in an accumulation unit array, and outputting the added sum as the output result of the specific accumulation unit, wherein the accumulation unit array comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiply-add unit array, and the previous accumulation output result is the output result of the previous accumulation unit of the specific accumulation unit or zero.

In a third aspect, a processor is provided that includes the apparatus for convolution calculation of the first aspect.

In a fourth aspect, there is provided a mobile device comprising the means for convolution calculation of the first aspect; alternatively, the processor of the third aspect.

In a fifth aspect, a computer storage medium is provided, in which program code is stored, the program code being operable to instruct execution of the method of the second aspect.

According to the technical scheme of the embodiment of the application, all calculations of the weight values corresponding to the multiplication and addition unit can be completed by using one multiplication and addition unit, data movement in the calculation process is reduced, the bandwidth of input and output data is reduced, and therefore the efficiency of convolution calculation can be improved.

Drawings

Fig. 1 is a schematic diagram of a convolution operation process of a convolutional neural network according to an embodiment of the present application.

Fig. 2 is an architecture diagram of a solution to which an embodiment of the present application is applied.

Fig. 3 is a schematic architecture diagram of a mobile device of an embodiment of the present application.

FIG. 4 is a diagram of an apparatus for convolution calculation according to an embodiment of the present application.

Fig. 5 is a schematic diagram of an apparatus for convolution calculation according to another embodiment of the present application.

FIG. 6 is a schematic diagram of a convolution kernel map according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a multiply-add unit according to an embodiment of the present application.

Fig. 8 is a schematic diagram of an accumulation unit according to an embodiment of the present application.

FIG. 9 is a schematic flow chart diagram of a method of convolution calculation according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

It should be understood that the specific examples are provided herein only to assist those skilled in the art in better understanding the embodiments of the present application and are not intended to limit the scope of the embodiments of the present application.

It should also be understood that, in the various embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic of the processes, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that the various embodiments described in this specification can be implemented individually or in combination, and the examples in this application are not limited thereto.

The technical solution of the embodiment of the present application may be applied to various deep learning algorithms, such as a convolutional neural network, but the embodiment of the present application does not limit this.

Fig. 1 shows a schematic diagram of the convolution operation process of a convolutional neural network.

As shown in fig. 1, the convolution operation of the convolutional neural network operates an Input Feature Map (IFM) and a set of weight values, and outputs an Output Feature Map (OFM). The input weight values are called filters (filters) or convolution kernels. The input feature map is an output feature map of a previous layer. The output characteristic diagram is the characteristic diagram obtained after the input characteristic diagram is operated by the current layer. The convolution kernel and the input and output feature maps can be represented as a multi-dimensional matrix, and one convolution operation of convolution layers of the convolutional neural network is to perform an inner product operation on at least part of feature values (data units) of the input feature matrix and weight values of the convolution kernel matrix.

The convolution operation of the convolution layer can adopt a sliding window mode, the upper left corner of the input characteristic value matrix is taken as a starting point, the size of the convolution kernel is taken as a window, and the window is sequentially slid to the lower right corner of the input characteristic matrix to generate a complete two-dimensional output characteristic matrix. After sliding the window each time, the convolution calculation device extracts an input eigenvalue of the size of one window from the input eigenvalue matrix, and performs inner product operation on the input eigenvalue and the convolution kernel to generate an output eigenvalue. After all the two-dimensional output feature matrices are sequentially generated in the above manner, the three-dimensional output feature matrix of the convolutional layer can be obtained.

As shown in fig. 2, system 200 may include convolution calculation device 210 and memory 220.

The memory 220 is used for storing data to be processed, such as input feature maps and weight values, and storing processed data, such as output feature maps. The memory 220 may be a Static Random Access Memory (SRAM).

The convolution calculating device 210 includes a Multiply Accumulate Unit (MAU) 211, an IFM input module 212, a weight value input module 213, and an OFM storage module 214. The weight value input module 213 is responsible for reading out the weight values from the memory 220 and sending them to the MAU 211 in a specific format. The IFM input module 212 is responsible for reading the input profile data from the memory 220 and sending it to the MAU 211 for convolution. The MAU 211 may include a systolic array and a buffer to store intermediate computation results. In the convolution operation, the MAU 211 loads the weight value input from the weight value input module 213 to the systolic array, and then, after the input feature map data is input from the IFM input module 212 to the systolic array, multiplies the input feature map data by the previously loaded weight value. If the intermediate result is buffered in the buffer in MAU 211, the systolic array output result will continue to multiply and accumulate with the intermediate result in the buffer again. And if the result of the multiply-accumulate operation is still the intermediate result of the convolution operation, storing the result into a cache of the MAU, otherwise, outputting the result into the lower-level module OFM storage module 214 for subsequent processing. The OFM storage module 214 assembles the convolution calculation result output by the MAU 211 into a data format stored in the memory 220, and then writes it to the memory 220.

In the embodiment of the application, data shifting in the calculation process is reduced by utilizing the characteristic of weight value sharing of the convolutional neural network, so that the bandwidth of input and output data is reduced, and the efficiency of convolutional calculation is improved.

In some embodiments, the technical solutions of the embodiments of the present application may be applied to a mobile device. The movable device can be an unmanned aerial vehicle, an unmanned ship, an automatic driving vehicle or a robot, and the embodiment of the application is not limited to the above.

Fig. 3 is a schematic architecture diagram of a removable device 300 of an embodiment of the present application.

As shown in FIG. 3, the mobile device 300 may include a power system 310, a control system 320, a sensing system 330, and a processing system 340.

A power system 310 is used to power the mobile device 300.

Taking an unmanned aerial vehicle as an example, a power system of the unmanned aerial vehicle may include an electronic governor (abbreviated as an electronic governor), a propeller, and a motor corresponding to the propeller. The motor is connected between the electronic speed regulator and the propeller, and the motor and the propeller are arranged on the corresponding machine arm; the electronic speed regulator is used for receiving a driving signal generated by the control system and providing a driving current for the motor according to the driving signal so as to control the rotating speed of the motor. The motor is used for driving the propeller to rotate, so that power is provided for the flight of the unmanned aerial vehicle.

The sensing system 330 may be used to measure attitude information of the mobile device 300, i.e., position information and state information of the mobile device 300 in space, such as three-dimensional position, three-dimensional angle, three-dimensional velocity, three-dimensional acceleration, three-dimensional angular velocity, and the like. The sensing System 330 may include, for example, at least one of a gyroscope, an electronic compass, an Inertial Measurement Unit (IMU), a vision sensor, a Global Positioning System (GPS), a barometer, an airspeed meter, and the like.

The sensing system 330 may also be used for capturing images, i.e. the sensing system 330 comprises a sensor, such as a camera or the like, for capturing images.

The control system 320 is used to control the movement of the mobile device 300. The control system 320 may control the removable device 300 according to preset program instructions. For example, control system 320 may control movement of mobile device 300 based on the attitude information of mobile device 300 measured by sensing system 330. The control system 320 may also control the removable device 300 based on control signals from a remote control. For example, for a drone, the control system 320 may be a flight control system (flight control), or a control circuit in the flight control.

The processing system 340 may process the images acquired by the sensing system 330. For example, the Processing system 340 may be an Image Signal Processing (ISP) chip.

Processing system 340 may be system 200 in fig. 2, or processing system 340 may include system 200 in fig. 2.

It should be understood that the above-described division and naming of the various components of the mobile device 300 is merely exemplary and should not be construed as a limitation of the embodiments of the present application.

It should also be understood that the removable device 300 may also include other components not shown in fig. 3, which are not limited by the embodiments of the present application.

FIG. 4 is a diagram illustrating an apparatus 400 for convolution calculation according to an embodiment of the present application. The apparatus 400 may be the MAU 211 of FIG. 2.

As shown in fig. 4, the apparatus 400 may include a multiply-add cell array 410 and an accumulate cell array 420.

The Multiply-add Cell array 410 includes M rows and N columns of Multiply-add cells (MC), where M and N are positive integers.

And a specific multiplication and addition unit in the M rows and N columns of multiplication and addition units is used for multiplying the characteristic value input into the specific multiplication and addition unit and a weight value corresponding to the specific multiplication and addition unit, adding a product after multiplication and a previous multiplication and addition output result, and outputting the added sum as an output result of the specific multiplication and addition unit, wherein the specific multiplication and addition unit is any one of the M rows and N columns of multiplication and addition units, and the previous multiplication and addition output result is an output result or zero of the previous multiplication and addition unit of the specific multiplication and addition unit in the column where the specific multiplication and addition unit is located.

For example, for the multiply-add unit in the first row, the previous multiply-add output result is zero, and the multiply-add unit multiplies the corresponding eigenvalue and the weight value and outputs the result downwards, that is, to the next multiply-add unit in the column. For the multiplication and addition units of other rows, the previous multiplication and addition output result is the output result of the previous multiplication and addition unit in the column, the multiplication and addition unit multiplies the corresponding characteristic value and the weight value, then adds the multiplication and addition result with the previous multiplication and addition output result, and outputs the added sum downwards. The multiplying and adding units in the last row are output to the accumulating unit in the column, and the other multiplying and adding units are output to the next multiplying and adding unit in the column.

An accumulation unit array, including 1 row of N accumulation units (ACC), where the N accumulation units respectively correspond to N columns of the multiply-add unit array, and a specific accumulation unit of the N accumulation units is configured to add an output result of a last multiply-add unit in a column corresponding to the specific accumulation unit to a previous accumulation output result, and output the added sum as an output result of the specific accumulation unit, where the previous accumulation output result is an output result of a previous accumulation unit of the specific accumulation unit or zero.

For example, if a certain accumulation unit is the first accumulation unit corresponding to the convolution kernel, the previous accumulation output result is zero, and the accumulation unit only needs to output the output result of the last multiply-add unit in the column to the next accumulation unit. If a certain accumulation unit is not the first accumulation unit corresponding to the convolution kernel, the previous accumulation output result is the output result of the previous accumulation unit of the accumulation unit, and the accumulation unit adds the output result of the last multiplication and addition unit in the column with the previous accumulation output result and outputs the added sum.

It should be understood that various units or modules in the embodiments of the present application may be specifically implemented by a circuit, for example, the multiply-add unit may be a multiply-add circuit, but the embodiments of the present application are not limited thereto, and they may also be implemented by other ways.

Optionally, in an embodiment of the present application, as shown in fig. 5, the apparatus 400 may further include: a weight value injection module 430 and a feature value injection module 440.

A weight value injection module 430 for inputting weight values to the multiply-add cell array 410;

and a feature value injection module 440, configured to input a feature value to the multiply-add cell array 410.

The weight value injection module 440 is connected to the first row of multiply-add units of the multiply-add unit array 410, wherein for each column of multiply-add units, a weight value is transferred from the multiply-add unit of the first row to the corresponding multiply-add unit. Each column multiply-add unit latches the weight value after the weight value is transmitted to the corresponding multiply-add unit.

Specifically, the weight value injection module 430 has only one interface with each column of multiply-add units of the multiply-add unit array 410 (i.e., the interface between the weight value injection module 430 and the first row of multiply-add units), and the interface can transmit only one weight value per clock cycle. The weight input can be divided into two stages of shifting and loading. In the shift stage, the weight value injection module 430 sequentially sends the weight values required by the multiply-add units in the same column to the multiply-add unit array 410 through the same interface. In the multiply-add unit array 410, the received weight values are sequentially passed down from the multiply-add units at the interface. In the loading stage, the multiply-add units in the same column in the systolic array simultaneously load the weight values into the respective buffers. The weight value injection module 430 delays the input of the weight values for two adjacent columns of the multiply-add units by one clock cycle.

The eigenvalue injection module 440 is connected to the first column of multiply-add cells of the multiply-add cell array 410, wherein for each row of multiply-add cells, eigenvalues are passed from the multiply-add cell of the first column to the multiply-add cell of the next column in turn.

Specifically, the eigenvalue injection module 440 has only one interface with each row of multiply-add cells of the multiply-add cell array 410 (i.e., the interface between the eigenvalue injection module 440 and the first column of multiply-add cells) that can only transmit one eigenvalue per clock cycle. In the multiply-add cell array 410, the received characteristic values are sequentially passed to the right from the multiply-add cell at the interface until the last multiply-add cell. The eigenvalue injection module 440 inputs eigenvalues for two adjacent rows of multiply-add cells with a delay of one clock cycle.

Optionally, in an embodiment of the present application, the multiply-add unit array 410 may correspond to a weight value of a convolution kernel as follows:

for a convolution kernel with the size of Kd multiplied by Kh multiplied by Kw, Kd multiplied by Kw two-dimensional graphs developed in the depth direction are mapped to corresponding multiplication and addition units along the column direction of the multiplication and addition unit array, wherein one weight value corresponds to one multiplication and addition unit, and Kd, Kh and Kw are positive integers respectively representing the depth, height and width of the convolution kernel;

and a plurality of convolution kernels are mapped along the row direction of the multiplication and addition unit array.

Alternatively, one convolution kernel is mapped onto the multiply-add unit array a plurality of times with Kd × Kh > M.

For example, FIG. 6 shows a schematic diagram of a convolution kernel map. Kh and Kw in FIG. 6 are 3, respectively. The convolution kernel 0 and the convolution kernel 1 are respectively expanded into Kd 3 multiplied by 3 two-dimensional graphs, then the two-dimensional graphs obtained by expanding the convolution kernel 0 are sequentially mapped along 0-2 columns of multiplication and addition units of the multiplication and addition unit array, and the two-dimensional graphs obtained by expanding the convolution kernel 1 are sequentially mapped along 3-5 columns of multiplication and addition units of the pulse array. If the height M of the multiply-add unit array is equal to Kd multiplied by 3, the first two-dimensional graph of convolution kernel 0 is mapped to the multiply-add units of the 0 nd to the 2 nd rows of the 0 nd to the 2 nd columns of the systolic array, and the Kd two-dimensional graph is mapped to the last three rows of multiply-add units of the 0 nd to the 2 nd columns. The first two-dimensional graph of the convolution kernel 1 is mapped to the multiplying and adding units of the 0 nd to 2 nd rows of the 3 rd to 5 th columns of the systolic array, and the Kd two-dimensional graph is mapped to the last three rows of the multiplying and adding units of the 3 rd to 5 th columns.

The accumulation cell array is mapped in a manner similar to that of the above multiply-add cell array. As shown in FIG. 6, convolution kernel 0 maps to the 0 nd to 2 nd columns of accumulation units, and convolution kernel 1 maps to the 3 rd to 5 th columns of accumulation units. Wherein the calculation result of convolution kernel 0 is output from the accumulation unit of column 2, and the calculation result of convolution kernel 1 is output from the accumulation unit of column 5.

Fig. 7 shows a schematic diagram of a multiply-add unit according to an embodiment of the present application.

As shown in fig. 7, a particular multiply-add unit may include:

a weight value shift register 701, configured to buffer and transmit a weight value along a column where the specific multiply-add unit is located, and latch a weight value corresponding to the specific multiply-add unit to a weight value register 702;

a weight value register 702, configured to cache a weight value corresponding to the specific multiply-add unit;

a characteristic value shift register 703, configured to buffer and transfer a characteristic value along a row where the specific multiply-add unit is located, and latch the characteristic value into a characteristic value register 704;

a feature value register 704 for caching feature values;

a multiplication circuit 705, configured to multiply the weight value in the weight value register 702 and the feature value in the feature value register 704, and output the multiplied product to the product register 706;

a product register 706 for buffering the product multiplied by the multiplication circuit 705;

an adding circuit 707 for adding the product in the product register 706 to the previous multiplication and addition output result, and outputting the added sum down the column in which the specific multiplication and addition unit is located.

Specifically, the weight value shift register 701 is responsible for buffering the weight value sent from the weight value injection module 430 or the previous multiply-add unit. In the shift stage of the weight value input, the weight value buffered by the weight value shift register 701 is transmitted to the next multiply-add unit. In the loading phase of the weight input, the weight value buffered by the weight value shift register 701 is latched to the weight value register 702. The eigenvalue shift register 703 is responsible for buffering the eigenvalue data sent from the eigenvalue injection module 440 or the multiplier-adder unit on the left. The eigenvalue data buffered in the eigenvalue shift register 703 is latched in the eigenvalue register 704 and is also sent to the right multiply-add unit. The multiplication circuit 705 is responsible for multiplying the weight value and the feature value buffered by the weight value register 702 and the feature value register 704, and the result of the multiplication is sent to the product register 706. The adder circuit 707 is responsible for passing the data in the product register 706 and the previous multiply-add output result fed above after being accumulated again and then down.

FIG. 8 shows a schematic diagram of an accumulation unit of one embodiment of the present application.

As shown in fig. 8, a specific accumulation unit may include:

a filter circuit 801, configured to filter an output result of the last multiply-add unit in the column corresponding to the specific accumulation unit according to the step value calculated by the convolution, and output the filtered result to a multiply-add unit result register 802;

a result register 802 of the multiply-add unit, which is used for caching the result filtered by the filtering circuit 801;

a delay circuit 803, configured to delay the previous accumulation output result according to the expansion value of the convolution calculation, and output the delayed result to the accumulation unit result register 804;

an accumulation unit result register 804 for buffering the result delayed by the delay circuit 803;

a first stage add circuit 805 to add the result in the multiply add unit result register 802 and the result in the accumulate unit result register 804 and to output the added sum to a sum register 806;

and a sum register 806 for buffering the sum added by the first-stage adding circuit 805.

Optionally, the specific accumulation unit may further include: a second stage addition circuit 807.

In this case, the sum register 806 is configured to output the sum in the sum register 806 to a next accumulation unit when the column corresponding to the specific accumulation unit is not the last column corresponding to the specific convolution kernel; when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel, outputting the sum in the sum register 806 to the second-stage addition circuit 807;

the second-stage adding circuit 807 is configured to add the sum in the sum register 806 and the intermediate result of the specific convolution kernel in the intermediate result buffering module, and output the added sum to the result processing module, when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel.

Specifically, there may be a redundant result (invalid result) among the results calculated by the multiply-add unit. Taking the 1 st column of the multiplier-adder unit in fig. 6 as an example, the product of the weight value in the 1 st column of the multiplier-adder unit and the eigenvalue in the 1 st column of the eigenvalue matrix is a valid result, and the product of the weight value and the eigenvalue in the 0 th column of the eigenvalue matrix is a redundant result, which needs to be filtered out. Therefore, the filter circuit 801 may filter out the redundant result output from the multiply-add unit array 410 according to the parameter step (Stride) value input during the convolution calculation, and at the same time, send the filtered result to the multiply-add unit result register 802.

On the other hand, a valid result output by each column of the multiply-add unit needs to be accumulated with a valid result output by the corresponding other column. And the times at which the two columns output valid results may be separated by some clock cycles. Therefore, delay circuit 803 needs to delay the result output by the left accumulation unit by a specified clock period before sending it to accumulation unit result register 804. The number of delayed clock cycles is calculated from the parameter expansion (displacement) value input during the convolution calculation.

First stage add circuit 805 accumulates data buffered in multiply add unit result register 802 and accumulate unit result register 804 and feeds the accumulated data into sum register 806.

When the convolution kernel of the convolution operation is mapped, the continuous Kw accumulation units are mapped to the same convolution kernel, and the size of Kw is the same as the width of the convolution kernel. Of the Kw accumulation units, the first accumulation unit does not need to receive the result output from the left accumulation unit, and the last accumulation unit does not output the result buffered by the sum register 806 to the right multiply-add unit, but only accumulates the result buffered by the sum register 806 and the intermediate result read back from the intermediate result buffer module in the second-stage addition circuit 807 and outputs the result to the result processing module.

Optionally, in an embodiment of the present application, as shown in fig. 5, the apparatus 400 may further include: results processing module 450.

And a result processing module 450, configured to process a result output by the accumulation unit array 410.

Optionally, the apparatus 400 may further include: an intermediate result caching module 460.

In this case, the result processing module 450 outputs the result output by the accumulation unit array 410 when the result output by the accumulation unit array 410 is the final result of a specific convolution kernel; when the result output by the accumulation unit array 410 is the intermediate result of the specific convolution kernel, the result output by the accumulation unit array 410 is buffered in the intermediate result buffer module 460.

Specifically, in the case of Kd × Kh > M, one convolution kernel is mapped onto the multiply-add unit array 410 a plurality of times. In this case, the calculation result of the convolution kernel is composed of calculation results obtained by multiple mappings. That is, the calculation results obtained from the previous mappings in the multiple mappings are intermediate results, and are added to the calculation results obtained from the next mapping, so as to obtain the final result of the convolution kernel. Therefore, when the result output from the accumulation unit array 410 is an intermediate result of a specific convolution kernel, the result processing module 450 buffers the output result into the intermediate result buffer module 460 so as to add the result to the next output result; when the result output by the accumulation unit array 410 is the final result of a specific convolution kernel, the result output by the accumulation unit array 410 is output.

The intermediate result buffer module 460 is configured to buffer the intermediate results of each convolution kernel. Optionally, the intermediate result caching module may include: n First-in First-out queues (FIFO).

For convolution kernels of Kd × Kh × Kw size, every Kw FIFOs of the N FIFOs make up a group for buffering the intermediate results of one convolution kernel.

Kw FIFOs form a group of intermediate results for caching a convolution kernel, so that the resource of the Kw FIFOs can be fully utilized, and the utilization rate of the FIFOs can be improved.

Optionally, in an embodiment of the present application, as shown in fig. 5, the apparatus 400 may further include: a control module 470.

The control module 470 may be used to control the input of weight values and feature values to the multiply-add cell array 410, and to control the computations of the multiply-add cell array 410 and the accumulate cell array 420.

Specifically, the control module 470 may be used to control the processing of the modules in the apparatus 400 to obtain the calculation results. For example, the control module 470 may first control the weight value injection module 430 to load the weight value fed by the weight value input module into the multiply-add unit array 410, then control the eigenvalue injection module 440 to input the eigenvalue fed by the IFM input module into the multiply-add unit array 410, and control the multiply-add unit array 410 and the accumulation unit array 420 to perform convolution operation. After all the feature map data are sent to the multiply-add unit array 410 and the convolution operation is completed, the above processes are repeated in sequence until all the convolution operation is completed.

Having described the apparatus for convolution calculation of the embodiment of the present application, the following describes the method for convolution calculation of the embodiment of the present application. The convolution calculation method in the embodiment of the present application is a method for implementing the technical solution in the embodiment of the present application by the apparatus for convolution calculation in the embodiment of the present application or a device including the apparatus for convolution calculation in the embodiment of the present application, and related descriptions may refer to the foregoing embodiment, and are not repeated herein for brevity.

FIG. 9 shows a schematic flow chart diagram of a method 900 of convolution calculation of an embodiment of the present application.

As shown in fig. 9, the method 900 includes:

910, inputting a weight value to a multiplication and addition unit array, wherein the multiplication and addition unit array comprises M rows and N columns of multiplication and addition units;

920, inputting a characteristic value to the multiplication and addition unit array;

930, multiplying, by a specific multiply-add unit of the multiply-add units in the M rows and the N columns, a feature value input to the specific multiply-add unit and a weight value corresponding to the specific multiply-add unit, adding a product after the multiplication to a previous multiply-add output result, and outputting the added sum as an output result of the specific multiply-add unit, wherein the specific multiply-add unit is any one of the multiply-add units in the M rows and the N columns, the previous multiply-add output result is an output result of a previous multiply-add unit of the specific multiply-add unit in the column where the specific multiply-add unit is located or zero, and M and N are both positive integers;

940, the output result of the last multiply-add unit in the column corresponding to the specific accumulation unit and the previous accumulation output result are added through the specific accumulation unit in the accumulation unit array, and the added sum is output as the output result of the specific accumulation unit, wherein the accumulation unit array comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiply-add unit array, and the previous accumulation output result is the output result of the previous accumulation unit of the specific accumulation unit or zero.

Optionally, in an embodiment of the present application, the inputting a weight value to the multiply-add unit array includes: and inputting a weight value through a first row of multiply-add units of the multiply-add unit array, wherein the weight value is transferred from the multiply-add unit of the first row to the corresponding multiply-add unit for each column of multiply-add units.

Optionally, in an embodiment of the present application, after the weight value is transferred to the corresponding multiply-add unit, each column multiply-add unit latches the weight value at the same time.

Optionally, in an embodiment of the present application, the inputting the characteristic value to the multiply-add unit array includes: and inputting a characteristic value through a first row of multiplication and addition units of the multiplication and addition unit array, wherein for each row of multiplication and addition units, the characteristic value is sequentially transmitted from the multiplication and addition unit of the first row to the multiplication and addition unit of the next row.

Optionally, in an embodiment of the present application, the multiply-add unit array corresponds to a weight value of a convolution kernel as follows:

Alternatively, in an embodiment of the present application, one convolution kernel is mapped onto the multiply-add unit array multiple times with Kd × Kh > M.

Optionally, in an embodiment of the present application, the method further includes: and processing the result output by the accumulation unit array.

Optionally, in an embodiment of the present application, the processing a result output by the accumulation unit array includes: when the result output by the accumulation unit array is the final result of the specific convolution kernel, outputting the result output by the accumulation unit array; and when the result output by the accumulation unit array is the intermediate result of the specific convolution kernel, caching the result output by the accumulation unit array into an intermediate result caching module.

Optionally, in an embodiment of the present application, the intermediate result caching module includes: n FIFO queues FIFO, wherein, for convolution kernels of Kd × Kh × Kw size, every Kw FIFOs in the N FIFOs form a group for buffering the intermediate results of one convolution kernel.

Optionally, in an embodiment of the present application, the specific multiply-add unit includes: the device comprises a weight value shift register, a weight value register, a characteristic value shift register, a characteristic value register, a multiplication circuit, a product register and an addition circuit; the specific multiplication and addition unit is used for storing the weight value, wherein the weight value is cached by the weight value shift register and transmitted along the column where the specific multiplication and addition unit is located, and the weight value corresponding to the specific multiplication and addition unit is latched to the weight value register; the characteristic value is cached through the characteristic value shift register and transmitted along the row where the specific multiplication and addition unit is located, and the characteristic value is latched to the characteristic value register; multiplying the weight value in the weight value register and the characteristic value in the characteristic value register by the multiplying circuit, and outputting the multiplied product to the product register; and adding the product in the product register and the previous multiplication and addition output result through the addition circuit, and outputting the added sum downwards along the column of the specific multiplication and addition unit.

Optionally, in an embodiment of the present application, the specific accumulation unit includes: the circuit comprises a filter circuit, a result register of a multiply-add unit, a delay circuit, a result register of an accumulation unit, a first phase adding circuit and a sum register; the output result of the last multiply-add unit in the corresponding column of the specific accumulation unit is filtered through the filter circuit according to the step value calculated by convolution, and the filtered result is output to the result register of the multiply-add unit; delaying the previous accumulation output result according to the expansion value calculated by convolution through the delay circuit, and outputting the delayed result to the accumulation unit result register; adding, by the first stage addition circuit, a result in the multiply add unit result register and a result in the accumulate unit result register, and outputting an added sum to the sum register.

Optionally, in an embodiment of the present application, the specific accumulation unit further includes: a second stage addition circuit; when the column corresponding to the specific accumulation unit is not the last column corresponding to the specific convolution kernel, outputting the sum in the sum register to the next accumulation unit; when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel, outputting the sum in the sum register to the second-stage addition circuit; and adding, by the second-stage addition circuit, the sum in the sum register and the intermediate result of the particular convolution kernel in the intermediate result cache module when the column corresponding to the particular accumulation unit is the last column corresponding to the particular convolution kernel.

Optionally, in an embodiment of the present application, the method further includes: controlling input of weight values and feature values to the multiply-add unit array, and controlling calculation of the multiply-add unit array and the accumulation unit array.

The embodiment of the present application further provides a processor, where the processor includes the apparatus for convolution calculation in the embodiment of the present application.

For example, the processor may be the convolution calculation device 210 in fig. 2, wherein the MAU 211 may be a device for convolution calculation according to an embodiment of the present application.

The embodiment of the present application further provides a mobile device, which may include the apparatus for convolution calculation in the embodiment of the present application; or, include the processors of the embodiments of the present application described above.

Embodiments of the present application further provide a computer storage medium having a program code stored therein, where the program code may be used to instruct a method for performing convolution calculation according to the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An apparatus for convolution computation, comprising:

a specific multiplication and addition unit in the multiplication and addition units in M rows and N columns, wherein the specific multiplication and addition unit is used for multiplying the characteristic value input to the specific multiplication and addition unit and a weight value corresponding to the specific multiplication and addition unit, adding a product obtained after multiplication and a previous multiplication and addition output result, and outputting the added sum as an output result of the specific multiplication and addition unit, the specific multiplication and addition unit is any one of the multiplication and addition units in the M rows and the N columns, the previous multiplication and addition output result is an output result of a previous multiplication and addition unit of the specific multiplication and addition unit in the column where the specific multiplication and addition unit is located or zero, and M and N are positive integers;

and the accumulation unit array comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiplication and accumulation unit array, and a specific accumulation unit in the N accumulation units is used for adding an output result of the last multiplication and accumulation unit in the column corresponding to the specific accumulation unit with a previous accumulation output result and outputting the added sum as an output result of the specific accumulation unit, wherein the previous accumulation output result is the output result of the previous accumulation unit of the specific accumulation unit or zero.

2. The apparatus of claim 1, further comprising:

a weight value injection module for inputting weight values to the multiply-add unit array;

and the characteristic value injection module is used for inputting the characteristic value to the multiplication and addition unit array.

3. The apparatus according to claim 2, wherein the weight value injection module is connected to a first row of the multiply-add unit array, wherein for each column of multiply-add units, a weight value is passed from the multiply-add unit of the first row to the corresponding multiply-add unit.

4. The apparatus of claim 3, wherein each column multiply-add unit is configured to latch the weight values after the weight values are passed to the corresponding multiply-add unit.

5. The apparatus according to any one of claims 2 to 4, wherein the eigenvalue injection module is connected to the first column of multiply-add units of the multiply-add unit array, wherein for each row of multiply-add units, eigenvalues are passed from the multiply-add unit of the first column to the multiply-add unit of the next column in turn.

6. The apparatus according to any one of claims 1 to 5, wherein the array of multiply-add units is configured to correspond to weight values of convolution kernels as follows:

7. The apparatus of claim 6, wherein one convolution kernel is mapped onto the multiply-add unit array multiple times with Kd x Kh > M.

8. The apparatus of any one of claims 1 to 7, further comprising:

and the result processing module is used for processing the result output by the accumulation unit array.

9. The apparatus of claim 8, further comprising:

an intermediate result caching module;

the result processing module is used for outputting the result output by the accumulation unit array when the result output by the accumulation unit array is the final result of the specific convolution kernel; and when the result output by the accumulation unit array is the intermediate result of the specific convolution kernel, caching the result output by the accumulation unit array to the intermediate result caching module.

10. The apparatus of claim 9, wherein the intermediate result caching module comprises:

n FIFO queues FIFO, wherein, for convolution kernels of Kd × Kh × Kw size, every Kw FIFOs in the N FIFOs form a group for buffering the intermediate results of one convolution kernel.

11. The apparatus according to any one of claims 1 to 10, wherein the specific multiply-add unit comprises:

the weight value shift register is used for caching and transmitting weight values along the column where the specific multiplication and addition unit is located, and latching the weight values corresponding to the specific multiplication and addition unit to the weight value register;

the weight value register is used for caching the weight value corresponding to the specific multiplication and addition unit;

the characteristic value shift register is used for caching and transmitting the characteristic value along the row where the specific multiplication and addition unit is located, and latching the characteristic value into the characteristic value register;

the characteristic value register is used for caching the characteristic value;

the multiplication circuit is used for multiplying the weight value in the weight value register and the characteristic value in the characteristic value register and outputting the multiplied product to the product register;

a product register for buffering the product multiplied by the multiplication circuit;

and the addition circuit is used for adding the product in the product register and the previous multiplication and addition output result and outputting the added sum downwards along the column of the specific multiplication and addition unit.

12. The apparatus according to any one of claims 1 to 11, wherein the specific accumulation unit comprises:

the filter circuit is used for filtering the output result of the last multiply-add unit in the corresponding column of the specific accumulation unit according to the step value calculated by convolution and outputting the filtered result to a multiply-add unit result register;

the result register of the multiply-add unit is used for caching the result filtered by the filter circuit;

the delay circuit is used for delaying the previous accumulation output result according to the expansion value calculated by convolution and outputting the delayed result to an accumulation unit result register;

the accumulation unit result register is used for caching the result delayed by the delay circuit;

a first stage addition circuit for adding a result in the multiply add unit result register and a result in the accumulation unit result register and outputting the added sum to a sum register;

and the register is used for buffering the sum added by the first-stage adding circuit.

13. The apparatus of claim 12, wherein the particular accumulation unit further comprises:

a second stage addition circuit;

the sum register is used for outputting the sum in the sum register to the next accumulation unit when the column corresponding to the specific accumulation unit is not the last column corresponding to the specific convolution kernel; when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel, outputting the sum in the sum register to the second-stage addition circuit;

and the second-stage adding circuit is used for adding the sum in the sum register and the intermediate result of the specific convolution kernel in the intermediate result cache module when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel, and outputting the added sum to the result processing module.

14. The apparatus of any one of claims 1 to 13, further comprising:

and the control module is used for controlling the input of the weight values and the characteristic values to the multiplication and addition unit array and controlling the calculation of the multiplication and addition unit array and the accumulation unit array.

15. A method of convolution computation, comprising:

inputting a weight value to a multiplication and addition unit array, wherein the multiplication and addition unit array comprises M rows and N columns of multiplication and addition units;

inputting a characteristic value to the multiplication and addition unit array;

multiplying, by a specific multiply-add unit of the multiply-add units in the M rows and the N columns, a feature value input to the specific multiply-add unit and a weight value corresponding to the specific multiply-add unit, adding a product after the multiplication to a previous multiply-add output result, and outputting the added sum as an output result of the specific multiply-add unit, wherein the specific multiply-add unit is any one of the multiply-add units in the M rows and the N columns, the previous multiply-add output result is an output result of a previous multiply-add unit of the specific multiply-add unit in the column where the specific multiply-add unit is located or zero, and M and N are both positive integers;

and adding the output result of the last multiply-add unit in the column corresponding to the specific accumulation unit with the previous accumulation output result through the specific accumulation unit in an accumulation unit array, and outputting the added sum as the output result of the specific accumulation unit, wherein the accumulation unit array comprises 1 row of N accumulation units, the N accumulation units respectively correspond to N columns of the multiply-add unit array, and the previous accumulation output result is the output result of the previous accumulation unit of the specific accumulation unit or zero.

16. The method of claim 15, wherein the inputting the weight values into the multiply-add cell array comprises:

and inputting a weight value through a first row of multiply-add units of the multiply-add unit array, wherein the weight value is transferred from the multiply-add unit of the first row to the corresponding multiply-add unit for each column of multiply-add units.

17. The method of claim 16, wherein each column of multiply-add units latches the weight values simultaneously after they are passed to the corresponding multiply-add unit.

18. The method according to any one of claims 15 to 17, wherein the inputting the eigenvalues to the multiply-add cell array comprises:

and inputting a characteristic value through a first row of multiplication and addition units of the multiplication and addition unit array, wherein for each row of multiplication and addition units, the characteristic value is sequentially transmitted from the multiplication and addition unit of the first row to the multiplication and addition unit of the next row.

19. The method according to any one of claims 15 to 18, wherein the multiply-add cell array corresponds to the weight values of the convolution kernels as follows:

20. The method of claim 19, wherein one convolution kernel is mapped onto the multiply-add unit array multiple times with Kd x Kh > M.

21. The method according to any one of claims 15 to 20, further comprising:

and processing the result output by the accumulation unit array.

22. The method of claim 21, wherein said processing the result output by said array of accumulation units comprises:

when the result output by the accumulation unit array is the final result of the specific convolution kernel, outputting the result output by the accumulation unit array; and when the result output by the accumulation unit array is the intermediate result of the specific convolution kernel, caching the result output by the accumulation unit array into an intermediate result caching module.

23. The method of claim 22, wherein the intermediate result caching module comprises:

24. The method according to any of claims 15 to 23, wherein the specific multiply-add unit comprises: the device comprises a weight value shift register, a weight value register, a characteristic value shift register, a characteristic value register, a multiplication circuit, a product register and an addition circuit;

the specific multiplication and addition unit is used for storing the weight value, wherein the weight value is cached by the weight value shift register and transmitted along the column where the specific multiplication and addition unit is located, and the weight value corresponding to the specific multiplication and addition unit is latched to the weight value register;

the characteristic value is cached through the characteristic value shift register and transmitted along the row where the specific multiplication and addition unit is located, and the characteristic value is latched to the characteristic value register;

multiplying the weight value in the weight value register and the characteristic value in the characteristic value register by the multiplying circuit, and outputting the multiplied product to the product register;

and adding the product in the product register and the previous multiplication and addition output result through the addition circuit, and outputting the added sum downwards along the column of the specific multiplication and addition unit.

25. The method according to any of claims 15 to 24, wherein the specific accumulation unit comprises: the circuit comprises a filter circuit, a result register of a multiply-add unit, a delay circuit, a result register of an accumulation unit, a first phase adding circuit and a sum register;

the output result of the last multiply-add unit in the corresponding column of the specific accumulation unit is filtered through the filter circuit according to the step value calculated by convolution, and the filtered result is output to the result register of the multiply-add unit;

delaying the previous accumulation output result according to the expansion value calculated by convolution through the delay circuit, and outputting the delayed result to the accumulation unit result register;

adding, by the first stage addition circuit, a result in the multiply add unit result register and a result in the accumulate unit result register, and outputting an added sum to the sum register.

26. The method of claim 25, wherein the particular accumulation unit further comprises: a second stage addition circuit;

when the column corresponding to the specific accumulation unit is not the last column corresponding to the specific convolution kernel, outputting the sum in the sum register to the next accumulation unit; when the column corresponding to the specific accumulation unit is the last column corresponding to the specific convolution kernel, outputting the sum in the sum register to the second-stage addition circuit;

and adding, by the second-stage addition circuit, the sum in the sum register and the intermediate result of the particular convolution kernel in the intermediate result cache module when the column corresponding to the particular accumulation unit is the last column corresponding to the particular convolution kernel.

27. The method of any one of claims 15 to 26, further comprising:

controlling input of weight values and feature values to the multiply-add unit array, and controlling calculation of the multiply-add unit array and the accumulation unit array.

28. A processor comprising means for convolution calculation according to any one of claims 1 to 14.

29. A mobile device, comprising:

means for convolution calculation according to any one of claims 1 to 14; alternatively, the first and second electrodes may be,

the processor of claim 28.