CN112639836A

CN112639836A - Data processing device, electronic equipment and data processing method

Info

Publication number: CN112639836A
Application number: CN202080004607.0A
Authority: CN
Inventors: 杨康; 韩峰
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2021-04-09
Also published as: WO2021168644A1

Abstract

A data processing apparatus, an electronic device, and a data processing method, wherein the apparatus includes: the input module (1) is used for acquiring an input characteristic value matrix and an n-bit or 2 n-bit weight value matrix; the calculation module (2) is used for carrying out convolution operation on the input characteristic value matrix and the n-bit or 2 n-bit weight value matrix to obtain an output characteristic value matrix; an output module (3) for outputting the output eigenvalue matrix; wherein n is a positive integer. The convolution operation of data with two lengths can be realized, the precision of the deep convolution neural network is improved, and the design requirements of convolution neural networks with different depths are met.

Description

Data processing device, electronic equipment and data processing method

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data processing device, electronic equipment and a data processing method.

Background

The deep convolutional neural network is a machine learning algorithm and is widely applied to computer vision tasks such as target recognition, target detection, semantic segmentation of images and the like.

Most of operations of the deep convolutional neural network are convolutional operations, and a special hardware circuit is designed to accelerate the convolutional operations of the convolutional layer, so that the calculation time of the deep convolutional neural network can be greatly reduced. The operation number of the conventional convolution operation device only supports fixed point numbers with one width, such as 8bits fixed point numbers, so that the data of a deep convolution neural network with higher precision requirements cannot be processed, and the design requirement that the precision of the deep convolution neural network is increasingly improved is difficult to meet.

Disclosure of Invention

The embodiment of the invention provides a data processing device, electronic equipment and a data processing method, and aims to solve the technical problem that a convolution operation device in the prior art cannot meet the precision requirement of a deep convolution neural network easily.

A first aspect of an embodiment of the present invention provides a data processing apparatus, including:

the input module is used for acquiring an input characteristic value matrix and an n-bit or 2 n-bit weight value matrix;

the calculation module is used for carrying out convolution operation on the input eigenvalue matrix and the weight value matrix of n bits or 2n bits to obtain an output eigenvalue matrix;

the output module is used for outputting the output characteristic value matrix;

wherein n is a positive integer.

A second aspect of an embodiment of the present invention provides an electronic device, including the data processing apparatus according to the first aspect.

A third aspect of the embodiments of the present invention provides a data processing method, including:

acquiring an input characteristic value matrix and an n-bit or 2 n-bit weight value matrix;

carrying out convolution operation on the input eigenvalue matrix and the weighted value matrix of n bits or 2n bits to obtain an output eigenvalue matrix;

outputting the output eigenvalue matrix;

wherein n is a positive integer.

The data processing device, the electronic equipment and the data processing method provided by the embodiment of the invention can realize convolution operation of data with two lengths, improve the precision of the deep convolution neural network and adapt to the design requirements of different depth convolution neural networks.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic diagram of an application scenario according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolution operation in the application scenario of FIG. 1;

fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a principle of performing convolution operation by a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data processing apparatus according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a systolic unit in a data processing apparatus according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of an accumulator in a data processing apparatus according to a third embodiment of the present invention;

fig. 8 is a schematic diagram illustrating a convolution operation process of n-bit data performed by the data processing apparatus according to the third embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a convolution operation process of 2 n-bit data performed by the data processing apparatus according to the third embodiment of the present invention;

fig. 10 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention;

fig. 11 is a schematic diagram illustrating a storage format when the data processing apparatus stores n-bit data according to a fourth embodiment of the present invention;

fig. 12 is a schematic diagram illustrating a storage format when a data processing apparatus according to a fourth embodiment of the present invention stores 2 n-bit data;

fig. 13 is a flowchart illustrating a data processing method according to a fifth embodiment of the present invention.

Reference numerals:

1-input module 2-calculation module

3-output module 4-memory

11-weighted value loading module 12-input characteristic value loading module

21-ripple unit 22-accumulator

23-control unit 24-weight value injection unit

25-input characteristic value injection unit 26-result output unit

27-result storage unit 211-weight value register

212-input characteristic value register 213-multiplication circuit

214-adder 215-weight value shift register

216-input eigenvalue shift register 217-multiplication result register

221-multiply-accumulate result register 222-pre-multiply-accumulate result register

223-vertical adding circuit 224-first stage adding circuit

225-filter circuit 226-accumulator result register

227-sum register 228-delay circuit

229-second stage adder Circuit

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Fig. 1 is a schematic diagram of an application scenario according to an embodiment of the present invention. The data processing device and the data processing method provided by the embodiment of the invention can be applied to any scene needing convolution operation, such as a deep convolution neural network and the like.

As shown in fig. 1, a deep convolutional neural network to which an embodiment of the present invention can be applied includes: input, output and hidden layers. Each layer in the network shown in fig. 1 may have one input and one output, and in an actual deep convolutional neural network, each layer may have multiple inputs or multiple outputs.

The hidden layer of the deep convolutional neural network consists of a set of cascaded feature maps and operations. Operations of the hidden layer include convolution, pooling, activation, and the like. The characteristic diagram of the hidden layer is generated by the characteristic diagram of the previous layer after the above operation. In general, the layers in the convolutional neural network may be named according to the type of operation, for example, the layer performing the convolutional operation may be referred to as a convolutional layer, and the layer performing the pooling operation may be referred to as a pooling layer.

The convolution operation process of the convolutional layer is as follows: and carrying out vector inner product operation on the input characteristic graphs by using a group of weight values, and then outputting a group of characteristic graphs. The input weight values are also referred to as filters or convolution kernels.

The weight values and the input and output profiles may each be represented as a multi-dimensional matrix. The input characteristic diagram can be expressed as an input characteristic value matrix, and elements in the matrix are recorded as input characteristic values; the output characteristic map may be represented as a matrix of output characteristic values, with the elements in the matrix being denoted as output characteristic values.

Fig. 2 is a schematic diagram of a convolution operation process in the application scenario shown in fig. 1. As shown in fig. 2, a weight matrix of R × N is convolved with an input eigenvalue matrix of H × N, and an output eigenvalue matrix of E × N can be obtained. Each output eigenvalue in the output eigenvalue matrix can be obtained by performing inner product operation on part of input eigenvalues in the input eigenvalue matrix and weight values of the weight value matrix.

The technical scheme provided by the embodiment of the invention can support convolution operation of n bits or 2n bits. The technical solution in the embodiments of the present invention is described below with reference to the accompanying drawings.

Example one

The embodiment of the invention provides a data processing device. Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 3, the data processing apparatus in this embodiment may include:

the input module 1 is configured to obtain an input eigenvalue matrix and an n-bit or 2 n-bit weight value matrix, where n is a positive integer;

the calculation module 2 is used for performing convolution operation on the input eigenvalue matrix and the n-bit or 2 n-bit weight value matrix to obtain an output eigenvalue matrix;

and the output module 3 is used for outputting the output characteristic value matrix.

Specifically, the input module 1 may be connected to a memory or another module, and is configured to obtain an input eigenvalue matrix and a weight value matrix to be subjected to convolution operation. Optionally, the connection described in the embodiments of the present invention may be a physical connection or a communication connection.

The weight value matrix may be an n-bit weight value matrix, or may be a 2 n-bit weight value matrix. The n-bit weight value matrix may refer to a matrix in which the length of the weight value is n bits; a 2 n-bit weight value matrix may refer to a matrix in which the length of the weight values is 2n bits. Optionally, the length of the input eigenvalue in the input eigenvalue matrix may be the same as the length of the weighted value in the weighted value matrix, and when the weighted value matrix is 2n bits, the input eigenvalue matrix may also be 2n bits, which can ensure that the input eigenvalue matrix and the weighted value matrix directly perform convolution operation, thereby improving the operation efficiency and accuracy.

The calculation module 2 may be connected to the input module 1 to obtain the input eigenvalue matrix and the weight value matrix, and perform convolution operation. Specifically, a part of the input eigenvalues in the input eigenvalue matrix may be multiplied by corresponding weight values in the weight value matrix and accumulated to obtain corresponding output eigenvalues.

Fig. 4 is a schematic diagram illustrating a principle of performing convolution operation by a data processing apparatus according to an embodiment of the present invention. As shown in fig. 4, the input eigenvalue matrix is:

X₀₀	X₀₁	X₀₂	X₀₃	X₀₄
					X₁₀	X₁₁	X₁₂	X₁₃	X₁₄
X₂₀	X₂₁	X₂₂	X₂₃	X₂₄

the weight value matrix is:

W₀₀	W₀₁
		W₁₀	W₁₁

the output eigenvalue matrix is:

Y₀₀	Y₀₁	Y₀₂	Y₀₃
				Y₁₀	Y₁₁	Y₁₂	Y₁₃

wherein, X_ijFor the ith row of the input eigenvalue matrix, the jth input eigenvalue, W_ijIs the jth weight value, Y, of the ith row in the weight value matrix_ijThe ith row of the output eigenvalue matrix is the jth output eigenvalue. The weight value matrix comprises 2 × 2 weight values, the weight value matrix traverses each 2 × 2 part in the input characteristic value matrix, and inner product operation is carried out on the part to obtain a corresponding output characteristic value, namely:

Y_ij＝X_ij*W₀₀+X_i(j+1)*W₀₁+X_(i+1)j*W₁₀+X_(i+1)(j+1)*W₁₁

as shown in fig. 4, the weight value matrix is first calculated with the 2 × 2 portion enclosed by the bold frame at the upper left corner of the input eigenvalue matrix to obtain the output eigenvalue Y₀₀＝X₀₀*W₀₀+X₀₁*W₀₁+X₁₀*W₁₀+X₁₁*W₁₁(ii) a Then, the bold frame moves to the right by one column, and the weight value matrix and the next 2 x 2 part are calculated to obtain a corresponding output characteristic value Y₀₁＝X₀₁*W₀₀+X₀₂*W₀₁+X₁₁*W₁₀+X₁₂*W₁₁(ii) a By analogy, all output characteristic values can be obtained after all 2 × 2 boxes are traversed.

Each matrix shown in fig. 4 is a two-dimensional matrix, in practical application, the weight value matrix, the input eigenvalue matrix, and the output eigenvalue matrix may be two-dimensional or three-dimensional matrices, and the principle of convolution operation of the three-dimensional matrix is similar to that of convolution operation of the two-dimensional matrix, and is not described here again.

If the obtained input eigenvalue matrix and the weight value matrix are n bits, the length of the output eigenvalue in the output eigenvalue matrix can also be n bits; if the obtained input eigenvalue matrix and the obtained weight value matrix are 2n bits, the length of the output eigenvalue in the output eigenvalue matrix can also be 2n bits.

The output module 3 may be connected to the calculation module 2 to obtain the output eigenvalue matrix calculated by the calculation module 2, and output the output eigenvalue matrix. The output may be in a variety of ways. For example, the output eigenvalue matrix can be displayed to the user or output to the next convolution layer for the next level of convolution operation.

In practical application, the device can simultaneously support two lengths: convolution operations of n bits and 2n bits of data, for example convolution operations of 8bits and 16 fixed point numbers. When the convolutional neural network is fixed by using the fixed point number with the length of n and the network precision can meet the design requirement, the device can use the fixed point number with the length of n bits to carry out convolution operation, and the same hardware resource can provide higher convolution operation concurrency. If the precision loss of the network after fixed point by using the fixed point number with the length of n bits is large and does not meet the design requirement, the device can be switched to carry out convolution operation by using the fixed point number with the length of 2n bits, and the network can also carry out fixed point by using the fixed point number with the length of 2n bits, so that the precision loss after the fixed point of the network is reduced.

The data processing device provided by the embodiment comprises an input module 1, a calculation module 2 and an output module 3, wherein the input module 1 can be used for acquiring an n-bit or 2 n-bit weight value matrix and an input characteristic value matrix, the calculation module 2 can perform convolution operation on the acquired n-bit or 2 n-bit weight value matrix and the input characteristic value matrix to obtain an n-bit or 2 n-bit output characteristic value matrix, and the output module 3 can output the n-bit or 2 n-bit output characteristic value matrix, so that convolution operation of data with two lengths is realized, when a high precision requirement exists, convolution operation can be realized by adopting 2 n-bit data, the precision of a deep convolutional neural network is improved, and the data processing device is suitable for design requirements of convolutional neural networks with different depths.

Example two

The second embodiment of the invention provides a data processing device. In this embodiment, based on the technical solutions provided in the above embodiments, the convolution operation is implemented by a systolic array, an accumulator array, and the like. Fig. 5 is a schematic structural diagram of a data processing apparatus according to a second embodiment of the present invention. As shown in fig. 5, the data processing apparatus in this embodiment may include:

the input module is used for acquiring a weight value matrix of n bits or 2n bits and an input characteristic value matrix of n bits or 2n bits; the input module may specifically include a weight value loading module 11 and an input eigenvalue loading module 12, where the weight value loading module 11 is configured to obtain an n-bit or 2 n-bit weight value matrix, and the input eigenvalue loading module 12 is configured to obtain an n-bit or 2 n-bit input eigenvalue matrix;

the calculation module 2 is configured to perform convolution operation on the input eigenvalue matrix and the weight value matrix to obtain an output eigenvalue matrix;

Wherein, the computing module 2 may include:

the pulsating array is used for realizing the multiplication and accumulation operation of the weight value of n bits or 2n bits in the weight value matrix and the corresponding input characteristic value;

and the accumulator array is used for calculating an output characteristic value matrix according to the multiply-accumulate result obtained by the pulse array.

Specifically, the pulse array may calculate a multiply-accumulate result corresponding to each row of weight values in the weight value matrix, and the accumulator array adds the multiply-accumulate results corresponding to each row of weight values to obtain an output characteristic value; or, the pulse array may calculate a multiply-accumulate result corresponding to each row weight value in the weight value matrix, and the accumulator array adds the multiply-accumulate results corresponding to each row weight value to obtain an output characteristic value.

Taking the matrix shown in fig. 4 as an example, when calculating the output result corresponding to the input eigenvalue and the weight value matrix in the upper left corner bold frame, the systolic array may calculate the multiply-accumulate result obtained by each row of weight values and the corresponding input eigenvalue, where the first row of weight values includes W₀₀And W₁₀The accumulated result obtained by multiplying and accumulating the corresponding input characteristic value is X₀₀*W₀₀+X₁₀*W₁₀The second column weight value comprises W₀₁And W₁₁The corresponding accumulated result is X₀₁*W₀₁+X₁₁*W₁₁The accumulator array adds the multiply-accumulate results corresponding to each row of weighted values to obtain an output characteristic value Y₀₀＝X₀₀*W₀₀+X₀₁*W₀₁+X₁₀*W₁₀+X₁₁*W₁₁。

Alternatively, the systolic array may calculate a multiply-accumulate result of each row weight value and the corresponding input feature value, where the first row weight value includes W₀₀And W₀₁The accumulated result obtained by multiplying and accumulating the corresponding input characteristic value is X₀₀*W₀₀+X₀₁*W₀₁The second line weight value comprises W₁₀And W₁₁The corresponding accumulated result is X₁₀*W₁₀+X₁₁*W₁₁The accumulator array adds the multiply-accumulate results corresponding to the weighted values of each row to obtain an output characteristic value Y₀₀＝X₀₀*W₀₀+X₀₁*W₀₁+X₁₀*W₁₀+X₁₁*W₁₁。

Embodiments of the present invention provide figures in which MC represents a ripple cell and ACC represents an accumulator. As shown in fig. 5, the systolic array may include multiple columns of systolic units 21, where each column of systolic units 21 may be configured to load a weight value, and multiply and accumulate the loaded weight value and a corresponding input feature value to obtain a multiply and accumulate result corresponding to each loaded column of weight values.

The number of columns of the systolic unit 21 used in the calculation process may be equal to the number of columns of the weight value matrix, and one column of the systolic unit 21 may load one column of weight values in the weight value matrix. Alternatively, the number of columns of the systolic unit 21 used in the calculation process may be equal to the number of rows of the weight value matrix, and a row of the systolic unit 21 may load a row of weight values in the weight value matrix. For convenience of description, in the embodiments of the present invention, a row of the systolic unit 21 is loaded with a row of the weight values.

Each of the ripple units 21 in a row of ripple units 21 may be loaded with a weight value, and obtain an input feature value, multiply the input feature value by the loaded weight value, add the obtained product to the output of the previous row of ripple units 21, and then output the added result. The result output by the last systolic unit 21 of each column is the multiply-accumulate result corresponding to that column.

The accumulator array may include a plurality of accumulators 22, the number of the accumulators 22 is equal to the number of columns of the systolic unit 21, and each accumulator 22 is connected to each column of the systolic unit 21 in a one-to-one correspondence. Specifically, assuming that the number of accumulators 22 and the number of columns of the systolic unit 21 are both k, the ith accumulator 22 is connected to the ith column of the systolic unit 21, where k is a natural number greater than 1, and i is 1, 2, … …, and k.

The accumulator 22 is connected to a row of the ripple units 21, and may refer to being connected to the last ripple unit 21 in the row of the ripple units 21.

The accumulator 22 is configured to obtain an output result of a corresponding row of the ripple units 21, add the output result of the previous accumulator 22, and output the added result to the next accumulator 22, thereby implementing the accumulation of the output results of the respective rows of the ripple units 21.

Optionally, the calculation module 2 may further include a result output unit 26 and a result storage unit 27. When the number of rows of the weight value matrix is larger than that of the systolic array, the systolic array can load a part of weight values in the weight value matrix at a time; the result storage unit 27 is configured to store an intermediate result, where the intermediate result is a result corresponding to a part of the weight values in the weight value matrix after being operated.

After the accumulation of the output results of each row of the pulsating units 21 is realized, if the intermediate result is cached in the result storage unit 27, the accumulated result is further accumulated with the intermediate result in the result storage unit 27 again, if the accumulated result is still the intermediate result of the convolution operation, the result output unit 26 stores the accumulated result in the result storage unit 27, and if the accumulated result is the final result of the convolution operation, the result output unit 26 outputs the result to the output module 3 for subsequent processing. And the final result is a result corresponding to all the weighted values in the weighted value matrix after operation.

Through the result output unit 26 and the result storage unit 27, under the condition that the weight value matrix is larger than the pulse array, the intermediate result of the convolution operation can be calculated by loading one part of weight value through the pulse array, and then the calculation is continued by loading the other part of weight value through the pulse array until the final result is obtained and output, so that the operation of the large weight value matrix is completed by using the small pulse array, the size of the device is effectively reduced, and the cost of the device is reduced.

In order to implement the feeding of the weight value and the input feature value into the systolic array, the data processing apparatus in this embodiment may further include: a weight value injection unit 24 and an input feature value injection unit 25.

The input end of the weight value injection unit 24 may be connected to the weight value loading module 11, and the output end may be connected to the systolic array, specifically, to each systolic unit 21 in the systolic array, so as to input the weight value to the corresponding respective systolic unit 21.

Similarly, an input of the input feature value injection unit 25 may be connected to the input feature value loading module 12, and an output may be connected to the systolic array, and in particular, to each of the systolic cells 21 in the systolic array, to input the input feature value to the respective systolic cell 21.

The weight value injection unit 24 and the input feature value injection unit 25 can buffer the weight value and the input feature value and then send the buffered values to the pulse array, thereby improving the stability of the device.

Optionally, the weight value injection unit 24 may be directly connected to each pulse unit 21, or, as shown in fig. 5, may be directly connected to the first row of pulse units 21, and connected to other pulse units 21 through the middle pulse unit 21, where the middle pulse unit 21 transfers the weight value.

Similarly, the input characteristic value injection unit 25 may be directly connected to each of the pulse units 21, or may be directly connected to the pulse units 21 in the first column as shown in fig. 5, and connected to other pulse units 21 through the intermediate pulse units 21, and the input characteristic value is transmitted through the intermediate pulse units 21.

The connection mode between the weight value injection unit 24 or the input feature value injection unit 25 shown in fig. 5 and the ripple unit 21 can effectively save wiring and reduce the volume of the device.

In order to realize the convolution operation, the whole convolution calculation process can be divided into a weight value loading stage and a calculation stage. In the weight value loading stage, loading the weight values in the weight value matrix into the systolic unit 21 of the systolic array; and in the calculation stage, the input characteristic values in the input characteristic value matrix are input into the pulse array, and calculation is carried out according to the weight values and the input characteristic values.

The data processing apparatus in this embodiment may further include: a control unit 23. The control unit 23 is used for controlling the other modules in the computing module 2 to work.

Specifically, the control unit 23 may control the weight value injection unit 24 to load the weight value obtained from the weight value loading module 11 to the systolic array, and then control the input feature value injection unit 25 to feed the input feature value obtained from the input feature value loading module 12 to the systolic array, and control the systolic array and the accumulator array to perform convolution operation.

Optionally, the input eigenvalue may be reused when performing convolution operation. The control unit 23 may specifically be configured to: in the weight value loading stage, controlling the weight values in the weight value matrix to be sequentially loaded into the pulse units 21 of the pulse array; in the calculation stage, the input eigenvalues in the input eigenvalue matrix are controlled to be sequentially transferred to the right in the pulsation array, and the pulsation unit 21 is controlled to perform calculation according to the loaded weight values and the transferred input eigenvalues.

Thus, in the calculation stage, the input characteristic value enters from one interface of a row of the ripple units 21, and sequentially passes through each ripple unit 21 of the row from left to right, and each ripple unit 21 can perform operation by using the input characteristic value, so that the input characteristic value is reused, and the data access bandwidth required by convolution operation is reduced.

Optionally, in the weight value loading stage, the control unit 23 may specifically be configured to: in the shift stage in the weight value loading stage, for each row of the ripple units 21, sequentially sending the weight values to be loaded by the row of the ripple units 21 into the ripple array through the first ripple unit 21 of the row, and in the ripple array, sequentially transmitting the received weight values from the first ripple unit 21 downwards; in the loading phase of the weight value loading phase, the ripple unit 21 in the ripple array is controlled to store a corresponding weight value.

Specifically, the weight value injection unit 24 is responsible for caching the weight values sent by the weight value loading module 11, and loading the weight values for the systolic array under the control of the control unit 23. The weight value injection unit 24 has only one interface with each column of systolic units 21 of the systolic array, which can only transmit one weight value per clock cycle. The weight value loading phase can be specifically divided into two phases of shifting and loading. In the shift stage, the weight value injection unit 24 sequentially sends the weight values required by the same row of the systolic units 21 to the systolic array through the same interface. In the systolic array, the received weight values are passed down in sequence from the systolic unit 21 at the interface. In the loading phase, the systolic units 21 in the same column in the systolic array simultaneously load the buffered weight values into their respective registers for use in the subsequent multiply-accumulate process. The weight value injection unit 24 may have a delay of one clock cycle when loading the weight values for the two adjacent columns of the ripple units 21.

The input feature value injection unit 25 is responsible for buffering the input feature values sent by the input feature value loading module 12, and sending the input feature values to the systolic array under the control of the control unit 23. The input characteristic value injection unit 25 has only one interface with each row of systolic units 21 of the systolic array, which can only transfer one input characteristic value per clock cycle. In the systolic array, the received input characteristic values are passed to the right from the systolic unit 21 at the interface in sequence until the last systolic unit 21. The input feature value injection unit 25 may have a delay of one clock cycle in feeding the input feature values to two adjacent rows of the ripple units 21.

In the systolic array, the input characteristic values are transmitted from left to right, the weight values are transmitted from top to bottom, and the data may take time of one clock cycle when passing through one row or one row of the systolic units 21, so that there may be a delay of one clock cycle when two adjacent rows or two adjacent rows of the systolic units 21 load the data, and the loading of the weight values and the operation between the weight values and the corresponding input characteristic values can be accurately realized.

In practical applications, the control unit 23 may obtain lengths of the weight values in the weight value matrix or lengths of the input eigenvalues in the input eigenvalue matrix, and control components such as the systolic array and the accumulator array to implement convolution operation according to the lengths.

For example, when the length is n bits, n bits of data may be loaded into the systolic array; when the length is 2n bits, 2n bits of data can be loaded to the systolic array, so that the calculation of data with different precisions is realized.

Optionally, the control unit 23 may control each unit to implement convolution operation by controlling a hardware circuit such as a state machine. For example, configuration information may be stored in a register or carried in an instruction, where the configuration information is used to indicate how long data is to be subjected to convolution operation, and the control unit 23 may generate a control signal according to the configuration information, and control components such as the systolic array and the accumulator array to perform switching between two convolution operation modes of n bits and 2n bits.

In the data processing apparatus provided in this embodiment, the calculation module 2 may include a pulse array and an accumulator array, and convolution operation is implemented by the pulse array and the accumulator array, where the pulse array may be configured to implement multiply-accumulate operation of a weight value of n bits or 2n bits in a weight value matrix and a corresponding input eigenvalue, and the accumulator array may be configured to calculate an output eigenvalue matrix according to a multiply-accumulate result obtained by the pulse array, so as to split convolution operation into multiply-accumulate operation and accumulate operation, and accurately calculate a convolution result of the weight value matrix and the input eigenvalue matrix, and data reuse among convolution operations may effectively reduce a data access bandwidth required by convolution operation, and save resources.

EXAMPLE III

The third embodiment of the invention provides a data processing device. The present embodiment provides a specific implementation scheme of the ripple unit and the accumulator on the basis of the technical solutions provided by the above embodiments. Fig. 5 can be seen as a schematic diagram of the overall configuration of the data processing apparatus in this embodiment. Fig. 6 is a schematic structural diagram of a systolic unit in a data processing apparatus according to a third embodiment of the present invention. Fig. 7 is a schematic structural diagram of an accumulator in a data processing apparatus according to a third embodiment of the present invention.

As shown in fig. 6, the pulsation unit 21 may include:

a weight value register 211 for storing a weight value;

an input feature value register 212 for storing an input feature value;

a multiplication circuit 213, which may be connected to the weight value register 211 and the input feature value register 212, respectively, and configured to obtain a product of the weight value and the input feature value according to the weight value stored in the weight value register 211 and the input feature value stored in the input feature value register 212;

an adding circuit 214 may be connected to the multiplying circuit 213 for adding the product obtained by the multiplying circuit 213 to the output of the previous row of ripple unit 21. When the ripple unit 21 is not present in the previous row, the addition circuit 214 may directly output the result acquired from the multiplication circuit 213.

Through the above components, the ripple unit 21 can implement a function of loading a weight value, acquiring an input feature value, multiplying the input feature value by the loaded weight value, adding the obtained product to the output of the previous ripple unit 21, and outputting the added result. The result output by the adder 214 may be sent to the next ripple unit 21.

Optionally, the pulsation unit 21 may further include:

a weight value shift register 215 for transferring a weight value to the next line of ripple unit 21;

an input characteristic value shift register 216 for passing the input characteristic value to the next column of systolic units 21.

Specifically, the weight value shift register 215 may be responsible for buffering the weight value sent from the weight value injection unit 24 or the upper-level ripple unit 21. In the shift stage of weight value loading, the weight value buffered by the weight value shift register 215 is passed down to the next stage ripple unit 21. In the loading phase of the weight value loading, the weight value buffered by the weight value shift register 215 is latched to the weight value register 211.

In the process of calculating according to the weight value in the weight value register 211, the weight value shift register 215 can be used to load the next weight value, so that the calculation efficiency corresponding to the whole weight value matrix can be effectively improved.

The input feature value shift register 216 is responsible for buffering the input feature value sent from the input feature value injection unit 25 or the left ripple unit 21. The input feature value buffered by the input feature value shift register 216 is latched into the input feature value register 212 and is also provided to the right dither unit 21.

In the process of performing calculation according to the input eigenvalue in the input eigenvalue register 212, the input eigenvalue shift register 216 may be used to load the next input eigenvalue, which can effectively improve the calculation efficiency corresponding to the whole input eigenvalue matrix.

Optionally, the ripple unit 21 may further include a multiplication result register 217. The connection between the addition circuit 214 and the multiplication circuit 213 can be realized by the multiplication result register 217. The multiplication result register 217 is used for storing the multiplication result of the weight value loaded by the ripple unit 21 and the input characteristic value, so that the multiplication result can be added with the output of the ripple unit 21 at the previous stage conveniently, and the stability of the device is improved.

In the embodiment of the present invention, each of the ripple units 21 may optionally perform a multiply-accumulate operation of n bits by n bits. Specifically, the length of the weight value that each of the systolic units 21 can load may be n bits, and when the length of the weight value in the weight value matrix is 2n bits, each column of systolic units 21 loads the upper n bits or the lower n bits of the weight value in the weight value matrix.

The two columns of pulse units 21 are loaded with a weight value of n high bits and a weight value of n low bits, respectively, so that the n-bit device can calculate 2 n-bit data.

Correspondingly, when the length of the input eigenvalue in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the systolic unit 21 each time may be the upper n bits or the lower n bits of the input eigenvalue in the input eigenvalue matrix.

Further, the upper n bits and the lower n bits of one row of weight values may be loaded in two adjacent rows of the ripple units 21, respectively, and the upper n bits of the input characteristic value may be sequentially transferred from the first row of the ripple units 21 to the last row of the ripple units 21 next to the lower n bits, so that the accumulator 22 may further calculate the multiply-accumulate result, and the complexity of the accumulator 22 may be reduced.

As shown in fig. 7, the accumulator 22 in the present embodiment may include:

a multiply-accumulate result register 221, which may be connected to the last pulse unit 21 of the corresponding column, and is used to obtain the output result of the last pulse unit 21;

a pre-multiply-accumulate result register 222, which may be connected to the multiply-accumulate result register 221, and is configured to obtain an output result from the multiply-accumulate result register 221 every other clock cycle when the input characteristic value is 2n bits;

a vertical adder 223, which may be connected to the multiply-accumulate result register 221 and the previous multiply-accumulate result register 222, respectively, and configured to send the output result of the multiply-accumulate result register 221 to the first-stage adder 224 when the input characteristic value is n bits, or send the sum of the output result of the multiply-accumulate result register 221 and the output result of the previous multiply-accumulate result register 222 to the first-stage adder 224 when the input characteristic value is 2n bits;

a first-stage adding circuit 224 may be connected to the vertical adding circuit 223 and the previous-stage accumulator 22, respectively, for adding the result output from the vertical adding circuit 223 and the result output from the previous-stage accumulator 22.

Through the above respective components, the accumulator 22 can obtain the output result of the corresponding column of the ripple unit 21, add the output result of the previous-stage accumulator 22, and output the added result to the next-stage accumulator 22.

It should be understood that the data addition in the embodiments of the present invention may refer to directly adding two data, or may refer to adding data after converting the data into a certain format. For example, before adding data of different systems, the data may be converted into the same system; before the data of the upper n bits and the data of the lower n bits are added, the data of the upper n bits can be shifted to the left by n bits, and the two data can be added after being aligned.

Optionally, when the input characteristic value is 2n bits, the accumulator 22 may store, by using one register, an output result corresponding to n upper bits of the input characteristic value acquired from the ripple unit 21, and store, by using another register, an output result corresponding to n lower bits of the input characteristic value; and obtaining an output result corresponding to the input characteristic value according to the output result of the upper n bits and the output result of the lower n bits of the input characteristic value, adding the output result of the input characteristic value to the output result of the previous-stage accumulator 22, and outputting the added result to the next-stage accumulator 22.

The output result with n higher bits and the output result with n lower bits of the input feature value may be two adjacent output results obtained from the ripple unit 21, respectively.

Specifically, if the systolic array corresponding to the accumulator 22 is loaded with n lower bits of the weight value, when the output result corresponding to the input feature value is obtained according to the output result of the n higher bits and the output result of the n lower bits of the input feature value, the accumulator 22 may be specifically configured to: and shifting the output result of the high n bits of the input characteristic value to the left by n bits, and adding the output result of the low n bits to obtain an output result corresponding to the input characteristic value.

If the systolic array corresponding to the accumulator 22 is loaded with n higher bits of the weight value, when the output result corresponding to the input feature value is obtained according to the output result of the n higher bits and the output result of the n lower bits of the input feature value, the accumulator 22 may be specifically configured to: and left shifting the output result of the high n bits of the input characteristic value by n bits, adding the output result of the low n bits, and left shifting the result obtained by the addition by n bits to obtain the output result corresponding to the input characteristic value.

The shift operation can be implemented in the vertical adding circuit 223, and the high n-bit output result can be restored to the actual multiply-accumulate result by shifting the high n-bit data by n bits to the left, so as to ensure the result accuracy.

Optionally, the accumulator 22 may further include: a filter circuit 225; the connection between the multiply-accumulate result register 221 and the last systolic unit 21 of the corresponding column may be implemented by the filter circuit 225. The filter circuit 225 may be configured to filter the redundant multiply-accumulate result output by the systolic array according to the step value (Stride value) of the convolution operation, and the unfiltered result is sent to the multiply-accumulate result register 221 by the filter circuit 225.

By arranging the filter circuit 225 to filter redundant data, the correctness of convolution operation under different step length requirements can be ensured, the step length requirements of different occasions are met, and the application range of the device is expanded.

Optionally, the accumulator 22 may further include: accumulator result register 226; the first stage adder circuit 224 may be connected to the previous accumulator 22 via the accumulator result register 226, and the accumulator result register 226 may be used to obtain the result output by the previous accumulator 22 and send the result to the first stage adder circuit 224.

Optionally, the accumulator 22 may further include: and a sum register 227 connected to the first-stage adder circuit 224, for storing the result output from the first-stage adder circuit 224 and outputting the result to the next-stage accumulator 22.

The result output from the previous-stage accumulator 22 and the result output from the first-stage adder circuit 224 can be stored by the accumulator result register 226 and the sum register 227, respectively, so as to ensure smooth calculation.

Optionally, the accumulator 22 may further include: a delay circuit 228. The accumulator result register 226 may be coupled to the previous accumulator 22 via the delay circuit 228. The delay circuit 228 may be configured to delay the result output from the previous accumulator 22 by a corresponding clock period according to the expansion value (displacement value) of the convolution operation, and then send the result to the accumulator result register 226. The number of clock cycles of the delay is determined by the expansion value of the convolution operation.

The delay circuit 228 is arranged to delay the result output by the previous accumulator 22, so that the correctness of convolution operation under different expansion value requirements can be ensured, the expansion value requirements in different occasions can be met, and the application range of the device can be enlarged.

Optionally, the accumulator 22 may further include: the second-stage addition circuit 229; the second stage adder circuit 229 may be coupled to a sum register 227 and the result generation unit 26 may be coupled to the second stage adder circuit 229.

The second stage adder 229 of the last stage accumulator 22 is used to add the result in the sum register 227 with the intermediate result read from the result storage unit 27 by the result output unit 26 and output the result to the result output unit 26.

When mapping the weight value matrix of the convolution operation to the systolic array, N successive accumulators 22 will map to the same weight value matrix, where the size of N may be the same as the width of the weight value matrix. Of the N accumulators 22, the first accumulator 22 does not need to receive the result output from the left-hand first-stage accumulator 22, and the last accumulator 22 does not output the result buffered by the sum register 227 to the right-hand first-stage accumulator 22, but only adds the result buffered by the sum register 227 and the intermediate result read back from the result storage unit 27 in the second-stage adding circuit 229 and outputs the result to the result output unit 26.

Each stage of the accumulators 22 is connected to the result output unit 26, and the width of the weight value matrix determines which stage of the accumulators 22 outputs the accumulated result to the result output unit 26, for example, if the width of the weight value matrix is 3, the third stage of the accumulators 22 outputs the accumulated result to the result output unit 26, and if the width of the weight value matrix is 4, the fourth stage of the accumulators 22 outputs the accumulated result.

The result output unit 26 may send the result obtained from the second-stage addition circuit 229 to the output module 3 when the result output by the second-stage addition circuit 229 of the accumulator 22 is the final result; when the result output from the second-stage addition circuit 229 is an intermediate result, the acquired result is sent to the result storage unit 27.

Optionally, the result storage unit 27 may include a plurality of FIFO (First Input First Output) storage units, and the result Output unit 26 may send the intermediate result to a corresponding FIFO storage unit in the result storage unit 27.

Specifically, each stage of accumulator 22 may correspond to one FIFO memory location, and each FIFO memory location may perform read and write operations simultaneously. When convolution operation is performed, the N FIFO storage units can be divided into different groups according to the size of the weight value matrix. Different FIFO memory cell groups cache intermediate results of different weight value matrixes.

As described above, the width of the weight value matrix determines which of the stages of accumulators 22 outputs the accumulated result to result production unit 26, so as to utilize the FIFO memory location corresponding to the accumulator 22 that does not output the accumulated result to result production unit 26. In this embodiment, a group of FIFO storage units may include the accumulator 22 outputting the accumulated result to the result output unit 26 and the FIFO storage units corresponding to all the accumulators 22 preceding it, and all the buffers of the group of FIFO storage units may be used by the accumulator 22 outputting the accumulated result.

For example, if the width of the weight value matrix is 3, the third-stage accumulator 22 outputs the accumulated result to the result output unit 26, so that the FIFOs corresponding to the first-stage to third-stage accumulators 22 can be divided into one group for caching the accumulated result output by the third-stage accumulator 22, thereby effectively utilizing the idle FIFO storage unit and improving the storage efficiency of the accumulated result.

In practical applications, when performing convolution operation using fixed-point numbers with length n, the vertical adding circuit 223 directly forwards the output result of the ripple unit 21 obtained by the multiply-accumulate result register 221 to the first-stage adding circuit 224 for accumulation.

When performing convolution operation using fixed-point numbers of 2n bits in length, two consecutive output results of the systolic array need to be accumulated in the vertical addition circuit 223. The first output result received by the multiply-accumulate result register 221 may be buffered in the pre-multiply-accumulate result register 222 on the next clock cycle. When the multiply-accumulate result register 221 receives the second output result, the output result buffered by the multiply-accumulate result register 221 and the output result buffered by the previous multiply-accumulate result register 222 are accumulated in the vertical adder 223, and the accumulated result is sent to the first-stage adder 224 to be added continuously, thereby realizing convolution operation of the fixed-point number of 2n bits.

In the data processing apparatus provided in this embodiment, the multiply-accumulate result register 221 and the previous multiply-accumulate result register 222 of the accumulator 22 may store two adjacent output results of the corresponding row of the systolic unit 21, and the multiply-accumulate result corresponding to the input eigenvalue of 2n bits may be determined according to the data stored in the multiply-accumulate result register 221 and the previous multiply-accumulate result register 222, so that the convolution operation of the input eigenvalue of 2n bits may be implemented by sending n bits of the input eigenvalue to the systolic unit 21 each time, without increasing the storage space of the systolic unit 21, taking into account the apparatus cost and the calculation efficiency, and having a higher application value.

Fig. 8 is a schematic diagram illustrating a convolution operation process of n-bit data performed by the data processing apparatus according to the third embodiment of the present invention. Wherein, the size of the weight value matrix is 3 x 3. As shown in fig. 8, KhaDb is the number b of the a-th row in the input eigenvalue matrix; kwc is the weight value vector of the c column in the weight value matrix, which will be deployed to the corresponding column of systolic units at the beginning of convolution operation; KwcDd is the d-th multiplication accumulation result of the c-th row of the weight value matrix corresponding to the output characteristic value; bias value input by convolution operation is Bias value; and SxTy is an accumulation result output by the x-th stage accumulator at the y moment.

When the convolution operation starts, the weight value vector Kwc in the weight value matrix is sent to the pulsation array in three clock cycles, and each pulsation unit loads the weight value at the corresponding position in the 3 x 3 weight value matrix; after the weight loading is finished, the input characteristic values are sequentially sent into the pulse array according to the sequence in the figure 8, and the input characteristic values and the weight values are multiplied and accumulated in the pulse array; the results of the time-sequential output of the systolic array are shown in fig. 8.

The results output from the systolic array are sent to the corresponding accumulators for accumulation, the calculation performed by the accumulators at each moment is as shown in fig. 8, and the final output characteristic value can be obtained after the accumulation operation is completed by the third-stage accumulator.

By the process shown in fig. 8, convolution operation of n-bit data can be realized. One input characteristic value is multiplied by one row of weight values respectively, which is equivalent to that one input characteristic value is multiplied and accumulated for multiple times, so that the reuse of data is realized, and the data access bandwidth required by convolution operation is reduced.

Fig. 9 is a schematic diagram illustrating a convolution operation process of 2 n-bit data performed by the data processing apparatus according to the third embodiment of the present invention. Wherein, 2n is 16, and the size of the weight matrix is 3 x 3. As shown in fig. 9, KhaDb _ LSB is the lower n bits of the number b of the a-th row in the input eigenvalue matrix; khaddb _ MSB is the upper n bits of the b-th number in the a-th row of the input eigenvalue matrix. Kwc _ LSB is the lower n bits of the c-th column weight vector in the weight matrix, Kwc _ MSB is the upper n bits of the c-th column weight vector in the weight matrix, which are deployed to the corresponding systolic unit at the beginning of the convolution operation.

KwcDd _ LL is a first part of the d-th multiply-accumulate result of the c-th row of weight values in the weight value matrix corresponding to the output characteristic value, and is obtained by multiply-accumulate the lower n bits of the input characteristic value and the lower n bits of the weight values; KwcDd _ ML is a second part of the d-th multiply-accumulate result of the c-th row of weight values in the weight value matrix corresponding to the output characteristic value, and is obtained by multiply-accumulate the high n bits of the input characteristic value and the low n bits of the weight values; KwcDd _ LM is a third part of the d-th multiply-accumulate result of the c-th row of weight values in the weight value matrix corresponding to the output characteristic value, and is obtained by multiply-accumulate the lower n bits of the input characteristic value and the higher n bits of the weight values; KwcDd _ MM is a fourth part of the d-th multiply-accumulate result of the c-th row of weight values in the weight value matrix corresponding to the output characteristic value, and is obtained by multiply-accumulate the high n bits of the input characteristic value and the high n bits of the weight values; bias value input by convolution operation is Bias value; and SxTy is an accumulation result output by the x-th stage accumulator at the y moment.

When the convolution operation starts, the high n-bit vector and the low n-bit vector of the weight value in the weight value matrix: kwc _ LSB and Kwc _ MSB are sent into the ripple array in three clock cycles, and each ripple unit loads corresponding n bits of the weight value of the corresponding position; after the weight loading is finished, the input characteristic values are sequentially sent into the pulse array according to the sequence in the figure 9, and the input characteristic values and the weight values are multiplied and accumulated in the pulse array; the results of the time-sequential output of the systolic array are shown in fig. 9.

The results output from the systolic array are fed to the corresponding accumulators for further accumulation, and the calculations performed by the accumulators at each time are shown in fig. 9. The vertical addition circuit of each accumulator needs to shift the output result of the second input by n bits to the left before accumulation. The accumulator with the higher n-bit weight value also needs to shift the sum left by n bits after adding the two output results. The accumulator passes an accumulation result to the next accumulator every two clock cycles. And after the accumulation operation of the last-stage accumulator is finished, the final output characteristic value can be obtained.

In practical application, the device can simultaneously support data of two lengths to carry out calculation. The convolution operation is carried out by using the data of n bits, so that higher concurrency of the convolution operation can be provided; the 2 n-bit data is used for convolution operation, so that the network precision can be effectively improved.

Note that, in fig. 8 and 9, a plurality of time axes appear, and each time axis is used only for assisting in displaying the output order in the respective time lines, and T0 in each time axis does not have the same time.

Example four

The fourth embodiment of the invention provides a data processing device. In this embodiment, on the basis of the technical solutions provided in the foregoing embodiments, a memory is added to store data. Fig. 10 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 10, the data processing apparatus in this embodiment may include:

the input module is used for acquiring a weight value matrix of n bits or 2n bits and an input characteristic value matrix of n bits or 2n bits; the input module specifically comprises a weight value loading module 11 and an input characteristic value loading module 12, wherein the weight value loading module 11 is used for acquiring an n-bit or 2 n-bit weight value matrix, and the input characteristic value loading module 12 is used for acquiring an n-bit or 2 n-bit input characteristic value matrix;

the output module 3 is used for outputting the output characteristic value matrix;

a memory 4 for storing at least one of: the system comprises an input eigenvalue matrix, an output eigenvalue matrix and a weight value matrix.

Optionally, the Memory 4 may be a Static Random-Access Memory (SRAM). The weight value loading module 11 may be connected to the memory 4, read out the weight values from the memory 4, and send them to the calculating module 2 according to a specific format. The input eigenvalue loading module 12 can read out the input eigenvalue from the memory 4 and send it to the calculation module 2 for convolution operation.

The calculation module 2 can output one output eigenvalue in the eigenvalue matrix every clock cycle, and the output module 3 writes the output eigenvalue into the memory 4. Optionally, there may be some format requirements when the output characteristic value is stored in the memory 4, for example, the output characteristic value needs to be aligned by 32 bits, that is, the start address of the first byte of the output characteristic value is an integer multiple of 32. The output module 3 may assemble the output characteristic values into corresponding formats and send the corresponding formats to the memory 4 for storage.

Optionally, when the length of the data stored in the memory 4 is n bits, the memory 4 may sequentially store m data through n × m bit storage spaces. When the length of the data stored in the memory 4 is 2n bits, the memory 4 can store m data through a storage space of 2n × m bits, and the upper n bits and the lower n bits of each data are adjacently stored; and n and m are both positive integers.

Fig. 11 is a schematic diagram of a storage format when the data processing apparatus stores n-bit data according to a fourth embodiment of the present invention. As shown in fig. 11, each box represents a memory space of n bits, the numbers on the boxes represent the serial numbers of the memory spaces, and the numbers within the boxes represent the serial numbers of the stored data. Fig. 11 shows 2m n-bit storage spaces, and the ith n-bit storage space stores the ith data.

Fig. 12 is a schematic diagram of a storage format when the data processing apparatus according to the fourth embodiment of the present invention stores 2 n-bit data. As shown in fig. 12, each box represents a storage space of n bits, the number on the box represents the serial number of the storage space, i _ LSB within the box represents the lower n bits of the ith data, and i _ MSB represents the upper n bits of the ith data. FIG. 12 shows a 2m n-bit storage space, with the 2i n-bit storing the lower n-bits of the ith data and the 2i +1 n-bit storing the upper n-bits of the ith data.

The data processing apparatus provided in this embodiment may store, by the memory 4, at least one of: the method comprises the steps of inputting a characteristic value matrix, outputting the characteristic value matrix and a weight value matrix, wherein when the length of data stored in a memory 4 is 2n bits, the memory 4 can store m data through a storage space with 2n x m bits, and the high n bits and the low n bits of each data are adjacently stored, so that the characteristic values and the weight values are conveniently input into a pulse array in sequence, and the efficiency of convolution operation is improved.

EXAMPLE five

The fifth embodiment of the invention provides a data processing method. Fig. 13 is a flowchart illustrating a data processing method according to a fifth embodiment of the present invention. As shown in fig. 13, the data processing method in this embodiment may include:

step 1301, an input characteristic value matrix and an n-bit or 2 n-bit weight value matrix are obtained.

Step 1302, performing convolution operation on the input eigenvalue matrix and the n-bit or 2 n-bit weight value matrix to obtain an output eigenvalue matrix.

And step 1303, outputting the output characteristic value matrix.

Wherein n is a positive integer.

The data processing method shown in fig. 13 can be implemented based on the apparatuses in the embodiments shown in fig. 1 to 12, and specific implementation principles can refer to relevant descriptions in the embodiments shown in fig. 1 to 12. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 12, and are not described herein again.

In one practical way, the weight value length in the weight value matrix of n bits is n bits; the weight value length in the 2 n-bit weight value matrix is 2n bits;

the length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.

In one practical manner, the method further comprises:

storing data in a matrix, wherein the matrix is at least one of an input eigenvalue matrix, an output eigenvalue matrix and a weighted value matrix;

when the length of the stored data is 2n bits, m data are stored through a storage space with 2n x m bits, and the upper n bits and the lower n bits of each data are adjacently stored; and m is a positive integer.

In an implementable manner, performing convolution operation on the input eigenvalue matrix and the n-bit or 2 n-bit weight value matrix to obtain an output eigenvalue matrix, including:

multiplying and accumulating the weight values of n bits or 2n bits in the weight value matrix and the corresponding input characteristic values;

and calculating an output characteristic value matrix according to the multiply-accumulate result obtained by the multiply-accumulate operation.

In one practical implementation manner, the operation of multiplying and accumulating the n-bit or 2 n-bit weight values in the weight value matrix and the corresponding input feature values includes:

loading weight values in a weight value matrix in a systolic array;

performing multiply-accumulate operation on the weight value loaded by each row of pulse units in the pulse array and the corresponding input characteristic value to obtain a multiply-accumulate result corresponding to each row of weight values;

wherein the weight value is n-bit or 2 n-bit weight value.

In one implementation, the weight value that the systolic unit can load is n bits in length;

when the length of the weight value in the weight value matrix is 2n bits, each row of pulse units loads the high n bits or the low n bits of the weight value in the weight value matrix.

In one practical way, the upper n bits and the lower n bits of one row of weight values are loaded in two adjacent rows of ripple units respectively.

In an implementable manner, when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the systolic unit acquires the input eigenvalue at each time as n higher bits or n lower bits of the input eigenvalue in the input eigenvalue matrix.

In one possible implementation, the upper n bits or the lower n bits of the input feature value are sequentially transferred from the first column of systolic units to the last column of systolic units.

In an implementation manner, the step of performing multiply-accumulate operation on the weight value loaded by each row of pulse units in the pulse array and the corresponding input characteristic value to obtain a multiply-accumulate result corresponding to each row of weight values includes:

and sequentially transmitting the input eigenvalues in the input eigenvalue matrix to the right in the pulsation array, and performing multiplication and accumulation operation on the loaded weight values and the transmitted input eigenvalues through each row of pulsation units to obtain a multiplication and accumulation result corresponding to each row of weight values.

In one implementation, loading the weight values in the weight value matrix in the systolic array includes:

in a shifting stage in a weight value loading stage, aiming at each row of pulse units, sequentially sending the weight values to be loaded by the row of pulse units into a pulse array through a first pulse unit of the row, and in the pulse array, sequentially transmitting the received weight values downwards from the first pulse unit;

in a loading stage of the weight value loading stage, a corresponding weight value is stored through a pulse unit in the pulse array.

In an implementation manner, the obtaining a multiply-accumulate result corresponding to each row of weight values by performing a multiply-accumulate operation on the loaded weight value and the transmitted input characteristic value by each row of pulse units includes:

each column of ripple cells performs the following operations: acquiring an input characteristic value through each pulse unit of the row, multiplying the acquired input characteristic value by a weight value loaded by the pulse unit, adding the obtained product to the output of the last pulse unit, and outputting the addition result; the output result of the last pulse unit is the multiply-accumulate result corresponding to the column.

In one implementation, an accumulator is provided for each row of the ripple units; calculating an output eigenvalue matrix according to the multiply-accumulate result obtained by the multiply-accumulate operation, comprising:

acquiring output results of a corresponding row of pulse units through each accumulator, and adding the output results of the pulse units with the output results of the previous-stage accumulator to obtain the output results of the accumulators; and determining an output characteristic value according to the output result of the last-stage accumulator.

In an implementation manner, if the input feature value is 2n bits, the obtaining the output result of the corresponding row of systolic units includes:

storing an output result corresponding to the upper n bits of the input characteristic value acquired from the pulse unit through one register, and storing an output result corresponding to the lower n bits of the input characteristic value through the other register;

and obtaining an output result corresponding to the input characteristic value according to the output result of the high n bits and the output result of the low n bits of the input characteristic value.

In an implementation manner, if the systolic array corresponding to the accumulator is loaded with n lower bits of the weight value, obtaining an output result corresponding to the input feature value according to an output result of the n higher bits and an output result of the n lower bits of the input feature value, includes:

and shifting the output result of the high n bits of the input characteristic value to the left by n bits, and adding the output result of the low n bits to obtain an output result corresponding to the input characteristic value.

In an implementation manner, if the systolic array corresponding to the accumulator is loaded with n higher bits of the weight value, obtaining an output result corresponding to the input feature value according to an output result of the n higher bits and an output result of the n lower bits of the input feature value, including:

and left shifting the output result of the high n bits of the input characteristic value by n bits, adding the output result of the low n bits, and left shifting the result obtained by the addition by n bits to obtain the output result corresponding to the input characteristic value.

In one practical aspect, the output result of the upper n bits and the output result of the lower n bits of the input feature value are two adjacent output results obtained from the systolic unit, respectively.

In one implementable manner, storing, by one register, an output result corresponding to n upper bits of an input feature value acquired from a systolic unit, and storing, by another register, an output result corresponding to n lower bits of the input feature value includes:

acquiring an output result from the pulse unit through the multiply-accumulate result register, and sending an output result corresponding to a low n bit to the forward multiply-accumulate result register through the multiply-accumulate result register every other clock period; storing output results corresponding to high n bits through the multiply-accumulate result register, and storing output results corresponding to low n bits through the previous multiply-accumulate result register;

or, acquiring an output result from the pulse unit through the multiply-accumulate result register, and sending an output result corresponding to the high n bits to the forward multiply-accumulate result register through the multiply-accumulate result register every other clock period; and storing the output result corresponding to the lower n bits by the multiply-accumulate result register, and storing the output result corresponding to the upper n bits by the previous multiply-accumulate result register.

In one practical manner, the method further comprises:

and filtering the redundant multiply-accumulate result output by the pulse array according to the step value of the convolution operation.

In one practical implementation, the obtaining of the output result of the corresponding column of the systolic unit and the adding of the output result of the previous stage accumulator includes:

and according to the expansion value of the convolution operation, delaying the result output by the previous-stage accumulator by the corresponding clock period, and adding the result to the output result obtained from the corresponding row of pulse units.

if the number of rows of the weight value matrix is larger than that of the pulse array, loading a part of weight values in the weight value matrix in the pulse array every time;

correspondingly, the determination of the output characteristic value by the accumulation result of the last-stage accumulator comprises the following steps:

judging whether an intermediate result of the output characteristic value is stored:

if not, the accumulation result of the last-stage accumulator is stored as an intermediate result;

if so, adding the accumulated result of the last-stage accumulator with the stored intermediate result, and if the added result is the final result of the output characteristic value, sending the final result to an output module; if the addition result is not the final result of the output characteristic value, updating the intermediate result into the addition result and storing the addition result;

the intermediate result is a corresponding result after a part of weight values in the weight value matrix are operated; and the final result is a corresponding result after all the weighted values in the weighted value matrix are operated.

An embodiment of the present invention further provides an electronic device, including the data processing apparatus according to any of the above embodiments. The electronic device may be any device that may use convolution operations, such as a computer, an unmanned aerial vehicle, a handheld device, and the like.

The implementation principle of the electronic device may refer to the relevant description in the embodiments shown in fig. 1 to 12, and the corresponding execution process and technical effect refer to the description in the embodiments shown in fig. 1 to 12, which are not described herein again.

The technical solutions and the technical features in the above embodiments may be used alone or in combination when conflicting with the present invention, and all embodiments are equivalent embodiments within the scope of the present invention as long as they do not exceed the scope recognized by those skilled in the art.

In the embodiments provided in the present invention, it should be understood that the disclosed related remote control device and method can be implemented in other ways. For example, the above-described remote control device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, remote control devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data processing apparatus, comprising:

wherein n is a positive integer.

2. The apparatus of claim 1, wherein the weight values in the n-bit weight value matrix are n bits long; the weight value length in the 2 n-bit weight value matrix is 2n bits;

3. The apparatus of claim 1, further comprising: a memory;

the memory is configured to store at least one of: inputting a characteristic value matrix, outputting a characteristic value matrix and a weight value matrix;

when the length of the data stored in the memory is 2n bits, the memory stores m data through a 2n x m bit storage space, and the high n bits and the low n bits of each data are adjacently stored; and m is a positive integer.

4. The apparatus of claim 1, wherein the computing module comprises:

5. The apparatus of claim 4, wherein the computing module further comprises:

and the control unit is used for acquiring the length of the weight value in the weight value matrix and controlling the pulsation array and the accumulator array to realize convolution operation according to the length of the weight value.

6. The apparatus of claim 5, wherein the systolic array comprises: a plurality of rows of pulsating units;

each row of the pulsating units is used for loading the weight values, and multiplying and accumulating the loaded weight values and the corresponding input characteristic values to obtain a multiplication and accumulation result corresponding to each row of the loaded weight values.

7. The apparatus of claim 6, wherein the weight value loadable by the systolic unit is n bits in length;

8. The apparatus of claim 7, wherein the upper n bits and the lower n bits of a row of weight values are loaded in two adjacent rows of systolic units respectively.

9. The apparatus according to claim 6, wherein when the length of the input eigenvalue in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the systolic unit each time is n upper bits or n lower bits of the input eigenvalue in the input eigenvalue matrix.

10. The apparatus of claim 9, wherein the upper n bits or the lower n bits of the input eigenvalue are sequentially passed from the first column of systolic units to the last column of systolic units.

11. The apparatus according to claim 6, wherein the control unit is specifically configured to:

in a weight value loading stage, controlling the weight values in the weight value matrix to be sequentially loaded into the pulse units of the pulse array;

in the calculation stage, the input eigenvalues in the input eigenvalue matrix are controlled to be sequentially transmitted to the right in the pulsation array, and the pulsation unit is controlled to calculate according to the loaded weight value and the transmitted input eigenvalue.

12. The apparatus according to claim 11, wherein in the weight value loading phase, the control unit is specifically configured to:

and in a loading stage in the weight value loading stage, controlling the pulse units in the pulse array to store corresponding weight values.

13. The apparatus of claim 6, wherein each column of pulsing units comprises a plurality of pulsing units;

the pulse unit is used for loading a weight value, acquiring an input characteristic value, multiplying the input characteristic value by the loaded weight value, adding the obtained product to the output of the last row of pulse units, and outputting the added result.

14. The apparatus of claim 13, wherein the pulsing unit comprises:

the weight value register is used for storing weight values;

an input feature value register for storing an input feature value;

the multiplication circuit is used for obtaining the product of the weight value and the input characteristic value according to the weight value stored in the weight register and the input characteristic value stored in the input characteristic value register;

and the addition circuit is used for adding the product obtained by the multiplication circuit and the output of the pulse unit in the previous row.

15. The apparatus of claim 14, wherein the pulsing unit further comprises:

the weight value shift register is used for transmitting weight values to the next row of pulse units;

and the input characteristic value shift register is used for transmitting the input characteristic value to the next row of the pulse units.

16. The apparatus of claim 6, wherein the accumulator array comprises a plurality of accumulators, the number of the accumulators and the number of columns of the systolic unit are k, the ith accumulator corresponds to the ith column of the systolic unit, where k is a natural number greater than 1, and i is 1, 2, … …, k;

the accumulator is used for acquiring the output result of a corresponding row of the pulse units, adding the output result of the previous-stage accumulator and outputting the added result to the next-stage accumulator.

17. The apparatus of claim 16, wherein when the input eigenvalue is 2n bits, the accumulator is specifically configured to:

and obtaining an output result corresponding to the input characteristic value according to the output result of the high n bits and the output result of the low n bits of the input characteristic value, adding the output result of the input characteristic value and the output result of the previous-stage accumulator, and outputting the added result to the next-stage accumulator.

18. The apparatus of claim 17, wherein if the systolic array corresponding to the accumulator is loaded with n lower bits of the weight value, the accumulator is specifically configured to, when obtaining the output result corresponding to the input feature value according to the output result of the n higher bits and the output result of the n lower bits of the input feature value:

19. The apparatus of claim 17, wherein if the systolic array corresponding to the accumulator is loaded with n higher bits of the weight value, the accumulator is specifically configured to, when obtaining the output result corresponding to the input feature value according to the output result of the n higher bits and the output result of the n lower bits of the input feature value:

20. The apparatus according to claim 17, wherein the output result n bits higher and the output result n bits lower than the input eigenvalue are two adjacent output results obtained from the ripple unit.

21. The apparatus of claim 16, wherein the accumulator comprises:

a multiply-accumulate result register for obtaining the output result of the last pulse unit;

the pre-multiplication accumulation result register is used for acquiring an output result from the multiplication accumulation result register every other clock period when the input characteristic value is 2n bits;

a vertical adder circuit configured to send an output result of the multiply-accumulate result register to a first-stage adder circuit when the input eigenvalue is n bits, or send a sum of an output result of the multiply-accumulate result register and an output result of the previous multiply-accumulate result register to the first-stage adder circuit when the input eigenvalue is 2n bits;

a first-stage addition circuit for adding a result output from the vertical addition circuit to a result output from the previous-stage accumulator.

22. The apparatus of claim 21, wherein the accumulator further comprises: a filter circuit;

and the filter circuit is used for filtering the redundant multiply-accumulate result output by the pulse array according to the step value of the convolution operation.

23. The apparatus of claim 21, wherein the accumulator further comprises: an accumulator result register;

the accumulator result register is used for acquiring the result output by the previous-stage accumulator and sending the result to the first-stage addition circuit.

24. The apparatus of claim 23, wherein the accumulator further comprises: a delay circuit;

and the delay circuit is used for delaying the result output by the previous-stage accumulator by a corresponding clock period and then transmitting the result to the accumulator result register according to the expansion value of the convolution operation.

25. The apparatus of claim 21, wherein the accumulator further comprises:

and the sum register is used for storing the result output by the first-stage addition circuit and outputting the result to the next-stage accumulator.

26. The apparatus of claim 25, further comprising: a result output unit and a result storage unit; the accumulator further comprises: a second-stage addition circuit;

when the number of rows of the weight value matrix is larger than that of the pulse array, the pulse array loads a part of weight values in the weight value matrix each time; the result storage unit is used for storing an intermediate result, wherein the intermediate result is a result corresponding to a part of weight values in the weight value matrix after operation;

the second-stage addition circuit of the last-stage accumulator is used for adding the result in the sum register and the intermediate result read by the result output unit from the result storage unit and outputting the result to the result output unit;

the result output unit is used for sending the result obtained from the second-stage addition circuit to an output module when the result output by the second-stage addition circuit is a final result; when the result output by the second-stage addition circuit is an intermediate result, the obtained result is sent to a result storage unit;

and the final result is a result corresponding to all the weighted values in the weighted value matrix after operation.

27. An electronic device, characterized in that it comprises a data processing apparatus according to any one of claims 1-26.

28. A data processing method, comprising:

outputting the output eigenvalue matrix;

wherein n is a positive integer.

29. The method of claim 28, wherein the weight values in the n-bit weight value matrix are n bits long; the weight value length in the 2 n-bit weight value matrix is 2n bits;

30. The method of claim 28, further comprising:

31. The method of claim 28, wherein convolving the input eigenvalue matrix with an n-bit or 2 n-bit weight value matrix to obtain an output eigenvalue matrix comprises:

32. The method of claim 31, wherein multiplying and accumulating the n-bit or 2 n-bit weight values in the weight value matrix with the corresponding input eigenvalues comprises:

loading weight values in a weight value matrix in a systolic array;

wherein the weight value is n-bit or 2 n-bit weight value.

33. The method of claim 32, wherein the weight value that the systolic unit can load is n bits long;

34. The method of claim 33, wherein the upper n bits and the lower n bits of a row of weight values are loaded in two adjacent rows of systolic units respectively.

35. The method according to claim 32, wherein when the length of the input eigenvalue in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the systolic unit each time is n upper bits or n lower bits of the input eigenvalue in the input eigenvalue matrix.

36. The method of claim 35, wherein the upper n bits or the lower n bits of the input eigenvalue are sequentially passed from the first column of systolic units to the last column of systolic units.

37. The method of claim 32, wherein performing a multiply-accumulate operation on the weighted value loaded by each row of pulse units in the pulse array and the corresponding input feature value to obtain a multiply-accumulate result corresponding to each row of weighted values comprises:

38. The method of claim 32, wherein loading the weight values in the weight value matrix in a systolic array comprises:

39. The method of claim 37, wherein the multiplying and accumulating the loaded weight value and the transmitted input feature value by each row of pulse units to obtain a multiplying and accumulating result corresponding to each row of weight values comprises:

40. The method of claim 32, wherein an accumulator is provided for each column of dither units; calculating an output eigenvalue matrix according to the multiply-accumulate result obtained by the multiply-accumulate operation, comprising:

41. The method of claim 40, wherein if the input eigenvalue is 2n bits, the obtaining the output result of the corresponding row of systolic units comprises:

42. The method of claim 41, wherein obtaining the output result corresponding to the input feature value according to the output result of the upper n bits and the output result of the lower n bits of the input feature value if the systolic array corresponding to the accumulator is loaded with the lower n bits of the weight value comprises:

43. The method of claim 41, wherein obtaining the output result corresponding to the input feature value according to the output result of the upper n bits and the output result of the lower n bits of the input feature value if the systolic array corresponding to the accumulator is loaded with the upper n bits of the weight value comprises:

44. The method according to claim 41, wherein the output result of the upper n bits and the output result of the lower n bits of the input characteristic value are two adjacent output results obtained from a systolic unit respectively.

45. The method of claim 41, wherein storing the output result corresponding to the upper n bits of the input feature value obtained from the systolic unit by one register and storing the output result corresponding to the lower n bits of the input feature value by another register comprises:

46. The method of claim 41, further comprising:

47. The method of claim 41, wherein obtaining the output result of a corresponding column of systolic units and adding the output result of a previous accumulator comprises:

48. The method of claim 40, wherein loading the weight values in the weight value matrix in a systolic array comprises: