CN110659445B

CN110659445B - Arithmetic device and processing method thereof

Info

Publication number: CN110659445B
Application number: CN201810712308.7A
Authority: CN
Inventors: 谭弘泽; 章隆兵; 李文青; 肖俊华; 王剑
Original assignee: Loongson Technology Corp Ltd
Current assignee: Loongson Technology Corp Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2022-12-30
Anticipated expiration: 2038-06-29
Also published as: CN110659445A

Abstract

The embodiment of the invention provides an arithmetic device and a processing method thereof, wherein the arithmetic device comprises: the data line comprises n first broadcast lines and m second broadcast lines; one end of each first broadcast line is connected with the first register group, the other end of each first broadcast line is connected with a row of processing units in the processing unit array, and the first broadcast lines are used for transmitting first data output by the first register group to a row of processing units; one end of each second broadcast line is connected with the second register group, and the other end of each second broadcast line is connected with a row of processing units in the processing unit array and used for transmitting the second data output by the second register group to the row of processing units; each processing unit in the processing unit array is used for carrying out operation according to the first data and the second data and outputting an operation result. The embodiment of the invention avoids the problem of operation overhead caused by a large amount of repeated data generated in the operation process, and reduces the operation overhead.

Description

Arithmetic device and processing method thereof

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an arithmetic device and a processing method for the arithmetic device.

Background

With the rapid development of artificial intelligence technology, machine learning methods represented by deep neural networks are practically applied in the fields of computer vision, speech recognition, go and the like, and become research hotspots.

At present, the wide application of deep learning methods such as Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) brings a larger workload to the data center. For example, in the implementation of a convolutional layer and a fully-connected layer of a deep neural network, convolutional operation is a key operation that is indispensable and takes a lot of time, and needs to occupy a lot of memory access and computing resources.

The prior art is generally based on pure software reformation convolution operation, so that the convolution operation can be accelerated by using an acceleration means of matrix multiplication. Specifically, because the convolution operation is a series of multiply-accumulate operations, the convolution operation can be converted into a matrix operation with a higher dimensionality through software, and thus the convolution operation can be completed in a matrix operation mode.

However, the software reformation of the convolution operation will increase the overall dimension, bring about a large amount of data handling during the reformation, and greatly increase the data size after the operation. Specifically, in order to express the convolution in a matrix form, it is necessary to repeatedly store elements in the convolution operation. The large amount of repeated data generated in the process brings a large amount of repeated access and cache during operation, and the overhead is increased.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed in order to provide an arithmetic device and a processing method of an arithmetic device that overcome or at least partially solve the above problems, in order to reduce overhead.

In order to solve the above problem, an embodiment of the present invention discloses an arithmetic device, including: the data line comprises n first broadcast lines and m second broadcast lines, and m and n are both natural numbers more than or equal to 1; the processing unit array comprises n processing unit rows and m processing unit columns; one end of each first broadcast line is connected with the first register group, the other end of each first broadcast line is connected with a row of processing units in the processing unit array, and the first broadcast lines are used for transmitting first data output by the first register group to the row of processing units, wherein the row numbers of the processing units connected with different first broadcast lines are different; one end of each second broadcast line is connected with the second register group, the other end of each second broadcast line is connected with a row of processing units in the processing unit array, and the second broadcast lines are used for transmitting second data output by the second register group to the row of processing units, wherein the row numbers of the processing units connected with different second broadcast lines are different; each processing unit in the processing unit array is used for carrying out operation according to the first data and the second data and outputting an operation result.

Optionally, a controller is also included; the processing unit includes: the device comprises a multiplier-adder, an accumulation variable register, a result register and a selector; the multiplier-adder is used for receiving the first data and the second data, multiplying the first data and the second data to obtain a product result, and adding the product result and an accumulation variable output by the selector to obtain an accumulation result; the input end of the accumulation variable register is connected with the multiplier-adder and is used for storing an accumulation result output by the multiplier-adder and transmitting the accumulation result to the selector; the selector is connected with the controller and used for transmitting the accumulation result as an accumulation variable to the multiplier-adder according to a control signal output by the controller; or, the accumulated result is taken as an operation result and is transmitted to the result register; and the result register is used for outputting the operation result.

Optionally, the selector includes a first selector and a second selector, and the output end of the accumulation variable register is connected to the first selector and the second selector respectively, and the multiplier-adder includes a multiplier and an adder; the output end of the multiplier is connected with the adder and is used for receiving the first data and the second data, multiplying the first data and the second data and outputting a product result to the adder; the output end of the adder is connected with the accumulation variable register and is used for adding the product result and the accumulation variable output by the first selector to obtain an accumulation result and outputting the accumulation result to the accumulation variable register; the output end of the first selector is connected with the adder and is used for receiving the accumulation result and the zero clearing signal corresponding to the accumulation variable register and outputting an accumulation variable to the adder according to the accumulation result and/or the zero clearing signal; and the output end of the second selector is connected with the result register and is used for receiving the control signal, taking the accumulated result as an operation result according to the control signal and transmitting the operation result to the result register.

Optionally, the processing unit array includes at least one row of processing units, where the row of processing units includes at least two processing units connected in sequence, and the two processing units are divided into a first processing unit and a second processing unit; the output end of the result register of the first processing unit is connected with the second selector of the second processing unit and used for transmitting the operation result output by the first processing unit to the second selector of the second processing unit; and the second selector of the second processing unit is also used for receiving the operation result output by the first processing unit and transmitting the operation result to a result register in the second processing unit.

Optionally, the first register bank includes at least one first cache register, each first cache register including at least one output, the second register bank includes at least one second cache register, each second cache register including at least one output; each output end of the first cache register is connected with each processing unit in a row of processing units through the first broadcast line, and is used for caching the received first data and respectively transmitting the first data to each processing unit in the row of processing units through the first broadcast line; each output end of the second cache register is connected to each processing unit included in a row of processing units through the second broadcast line, and is configured to cache the received second data, and transmit the second data to each processing unit included in the row of processing units through the second broadcast line.

Optionally, the system further comprises a timing adjustment module; the timing adjustment module includes: the first timing adjustment module and/or the second timing adjustment module; the first buffer register is connected with the first timing adjustment module and used for outputting the buffered first data according to the first timing signal output by the first timing adjustment module; the second buffer register is connected with the second timing adjustment module and is used for outputting the buffered second data according to the second timing signal output by the second timing adjustment module.

Optionally, when the first buffer register is a convolution register, the first buffer register is further configured to output the buffered first data according to a first feedback signal output by the processing unit array; and/or when the second buffer register is a convolution register, the second buffer register is further configured to output the buffered second data according to a second feedback signal output by the processing unit array.

The embodiment of the invention also discloses a processing method of the arithmetic device, the arithmetic device comprises any one of the arithmetic devices, and the method comprises the following steps:

transmitting the first data output by the first register group to each row of processing units in the processing unit array through a first broadcast line in the arithmetic device;

transmitting the second data output by the second register group to each column of processing units in the processing unit array through a second broadcast line;

in each processing unit in the processing unit array, performing operation according to the received first data and second data respectively to obtain an operation result corresponding to each processing unit;

and outputting the operation results corresponding to the processing units.

Optionally, the method further comprises: caching the received first data into the first register group according to the row number of the processing unit array; and caching the received second data into the second register group according to the number of the columns of the processing unit array.

Optionally, the performing, in each processing unit in the processing unit array, an operation according to the received first data and second data to obtain an operation result corresponding to each processing unit includes:

in each processing unit, multiplying received first data and second data by a multiplier-adder to obtain a product result, and adding the product result and an accumulation variable output by a selector to obtain an accumulation result, wherein the accumulation variable is output by the selector according to a first accumulation result stored in an accumulation variable register;

and transmitting the accumulation result to a result register as an operation result according to a control signal output by the controller.

Optionally, the multiplying the received first data and the second data by a multiplier-adder to obtain a product result, and adding the product result to an accumulation variable output by the selector to obtain an accumulation result, includes:

multiplying the received first data and the second data by a multiplier in the multiplier-adder to obtain a corresponding product result;

transmitting the product result to an adder, and triggering the adder to add the product result and the accumulation variable to obtain an accumulation result;

before the accumulated result is transmitted to the result register according to the control signal output by the controller, the method further includes: and transmitting the accumulated result to the accumulation variable register so as to update the first accumulated result stored in the accumulation variable register.

The embodiment of the invention has the following advantages:

the operation device in the embodiment of the invention can transmit the first data output by the first register to each row of processing units in the processing unit through the first broadcast line, and transmit the second data output by the second register to each column of processing units in the processing unit through the second broadcast line, namely, the operation of the array of processing units can be matched in a register row-column cross broadcast mode, so that each processing unit in the array of processing units can operate according to the received first data and second operation, the output of a convolution operation result is realized through the improvement of a hardware device, and the convolution operation does not need to be reformed through software, thereby avoiding the problem of operation expense caused by a large amount of repeated data generated in the operation process, reducing the operation expense, and improving the concurrency and the operation efficiency.

Drawings

FIG. 1 is a schematic diagram of an embodiment of an arithmetic device according to the present invention;

FIG. 2 is a schematic diagram of an arithmetic device in accordance with an alternative embodiment of the present invention;

FIG. 3 is a schematic diagram of a processing unit in an alternative embodiment of the invention;

FIG. 4 is a schematic diagram of a processing unit in another alternative embodiment of the invention;

FIG. 5 is a schematic diagram of a computing device according to another alternative embodiment of the invention;

FIG. 6 is a schematic diagram of a computing device according to an alternative embodiment of the invention;

FIG. 7 is a schematic diagram of a computing device according to an alternative embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method of processing a computing device according to an embodiment of the present invention

FIG. 9 is a schematic diagram of two convolution registers connected in series in one example of the invention;

FIG. 10 is a schematic diagram of the structure of a convolution register in an example of the present invention;

FIG. 11 is a schematic diagram of a computing device according to an exemplary embodiment of the present invention calculating n rows and m columns of submatrices;

fig. 12 is a diagram showing a result of m convolution operations in which an arithmetic device according to an example of the present invention obtains n convolution kernels.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

One of the core concepts of the embodiments of the present invention is to provide a new computing device, which uses a register row-column cross broadcast manner to cooperate with the computing operation of a processing unit array, so as to avoid the problem of high computing overhead caused by a large amount of repeated data generated during the computing process, reduce the computing overhead, and improve the computing efficiency.

Referring to fig. 1, a schematic structural diagram of an embodiment of an arithmetic device according to the present invention is shown.

In an embodiment of the present invention, the operation device may include: a register set 110, a processing unit array 120, and a data line 130, the register set 110 including a first register set 111 and a second register set 112, the data line 130 including n first broadcast lines 131 and m second broadcast lines 132. Both m and n are natural numbers of 1 or more. The processing unit array comprises n processing unit rows and m processing unit columns.

In a specific implementation, one end of each first broadcast line 131 is connected to the first register set 111, and the other end of each first broadcast line is connected to a row of processing units in the processing unit array 120, and is configured to transmit the first data output by the first register set 111 to the row of processing units. Wherein, the line numbers of the processing units connected with different first broadcast lines are different. Each of the second broadcast lines 132 has one end connected to the second register set 112 and the other end connected to a row of processing units in the processing unit array 120, and is configured to transmit the second data output by the second register set 112 to the row of processing units. Wherein the processing unit column numbers of different second broadcast line connections are different. Each processing unit in the processing unit array 120 is configured to perform an operation according to the first data and the second data, and output an operation result.

In the operation process, the operation device according to the embodiment of the invention may transmit the first data output by the first register group 111 to each row of processing units of the processing unit array through the first broadcast line 131, that is, may broadcast the first data to each processing unit in each row of processing units through the first broadcast line 131, and may transmit the second data output by the second register group 112 to each column of processing units of the processing unit array through the second broadcast line 132, that is, may broadcast the second data to each processing unit in each column of processing units through the second broadcast line 132, so that a register row-column cross broadcast manner is implemented, and in cooperation with the operation of the processing unit array 120, each processing unit in the processing unit array 120 may perform operation according to the received first data and second data, thereby avoiding a problem that a large amount of repeated data is generated in the operation process to cause the operation overhead, reducing the operation overhead, and improving concurrency and operation efficiency.

In an alternative embodiment of the present invention, as shown in fig. 2, the computing device may further include a controller 140. Each processing unit in the processing unit array 120 may specifically include: a multiplier-adder 121, an accumulation variable register 122, a result register 123, and a selector 124.

The multiplier-adder 121 may be connected to the output terminal of the first register bank 111 through a first broadcast line 131, and may be connected to the output terminal of the second register bank 112 through a second broadcast line 132, and may be configured to receive the first data and the second data, multiply the first data and the second data to obtain a product result, and add the product result to the accumulation variable output by the selector 124 to obtain an accumulation result.

The input end of the accumulation variable register 122 may be connected to the multiplier-adder 121, and is configured to store the accumulation result output by the multiplier-adder 121, and transmit the accumulation result to the selector 124, so as to transmit the accumulation result as an accumulation variable to the multiplier-adder 121 through the selector 124 for accumulation. Specifically, the selector 124 is connected to the controller 140, and configured to transmit the accumulation result as an accumulation variable to the multiplier-adder 121 according to a control signal output by the controller 140; alternatively, the accumulated result is transmitted to the result register 123 as an operation result. The result register 123 is configured to output the operation result.

As an example of the present invention, when the control signal output by the controller 140 is a preset accumulation valid signal, for example, when the value of the control signal output by the controller 140 is "1", the selector 124 may transmit the accumulation result stored in the accumulation variable register 122 as an accumulation variable to the multiplier-adder 121 based on the accumulation valid signal, so as to trigger the multiplier-adder 121 to add the calculated multiplication-addition result and the accumulation variable to obtain a new accumulation result. Thus, the multiplier-adder 121 can output the obtained new accumulation result to the accumulation variable register 122 to update the accumulation result stored in the accumulation variable register 122. When the control signal output by the controller 140 is a preset accumulation end signal, for example, when the value of the control signal output by the controller 140 is "0", the selector 124 may transmit the accumulation result stored in the accumulation variable register 122 as the operation result to the result register 123 based on the accumulation end signal, so as to output the operation result through the result register 123.

It can be seen that, in the operation process of the processing unit in the embodiment of the present invention, the accumulated result output by the multiplier-adder 121 may be stored by the accumulated variable register 122, and the accumulated result stored in the accumulated variable register 122 may be used as an accumulated variable by the selector 124 and fed back to the multiplier-adder 121 in the processing unit for accumulation, that is, the update of the accumulated result is implemented in the same processing unit, so that the operation device may simultaneously execute the operation operations of multiple processing units in the processing unit array, thereby improving the concurrency of the operation operations, further implementing the acceleration of operations such as matrix and convolution, and improving the operation efficiency.

In the embodiment of the present invention, the selector 124 may optionally include a first selector and a second selector, and the output terminal of the accumulation variable register 122 may be connected to the first selector and the second selector, respectively. The multiplier-adder 121 may specifically include a multiplier and an adder.

Referring to fig. 3, a schematic diagram of a processing unit in an alternative embodiment of the invention is shown.

As shown in fig. 3, the output terminal of the multiplier 310 is connected to the adder 320, and is configured to receive the first data and the second data, multiply the first data and the second data, and output a multiplication result to the adder 320.

The output of the adder 320 may be connected to the accumulation variable register 122, and is configured to add the product result and the accumulation variable output by the first selector 330 to obtain an accumulation result, and output the accumulation result to the accumulation variable register 122. Thus, the accumulated result may be received by the accumulated variable register 122 for storage by the accumulated variable register 122.

The output end of the first selector 330 may be connected to the adder 320, and is configured to receive the accumulation result and the clear signal corresponding to the accumulation variable register 122, and output an accumulation variable to the adder 320 according to the accumulation result and/or the clear signal. Therefore, the adder 320 can accumulate the multiplication result output by the multiplier 310 and the accumulation variable, i.e. the processing unit can accumulate according to the accumulation result stored in the processing unit during the operation. It should be noted that the clear signal may be generated based on a manual operation of a user, or may be automatically generated by the controller according to a preset operation rule, which is not limited in this embodiment of the present invention.

The output end of the second selector 340 is connected to the result register 123, and is configured to receive the control signal, use the accumulated result as an operation result according to the control signal, and transmit the operation result to the result register 123. Therefore, the operation result can be output through the result register 123 to complete the operation corresponding to a single operation result, and the concurrency and the operation efficiency of the operation are further improved.

In the embodiment of the present invention, the first selector 330 may be a multiplexer, such as an alternative selector. Wherein the priority level of the clear signal may be higher than the priority level of the accumulation result. Specifically, when receiving the clear signal corresponding to the accumulation variable register 122, the first selector 330 may transmit the clear signal to the accumulation variable register 122, so as to trigger the accumulation variable register 122 to clear the currently stored value thereof by the clear signal, for example, setting the value stored in the accumulation variable register 122 to be the initial data "0". If the first selector 330 does not receive the clear signal corresponding to the accumulation variable register 122, the first selector 330 may transmit the accumulation result stored in the accumulation variable register 122 to the adder 320, so that the adder 320 may add the product result output by the multiplier 310 and the accumulation result stored in the accumulation variable register 122 to obtain a new accumulation result, and may output the new accumulation result to the accumulation variable register 122 to update the accumulation result stored in the accumulation variable register 122.

In an optional embodiment of the present invention, the processing unit array includes at least one row of processing units, and the row of processing units includes at least two processing units connected in sequence, for example, may include 2 processing units connected in series, and further, may include 3 or 4 processing units connected in series, and the like. The two processing units connected to each other may be divided into a first processing unit and a second processing unit, that is, one of the two processing units connected to each other may be used as the first processing unit, and the other may be used as the second processing unit connected to the first processing unit. For example, if a row of processing units includes 3 processing units connected in sequence, a first processing unit in the row of processing units may serve as a first processing unit connected to a second processing unit in the row of processing units, and the second processing unit may serve as a second processing unit connected to the first processing unit; in addition, the second processing unit may serve as a first processing unit connected to a third processing unit in the row of processing units, and the third processing unit may serve as a second processing unit connected to the second processing unit.

In a specific implementation, the output end of the first processing unit may be connected to the second selector of the second processing unit, and the operation result output by the first processing unit may be transmitted to the second processing unit, so that the second processing unit may receive and output the operation result output by the first processing unit. As shown in fig. 4, an output end of the result register of the first processing unit 410 may be connected to the second selector of the second processing unit 420, and configured to transmit the operation result output by the first processing unit 410 to the second selector of the second processing unit 420. The second selector of the second processing unit 420 may be further configured to receive the operation result output by the first processing unit 410, and transmit the operation result to a result register in the second processing unit 420. Therefore, the processing units in each row of processing units are calculated to obtain the operation result, and the operation result can be output through the last processing unit in the row of processing units.

It should be noted that a row of processing units in the embodiment of the present invention may include at least one first processing unit and at least one second processing unit, where the number of the first processing units may be equal to the number of the second processing units.

As an example of the present invention, when the register sets are disposed on the left side and the upper side of the processing unit array, such as the first register set is disposed on the left side of the processing unit array, and the second register set is disposed on the upper side of the processing unit array, the processing unit of the last column in the processing unit array may be respectively used as the last processing unit in each row of the processing units, that is, the processing unit of the rightmost column in each row may be used as the last processing unit, so that the operation result of each processing unit of each row is respectively output by the last processing unit of each row.

Of course, the first register set may also be disposed at other positions of the processing unit array, for example, when the first register set is disposed at the right side of the processing unit array, the processing unit in the leftmost column of each row may be used as the last processing unit, so as to output the operation result of each processing unit in each row through the leftmost processing unit in each row, and the like, which is not limited in this embodiment of the present invention. In addition, the second register set may also be disposed at other positions of the processing unit array, for example, may be disposed on the upper side of the processing unit array, which is not limited in this embodiment of the present invention.

In a specific implementation, the processing unit array may include n rows and m columns of processing units, where n and m are both positive integers. The number of the output terminals of the first register set may be the same as the number of the rows of the processing unit array, that is, the first register set may have n output terminals, and the first data output by each output terminal may be broadcast to the processing units in each column included in one row of the processing units, that is, the same first data output by one output terminal of the first register set may be transmitted to m processing units belonging to the same row. Correspondingly, the number of the output terminals of the second register set may be the same as the number of the columns of the processing unit array, that is, the second register set may have m output terminals, and the second data output by each output terminal may be broadcast to each row of processing units included in one column of processing units, that is, the same second data output by one output terminal of the second register set may be transmitted to n processing units belonging to the same column.

In this embodiment of the present invention, optionally, the first register set includes at least one first buffer register, each of the first buffer registers includes at least one output terminal, the second register set includes at least one second buffer register, and each of the second buffer registers includes at least one output terminal. As shown in fig. 5, each output terminal of the first buffer register 510 may be connected to each processing unit in a row of processing units via a first broadcast line, and is configured to buffer the received first data and transmit the first data to each processing unit in the row of processing units via the first broadcast line. Each output terminal of the second buffer register 520 may be connected to each processing unit in a row of processing units through the second broadcast line, and configured to buffer the received second data and transmit the second data to each processing unit in the row of processing units through the second broadcast line.

In particular, the first register set may include one or more first cache registers, and the first cache registers may be used to cache first data to be operated on. Each first cache register may comprise one or more outputs; also, the product of the number of outputs of each first cache register and the number of first cache registers comprised by the first register set may be equal to the number of rows of the array of processing units. The second register set may include one or more second cache registers, and the second cache registers may be used to cache second data to be operated on. Each second cache register may comprise one or more outputs; the product of the number of output terminals of each second buffer register and the number of second buffer registers included in the second register group may be equal to the number of columns of the processing unit array.

The controller 140 may be configured to control the first buffer register 510 to output the first data to be operated on buffered in the first buffer register 510 through the output terminal of the first buffer register 510, and control the second buffer register 520 to output the second data to be operated on buffered in the second buffer register 520 through the output terminal of the second buffer register 520, so that each processing unit in the processing unit array 120 may operate according to the received first data and the received second data.

In an optional embodiment of the present invention, the operation device may further include at least one timing adjustment module. The time sequence adjusting module can be connected with the register group, is specifically used for adjusting time delay or time sequence, and outputs the time sequence signal generated after adjustment to the register group, so that the register group can output first data and/or second data to each processing unit in the processing unit array according to the time sequence signal, the problem that the clock period is increased because the data output outwards by the register group depends on the feedback signal of the processing unit array is solved, and the possibility of reducing the processing speed of the operation device or increasing the energy consumption is avoided.

Referring to fig. 6, a connection diagram of an arithmetic device in an alternative embodiment of the invention is shown.

In a specific implementation, the timing adjustment module may include: the first timing adjustment module 610 and/or the second timing adjustment module 620. The first buffer register 510 is connected to the first timing adjustment module 610, and is configured to output the buffered first data according to the first timing signal output by the first timing adjustment module 610. The second buffer register 520 is connected to the second timing adjustment module 620, and is configured to output the buffered second data according to the second timing signal output by the second timing adjustment module 620.

Of course, the register set may also Output the buffered First data and/or second data to each processing unit in the processing unit array according to a predetermined execution order, for example, according to a First-in-First-out (First Input First Output) queue order, according to the feedback signal Output by the processing unit array. The feedback signal may include a signal that is automatically generated by the processing unit in the processing unit array and fed back to the register set each time the processing unit completes the operation. The feedback signal may be used to trigger the cache registers in the register set to output the next data to be calculated to the processing unit array, for example, to trigger the first cache register in the first register set to output the first data to be calculated next, and/or to trigger the second cache register in the second register set to output the second data to be calculated next, and so on.

In a specific implementation, the feedback signal output by the processing unit array may be divided into a first feedback signal and a second feedback signal. The first feedback signal may be used to trigger a first cache register in the first register set to output a first data to be operated next; the second feedback signal may be used to trigger a second buffer register in the second register set to output second data to be operated next. For example, when receiving a first feedback signal, a first cache register in a first register set may output first data to be operated next according to a preset operation rule; similarly, when receiving the second feedback signal, the second buffer register in the second register group may output the second data to be operated next according to the preset operation rule.

In another optional embodiment of the present invention, when the first buffer register 510 is a convolution register, the first buffer register 510 is further configured to output the buffered first data according to a first feedback signal output by the processing unit array 120; and/or, when the second buffer register is a convolution register, the second buffer register 520 is further configured to output the buffered second data according to the second feedback signal output by the processing unit array.

It should be noted that, the convolution register in the embodiment of the present invention has a certain function, and may include: under the matrix mode, the array can be used as independent First-in First-out queues (FIFO) of each row or each column; in the convolution mode, after buffering data, the input of subsequent alignment access can be utilized, and multiple sets of output data can be generated according to the convolution mode, and the like.

For example, when the convolution register is used as the first buffer register in the embodiment of the present invention, the convolution register may be used as an independent FIFO in each line during the matrix operation, and is configured to buffer the first data in each line and output the buffered first data in each line to the processing unit in the processing unit array; in the convolution operation, one or more groups of first data output to the processing unit array may be generated based on the buffered first data according to a convolution operation rule. Taking the example that 1,2,3,4,5,6,7,8 is stored in the peripheral memory of the convolution register, after the 1,2,3,4 is buffered in the convolution register, the convolution register can output a first group of first data (1, 2,3, 4); after 5,6,7,8 is buffered in the convolution register, the convolution register may sequentially output a second set of first data (2, 3,4, 5), a third set of first data (3, 4,5, 6), a fourth set of first data (4, 5,6, 7), and a fifth set of first data (5, 6,7, 8). It can be seen that the convolution register in this example can generate multiple sets of output data containing repeated contents according to the buffered data, that is, the last buffer register of the same register set can sequentially receive the 1 st, 2 nd, 3 th, 4 th, 8230of the next input data, the input of the line \8230, and the rest registers can receive the output data of the next line of the output data.

Of course, when the convolution register is used as the second buffer register in the embodiment of the present invention, the convolution register may be used as an independent FIFO for each column during the matrix operation, and is configured to buffer each column of the second data and output each buffered column of the second data to the processing unit in the processing unit array; during convolution operation, one or more groups of second data output to the processing unit array can be generated based on the cached second data according to a convolution operation rule, namely, the last cache register of the same register group can also sequentially receive the 1 st, 2 nd, 3 rd and 4 th 8230of the next input data, the column \8230, the input of the column, and the rest registers can receive the output data of the next column of the output data.

Referring to fig. 7, a schematic diagram of a connection of an arithmetic device according to an alternative example of the present invention is shown.

In this example, the first buffer register 510 and the second buffer register 520 may each be a convolution register, and each convolution register may include 4 inputs and 4 outputs. As shown in fig. 7, each of the first buffer registers 510 may receive the first data to be operated on through 4 input terminals, may buffer the received first data, and may transmit the buffered first data to the processing units in the processing unit array through 4 output terminals. Similarly, each second buffer register 520 may receive the second data to be operated on through 4 input terminals, and may buffer the received second data, and may transmit the buffered second data to the processing units in the processing unit array through 4 output terminals. As can be seen, in this example, the first data to be operated may be cached by the first cache register, and the second data to be operated may be cached by the second cache register, so that the operation device may trigger the processing units in the processing unit array to execute the operation by acquiring the cached first data and the cached second data during each data operation, and may not need to be called again, thereby improving the data operation efficiency.

Referring to fig. 8, a flow chart of steps of an embodiment of a processing method of an arithmetic device of the present invention is shown. The processing method can be applied to the computing device mentioned in the above embodiment, and specifically includes the following steps:

step 801, transmitting the first data output by the first register set to each row of processing units in the processing unit array through a first broadcast line in the arithmetic device.

Step 802, transmitting the second data output by the second register set to each column of processing units in the processing unit array through a second broadcast line.

In a specific implementation, the arithmetic device may broadcast the first data output by each output terminal of the first register group to each row of the processing units in the processing unit array through the first broadcast line, so that each processing unit in the processing unit array may receive the first data required by the arithmetic operation. Similarly, the arithmetic device may broadcast the second data output by each output terminal of the second register group to each row of processing units in the processing unit array through the second broadcast line, so that each processing unit in the processing unit array may receive the second data required by its operation.

It should be noted that, during the operation process, the operation device may execute step 801 and step 802 simultaneously; step 801 and step 802 may also be executed alternately, for example, step 801 is executed first, then step 802 is executed, and then step 801 is executed again, for example, step 802 is executed first, then step 801 is executed, and then step 802 is executed again, which is not limited in this embodiment of the present invention specifically.

Step 803, in each processing unit in the processing unit array, performing an operation according to the received first data and second data, respectively, to obtain an operation result corresponding to each processing unit.

In this embodiment of the present invention, after receiving the first data and the second data, each processing unit in the processing unit array may perform a multiplication operation according to the received first data and second data to obtain a corresponding product result, and may store the calculated product result in its own accumulation variable register, so that a value stored in the accumulation variable register may be subsequently used as an accumulation variable to be accumulated with a product result of the next received first data and second data until a single operation is completed, and then, the value stored in the accumulation variable register may be used as an operation result corresponding to the processing unit.

And step 804, outputting the operation results corresponding to the processing units.

Specifically, the arithmetic device may output the operation result corresponding to each processing unit in each line of the processing units through the last processing unit in each line of the processing unit array, respectively, so as to obtain the operation result corresponding to each processing unit in each line of the processing units.

In summary, the arithmetic device in the embodiment of the invention may transmit the first data output by the first register set to each processing unit in each row of processing units through the first broadcast line; moreover, the second data output by the second register set can be transmitted to each processing unit in each row of processing units through the second broadcast line, that is, a cross broadcast mode of the register set is realized through the first broadcast line and the second broadcast line, so that each processing unit in the processing unit array can simultaneously receive the first data and the second data by matching with the operation of each processing unit in the processing unit array, and thus each processing unit can be triggered to operate according to the received first data and second data, the problem of operation overhead caused by a large amount of repeated data generated in the operation process is avoided, the operation overhead is reduced, and the concurrency and the operation efficiency are improved.

In a specific implementation, the arithmetic device may cache the received first data in the first register group, so as to cache the received first data through the first cache register in the first register group, thereby avoiding a trouble of recalling the first data in the arithmetic process; and the received second data can be cached in the second register group, so that the received second data is cached through the second cache register in the second register group, and the trouble of recalling the second data in the operation process is avoided. In an optional embodiment of the present invention, the processing method of the computing device may further include: caching the received first data into the first register group according to the row number of the processing unit array; and caching the received second data into the second register group according to the column number of the processing unit array.

After caching first data in a first register group, the arithmetic device in the embodiment of the invention can broadcast the cached first data in the first register group to each row of processing units in the processing unit array through a first broadcast line, for example, the arithmetic device can broadcast the first data a1 output by a1 st output end of the first register group to each processing unit in a first row of the processing unit array, broadcast the first data a2 output by a2 nd output end of the first register group to each processing unit in a second row of the processing unit array, broadcast the first data a3 output by a3 rd output end of the first register group to each processing unit in a third row of the processing unit array, 8230\\\ 8230, and so on, broadcast the first data an output by an nth output end of the first register group to each processing unit in an nth row of the processing unit array; similarly, after the second data is cached in the second register set, the second data cached in the second register set may be broadcasted to each column of processing units in the processing unit array through the second broadcast line, for example, the second data b1 output by the 1 st output terminal of the second register set may be broadcasted to each processing unit in the first column of the processing unit array, the second data b2 output by the 2 nd output terminal of the second register set may be broadcasted to each processing unit in the second column of the processing unit array, and the second data b3 output by the 3 rd output terminal of the second register set may be broadcasted to each processing unit in the third column of the processing unit array, \\ 8230 \ 8230, and so on, the second data bm output by the m th output terminal of the second register set may be broadcasted to each processing unit in the m column of the processing unit array, that the first data and the second data may be transmitted to each processing unit in the processing units in a register cross broadcast manner, so as to avoid a large amount of operation overhead generated in the operation process of the operation, thereby increasing the efficiency.

In the embodiment of the present invention, after receiving the first data on the first broadcast line and the second data on the second broadcast line, the processing unit may multiply the currently received first data and the second data by the multiplier-adder to obtain a product result, and may accumulate the product result obtained by the calculation to the accumulation variable register for storage. The embodiment of the invention can take the accumulated result stored by the accumulated variable register as the first accumulated result, so that the first accumulated result stored by the accumulated variable register can be taken as the accumulated variable to be added with the product result obtained by the calculation of the multiplier in the following process, thereby realizing the accumulation of each product result obtained by the calculation of the processing unit, and further leading the processing unit to update the accumulated result in real time according to the received first data and the second data.

In an optional embodiment of the present invention, in each processing unit in the processing unit array, performing an operation according to the received first data and second data, respectively, to obtain an operation result corresponding to each processing unit, specifically, the method may include: in each processing unit, multiplying received first data and second data by a multiplier-adder to obtain a product result, and adding the product result and an accumulation variable output by a selector to obtain an accumulation result, wherein the accumulation variable is output by the selector according to a first accumulation result stored in an accumulation variable register; and transmitting the accumulation result to a result register as an operation result according to a control signal output by the controller.

In a specific implementation, the multiplier-adder in the processing unit may include a multiplier and an adder. Optionally, the multiplying the received first data and the second data by the multiplier-adder to obtain a product result, and adding the product result to the accumulation variable output by the selector to obtain an accumulation result, specifically including: multiplying the received first data and the second data by a multiplier in the multiplier-adder to obtain a corresponding product result; and transmitting the product result to an adder, and triggering the adder to add the product result and the accumulation variable to obtain an accumulation result. The adder may output the accumulation result to the accumulation variable register to store the accumulation result as a first accumulation result through the accumulation variable register. Optionally, before transmitting the accumulated result to the result register according to the control signal output by the controller, the method further includes: and transmitting the accumulated result to the accumulation variable register to update the first accumulated result stored in the accumulation variable register.

For example, the processing unit may understand that the second data is multiplied by the first data after the multiplier receives the first data and the second data; and may determine whether a new value is currently being calculated by determining whether a clear signal is received by the first selector. If the first selector receives the clear signal, that is, when calculating the new value, the first accumulation result stored in the accumulation variable register may be set to initial data 0 based on the clear signal, and then the product result obtained by multiplying may be added to the initial data 0 to obtain the accumulation result, and the accumulation result may be stored in the accumulation variable register. If the first selector does not receive the zero clearing signal, the first selector can transmit the first accumulation result stored in the accumulation variable register to the adder to be used as the accumulation variable, so that the adder can add the accumulation variable and the product result obtained by current multiplication to obtain a new accumulation result, and can transmit the accumulation result to the accumulation variable register. If the new accumulation result obtained by the current calculation is the last result to be accumulated by the processing unit, the first accumulation result stored in the accumulation variable register can be used as the operation result p according to the control signal output by the controller, and is transmitted to the result register to wait for output. If the processing unit currently has the operation result p awaiting output, i.e. when the result register stores the operation result p awaiting output, the operation result p may be output from the output port c of the processing unit. If the result register in the processing unit does not store the operation result p to be output, that is, the processing unit does not have the operation result p that needs to be output in the next beat, the operation result output by the first processing unit adjacent to the processing unit, that is, the operation result p ' output by the first processing unit, may be received through the input port c ', and the operation result p ' output by the first processing unit may be saved to the result register to wait for output.

For the case that the multiply-add operation needs to be performed for multiple cycles, the arithmetic device in the embodiment of the present invention may broadcast the first data and the second data cached in a time-sharing manner using a plurality of cache registers included in the register set, so that a plurality of operations may be performed simultaneously on the multiply-add operation pipeline, thereby improving the concurrency of the operations and further improving the operation efficiency. It should be noted that the multiply-add operation pipeline can be divided into a multiply operation pipeline and an add operation pipeline.

In a specific implementation, the register set in the embodiment of the present invention may include serially connected cache registers. For example, the first register group may include two or more first cache register processors connected in series to transmit one or more first data to each line processing unit through output terminals of the first cache register connected in series, respectively; for another example, the second register group may include two or more second buffer register processors connected in series to respectively transmit one or more second data to the respective column processing units through the output terminals of the second buffer register connected in series, and so on.

As an example of the present invention, where the first register bank includes two convolution registers connected in series laterally, as shown in fig. 9, the lateral output of the second convolution register may be connected to the lateral input of the first convolution register. Of course, when the first register group includes more than two convolution registers connected in series horizontally, the horizontal output terminal of the third convolution register may be connected to the horizontal input terminal of the second convolution register, the horizontal output terminal of the fourth convolution register may be connected to the horizontal input terminal of the third convolution register, \8230 \ 8230, and so on, the horizontal output terminal of the nth convolution register may be connected to the horizontal input terminal of the N-1 th convolution register, and N may be used to indicate the number of the first cache registers included in the first register group.

Of course, the respective convolution registers included in the register group may be independent of each other. Each convolution register may receive data to be buffered via an input and may transmit the buffered data to a processing unit in the array of processing units via an output. For example, as shown in fig. 10, the convolution register may receive data through 4 inputs and may buffer the received data, and may transmit the buffered data to the processing unit through 4 outputs.

In embodiments of the invention, the computing device may have one or more broadcast domains, such as in the case of the computing device having X +1 rows and Y +1 columns of broadcast domains, the first row of broadcast domains may include a controller and Y convolution registers; each row of broadcast domain may then contain the leftmost convolution register and Y arrays of single broadcast domain processing elements. If the uppermost broadcast field is labeled as line 1 and the leftmost broadcast field is labeled as column 1, the input of the leftmost data of the broadcast field of line M can be delayed by M beats, i.e., the control flow can be allowed to propagate from the controller with M beats. Thus, a data input to row M, column 1 may pass through N-1 beats to reach the row M, column N broadcast field. Similarly, the input of the data at the top of the broadcast field in column N may be delayed by N beats, i.e., the control flow may be allowed to spend N beats propagating from the controller. The data input to row 1 and column N may reach the broadcast domain of row M and column N through beat M-1. It can be seen that in any M row and N column broadcast domain, the horizontally propagated data and the vertically propagated data can always arrive at the M + N-1 beat simultaneously without further time spent waiting for each other. Even if delay is generated in a circuit and complete broadcasting in one beat is difficult, the embodiment of the invention can also divide a broadcasting domain by introducing the delay register, accurately control the delay and complete broadcasting in multiple beats, and simultaneously does not introduce unnecessary waiting to cause throughput loss, thereby reducing the operation expense and improving the operation efficiency.

For convenience of understanding, the processing method of the arithmetic device according to the embodiment of the present invention is described below with reference to specific examples.

As an example of the present invention, for the matrix operation, the operation device may operate the first input matrix Aik and the second input matrix Bkj based on the row number n and the column number m of the processing unit array to obtain the resultant output matrix Cij, that is, cij = Σ Aik × Bkj. The value range of the row number i of the first input matrix Aik can be from 0 to n-1; the value range of the column number j of the second input matrix Bkj can be from 0 to m-1; k may be used to characterize the column number of the first input matrix Aik and the row number of the second input matrix Bkj, which may range from 0 to n8-1. It should be noted that the matrix element a in the first input matrix Aik may be used as the first data in the embodiment of the present invention, and the matrix element b in the second input matrix Bkj may be used as the second data in the embodiment of the present invention. n8 may be used to characterize the number of computations corresponding to the matrix element a in the first input matrix Aik and the matrix element b in the second input matrix Aik at each operation.

When the number of rows of the first input matrix Aik exceeds the number of rows n of the processing unit array, the first input matrix Aik may be divided according to the number of rows n of the processing unit array, for example, the matrix elements of the 1 st row to the nth row in the first input matrix Aik may be divided into the first row input sub-matrix, the matrix elements of the n +1 st row to the 2n nd row in the first input matrix Aik may be divided into the second row input sub-matrix, the matrix elements of the 2n +1 st row to the 3n rd row in the first input matrix Aik may be divided into the third row input sub-matrix \ 8230 \\\ \ 8230and so on until the first input matrix Aik is divided. Subsequently, n rows of first data may be input to the first register set according to the row number n of the processing unit array based on the row input submatrix, so that the first register set may simultaneously buffer the n rows of first data, and the n rows of first data buffered by the first register set may be respectively transmitted to n rows of processing units in the processing unit array through the first broadcast line, so that the processing unit array may simultaneously perform operations on the n rows of first data. Of course, when the number of rows of the first input matrix Aik is less than or equal to the number of rows n of the processing unit array, the first data in each row included in the first input matrix Aik may be directly cached in the first register group, so as to transmit the first data in each row included in the first input matrix Aik to the processing unit array through the first register group and the first broadcast line, for example, the first data in row 1 may be transmitted to the first row processing unit of the processing unit array, the first data in row 2 may be transmitted to the second row processing unit of the processing unit array, and the first data in row 3 may be transmitted to the third row processing unit of the processing unit array, \\\ 8230, and so on, the first data in row i may be transmitted to the ith row processing unit of the processing unit array, so that the processing unit array may be triggered to simultaneously perform operations on the first data in each row included in the first input matrix Aik.

Similarly, when the number of columns of the second input matrix Bkj exceeds the number m of columns of the processing unit array, the second input matrix Bjk may be divided according to the number m of columns of the processing unit array, for example, the matrix elements of the 1 st column to the mth column in the second input matrix Bkj may be divided into the first column input sub-matrix, the matrix elements of the m +1 st column to the 2 mth column in the second input matrix Bkj may be divided into the second column input sub-matrix, the matrix elements of the 2m +1 st column to the 3 mth column in the second input matrix Bkj may be divided into the third column input sub-matrix 82308230, and so on until the second input matrix Bkj is divided.

Subsequently, m columns of second data may be input to the second register group according to the column number m of the processing unit array based on the column input submatrix, so that the second register group may simultaneously buffer m columns of second data, and may transmit the m columns of second data buffered by the second register group to m columns of processing units in the processing unit array through the second broadcast line, respectively, so that the processing unit array may simultaneously perform operations on the m columns of second data. Of course, when the number of columns of the second input matrix Bkj is less than or equal to the number m of columns of the processing unit array, the columns of second data included in the second input matrix Bjk may be directly cached in the second register set, so as to transmit the columns of second data broadcast included in the second input matrix Bjk to the processing unit array through the second register set and the second broadcast line, for example, the 1 st column of second data broadcast may be transmitted to the second column processing unit of the processing unit array, the 2 nd column of second data broadcast may be transmitted to the second column processing unit of the processing unit array, and the 3 rd column of second data broadcast may be transmitted to the third column processing unit 8230of the processing unit array.

In a specific implementation, when the number i of rows of the first input matrix Aik exceeds the number n of rows of the processing unit array, and the number j of columns of the second input matrix Bkj exceeds the number m of columns of the processing unit array, the arithmetic device in this example may perform the arithmetic operation corresponding to a sub-matrix of n rows and m columns in the result output matrix Cij after each arithmetic operation.

After the sub-matrix with Ci1j1 as the starting point in the start result output matrix Cij is calculated, the current calculation time i8 may be set to 0, and it may be determined whether data needs to be written into the first register group and the second register group by determining whether the current calculation time i8 is smaller than the preset calculation time n 8. Ci1j1 may be used to represent the operation result of the i1 th row and the j1 th column in the result output matrix Cij. If the current calculation time i8 is less than the preset calculation time n8, n rows of first data starting from the i1 row in the i8 th row of the first input matrix Aik may be written into a first buffer register in the first register group, and m columns of first data starting from the j1 row in the i8 th row of the input matrix Bkj may be written into a second buffer register in the second register group; then, the content cached by each row of the first cache register in the first register set may be broadcast to each row of the processing units in the processing unit array through the first broadcast line, that is, each first data cached in the first register set is broadcast to m processing units in the corresponding row of the processing units, and at the same time, the content broadcast cached by each column of the second cache register in the second register set may be broadcast to each column of the processing units in the processing unit array through the second broadcast line, that is, each second data cached in the second register set is broadcast to n processing units in the corresponding column of the processing units, so that each processing unit in the processing unit array may be triggered to perform multiplication calculation and accumulation calculation respectively according to the received first data and second data, so as to obtain n rows and m columns of accumulation results, and then 1 may be added to the current calculation number i8, so as to obtain a new calculation number i8, and then it may be determined whether the new calculation number i8 is smaller than a preset calculation number n8, so as to determine whether to continue to write data into the first register set and the second register set. If the calculation frequency i8 is equal to the preset calculation frequency n8, it can be determined that the sub-matrix operation is completed, and the operation result including n rows and m columns can be obtained.

For example, when Ci1j1 is C00, a submatrix formed by obtaining n rows and m columns of operation results may be as shown in fig. 11. The value of the operation result C00 of the 1 st row and 1 st column in the submatrix may be an accumulated result obtained by multiplying each first data of the 1 st row of the first input matrix Aik by each second data of the 1 st column of the second input matrix Bkj, that is, C00= a00 × b00+ a01 × b10+ a02 × b20+.. + a0k × bk0; the value of the operation result C10 of the 2 nd row and the 1 st column in the submatrix may be an accumulated result obtained by multiplying each first data of the 2 nd row of the first input matrix Aik by each second data of the 1 st column of the second input matrix Bkj, that is, C10= a10 × b00+ a11 × b10+ a12 × b20+. A1k bk0; the value of the operation result C10 of the 1 st row and the 2 nd column in the submatrix may be an accumulated result obtained by multiplying each first data of the 1 st row of the first input matrix Aik by each second data of the 2 nd column of the second input matrix Bkj, that is, C01= a00 × b01+ a01 × b11+ a02 × b21+. A0k bk1; the value of the operation result C11 at the 2 nd row and the 2 nd column in the submatrix may be an accumulation result obtained by multiplying each first data at the 2 nd row of the first input matrix Aik by each second data at the 2 nd column of the second input matrix Bkj, i.e., C11= a10 × b01+ a11 + b11+ a12 × b21+. A1k bk1, and so on, the value of the operation result C (n-1) (m-1) at the n nd row and the m th column in the submatrix may be an accumulation result obtained by multiplying each first data at the n th row of the first input matrix bk by each second data at the m th column of the second input matrix Bkj, i.e., C (n-1) (m-1) = a (n-1) 0 b0 (m-1) + a (n-1) 1 (m-1) + a (m-1) +1 (m-1) + b1 (m-1) + 1+ b1 (m-1).

For example, the computing device may include 64 multiplication calculations in total when calculating the product of two 4 x 4 input matrices. The arithmetic device in this example can perform arithmetic through the 4 × 4 processing unit array, the data included in the incoming matrix can require 1 beat, the multiplication delay time requires 2 beats, the accumulation calculation requires 4 beats, and the output arithmetic result can require 4 beats, that is, 11 beats in total. Wherein, beat can be used to represent the time required by the arithmetic device to execute the operation each time.

For example, the first data included in the first input matrix a is shown in the following table one, and the second data included in the second input matrix B has specific values as shown in the following table two:

a first input matrix A	Column	1	Column 2	Column 3	Column 4
						Line 1	1	3	5	7
Line 2	9	11	13	15
					Line 3	17	19	21	23
Line 4	25	27	29	31

Watch 1

Watch two

In the first beat, the arithmetic device may buffer 4 rows of first data included in the 1 st column of the first input matrix a into the first register group, that is, four first data of 1,9, 17, and 25 may be written into the first register group, and may buffer 4 columns of second data included in the 1 st row of the second input matrix B into the second register group, that is, four second data of 2,4,6, and 8 may be written into the second register group.

In the second beat, the arithmetic device may write 3, 11, 19, and 27 in the 2 nd column of the first input matrix a to the first register group, and may write 10, 12, 14, and 16 in the 2 nd row of the second input matrix B to the second register group, while may transmit 1,9, 17, and 25 output from the first register group to each row processing unit in the processing unit array through the first broadcast line, respectively, such as transmitting 1 output from the first register group to a first row processing unit in the processing unit array, transmitting 9 output from the first register group to a second row processing unit in the processing unit array, transmitting 17 output from the first register group to a third row processing unit in the processing unit array, and transmitting 25 output from the first register group to a fourth row processing unit in the processing unit array, etc., and 2,4,6 and 8 output by the second register set can be transmitted to each column of processing units in the processing unit array through the second broadcast line, for example, 2 output by the second register set can be transmitted to a fourth column of processing units in the processing unit array, 4 output by the second register set can be transmitted to a third column of processing units in the processing unit array, 6 output by the second register set can be transmitted to a second column of processing units in the processing unit array, 8 output by the second register set can be transmitted to a first column of processing units in the processing unit array, and the like, so that each processing unit in the processing unit array can receive corresponding first data and second data at the same time, for example, a first row and a first column of processing units can receive first data 1 and second data 8 at the same time, a first row and a second column of processing units can receive first data 1 and second data 6 at the same time, the fourth row and fourth column processing units may receive the first data 25 and the second data 2, etc. simultaneously.

In the third beat, each processing unit in the processing unit array may perform a multiplication operation according to the received first data and second data to obtain a corresponding product result, and may receive the first data continuously output by the first register set and the second data continuously output by the second register set, for example, the first row and first column processing unit may multiply the first data 1 and the second data 8 to obtain a corresponding product result 8, and may receive 3 output by the first register set and 16 output by the second register set; for another example, the fourth row and fourth column of processing elements may multiply first data 25 by second data 2 to obtain corresponding product result 50, and may receive 27 from the first register set output and 10 from the second register set output, etc.

In the fourth beat, each processing unit in the processing unit array may store the calculated product result in the accumulation variable register, and may perform a multiplication operation based on the first data and the second data received in the last beat to obtain a new product result, and may receive the first data continuously output by the first register set and the second data continuously output by the second register set, for example, the first column processing unit may store the product result 8 calculated in the third beat in its own accumulation variable register, and may multiply the first data 3 received in the third beat with the second data 16 to obtain a new product result 48, and may receive the output of the first register set 5 and the output of the second register set 24, and so on.

In the fifth beat, each processing unit in the processing unit array may add the product result calculated in the previous beat to the first accumulation result (i.e., accumulation variable) stored in the accumulation variable register to obtain an accumulation result, and store the accumulation result as a new first accumulation result in the accumulation variable register, and may perform a multiplication operation according to the first data and the second data received in the previous beat to obtain a new product result, and may receive the first data continuously output by the first register set and the second data continuously output by the second register set, for example, the first row and first column processing units may add the product result 48 calculated in the third beat to the first accumulation result 8 stored in the accumulation variable register, and store the accumulation result 56 obtained after the addition into their own accumulation variable registers, and may multiply the first data 5 received in the third beat with the second data 24 to obtain a new product result 120, and may receive the output 7 of the first register set and the output 32 of the second register set, and so on.

In the sixth beat, each processing unit in the processing unit array may add the product result calculated in the fifth beat to the first accumulation result stored in the accumulation variable register to obtain an accumulation result, and store the accumulation result as a new first accumulation result in the accumulation variable register, and may perform a multiplication operation according to the first data and the second data received in the fifth beat to obtain a new product result, and may receive the first data continuously output by the first register group and the second data continuously output by the second register group, \82308230, and so on until the multiplication operation of the first input matrix a and the second input matrix B is completed, a result output matrix C is obtained, and the result output matrix may include operation results in 4 rows and 4 columns, as shown in the following table three:

c matrix	Column	1	Column 2	Column 3	Column 4
						Line 1	304	336	368	400
Line 2	752	848	944	1040
					Line 3	1200	1360	1520	1680
Line 4	1648	1872	2096	2320

Watch III

The operation results in the first row and the first column in the result output matrix C may be calculated by the processing units in the first row and the first column in the processing unit array, the operation results in the second row and the first column in the result output matrix C may be calculated by the processing units in the second row and the first column in the processing unit array, and the operation results in the first row and the second column in the result output matrix C may be calculated by the processing units in the first row and the second column in the processing unit array, \8230, and so on, the operation results in the fourth row and the fourth column in the result output matrix C may be calculated by the processing units in the fourth row and the fourth column in the processing unit array.

When the operation result output stage is performed, the operation device may transmit the operation result obtained by calculation completion via the result register p in the processing unit, so as to output the operation result obtained by calculation of each processing unit in each row by the last processing unit in each row in the processing unit array, and as described in conjunction with the above example, may output the operation result obtained by calculation of each processing unit in each row by the fourth column processing unit in each row in the processing unit array. Specifically, the operation result calculated by each processing unit in the first row of processing units may be output through the first row of fourth column of processing units in the processing unit array; outputting the operation result obtained by calculation of each processing unit in the second row of processing units through a second row and a fourth column of processing units in the processing unit array; outputting the operation result obtained by calculation of each processing unit in the third row of processing units through a third row and a fourth column of processing units in the processing unit array; and the operation results calculated by the processing units in the fourth row of the processing unit array can be output through the fourth row and the fourth column of the processing unit in the processing unit array, and the like.

As another example of the present invention, for the multi-convolution kernel convolution operation, the operation device may operate the convolution sequence D (i + k) to be operated on and the second input matrix Bkj based on the row number n and the column number m of the processing unit array, and obtain the resulting output matrix Cij, that is, cij = D (i + k) × Bkj.

It should be noted that the convolution sequence D (i + k) can be converted into the corresponding first input matrix Aik, i.e., D (i + k) = Aik. The length of the convolution sequence D (i + k) may be n1+ n6 "1, and m1 convolution kernels may be included altogether, and the width of each convolution kernel (i.e., the number of columns of the convolution kernel) is n6; the result output matrix Cij may be a square matrix C composed of m1 result sequences, and the length of the result sequence may be n1. The row number i of the first input matrix Aik may range from 0 to n1-1; the value range of the column number j of the second input matrix Bk can be from 0 to m1-1; k can range from 0 to n6-1.

A convolution kernel element D in the convolution sequence D (i + k) may be used as the first data in the embodiment of the present invention, and may be used to characterize the numerical information of the input signal, such as the numerical information of the input signal in a finite impulse response model; the matrix element b in the second input matrix Bkj may be used as second data in the embodiment of the present invention, and may be used to characterize the numerical information of the filter, such as the numerical information of the filter in the finite impulse response model.

In a specific implementation, when the number of convolution kernels of the convolution sequence D (i + k) exceeds the number n of rows of the processing unit array, and the number m of columns of the second input matrix Bkj exceeds the number m of columns of the processing unit array, the arithmetic device in this example may perform m convolution operations corresponding to n convolution kernels in each operation completion result output matrix Cij.

Specifically, after the convolution operation is started, m first data starting from the ith number of the convolution sequence D (i + k) may be written into the first register group, and the current operation number i6 may be set to 0, so that whether the corresponding data needs to be broadcast and transmitted to the processing unit array may be determined by determining whether the value of the operation number i6 is smaller than the preset value of the convolution kernel width n 6. If the current operation number i6 is smaller than the convolution kernel width n6, the first data broadcast output by each output end of the first register group may be transmitted to each row processing unit of the processing unit array through the first broadcast line according to the row number n of the processing unit array, that is, each first data cached in the first register group is broadcast to m columns of processing units in the corresponding row processing unit, and meanwhile, the second data output by each output end of the second register group may be transmitted to each column of processing units of the processing unit array through the second broadcast line, that is, each second data output in the second register group is broadcast to n rows of processing units in the corresponding column of processing unit, so that each processing unit in the processing unit array may be triggered to perform multiplication calculation and accumulation calculation according to the received first data and second data, so as to obtain n rows of m columns of accumulation results, then 1 may be added to the operation number i6, so as to obtain a new operation number i6, and then whether the new operation number i6 is smaller than the convolution kernel width n6, so as to determine whether the first data broadcast output by the first register group and the second data broadcast output by the second register group are continuously transmitted to the processing unit array. If the operation times i6 are equal to the width n6 of the convolution kernel, it can be determined that the convolution operation corresponding to the m convolution operation results of the n convolution kernels is completed, that is, m convolution operation results of the n convolution kernels can be obtained. The first register set and the second register set can respectively and automatically read in data required to be cached. Optionally, after the first register set and/or the second register set finishes reading the last data to be cached, the preset initial data 0 is automatically read in to ensure the correctness of the operation result.

For example, when Ci1j1 is the convolution operation result C00 of row 1 and column 1 in the result output matrix, m convolution operation results that result in n convolution kernels may be as shown in fig. 12. The value of the convolution operation result C00 of the 1 st row and the 1 st column may be an accumulated result obtained by multiplying m first data starting from the 1 st number of the convolution sequence D (i + k) by each second data in the 1 st column of the second input matrix Bkj, that is, C00= D0 × b00+ D1 × b10+ D2 × b20+ ·. + D (m-1) × bk0; the value of the convolution operation result C10 of row 2 and column 1 may be an accumulated result obtained by multiplying m first data starting from the 2 nd number of the convolution sequence D (i + k) by each second data in column 1 of the second input matrix Bkj, that is, C10= D1 × b00+ D2 × b10+ D3 × b20+.. + Dm. + bk0; the value of the convolution operation result C01 at row 1 and column 2 may be an accumulated result obtained by multiplying m first data starting from the 1 st number of the convolution sequence D (i + k) by each second data in column 2 of the second input matrix Bkj, i.e., C01= D0 × b01+ D1 × b11+ D2 × b21+. + D (m-1) bk1; the value of the convolution result C11 in the 2 nd row and 2 nd column may be an accumulated result obtained by multiplying m first data starting from the 1 st number of the convolution sequence D (i + k) by each second data in the 2 nd column of the second input matrix Bkj, i.e., C11= D1 b01+ D2 b11+ D3 b21+. + Dm bk1, and so on, the value of the convolution result C (n-1) (m-1) in the mth column of the nth row and m may be a result obtained by multiplying m first data starting from the nth number of the convolution sequence D (i + k) by each second data in the mth column of the second input matrix Bkj, i.e., C (n-1) (m-1) = D (n-1) = b0 (m-1 × Dn 1 (m-1) + b (m + k) +1 (m + 1) + b +1 (m-1) + b + 1).

For example, when the input convolution sequence D (i + k) is (1, 2,3,4,5,6,7, 8), and the input second input matrix contains the second data as shown in table four below:

B1	8	4	2	1
					B2	8	-4	2	-1
B3	8	-4	-2	1
					B4	8	6	4	2

watch four

It should be noted that the impulse response function B1 in table four may include: second data of the first row in the second input matrix, 8, 4, 2 and 1, respectively; the impulse response function B2 may include: second data of a second row in the second input matrix are respectively 8, -4, 2 and-1; the impulse response function B3 may include: the second data of the third row in the second input matrix are respectively 8, -4, -2 and 1; and, the impulse response function B4 may include: the second data of the fourth row in the second input matrix are 8, 6, 4 and 2, respectively.

If the number of rows n and the number of columns m of the processing unit array are both 4, and the first register group is a convolution register, the arithmetic device may buffer, in the first beat, the first 4 first data in the convolution sequence D (i + k) into the convolution register, that is, 1,2,3, and 4 in the convolution sequence D (i + k) into the first register group, and may input, at the same time, 4 second data included in the first column in the second input matrix into the second register group, that is, 8, and 8 of the first row in the second input matrix into the second register group. In the second beat, 5,6,7, and 8 in the convolution sequence D (i + k) may be buffered in the convolution register, and 1,2,3, and 4 output by the convolution register may be respectively transmitted to each row of processing units in the processing unit array through the first broadcast line, for example, 1 output by the convolution register is transmitted to 4 processing units in the first row of processing units, 2 output by the convolution register is transmitted to 4 processing units in the second row of processing units, 3 output by the convolution register is transmitted to 4 processing units in the third row of processing units, and 4 output by the convolution register is transmitted to 4 processing units in the fourth row of processing units; meanwhile, 4 second data 4, -4, -4 and 6 contained in the second column of the second input matrix may be respectively buffered in the second register group, and 4 second data 8, 8 and 8 output from the second register group may be respectively transmitted to the processing units in each column of the processing unit array through the second broadcast line, for example, the second data 8 output from the first output terminal of the second register group is transmitted to the 4 processing units of the first column of the processing units, the second data 8 output from the second output terminal of the second register group is transmitted to the 4 processing units of the second column of the processing units, the second data 8 output from the third output terminal of the second register group is transmitted to the 4 processing units of the third column of the processing units, and the second data 8 output from the fourth output terminal of the second register group is transmitted to the 4 processing units of the fourth column of the processing units.

After being processed by the convolution register, the arithmetic device can transmit 2 output by the convolution register to 4 processing units in the first row of processing units, transmit 3 output by the convolution register to 4 processing units in the second row of processing units, transmit 4 output by the convolution register to 4 processing units in the third row of processing units, and transmit 5 output by the convolution register to 4 processing units in the fourth row of processing units in the third beat; meanwhile, 4 second data 2, -2 and 4 contained in a third column of the second input matrix may be respectively buffered in the second register group, and 4, -4, -4 and 6 output by the second register may be respectively transmitted to the corresponding processing units in each column through the second broadcast line; and performing multiplication calculation on the first data and the second data transmitted to each processing unit by the second beat through each processing unit in the processing unit array respectively to obtain corresponding product results.

Similarly, in the fourth beat, the arithmetic device may transmit, via the first broadcast line, 3 of the outputs of the convolution registers to 4 processing units in the first row of processing units, 4 of the outputs of the convolution registers to 4 processing units in the second row of processing units, 5 of the outputs of the convolution registers to 4 processing units in the third row of processing units, and 6 of the outputs of the convolution registers to 4 processing units in the fourth row of processing units; meanwhile, 4 second data 1, -1,1 and 2 contained in the fourth column of the second input matrix may be input into the second register group, and 2, -2 and 4 output by the second register may be transmitted to the corresponding processing units in each column through the second broadcast line; and accumulating the accumulated result obtained by the processing units in the third beat as an accumulated variable to obtain a corresponding accumulated result through each processing unit in the processing unit array, and meanwhile, performing multiplication on the first data and the second data transmitted to each processing unit in the third beat to obtain a new product result.

In the fifth beat, the arithmetic device may transmit 4 outputs of the convolution register to 4 processing units in the first row of processing units, 5 outputs of the convolution register to 6 processing units in the second row of processing units, 7 outputs of the convolution register to 4 processing units in the third row of processing units, 8 outputs of the convolution register to 4 processing units in the fourth row of processing units, and simultaneously may transmit 1, -1, and 2 outputs of the second register to corresponding processing units in each column through the second broadcast line, respectively; and accumulating the accumulated result obtained by the previous beat of calculation of each processing unit as an accumulation variable to obtain a corresponding accumulated result through each processing unit in the processing unit array, and meanwhile, multiplying the first data and the second data transmitted to each processing unit in the previous beat to obtain a new product result.

By analogy, in the sixth beat, the arithmetic device may respectively accumulate, as an accumulation variable, the accumulation result calculated in the fifth beat by each processing unit through each processing unit in the processing unit array to obtain a new accumulation result, and may multiply the first data and the second data transmitted to each processing unit in the previous beat to obtain a new product result. In the seventh beat, the operation device may respectively use each processing unit in the processing unit array to obtain an accumulated result calculated in the sixth beat as an accumulated variable, and accumulate the accumulated result with the multiplied result calculated in the sixth beat to obtain a final accumulated result, so as to transmit the final accumulated result as an operation result to the result register in the next beat. It is apparent that the arithmetic device can input the operation results calculated by the processing units in the processing unit array at the beginning of the eighth beat, for example, at the eighth beat, the operation results calculated by the processing units in the fourth column in the processing unit array are respectively output, the operation results calculated by the processing units in the third column are respectively transmitted to the result register in the processing units in the fourth column in the same row, so that the operation results calculated by the processing units in the fourth column are respectively output by the processing units in the fourth column in the next beat. Similarly, after the operation results calculated by the processing units in the third row are transmitted to the processing units in the fourth row, the operation results calculated by the processing units in the second row may be transmitted to the processing units in the third row, so that the operation results calculated by the processing units in the second row may be transmitted to the processing units in the fourth row for output by the processing units in the third row; and after the operation results calculated by the processing units in the second row are transmitted to the processing units in the third row, the operation results calculated by the processing units in the first row can be respectively transmitted to the processing units in the second row, so that the operation results calculated by the processing units in the first row can be transmitted to the processing units in the fourth row for output through the processing units in the second row and the third row.

In summary, the arithmetic device in the embodiment of the invention may broadcast the first data output by the first register set and the second data output by the second register set to each processing unit in the processing unit array through the data line, so as to trigger each processing unit to perform an arithmetic operation on the received first data and second data; and the product result calculated by each processing unit can be stored in the accumulation variable register of the processing unit, so that each processing unit can accumulate the product result calculated by the processing unit according to the product result calculated by the processing unit, the trouble of transmitting the product result calculated by the processing unit to an independent accumulator is avoided, the concurrency of operation is improved, the times of accessing and storing data of an operating device can be further reduced, the operation cost is reduced, and the operation efficiency is improved.

In addition, the register set in the embodiment of the present invention can convert an input sequence without repetition into an output sequence with repetition, and the first register set in the above example can convert an input convolution sequence D (i + k) into an output sequence with repetition, thereby reducing input and saving storage.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those of skill in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the embodiments of the invention.

The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.

Embodiments of the present invention may be provided as methods, apparatus, or computer program products. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a predictive manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true scope of the embodiments of the present invention.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such process, method, article, or terminal device. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The above detailed description of the computing device and the processing method thereof provided by the present invention has been presented, and the principle and the implementation of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An arithmetic device, comprising: the data line comprises n first broadcast lines and m second broadcast lines, and m and n are both natural numbers more than or equal to 1; the processing unit array comprises n processing unit rows and m processing unit columns;

one end of each first broadcast line is connected with the first register group, the other end of each first broadcast line is connected with a row of processing units in the processing unit array, and the first broadcast lines are used for transmitting first data output by the first register group to the row of processing units, wherein the row numbers of the processing units connected with different first broadcast lines are different;

one end of each second broadcast line is connected with the second register group, and the other end of each second broadcast line is connected with a row of processing units in the processing unit array and used for transmitting the second data output by the second register group to the row of processing units, wherein the row numbers of the processing units connected with different second broadcast lines are different;

each processing unit in the processing unit array is used for carrying out operation according to the first data and the second data and outputting an operation result.

2. The computing device of claim 1, further comprising a controller;

the processing unit includes: the device comprises a multiplier-adder, an accumulation variable register, a result register and a selector;

the multiplier-adder is used for receiving the first data and the second data, multiplying the first data and the second data to obtain a product result, and adding the product result and the accumulation variable output by the selector to obtain an accumulation result;

the input end of the accumulation variable register is connected with the multiplier-adder and is used for storing an accumulation result output by the multiplier-adder and transmitting the accumulation result to the selector;

the selector is connected with the controller and used for transmitting the accumulation result as an accumulation variable to the multiplier-adder according to a control signal output by the controller; or, the accumulated result is used as an operation result and is transmitted to the result register;

and the result register is used for outputting the operation result.

3. The arithmetic device according to claim 2, wherein the selector comprises a first selector and a second selector, and the output terminal of the accumulation variable register is connected to the first selector and the second selector, respectively, and the multiplier-adder comprises a multiplier and an adder;

the output end of the multiplier is connected with the adder and is used for receiving the first data and the second data, multiplying the first data and the second data and outputting a product result to the adder;

the output end of the adder is connected with the accumulation variable register and is used for adding the product result and the accumulation variable output by the first selector to obtain an accumulation result and outputting the accumulation result to the accumulation variable register;

the output end of the first selector is connected with the adder and is used for receiving the accumulation result and the zero clearing signal corresponding to the accumulation variable register and outputting an accumulation variable to the adder according to the accumulation result and/or the zero clearing signal;

and the output end of the second selector is connected with the result register and is used for receiving the control signal, taking the accumulated result as an operation result according to the control signal and transmitting the operation result to the result register.

4. The arithmetic device according to claim 3, wherein the processing unit array includes at least one row of processing units, the row of processing units includes at least two processing units connected in sequence, and the two processing units are divided into a first processing unit and a second processing unit;

the output end of the result register of the first processing unit is connected with the second selector of the second processing unit and used for transmitting the operation result output by the first processing unit to the second selector of the second processing unit;

and the second selector of the second processing unit is also used for receiving the operation result output by the first processing unit and transmitting the operation result to a result register in the second processing unit.

5. The arithmetic device of any one of claims 1 to 4, wherein the first register bank comprises at least one first buffer register, each first buffer register comprising at least one output, wherein the second register bank comprises at least one second buffer register, each second buffer register comprising at least one output;

each output end of the first cache register is connected with each processing unit in a row of processing units through the first broadcast line, and is used for caching the received first data and respectively transmitting the first data to each processing unit in the row of processing units through the first broadcast line;

each output end of the second cache register is connected to each processing unit included in a row of processing units through the second broadcast line, and is configured to cache the received second data, and transmit the second data to each processing unit included in the row of processing units through the second broadcast line.

6. The computing device of claim 5, further comprising a timing adjustment module;

the timing adjustment module includes: the first timing adjustment module and/or the second timing adjustment module;

the first buffer register is connected with the first timing adjustment module and used for outputting the buffered first data according to the first timing signal output by the first timing adjustment module;

the second buffer register is connected with the second timing adjustment module and is used for outputting the buffered second data according to the second timing signal output by the second timing adjustment module.

7. The computing device of claim 5,

when the first cache register is a convolution register, the first cache register is further configured to output the cached first data according to the first feedback signal output by the processing unit array; and/or the presence of a gas in the gas,

and when the second cache register is a convolution register, the second cache register is further configured to output the cached second data according to a second feedback signal output by the processing unit array.

8. A method of processing an arithmetic device, the arithmetic device comprising the arithmetic device according to any one of claims 1 to 7, the method comprising:

transmitting first data output by the first register group to each row of processing units in the processing unit array through a first broadcast line in the arithmetic device;

and outputting the operation results corresponding to the processing units.

9. The method of claim 8, further comprising:

caching the received first data into the first register group according to the row number of the processing unit array;

and caching the received second data into the second register group according to the number of the columns of the processing unit array.

10. The method according to claim 8 or 9, wherein performing an operation in each processing unit in the processing unit array according to the received first data and second data respectively to obtain an operation result corresponding to each processing unit comprises:

in each processing unit, multiplying the received first data and the second data by a multiplier-adder to obtain a product result, and adding the product result and an accumulation variable output by a selector to obtain an accumulation result, wherein the accumulation variable is output by the selector according to a first accumulation result stored in an accumulation variable register;

11. The method of claim 10, wherein multiplying the received first data and second data by a multiplier-adder to obtain a product result, and adding the product result to an accumulation variable output by a selector to obtain an accumulation result, comprises:

before the accumulated result is transmitted to the result register according to the control signal output by the controller, the method further includes: and transmitting the accumulated result to the accumulation variable register to update the first accumulated result stored in the accumulation variable register.