CN110399976B

CN110399976B - Computing device and computing method

Info

Publication number: CN110399976B
Application number: CN201810376471.0A
Authority: CN
Inventors: 梁晓峣; 景乃锋; 崔晓松; 陈云
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2022-04-05
Anticipated expiration: 2038-04-25
Also published as: CN110399976A; WO2019206162A1

Abstract

The application provides a computing device and a computing method, which can improve the data utilization rate. The computing device includes: a plurality of arithmetic units for performing a plurality of convolution operations on a plurality of data, the plurality of arithmetic units including a first arithmetic unit and a second arithmetic unit; the first arithmetic unit is used for: performing multiplication operation on first weight data and first data in a plurality of data in the ith convolution operation in a plurality of convolution operations, wherein i is an integer greater than 0; receiving second data sent by a second operation unit, wherein the second data is data for the second operation unit to perform convolution operation in the ith convolution operation; and in the (i + 1) th convolution operation in the multiple convolution operations, multiplying the first weight data and the second data.

Description

Computing device and computing method

Technical Field

The present application relates to the field of data processing, and in particular, to a computing apparatus and a computing method.

Background

Convolution operations are the most common computational operations in convolutional neural networks. Convolution operations are mainly applied in convolutional layers. In the convolution operation, each weight data in the convolution kernel is multiplied by the corresponding input data, and then the dot product results are added to obtain the output result of one convolution operation. And then, according to the step length setting of the convolutional layer, sliding the convolutional kernel, and repeating the convolution operation. Convolutional neural networks typically process large amounts of data, and high data throughput requires high access bandwidth. For example, a computing device for convolution operations typically includes a plurality of arithmetic units that process data in parallel and a register file for storing data to be subjected to convolution operations. In each convolution operation, a plurality of operation units need to read data required for the convolution operation from the register file, and since the amount of data processed by the convolution operation is usually huge, a plurality of operation units need to read a large amount of data from the register file during the convolution operation.

Disclosure of Invention

The application provides a computing device and a computing method, which can improve the data utilization rate.

In a first aspect, a computing device is provided, comprising: a plurality of arithmetic units for performing a plurality of convolution operations on a plurality of data, the plurality of arithmetic units including a first arithmetic unit and a second arithmetic unit; the first arithmetic unit is configured to: performing multiplication operation on first weight data and first data in the plurality of data in the ith convolution operation in the plurality of convolution operations, wherein i is an integer greater than 0; receiving second data sent by the second operation unit, wherein the second data is data for the second operation unit to perform convolution operation in the ith convolution operation; and in the (i + 1) th convolution operation in the plurality of convolution operations, multiplying the first weight data and the second data.

In the embodiment of the application, the first operation unit can acquire the repeated data in the two convolution operations from the second operation unit without reading the repeated data to be convolved from the register file, so that the utilization rate of the data is improved, the data quantity interacted between the operation unit and the register file is reduced, and the expense of the access bandwidth between the operation unit and the register file is reduced.

In one possible implementation manner, the method further includes: a plurality of register sets for storing the plurality of data; the plurality of operation units are further configured to acquire the plurality of data required in the plurality of convolution operations from the plurality of register groups, wherein each of the plurality of operation units is capable of acquiring data stored in any one of the plurality of register groups.

In the embodiment of the application, any one of the plurality of operation units can access any one register group, so that the plurality of register groups only need to read and store data to be convolved once, and do not need to read and store a large amount of repeated data in convolution operation from a memory for multiple times, so that the storage space of the plurality of register groups can be saved, the pressure of access bandwidth between the plurality of register groups and the memory is reduced, and the parallelism of data access is improved.

In one possible implementation, the first arithmetic unit is configured to: obtaining the first data from a first register bank of the plurality of register banks in the ith convolution operation; the second arithmetic unit is configured to: in the ith convolution operation, the second data is obtained from a second register set of the plurality of register sets, where the first register set and the second register set are different register sets.

In one possible implementation, during the i-th convolution operation, data read by the plurality of arithmetic units from the plurality of register groups is located in different register groups of the plurality of register groups.

In the embodiment of the application, a flexible data arrangement mode is adopted in a plurality of register groups, so that data can be read out as much as possible in one clock cycle, data reading conflict is avoided, and a non-blocking convolution operation process is realized.

In one possible implementation, the convolution kernel size corresponding to the convolution operation is K × K, the plurality of register groups includes K × K register groups, the plurality of data includes M rows and N columns of data, and the distribution of the plurality of data in the K × K register groups satisfies the following condition: the data in the x row and the s column in the multiple data are stored in the b register group, the data in the x + a row and the s column in the multiple data are stored in the [ b + (a) K) ] mod (K) register group, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x is not less than 1, N is not less than s is not less than 1, K is not less than b is not less than 1, M, N, x, s, b and a are positive integers.

In a second aspect, a computing method is provided, including: the calculation apparatus to which the calculation method is applied includes a plurality of operation units for performing a plurality of convolution operations on a plurality of data, the plurality of operation units including a first operation unit and a second operation unit; the method comprises the following steps: in the ith convolution operation in the multiple convolution operations, the first operation unit performs multiplication operation on first weight data and first data in the multiple data, wherein i is an integer greater than 0; the first arithmetic unit receives second data sent by the second arithmetic unit, wherein the second data is data for the second arithmetic unit to execute convolution operation in the ith convolution operation; in an i +1 th convolution operation of the plurality of convolution operations, the first operation unit performs a multiplication operation on the first weight data and the second data.

In one possible implementation, the computing apparatus further includes: a plurality of register sets for storing the plurality of data; wherein each of the plurality of arithmetic units is capable of acquiring data for acquiring data stored in any one of the plurality of register groups.

In one possible implementation manner, the method further includes: in the i-th convolution operation, the first operation unit acquires the first data from a first register group of the plurality of register groups; in the ith convolution operation, the second arithmetic unit obtains the second data from a second register group of the plurality of register groups, where the first register group and the second register group are different register groups.

In one possible implementation, the convolution kernel size corresponding to the convolution operation is K × K, the plurality of register groups includes K × K register groups, the plurality of data includes M rows and N columns of data, and the distribution of the plurality of data in the K × K register groups satisfies the following condition: the data in the x row and the s column in the multiple data are stored in an r group register, the data in the x + a row and the s column in the multiple data are stored in an r + mod (K) group register, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x and not less than 1, N is not less than s and not less than 1, K is not less than r and not less than 1, M, N, x, s, r and a are positive integers.

In a third aspect, a computing device is provided, which includes a plurality of arithmetic units and a plurality of register sets, wherein each arithmetic unit is respectively connected with the plurality of register sets; the register groups are used for storing a plurality of data to be subjected to a plurality of convolution operations; the plurality of operation units are configured to acquire the plurality of data from the plurality of register groups and perform a plurality of convolution operations on the plurality of data, where each of the plurality of operation units is configured to acquire data stored in any one of the plurality of register groups.

In one possible implementation, during an i-th convolution operation of the plurality of convolution operations, data read by the plurality of operation units from the plurality of register groups is located in a different register group of the plurality of register groups, where i is an integer greater than 0.

In one possible implementation, the plurality of arithmetic units include a first arithmetic unit and a second arithmetic unit, the first arithmetic unit is configured to: performing multiplication operation on first weight data and first data in the plurality of data in the ith convolution operation in the plurality of convolution operations, wherein i is an integer greater than 0; receiving second data sent by the second operation unit, wherein the second data is data for the second operation unit to perform convolution operation in the ith convolution operation; and in the (i + 1) th convolution operation in the plurality of convolution operations, multiplying the first weight data and the second data.

In a fourth aspect, there is provided a chip on which a computing device as described in the first aspect or any one of the possible implementations of the first aspect is arranged, or on which a computing device as described in the third aspect or any one of the possible implementations of the third aspect is arranged.

In a fifth aspect, there is provided a computer storage medium storing program code for execution by a computing device, the program code comprising instructions for performing the method of the second aspect.

Drawings

Fig. 1 is a schematic structural diagram of a computing device according to an embodiment of the present application.

FIG. 2 is a schematic diagram of convolution operations performed by different periods of convolution kernels with data to be convolved.

Fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a computing device according to another embodiment of the present application.

FIG. 5 is a diagram illustrating a connection between a register set and an arithmetic unit according to an embodiment of the present application.

FIG. 6 is a diagram illustrating a connection between a register set and an operation unit according to another embodiment of the present disclosure.

Fig. 7 is a schematic diagram of an arrangement manner of data to be convolved in a plurality of register groups in this embodiment of the application.

Fig. 8 is a schematic process diagram of performing a convolution operation in the embodiment of the present application.

Fig. 9 is a flowchart illustrating a calculation method in the embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

For ease of understanding, the concept of convolution operations and related techniques will be described in detail.

Convolution operations are commonly applied in convolutional neural networks. Convolutional neural networks may be used to process image data. The data to be subjected to convolution calculations may be referred to as input matrices or input data. The input data may be image data. Where the convolution operation may correspond to a convolution kernel. The convolution kernel is a matrix of size K x K, K being an integer greater than or equal to 1. Each element in the convolution kernel is a weight data. In the convolution process, a convolution kernel slides on an input matrix according to step length, the input matrix is divided into a plurality of sub-matrixes with the same size as the convolution kernel by a sliding window, each sub-matrix and the convolution kernel carry out point multiplication, and the point multiplication results are accumulated to obtain an operation result of convolution operation. It should be noted that the width and length of the convolution kernel may be equal or unequal. In the embodiment of the present application, it is explained that the width and the length of the convolution kernel are equal, that is, the size of the convolution kernel is K × K. Those skilled in the art will understand that the calculation apparatus and the calculation method in the embodiments of the present application can be applied to the case where the widths and lengths of the convolution kernels are not equal.

Fig. 1 is a schematic structural diagram of a computing apparatus 100 for convolution operation according to an embodiment of the present application. As shown in FIG. 1, computing device 100 generally includes a convolution processing unit 110, a register file 120, a control unit 130, and a memory 140. The convolution processing unit 110 may include a plurality of arithmetic units. Each operation unit is used for executing multiplication operation between the data to be convolved and the weight data. Convolution processing unit 110 may also include an adder tree unit. The adder tree unit can be used for accumulating the output result of the operation unit according to the rule of convolution calculation so as to obtain the operation result of the convolution operation. The memory 140 is used to store data. And the register file 120 is used to read data to be convolved from the memory 140 through the bus and store the data in the register file 120. Control unit 130 may be used to control convolution processing unit 110 and register file 120 to implement the operations described above. In the related art, a plurality of arithmetic units operate independently of each other. I.e. there is no connection between the plurality of arithmetic units. The respective arithmetic units need to read the data to be convolved from the register file 120 separately during operation. There is a large amount of duplicate data in the data to be convolved that is processed in multiple convolution operations, but these duplicate data still need to be repeatedly read from the register file 120. And therefore the utilization of data is low.

For ease of understanding, the process of the convolution operation is illustrated below in conjunction with FIG. 2.

FIG. 2 shows a schematic diagram of the convolution operation of different periodic convolution kernels with data to be convolved. As shown in fig. 2, the convolution kernel is a weight matrix of 2 × 2, and the data to be convolved is data of multiple rows and multiple columns, which may also be referred to as input data or an input matrix. Fig. 2 shows the data processed in two adjacent convolution operations. Where the convolution kernel is slid on the input matrix according to the step size, the data covered by the hatched area in fig. 2 is the repeated data in two operations, which shows the reusability of the convolution operation. However, in the related art, a computing apparatus that performs a convolution operation generally processes data in parallel using a plurality of arithmetic units. The plurality of arithmetic units are independent from each other and cannot transmit data to each other. Therefore, a plurality of operation units need to read data required by the convolution operation from the register file at each convolution operation. The data that is repeated in both convolution operations is also repeatedly read from the register file. For example, assume that a first operation unit and a second operation unit of the plurality of operation units perform a first convolution operation and a second convolution operation, respectively. The first convolution operation and the second convolution operation are adjacent operations. The first arithmetic unit reads the data to be convolved 10, 27, 35 and 82 from the register file in the first convolution operation period. The second arithmetic unit reads the data to be convolved 27, 69, 82, and 75 from the register file in the second convolution operation period. Where

data

27, 82 is repeatedly read from the register file. When the amount of data processed by the convolution operation is large, a large amount of data is repeatedly read, thereby increasing the pressure on the access bandwidth.

The embodiment of the application provides a computing device. The computing device can reduce repeated reading data from the register file and reduce access bandwidth. And each period can load weight data or data to be convolved for each calculation unit at the same time.

Fig. 3 is a schematic structural diagram of a computing apparatus 300 according to an embodiment of the present invention. As shown in fig. 3, the computing device 300 includes:

a plurality of operation units 310 for performing a plurality of convolution operations on a plurality of data, the plurality of operation units 310 including a first operation unit 311 and a second operation unit 312. The first arithmetic unit 311 is configured to: performing multiplication operation on first weight data and first data in the plurality of data in the ith convolution operation in the plurality of convolution operations, wherein i is an integer greater than 0; receiving second data sent by the second arithmetic unit, wherein the second data is data for the second arithmetic unit 312 to perform convolution operation in the ith convolution operation; and in the (i + 1) th convolution operation in the plurality of convolution operations, multiplying the first weight data and the second data.

Alternatively, the above-mentioned receiving of the second data sent by the second arithmetic unit 312 may mean that the first arithmetic unit 311 directly obtains the second data through the connection with the second arithmetic unit 312. It may also mean that the first arithmetic unit 311 acquires the second data through an indirect connection with the second arithmetic unit 312. For example, the first arithmetic unit 311 is connected to a third arithmetic unit, the third arithmetic unit is connected to the second arithmetic unit 312, the second arithmetic unit 312 sends the second data to the third arithmetic unit, and the third arithmetic unit sends the second data to the first arithmetic unit 311.

The operation units 310 are in one-to-one correspondence with the weight data in the convolution kernel, and are configured to perform multiplication operations corresponding to the weight data in one convolution operation. For example, the first weight data is the weight data corresponding to the first calculating unit 311, and the second weight data is the weight data corresponding to the second calculating unit 312.

Optionally, the computing apparatus 300 may further include a plurality of register sets 320, where the plurality of register sets 320 are configured to store the plurality of data for the plurality of convolution operations; the operation units 310 are further configured to obtain the data processed by the convolution operations from the register sets 320. For example, the partial operation unit 310 may acquire data that does not overlap with the previous convolution operation in the current convolution operation from the plurality of register groups 320, and the residual operation unit 310 may acquire data that overlaps with the previous convolution operation in the current convolution operation from another operation unit 310. Alternatively, in some convolution operations, since there is no repeated data between the convolution operation and the last convolution operation, all the operation units 310 need to acquire the data to be convolved from the register group 320. For example, in the first convolution operation, when the convolution kernel moves in the row direction and moves to a new row, or when the convolution kernel moves in the column direction and moves to a new column, all the operation units 310 need to acquire data to be convolved from the register group 320.

The plurality of data includes the first data and the second data. In one example, assume that first data is stored in a first register bank of a plurality of register banks and second data is stored in a second register bank of the plurality of register banks. The first register set and the second register set are different register sets. If the data processed in the i-1 th convolution operation does not include the first data and the second data, in the i-th convolution operation, the first arithmetic unit acquires the first data from the first register group, and the second arithmetic unit acquires the second data from the second register group.

Alternatively, the plurality of register sets 320 may also be referred to as a register file. Each register set includes a plurality of registers, each register for storing one data. Each register set may be provided with a read port. That is, each of the plurality of arithmetic units can read one data from one register group in each clock cycle.

In some examples, the plurality of operation units 310 are configured to perform a multiplication operation in a convolution operation at a time of the convolution operation. For example, assume that a convolution kernel corresponding to a convolution operation includes K × K weight data, where K is an integer greater than 1. The plurality of operation units 310 may include K × K operation units, which are respectively used to perform K × K multiplication operations corresponding to the K × K weight data.

The K × K operation units may be divided into K groups of processing units, each group includes K operation units, in one example, the K groups of operation units may correspond to K columns of weight data in the K × K weight data, and the K operation units in each group of operation units sequentially correspond to K columns of weight data in each column of weight data one to one. Or, the K groups of operation units may correspond to K rows of weight data in the K × K weight data, and the K operation units in each group of operation units sequentially correspond to K weight data in each row of weight data one to one. In the embodiment of the present application, the description will be given taking the example where the convolution kernel is along the row direction of the input matrix. Those skilled in the art will appreciate that the scheme of moving the convolution kernel along the column direction of the input matrix is similar to the row direction and will not be described in detail here.

The K sets of arithmetic units are operable to: in the above-mentioned i-th convolution operation, K × K multiplication operations corresponding to K × K weight data are respectively executed. After the ith convolution operation is executed, the K-t to K groups of operation units in the K groups of operation units are used to obtain data required by the (i + 1) th convolution operation from the plurality of register groups 320, where t is a sliding step corresponding to the convolution operation, and t is an integer greater than or equal to 1. And the operation unit of the y group in the operation units of the K group is used for reading the second data from the operation unit of the y + t group, y is a positive integer, and y is more than or equal to 1 and less than or equal to K-t. The d arithmetic unit in the y-th group of arithmetic units is used for reading the second data from the d arithmetic unit in the y + t group of arithmetic units. The second data is data to be convolved which is repeated in the ith convolution operation and the (i + 1) th convolution operation.

Alternatively, the first arithmetic unit in fig. 3 may be any one of the arithmetic units in the y-th group. The second arithmetic unit in fig. 3 may be any one of the K-t to K-th groups of arithmetic units.

In some examples, adjacent ones of the arithmetic units in each row of the K × K arithmetic units are connected to each other. The connection mode can support data sliding between the arithmetic units. Thus, some of the plurality of arithmetic units can acquire data that is not repeated in two convolution operations from the plurality of register groups, and the remaining arithmetic units can acquire data that is repeated in two successive convolution operations from the adjacent arithmetic units. Therefore, repeated data is prevented from being read from a plurality of register groups, and the data volume of interaction between the operation unit and the register file is reduced.

The K × K arithmetic elements are not limited to the connection between the arithmetic elements in each row, and other connection methods may be used. For example, adjacent cells between the operation cells in each column of operation cells may be connected to each other. Or all the arithmetic units in each row of arithmetic units are connected. Alternatively, other connection manners may be adopted as long as the connection manner enables the first arithmetic unit 311 to acquire the second data from the second arithmetic unit 312.

As an example, FIG. 4 illustrates a detailed block diagram of a computing device 400 in one embodiment of the application. As shown in fig. 4, the calculation apparatus 400 includes a convolution calculation unit 31 and a plurality of register groups 320. Assume that the convolution kernel includes K x K weight data. The convolution processing unit 31 includes K × K arithmetic units 310 and an addition tree unit 316. The K × K operation units 310 correspond to the K × K weight data one by one, and are respectively used for performing multiplication operations corresponding to the K × K weight data one by one. K × K operation units 310 are arranged in K rows and K columns, and adjacent operation units in each row of operation units are connected. Specifically, each of the operation units 310 may include a first register 3101, a second register 3102, and a multiplier 3103. The multiplier 3103 is connected to a first register 3101 and a second register 3102. The first register 3101 is used to store the weight data corresponding to each operation unit 310, and the second register 3102 is used to store the data to be convolved corresponding to the operation unit 310 in one convolution operation. The multiplier 3103 is used to multiply data stored in the first register 3101 and the second register 3102. The adder tree unit 316 is configured to receive the results of the multiplication operations from the K × K arithmetic units 310, and perform an accumulation operation on the results to obtain an operation result of a convolution operation.

Optionally, the computing device 400 may further include a memory 330 and a control unit 340. Wherein the memory 330 may be connected to the plurality of register sets 320 through a bus. The memory 330 is used to store data to be convolved and input the data to be convolved to a plurality of register groups through a bus. The control unit 340 may be used to control the convolution processing unit 31 and the plurality of register sets 320 to implement the above-described operations.

The embodiment of the present application does not limit the type of the memory. For example, the Memory may be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM).

Optionally, in the embodiment of the present application, a storage manner of the data to be convolved in the plurality of register groups 320 is not limited. For example, multiple register banks 320 may be partitioned. Each area corresponds to one arithmetic unit 310, and the arithmetic unit 310 reads only required data to be convolved from the area, and does not read data from other areas.

As an example, fig. 5 shows a connection manner of a plurality of arithmetic units 310 and a plurality of register groups 320 in the embodiment of the present application. The plurality of arithmetic units 310 correspond to the plurality of register groups 320 one-to-one, and each arithmetic unit is connected to one register group. Each register group can be used for storing data to be convolved corresponding to one operation unit. Each arithmetic unit reads in one data from the corresponding register set in each clock cycle.

However, in the partitioned storage method, since each operation unit can only read data in the corresponding area, if data to be convolved by a plurality of operation units is repeated, the plurality of partitions in the plurality of register groups 320 need to repeatedly read and store the data from the memory 330. This storage arrangement occupies the storage space of the register sets 320 and also increases the bandwidth requirement for the register sets 320 to read data from the memory. For example, with continued reference to fig. 1, in one convolution operation, if the first arithmetic unit 311 reads in data 10 and the second arithmetic unit 312 reads in data 27, in the convolution operation performed after the convolution kernel is shifted right by one line, the first arithmetic unit 311 needs to read in data 27 and the second arithmetic unit 312 needs to read in data 69. The data 27 is repeatedly read and stored by the partitions corresponding to the first arithmetic unit 311 and the second arithmetic unit 312.

Alternatively, the plurality of register sets 320 may not be partitioned, and each arithmetic unit may access data stored in any of the register sets. In a non-partitioned manner, each of the plurality of arithmetic units 310 is configured to obtain data stored in any one of the plurality of register sets 320.

As an example, fig. 6 shows a connection manner of a plurality of arithmetic units 310 and a plurality of register sets 320 in a further embodiment of the present application. It should be noted that the connection manner or structure shown in fig. 6 may be applied to the computing device provided in the present application, for example, the computing device in fig. 1, fig. 3, or fig. 4. The method can also be applied to any other computing device, and the embodiment of the application is not limited to this.

As shown in fig. 6, a full-connection switching network may be provided between the plurality of arithmetic units 310 and the plurality of register sets 320. The fully-connected switching network is a network that is used to implement the full connection of the plurality of arithmetic units 310 and the plurality of register sets 320. Through the fully-connected switching network, a plurality of arithmetic units 310 can be connected to any one of a plurality of register sets 320. In other words, each of the plurality of arithmetic units is operable to acquire data stored in any one of the plurality of register sets.

For example, in the ith convolution operation, the first operation unit 311 may obtain the first data in a first register group of the plurality of register groups 320, and in the jth convolution operation, the first operation unit 311 may obtain the third data in a second register group of the plurality of register groups 320. Wherein i and j are positive integers, and i is not equal to j.

In the embodiment of the present application, the multiple operation units 310 and the multiple register groups 320 are connected by using a fully-connected switching network, so that any operation unit 310 in the multiple operation units 310 can access any register group, and therefore, the multiple register groups 320 only need to read and store data to be convolved once, and do not need to read from the memory 330 multiple times and store a large amount of repeated data in multiple convolution operations, so that the storage space of the multiple register groups 320 can be saved, the pressure of access bandwidth between the multiple register groups 320 and the memory 330 is reduced, and the parallelism of data access is improved.

However, in the non-partitioned storage mode, each register group 320 corresponds to only one read port, so that each arithmetic unit 310 can read in only one data from one register group 320 in each clock cycle. If a plurality of corresponding data to be convolved in one convolution operation are stored in the same register group 320, there is a data read conflict. The plurality of arithmetic units 310 require a plurality of clock cycles to read out all of the required data to be convolved. This storage may affect the processing speed of the arithmetic unit 310.

In order to solve the above problem, the embodiment of the present application further provides a way of storing data to be convolved in a plurality of register groups 320. The manner in which the plurality of register sets 320 store data to be convolved will be described continuously below.

In some embodiments, the plurality of data to be convolved in the plurality of convolution operations may be arranged such that, during the ith convolution operation, the data read by the plurality of arithmetic units 310 from the plurality of register groups 320 is located in a different register group of the plurality of register groups 320.

The above-mentioned i-th convolution operation may be any one of a plurality of convolution operations. The data read from the register sets 320 by the operation units 310 may refer to the data required to be read from the register sets 320 in the i-th convolution operation. The data that needs to be read from the plurality of register sets 320 in the ith convolution operation may be all data processed in the ith convolution operation, or may be partial data processed in the ith convolution operation. For example, if there is partially duplicated data in the i-th convolution operation and the i-1-th convolution operation, duplicated data may be obtained from the inside of the plurality of operation units 310, and non-duplicated data may be obtained from the plurality of register groups 320. The data that needs to be read from the plurality of register sets 320 in the i-th convolution operation is the non-repetitive data. If there is no repeated data in the ith convolution operation and the (i-1) th convolution operation, the data that needs to be read from the plurality of register sets 320 in the ith convolution operation is all the data to be processed in the ith convolution operation.

There are various specific ways to arrange the plurality of data to be convolved. In one example, the convolution operation corresponds to a convolution kernel size of K × K, the plurality of register banks includes K × K register banks, the plurality of data includes M rows and N columns of data, and the distribution of the plurality of data in the K × K register banks satisfies the following condition:

the data in the x row and the s column in the multiple data are stored in a b register, the data in the x + a row and the s column in the multiple data are stored in a b + mod (K) register group, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x not less than 1, N is not less than s not less than 1, K is not less than b not less than 1, M, N, x, s, b is a positive integer.

Alternatively, each row of the M rows of data may be sequentially arranged in a loop in a plurality of register groups on the basis of meeting the condition. For example, the first row of data may be sequentially arranged from the first register set to the K × K register sets, and then continue to be circularly arranged from the first register set.

Optionally, the plurality of data may be arranged such that: and any K continuous data in each row of the M rows of data are stored in different register groups, and/or any K continuous data in each column of the N columns of data are stored in different register groups.

In some embodiments, the above arrangement is only an example, and the arrangement of the plurality of data may also be modified and converted with reference to the above formula. For example, the plurality of register groups may not be limited to K × K register groups, as long as data to be convolved in one convolution operation is stored in different register groups. Alternatively, after the rows and columns of the plurality of data are inverted, the data may still be stored according to the above arrangement.

Fig. 7 is a specific schematic diagram of an arrangement manner of a plurality of data to be convolved in a plurality of register groups in this embodiment. For ease of understanding, a specific arrangement of the plurality of data to be convolved is described below with reference to fig. 7. As shown in fig. 7, the convolution kernel is a weight matrix of 3 × 3. The plurality of arithmetic units may include 3 × 3 arithmetic units, and the plurality of register groups may include 3 × 3 register groups. The operation units are connected with the register groups through a full-connection switching network. The plurality of register groups may constitute a register array. Each row represents a register bank, each register bank comprising a number of registers. The row number of the register array can be represented by the column numbers r1, r2 and … rn. The row number of the register array may be denoted by b1, b2, … b 9. Also shown in fig. 7 are a plurality of data to be convolved required for a plurality of convolution operations. The plurality of data includes a plurality of rows and a plurality of columns of data. For ease of illustration, the data are shown as 1, 2, 3, …, 256, respectively.

The plurality of data may be arranged in a plurality of register sets in the manner described above. For example, the

first row data

1, 2, …, 16 may be sequentially cyclically stored in the first register group b1 through the ninth register b 9. If the first data in the first row of data is stored in the b-th register, the second row of data is loaded from the [ b + (a x K) ] mod (K x K) group of registers. For example, the first data 1 in the first row of data is located in the first register set, i.e., b is equal to 1, and the first data 17 in the second row of data is stored in the [1+ (1 × 3) ] mod 9 is equal to 4 register sets according to the above calculation method. And the first data 33 in the third row of data should be stored in the [1+ (2 x 3) ] mod 9 ═ 7 register set. Alternatively, the second line of data may be stored cyclically in the plurality of register sets in sequence starting from the 4 th register set. The third row of data may be stored cyclically into the plurality of register sets starting with the 7 th register set. With the above arrangement, the plurality of arithmetic units can read all data required for one convolution operation in one clock cycle. In addition, as shown in fig. 7, it is assumed that 3 × 3 pieces of weight data are q1 to q9, respectively. The weight data may also be stored in different register groups of the plurality of register groups, respectively. With the above arrangement, before the start of the convolution operations for a plurality of times, the plurality of operation units may read all the weight data in one clock cycle, and then read the data to be convolved in the convolution operation.

For ease of understanding, the following describes in detail the processing procedure of the convolution operation performed by the computing apparatus in the embodiment of the present application with reference to fig. 8. Fig. 8 is a diagram of a specific example of a convolution operation in a further embodiment of the present application. The convolution kernel in fig. 8 is assumed to be a 3 × 3 convolution kernel, which employs the calculation apparatus shown in fig. 3 or fig. 4 and the data arrangement in fig. 7. The procedure of the convolution operation is as follows:

s801, load the weight data corresponding to the convolution kernel from the memory 330 into the plurality of register sets 320, and store 3 × 3 weight data in different register sets 320.

S802, load a plurality of data corresponding to the plurality of convolution operations into the plurality of register groups 320, and the arrangement rule is as shown in fig. 7.

S803, the weight data corresponding to 3 × 3 arithmetic units 310 are read from the register sets 320, and the weight data are stored in the first register 3101 of each arithmetic unit 310. Since the weight data is ordered, conflict-free reading can be achieved when the weight data is loaded into the arithmetic unit 310.

S804, in the first convolution operation, the first 3 data in the first 3 rows of the data to be convolved are read to the

multiple operation units

310, and 3 × 3 data in total, and due to the arrangement rule of the data to be convolved, the 3 × 3 data are read out from different register groups 320 in the 3 × 3 register groups 320, so that the data required by the first convolution operation can be read out in one clock cycle. The 3 × 3 operation units 310 perform multiplication operations corresponding to the 3 × 3 weight data one by one, and output results.

S805, each operation unit 310 sends the output results of 3 × 3 operation units 310 to the addition tree unit 316 (accumulation unit), obtains the operation result of the first convolution operation, and may write the operation result back to the register file. The register file may include the above-mentioned plurality of register sets 320, and may further include other register sets. When the operation result is written back to the register file, the storage position of the operation result does not conflict with the position of the previously stored data to be convolved.

S806, before performing the second convolution operation, the third column operation unit 310 reads the 4 th data of the first 3 rows of the plurality of data from the plurality of register sets 320;

s807, the third column operation unit 310 transfers the data processed in the first convolution operation to the second column operation unit 310, and the second column operation unit 310 transfers the data processed in the first convolution operation to the first column operation unit 310. Then, each operation unit 310 performs multiplication in the second convolution operation.

S808, the operation units 310 send the output results of 3 × 3 operation units 310 to the addition tree unit 316 (accumulation unit), so as to obtain the operation result of the second convolution operation, and write the operation result back to the register groups 320. The subsequent convolution process can repeat the above operations.

Method embodiments of the present application are described below, which correspond to apparatus embodiments, and therefore reference may be made to the foregoing apparatus embodiments for those parts not described in detail.

Fig. 9 is a schematic flowchart of a calculation method according to an embodiment of the present application. The calculation apparatus to which the calculation method is applied includes a plurality of operation units for performing a plurality of convolution operations on a plurality of data, the plurality of operation units including a first operation unit and a second operation unit;

the method comprises the following steps:

s901, in an ith convolution operation of the multiple convolution operations, the first arithmetic unit performs a multiplication operation on first weight data and first data in the multiple data, where i is an integer greater than 0;

s902, the first arithmetic unit receives second data sent by the second arithmetic unit, where the second data is data for the second arithmetic unit to perform a convolution operation in the ith convolution operation;

s903, in the (i + 1) th convolution operation of the plurality of convolution operations, the first arithmetic unit performs a multiplication operation on the first weight data and the second data.

Optionally, the computing device in fig. 9 further comprises a plurality of register sets for storing the plurality of data; the calculation method in fig. 9 further includes: controlling the plurality of arithmetic units to acquire the plurality of data required in the plurality of convolution operations from a plurality of register groups of the computing device.

Alternatively, in the method of fig. 9, each of the plurality of arithmetic units is configured to obtain data stored in any one of the plurality of register sets.

Optionally, in the method of fig. 9, during the i-th convolution operation, data read from the plurality of register groups by the plurality of operation units is located in different register groups of the plurality of register groups.

Optionally, in the method in fig. 9, a convolution kernel size corresponding to the convolution operation is K × K, the plurality of register banks includes K × K register banks, the plurality of data includes M rows and N columns of data, and a distribution of the plurality of data in the K × K register banks satisfies the following condition: the data in the x row and the s column in the multiple data are stored in an r group register, the data in the x + a row and the s column in the multiple data are stored in an r + mod (K) group register, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x and not less than 1, N is not less than s and not less than 1, K is not less than r and not less than 1, M, N, x, s, r and a are positive integers.

In the embodiment of the application, a computing device supporting sliding of data to be convolved among computing units is provided, which can support a plurality of computing units to read data to be convolved from adjacent computing units, so as to avoid reading a large amount of repeated data from a plurality of register groups, reduce the interactive data amount between the computing units and the plurality of register groups, reduce data access conflicts, and improve the computing performance of the architecture. Furthermore, the plurality of operation units and the plurality of register groups can be connected by adopting a full-connection switching network, so that any operation unit in the plurality of operation units can access any register group, therefore, the plurality of register groups only need to read and store data to be convolved once, and do not need to read from the memory for multiple times and store a large amount of repeated data in multiple convolution operations, so that the storage space of the plurality of register groups can be saved, the pressure of access bandwidth between the plurality of register groups and the memory is reduced, and the parallelism of data access is improved. In the embodiment of the application, a flexible data arrangement mode can be adopted in a plurality of register groups, and data can be read out as much as possible in one clock period, so that data reading conflict is avoided, and a non-blocking convolution operation process is realized.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A computing device, comprising:

a plurality of register groups for storing a plurality of data;

a plurality of operation units for acquiring the plurality of data required in a plurality of convolution operations from the plurality of register groups, wherein each of the plurality of operation units is capable of acquiring data stored in any one of the plurality of register groups;

the plurality of operation units are further used for performing convolution operation on the plurality of data for a plurality of times, and the plurality of operation units comprise a first operation unit and a second operation unit;

the first arithmetic unit is configured to:

performing multiplication operation on first weight data and first data in the plurality of data in the ith convolution operation of the plurality of convolution operations, wherein i is an integer greater than 0;

receiving second data sent by the second operation unit, wherein the second data is data for the second operation unit to perform convolution operation in the ith convolution operation;

performing multiplication operation on the first weight data and the second data in the (i + 1) th convolution operation of the plurality of times of convolution operations;

wherein, during the ith convolution operation, data read from the plurality of register groups by the plurality of arithmetic units is located in different register groups of the plurality of register groups;

the convolution kernel size corresponding to the convolution operation is K x K, the plurality of register groups comprise K x K register groups, the plurality of data comprise M rows and N columns of data, and the distribution of the plurality of data in the K x K register groups meets the following conditions:

the data in the x row and the s column in the multiple data are stored in the b register group, the data in the x + a row and the s column in the multiple data are stored in the [ b + (a) K) ] mod (K) register group, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x is not less than 1, N is not less than s is not less than 1, K is not less than b is not less than 1, M, N, x, s, b and a are positive integers.

2. The apparatus of claim 1,

the first arithmetic unit is specifically configured to, in the ith convolution operation, obtain the first data from a first register set of the plurality of register sets;

the second arithmetic unit is specifically configured to, in the ith convolution operation, obtain the second data from a second register set of the plurality of register sets, where the first register set and the second register set are different register sets.

3. A calculation method characterized in that a calculation apparatus to which the calculation method is applied includes a plurality of operation units for performing a plurality of convolution operations on a plurality of data, the plurality of operation units including a first operation unit and a second operation unit, and a plurality of register groups for storing the plurality of data, wherein each of the plurality of operation units is capable of acquiring data stored in any one of the plurality of register groups;

the method comprises the following steps:

in the ith convolution operation of the multiple convolution operations, the first operation unit performs multiplication operation on first weight data and first data in the multiple data, wherein i is an integer greater than 0;

the first arithmetic unit receives second data sent by the second arithmetic unit, wherein the second data is data for the second arithmetic unit to execute convolution operation in the ith convolution operation;

in the (i + 1) th convolution operation of the plurality of convolution operations, the first operation unit performs multiplication operation on the first weight data and the second data;

the data in the x row and the s column in the multiple data are stored in an r group register, the data in the x + a row and the s column in the multiple data are stored in an r + mod (K) group register, wherein K is an integer not less than 1, mod represents carrying out remainder operation, M is not less than x and not less than 1, N is not less than s and not less than 1, K is not less than r and not less than 1, M, N, x, s, r and a are positive integers.

4. The computing method of claim 3, wherein the method further comprises:

in the i-th convolution operation, the first operation unit acquires the first data from a first register group of the plurality of register groups;

in the ith convolution operation, the second arithmetic unit obtains the second data from a second register set of the plurality of register sets, where the first register set and the second register set are different register sets.

5. A computing device is characterized by comprising a plurality of operation units and a plurality of register groups, wherein each operation unit is respectively connected with the plurality of register groups;

the register groups are used for storing a plurality of data to be subjected to a plurality of convolution operations;

the plurality of operation units are used for acquiring the plurality of data from the plurality of register groups and performing convolution operation on the plurality of data for a plurality of times, wherein each operation unit in the plurality of operation units is used for acquiring the data stored in any one register group in the plurality of register groups;

during the ith convolution operation in the plurality of convolution operations, the data read from the plurality of register groups by the plurality of operation units are located in different register groups of the plurality of register groups, wherein i is an integer greater than 0;

6. The apparatus of claim 5, wherein the plurality of arithmetic units comprises a first arithmetic unit and a second arithmetic unit,

the first arithmetic unit is configured to:

performing multiplication operation on first weight data and first data in the plurality of data in the ith convolution operation in the plurality of convolution operations, wherein i is an integer greater than 0;

and in the (i + 1) th convolution operation in the plurality of convolution operations, multiplying the first weight data and the second data.