CN112632464B

CN112632464B - Processing device for processing data

Info

Publication number: CN112632464B
Application number: CN202011577665.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-11-29
Anticipated expiration: 2040-12-28
Also published as: CN112632464A

Abstract

It is described a processing apparatus for processing data, comprising: a systolic array comprising m first stage computing units arranged in an m x m array, where m is a positive integer greater than or equal to 2, wherein each of the m first stage computing units comprises n x n second stage computing units arranged in an n x n array, where n is a positive integer greater than or equal to 2, each of the n second stage computing units being configured to perform a dot product operation of a first vector and a second vector, the first and second vectors each comprising n data. The processing device for processing data can obviously improve data calculation density, reduce data input delay and improve hardware utilization rate.

Description

Processing device for processing data

Technical Field

Embodiments of the present disclosure generally relate to an information processing apparatus, and more particularly, to a processing apparatus for processing data.

Background

In recent years, deep learning has been increasingly used in fields such as image recognition, speech processing, and the like. However, with the increasing depth of the network, the demands of computing power, memory access bandwidth and the like required in the training and prediction process of the deep neural network are gradually difficult to be met by the traditional computing platform. Therefore, a method of replacing the geared software computation with a hardware accelerator is considered to be an effective method of improving the computational efficiency of a neural network, such as a neural network processor implemented with a general purpose graphics processor, a dedicated processor chip, and a field programmable logic array. The processing device implemented based on the systolic array architecture has attracted great attention in the industry and academia due to its features of high concurrency and low bandwidth requirement. However, the data processing device in the conventional design generally has the disadvantages of long delay and waste of hardware resources.

Disclosure of Invention

The present disclosure provides a processing apparatus for processing data, which can significantly improve data computation density, reduce data input delay, and improve hardware utilization.

According to an aspect of the present disclosure, there is provided a processing apparatus for processing data, comprising: a systolic array comprising m first stage computing units arranged in an m x m array, where m is a positive integer equal to or greater than 2, wherein each of the m first stage computing units comprises n second stage computing units arranged in an n x n array, where n is a positive integer equal to or greater than 2, each of the n second stage computing units being configured to perform a dot product operation of a first vector and a second vector, the first and second vectors each comprising n data.

According to an exemplary embodiment of the present disclosure, each of the n data may be of a type of 4-bit integer type, 8-bit integer type, 16-bit floating point type, 24-bit floating point type, and 32-bit floating point type.

According to an exemplary embodiment of the present disclosure, data from n first vectors of a first matrix and data from n second vectors of a second matrix may be broadcast into n x n second stage computing units simultaneously.

According to an exemplary embodiment of the disclosure, the processing means may be adapted to process a first matrix multiplied by a second matrix, wherein the first matrix = M x K and the second matrix = K x N, wherein M denotes a number of rows of the first matrix and K denotes a number of columns of the first matrix or a number of rows of the second matrix, wherein M = mn, K is an integer multiple of N and N denotes a number of columns of the second matrix, wherein N is a positive integer.

According to an exemplary embodiment of the present disclosure, the first matrix may include K/N first matrix primary sub-matrices of M × N type, each first matrix primary sub-matrix of M × N type includes M/N first matrix secondary sub-matrices of N × N type, and the second matrix may include K/N second matrix primary sub-matrices of N × N type, each second matrix primary sub-matrix of N × N type includes N/N second matrix secondary sub-matrices of N × N type.

According to an exemplary embodiment of the present disclosure, the n × n second-stage computing units in each first-stage computing unit may be configured to: at each clock instruction, n rows of first vectors from the first matrix secondary submatrix and n columns of second vectors from the second matrix secondary submatrix are received to obtain n x n dot product operation results.

According to an exemplary embodiment of the present disclosure, each first-level computing unit may include: the first group of registers is used for storing K/N first matrix primary submatrices of M x N types from the first matrix, the second group of registers is used for storing K/N second matrix primary submatrices of N x N types from the second matrix, and the third group of registers is used for storing the obtained dot product operation result.

According to an exemplary embodiment of the present disclosure, the second stage calculation unit of the ith row and the jth column of the n × n second stage calculation units may be configured to: a first vector of n data from the ith row of the corresponding first matrix secondary sub-matrix and a second vector of n data from the jth column of the corresponding second matrix secondary sub-matrix are received, i =1, 2 \8230 \ 8230, n, j =1, 2 \8230, n.

According to an exemplary embodiment of the present disclosure, in case M = N = K, the processing device may be configured to: and inputting all the data of the first matrix and the second matrix into the processing device within m periods.

According to an example embodiment of the present disclosure, a processing apparatus may be configured to: in a first cycle of the input data, a kth first-stage computing unit in a first column of m first-stage computing units receives: a kth first matrix secondary sub-matrix from the first matrix primary sub-matrix, and a kth second matrix secondary sub-matrix from the first second matrix primary sub-matrix, wherein k =1, 2 \8230m.

According to an example embodiment of the present disclosure, a processing apparatus may be configured to: in 2 nd to m th cycles of input data, the kth first-matrix secondary sub-matrix is transferred to a kth first-stage computing unit of m first-stage computing units of 2 nd to m th columns, respectively, along a straight-line path, and the kth second-matrix secondary sub-matrix is transferred to a k-1 first-stage computing unit of m first-stage computing units of 2 nd to m th columns, in a diagonal form.

According to an exemplary embodiment of the present disclosure, the kth second matrix secondary sub-matrix may be transmitted to the mth first level computing unit in the next column of m first level computing units in case k-1 is equal to 0.

According to an example embodiment of the present disclosure, a processing apparatus may be configured to: in the s-th cycle of the input data, the kth first-stage computing unit in the first column of m first-stage computing units receives: a kth first matrix secondary sub-matrix from the s-th first matrix primary sub-matrix, and a kth second matrix secondary sub-matrix from the s-th second matrix primary sub-matrix, wherein s =2, 3 \ 8230 \ 8230, m, and k =1, 2 \ 8230, m.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be considered limiting of the present application.

Fig. 1 is a schematic diagram of a systolic array based processing device according to the prior art.

Fig. 2 is a schematic diagram of a processing device for processing data according to one embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a processing device having a two-stage structure showing a matrix input into each first-stage computing unit according to one embodiment of the present disclosure.

FIG. 4 is a schematic diagram of a second level computing unit, according to one embodiment of the present disclosure.

Fig. 5 is a schematic illustration of a first matrix processed by a processing device according to one embodiment of the present disclosure.

Fig. 6 is a schematic illustration of a second matrix processed by a processing device according to one embodiment of the present disclosure.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In describing embodiments of the present disclosure, the terms "include" and "comprise," and similar language, are to be construed as open-ended, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same objects. Other explicit and implicit definitions are also possible below.

The structure of a conventional processing apparatus will be described below with reference to fig. 1 by taking a TPU (tensor processing unit) based on a systolic array as an example.

The TPU in the conventional design usually includes a plurality of basic operation units, which are arranged in a pulse array, a plurality of weight queues (matrices) and data queues (matrices) are propagated into the pulse array under the control of a clock signal, and the whole matrix multiplication process is implemented by controlling each basic operation unit to continuously multiply and accumulate the received weight queues and data queues through a control signal. In the systolic array, an accumulation register for storing and transmitting intermediate operation results is arranged between the basic operation units in each row and each column, and the accumulation register also needs to be controlled by a control signal to execute storage operation. In the conventional systolic array-based operation, only one data is input to the operation unit per clock signal, and only 1 multiplication and addition operation is performed in each operation unit. Since only one data in the matrix is input to the systolic array under one clock signal, and the systolic array usually includes many computing units, the conventional systolic array-based computing device usually has high latency. Furthermore, since the source flip-flop (or register) and the destination flip-flop (or register) are both used for a single multiply-accumulate operation, a large number of source flip-flops or destination flip-flops are required when processing a large amount of data, which is very expensive.

In conventional designs of flip-flops or registers, the flip-flops or registers typically use a storage capacity of 4B (bytes). In order to store the result of the multiplication and addition of 64kB, 16 × 1024 (i.e., 16 k) arithmetic units are usually required, and in the case of a pulse array having a two-dimensional structure, an array arrangement of 128 × 128 arithmetic units is usually required. Further, since many calculation units are included in each row and each column, it takes several hundred cycles to propagate all the data to the calculation units, and thus, a delay time of data input is long.

In view of the various problems in the prior art solutions mentioned above, embodiments of the present disclosure provide a processing apparatus for processing data, aiming at least partially solving the above problems, where the processing of data is, for example, matrix multiplication. In the scheme of the disclosure, the processing device has a two-stage structure, and on the basis of m × m type pulse array type calculation units as first-stage calculation units, a plurality of second-stage calculation units are further provided in each first-stage calculation unit, the second-stage calculation units are arranged in an n × n array manner, and m and n are positive integers greater than or equal to 2. In addition, dot product operation (dot product, dpn) a · b = a1 = [ b1, b2, + 8230; + an 8230;) of the first vector a = [ a1, a2, \ 8230;) and the second vector b = [ b1, b2, \ 8230;, bn ] is performed in each second-stage calculation unit. The processing means performs a multiplication of a first matrix, which may be from the first matrix, and a second matrix, which may be from the second matrix, the first vector comprising n data and the second vector also comprising n data, and the dot product of the first vector and the second vector is called dpn.

In the embodiment of the present disclosure, a processing apparatus having a two-stage structure is employed, and all the calculation units at the second stage perform the dot product operation of the first and second vectors at the same time, so that the processing apparatus is high in calculation density. In addition, since fewer first-stage computing units are included in the systolic array, the propagation path of the data is shorter, and thus the delay is shorter.

Embodiments of the present disclosure will be described below in conjunction with fig. 2 to 6.

Fig. 2 schematically shows a processing apparatus for processing data according to an embodiment of the present disclosure. Fig. 3 is a schematic diagram of a processing device having a two-stage structure showing a matrix input into each first-stage computing unit according to one embodiment of the present disclosure. FIG. 4 is a schematic diagram of a second level computing unit, according to one embodiment of the present disclosure. Fig. 5 is a schematic diagram of a first matrix processed by a processing device according to one embodiment of the present disclosure. Fig. 6 is a schematic illustration of a second matrix processed by a processing device according to one embodiment of the present disclosure.

The processing means for processing data performs a matrix multiplication operation, for example a multiplication of a first matrix and a second matrix. The first matrix is an M x K matrix and the second matrix is a K x N matrix, wherein M represents the row number of the first matrix, K represents the column number of the first matrix or the row number of the second matrix, wherein M = mn, K is an integer multiple of N, N represents the column number of the second matrix, wherein N is a positive integer, wherein M represents the number of rows and columns of the systolic array, and N represents the number of rows and columns of the second stage of computing units.

In order to more clearly explain a processing device having a two-stage structure according to an embodiment of the present disclosure, an exemplary embodiment of the present disclosure will be explained below by taking K =32, M =32, N =32, M =4, and N =8 as examples. It will be understood by those skilled in the art that the numerical values of K, M, N, M and N are not limited to the above values and may be any positive integer as desired.

In the field of chip design, the larger the value of n is, the more wiring is required, and the more complicated the design is. According to the existing chip manufacturing process, the difficulty of manufacturing the second-level computing units arranged in an array of 16 × 16 is high, and therefore, in the actual chip design, the value of n is generally less than 16.

In the exemplary embodiment, shown in fig. 2, systolic array 200 is a4 x 4 array, and in the embodiment shown, systolic array 200 includes 16 first stage computing units 201, with only the four first stage computing units in the last column being labeled with reference numeral 201 for clarity of the drawing. As shown in fig. 4, all second-level computing units make up an 8 by 8 array 300, which in the illustrated embodiment includes a total of 64 second-level computing units 301, with only the eight second-level computing units in the last column being labeled with reference numeral 301 for clarity of the drawing. As shown in fig. 5, the first matrix a is a 32 x 32 matrix, with each data in the matrix being a 16 bit floating point number. As shown in fig. 6, the second matrix B is also a 32 x 32 matrix, with each data in the matrix being a 16 bit floating point number. Thus, the data sizes of both matrices A and B are 16kb. It will be understood by those skilled in the art that the type of data in the matrix is not limited to 16-bit floating point numbers, but may be any of 4-bit integer type, 8-bit integer type, 16-bit integer type, 24-bit floating point type, and 32-bit floating point type.

As shown in fig. 5, the data in the rectangular frame in the matrix a represents the first primary sub-matrix a-1 of the matrix a, and the first primary sub-matrix a-1 includes 32 rows and 8 columns of data, and the data includes the 1 st to 8 th columns of data of the matrix a. Thus, matrix A includes 32/8 (i.e., K/n) primary sub-matrices A-1, A-2, A-3, and A-4, each having a data size of 4kb, wherein the second primary sub-matrix A-2 is the 9 th to 16 th columns of data of matrix A, the third primary sub-matrix A-3 is the 17 th to 24 th columns of data of matrix A, and the fourth primary sub-matrix A-4 is the 25 th to 32 th columns of data of matrix A. In fig. 5, although only the first primary submatrix a-1 is shown, data included in other primary submatrices can be clearly understood by those skilled in the art through the above description.

As shown in fig. 6, the data in the rectangular frame in the matrix B represents the first primary sub-matrix B-1 of the matrix B, and the first primary sub-matrix B-1 includes 8 rows and 32 columns of data, and the data includes the 1 st to 8 th rows of data of the matrix B. The matrix B thus includes 32 rows/8 rows (i.e., K/n) of (i.e., 4) primary sub-matrices B-1, B-2, B-3 and B-4, each of which has a data size of 4kb, wherein the second primary sub-matrix B-2 is the 9 th to 16 th row data of the matrix B, the third primary sub-matrix B-3 is the 17 th to 24 th row data of the matrix B, and the fourth primary sub-matrix B-4 is the 25 th to 32 th row data of the matrix B. In fig. 6, although only the first primary sub-matrix B-1 is shown, data included in other primary sub-matrices can be clearly understood by those skilled in the art from the above description.

Referring again to FIG. 5, the first primary sub-matrix A-1 of matrix A includes four secondary sub-matrices A1, A1,2, A1,3 and A1,4, wherein the secondary sub-matrix A1,1 is the 1 st to 8 th row of data of the primary sub-matrix A-1, the secondary sub-matrix A1,2 is the 8 th to 16 th row of data of the primary sub-matrix A-1, the secondary sub-matrix A1,3 is the 16 th to 24 th row of data of the primary sub-matrix A-1, and the secondary sub-matrix A1,4 is the 24 th to 32 th row of data of the primary sub-matrix A-1. Therefore, the secondary submatrices A1, A1,2, A1,3, and A1,4 are each 8 × 8 matrices, and the data size is 8 × 16b =1kb. In the following, table 1 will be used to clearly represent the data contained by each primary sub-matrix and each secondary sub-matrix of matrix a.

Table 1 matrix a

Referring again to fig. 6, the first primary sub-matrix B-1 of the matrix B includes four secondary sub-matrices B1, B1,2, B1,3 and B1,4, wherein the secondary sub-matrix B1,1 is the 1 st to 8 th columns of data of the primary sub-matrix B-1, the secondary sub-matrix B1,2 is the 8 th to 16 th columns of data of the primary sub-matrix B-1, the secondary sub-matrix B1,3 is the 16 th to 24 th columns of data of the primary sub-matrix B-1, and the secondary sub-matrix B1,4 is the 24 th to 32 th columns of data of the primary sub-matrix B-1. Therefore, the secondary submatrices B1, B1,2, B1,3, and B1,4 are each 8 × 8 matrices, and the data size is 8 × 16b =1kb. In the following, table 2 will be used to clearly represent the data contained by the respective primary and secondary sub-matrices of matrix B.

Table 2 matrix B

As shown in fig. 2, the first stage computing unit 201 constitutes a4 × 4 systolic array computing unit. In each of the 16 first-stage calculation units 201, 8 × 8 second-stage calculation units 301 as shown in fig. 4 are provided. As shown in fig. 4, in each second-stage calculation unit 301, a dot product operation dp8 of a first vector (including 8 data) and a second vector (including 8 data) is performed. The 8 data of each first vector are broadcast from the left side of the second-stage calculation unit array 300 into each of the 8 second-stage calculation units 301 of each row, and the 8 data of each second vector are broadcast from the upper side of the second-stage calculation unit array 300 into each of the 8 second-stage calculation units 301 of each column. By "broadcast," it is meant that all data of a vector is input to each of the plurality of second stage computing units 301 at the same time under the same clock signal, rather than only one data of a vector being input to a computing unit under a clock signal, as in the data input manner of systolic array 200. In the embodiment shown in fig. 4, 8 first vectors may be from each of the 16 secondary sub-matrices of matrix a as shown in table 1 above, and 8 second vectors may be from each of the 16 secondary sub-matrices of matrix B as shown in table 2 above.

As shown in fig. 4, 8 data in the 1 st first vector are broadcast into each of the 8 second-stage computing units 301 in row 1 under the same clock signal, and so on, 8 data in the 2 nd to 8 th first vectors are also broadcast into each of the 8 second-stage computing units 301 in each of rows 2 to 8 under the clock signal. Furthermore, 8 data in the 1 st second vector are broadcast into each of the 8 second-stage computing units 301 in column 1 under the same clock signal, and so on, 8 data in the 2 nd to 8 th second vectors are also broadcast into each of the 8 second-stage computing units 301 in each of columns 2 to 8 under the clock signal. Therefore, in this exemplary embodiment, all the second-stage calculation units 301 perform dot product operations dp8 a total of 64 times under the same clock signal, and one dot product operation dp8 includes 8 multiplication and addition operations. In the related art systolic array-based processing device, the next calculation unit performs only 1 multiplication and addition operation under one clock signal, and not all calculation units perform the multiplication and addition operations due to the hysteresis of data input, so the calculation density in the exemplary embodiment of the present disclosure is significantly increased and the delay time is also significantly reduced compared to the calculation density of the related art systolic array-based processing device.

As shown in fig. 2, the processing apparatus includes 4 × 4 first-stage computing units 201, and each first-stage computing unit 201 receives operands A1 to A4 from a matrix a and operands B1 to B4 from a matrix B. In an exemplary embodiment, as shown in fig. 3, the operation object A1 may be the 1 st secondary sub-matrix A1, A2,1, A3,1 or A4,1 from each of the four primary sub-matrices of the matrix a, and the operation object B1 may be the 1 st secondary sub-matrix B1, B2,1, B3,1 or B4,1 from each of the primary sub-matrices of the matrix B; the operand A2 may be A2 nd secondary sub-matrix A1,2, A2, A3,2 or A4,2 from each of the four primary sub-matrices of matrix a, and the operand B2 may be A2 nd secondary sub-matrix B1,2, B2, B3,2 or B4,2 from each of the primary sub-matrices of matrix B; operand A3 may be the 3 rd secondary sub-matrix A1,3, A2,3, A3,3 or A4,3 from each of the four primary sub-matrices of matrix a, and operand B3 may be the 3 rd secondary sub-matrix B1,3, B2,3, B3,3 or B4,3 from each of the primary sub-matrices of matrix B; and operand A4 may be the 4 th secondary sub-matrix A1,4, A2,4, A3,4 or A4,4 from each of the four primary sub-matrices of matrix a, and operand B4 may be the 4 th secondary sub-matrix B1,4, B2,4, B3,4 or B4,4 from each of the primary sub-matrices of matrix B.

How the data of the matrix a and the matrix B are input to the processing device will be explained below.

When data input is performed, in a first cycle, all of the 1 st primary sub-matrix a-1 of the matrix a and the 1 st primary sub-matrix B-1 of the matrix B as shown in fig. 5 and 6 are input into the processing apparatus, specifically, under a first clock signal, the four secondary sub-matrices A1, A1,2, A1,3 and A1,4 of the 1 st primary sub-matrix a-1 are respectively input into the 1 st column and four first-stage calculation units 201 of the systolic array 200, while the four secondary sub-matrices B1, B1,2, B1,3 and B1,4 of the 1 st primary sub-matrix B-1 are also respectively input into the 1 st column and four first-stage calculation units 201 of the systolic array 200. Since the data size of each secondary submatrix is 1kb, the size of the operand a and the operand B input to each first-stage calculation unit are both 1kb. Since each primary sub-matrix has a data size of 4kb and both matrices a and B have a data size of 16kb, only four cycles (four clock signals) are required to input all the data of matrices a and B into the processing apparatus. Therefore, the number of cycles required to input data to the processing device according to the exemplary embodiment of the present disclosure is small, the time is short, and thus the delay time is short.

In a second cycle (second clock signal) of input data, the 2 nd primary sub-matrix a-2 of the matrix a and the 2 nd primary sub-matrix B-2 of the matrix B are all input to the processing apparatus, specifically, the 4 secondary sub-matrices A2,1, A2, A2,3 and A2,4 of the 2 nd primary sub-matrix of the matrix a and the 4 secondary sub-matrices B2,1, B2, B2,3 and B2,4 of the 2 nd primary sub-matrix of the matrix B are input to the 1 st column of the four first-stage calculation units 201 of the systolic array 200, respectively. Further, in this second period, the data in the four secondary submatrices A1, A1,2, A1,3, and A1,4 of the primary submatrix a-1 are propagated to the four first-level computing units 201 of the 2 nd column. The data of the matrix B may be propagated in different ways, for example, in a second cycle (second clock signal), the data in the four secondary submatrices B1, B1,2, B1,3 and B1,4 are transmitted to the first-stage computing units 201 in the next column in a line-of-subject manner, for example, the data of the secondary submatrix B1,1 is propagated to the last first-stage computing unit 201 in the 2 nd column, the data of the secondary submatrix B1,2 is propagated to the 1 st first-stage computing unit 201 in the 2 nd column, the data of the secondary submatrix B1,3 is propagated to the 2 nd first-stage computing unit 201 in the 2 nd column, and the data of the secondary submatrix B1,4 is propagated to the 3 rd first-stage computing unit 201 in the 2 nd column. It should be noted, however, that one skilled in the art, based on the teachings of the present disclosure, may envision propagation paths for data for matrix a and matrix B that are different from the propagation paths shown in fig. 2, and all such variations are included within the scope of the present disclosure.

In the third period (third clock signal) of the input data, the 3 rd primary sub-matrix a-3 of the matrix a and the 3 rd primary sub-matrix B-3 of the matrix B are all input into the processing apparatus, specifically, the 4 secondary sub-matrices A3,1, A3,2, A3,3 and A3,4 of the 3 rd primary sub-matrix of the matrix a and the 4 secondary sub-matrices B3,1, B3,2, B3,3 and B3,4 of the 3 rd primary sub-matrix of the matrix B are input into the 1 st column four first-stage calculation units 201 of the systolic array 200, respectively. In the fourth period (fourth clock signal) of the input data, the 4 th primary sub-matrix a-4 of the matrix a and the 4 th primary sub-matrix B-4 of the matrix B are all input into the processing apparatus, specifically, the 4 secondary sub-matrices A4,1, A4,2, A4,3 and A4,4 of the 4 th primary sub-matrix of the matrix a and the 4 secondary sub-matrices B4,1, B4,2, B4,3 and B4,4 of the 4 th primary sub-matrix of the matrix B are input into the 1 st column of the four first-stage calculation units 201 of the systolic array 200, respectively. To this end, the data of matrix a and matrix B are all input into the processing device for operation in four cycles or four clock signals.

Each first-stage computing unit 201 shown in fig. 2 includes: a first set of registers (e.g., a first set of source registers) (not shown) that store secondary sub-matrices from matrix a; a second set of registers (e.g., a second set of source registers)) (not shown) that stores secondary sub-matrices from matrix B; and a third set of registers (e.g., destination registers)) (not shown) that store the results of the n × n dpn calculations. In an exemplary embodiment according to the present disclosure, in each first-stage calculation unit 201, in each cycle or clock, a total of 64 calculation results of dp8 (as shown in fig. 4) are generated, and 64 destination registers (normally, each register has a storage capacity of 4B) are required to store the calculation results of 64 dp8. Therefore, the processing apparatus including the two-stage structure (including 16 first-stage computing units 201 in total) includes 16 × 64 (i.e., 1 k) registers having a storage capacity of 4kB in total.

In one embodiment according to the disclosure, the same data processing instruction may require multiple clocks (or multiple cycles) to complete, or only one clock (or one cycle) to complete. Assuming that only 256 channels of data can be processed under one clock signal due to hardware limitation, if 1024 channels of data are needed to be processed in the artificial intelligence application, four clocks are needed to process the data, and the calculation result of 64 dp8 obtained from the previous clock needs to be added to the multiply-add result of the current clock. However, if the data to be processed in the artificial intelligence application only needs 128 channels to be processed, then one instruction can completely process the data, and the data processed under the current instruction may be data from different applications, so that the multiply-add result obtained from the previous clock does not need to be added to the current operation result. In an exemplary embodiment, four cycles are required to input all the data of the matrices a and B into the processing device, and thus the multiply-add result of the previous cycle or clock needs to be added to the operation result of the next cycle or clock.

In order to process data of a similar size and a similar number of operations as the processing device of the two-stage structure described above, the number of calculation units required is typically approximately 128 × 128 in the case of using a systolic array of conventional design in which each calculation unit performs 1 multiply-add operation under one clock signal. Since the accumulation result registers are all used for a single multiply-accumulate operation, the total number of registers required is about 128 × 128= 1ik, and the required storage capacity is 64kB. The processing device according to an exemplary embodiment of the present disclosure therefore requires only 1/16 as much memory space or number as the prior art systolic array based TPU, and the utilization of hardware (such as destination registers, source registers, etc.) is significantly increased.

Finally, in exemplary embodiments according to the present disclosure, since there are m (in one example, m = 4) first-stage computing units 201 to receive input and process data, the operation of a large amount of data can be implemented using only 8 × 8 second-stage computing units 301 without employing 16 × 16 second-stage computing units 301 to process the data. As described above, in the chip processing field, the number of wirings required for manufacturing 16 × 16 second-stage computing units 301 is large, and the processing process is more complicated and costly. In the exemplary embodiment according to the present disclosure, similar data processing capabilities can be achieved using a processing device having a two-stage structure, and only 8 × 8 second-stage calculation units 301 are required, so that the wiring difficulty of chip manufacturing, the design difficulty, and the manufacturing cost are reduced.

Without prejudice to the underlying principles, the details and the embodiments may vary, even significantly, with respect to what has been described by way of example only, without departing from the scope of protection. The various embodiments described above can be combined to provide further embodiments. Aspects of the embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications to provide yet further embodiments.

These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.

Claims

1. A processing apparatus for processing data, comprising:

a systolic array comprising m first stage computational cells arranged in an m array, where m is a positive integer greater than or equal to 2,

wherein each of the m x m first-stage computing units comprises n x n second-stage computing units arranged in an n x n array, where n is a positive integer greater than or equal to 2,

each of the n x n second stage computing units configured to perform a dot product operation of a first vector and a second vector, the first vector and the second vector each comprising n data,

wherein the processing means is arranged to process a first matrix multiplied by a second matrix, data from n of the first vectors of the first matrix and data from n of the second vectors of the second matrix being broadcast simultaneously to the n x n second stage computing units.

2. The processing device of claim 1, wherein each of the n data is of a type of any one of 4-bit integer type, 8-bit integer type, 16-bit floating point type, 24-bit floating point type, and 32-bit floating point type.

3. The processing device of claim 1, wherein the first matrix = M x K and the second matrix = K x N, wherein M represents a number of rows of the first matrix, K represents a number of columns of the first matrix or a number of rows of the second matrix, wherein M = mn, K is an integer multiple of N, N represents a number of columns of the second matrix, wherein N is a positive integer.

4. A treatment device according to claim 3, wherein

Said first matrix comprises K/n first primary sub-matrices of type M x n, each said first primary sub-matrix of type M x n comprising M/n first secondary sub-matrices of type n x n, and

the second matrix comprises K/N second primary sub-matrices of N type, each of said N primary sub-matrices comprising N/N second secondary sub-matrices of N type.

5. The processing apparatus of claim 4, wherein the n x n second stage computing units in each first stage computing unit are configured to: receiving n rows of the first vectors from the first matrix secondary sub-matrix and n columns of the second vectors from the second matrix secondary sub-matrix under each clock instruction to obtain n x n dot product operation results.

6. The processing device of claim 4, wherein each of the first level computing units comprises:

a first set of registers for storing K/n M x n types of said first primary sub-matrices from said first matrix,

a second set of registers for storing primary sub-matrices of said second matrix of type K/N x N from said second matrix, an

And the third group of registers are used for storing the obtained dot product operation result.

7. The processing apparatus of claim 5, wherein the second-stage computing units of ith row and jth column of the n x n second-stage computing units are configured to: receiving the first vector of n data elements from the ith row of the corresponding first matrix secondary sub-matrix and the second vector of n data elements from the jth column of the corresponding second matrix secondary sub-matrix, i =1, 2 \8230;/j =1, 2 \8230;/8230n.

8. The processing device of claim 4, wherein M = N = K, the processing device configured to:

and inputting all data of the first matrix and the second matrix into the processing device in m periods.

9. The processing apparatus of claim 8, wherein the processing apparatus is configured to: in a first cycle of input data, a kth first stage computation unit of a first column of m first stage computation units receives: a kth of the first matrix secondary sub-matrix from a first of the first matrix primary sub-matrices, and a kth of the second matrix secondary sub-matrix from a first of the second matrix primary sub-matrices, wherein k =1, 2 \8230m.

10. The processing apparatus of claim 9, wherein the processing apparatus is configured to: in the 2 nd to m th cycles of the input data, the kth sub-matrix of the first matrix sub-level is transferred along a straight-line path to the kth first-level computing unit of the m first-level computing units of the 2 nd to m th columns, respectively, and

the kth secondary sub-matrix of the second matrix is transferred to the (k-1) th first-stage computing unit of the m first-stage computing units of the 2 nd to m th columns in a diagonal form.

11. The processing apparatus of claim 10, wherein

In the case where k-1 is equal to 0, the k-th secondary sub-matrix of the second matrix is transmitted to the m-th first-level computing unit in the next column of m first-level computing units.

12. The processing apparatus of claim 9, wherein the processing apparatus is configured to:

in the s-th cycle of the input data, the kth first-stage computing unit in the first column of m first-stage computing units receives: a kth of the first matrix secondary submatrix from the sth of the first matrix primary submatrix, and a kth of the second matrix secondary submatrix from the sth of the second matrix primary submatrix, wherein s =2, 3 \8230; m, and k =1, 2 \8230; m.