CN111753253B

CN111753253B - Data processing method and device

Info

Publication number: CN111753253B
Application number: CN202010595866.7A
Authority: CN
Inventors: 曹文慧; 姚猛; 周昱; 邹玥
Original assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Current assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2024-05-28
Anticipated expiration: 2040-06-28
Also published as: CN111753253A

Abstract

A data processing method and device are disclosed, wherein the method comprises the following steps: in the process of executing matrix operation of a first matrix and a second matrix through a computing unit array, controlling each matrix element in a plurality of third matrices to be respectively cached in a first register of a plurality of computing units of the computing unit array according to a first preset direction; when the operation of the vector and the third matrix is required to be executed, respectively carrying out first operation on matrix elements cached by the first registers of the plurality of computing units and vector elements transmitted according to a second preset direction, carrying out second operation on the result of the first operation and the output of the corresponding computing unit in the previous row, and obtaining the operation result of the vector and the third matrix according to the result of the second operation respectively output by the computing units in the last row of the computing unit array. The method and the device are beneficial to improving the calculation efficiency of the data processing chip and improving the instantaneity of the neural network by mixing matrix and matrix operation and vector and matrix operation.

Description

Data processing method and device

Technical Field

The present disclosure relates to integrated circuit technology, and more particularly, to a data processing method, a data processing apparatus, a storage medium, and an electronic device.

Background

The computational effort of the neural network tends to be large. For example, the computation of neural networks often involves more matrix-to-matrix operations and vector-to-matrix operations. The time consumed by matrix-to-matrix operation and vector-to-matrix operation is reduced, and the calculation efficiency of the neural network is often greatly influenced, so that the instantaneity of the neural network is influenced.

How to reduce the time cost occupied by the operation of the matrix and the vector and the matrix is a technical problem which is worth concerned.

Disclosure of Invention

The present disclosure has been made in order to solve the above technical problems. The embodiment of the disclosure provides a data processing method, a data processing device, a storage medium and electronic equipment.

According to a first aspect of embodiments of the present disclosure, there is provided a data processing method, the method comprising: in the process of executing matrix operation of a first matrix and a second matrix through a computing unit array, controlling each matrix element in a plurality of third matrices to be respectively cached in first registers of a plurality of computing units of the computing unit array according to a first preset direction, wherein the first registers of a plurality of computing units in a row of computing units in the first preset direction respectively cache the matrix elements in the same position in the plurality of third matrices; when the operation of at least one vector and the plurality of third matrixes is required to be executed, respectively carrying out first operation on matrix elements cached by a first register of the plurality of calculation units and vector elements transmitted according to a second preset direction, carrying out second operation on the result of the first operation and the output of a corresponding calculation unit in the previous row, and outputting the result of the second operation; and obtaining the operation results of the at least one vector and the plurality of third matrixes according to the results of the second operation respectively output by the plurality of calculation units in the last rows of the calculation unit arrays.

According to a second aspect of embodiments of the present disclosure, there is provided a data processing method, the method comprising: caching a first matrix block in a first matrix to be operated of an N-th layer based on a neural network in a second register of each calculation unit of a calculation unit array, and sequentially providing m second matrix blocks in a second matrix to be operated of the N-th layer for the calculation unit array to obtain operation results of the first matrix block and the m second matrix blocks respectively; wherein N and m are integers greater than 1; in the operation process of the first matrix block and the m second matrix blocks, respectively buffering m third matrix blocks in a third matrix to be operated of an N-1 th layer based on a neural network in a first register of each calculation unit of the calculation unit array; and under the condition that the matrix operation results of the first matrix block and the m second matrix blocks are obtained, respectively, the m vector blocks are provided for the calculation unit array, and the operation results of the m vector blocks and the m third matrix blocks are obtained.

According to a third aspect of the embodiments of the present disclosure, there is provided a data processing apparatus comprising: the first control module is used for controlling each matrix element in the plurality of third matrices to be respectively cached in first registers of a plurality of computing units of the computing unit array according to a first preset direction in the process of executing matrix operation of the first matrix and the second matrix through the computing unit array, wherein the first registers of a plurality of computing units in a row of computing units in the first preset direction respectively cache the matrix elements in the same position in the plurality of third matrices; the second control module is used for respectively carrying out first operation on matrix elements cached by the first registers of the plurality of computing units and vector elements transmitted according to a second preset direction when operation of at least one vector and the plurality of third matrices is required to be executed, carrying out second operation on the result of the first operation and the output of the corresponding computing unit in the previous row, and outputting the result of the second operation; and the processing module is used for obtaining the operation results of the at least one vector and the plurality of third matrixes according to the results of the second operation respectively output by the plurality of calculation units in the last rows of the plurality of calculation unit arrays.

According to a fourth aspect of embodiments of the present disclosure, there is provided a data processing apparatus comprising: the third control module is used for caching a first matrix block in a first matrix to be operated of an N layer based on a neural network in a second register of each calculation unit of the calculation unit array, and sequentially providing m second matrix blocks in a second matrix to be operated of the N layer for the calculation unit array to obtain operation results of the first matrix block and the m second matrix blocks respectively; wherein N and m are integers greater than 1; the fourth control module is used for respectively buffering m third matrix blocks in a third matrix to be operated of an N-1 layer based on the neural network in a first register of each calculation unit of the calculation unit array in the operation process of the first matrix blocks and the m second matrix blocks; and a fifth control module, configured to, when obtaining the matrix operation results of the first matrix block and the m second matrix blocks, respectively, provide the m vector blocks to the computing unit array, respectively, to obtain the operation results of the m vector blocks and the m third matrix blocks, respectively.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for implementing the above method.

According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method described above.

According to the data processing method and device provided by the embodiment of the disclosure, in the process of performing matrix operation of the first matrix and the second matrix by using the computing unit array, each matrix element in the plurality of third matrices is respectively cached in the first register of each computing unit, so that the process of caching the plurality of third matrices can not occupy a time period independently; namely, the clock period occupied by the matrix operation of the first matrix and the second matrix can realize the matrix operation of the first matrix and the second matrix, and simultaneously realize the storage of each matrix element in the third matrix in the computing unit array; when the operation of the vector and the third matrix is needed, the operation of a plurality of vectors and a plurality of third matrices is realized in one time period, so that the operation of the vector and the third matrix is avoided from being executed for a plurality of times, the calculation of a plurality of time periods and a part of calculation of the calculation unit array is avoided, the calculation efficiency of calculation resources is effectively improved, and the full utilization of the calculation resources is facilitated.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, not to limit the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a schematic illustration of a scenario in which the present disclosure is applicable;

FIG. 2 is a flow chart of one embodiment of a data processing method of the present disclosure;

FIG. 3 is a flow chart of one embodiment of a matrix operation implementing a first matrix and a second matrix of the present disclosure;

FIG. 4 is a schematic diagram of a portion of an array of computing cells of the present disclosure;

FIG. 5 is a schematic diagram of another portion of the computational cell array of the present disclosure;

FIG. 6 is a flow chart of an embodiment of controlling each matrix element to be buffered in a second register of a computation unit according to the present disclosure;

FIG. 7 is a schematic diagram of a configuration of one embodiment of a computing unit of the present disclosure;

FIG. 8 is a schematic diagram of another embodiment of a computing unit of the present disclosure;

FIG. 9 is a flow chart of one embodiment of controlling each matrix element to be cached in a first register of a computing unit according to the present disclosure;

FIG. 10 is a flowchart illustrating an embodiment of a data processing method of the present disclosure suitable for neural network operations;

FIG. 11 is a schematic diagram of a two-layer embodiment of a neural network of the present disclosure;

FIG. 12 is a schematic diagram of an embodiment of a data processing apparatus of the present disclosure;

FIG. 13 is a schematic diagram of another embodiment of a data processing apparatus of the present disclosure;

fig. 14 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, such as a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure are applicable to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the disclosure

In practicing the present disclosure, the inventors have discovered that many computations in a neural network can be translated into matrices and operations of matrices, e.g., convolution computations can be translated into matrices and operations of matrices. At present, in order to improve the real-time performance of the neural network, the matrix and the operation of the matrix are generally optimized to improve the operation efficiency of the matrix and the matrix.

Based on factors such as lightweight neural network, more vector and matrix operations appear in the neural network operation, such as Batchsize (batch size) calculation of the full-connection layer with 1 and DEPTHWISE CONVOLUTION (deep convolution) calculation, which can be converted into vector and matrix operations. The existing optimization for matrix and matrix operation cannot be applied to vector and matrix operation. If the vector and matrix operation can be executed in the matrix and matrix operation process, the time cost occupied by the vector and matrix operation is reduced, and the real-time performance of the neural network is improved.

Exemplary overview

The data processing method can realize matrix-matrix and matrix-vector mixed operation. An example is shown in fig. 1.

In fig. 1, it is assumed that an apparatus includes: the data processing system comprises a computing unit array 100, a storage module 101, a data rearrangement module 102, a post-processing module 103, a first buffer module 104, a second buffer module 105, a third buffer module 106, a fourth buffer module 107 and a control module 108.

The storage module 101 may be a storage device such as a Cache (Cache memory) or a DRAM (Dynamic Random Access Memory ). The storage module 101 stores data to be operated on, for example: matrix to be operated on, vector to be operated on, etc.

The data rearrangement module 102 is configured to read data to be operated from the storage module 101 according to control of the control module 108, and rearrange the data to be operated according to a structure of the computing unit array 100, so as to form data that is convenient to provide to the computing unit array 100, and store the data in a corresponding buffer module.

For example, the data rearrangement module 102 may divide the first matrix to be operated read from the storage module 101 according to the control of the control module 108 to form a plurality of first matrix blocks, where each first matrix block may be regarded as a matrix a; the data rearrangement module 102 may divide the second matrix to be operated read from the storage module 101 according to the control of the control module 108 to form a plurality of second matrix blocks, where each second matrix block may be considered as a matrix b; the data rearrangement module 102 may divide the third matrix to be operated read from the storage module 101 according to the control of the control module 108 to form a plurality of third matrix blocks, where each third matrix block may be regarded as a matrix c; the data rearrangement module 102 may divide the vector to be operated read from the storage module 101 according to the control of the control module 108 to form a plurality of vector blocks, where each vector block may be regarded as a vector d.

The data reordering module 102 may respectively buffer a plurality of matrices a in the first buffer module 104, b in the second buffer module 105, c in the third buffer module 106, and d in the fourth buffer module 107 according to the control of the control module 108 and according to the corresponding storage format.

Assuming that the size of the computing cell array 100 is 16 x 16, and the sizes of matrix a, matrix b and matrix c are all 16 x 16, and vector d includes 16 vector elements.

Assuming that in the ith clock cycle, the control module 108 controls the 1 st matrix a buffered in the first buffer module 104 to be provided to the computing unit array 100, so that 16×16 matrix elements in the 1 st matrix a are buffered in the second registers of the computing units of the computing unit array 100, respectively. Wherein i is a positive integer.

In the (i+1) th clock period, on the one hand, the control module 108 controls the 1 st matrix b in the second buffer module 105 to be provided to the computing unit array 100, so as to realize the matrix multiplication operation of the 1 st matrix a and the 1 st matrix b; on the other hand, the control module 108 controls matrix elements (i.e., 16 matrix elements) of the 1 st to 16 th rows and 0 th columns in the third buffer module 106 to be respectively stored in the first registers of the 16 computing units in the first preset direction of the computing unit array 100.

In the (i+2) th clock period, on the one hand, the control module 108 controls the (2) nd matrix b in the second buffer module 105 to be provided to the computing unit array 100, so as to realize the matrix multiplication operation of the (1) st matrix a and the (2) nd matrix b; on the other hand, the control module 108 controls matrix elements (i.e., 16 matrix elements) of the 1 st to the 0 th rows and 1 st columns in the third buffer module 106 to be respectively stored in the first registers of the 16 computing units in the first preset direction of the computing unit array 100.

……

In the (i+16) th clock period, on one hand, the control module 108 controls the 16 th matrix b in the second buffer module 105 to be provided to the computing unit array 100, so as to realize matrix multiplication operation of the 1 st matrix a and the 16 th matrix b; on the other hand, the control module 108 controls the matrix elements (i.e., the 16 matrix elements) of the 3 rd row and the 3 rd column in the 1 st matrix c to the 16 th matrix c in the third buffer module 106 to be respectively stored in the first registers of the 16 computing units in the first preset direction of the computing unit array 100.

In the (i+17) th clock cycle, the control module 108 controls the first vector d in the fourth buffer module 107 to be supplied to the calculation unit array 100, thereby implementing multiplication operation of the first vector d with 16 matrices c at the same time.

In addition, in the (i+17) th clock cycle, the control module 108 may control the 2 nd matrix a in the first buffer module 104 to be provided to the computing unit array 100, so that the 16×16 matrix elements in the 2 nd matrix a are respectively buffered in the second registers of the computing units of the computing unit array 100.

……

With the above manner, the present disclosure can alternately implement the matrix-matrix multiplication process and the matrix-vector multiplication process.

Matrix multiplication results of the plurality of matrices a and the plurality of matrices b outputted by the calculation unit array 100a plurality of times form final operation results of the first matrix to be operated and the second matrix to be operated after data arrangement (such as processing of addition calculation or the like) by the post-processing module 103, and the post-processing module 103 may store the final operation results in the storage module 101. The matrix-vector multiplication results of the plurality of matrices c and the plurality of vectors outputted by the calculation unit array 100a plurality of times form a final operation result of the third matrix-to-be-operated and the vector-to-be-operated after data arrangement (e.g., sequential arrangement, etc.) via the post-processing module 103, and the post-processing module 103 may store the final operation result in the storage module 101.

It should be noted that, when the data processing method of the present disclosure is applied in the operation environment of the neural network, since convolution operation in the neural network and the like can be converted into operation of a matrix and operation of a vector, the data rearrangement module 102 should have the capability of converting convolution operation in the neural network into operation of a matrix and operation of a matrix and vector, so that the convolution operation of the neural network can be conveniently implemented by using the data processing method of the present disclosure.

Exemplary method

FIG. 2 is a flow chart of one embodiment of a method of implementing matrix multiplication and vector multiplication using a computing unit of the present disclosure. The method as shown in fig. 2 includes: s200, S201, and S202. The steps are described separately below.

And S200, controlling each matrix element in the plurality of third matrixes to be respectively cached in a first register of a plurality of computing units of the computing unit array according to a first preset direction in the process of executing matrix operation of the first matrix and the second matrix through the computing unit array.

The present disclosure may perform matrix operations of the first matrix and the second matrix using a plurality of clock cycles, and control each matrix element in the plurality of third matrices to be buffered in a first register of a plurality of computing units of the computing unit array according to a first preset direction in a matrix operation process of performing the first matrix and the second matrix using a plurality of clock cycles. That is, the matrix operations of the first matrix and the second matrix in the present disclosure may occupy a plurality of clock cycles, in which not only the matrix operations of the first matrix and the second matrix but also the storage of each matrix element in the third matrix in the computing unit array are realized.

The calculation unit array in the present disclosure may refer to an array formed by an arrangement of n calculation units. Where n is an integer, in particular, n may be an integer having a larger value (e.g., in the order of hundred or thousand), and the size of n determines the size of the computing unit array. In one example, n=l×w×h, where l, w, and h are integers greater than 1, i.e., the computing unit array in the present disclosure can be considered as an array in the form of a cuboid having l computing units in the rectangular direction, w computing units in the wide direction, and h computing units in the high direction. In the usual case, l, w and h are equal.

The first matrix and the second matrix in the present disclosure are two-dimensional matrices that need to be subjected to matrix operations. The matrix operation may be a matrix multiplication operation or the like. In one example, the first matrix, the second matrix and all third matrices may be matrix blocks in two matrices to be operated on the same layer based on the neural network, that is, the first matrix is a matrix block in a first matrix to be operated on an nth layer of the neural network, the second matrix may be a matrix block in a second matrix to be operated on an nth layer of the neural network, and the third matrix may be a matrix block in a third matrix to be operated on an nth layer of the neural network. The first matrix, the second matrix and each third matrix are typically the same size, i.e. the number of matrix elements comprised by the first matrix, the second matrix and each third matrix is the same. For example, the first matrix, the second matrix and all third matrices are of size lxw, i.e. the first matrix, the second matrix and all third matrices comprise lxw matrix elements. In the usual case, L and W are equal.

The first preset direction of the present disclosure is generally an arrangement direction of the computing units in the computing unit array, and thus, the first preset direction may be referred to as a first arrangement direction of the computing units. For example, the first preset direction may be a long direction or a wide direction or a high direction of the computing unit array.

Each computing unit in the array of computing units in the present disclosure includes a first register. The present disclosure may buffer matrix elements at the same position in a plurality of third matrices in a first register in the same row of calculation units of the calculation unit array according to a first preset direction. For example, assuming that there are n1 third matrices (where n1 is an integer greater than 1), matrix elements in the ith row and jth column in the n1 third matrices are respectively buffered in the first registers in all kth row computing units of the computing unit array.

S201, when at least one vector and the operations of the plurality of third matrixes are required to be executed, performing a first operation on matrix elements cached by the first registers of the plurality of computing units and vector elements transmitted according to a second preset direction, performing a second operation on a result of the first operation and outputs of corresponding computing units in the previous row, and outputting a result of the second operation.

The vector and each third matrix in the present disclosure require vector-matrix operations. The vector-matrix operation may be a multiplication of a vector and a matrix, etc. In one example, all vectors and all third matrices may be a plurality of vector blocks in a vector to be operated on and a plurality of matrix blocks in a matrix to be operated on based on the same layer of the neural network, i.e., each vector may be a vector block in a vector to be operated on of an nth layer of the neural network, and each third matrix may be a matrix block in a third matrix to be operated on of an nth layer of the neural network.

The size of all vectors in the present disclosure is typically the same, i.e. the vector dimensions of all vectors are the same, and the vector dimensions of all vectors are typically the same as the number of matrix elements comprised by the third matrix in one direction. For example, assuming that the size of the third matrix is l×w, the vector dimensions of all vectors are W.

The second preset direction of the present disclosure is generally an arrangement direction of the computing units in the computing unit array, and thus, the second preset direction may be referred to as a second arrangement direction of the computing units. For example, the second preset direction may be a long direction or a wide direction or a high direction of the computing unit array. In addition, the first preset direction and the second preset direction are generally different directions.

The operation of the vector and the plurality of third matrices in the present disclosure may refer to an operation between the vector and the matrices. The first operation in this disclosure may be a multiplication operation. The second operation in the present disclosure may be an addition operation. It is assumed that the result of the first operation in the present disclosure is calculated by the i-th calculation unit of the current row, and at this time, the corresponding calculation unit in the upper row in the present disclosure may refer to the i-th calculation unit in the upper row. The current row and the last row in this disclosure are generally related to a specific structural design of the compute cell array. In one example, the current row and the previous row may each be a row in the rectangular direction of the calculation unit array. In another example, the current row and the previous row may each be a row in the width direction of the calculation cell array. In yet another example, the current row and the previous row may each be a row in the high direction of the calculation cell array.

S202, according to the results of the second operation respectively output by the plurality of computing units in the last row of the computing unit array, obtaining the operation results of the at least one vector and the plurality of third matrixes.

The calculation unit array in the present disclosure includes a plurality of last rows, for example, assuming that a row in the present disclosure is a row in the rectangular direction of the calculation unit array, the number of last rows in the present disclosure may be the size of the width of the calculation unit array. For another example, assuming that the rows in the present disclosure are rows in the width direction of the calculation cell array, the number of last rows in the present disclosure may be the long size of the calculation cell array. For another example, assuming that the rows in the present disclosure are rows in the high direction of the calculation cell array, the number of last rows in the present disclosure may be the wide size of the calculation cell array. The operation result of the at least one vector and the plurality of third matrices obtained in the present disclosure may be the operation result of multiplying the at least one vector and the plurality of third matrices.

The present disclosure buffers each matrix element in the plurality of third matrices in a first register in a plurality of computation units in the computation unit array, respectively, in a process of performing matrix operations of the first matrix and the second matrix using the computation unit array, and therefore, the process of buffering the plurality of third matrices in the present disclosure may not occupy a clock cycle alone; namely, the clock period occupied by the matrix operation of the first matrix and the second matrix can realize the matrix operation of the first matrix and the second matrix, and simultaneously realize the storage of each matrix element in the third matrix in the computing unit array; when the vector and the third matrixes are needed to be operated, the operation of the vectors and the third matrixes is realized in one time period, so that the operation of the vectors and the third matrixes is avoided from being executed for multiple times, the partial calculation through the time periods and the calculation unit arrays is avoided, the calculation efficiency of calculation resources is effectively improved, and the full utilization of the calculation resources is facilitated.

In one alternative example, each compute unit in the compute unit array in the present disclosure generally includes: two registers, a first register and a second register. The present disclosure generally performs matrix operations for a first matrix and a second matrix based on a second register. In the case where the matrix operation of the first matrix and the second matrix is a matrix multiplication operation, sReg (SHIFT REGISTER ) may be used for the first register and mReg (multiply Register, multiplication register) may be used for the second register in each of the calculation units in the present disclosure. One specific process of the present disclosure for performing matrix operations for the first matrix and the second matrix is shown in fig. 3.

In fig. 3, S300, each matrix element in the first matrix is controlled to be cached in the second register of each computing unit according to the first preset direction.

Optionally, the second register of each computing unit in a row of computing units of the computing unit matrix in the first preset direction buffers the same matrix element in the first matrix.

The computing element array is assumed to be a three-dimensional structure, i.e., the computing element array includes an x-direction, a y-direction, and a z-direction. Assuming that the size of the computational cell array is 4 x 4, for clarity, the present disclosure uses the computational cell array as shown in fig. 4 and 5.

In fig. 4 and 5, each of the small squares represents one calculation unit. The first preset direction in the present disclosure may be the x direction. It is assumed that the computing unit array includes 4 sets of computing units in total in the z direction, a first set of computing units and a fourth set of computing units being shown in fig. 4, and a second set of computing units and a third set of computing units being shown in fig. 5.

The first set of computing units comprises 16 computing units, namely m000, m001, m002, m003, m010, m011, m012, m013, m020, m021, m022, m023, m030, m031, m032 and m033;

The second set of computing units comprises 16 computing units, namely m100, m101, m102, m103, m110, m111, m112, m113, m120, m121, m122, m123, m130, m131, m132 and m133;

the third group of computing units comprises 16 computing units, namely m200, m201, m202, m203, m210, m211, m212, m213, m220, m221, m222, m223, m230, m231, m232 and m233;

The fourth set of computing units comprises 16 computing units, namely m300, m301, m302, m303, m310, m311, m312, m313, m320, m321, m322, m323, m330, m331, m332 and m333.

It is assumed that matrix operations of the first matrix and the second matrix are performed using a 4 x 4 array of calculation units as shown in fig. 4 and 5, and wherein the first matrix a may be expressed in the form:

the present disclosure may store 16 pieces of data, a00 through a33, in each of the computing units in the computing unit matrix as follows:

a00 is stored in a second register of the calculation units m000, m001, m002 and m 003;

a01 is stored in the second registers of the calculation units m010, m011, m012, and m 013;

a02 is stored in the second registers of the calculation units m020, m021, m022 and m 023;

a03 is stored in a second register of the calculation units m030, m031, m032 and m 033;

a10 is stored in the second registers of the computing units m100, m101, m102 and m 103;

a11 is stored in the second registers of the calculation units m110, m111, m112 and m 113;

a12 is stored in the second registers of the computing units m120, m121, m122 and m 123;

a13 is stored in the second registers of the calculation units m130, m131, m132, and m 133;

a20 is stored in the second registers of the computing units m200, m201, m202 and m 203;

a21 is stored in the second registers of the calculation units m210, m211, m212, and m 213;

a22 is stored in the second registers of the calculation units m220, m221, m222, and m 223;

a23 is stored in the second registers of the calculation units m330, m231, m232, and m 233;

a30 is stored in the second registers of the computing units m300, m301, m302 and m 303;

a31 is stored in the second registers of the computing units m310, m311, m312 and m 313;

a32 is stored in the second registers of the computing units m320, m321, m322 and m 323;

a33 is stored in the second registers of the calculation units m330, m331, m332, and m 333.

The present disclosure may buffer all matrix elements in the first matrix in the second register of each compute unit with one clock cycle.

S301, providing each matrix element in the second matrix to each computing unit according to a second preset direction.

Optionally, the second preset direction in the present disclosure may be the z direction. It is assumed that the second matrix b can be expressed as follows:

The 16 data of b00 to b33 are supplied to the calculation unit shown in fig. 4 and 5 as follows:

b00 is provided to a second register of the computing units m000, m100, m200 and m 300;

b01 is provided to a second register of the computing units m001, m101, m201 and m 301;

b02 is provided to a second register of the computing units m002, m102, m202 and m 302;

b03 is provided to a second register of the calculation units m003, m103, m203 and m 303;

b10 is provided to a second register of the computing units m010, m110, m210 and m 310;

b11 is supplied to the second registers of the calculation units m011, m111, m211, and m 311;

b12 is provided to a second register of the calculation units m012, m112, m212 and m 312;

b13 is supplied to the second registers of the calculation units m013, m113, m213 and m 313;

b20 is provided to a second register of the calculation units m020, m120, m220 and m 320;

b21 is provided to a second register of the calculation units m021, m121, m221 and m 321;

b22 is provided to a second register of the calculation units m022, m122, m222 and m 322;

b23 is provided to a second register of the calculation units m023, m123, m223 and m 323;

b30 is provided to a second register of the calculation units m030, m130, m230 and m 330;

b31 is provided to a second register of the calculation units m031, m131, m231 and m 331;

b32 is provided to a second register of the computing units m032, m132, m232 and m 332;

b33 is provided to the second registers of the calculation units m033, m133, m233 and m 333.

The present disclosure may provide all matrix elements in the second matrix to each computing unit separately using one clock cycle.

S302, performing third operation on matrix elements in a second register of each calculation unit and the matrix elements transmitted according to a second preset direction, performing fourth operation on third operation results and output of corresponding calculation units in the previous row, and outputting fourth operation results.

Optionally, the present disclosure may perform multiplication operation on matrix elements in the second register of each computing unit and matrix elements transmitted according to the second preset direction, where each computing unit may obtain a multiplication operation result. Each of the calculation units in the present disclosure may perform an addition operation on the multiplication operation result outputted by it and the output of the corresponding calculation unit in the previous row thereof (if one calculation unit does not have the previous row, the calculation unit may directly output the multiplication operation result thereof), so that the present disclosure may obtain matrix multiplication operation results of the first matrix and the second matrix.

Alternatively, m000, m001, m002, m003, m100, m101, m102, m103, m200, m201, m202, m203, m300, m301, m302 and m303 in fig. 4 and 5 are all calculation units located in the first row, and the multiplication result may be directly output without performing the addition operation.

M010 should add its multiplication result to the multiplication result output by m000 and add the multiplication result as the output of m 010.

M020 should add the multiplication result and the output of m010, and use the addition result as the output of m 020.

M030 should add the multiplication result and the output of m020 and add the multiplication result as the output of m 030.

……

M313 should add the multiplication result thereof and the multiplication result output by m303, and add the multiplication result as the output of m 313.

M323 should add the multiplication result thereof and the output of m313 and add the multiplication result as the output of m 313.

M333 should add the multiplication result thereof and the output of m323, and add the multiplication result as the output of m 333.

S303, obtaining matrix operation results of the first matrix and the second matrix according to fourth operation results output by each calculation unit in the last row of the calculation unit array.

Alternatively, the present disclosure may use the outputs of the calculation units m030, m031, m032, m033, m120, m121, m122, m123, m230, m231, m232, m233, m330, m331, m332, and m333 in fig. 4 and 5 as the matrix operation results of the first matrix a and the second matrix b.

In the previous example, assuming that the matrix operation of the first matrix a and the second matrix b is a matrix multiplication operation, then:

The output of the calculation unit m030 is: a00×b00+a01×b10+a02×b20+a03×b30;

the output of the calculation unit m031 is: a00×b01+a01×b11+a02×b21+a03×b31;

the output of the calculation unit m032 is: a00×b02+a01×b12+a02×b22+a03×b32;

The output of the calculation unit m033 is: a00×b03+a01×b13+a02×b23+a03×b33;

the output of the calculation unit m130 is: a10×b00+a11×b10+a12×b20+a13×b30;

the output of the calculation unit m131 is: a10×b01+a11×b11+a12×b21+a13×b31;

the output of the calculation unit m132 is: a10×b02+a11×b12+a12×b22+a13×b32;

the output of the calculation unit m133 is: a10×b03+a11×b13+a12×b23+a13×b33;

the output of the calculation unit m230 is: a20×b00+a21×b10+a22×b20+a23×b30;

the output of the calculation unit m231 is: a20×b01+a21×b11+a22×b21+a23×b31;

The output of the calculation unit m232 is: a20×b02+a21×b12+a22×b22+a23×b32;

The output of the calculation unit m233 is: a20×b03+a21×b13+a22×b23+a23×b33;

the output of the calculation unit m330 is: a30×b00+a31×b10+a32×b20+a33×b30;

the output of the calculation unit m331 is: a30×b01+a31×b11+a32×b21+a33×b31;

The output of the calculation unit m332 is: a30×b02+a31×b12+a32×b22+a33×b32;

the output of the calculation unit m333 is: a30×b03+a31×b13+a32×b23+a33×b33.

According to the method, the first register and the second register are arranged in the computing unit, each matrix element in the first matrix is cached by the second register, so that not only can the matrix operation of the first matrix and the second matrix be completed by one clock period or two clock periods, but also matrix operation (such as matrix multiplication operation) can be performed on the basis of the matrix elements cached by the second register and the matrix elements transmitted according to the second preset direction, and a plurality of matrix elements of the third matrix are sequentially cached in the first register of the computing unit, so that a feasible and convenient implementation scheme is provided for saving the read-write time of the third matrix.

In an alternative example, the present disclosure may cause each matrix element in the first matrix to be cached in the second register of each computing unit in accordance with the first preset direction by the control signal. An example of the control of each matrix element of the present disclosure to be buffered in the second register of the computation unit is shown in fig. 6.

In fig. 6, S600, a first control signal is sent to each computing unit based on the control input terminal of each computing unit.

Alternatively, the computing unit array in the present disclosure may be provided with a control line, and the control line is connected to each computing unit individually. The present disclosure may send the first control signal to the corresponding computing unit via the control line. The transmission direction of the first control signal in the control line may be a first preset direction. In one example, the control lines in the compute cell array may be the lines identified with cmd in FIGS. 4 and 5. In addition, the control line may take the form of a bus to which each computing unit is coupled, that is, the computing units of fig. 4 and 5 may not be serially coupled.

S601, according to the first control signal, each matrix element in the first matrix is cached in a second register of each calculation unit according to a first preset direction.

Alternatively, each computing unit in the present disclosure may be provided with a multiplexer, respectively. For any computing unit, the multiplexers in the computing unit are respectively connected with the first register and the second register in the computing unit, and the first control signal in the disclosure can control the gating state of the multiplexers in the computing unit, so that the second register is communicated with the input of the first preset direction, and each matrix element in the first matrix transmitted according to the first preset direction can be buffered in the second register of the corresponding computing unit.

Optionally, an example of a configuration of the computing unit including a multiplexer, a first register and a second register in the present disclosure is shown in fig. 7.

In fig. 7, the calculation unit may include: a control unit 700, a multiplexer 701, a first register 702, a second register 703, a multiplication unit 704, and an addition unit 705. There is a data path between the first register 702 and the second register 703, i.e. the data buffered in the first register 702 can be transferred directly into the second register 703.

The first control signal is transmitted in the control line, and the control unit 700 in the calculation unit controls the multiplexer 701 to communicate with the second register 703 after receiving the first control signal. In addition, the control unit 700 may also transmit an enable signal to the second register 703, so that a matrix element in the first matrix is stored in the second register 703 through the multiplexer 701 as data to be stored.

Optionally, another example of the structure of the computing unit including a multiplexer, a first register and a second register in the present disclosure is shown in fig. 8.

In fig. 8, the calculation unit may include: a control unit 800, a multiplexer 801, a first register 802, a second register 803, a multiplication unit 804, and an addition unit 805. There is no data path between the first register 802 and the second register 803, i.e. the data buffered in the first register 702 may not be transferred directly into the second register 703, but need to be transferred into the second register 703 via the multiplexer 801.

The first control signal is transmitted in the control line, and the control unit 800 in the calculation unit controls the multiplexer 801 to communicate with the second register 803 after receiving the first control signal. In addition, the control unit 800 may also transmit an enable signal to the second register 803, so that a matrix element in the first matrix is stored in the second register 803 through the multiplexer 801 as data to be stored.

The storage control method and the storage control device have the advantages that the storage control of the first register and the second register in the computing unit is conveniently realized by utilizing the first control signal and setting the multiplexer in the computing unit.

In an alternative example, the present disclosure may cause each matrix element in the plurality of third matrices to be buffered in the first register of each computing unit according to the first preset direction by the control signal. The flow of one example of the control of each matrix element buffering in the first register of the computation unit of the present disclosure is shown in fig. 9.

In fig. 9, S900, a second control signal is sent to each computing unit based on the control input terminal of each computing unit.

Alternatively, the control input of the computing unit in the present disclosure may refer to the connection of the computing unit to the control line. Each computing unit may receive a control signal, e.g. a first control signal or a second control signal, etc., transmitted in the control line via its control input.

In one example, the second control signal in the present disclosure may include: the first register stores an indication and a location identification of the computing unit. Wherein the first register store indicates that data currently transferred for informing the computing unit needs to be stored in the first register. The position identification of the computing unit is used for informing the computing unit corresponding to the data transmitted by the computing unit currently.

In another example, the second control signal in the present disclosure may include: a shift indication. The shift instruction is used for notifying the computing unit to carry out shift processing on the matrix elements stored in the first register of the computing unit, so that the matrix elements stored in the first register of the computing unit are shifted out of the first register and transferred into the first registers of other computing units.

S901, according to a second control signal, each matrix element in the plurality of third matrices is respectively cached in a first register of each calculation unit according to a first preset direction.

Optionally, the disclosure may buffer each matrix element in the third matrices in the first registers of each computing unit based on a plurality of clock cycles in different manners according to different control signals. The specific implementation process of buffering each matrix element in the third matrix is described below by taking the first control signal and the second control signal as examples.

In a first example, when each computing unit in the disclosure receives the first register storage indication and the location identifier of the computing unit, it may learn, according to the first register storage indication, that data needs to be stored in the first register of the computing unit, and each computing unit may further determine, according to the location identifier received by each computing unit, whether the location identifier is its own location identifier, and if one computing unit determines that the received location identifier is its own location identifier, the computing unit buffers the data transmitted according to the first preset direction in the first register thereof. If a computing unit judges that the received position identifier is not the position identifier of the computing unit, the computing unit does not execute the buffer operation of the first register for the data transmitted according to the first preset direction.

In a second example, when receiving the shift instruction, the computing unit in the present disclosure may perform shift processing on the matrix element cached in the first register, so that the matrix element cached in the first register is shifted to the first register of the downstream computing unit that is in the same row as the computing unit and is adjacent to the computing unit, and the matrix element transmitted according to the first preset direction is cached in the first register of the computing unit. The same row in the present disclosure may refer to a row based on a first preset direction, and downstream in the present disclosure may refer to a downstream based on the first preset direction. For example, in fig. 4, the calculating units m030 and m031, m032 and m033 are located in the same row, and the calculating unit m031 is a downstream calculating unit of m030, the calculating unit m032 is a downstream calculating unit of m031, and the calculating unit 033 is a downstream calculating unit of m 032.

Optionally, the clock period required by the present disclosure to store each matrix element in all third matrices in the first register of the compute unit array is generally related to the size of the compute unit array, for example, the number of the cells to be processed, if the size of the computing cell array is m1×m1×m1, it typically takes m1 clock cycles to store all matrix elements in all third matrices in the computational cell array.

Alternatively, it is assumed that matrix-vector operations of four third matrices and four vectors are performed using a 4 x 4 computing unit array as shown in fig. 4 and 5, and four of the third matrices c0, c1, c2 and c3 may be represented as follows:

And

The present disclosure can store c000 to c033, c100 to c133, c200 to c233, and c300 to c333, these 64 data, which are stored in each of the calculation units in the calculation unit matrix shown in fig. 4 and 5, in four clock cycles, in the calculation unit array as follows:

c000, c100, c200, and c300 are stored in the second registers of the calculation units m000, m001, m002, and m003, respectively;

c001, c101, c201, and c301 are stored in the second registers of the calculation units m010, m011, m012, and m013, respectively;

c002, c102, c202 and c302 are stored in the second registers of the calculation units m020, m021, m022 and m023, respectively;

c003, c103, c203, and c303 are stored in the second registers of the calculation units m030, m031, m032, and m033, respectively;

c010, c110, c210 and c310 are stored in second registers of the computing units m100, m101, m102 and m103, respectively;

c011, c111, c211, and c311 are stored in the second registers of the calculation units m110, m111, m112, and m113, respectively;

c012, c112, c212, and c312 are stored in the second registers of the calculation units m120, m121, m122, and m123, respectively;

c013, c113, c213, and c313 are stored in the second registers of the computing units m130, m131, m132, and m133, respectively;

c020, c120, c220 and c320 are stored in the second registers of the computing units m200, m201, m202 and m203, respectively;

c021, c121, c221 and c321 are stored in the second registers of the calculation units m210, m211, m212 and m213, respectively;

c022, c122, c222, and c322 are stored in the second registers of the calculation units m220, m221, m222, and m223, respectively;

c023, c123, c223, and c323 are stored in the second registers of the calculation units m230, m231, m232, and m233, respectively;

c030, c130, c230 and c330 are stored in second registers of the computing units m300, m301, m302 and m303, respectively;

c031, c131, c231, and c331 are stored in the second registers of the calculation units m310, m311, m312, and m313, respectively;

c032, c132, c232 and c332 are stored in the second registers of the calculation units m320, m321, m322 and m323, respectively;

c033, c133, c233 and c333 are stored in the second registers of the calculation units m330, m331, m332 and m333, respectively.

When matrix-vector multiplication is required, the present disclosure may provide all vector elements in the four vectors to the computational cell array in a second preset manner (i.e., the z-direction shown in fig. 4 and 5) using one clock cycle. Assuming that the four vectors are d0, d1, d2 and d3, respectively, where d0 may be denoted as (d 00, d01, d02, d 03), where d1 may be denoted as (d 10, d11, d12, d 13), where d2 may be denoted as (d 20, d21, d22, d 23), and where d3 may be denoted as (d 30, d31, d32, d 33). Under the above assumption:

Vector element d00 is provided to m000, m100, m200, and m300;

Vector element d01 is provided to m001, m101, m201, and m301;

Vector element d02 is provided to m002, m102, m202, and m302;

vector element d03 is provided to m003, m103, m203, and m303;

Vector element d10 is provided to m010, m110, m210, and m310;

vector element d11 is provided to m011, m111, m211, and m311;

vector element d12 is provided to m012, m112, m212, and m312;

vector element d13 is provided to m013, m113, m213, and m313;

vector element d20 is provided to m020, m120, m220, and m320;

vector element d21 is provided to m021, m121, m221 and m321;

vector element d22 is provided to m022, m122, m222, and m322;

vector element d23 is provided to m023, m123, m223, and m323;

vector element d30 is provided to m030, m130, m230 and m330;

vector element d31 is provided to m031, m131, m231, and m331;

vector element d32 is provided to m032, m132, m232, and m332;

vector element d33 is provided to m033, m133, m233, and m333.

From this, the present disclosure can complete the matrix-vector multiplication operation of the matrix c0 and the vector d0, the matrix-vector multiplication operation of the matrix c1 and the vector d1, the matrix-vector multiplication operation of the matrix c2 and the vector d2, and the matrix-vector multiplication operation of the matrix c3 and the vector d3 in one clock cycle.

By using the second control signal, the matrix elements in one matrix in the matrix-matrix operation and each matrix element in one matrix in the matrix-vector operation can be sequentially stored in each calculation unit, so that the cross operation of the matrix-matrix and the matrix-vector can be completed in the least time. By adopting the position identification, each matrix element in the third matrix can be stored in a first register of a computing unit at the corresponding position of the computing unit array at one time; by employing shift indications, each matrix element in the third matrix can be moved stepwise into the first register of the computation unit at the corresponding position of the computation unit array; thereby facilitating an increase in the flexibility of storage of the first register.

In an alternative example, a multiplexer may be provided in the computing unit in the present disclosure, and the present disclosure may implement buffering of matrix elements using the multiplexer in the computing unit. Specifically, according to the received control signal, the computing unit in the disclosure may gate, through its multiplexer, the connection between the first register of the computing unit and the input of the first preset direction, that is, the connection between the first register of the computing unit and a data line of the computing unit array, so that each matrix element in the third matrix transmitted according to the first preset direction through the data line is buffered in the first register of the corresponding computing unit. The storage of matrix elements in different registers of a computing unit can be conveniently realized by utilizing the multiplexer.

In an alternative example, the second register in each compute unit of the compute unit array of the present disclosure may be a register for performing a matrix-matrix operation and a matrix-vector operation. The first register in each compute unit of the compute unit array of the present disclosure may be a register for caching. That is, for a computing unit, the second register in the computing unit is connected to a unit (e.g., a multiplying unit) for performing an operation in the computing unit, while the first register in the computing unit may not be connected to a unit for performing an operation in the computing unit. In this case, if the operations of at least one vector and a plurality of third matrices need to be performed, the present disclosure may control the matrix elements cached in the first register of each computing unit to be transferred to the second register of each computing unit, and then perform the first operation on the matrix elements in the second register of each computing unit and the vector elements transmitted according to the second preset direction.

Alternatively, the present disclosure may use the third control signal to cause the matrix element cached in the first register to be cached in the second register of each computing unit. In a specific example, the disclosure may first send a third control signal to each computing unit based on the control input of each computing unit, and then, the computing unit gates, according to the third control signal received by the computing unit, connection between the first register and the second register through a multiplexer in the computing unit, and transfers the matrix element buffered in the first register to the second register.

However, it should be specifically noted that the first register and the second register in the computation unit of the present disclosure may also be registers for performing matrix-matrix operations and matrix-vector operations. For example, the first register and the second register in the computing unit are connected to the unit for performing operations in the computing unit through a multiplexer, and the multiplexer gates the connection of the second register to the unit for performing operations when matrix operations of the first matrix and the second matrix need to be performed; when a plurality of matrix-vector operations of the third matrix and vector are needed to be performed, the multiplexer gates the connection between the first register and the unit for performing the operations, so that the present disclosure may not need to buffer the matrix elements cached in the first register in the second register, and then perform the matrix-vector operations by using the second register, but perform the matrix-vector operations directly using the first register.

The method and the device have the advantages that the matrix elements cached in the first register are cached in the second register, and then the second register is utilized to execute the first operation, and although data transfer between the registers can occupy one clock period, the method and the device can adopt a pipeline design mode to enable the clock period to be covered in vector and matrix operation, so that the method and the device are beneficial to conveniently realizing matrix-vector operation on the basis of simplifying the structure of a computing unit.

The data processing method provided in the present disclosure is applicable to an example in the operation of a neural network, as shown in fig. 10.

In fig. 10, S1000, a first matrix block in a first matrix to be operated of an nth layer based on a neural network is cached in a second register of each computing unit of the computing unit array, and m second matrix blocks in a second matrix to be operated of the nth layer are sequentially provided to the computing unit array, so as to obtain operation results of the first matrix block and the m second matrix blocks respectively.

Optionally, N and m in the present disclosure are integers greater than 1. The neural network in the present disclosure may be a neural network including convolution operation, for example, the neural network may be a lightweight MobileNet (mobile-end neural network) series. As shown in fig. 11, the nth layer in the present disclosure may be Pointwiseconv (point-wise convolution) layers in the neural network.

Alternatively, as described above, convolution operations in a neural network may be translated into matrix-matrix operations as well as matrix-vector operations. The present disclosure may employ existing transformation methods to transform convolution operations into matrix-matrix operations and matrix-vector operations, and the specific transformation process is not described in detail herein.

Alternatively, the first matrix to be operated on in the present disclosure may be the input F2 of the nth layer, and the second matrix to be operated on may be the weight matrix W2 of the nth layer of the neural network (as shown in fig. 11).

In one example, the first matrix to be operated F2 may be a three-dimensional tensor of 16×16×64, and the second matrix to be operated W2 may be a four-dimensional tensor of 256×1×1×64, and the present disclosure may divide F2 and W2 into a plurality of matrix blocks, respectively, for example, the present disclosure may divide the first matrix to be operated F2 into 64 first matrix blocks, each of which has a size of 16×16, and divide the second matrix to be operated W2 into 64 second matrix blocks, each of which has a size of 16×16. The present disclosure may perform processing such as addition on the operation results of all matrix blocks, so that a matrix multiplication operation result F3 of W2 and F2 may be formed, the output of the nth layer is the matrix multiplication operation result F3 of W2 and F2, and F3 may be a three-dimensional tensor of 16×16×256.

Alternatively, the present disclosure may buffer each matrix element in a first matrix block in a second register of each compute unit of a 16×16 compute unit matrix, respectively, using one clock cycle.

S1001, in the operation process of the first matrix block and m second matrix blocks, respectively buffering m third matrix blocks in a third matrix to be operated of an N-1 th layer based on the neural network in a first register of each calculation unit of the calculation unit array.

Alternatively, in the case where the neural network in the present disclosure is of the MobileNet series that is lightweight, the N-1 th layer in the present disclosure may be a Depthwiseconv (depth direction convolution) layer in the neural network.

Alternatively, the third matrix to be operated on in the present disclosure may be the input F1 of the N-1 layer (as shown in FIG. 11), and the vector to be operated on may be the weight matrix W1 of the N-1 layer of the neural network.

In one example, the third matrix to be operated on may be a three-dimensional tensor of 16×16×64, and the present disclosure may divide F1 into a plurality of third matrix blocks, for example, may divide F1 into 64 third matrix blocks, each of which has a size of 16×16. If the matrix array is 16 x 16 in size, the present disclosure may utilize 16 clock cycles to buffer all matrix elements in the 16 third matrix blocks in the first register of each compute unit for the compute unit. The 16 clock cycles are exactly the time for one first matrix block to perform a matrix operation with 16 second matrix blocks.

S1002, when matrix operation results of the first matrix block and the m second matrix blocks are obtained, the m vector blocks are respectively provided to the computing unit array, and operation results of the m vector blocks and the m third matrix blocks are obtained.

Alternatively, the vector to be operated on in the present disclosure may be the weight matrix W1 of the N-1 layer (as shown in fig. 11). In one example, the vector to be operated on W1 may be a 4 x 64 three-dimensional tensor, and the present disclosure may divide W1 into a plurality of vector blocks, such as into 64 vector blocks, each comprising 16 vector elements. The method and the device can sequentially provide a plurality of vector blocks for the computing unit array in a plurality of continuous clock cycles, so that the computing unit array can complete operation of vectors and matrixes for a plurality of times, the method and the device can perform addition and other processing on operation results of all vectors and matrixes, so that matrix-vector multiplication operation results F2 of W1 and F1 can be formed, and the output F2 of an N-1 layer can be a three-dimensional tensor of 16 multiplied by 64.

The method and the device can finally efficiently complete the operation of the N layer and the operation of the N-1 layer by alternately executing the matrix-matrix operation of the matrix block of the N layer and the matrix block of the matrix block and the matrix-vector operation of the matrix block of the N-1 layer.

It should be specifically noted that, when dividing the matrix to be operated and the vector to be operated, the present disclosure should avoid data dependency in different layers of operation as soon as possible.

Exemplary apparatus

FIG. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the corresponding method embodiments of the present disclosure. The apparatus as shown in fig. 12 includes: a first control module 1200, a second control module 1201, and a processing module 1202.

The first control module 1200 is configured to control each matrix element in the plurality of third matrices to be buffered in a first register of a plurality of computing units of the computing unit array according to a first preset direction during a matrix operation of the first matrix and the second matrix performed by the computing unit array. The first registers of a plurality of computing units in a row of computing units in a first preset direction are respectively cached in matrix elements at the same position in a plurality of third matrices.

Optionally, the first control module 1200 in the present disclosure is further configured to control each matrix element in the first matrix to be cached in a second register of each calculation unit according to a first preset direction, provide each matrix element in the second matrix to each calculation unit according to a second preset direction, perform a third operation on each matrix element in the second register of each calculation unit and each matrix element transmitted according to the second preset direction by the first control module 1200, and perform a fourth operation on a third operation result and an output of a corresponding calculation unit in a previous row by the first control module 1200, and output a fourth operation result; the first control module 1200 may obtain the matrix operation results of the first matrix and the second matrix according to the fourth operation result output by each calculation unit in the last row of the calculation unit array. Wherein the second register of each of the calculation units in a row of calculation units in the first preset direction buffers the same matrix element in the first matrix.

Alternatively, the first control module 1200 may send a first control signal to each computing unit based on the control input end of each computing unit, where the first control module 1200 buffers each matrix element in the first matrix in the second register of each computing unit according to the first preset direction according to the first control signal.

Optionally, the first control module 1200 may gate the connection between the second register and the first preset direction input through the multiplexer in the computing unit according to the first control signal, and buffer each matrix element in the first matrix in the second register of each computing unit through the first preset direction of each computing unit.

Alternatively, the first control module 1200 may send a second control signal to each computing unit based on the control input end of each computing unit, where the first control module 1200 buffers each matrix element in the plurality of third matrices in the first register of each computing unit according to the first preset direction according to the second control signal.

Optionally, the second control signal may include: the first register stores an indication, and a location identification of the computing unit. For any computing unit, when it is determined that the computing unit receives the first register storage indication and the location identifier, and the location identifier is the location identifier of the computing unit, the first control module 1200 buffers the matrix element transmitted according to the first preset direction in the first register of the computing unit.

Optionally, the second control signal may include: a shift indication. For any computing unit, the first control module 1200 may shift the matrix element stored in the first register of the computing unit to the first register of a downstream computing unit that is in line with and adjacent to the computing unit when determining that the computing unit receives the shift instruction, and buffer the matrix element transmitted according to the first preset direction in the first register of the computing unit.

Optionally, the first control module 1200 gates, through a multiplexer in the computing unit, connection between the first register of the computing unit and the first preset direction input, and buffers the matrix element transmitted according to the first preset direction in the first register of the computing unit.

The second control module 1201 is configured to, when performing operations of at least one vector and the plurality of third matrices, perform a first operation on matrix elements buffered in the first registers of the plurality of computing units and vector elements transmitted according to a second preset direction, perform a second operation on a result of the first operation and an output of a corresponding computing unit in a previous row, and output a result of the second operation.

Optionally, the second control module 1201 may control the matrix elements in the first register of each computing unit to be cached in the second register of each computing unit, and perform the first operation on the matrix elements in the second register of each computing unit and the vector elements transmitted according to the second preset direction.

Optionally, the second control module 1201 sends a third control signal to each computing unit based on the control input terminal of each computing unit; the second control module 1201 gates the connection of the first register and the second register through a multiplexer in the calculation unit according to the third control signal, and buffers the matrix element in the first register in the second register.

The processing module 1202 is configured to obtain an operation result of at least one vector and a plurality of third matrices according to the second operation result respectively output by a plurality of computing units in a plurality of last rows in the computing unit array.

Fig. 13 is a schematic structural view of an embodiment of a data processing apparatus of the present disclosure. The apparatus of this embodiment may be used to implement the corresponding method embodiments of the present disclosure. The apparatus as shown in fig. 13 includes: the third control module 1300, the fourth control module 1301, and the fifth control module 1302.

The third control module 1300 is configured to buffer a first matrix block in a first matrix to be operated of an nth layer based on the neural network in a second register of each computing unit of the computing unit array, and sequentially provide m second matrix blocks in a second matrix to be operated of the nth layer to the computing unit array, so as to obtain operation results of the first matrix block and the m second matrix blocks. Wherein N and m are integers greater than 1.

The fourth control module 1301 is configured to buffer m third matrix blocks in the third to-be-operated matrix of the N-1 th layer based on the neural network in the first registers of the computing units of the computing unit array during the operation of the first matrix block and the m second matrix blocks.

The fifth control module 1302 is configured to, when obtaining the matrix operation results of the first matrix block and the m second matrix blocks, provide the m vector blocks to the computing unit array, respectively, to obtain the operation results of the m vector blocks and the m third matrix blocks, respectively.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 14. Fig. 14 shows a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 14, electronic device 141 includes one or more processors 1411 and memory 1412.

Processor 1411 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in electronic device 141 to perform desired functions.

Memory 1412 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example: random Access Memory (RAM) and/or cache, etc. The nonvolatile memory may include, for example: read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 1411 to implement the data processing methods and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, electronic device 141 may further include: input 1413 and output 1414, interconnected by a bus system and/or other forms of connection mechanisms (not shown). In addition, the input device 1413 may also include, for example, a keyboard, a mouse, and the like. The output device 1414 can output various information to the outside. The output devices 1414 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, for simplicity, only some of the components of the electronic device 141 relevant to the present disclosure are shown in fig. 14, components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 141 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a data processing method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a data processing method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatus, devices, and systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, changes, additions, and sub-combinations thereof.

Claims

1. A data processing method, comprising:

In the process of executing matrix operation of a first matrix and a second matrix through a computing unit array, controlling each matrix element in a plurality of third matrices to be respectively cached in first registers of a plurality of computing units of the computing unit array according to a first preset direction, wherein the first registers of a plurality of computing units in a row of computing units in the first preset direction respectively cache the matrix elements in the same position in the plurality of third matrices;

when the operation of at least one vector and the plurality of third matrixes is required to be executed, respectively carrying out first operation on matrix elements cached by a first register of the plurality of calculation units and vector elements transmitted according to a second preset direction, carrying out second operation on the result of the first operation and the output of a corresponding calculation unit in the previous row, and outputting the result of the second operation;

and obtaining the operation results of the at least one vector and the plurality of third matrixes according to the results of the second operation respectively output by the plurality of calculation units in the last rows of the calculation unit arrays.

2. The method of claim 1, wherein the performing, by the computing unit array, the matrix operation of the first matrix and the second matrix comprises:

Controlling each matrix element in the first matrix to be cached in a second register of each computing unit according to the first preset direction;

Providing each matrix element in the second matrix to each calculation unit according to a second preset direction;

Performing third operation on matrix elements in a second register of each calculation unit and the matrix elements transmitted according to a second preset direction, performing fourth operation on the third operation result and the output of the corresponding calculation unit in the previous row, and outputting a fourth operation result;

obtaining matrix operation results of the first matrix and the second matrix according to fourth operation results output by each calculation unit in the last row of the calculation unit array;

wherein the second register of each computing unit in the row of computing units in the first preset direction caches the same matrix element in the first matrix.

3. The method of claim 2, wherein the controlling each matrix element in the first matrix to be cached in the second register of each computing unit according to the first preset direction comprises:

transmitting a first control signal to each computing unit based on a control input of each computing unit;

and according to the first control signal, each matrix element in the first matrix is cached in a second register of each calculation unit according to a first preset direction.

4. A method according to any one of claims 1 to 3, wherein controlling each matrix element in the plurality of third matrices to be buffered in a first register of each computing unit according to a first preset direction, respectively, comprises:

Transmitting a second control signal to each computing unit based on the control input of each computing unit;

and according to the second control signals, each matrix element in the third matrixes is respectively cached in a first register of each computing unit according to a first preset direction.

5. The method of claim 4, wherein the second control signal comprises: the first register stores an indication, and a location identity of the computing unit;

According to the second control signal, each matrix element in the plurality of third matrices is respectively cached in a first register of each calculation unit according to a first preset direction, and the method comprises the steps of;

For any computing unit, when the computing unit is determined to receive the first register storage indication and the position identification, and the position identification is the position identification of the computing unit, the matrix element transmitted according to the first preset direction is cached in the first register of the computing unit.

6. The method of claim 4, wherein the second control signal comprises: a shift indication;

For any computing unit, when the computing unit is determined to receive a shift instruction, shifting matrix elements stored in a first register of the computing unit to a first register of a downstream computing unit which is in the same row as the computing unit and is adjacent to the computing unit, and buffering the matrix elements transmitted according to a first preset direction in the first register of the computing unit.

7. A data processing method, comprising:

Caching a first matrix block in a first matrix to be operated of an N-th layer based on a neural network in a second register of each calculation unit of a calculation unit array, and sequentially providing m second matrix blocks in a second matrix to be operated of the N-th layer for the calculation unit array to obtain operation results of the first matrix block and the m second matrix blocks respectively; wherein N and m are integers greater than 1;

In the operation process of the first matrix block and the m second matrix blocks, respectively buffering m third matrix blocks in a third matrix to be operated of an N-1 th layer based on a neural network in a first register of each calculation unit of the calculation unit array;

And under the condition that the matrix operation results of the first matrix block and the m second matrix blocks are obtained, respectively, providing m vector blocks for the calculation unit array to obtain operation results of the m vector blocks and the m third matrix blocks.

8. A data processing apparatus comprising:

The first control module is used for controlling each matrix element in the plurality of third matrices to be respectively cached in first registers of a plurality of computing units of the computing unit array according to a first preset direction in the process of executing matrix operation of the first matrix and the second matrix through the computing unit array, wherein the first registers of a plurality of computing units in a row of computing units in the first preset direction respectively cache the matrix elements in the same position in the plurality of third matrices;

The second control module is used for respectively carrying out first operation on matrix elements cached by the first registers of the plurality of computing units and vector elements transmitted according to a second preset direction when operation of at least one vector and the plurality of third matrices is required to be executed, carrying out second operation on the result of the first operation and the output of the corresponding computing unit in the previous row, and outputting the result of the second operation;

And the processing module is used for obtaining the operation results of the at least one vector and the plurality of third matrixes according to the results of the second operation respectively output by the plurality of calculation units in the last rows of the plurality of calculation unit arrays.

9. A computer readable storage medium storing a computer program for performing the method of any one of the preceding claims 1-7.

10. An electronic device, the electronic device comprising:

A processor;

a memory for storing the processor-executable instructions;

The processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-7.