CN111125628A

CN111125628A - Method and apparatus for processing two-dimensional data matrix by artificial intelligence processor

Info

Publication number: CN111125628A
Application number: CN201911349779.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-08

Abstract

The present disclosure describes a method, an electronic device and a computing means for an artificial intelligence processor to process a two-dimensional data matrix, wherein the computing means may be comprised in a combined processing means, which may further comprise a universal interconnect interface and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.

Description

Method and apparatus for processing two-dimensional data matrix by artificial intelligence processor

Technical Field

The present invention relates to the field of data processing, and more particularly to the field of matrix operations on artificial intelligence processors.

Background

In computer vision, a video consists of one frame by one frame of image, and the image consists of one signal; the processing of signals, the processing of images and the processing of video is more and more important. In practice, each image is a two-dimensional data matrix, and the image is processed, in practice, a larger two-dimensional data matrix is processed. The transposition of the matrix is a fundamental operation that is important in matrix processing. Therefore, in the fields of signal processing, image processing, video analysis and the like, a large amount of matrix transposition operation requirements can appear, and the requirement on the transposition performance is high;

except for the field of deep learning, pictures with smaller sizes are used, and in the fields of signal processing, image processing, video analysis and the like, general images are large, the most common images are 720p, 1080p, 4k and even 8k, and a large amount of operation requirements for large-scale matrix transposition appear here. However, the on-chip (or on-chip RAM) resources of the processors involved in the computation are generally limited and are not sufficient to cache such large-scale matrix data. Meanwhile, although the capacity of the off-chip memory is much larger than that of the on-chip memory, the speed of operation in the off-chip memory is much lower than that in the on-chip memory, and a large amount of operations are required for accessing the off-chip memory, which affects the performance of the algorithm.

Disclosure of Invention

The invention aims to overcome the defects of more access times and lower operation speed in the transposition process of a large two-dimensional data matrix in the prior art.

According to a first aspect of the present disclosure, there is provided a method of processing a two-dimensional data matrix by an artificial intelligence processor, comprising: according to the capacity of an on-chip storage unit on an artificial intelligence processor and the size of the two-dimensional data matrix, splitting the two-dimensional data matrix into at least two sub-matrixes, wherein all elements of each sub-matrix can be stored in the on-chip storage unit; loading the sub-matrix into an on-chip storage unit of the artificial intelligence processor; the artificial intelligence processor performs matrix transposition operation on the sub-matrix to obtain an operation result; and the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.

According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.

At least one beneficial effect brought by the technical scheme of the disclosure comprises that on-chip resources of the processor can be fully utilized, and the defect that large data cannot be processed due to limited on-chip resources of the processor is solved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

FIG. 1a shows a schematic diagram of the internal structure of a processor group to which the method of the present disclosure may be applied;

FIG. 1b shows a schematic diagram of an artificial intelligence processor to which the method of the present disclosure can be applied;

FIG. 2 illustrates a method of matrix transposing a two-dimensional data matrix in an off-chip storage unit according to one embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a method of dynamically resizing the sub-matrices, in accordance with one embodiment of the present disclosure;

FIG. 4a shows a schematic diagram of expanding a two-dimensional matrix according to one embodiment of the present disclosure;

FIG. 4b shows a schematic diagram of expanding a two-dimensional matrix according to another embodiment of the present disclosure;

FIG. 5 illustrates a flow chart of a method of partitioning and adjusting a two-dimensional data matrix according to one embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of sub-matrices with overlapping portions, according to one embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a parallel matrix transpose operation performed on a plurality of sub-matrices by a plurality of respective artificial intelligence processors;

FIG. 8 illustrates an apparatus for matrix transposing a two-dimensional matrix of data in an off-chip storage unit in accordance with another aspect of the disclosure;

FIG. 9 shows a schematic diagram of a combined treatment apparatus according to the present disclosure; and

fig. 10 shows a schematic block diagram of a board card according to the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 1a shows a schematic diagram of the internal structure of a processor group to which the method of the present disclosure may be applied.

An Artificial Intelligence (AI) chip accelerates the data computing capacity and reduces the memory access delay. The AI chip adopts a multi-core processor architecture, supports up to 16-core parallel computation, and adds a storage unit core (also called an on-chip or on-chip storage unit) to accelerate data reading, thereby solving the problem of memory access bottleneck of a processor core and a DDR (also called an off-chip storage unit) of the AI chip. And stronger computing capability is provided for a user in scenes of processing deep learning, network computing and the like.

The AI chip has 16 processor cores in total for executing the calculation task. Every 4 processor cores constitute one processor group, i.e. 4 processor groups in total. There is a memory unit core within each processor group. The memory unit core is mainly used for data exchange between the shared memory unit inside the processor group and the processor core and between the processor groups. When the memory core and the processor core simultaneously access the DDR, only one group of buses is guaranteed to access the DDR after the arbitration of the multiplexer.

FIG. 1b shows a schematic diagram of an artificial intelligence processor to which the method of the present disclosure can be applied.

The DDR of the AI chip adopts a Non-Uniform Memory Access (NUMA) architecture, and each processor group can Access different DDR channels through the NOC0, but has different delays for accessing different DDR channels. Each processor group corresponds to a DDR channel with the lowest access delay, and the access delay of other channels is relatively long. As shown in the structure diagram of the processor group and the DDR in fig. 1b, the delay time is the lowest when the processor group 0, the processor group 1, the processor group 2, and the processor group 3 access the corresponding DDR0, DDR1, DDR2, and DDR3, respectively. That is, each processor core accesses the DDR channel with the lowest access delay of the respective processor group.

Because the access bandwidth inside the processor group is higher than the access bandwidth between the processor core and the DDR, the AI chip can internally access the shared memory unit by adopting the processor group to reduce the direct access of the processor core to the DDR, thereby improving the data throughput.

When 4-core parallel computing is required, the memory unit core may broadcast data from the shared memory unit to 4 processor cores within the processor complex simultaneously for data computation by way of data broadcast (via NOC 1). Compared with a mode that all processor cores read data through DDR, the memory access delay can be reduced under the condition, and the computing performance is optimized.

As computing demands increase, 16 processor cores may need to process multiple computing tasks simultaneously. The direct access of the processor core to the DDR inevitably causes data access delay, and the problems of low computing speed and the like are caused. The AI chip avoids direct communication between the 16 processor cores and the DDR through mutual data exchange of the processor groups, thereby reducing the delay of data access.

For a large two-dimensional data matrix, such as a high-definition picture, the structure of the AI chip can be fully utilized to reduce data exchange or data access with an external storage unit, and improve data processing speed and data transmission throughput.

FIG. 2 illustrates a method of an artificial intelligence processor processing a two-dimensional data matrix, the method comprising: in operation S210, according to the capacity of an on-chip storage unit on the artificial intelligence processor and the size of the two-dimensional data matrix, splitting the two-dimensional data matrix into at least two sub-matrices, where all elements of each sub-matrix can be stored in the on-chip storage unit; loading the sub-matrix into an on-chip storage unit of the artificial intelligence processor in operation S220; in operation S230, the artificial intelligence processor performs a matrix transposition operation on the sub-matrix to obtain an operation result; and in operation S240, the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.

For large two-dimensional data matrices, such as high resolution pictures, the capacity of the memory unit on an artificial intelligence processor may not be sufficient to accommodate all the data of the two-dimensional data matrix at once, and if these two-dimensional data matrices are executed on off-chip memory units (e.g., DDR memory), then a large number of data access operations are required, resulting in inefficiency.

In the present disclosure, the large two-dimensional data matrix may be divided into a plurality of small matrices or arrays, and the size of each small matrix or array is not larger than the capacity of the on-chip storage unit, so that the large two-dimensional data matrix may be read into the on-chip storage unit in blocks or batches.

In the present disclosure, the read block data may be transposed, that is, the rows of the matrix are changed into columns, and the columns are changed into rows. On most of the CPU or artificial intelligence processing chips, such as cambrian MLU series boards, there are interfaces that are based on the transpose of the fast computation matrix of the underlying hardware design, so that these partitioned matrices can be sent to these interfaces to complete the matrix operation. It should be understood that the transpose operation described herein includes transposing a matrix and also includes the operation of inverse transforming the transposed matrix.

It should be noted that the artificial intelligence chip described above may be any general or special purpose processor or combination of processors, and the present disclosure is not limited to any type of specific chip or processor.

After the received block data is processed, the operation result may be transferred to an off-chip memory, and then a new block data is read for the next round of operation and processing.

In the above, the size of the matrix is defined by the length and width of the matrix. For example, a 4 x 16 matrix and an 8 x 8 matrix, each containing 64 elements, the space occupied by the 64 elements is the same if the two-dimensional data matrices are converted to one-dimensional arrays, but in this disclosure, the 8 x 8 matrix and the 4 x 16 matrix are considered to be different in size. The following illustrates, by way of example, the splitting process of a large two-dimensional data matrix.

Let a large two-dimensional data matrix be M, its height be H, and its width be W, and its matrix can be expressed as equation 1:

where m denotes each element in the two-dimensional data matrix, the first subscript denotes the row in which the element is located, and the second subscript denotes the column in which the element is located.

The matrix M may be split into Rn × Cn submatrices SM (H, W), each submatrix having a size of Sr × Sc, where Rn ═ H/Sr, Cn ═ W/Sc, and H, W herein denote subscript indices of the submatrices, and satisfy 0 ═ H < - > -1, and 0 ≦ W < ═ W-1; thus, the large-scale two-dimensional data matrix M described above can also be expressed as equation 2:

wherein SM can for example represent a split sub-matrix, wherein the sub-matrix SM₀₀Can be expressed as equation 3.1, and the sub-matrix SM₀₁Can be expressed as equation 3.2:

other sub-matrices may be represented in a similar manner and will not be exhaustive here.

Thus, each time an operation is performed, the submatrices SM can be read into the on-chip memory cells in batches, so that the operation can be performed at high speed in the on-chip memory cells.

After the two-dimensional data matrix is divided into a plurality of sub-matrices, the sub-matrices may be stored in the off-chip storage unit row by row or column by column according to coordinates of the sub-matrices in the two-dimensional data matrix, so as to form a one-dimensional array.

Equation 4 below shows the way storage is done row by row.

M＝{SM₀₀SM₀₁... SM_0(Cn-1)SM₁₀... SM_1(Cn-1)... SM_(Rn-1)0...SM_(Rn-1)(Cn-1)Equation (4)

Equation 5 shows the way of storing column by column:

M＝{SM₀₀SM₁₀... SM_(Rn-1)0SM₀₁... SM_(Rn-1)1... SM_0(Cn-1)...SM_(Rn-1)(Cn-1)equation (5)

In the above storing process, the elements in each sub-matrix may also be stored in a row-by-row or column-by-column manner, i.e. the zeroth sub-matrix SM is stored row by row first₀₀Is followed by the storage of the submatrix SM₀₀The first row (or first column) elements. Current sub-matrix SM₀₀After all the elements of (a) are stored, the first sub-matrix (the first sub-matrix SM calculated by rows) is stored₀₁Or a first sub-matrix SM calculated by columns₁₀) Until all the split sub-matrices are stored.

The position of an element in the one-dimensional array may be calculated from the number of the sub-matrix in which the element is located, the size of the sub-matrix, and the coordinates of the element in the sub-matrix.

According to an embodiment of the present disclosure, calculating the position of the element in the one-dimensional array may include: multiplying the number of the sub-matrix by the size of the sub-matrix, plus the absolute position of the element in the sub-matrix.

The absolute position in the above context refers to the corresponding position of the elements in the matrix in the array when the matrix is converted into a one-dimensional array.

Taking the row-by-row storage as an example, assuming that the number of each sub-matrix is idx _ block, the size of each sub-matrix is Sr × Sc, and the absolute position of each element in the sub-matrix is idx _ element, the position of the element in the whole one-dimensional array is as shown in equation 6:

idx ═ idx _ block [ ((Sr) × Sc) + idx _ element equation (6)

Wherein the number idx _ block of the submatrix may be calculated according to the following equation 7:

the absolute position idx _ element of each element in the sub-matrix can be calculated according to equation 8 as follows:

idx _ element ═ (i% Sr) × Sc + (j% Sc) equation (8)

Wherein the operator Floor represents a lower integer, and the symbol "%" represents a remainder.

Equations 6-8 above are illustrated with an example of a 16 × 16 two-dimensional data matrix with a sub-matrix size of 8 × 8 (i.e., Sr ═ 8 and Sc ═ 8) and a division into four sub-matrices (i.e., Rn ═ 2 and Cn ═ 2).

For example, the coordinates of an element are (3,3), then

idx_block＝Floor(3/8)*2+Floor(3/8)＝0

idx_element＝(3％8)*8+(3％8)＝3*8+3＝27

idx＝0*(8*8)+27＝27

From the above calculation results, it can be seen that the element is located in the zeroth sub-matrix, is located at the 27 th sub-matrix in the zeroth sub-matrix, and is also located at the 27 th sub-matrix in the entire two-dimensional data matrix.

In yet another example, for example, where the coordinates of an element are (8,9), then

idx_block＝Floor(8/8)*2+Floor(9/8)＝3

idx_element＝(8％8)*8+(9％8)＝0*8+1＝1

idx＝3*(8*8)+1＝193。

In the above example, the element (8,9) is in the third sub-matrix, is at the 1 st in the third sub-matrix, and is at the 193 rd in the entire two-dimensional data matrix.

It is to be understood that the splitting of the large two-dimensional data matrix does not depend entirely on the capacity of the on-chip memory cells, but further on the size of the two-dimensional data matrix. According to an embodiment of the present disclosure, splitting the two-dimensional data matrix into at least two sub-matrices may include: and dynamically adjusting the size of the sub-matrix according to the difference of the size of the two-dimensional data matrix.

It should be understood that when describing the size of the two-dimensional data matrix herein, the units of the matrix and the memory cell space (e.g., K, M, etc.) are omitted, and only the unitless data is schematically represented. It is to be understood that this unitless representation of the present disclosure is merely for ease of understanding, and the principles thereof may be applied to any size matrix and memory cell.

For example, for a 64 × 64 two-dimensional data matrix, which contains 4096 elements in total and 64 on-chip memory cells in space, the 64 × 64 two-dimensional data matrix may be divided into 8 × 8 sub-matrices, each word matrix has 64 elements, and the 64 × 64 two-dimensional data matrix may be just split into 64 sub-matrices; in another case, the two-dimensional matrix may be divided into 4 × 16 sub-matrices, and also into 64 sub-matrices; the two-dimensional matrix can also be split into 2 × 32 sub-matrices or 1 × 64 matrices (or called arrays).

For another example, for a 4 × 1024 two-dimensional data matrix (e.g. a long and narrow picture), the total number of elements included in the two-dimensional data matrix is 4096, in which case, the two-dimensional data matrix is preferably split into 4 × 16 sub-matrices, or into 2 × 32 sub-matrices, or into 1 × 64 matrices or arrays.

Thus, preferably, for each input two-dimensional data matrix, the splitting manner can be dynamically adjusted, so that the size of the split sub-matrix can be dynamically adjusted.

Fig. 3 shows a flow chart of a method of dynamically adjusting the size of the sub-matrices according to an embodiment of the present disclosure.

As shown in fig. 3, dynamically resizing the sub-matrices may include the following operations: in operation S2110, setting an initial row number and an initial column number of the submatrix; determining a first ratio between a number of rows of the two-dimensional data matrix and the initial number of rows in operation S2120; determining a second ratio between the number of columns of the two-dimensional data matrix and the initial number of columns in operation S2130; and, in operation S2140, adjusting the number of rows and the number of columns of the sub-matrix such that the first ratio and the second ratio are both integers.

In the above operation, an initial number of rows and an initial number of columns may be first set according to the capacity of the on-chip memory cells. For example, for a memory cell of size 64, the initial value may be set to 8 × 8, 1 × 64, 2 × 32, 4 × 16, or the like. The initial value may be an empirical value, for example determined based on the size of the received majority of the two-dimensional data matrix. It should be understood that the above initial value is only an example, and schemes of 5 × 5, 9 × 9, and the like, which are not integer powers of 2, may also be adopted.

Next, the number of sub-matrices that can be split by the two-dimensional data matrix may be determined according to the initial value, that is, a first ratio of the number of rows to the initial number of rows and a second ratio of the number of columns to the initial number of columns of the two-dimensional data matrix are determined, and the two ratios are continuously adjusted until the two ratios are both integers. This means that splitting of the two-dimensional data matrix achieves superior results. The smaller the first ratio and the second ratio, the smaller the number of the submatrices that can be split, which has an advantageous effect on reducing the number of times the processor accesses the off-chip memory unit.

In general, the form of a two-dimensional data matrix is complex, so that not all rows and/or columns of the two-dimensional data matrix can be divided into exactly integer sub-matrices exactly seamlessly, but there may be situations where integer division is not possible.

There are various embodiments to address this situation.

According to one embodiment of the present disclosure, when the first ratio and/or the second ratio is a non-integer, the rows and/or the columns of the two-dimensional data matrix are expanded so that the first ratio and/or the second ratio is an integer.

Fig. 4a shows a schematic diagram of expanding a two-dimensional matrix according to one embodiment of the present disclosure.

As shown in fig. 4a, the dark color represents the original content of the two-dimensional data matrix, and the light color represents the expanded portion. Where the original size of the two-dimensional data matrix is 16 × 15, i.e. 16 rows and 15 columns, and the size of the on-chip memory cells is 64, in this case, the two-dimensional data matrix can be split into a plurality of 8 × 8 sub-matrices, namely, a zeroth sub-matrix, a first sub-matrix, a second sub-matrix and a third sub-matrix. As shown in fig. 4a, at this time, the columns (15 columns) cannot be divided into integers, and thus one column may be expanded in the columns of the two-dimensional matrix to form a 16 × 16 matrix. The expanded column may be all 0 values, and thus, the two-dimensional data matrix after the column is expanded may be split into four sub-matrices, where the first sub-matrix and the third sub-matrix include a column of expanded columns. The size of each sub-matrix is 8 x 8. In addition, the expanded column is not necessarily the complement 0, but the corresponding data may be complemented according to the data of the adjacent column, for example, the data of the expanded column may be complemented to be the same as the data of the last column. For landscape pictures, this subtle supplementation may not substantially affect human interpretation of the picture and the aesthetic appeal of the picture.

Fig. 4b shows a schematic diagram of expanding a two-dimensional matrix according to another embodiment of the present disclosure.

The expansion of the rows and/or columns of the two-dimensional data matrix does not necessarily have to be performed at the end of a row or at the end of a column, but an expanded row or column may also be inserted in the middle of the two-dimensional matrix.

In fig. 4b, an extended column is inserted between the 7 th column and the 8 th column of the two-dimensional data matrix, and the extended column may be filled with 0 in its entirety, or the extended column may be filled with a value of an adjacent column, for example, with data of a previous column or a subsequent column, or with an average value of a previous row and a previous column. Filling by means of an average value helps to form a more gradual transition. Unlike fig. 4a, the extended columns included in the first and third sub-matrices are located in the zeroth column of the sub-matrix.

It is to be understood that only two embodiments of extended rows/extended columns are shown in fig. 4a and 4b, and in other embodiments (not shown in the drawings), an extended row or an extended column may also be inserted before the first row or before the first column of the two-dimensional data matrix, or may also be inserted between any row or any column of the two-dimensional data matrix.

According to one embodiment of the present disclosure, the sub-matrices may be disposed adjacent to each other. In the embodiment shown in fig. 4a and 4b, no overlap occurs between each sub-matrix, and thus may be referred to as "immediately adjacent" in the present disclosure. For the case in fig. 4b, although an extended column is inserted between the 7 th and 8 th columns, since no overlap occurs between the sub-matrices, this case is also referred to as "immediately adjacent" in the present disclosure.

In the above, the position of the expanded row or column can be located, so when the sub-matrix is restored to the two-dimensional data matrix, the expanded row or column can be deleted, thereby obtaining the matrix with the same size as the original matrix, and further, no additional data content is added.

Given the situation above where the sub-matrices are dynamically sized, according to another embodiment of the present disclosure, the size of the sub-matrices may be fixed, having a predetermined number of rows and a predetermined number of columns.

For example, for an on-chip memory cell of size 64, the size of the matrix may be fixedly set to 8 × 8. It is to be understood that the fixed size is not necessarily equal to the capacity of the on-chip memory cell, but may also be smaller than the capacity of the on-chip memory cell. In addition, 8 × 8 is just an example, and the size of the matrix may be fixedly set to other sizes such as 1 × 64, 2 × 32, and 4 × 16.

By adopting the sub-matrix with fixed size, the sub-matrix does not need to be adjusted at all, thereby saving the corresponding adjusting time.

FIG. 5 illustrates a flow chart of a method of partitioning and adjusting a two-dimensional data matrix according to one embodiment of the present disclosure.

As shown in fig. 5, in the case that the sub-matrix is a fixed size, how to divide the two-dimensional data matrix can be determined as follows: determining a third ratio between the number of rows of the two-dimensional data matrix and the predetermined number of rows in operation S5110; determining a fourth ratio between the number of columns of the two-dimensional data matrix and the predetermined number of columns in operation S5120; and, in operation S5130, when the third ratio and/or the fourth ratio is a non-integer, expanding the rows and/or columns of the two-dimensional data matrix such that the third ratio and/or the fourth ratio is an integer.

In the flowchart shown in fig. 5, when a row of the two-dimensional data matrix cannot be divided exactly by a predetermined number of rows of the submatrix, or when a column of the two-dimensional data matrix cannot be divided exactly by a predetermined number of rows of the submatrix, the row or column of the two-dimensional data matrix may be expanded until the row or column can be divided exactly by the predetermined number of rows or the predetermined column.

The embodiment of expanding the rows and/or columns of the two-dimensional data matrix has been described above with reference to fig. 4a and 4b, and will not be described here again.

According to another embodiment of the present disclosure, at least two of the sub-matrices have an overlapping portion.

Fig. 6 illustrates a schematic diagram of sub-matrices with overlapping portions according to an embodiment of the present disclosure.

As shown in fig. 6, the size of the two-dimensional data matrix is 16 × 14 and the predetermined size of the sub-matrix is 8 × 8, in which case the two-dimensional data matrix may be split into four 8 × 8 sub-matrices, a zeroth sub-matrix, a first sub-matrix, a second sub-matrix and a third sub-matrix, respectively. Since the number of columns 14 of the two-dimensional data matrix divided by the number of predetermined columns 8 of sub-matrices is non-integer, it is possible to have overlapping portions of at least two sub-matrices. In fig. 6, the grid part is an overlapping part of two sub-matrices, i.e., an overlapping part of the zeroth sub-matrix and the first sub-matrix, and an overlapping part of the second sub-matrix and the third sub-matrix. The overlapping part contains both the elements of the last two columns of the zeroth and second sub-matrices and the elements of the first two columns of the first and third sub-matrices.

The inclusion of overlapping portions in the submatrices may eliminate the need to insert additional rows or columns into the matrix when the size of the two-dimensional data matrix is not evenly divisible by the size of the submatrix. It is to be understood that fig. 6 is only an example, the size of the overlapping portion may be arbitrary, and the overlapping and non-overlapping of the sub-matrices may be used in combination. For example, in the case that the size of the two-dimensional data matrix is 24 × 23, and the size of the sub-matrix is 8 × 8, the two-dimensional data matrix may be split into 3 sub-matrices (respectively, a zeroth sub-matrix, a first sub-matrix and a second sub-matrix) in the column direction, the zeroth sub-matrix and the first sub-matrix may be adjacent and may not overlap, and the first sub-matrix and the second sub-matrix may have a column overlapping portion.

According to the above description, when the two-dimensional matrix is split, there may be an overlapping portion. When an overlap portion is present, the overlap portion can be identified to locate the overlap portion.

Still using fig. 6 as an example for illustration. In fig. 6, the size of the two-dimensional data matrix is 16 × 14 and the size of the sub-matrix is 8 × 8, and thus, it can be found that the overlapped portions may be the 7 th and 8 th columns. The overlapping portion may be identified in the information about the sub-matrices to facilitate subsequent positioning and processing. The information about the sub-matrix may include, for example, the size of the sub-matrix, the position of the sub-matrix in the two-dimensional data matrix, whether there is an overlapping portion of the sub-matrix with other sub-matrices, and the position of the overlapping portion of the sub-matrix with other sub-matrices, and the like.

After each sub-matrix is processed (e.g., transposed or otherwise) in the on-chip storage unit, the sub-matrices formed after such processing may be stored back in the off-chip storage unit. When storing these processed sub-matrices in the off-chip storage unit, if there is an overlapping portion between two sub-matrices, the overlapping portion may be stored in various ways, for example, the data of one processed sub-matrix may be used to cover the data of the other processed sub-matrix, or the overlapping portion between two processed sub-matrices may be averaged to obtain the final storage result. It is to be understood that any treatment of the overlapping portions is within the scope of the present disclosure.

Transposing a matrix, i.e., changing a row to a column and a column to a row, can be represented by the following equation 9:

m' (j, i) ═ M (i, j) equation (9)

That is, the element in the ith row and the jth column of the original matrix is transposed to be the element in the jth row and the ith column. On most of the CPU or artificial intelligence processing chips, such as cambrian MLU series boards, there are interfaces that are based on the transpose of the fast computation matrix of the underlying hardware design, so that these partitioned matrices can be sent to these interfaces to complete the matrix operation.

According to one embodiment of the present disclosure, the transmitting, by the artificial intelligence processor, the operation result to the off-chip storage unit for storage includes: and storing the transposed plurality of sub-matrices into an off-chip storage unit, so that the plurality of transposed sub-matrices constitute a transposed two-dimensional data matrix.

In this embodiment, each time the transpose of one sub-matrix is completed, the transposed matrix may be transmitted to the off-chip storage unit for storage, and with the continuous operation of the subsequent transpose operation, all the sub-matrices are transposed and stored in the off-chip storage unit, so as to form a large two-dimensional transpose matrix, where the two-dimensional transpose matrix and the original two-dimensional data matrix are in a transposed relationship with each other, and conform to the relationship expressed by equation 9.

It should be understood that in the transposing process, the transposing is actually performed in the form of a one-dimensional array, and the obtained result is also a one-dimensional array, and the one-dimensional array and the two-dimensional matrix can form a corresponding relationship through equations 6-8. Those skilled in the art can utilize equations 6 through, etcEquation 8 is used to obtain the position idx of each element in the large two-dimensional data matrix, thereby obtaining the final two-dimensional transpose matrix, i.e., M (i, j) ═ M_idx。

Therefore, the technical scheme of the disclosure adopts two-stage transposition, namely, the sub-matrixes are transposed respectively, and then the large-scale two-dimensional data matrix is transposed when the large-scale two-dimensional data matrix is stored in the off-chip storage unit. The designed strategy can be calculated by using a hardware bottom layer as much as possible, so that the time complexity is further reduced, and the performance of the transposition algorithm is improved.

According to the foregoing, it may be detected whether each sub-matrix and other sub-matrices have an overlapping portion so far, and if so, the content of the overlapping portion of the two sub-matrices may be averaged.

According to another embodiment of the present disclosure, the performing, by the artificial intelligence processor, a matrix transposition operation on the submatrix to obtain an operation result includes: and respectively carrying out parallel matrix transposition operation on the plurality of sub-matrixes through a plurality of artificial intelligence processors.

Fig. 7 is a schematic diagram illustrating a matrix transpose operation performed in parallel on a plurality of sub-matrices by a plurality of intelligent processors, respectively.

As shown in fig. 7, the two-dimensional data matrix has a size of 16 × 16, and may be divided into four sub-matrices, i.e., a zeroth sub-matrix, a first sub-matrix, a second sub-matrix, and a third sub-matrix, each having a size of 8 × 8. Four processors may be employed to process the four sub-matrices in parallel, respectively, namely processor 0 processing the zeroth sub-matrix, processor 0 processing the first sub-matrix, processor 3 processing the second sub-matrix, and processor 3 processing the third sub-matrix.

It should be understood that processors 0-3 are merely a generic term for processors, and each processor may be a separate processor core (as shown in FIG. 1 a) or a group of processors (as shown in FIG. 1 b). The plurality of processors can respectively read the corresponding sub-matrixes from the off-chip storage unit into the on-chip storage unit, and perform the transposition operation on the corresponding sub-matrixes in the on-chip storage unit.

The number of the on-chip storage units can be multiple, each on-chip storage unit stores one sub-matrix, and therefore each processor respectively carries out parallel processing on the sub-matrices in different on-chip storage units; the on-chip memory unit can also be a single shared memory unit, the shared memory unit can store a plurality of sub-matrixes at one time, and a plurality of processors perform parallel processing on different sub-matrixes in the single shared memory unit.

By means of the multiple processors, the operation speed of the large-scale two-dimensional data matrix can be greatly improved, and the advantages of the artificial intelligent processor in the aspect of the multi-core processor are fully exerted.

In summary, compared with the prior art, the technical scheme disclosed by the invention has the following advantages:

the technical scheme of the invention can fully utilize the on-chip resources of the processor and solve the defect that the large data cannot be processed due to the limited on-chip resources of the processor.

The technical scheme of the invention carries out the operation of block transfer to the original large matrix, and can solve the problem of low access efficiency caused by a large number of random address hopping operations.

The technical scheme disclosed by the invention can operate the data of the whole sub-matrix size at one time, and can avoid frequent operation of a scalar with low efficiency on off-chip storage, so that the calculation efficiency of transposition can be improved.

The technical scheme disclosed by the invention adopts two stages of transposition, namely sub-matrix transposition and large-scale matrix transposition, so that the designed strategy can be calculated by using a hardware bottom layer as much as possible, the time complexity is further reduced, and the performance of a transposition algorithm is improved.

Therefore, the technical scheme disclosed by the invention can fully utilize hardware resources and solve the problem that the hardware resources are limited and are not enough to support large-scale data operation; the time consumption of caching between memories is reduced, so that the memory access efficiency is improved, and the performance of the algorithm is further improved.

FIG. 8 illustrates an apparatus for matrix transposing a two-dimensional matrix of data in an off-chip storage unit according to another aspect of the disclosure, comprising: a first device M810, configured to split the two-dimensional data matrix into at least two sub-matrices according to a capacity of an on-chip storage unit on the artificial intelligence processor and a size of the two-dimensional data matrix, where all elements of each sub-matrix can be stored in the on-chip storage unit; a second means M820 for loading the sub-matrices into on-chip storage units of the artificial intelligence processor; a third device M830, configured to perform a matrix transpose operation on the sub-matrix through the artificial intelligence processor to obtain an operation result;

a fourth means M840 for transmitting the operation result to the off-chip storage unit for storage by the artificial intelligence processor.

The present disclosure also provides an electronic device, including: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.

Fig. 9 illustrates a combined processing device 900 that includes the computing device 902 described above, a universal interconnect interface 904, and other processing devices 906. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 9 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.

Optionally, the architecture may further comprise a storage device 908, which is connected to said computing device and said other processing device, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip.

In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 10, an exemplary board card is provided that may include other kits in addition to the chip 1002, including but not limited to: a memory device 1004, an interface device 1006, and a control device 1008.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include multiple sets of memory cells 1010. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface means are used for enabling data transfer between the chip and an external device 1012, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.

Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A method of processing a two-dimensional data matrix by an artificial intelligence processor, comprising:

according to the capacity of an on-chip storage unit on an artificial intelligence processor and the size of the two-dimensional data matrix, splitting the two-dimensional data matrix into at least two sub-matrixes, wherein all elements of each sub-matrix can be stored in the on-chip storage unit;

loading the sub-matrix into an on-chip storage unit of the artificial intelligence processor;

the artificial intelligence processor performs matrix transposition operation on the sub-matrix to obtain an operation result; and

and the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.

2. The method of claim 1, wherein splitting the two-dimensional data matrix into at least two sub-matrices comprises:

and dynamically adjusting the size of the sub-matrix according to the difference of the size of the two-dimensional data matrix.

3. The method of claim 2, wherein dynamically adjusting the size of the sub-matrix comprises:

setting an initial row number and an initial column number of the sub-matrix;

determining a first ratio between the number of rows of the two-dimensional data matrix and the initial number of rows;

determining a second ratio between the number of columns of the two-dimensional data matrix and the initial number of columns;

and adjusting the number of rows and the number of columns of the sub-matrix so that the first ratio and the second ratio are both integers.

4. A method according to claim 3, wherein when the first and/or second ratio is non-integer, the rows and/or columns of the two-dimensional data matrix are expanded such that the first and/or second ratio is an integer.

5. The method according to any of claims 2-4, wherein the sub-matrices are arranged next to each other.

6. The method of claim 1, wherein the sub-matrix is fixed in size, having a predetermined number of rows and a predetermined number of columns.

7. The method of claim 6, wherein,

determining a third ratio between the number of rows of the two-dimensional data matrix and the predetermined number of rows;

determining a fourth ratio between the number of columns of the two-dimensional data matrix and the predetermined number of columns;

when the third ratio and/or the fourth ratio is a non-integer, expanding rows and/or columns of the two-dimensional data matrix so that the third ratio and/or the fourth ratio is an integer.

8. The method of claim 6, wherein at least two of the sub-matrices have overlapping portions.

9. The method of claim 8, further comprising: identifying the overlapping portion to locate the overlapping portion.

10. The method of any of claims 1-9, further comprising: and in the off-chip storage unit, storing the sub-matrixes row by row or column by column according to the coordinates of the sub-matrixes in the two-dimensional data matrix to form a one-dimensional array.

11. The method of claim 10, wherein the position of an element in the one-dimensional array is calculated from the number of sub-matrices, the size of a sub-matrix, and the absolute position of the element in a sub-matrix.

12. The method of claim 11, wherein calculating the position of the element in the one-dimensional array comprises:

multiplying the number of the sub-matrix by the size of the sub-matrix, plus the absolute position of the element in the sub-matrix.

13. The method of any of claims 1-12, wherein the artificial intelligence processor transferring the operation results to the off-chip storage unit for storage comprises:

and storing the transposed plurality of sub-matrices into an off-chip storage unit, so that the plurality of transposed sub-matrices constitute a transposed two-dimensional data matrix.

14. The method of claim 13, wherein if there is an overlap between a plurality of sub-matrices, averaging is performed on the overlap.

15. The method of any one of claims 1-14, wherein the artificial intelligence processor performing a matrix transpose operation on the sub-matrix to obtain an operation result comprises:

and respectively carrying out parallel matrix transposition operation on the plurality of sub-matrixes through a plurality of artificial intelligence processors.

16. An electronic device, comprising:

one or more processors; and

memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-15.

17. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-15.