CN111143766A - Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor - Google Patents
Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor Download PDFInfo
- Publication number
- CN111143766A CN111143766A CN201911349747.7A CN201911349747A CN111143766A CN 111143766 A CN111143766 A CN 111143766A CN 201911349747 A CN201911349747 A CN 201911349747A CN 111143766 A CN111143766 A CN 111143766A
- Authority
- CN
- China
- Prior art keywords
- matrix
- row
- column
- coefficient matrix
- coefficient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/76—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
- G06F7/78—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffers; Overflow or underflow handling therefor
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Discrete Mathematics (AREA)
- Computing Systems (AREA)
- Complex Calculations (AREA)
Abstract
The present disclosure describes a method, an electronic device and a computing device for processing a two-dimensional complex matrix by an artificial intelligence processor, wherein the computing device may be included in a combined processing device, which may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.
Description
Technical Field
The present invention relates to the field of data processing, and more particularly to the field of matrix operations on artificial intelligence processors.
Background
The discrete Fourier transform and the discrete Fourier inverse transform have wide functions in the fields of digital image processing, computer vision and the like, so that the fast engineering realization of the discrete Fourier transform has very important significance, but with the development of artificial intelligence technology and the requirement of more advanced fields, the requirements on the processing algorithm and performance of images and videos are higher, and the requirements on the computing performance in various applications are higher and higher; only scalar calculation can be carried out on a CPU, and the calculation time is rapidly increased along with the increase of the data scale, so that the calculation performance can be obviously improved if a group of data can be directly calculated, namely tensor calculation; the computation of discrete Fourier transform and scalar are usually carried out by fast Fourier transform, although the complexity of the algorithm can be reduced, the computation amount of the fast Fourier transform is large, and the computation amount has a great relation with the performance of the system; in the past, a matrix multiplication method is not directly adopted, because neither the CPU nor the GPU can directly carry out matrix multiplication, and algorithm design is carried out in upper-layer development. At present, some artificial intelligence chips support the calculation of tensor, are convenient for carry out convolution operation, and can operate bottom layer hardware well and flexibly at the same time.
Disclosure of Invention
The present disclosure is directed to overcome the defect that tensor calculation cannot be performed in the prior art, and provides a method capable of processing a two-dimensional complex matrix in an off-chip memory unit.
According to a first aspect of the present disclosure, there is provided a method of processing a two-dimensional complex matrix by an artificial intelligence processor, wherein the size of the two-dimensional complex matrix is N × M; the size of a row coefficient matrix corresponding to the two-dimensional complex matrix is M × M, and the size of a column coefficient matrix corresponding to the two-dimensional complex matrix is N × N, the method comprising: loading the two-dimensional complex matrix into a first storage area of an on-chip storage unit of the artificial intelligence processor; loading the row coefficient matrix into a second storage area of the on-chip storage unit; loading the column coefficient matrix into a third storage area of the on-chip storage unit; the artificial intelligence processor performs Fourier transform by using the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix in the on-chip storage unit to obtain an operation result; and the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.
According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
According to the technical scheme, hardware resources can be fully utilized, data can be loaded at one time, high-speed calculation is carried out on the on-chip memory, storage is carried out on the off-chip memory, and time consumption of cache between the memories is reduced, so that the memory access efficiency is improved, and the performance of an algorithm is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:
FIG. 1a shows a schematic diagram of the internal structure of a processor group to which the method of the present disclosure may be applied;
FIG. 1b shows a schematic diagram of an artificial intelligence processor to which the method of the present disclosure can be applied;
FIG. 2 illustrates a method of two-dimensional complex matrix processing in an off-chip memory according to one embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a two-dimensional complex matrix according to one embodiment of the present disclosure;
FIG. 4 shows a schematic diagram of converting a coefficient matrix into a one-dimensional array, according to one embodiment of the present disclosure;
FIG. 5 illustrates a flow chart of a method for Fourier transforming from the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix according to one embodiment of the present disclosure;
FIG. 6 shows a schematic diagram of a Fourier transform of a two-dimensional matrix of complex numbers and a corresponding matrix of row coefficients, according to one embodiment of the present disclosure;
FIG. 7 shows a schematic diagram of a Fourier transform of a two-dimensional matrix of complex numbers and a corresponding matrix of column coefficients, according to one embodiment of the present disclosure;
FIGS. 8a and 8b show the situation where different elements are located in different rows and different elements are located in different columns, respectively
FIG. 9 shows a schematic diagram of a combined treatment apparatus according to the present disclosure; and
fig. 10 shows a schematic block diagram of a board card according to the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
FIG. 1a shows a schematic diagram of the internal structure of a processor group to which the method of the present disclosure may be applied.
An Artificial Intelligence (AI) chip accelerates the data computing capacity and reduces the memory access delay. The AI chip adopts a multi-core processor architecture, supports up to 16-core parallel computation, and adds a storage unit core (also called an on-chip or on-chip storage unit) to accelerate data reading, thereby solving the problem of memory access bottleneck of a processor core and a DDR (also called an off-chip storage unit) of the AI chip. And stronger computing capability is provided for a user in scenes of processing deep learning, network computing and the like.
The AI chip has 16 processor cores in total for executing the calculation task. Every 4 processor cores constitute one processor group, i.e. 4 processor groups in total. There is a memory unit core within each processor group. The memory unit core is mainly used for data exchange between the shared memory unit inside the processor group and the processor core and between the processor groups. When the memory core and the processor core simultaneously access the DDR, only one group of buses is guaranteed to access the DDR after the arbitration of the multiplexer.
FIG. 1b shows a schematic diagram of an artificial intelligence processor to which the method of the present disclosure can be applied.
The DDR of the AI chip adopts a Non-Uniform Memory Access (NUMA) architecture, and each processor group can Access different DDR channels through the NOC0, but has different delays for accessing different DDR channels. Each processor group corresponds to a DDR channel with the lowest access delay, and the access delay of other channels is relatively long. As shown in the structure diagram of the processor group and the DDR in fig. 1b, the delay time is the lowest when the processor group 0, the processor group 1, the processor group 2, and the processor group 3 access the corresponding DDR0, DDR1, DDR2, and DDR3, respectively. That is, each processor core accesses the DDR channel with the lowest access delay of the respective processor group.
Because the access bandwidth inside the processor group is higher than the access bandwidth between the processor core and the DDR, the AI chip can internally access the shared memory unit by adopting the processor group to reduce the direct access of the processor core to the DDR, thereby improving the data throughput.
When 4-core parallel computing is required, the memory unit core may broadcast data from the shared memory unit to 4 processor cores within the processor complex simultaneously for data computation by way of data broadcast (via NOC 1). Compared with a mode that all processor cores read data through DDR, the memory access delay can be reduced under the condition, and the computing performance is optimized.
As computing demands increase, 16 processor cores may need to process multiple computing tasks simultaneously. The direct access of the processor core to the DDR inevitably causes data access delay, and the problems of low computing speed and the like are caused. The AI chip avoids direct communication between the 16 processor cores and the DDR through mutual data exchange of the processor groups, thereby reducing the delay of data access.
For a large two-dimensional data matrix, such as a high-definition picture, the structure of the AI chip can be fully utilized to reduce data exchange or data access with an external storage unit, and improve data processing speed and data transmission throughput.
FIG. 2 illustrates a method of an artificial intelligence processor processing a two-dimensional complex matrix, wherein the size of the two-dimensional complex matrix is N M, according to one embodiment of the present disclosure; the size of a row coefficient matrix corresponding to the two-dimensional complex matrix is M × M, and the size of a column coefficient matrix corresponding to the two-dimensional complex matrix is N × N, the method comprising: loading the two-dimensional complex matrix into a first memory area of an on-chip memory unit of the artificial intelligence processor in operation S210; loading the row coefficient matrix into a second storage area of the on-chip storage unit in operation S220; loading the column coefficient matrix into a third storage area of the on-chip storage unit in operation S230; in operation S240, the artificial intelligence processor performs fourier transform using the two-dimensional complex matrix, the row coefficient matrix, and the column coefficient matrix in the on-chip storage unit to obtain an operation result; and, in operation S250, the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.
It should be explained that the first storage area, the second storage area and the third storage area may be three different storage areas in the same physical memory, and each storage area is used for storing corresponding matrix data; or three separate physical memories, each for storing corresponding matrix data.
In the present disclosure, the storage area of the on-chip memory unit is large enough to store the corresponding two-dimensional complex matrix at once for subsequent operations.
The two-dimensional complex matrix herein is a mathematical representation, and in actual storage, the two-dimensional complex matrix may include two matrices: a real matrix and an imaginary matrix. Fig. 3 shows a schematic diagram of a two-dimensional complex matrix according to an embodiment of the present disclosure.
As shown in fig. 3, the size of the two-dimensional complex matrix is exemplarily 2 × 4, which includes 8 elements, respectively a00+jb00、a01+jb01、a02+jb02、a03+jb03、a10+jb10、a11+jb11、a12+jb12、a13+jb13It can be split into a real part matrix and an imaginary part matrix, the real part matrix includes the real part of each complex number, which is a00、a01、a02、a03、a10、a11、a12A13, the imaginary matrix includes the imaginary part of each complex number, b00、b01、b02、b03、b10、b11、b12、b13. The combination of the real and imaginary matrices can express the two-dimensional complex matrix.
The coefficient matrix of the two-dimensional discrete Fourier transform is related to the data scale size of the input data, and the size of the input data is only required to be known before the two-dimensional discrete Fourier transform matrix is calculated; and (3) calculating a row correlation coefficient matrix and a column correlation coefficient matrix of the two-dimensional Fourier transform in sequence according to the height and width (the number of rows and columns) of the input data:
wherein f _ row _ rr represents the process of inputting a real part and outputting the real part corresponding to row transformation; f _ row _ ri represents the process of inputting the real part and outputting the imaginary part corresponding to row transformation; f _ row _ ir represents the process of inputting the imaginary part and outputting the real part corresponding to row transformation; f _ row _ ii represents the process of row transformation of the input imaginary part and output imaginary part, McolsThe length of each row of input data is represented, namely the number of columns corresponding to the original two-dimensional matrix; the value ranges of j and k are that j is more than or equal to 0 and less than Mcols,0≤k<Mcols。
Thus, for a complex matrix of size N × M, the row coefficient matrix has size M × M and includes a first row coefficient matrix for storing a first row coefficient f _ row _ rr for real-to-real conversion; the second row coefficient matrix is used for storing a second row coefficient f _ row _ ri of conversion from a real part to an imaginary part; the third row coefficient matrix is used for storing a third row coefficient f _ row _ ir for converting the imaginary part into the real part; and a fourth row coefficient matrix for storing the fourth row coefficient f _ row _ ii converted from the imaginary part to the imaginary part.
Similarly, the column correlation coefficient matrix is calculated by the following equation:
wherein f _ cols _ rr represents the process of inputting a real part and outputting the real part corresponding to column transformation; f _ cols _ ri represents the process of inputting the real part and outputting the imaginary part corresponding to column transformation; f _ cols _ ir represents the process of inputting the imaginary part and outputting the real part corresponding to column transformation; f _ cols _ ii represents the process of column transformation input imaginary part and output imaginary part; n _ rows represents the length of each line of input data, and is data obtained after transposition, namely the line number corresponding to the original two-dimensional matrix; the value ranges of j and k are respectively that j is more than or equal to 0<Nrows,0≤k<Nrows。
Thus, for a complex matrix of size N × M, the column coefficient matrix has a size N × N, and the column coefficient matrix includes: the first column coefficient matrix is used for storing a first column coefficient f _ cols _ rr converted from a real part to a real part; the second column coefficient matrix is used for storing a second column coefficient f _ cols _ ri converted from the real part to the imaginary part; the third column of coefficient matrixes is used for storing the third column of coefficients f _ cols _ ir for converting the imaginary part into the real part; and a fourth column coefficient matrix for storing the fourth column coefficient f _ cols _ ii converted from the imaginary part to the imaginary part.
Similarly, the inverse fourier transformed row coefficients are calculated by the following equation:
wherein b _ row _ rr represents the process of inputting the real part and outputting the real part corresponding to the row inverse transformation; b _ row _ ri represents the process of inputting the real part and outputting the imaginary part corresponding to the row inverse transformation; b _ row _ ir represents the process of outputting the real part from the input imaginary part corresponding to the row inverse transformation; b _ row _ ii represents the process of row inverse transforming the input imaginary part and the output imaginary part.
Similarly, the inverse fourier transformed row coefficients are calculated by the following equation:
wherein b _ cols _ rr represents the process of inputting the real part corresponding to the column inverse transformation and outputting the real part; b _ cols _ ri represents the process of inputting the real part and outputting the imaginary part corresponding to the column inverse transformation; b _ cols _ ir represents the process of outputting the real part from the input imaginary part corresponding to the column inverse transformation; b _ cols _ ii represents the process of column inverse transforming the input imaginary part and the output imaginary part.
According to one embodiment of the present disclosure, the row coefficient matrix may be converted into a one-dimensional array by the artificial intelligence processor to be loaded into the second storage area; and converting the column coefficient matrix into a one-dimensional array to be loaded to the third storage area.
Fig. 4 shows a schematic diagram of converting a coefficient matrix into a one-dimensional array according to one embodiment of the present disclosure.
As shown in fig. 4, assuming that the size of one coefficient matrix (row coefficient matrix or column coefficient matrix) is 4 × 4, the coefficient matrix can be converted into a1 × 16 one-dimensional array. And the two-dimensional matrix is converted into a one-dimensional array for storage, so that the data access and the transformation calculation operation are facilitated.
Similarly, a two-dimensional complex matrix can also be converted into a one-dimensional array, and for a two-dimensional complex matrix with size N × M (including a real matrix and an imaginary matrix), the size is 1 × (N × M) when converted into a one-dimensional array. For tensor calculation, the one-dimensional array may be stored in the first storage unit in a multi-dimensional matrix manner of 1 × 1 × 1 × (N × M) where the number of data sets is 1, the height is 1, the width is 1, and the depth is N × M.
Fig. 5 shows a flowchart of a method for performing fourier transform on the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix according to an embodiment of the present disclosure.
As shown in fig. 5, the method of performing fourier transform includes: in operation S2410, performing convolution operation on each row of the two-dimensional complex matrix and all rows of the row coefficient matrix to obtain a first intermediate result; in operation S2420, adding the same row elements of the first intermediate result to obtain a first operation result; in operation S2430, transposing the first operation result to obtain a transposed result, and performing convolution operation on each row of the transposed result and all rows of the column coefficient matrix to obtain a second intermediate result; and adding the same row elements of the second intermediate result to obtain an operation result in operation S2440.
First, a basic concept of performing fourier transform on a complex number is described. For the input complex number, after Fourier transform, the output complex number is also output; general calculation formula to achieve complex multiplication:
(A1+jB1)*(A2+jB2)=(A1*A2-B1*B2)+j(A1B2+B1A2)
wherein A is1、A2Is the real part of two complex data, B1、B2The imaginary part of two complex data;
based on the above general calculation formula, calculating the fourier transform from the two-dimensional complex matrix and the row coefficient matrix can be performed by the following equation:
RRrow=inpreal*f_rows_rr
IRrow=inpimag*f_rows_ir
RIrow=inpreal*f_rows_ri
IIrow=inpimag*f_rows_ii
Realrow=RRrow+IRrow
Imagrow=RIrow+IIrow
wherein inpreal、inpimagRespectively the real and imaginary parts, RR, of the input datarowRepresenting the corresponding input with the real part and the output also being the real part, RIrowIndicating that the corresponding input is real and the output is imaginary, IRrowIndicating that the input is imaginary and the output is real, IIrowThe imaginary output is also the imaginary part when representing the input; realrowRepresenting the real part after Fourier transformation, ImagrowThe imaginary part after the fourier transform is indicated, where the subscript "row" indicates that the above operation is for a row coefficient matrix.
Fig. 6 shows a schematic diagram of a fourier transform of a two-dimensional matrix of complex numbers and a corresponding matrix of row coefficients, according to an embodiment of the present disclosure.
As shown in fig. 6, the two-dimensional complex matrix is 2 × 4, wherein the data of the first row of the two-dimensional complex matrix are exemplarily a1, b1, c1 and d 1; the row coefficient matrix is 4 x 4, the data of its first row is exemplarily a2, b2, c2 and d 2. And performing convolution operation on the data of the first row of the two-dimensional complex matrix and the row coefficient matrix to obtain a first intermediate result, wherein the value of performing the convolution operation on a1 and a2 is a3, the value of performing the convolution operation on b1 and b2 is b3, the value of performing the convolution operation on c1 and c2 is c3, and the value of performing the convolution operation on d1 and d2 is d 3. In the first operation result, the result a is a3+ b3+ c3+ d3, i.e., the same row elements of the first intermediate result are added. Further, although not shown in fig. 6, data of a first row of the two-dimensional complex matrix is subjected to convolution operation with data of a second row of the row coefficient matrix, and the obtained results are added to obtain a first operation result B. Likewise, the first operation results C and D can be obtained. After the operation of the first row data of the two-dimensional complex matrix is finished, the operation of the second row data is performed in the same manner. Thus, the size of the first operation result is the same as the size of the two-dimensional complex matrix, which is 2 × 4.
The calculation of the fourier transform from the two-dimensional matrix of complex numbers and the matrix of column coefficients can be performed by the following equation:
RRcol=tempreal*f_cols_rr
IRcol=tempimag*f_cols_ir
RIcol=tempreal*f_cols_ri
IIcol=tempimag*f_cols_ii
Realcol=RRcol+IRcol
Imagcol=RIcol+IIcol
wherein tempreal、tempimagThe real part and imaginary part, RR, of the transposed data obtained from the first intermediate result obtained by calculation in FIG. 6 are respectivelycolRepresenting the corresponding input with the real part and the output also being the real part, RIcolIndicating that the corresponding input real output is imaginaryPart, IRcolIndicating that the input is imaginary and the output is real, IIcolThe imaginary output is also the imaginary part when representing the input; realcolDenotes the real part after the inverse Fourier transform, ImagcolThe imaginary part after the inverse fourier transform is indicated, where the subscript "col" indicates that the above operation is for a column coefficient matrix.
Fig. 7 shows a schematic diagram of fourier transforming a two-dimensional matrix of complex numbers and a corresponding matrix of column coefficients, according to one embodiment of the present disclosure.
As shown in fig. 7, the size of the first operation result is 2 × 4, and after the transposition, the size is 4 × 2, and thus the size of the column coefficient matrix is 2 × 2. The calculation result is obtained, and the size of the calculation result is 4 multiplied by 2. After transposing the multiple operation results, the final operation result can be obtained.
It should be understood that, the above calculation first uses a two-dimensional complex matrix and a row coefficient matrix to calculate a first operation result, and then uses a transpose matrix and a column coefficient matrix of the first operation result to calculate an operation result; according to another embodiment of the present disclosure, the first operation result may be first calculated by transposing the two-dimensional complex matrix and the column coefficient matrix, and then calculated by transposing the first operation result and the row coefficient matrix.
According to one embodiment of the present disclosure, the fourier transforms are performed in parallel.
As shown in fig. 1a and 1b, there may be a plurality of processor cores and a plurality of processor groups, so that after data is read from the off-chip memory unit, the data may be processed in parallel to increase the processing speed of the data.
According to an embodiment of the present disclosure, wherein fourier transforming, by the artificial intelligence processor, the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix comprises: and respectively carrying out Fourier transform on different elements in each two-dimensional complex matrix and coefficient elements in a row coefficient matrix and a column coefficient corresponding to the elements by a plurality of artificial intelligence processors.
The different elements described herein refer to elements located at different positions in the two-dimensional complex matrix, and according to one embodiment of the present disclosure, each processor may be responsible for elements at fixed positions, for example, the 0 th, 2 th, and 4 th elements in each row in the two-dimensional complex matrix may be executed by the zeroth processor core, the 1 st, 3 th, and 5 th elements in each row in the two-dimensional complex matrix may be executed by the first processor core, and so on.
Fig. 8a and 8b show the situation where different elements are located in different rows and different elements are located in different columns, respectively.
In fig. 8a, the two-dimensional complex matrix may be, for example, 2 × 4, with processor 0 being responsible for processing in column 0, processor 1 being responsible for processing in column 1, processor 2 being responsible for processing in column 2, and processor 3 being responsible for processing in column 3. In this case, each processor reads a corresponding element from the on-chip memory unit and performs parallel processing, so that the processing speed can be increased.
In fig. 8b, the two-dimensional complex matrix may be, for example, 2 × 4, with processor 0 being responsible for row 0 processing and processor 1 being responsible for row 1 processing. In this case, processor 0 may convolve all elements of row 0 with the row coefficient matrix, and processor 1 may convolve all elements of row 1 with the row coefficient matrix. In this case, each row is handled by a different processor, thereby speeding up the operation.
It should be understood that the processors described herein, also referred to as a generic term, may be processor cores or processor groups. The present disclosure does not set any limit to the type of processor.
According to the technical scheme, hardware resources can be fully utilized, data can be loaded at one time, high-speed calculation is carried out on the on-chip memory, storage is carried out on the off-chip memory, and time consumption of cache between the memories is reduced, so that the memory access efficiency is improved, and the performance of an algorithm is improved.
Furthermore, it should be understood that, although described and illustrated above only by way of fourier transform, the aspects of the present disclosure apply equally to inverse fourier transform operations, differing only in the elements of the row coefficient matrix and the column coefficient matrix, but that the fourier transform and the inverse fourier transform are equivalent in terms of overall operation, and therefore the scope of protection of the present disclosure also encompasses situations of inverse fourier transforms.
The present disclosure also provides an electronic device, including: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.
Fig. 9 illustrates a combined processing device 900 that includes the computing device 902 described above, a universal interconnect interface 904, and other processing devices 906. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 9 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.
Optionally, the architecture may further comprise a storage device 908, which is connected to said computing device and said other processing device, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip.
In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 10, an exemplary board card is provided that may include other kits in addition to the chip 1002, including but not limited to: a memory device 1004, an interface device 1006, and a control device 1008.
The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include multiple sets of memory cells 1010. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface means are used for enabling data transfer between the chip and an external device 1012, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.
In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.
Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.
Claims (12)
1. A method of an artificial intelligence processor processing a two-dimensional complex matrix, wherein the size of the two-dimensional complex matrix is nxm; the size of a row coefficient matrix corresponding to the two-dimensional complex matrix is M × M, and the size of a column coefficient matrix corresponding to the two-dimensional complex matrix is N × N, the method comprising:
loading the two-dimensional complex matrix into a first storage area of an on-chip storage unit of the artificial intelligence processor;
loading the row coefficient matrix into a second storage area of the on-chip storage unit;
loading the column coefficient matrix into a third storage area of the on-chip storage unit;
the artificial intelligence processor performs Fourier transform by using the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix in the on-chip storage unit to obtain an operation result; and
and the artificial intelligence processor transmits the operation result to the off-chip storage unit for storage.
2. The method of claim 1, wherein the row coefficient matrix is converted into a one-dimensional array by the artificial intelligence processor for loading into the second storage area; and converting the column coefficient matrix into a one-dimensional array to be loaded to the third storage area.
3. The method according to claim 1 or 2, wherein the two-dimensional complex matrix comprises a real matrix and an imaginary matrix.
4. The method of claim 3, wherein the row coefficient matrix comprises:
the first row of coefficient matrixes are used for storing the first row of coefficients converted from the real part to the real part;
the second row coefficient matrix is used for storing a second row coefficient converted from a real part to an imaginary part;
the third row coefficient matrix is used for storing the third row coefficient converted from the imaginary part to the real part; and
and the fourth row coefficient matrix is used for storing the fourth row coefficient converted from the imaginary part to the imaginary part.
5. The method of claim 3, wherein the column coefficient matrix comprises:
the first column coefficient matrix is used for storing a first column coefficient of conversion from a real part to a real part;
the second column coefficient matrix is used for storing a second column coefficient converted from the real part to the imaginary part;
the third column of coefficient matrixes are used for storing the third column of coefficients converted from the imaginary part to the real part; and
and the fourth column coefficient matrix is used for storing the fourth column coefficient converted from the imaginary part to the imaginary part.
6. The method of any one of claims 1-5, wherein Fourier transforming the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix by an artificial intelligence processor to obtain a result of the operation comprises:
performing convolution operation on each row of the two-dimensional complex matrix and all rows of the row coefficient matrix to obtain a first intermediate result;
adding the same row elements of the first intermediate result to obtain a first operation result;
transposing the first operation result to obtain a transposed result, and performing convolution operation on each row of the transposed result and all rows of the column coefficient matrix to obtain a second intermediate result;
and adding the same row elements of the second intermediate result to obtain an operation result.
7. The method according to any one of claims 1-6, wherein the two-dimensional complex matrix is stored in the first storage unit in a multi-dimensional matrix manner of 1 x (nxm).
8. The method of any of claims 1-7, wherein the Fourier transform is performed in parallel.
9. The method of claim 8, wherein fourier transforming, by an artificial intelligence processor, the two-dimensional complex matrix, the row coefficient matrix and the column coefficient matrix comprises:
and respectively carrying out Fourier transform on different elements in each two-dimensional complex matrix and coefficient elements in a row coefficient matrix and a column coefficient corresponding to the elements by a plurality of artificial intelligence processors.
10. The method of claim 9, wherein the different elements are located in different rows or different columns.
11. An electronic device, comprising:
one or more processors; and
memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-10.
12. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911349747.7A CN111143766A (en) | 2019-12-24 | 2019-12-24 | Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911349747.7A CN111143766A (en) | 2019-12-24 | 2019-12-24 | Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111143766A true CN111143766A (en) | 2020-05-12 |
Family
ID=70519787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911349747.7A Withdrawn CN111143766A (en) | 2019-12-24 | 2019-12-24 | Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111143766A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112486872A (en) * | 2020-11-27 | 2021-03-12 | 维沃移动通信有限公司 | Data processing method and device |
CN113655986A (en) * | 2021-08-27 | 2021-11-16 | 中国人民解放军国防科技大学 | FFT convolution algorithm parallel implementation method and system based on NUMA affinity |
-
2019
- 2019-12-24 CN CN201911349747.7A patent/CN111143766A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112486872A (en) * | 2020-11-27 | 2021-03-12 | 维沃移动通信有限公司 | Data processing method and device |
CN112486872B (en) * | 2020-11-27 | 2024-07-19 | 维沃移动通信有限公司 | Data processing method and device |
CN113655986A (en) * | 2021-08-27 | 2021-11-16 | 中国人民解放军国防科技大学 | FFT convolution algorithm parallel implementation method and system based on NUMA affinity |
CN113655986B (en) * | 2021-08-27 | 2023-06-30 | 中国人民解放军国防科技大学 | FFT convolution algorithm parallel implementation method and system based on NUMA affinity |
CN113655986B9 (en) * | 2021-08-27 | 2023-10-10 | 中国人民解放军国防科技大学 | FFT convolution algorithm parallel implementation method and system based on NUMA affinity |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522052B (en) | Computing device and board card | |
CN111028136B (en) | Method and equipment for processing two-dimensional complex matrix by artificial intelligence processor | |
CN111124995A (en) | Method and apparatus for processing a one-dimensional complex array by an artificial intelligence processor | |
WO2021083101A1 (en) | Data processing method and apparatus, and related product | |
CN111488976B (en) | Neural network computing device, neural network computing method and related products | |
CN110059797B (en) | Computing device and related product | |
CN112686379B (en) | Integrated circuit device, electronic apparatus, board and computing method | |
CN111125628A (en) | Method and apparatus for processing two-dimensional data matrix by artificial intelligence processor | |
CN112416433A (en) | Data processing device, data processing method and related product | |
CN111143766A (en) | Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor | |
CN111488963B (en) | Neural network computing device and method | |
CN109711540B (en) | Computing device and board card | |
CN110059809B (en) | Computing device and related product | |
CN112084023A (en) | Data parallel processing method, electronic equipment and computer readable storage medium | |
CN109740730B (en) | Operation method, device and related product | |
CN111061507A (en) | Operation method, operation device, computer equipment and storage medium | |
WO2021082723A1 (en) | Operation apparatus | |
CN113469333B (en) | Artificial intelligence processor, method and related products for executing neural network model | |
US20240303295A1 (en) | Operation apparatus and related product | |
CN111382852B (en) | Data processing device, method, chip and electronic equipment | |
CN111382853B (en) | Data processing device, method, chip and electronic equipment | |
CN111368987B (en) | Neural network computing device and method | |
CN111124996A (en) | Method and apparatus for processing a one-dimensional complex array by an artificial intelligence processor | |
CN111783954A (en) | Method and equipment for determining performance of neural network | |
CN113807489B (en) | Method for performing deconvolution operation, board card and computing device thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200512 |
|
WW01 | Invention patent application withdrawn after publication |