CN117216466A

CN117216466A - Data processing method, device, system and storage medium

Info

Publication number: CN117216466A
Application number: CN202311111058.9A
Authority: CN
Inventors: 杨凯; 范登栋; 徐鹏翔; 刘勇翔; 田永鸿
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-12-12

Abstract

The application discloses a data processing method, a device, a system and a storage medium, which relate to the technical field of artificial intelligent chips and comprise the following steps: obtaining a matrix to be solved, wherein the matrix to be solved, a triangular matrix and a preset matrix are constructed to obtain an equation of a triangular matrix equation, the matrix to be solved and the triangular matrix are positioned on one side of the equation, and the preset matrix is positioned on the other side of the equation; performing inverse transformation on the triangular matrix to obtain an inverse triangular matrix; according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved, performing precision processing on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix; and inputting the first matrix and the second matrix into a matrix calculation unit to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved. The application can obviously improve the data processing efficiency.

Description

Data processing method, device, system and storage medium

Technical Field

The present application relates to the field of artificial intelligence chips, and in particular, to a data processing method, apparatus, system, and storage medium.

Background

Triangular matrix equations are widely available in different fields of application, for example, biological computing, signal processing, statistics, nuclear physics, etc., and play an important role in the computation process in these scientific fields. Since there are a large number of triangular linear equations in these scientific fields, the computational performance of solving triangular matrix equations has become one of the most interesting problems in these fields.

In recent years, with the continuous progress of technology and the continuous development of application, a data processing method of trigonometric matrix equation based on an artificial intelligence chip (AI chip) has been widely used. At present, the triangular matrix equation solving based on the AI chip mainly utilizes the vector computing capability of the AI chip, but the solving method cannot fully utilize the computing capability of the AI chip, so that the solving efficiency of the triangular matrix equation is lower. Therefore, how to fully utilize the computing capability of the AI chip to solve the triangular matrix equation becomes a technical problem to be solved urgently.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a data processing method, a device, a system and a storage medium, which can fully utilize the matrix computing capability of an artificial intelligent chip to solve the matrix, thereby remarkably improving the data processing efficiency.

The data processing method according to the embodiment of the first aspect of the present application is applied to an artificial intelligent chip, the artificial intelligent chip includes a matrix calculation unit, including:

obtaining a matrix to be solved, wherein the matrix to be solved, a triangular matrix and a preset matrix are constructed to obtain an equation of a triangular matrix equation, the matrix to be solved and the triangular matrix are positioned on one side of the equation, and the preset matrix is positioned on the other side of the equation;

performing inverse transformation on the triangular matrix to obtain an inverse triangular matrix;

according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved, performing precision processing on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix;

and inputting the first matrix and the second matrix into the matrix calculation unit to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved.

The data processing method according to the embodiment of the application has at least the following beneficial effects: firstly, obtaining a matrix to be solved, wherein the matrix to be solved and a triangular matrix form a preset matrix; secondly, carrying out inverse transformation on the triangular matrix to obtain an inverse triangular matrix; then, according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix, performing precision processing on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix; and finally, inputting the first matrix and the second matrix into a matrix calculation unit to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved. According to the data processing method, the triangular matrix in the triangular matrix equation is inverted to obtain the inverted triangular matrix, so that the vector line-by-line solving problem represented by the triangular matrix equation can be converted into the matrix multiplication problem, the matrix computing capacity of the artificial intelligent chip can be utilized for computing, and finally, the matrix to be solved represented by the matrix multiplication result is obtained. The matrix computing capability of the artificial intelligent chip is far stronger than the vector computing capability, and based on the matrix computing capability, the artificial intelligent chip can greatly improve the utilization efficiency of the artificial intelligent chip by performing matrix computing, so that the data processing efficiency based on the solution of a triangular matrix equation can be remarkably improved. Therefore, the data processing method can fully utilize the matrix computing capability of the artificial intelligent chip to solve the matrix, thereby remarkably improving the data processing efficiency.

According to some embodiments of the application, before the obtaining the matrix to be solved, the method comprises:

setting the preset matrix and the triangular matrix according to the solving problem;

and constructing and obtaining a triangular matrix equation through the matrix to be solved, the triangular matrix, the preset matrix and the preset constant, wherein the triangular matrix equation is an equation, the multiplied matrix to be solved and the triangular matrix are positioned on one side of the equation, and the multiplied preset matrix and the multiplied constant are positioned on the other side of the equation.

According to some embodiments of the application, the data processing method further comprises:

determining an initial matrix to be decomposed;

dividing the initial matrix into subarrays with equal length and width;

starting from the first column from the left and the first row from the top, the submatrices on the diagonal line are sequentially subjected to LU decomposition, panel decomposition and triangular matrix solving to obtain the target matrix.

According to some embodiments of the present application, starting from the first column from the left and the first row from the top, LU decomposition, panel decomposition and triangular matrix solution are sequentially performed on the submatrices on the diagonal line to obtain a target matrix, where the method includes:

Step 400, performing LU decomposition on the first column from left and the first row from top to obtain an L11 matrix and a U11 matrix;

step 410, combining the U11 matrix, performing panel decomposition on all submatrices with the same column as the a11 submatrices to obtain an L21-Lm 1 matrix, where m represents the number of rows of the initial matrix, and all submatrices do not include the a11 submatrices;

step 420, combining the L11 matrix, and performing triangular matrix solving on all the submatrices in the same row as the a11 submatrices to obtain U12 to U1n matrices, where n represents the number of columns of the initial matrix;

step 430, starting from the second column from the left and the A22 submatrix of the second row from the top, performing gem update on the submatrices except the first row and the first column;

step 440, repeating the calculation manners of step 400 to step 430 based on the updated a22 submatrix until the Lmm matrix and Unn matrix of the amp submatrix are obtained, where m is equal to n.

According to some embodiments of the present application, the performing precision processing on the inverse triangle matrix and the preset matrix according to the first floating point number precision of the matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved to obtain a first matrix corresponding to the inverse triangle matrix and a second matrix corresponding to the preset matrix includes:

And when the first floating point number precision is inconsistent with the second floating point number precision of the matrix to be solved, carrying out numerical scaling on the inverse triangular matrix and the preset matrix to obtain the first matrix corresponding to the inverse triangular matrix and the second matrix corresponding to the preset matrix, wherein a first numerical range of the first matrix and a second numerical range of the second matrix are in a range which can be represented by the first floating point number precision.

According to some embodiments of the present application, the performing numerical scaling on the inverse triangular matrix and the preset matrix to obtain the first matrix corresponding to the inverse triangular matrix and the second matrix corresponding to the preset matrix includes:

determining a first numerical range and third floating point number precision of the inverse triangular matrix, and determining a second numerical range and fourth floating point number precision of the preset matrix;

determining a first scaling preset constant of the inverse triangular matrix according to the first numerical range, the third floating point number precision and the first floating point number precision;

determining a second scaling preset constant of the preset matrix according to the second numerical range, the fourth floating point number precision and the first floating point number precision;

And performing precision conversion on the inverse triangular matrix according to the first scaling preset constant to obtain the first matrix, and performing precision conversion on the preset matrix according to the second scaling preset constant to obtain the second matrix.

According to some embodiments of the application, the performing precision conversion on the inverse triangular matrix according to the first scaling preset constant to obtain the first matrix, and performing precision conversion on the preset matrix according to the second scaling preset constant to obtain the second matrix includes:

multiplying the first scaling preset constant by the inverse triangular matrix to obtain a first product, and performing precision conversion on the first product according to the precision of the first floating point number to obtain the first matrix;

multiplying the second scaling preset constant by the preset matrix to obtain a second product, and performing precision conversion on the second product according to the precision of the first floating point number to obtain the second matrix.

A data processing apparatus according to an embodiment of a second aspect of the present application includes:

the acquisition module is used for acquiring a matrix to be solved, wherein the matrix to be solved, the triangular matrix and a preset matrix are constructed to obtain an equation of a triangular matrix equation, the matrix to be solved and the triangular matrix are positioned on one side of the equation, and the preset matrix is positioned on the other side of the equation;

The inversion module is used for carrying out inverse transformation on the triangular matrix to obtain an inverse triangular matrix;

the precision processing module is used for performing precision processing on the inverse triangular matrix and the preset matrix according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix;

and the solving module is used for inputting the first matrix and the second matrix into a matrix calculating unit of the artificial intelligent chip to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved.

An embodiment of a data processing system according to a third aspect of the present application comprises:

at least one memory;

at least one processor;

at least one program;

the program is stored in the memory, and the processor executes at least one of the programs to implement the data processing method according to the embodiment of the first aspect.

A computer-readable storage medium according to an embodiment of a fourth aspect of the present application stores computer-executable instructions for causing a computer to perform the data processing method according to the embodiment of the first aspect.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The application is further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a matrix diagram of a triangular matrix equation solving process according to one embodiment of the present application;

FIG. 3 is a matrix diagram of a triangular matrix equation solving process according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a data processing system according to an embodiment of the present application.

Reference numerals:

an acquisition module 100, an inversion module 110, an accuracy processing module 120, a solving module 130, a memory 200 and a processor 300.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

It should be noted that although functional block diagrams are depicted as block diagrams, and logical sequences are shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the block diagrams in the system. The terms and the like in the description and in the claims, and in the above-described drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In the description of the present application, the meaning of a number is one or more, the meaning of a number is two or more, and greater than, less than, exceeding, etc. are understood to exclude the present number, and the meaning of a number is understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

In the description of the present application, the descriptions of the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Noun interpretation:

triangle matrix equation: triangular matrix equations are a class of mathematical equations that exist widely in different fields, and their mathematical form is op (a) ×x=α×b or x×op (a) =α×b. Wherein, the matrix A is an upper triangle matrix or a lower triangle matrix, op (A) represents itself or its transpose, B is a given matrix or called a preset matrix, the matrix A and the matrix B are set according to the problem of actual solving, alpha is a preset constant, and X is the matrix to be solved.

Artificial Intelligence (AI) chip:

With the continuous progress of technology and the continuous development of applications, massive heterogeneous parallel computing systems are gradually becoming a trend in the development of today's and future supercomputers. This type of computing system is widely used in various fields of scientific computing, data analysis, and the like, with the advantages of high performance, high energy efficiency, and low cost. Among them, heterogeneous computing architecture adopts different types of processors/accelerators, such as GPU, TPU, NPU, etc., and achieves higher computing performance through efficient parallelization techniques. In particular, artificial intelligence has evolved rapidly in recent years, and such artificial intelligence applications often involve a large number of matrix computations, and various hardware vendors have also begun to incorporate new hardware units into accelerators, enabling artificial intelligence chips to provide higher performance matrix computing capabilities.

Based on this, the AI chip may provide one or more precision matrix computing forces, such as the NVIDIA a100 chip provides five precision matrix computing forces of FP64, TF32, BFLOAT16, FP16, and INT 8. The rising chip now provides matrix computing power with both FP16 and INT8 accuracy. While some AI chips can provide high-precision matrix computing power for FP64, peak performance is much lower than low-precision matrix computing power peak performance.

GPU：

GPU is an abbreviation for graphics processor Graphics Processing Unit, a hardware device that performs parallel computation specifically. GPUs were originally designed for graphics rendering, but due to their powerful parallel computing capabilities, are increasingly being applied in a wider computing field, such as machine learning, data science, scientific computing, and so on. GPUs have more cores and threads than CPUs, are better able to handle parallel processing of large amounts of data, and are therefore widely used in many computing fields.

Tensor core is a hardware unit in the GPU specifically designed for deep learning. It can perform efficient matrix multiplication operations and increase computation speed, which is particularly important for deep learning tasks. Compared with the common CUDA core, the Tensor core designs more efficient matrix multiplication, so that the training speed of the neural network can be faster. Tensor core is widely used in many deep learning frameworks such as TensorFlow and PyTorch, etc.

A rising AI processor (rising chip):

the computing Core of the rising AI processor is mainly composed of AI Core, and can be seen as a relatively simplified basic architecture of modern microprocessor from control. The AI Core architecture is essentially designed to accommodate common applications and algorithms in a particular field, commonly referred to as a "domain-specific architecture" (Domain Specific Architecture, DSA). It includes three basic computing resources: matrix computing units (Cube units), vector computing units (Vector units), and Scalar computing units (scaler units). The three calculation units have the functions of forming three independent execution pipelines, and the three calculation units are matched with each other under the unified scheduling of system software to achieve the optimal calculation efficiency. In addition, different precision and different types of calculation modes are provided inside the matrix calculation unit and the vector calculation unit.

In recent years, artificial intelligence has rapidly developed, and artificial intelligence applications generally include a large amount of matrix computation, and various hardware manufacturers have also begun to add new hardware units into accelerators to provide matrix computation capabilities with higher performance.

In existing schemes, solving triangular matrix equations typically calls the BLAS library, which is an application program interface standard used to normalize the numerical library of published basic linear algebraic operations, which provides the interface for solving triangular matrix equations. This set was originally published in 1979 and used to create larger numerical packages (e.g., LAPACK). To improve performance, each software and hardware manufacturer optimizes the BLAS interface for its product, such as Intel MKL of Intel, ACML of AMD, goto BLAS, ATLAS, etc., non-hardware manufacturer optimized version, and CUBLAS implemented by GPU computing technology. The corresponding linear algebra library is selected on different computing hardware, and the computing capacity of the hardware can be utilized.

Moreover, the method adopted by solving the triangular matrix equation in the prior art is to solve the triangular matrix equation row by row, only the vector operation capability of the AI chip can be utilized, and the performance is low.

The application provides a data processing method which can effectively utilize matrix calculation forces of various AI chips, obviously improve the solving efficiency of a triangular matrix equation and solve the problem that the triangular matrix equation cannot be efficiently solved on the AI chips.

A data processing method according to an embodiment of the present application is described below with reference to fig. 1 to 3.

It will be appreciated that as shown in fig. 1, there is provided a data processing method applied to an artificial intelligence chip including a matrix calculation unit, including:

step S100, a matrix to be solved is obtained, wherein the matrix to be solved, a triangular matrix and a preset matrix are constructed to obtain an equation of a triangular matrix equation, the matrix to be solved and the triangular matrix are positioned on one side of the equation, and the preset matrix is positioned on the other side of the equation;

step S110, carrying out inverse transformation on the triangular matrix to obtain an inverse triangular matrix;

step S120, performing precision processing on the inverse triangular matrix and the preset matrix according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix;

step S130, inputting the first matrix and the second matrix into a matrix calculation unit to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved.

Firstly, obtaining a matrix to be solved, wherein the matrix to be solved and a triangular matrix form a preset matrix; secondly, carrying out inverse transformation on the triangular matrix to obtain an inverse triangular matrix; then, according to the first floating point number precision of matrix operation of the artificial intelligent chip and the operation precision required by the matrix, performing precision processing on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix; and finally, inputting the first matrix and the second matrix into a matrix calculation unit to obtain a matrix multiplication result, wherein the matrix multiplication result is used for representing the matrix to be solved. According to the data processing method, the triangular matrix in the triangular matrix equation is inverted to obtain the inverted triangular matrix, so that the vector line-by-line solving problem represented by the triangular matrix equation can be converted into the matrix multiplication problem, the matrix computing capacity of the artificial intelligent chip can be utilized for computing, and finally, the matrix to be solved represented by the matrix multiplication result is obtained. The matrix computing capability of the artificial intelligent chip is far stronger than the vector computing capability, and based on the matrix computing capability, the artificial intelligent chip can greatly improve the utilization efficiency of the artificial intelligent chip by performing matrix computing, so that the data processing efficiency based on the solution of a triangular matrix equation can be remarkably improved. Therefore, the data processing method can fully utilize the matrix computing capability of the artificial intelligent chip to solve the matrix, thereby remarkably improving the data processing efficiency.

It should be noted that, the matrix to be solved is X, and the triangular matrix is op (a). Inverting the op (a) matrix in the triangular matrix equation op (a) ×x=α×b or x×op (a) =α×b to obtain the inverse matrix op (a) ^-1 The original vector of op (a) X x=α×b or x×op (a) =α×b is solved row by row, and converted into x=α×op (a) ^-1 X B or x=α×b×op (a) ^-1 Matrix multiplication problem of (c). The reason for converting the line-by-line vector solving problem of the triangular matrix equation into the matrix multiplication problem is that the matrix computing capacity of the AI chip is far stronger than that of the vector computing capacity, and the matrix multiplication problem can be computed by utilizing the matrix computing capacity of the AI chip.

Note that x=α×op (a) ^-1 X B or x=α×b×op (a) ^-1 The method comprises the steps of carrying out a first treatment on the surface of the Can be converted into solution:

X ^T ＝α×B ^T ×op(A) ^-T or X ^T ＝α×op(A) ^-T ×B ^T Where T is the matrix transpose symbol.

It can be appreciated that before obtaining the matrix to be solved, it includes:

setting a preset matrix and a triangular matrix according to the solving problem;

and constructing and obtaining a triangular matrix equation through the matrix to be solved, the triangular matrix, the preset matrix and the preset constant, wherein the triangular matrix equation is an equation, the multiplied matrix to be solved and the triangular matrix are positioned on one side of the equation, and the multiplied preset matrix and the multiplied preset constant are positioned on the other side of the equation.

It should be noted that, the preset matrix is B, and the preset constant is α.

It can be appreciated that performing an inverse transformation on the triangular matrix to obtain an inverse triangular matrix includes:

and constructing and obtaining an equation to be solved through an inverse triangular matrix, a preset matrix and a preset constant in the triangular matrix equation, wherein the equation to be solved is an equation, the matrix to be solved is positioned on one side of the equation, and the multiplied inverse triangular matrix, the preset matrix and the preset constant are positioned on the other side of the equation.

The inverse triangular matrix is op (A) ^-1 。

It will be appreciated that the data processing method further comprises:

determining an initial matrix to be decomposed;

dividing the initial matrix into sub-matrices with equal length and width;

It can be understood that starting from the first column from the left and the first row from the top, LU decomposition, panel decomposition and triangular matrix solution are sequentially performed on the submatrices on the diagonal to obtain the target matrix, including:

step 410, combining the U11 matrix, performing panel decomposition on all submatrices in the same column as the A11 submatrices to obtain L21 to Lm1 matrixes, wherein m represents the number of rows of the initial matrix, and all the submatrices do not comprise the A11 submatrices;

step 420, combining the L11 matrix, and solving the triangular matrix of all the submatrices in the same row as the A11 submatrices to obtain U12 to U1n matrixes, wherein n represents the number of columns of the initial matrix;

It should be noted that, in order to verify the practical effect of the data processing method based on the triangular matrix equation and the AI chip provided by the present application, the present application uses this technique in the HPL-AI benchmark test based on the rising AI chip, and the result shows that the calculation performance of the HPL-AI program is significantly improved. The method comprises the following steps:

Referring to FIG. 2, the algorithmic core of HPL-AI is LU decomposition. LU decomposition is a method of matrix decomposition, which can decompose a matrix into a product of a lower triangular matrix and an upper triangular matrix.

In fig. 2 (a), the initial matrix to be decomposed by LU is partitioned into small matrices each having NB in length and width;

in FIG. 2 (b), the small matrix in the upper left corner is decomposed separately into L ₁₁ And U ₁₁ Performing panel decomposition (panel factorization) and triangular matrix solution (trsm update) on the left column and the up row of small matrix blocks respectively, and updating the rest of the submatrices (general matrix multiplication gemm) to obtain submatrices to be decomposed;

in fig. 2 (c) to 2 (e), the problem solved by the input matrix is converted into a decomposition problem of the remaining sub-matrix by the above steps, and the subsequent iteration can complete the solution of the whole problem by repeatedly applying the above three steps to the remaining sub-matrix. The problem of input matrix solution can be understood as a matrix to be solved.

Further, a specific calculation case is provided:

step 1: performing LU decomposition on the small matrix block a11 to obtain L11 and U11, where a11=l11×u11;

step 2: taking U11 and a21 as inputs of the panel decomposition, obtaining L21, wherein a21=l21×u11+l22×0=l21×u11; in this step, a21 may be understood as a B matrix in the triangular matrix equation, and U11 may be understood as an a matrix in the triangular matrix equation.

Step 3: taking L11 and a12 as inputs of trsm solution, obtaining U12, wherein a12=l11×u12+0×u22=l11×u12; in this step, a12 may be understood as the B matrix in the triangular matrix equation, and L11 may be understood as the a matrix in the triangular matrix equation.

Step 4: taking L21, U12 and a22 as input of gemm, obtaining updated a22, wherein a22new=a22-l21×u12;

and (3) decomposing the matrix blocks in the leftmost column and the uppermost row once in each execution of the steps 1-4, and repeatedly applying the steps 1-4 to the rest submatrices at the right lower part after updating until all matrix blocks are decomposed.

The present application performs performance tests on the mixed-precision algorithm for solving the triangular matrix equation on the rising AI chip 910A, and the half-precision matrix calculation force theoretical peak value of the single card of the rising AI chip 910A is known to be 256TFLOPS. For input matrixes with different sizes, the floating point calculation performance of the trigonometric matrix equation solution can reach 119-217 TFLOPS. Experiments show that the technical scheme of the application can effectively utilize the performance advantages of the AI accelerator and improve the solving efficiency of the triangular matrix equation.

It can be understood that according to the first floating point number precision of the matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved, performing precision processing on the inverse triangle matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangle matrix and a second matrix corresponding to the preset matrix, including:

When the first floating point number precision is inconsistent with the second floating point number precision of the matrix to be solved, performing numerical scaling on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix, wherein a first numerical range of the first matrix and a second numerical range of the second matrix are in a range which can be represented by the first floating point number precision.

It can be understood that performing numerical scaling on the inverse triangular matrix and the preset matrix to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix, including:

determining a first numerical range and third floating point number precision of an inverse triangular matrix, and determining a second numerical range and fourth floating point number precision of a preset matrix;

and performing precision conversion on the inverse triangular matrix according to the first scaling preset constant to obtain a first matrix, and performing precision conversion on the preset matrix according to the second scaling preset constant to obtain a second matrix.

It can be understood that performing precision conversion on the inverse triangular matrix according to a first scaling preset constant to obtain a first matrix, and performing precision conversion on the preset matrix according to a second scaling preset constant to obtain a second matrix, including:

multiplying a first scaling preset constant by an inverse triangular matrix to obtain a first product, and performing precision conversion on the first product according to the precision of the first floating point number to obtain a first matrix;

and multiplying the second scaling preset constant by a preset matrix to obtain a second product, and performing precision conversion on the second product according to the precision of the first floating point number to obtain a second matrix.

It should be noted that if the floating point number accuracy of the initial problem (which can be understood as the floating point number accuracy of the matrix to be solved) is inconsistent with the accuracy of the AI chip matrix computational power support, the op (a) needs to be set ^-1 And B, performing numerical scaling and precision conversion.

Specifically, suppose op (A) and B are FP32 single precision, while part of AI chipsOnly FP16 half-precision matrix multiplication is supported. The exponent bit lengths of the respective accuracies are not uniform, and the representation range of FP32 is larger than that of FP16, i.e., the number that part FP32 can represent cannot be represented by FP16, and a numerical scaling is required. The scaling process is to put op (A) ^-1 And the B matrix are multiplied by a scaling factor s ₁ Sum s ₂ Obtaining s ₁ op(A) ^-1 Sum s ₂ B. Scaling factor s ₁ Sum s ₂ According to op (A) ^-1 And B matrix number range, op (A) ^-1 And floating point number precision of the B matrix and floating point number precision of the calculation force of the AI chip matrix are determined, so that floating point underflow and floating point overflow of the numerical value after precision conversion are avoided. After the scaling is completed, performing precision conversion, and converting s ₁ op(A) ^-1 Sum s ₂ B conversion to AI chip matrix calculation force support low precision, respectively(s) ₁ op(A) ^-1 ) _low Sum(s) ₂ B) _low . Then calculated by matrix calculation unit using AI chip (s ₁ op(A) ^-1 ) _low Sum(s) ₂ B) _low Matrix multiplication result (s ₁ op(A) ^-1 ) _low ×(s ₂ B) _low Or(s) ₂ B) _low ×(s ₁ op(A) ^-1 ) _low . If precision recovery is required, then(s) ₁ op(A) ^-1 ) _low ×(s ₂ B) _low Or(s) ₂ B) _low ×(s ₁ op(A) ^-1 ) _low The floating point number precision converted into the original problem is scaled and restored, and finally the matrix to be solved is obtained through calculation: x=s ₁ ^-1 ×s ₂ ^-1 ×α)×(s ₁ op(A) ^-1 ) _low ×(s ₂ B) _low Or x=s ₁ ^-1 ×s ₂ ^-1 ×α)×(s ₂ B) _low ×(s ₁ op(A) ^-1 ) _low 。

The reasons for matrix scaling and the choice of scaling factors are further described below.

First, it is necessary to clarify the floating point number representation of a computer, the representation range of each precision of a floating point number, and the meaning of floating point overflow.

The floating point number representation mode of the computer comprises: half precision (FP 16), single precision (FP 32), double precision (FP 64).

The double precision (FP 64) is represented as follows:

1bit

11bit

52bit

single precision (FP 32) is represented as follows:

1bit

8bit

23bit

the half precision (FP 16) is expressed as follows:

1bit

5bit

10bit

In a single precision 32-bit format, 1 bit is used to indicate whether a number is positive or negative. An exponent retains 8 bits because it is binary, going 2 to the high order, the remaining 23 bits are used to represent the number that makes up the number, referred to as the significant number.

While under double precision, the exponent retains 11 bits and the significant number of bits is 52 bits, greatly expanding the range of numbers and sizes that it can represent. The half precision is represented in a smaller range, the exponent is only 5 bits, and the significant bit is only 10 bits.

The semi-precision format is similar to the single-precision format, with the leftmost bit still being the sign bit, the exponent being 5 bits wide and stored in the form of the remainder-16 (process-16), the mantissa being 10 bits wide but with an implicit 1.

As shown in the following table, sign is a sign bit, 0 indicates that the floating point number is positive, and 1 indicates that the floating point number is negative.

Introducing mantissa, then exponents, fraction is mantissa, 10 bits long, but with hidden 1, mantissa can be understood as a number after a floating point number decimal point, such as 1.11, mantissa is 1100000000 (1), and finally hidden 1 is mainly used for calculation, and hidden 1 may have a situation that can carry.

The exponents are digits, have a length of 5 digits, and the specific expressed values are as follows:

When the exponent bits are all 0 s and the mantissa bits are also all 0 s, a 0 is indicated.

When the exponent bits are all 0 and the mantissa bits are not all 0, the denormal value, denormal floating point number, is a very small number.

When the exponent bits are all 1 and the mantissa bits are all 0, infinity is indicated, and at this time, if the sign bit is 0, positive infinity is indicated, and the sign bit is 1, negative infinity is indicated.

When the digits are all 1 and the mantissa digits are not all 0, it is not a number.

In the rest of the cases, the value of the exponent bit minus 15 is the exponent it represents, as represented by 11110 is 30-15=15.

It is therefore possible to obtain a half-precision floating-point number in a computationally intensive manner (-1)/(sign x 2 (exponent bit value) × (1+0. Mantissa bit), where 0. Mantissa bit represents, for example, 0001110001, 0. Mantissa bit 0.0001110001.

Maximum value that half precision can represent:

the 0 11110 1111111111 calculation method comprises the following steps: (-1)/(0×2 (30-15) × 1.1111111111 = 1.1111111111 (b) ×2≡15= 1.9990234375 (d) ×2≡15= 65504).

Minimum value that half precision can represent (except subtnormal value):

the 0 00001 0000000000 calculation method comprises the following steps: (-1)/(1). Times.2 (1-15) =2 (-14), approximately equal to 6.104 ×10 (-5) decimal.

Again, this time, the reverse is performed, for example, -1.5625×10 (-1), i.e., -0.15625= -0.00101 (decimal-binary) = -1.01×2 (-3), so the sign bit is 1, the exponent is-3+15=12, so the exponent bit is 01100, and the mantissa bit is 0100000000. Therefore, -1.5625×10 (-1) is represented by a half-precision floating point number as 1 01100 0100000000.

The representation range of each precision of the floating point number:

FP32 is a single precision floating point number, using 8 bits to represent an exponent, 23 bits to represent a fraction, occupying 4 bytes;

FP16 semi-precision floating point number, using 5 bits to represent the exponent, 10 bits to represent the fraction, occupying 2 bytes;

INT8, an octet integer, occupies 1 byte, INT8 is a fixed point calculation, representing an integer operation, typically quantized from a floating point operation. In binary, a "0" or "1" is a bit, and INT8 means that 8 bits are used to represent a number. Therefore, although INT8 is lower in precision than FP16, the data size is small, the energy consumption is low, the calculation speed is relatively faster, and the characteristics of end-side operation are more met;

mixing precision: in a simple way, the FP16 is used for multiplication and storage, and the FP32 is only used for addition operation, so that accumulation errors are avoided;

dynamic range of FP16 (6×10 ^-8 65504) is much lower than the dynamic range of FP32 (1.4 x 10 ^-45 ～1.7×10 ⁺³⁸ ) The method comprises the steps of carrying out a first treatment on the surface of the Accuracy of F16 (2 ^-10 ) Far coarser than the FP32 accuracy (2 ^-23 )。

On the data table demonstration, the integer ranges represented by FP32 and BF16 are the same, the fractional parts are different, and rounding errors exist; the data ranges represented by FP32 and FP16 are different, and in large data calculations FP16 is at risk of overflow.

The following is an explanation of why the int8 range is-128 to 127:

int8 takes 1 byte, 1 byte takes 8 bits;

wherein the most significant bit represents the sign bit 1-minus; 0-positive sign;

the binary of the maximum value is then:

0 1 1 1 1 1 1 1；

scaling to 10 scale is calculated from low order to high order:

0 1 1 1 1 1 1 1

＝0*2^7+1*2^6+1*2^5+1*2^4+1*2^3+1*2^2+1*2^1+1*2^0

＝0+64+32+16+8+4+2+1

＝127；

the binary of the minimum value should be opposite to the maximum value:

1 0 0 0 0 0 0 0；

scaling to 10 scale is calculated from low order to high order:

1 0 0 0 0 0 0 0

＝1*2^7+0*2^6+0*2^5+0*2^4+0*2^3+0*2^2+0*2^1+0*2^0

＝128+0+0+0+0+0+0+0

＝128。

there is also a well understood explanation of the fact:

int8 is 1 byte, i.e. 8 binary bits (bits);

2. each binary bit can store two numbers of 0 and 1, and there are 2^8 =256 combinations of 8 binary bits (256 numbers can be stored);

int8 is signed, so positive and negative numbers will bisect 256 numbers, 256/2=128;

4. the negative number is 128, and the minimum value of the negative number is-128;

5. the positive number is 128, 0 is one number, and the maximum value is +127;

2^8 =256 if uint8 (8 bit unsigned-no negative);

0 is a number, so it is 255 at maximum.

Based on the description of the representation range of each precision of the floating point number, the floating point number of each precision has an unused dynamic range, the absolute value of the numerical value is floating point overflow if the absolute value exceeds the upper limit of the dynamic range, and the absolute value of the numerical value is floating point underflow if the absolute value is lower than the lower limit of the dynamic range.

The meaning of floating point overflow is explained below.

Overflow of fixed point number/floating point number is divided into: overflow and underflow.

Arithmetic overflow (arithmetic overflow) refers to a computer performing an arithmetic operation that produces results that are beyond what the machine can represent.

Overflow has the separate overflow and underflow, the concepts of overflow and underflow being not exactly the same in fixed point and floating point computers.

In a fixed point computer:

a range of the representation exceeding the number from the positive direction is called overflow; the range of the representation exceeding the number from the negative direction is called underflow.

The method comprises the following steps:

in a floating point computer:

the representation range of floating point numbers is mainly determined by the level code.

Whether the sign of the number is positive or negative, if the step code exceeds the representation range of the step code from the positive direction, the step code is called overflow;

if the step code is beyond the representation range of the step code from the negative direction, or the mantissa is "0", the step code is collectively referred to as underflow.

In general, if a computer underflows a floating point number, the computer automatically treats the floating point number as "0" and does not output error information (overflow of precision);

the computer generates = "overflow interrupt" = =, and outputs overflow error information, and even stops the program running.

Simple overflow judgment rule:

overflow may occur only if two identical numbers are added or two dissimilar numbers are subtracted, i.e. overflow occurs only if the same numbers are added and the different numbers are subtracted. For example, two positive numbers are added, and the sign bit of the result is 1 (the result is negative); minus a positive number, the sign bit of the result is 0 (positive result). When overflow occurs in the fixed point number addition and subtraction operation, the operation result is wrong.

It should be noted that, the exponent bit lengths of the respective accuracies are not uniform, for example, the representation range of FP32 is larger than the representation range of FP16, that is, the number that part of FP32 can represent cannot be represented by FP16, and it is necessary to perform numerical scaling. Numerical scaling is used to avoid floating point overflows. The scaling factor is selected in relation to the floating point number accuracy of the matrix, the floating point range of the matrix, and the floating point number accuracy of the AI chip matrix calculation power.

Secondly, taking the value of the scaling factor as an example:

assuming that matrix a uses FP32 single precision floating point representation, if the value of the a matrix is greater than the upper dynamic range bound 65504.0 of FP16, then FP16 precision may not be representative of the a matrix. The a matrix needs to be scaled by a factor such that the value of the a matrix is smaller than the upper dynamic range bound of FP 16. If the value of the A matrix is less than 5.96 x 10-8, the A matrix is scaled by multiplying a factor such that the value of the A matrix is greater than the lower dynamic range of FP 16. In particular, the description of "the value of the a matrix is greater than" is only a coarse-shallow criterion, since the a matrix contains many elements, and the selection of a specific scaling factor requires a specific problem-specific analysis.

It should be noted that, as shown in fig. 3, for the problem of solving the large-size triangular matrix equation, the solution and the matrix multiplication of the small-size triangular matrix equation can be converted by using the block technique.

It should be noted that, in an application scenario with high precision requirement, a matrix multiplication precision repairing method, for example, the precision restoring method described above, may be introduced based on the above method, and if precision restoration is required, the method will(s) ₁ op(A) ^-1 ) _low ×(s ₂ B) _low Or(s) ₂ B) _low ×(s ₁ op(A) ^-1 ) _low The floating point number precision converted into the original problem is scaled and restored, and finally the matrix to be solved is obtained through calculation: x=s ₁ ^-1 ×s ₂ ^-1 ×α)×(s ₁ op(A) ^-1 ) _low ×(s ₂ B) _low Or x=s ₁ ^-1 ×s ₂ ^-1 ×α)×(s ₂ B) _low ×(s ₁ op(A) ^-1 ) _low 。

It should be noted that, if the accuracy of the AI chip matrix calculation force support is lower than the floating point number accuracy of the matrix to be solved, converting the triangular matrix and the preset matrix into the low accuracy of the AI chip matrix calculation force support may have two effects on the calculation result:

1. if the numerical range of the matrix cannot be represented with low precision, namely numerical overflow (including numerical overflow and numerical underflow) occurs, an error calculation result is obtained, and the solution is to scale the numerical value of the matrix to a range which can be represented with low precision by a numerical scaling method described above;

2. the matrix is converted to low precision, and a little error occurs in the calculation result. If the precision requirement is not high, no modification can be made. If errors cannot be tolerated, the precision repair scheme described above can be used.

It should be noted that, the method for determining the scaling factor:

knowing the floating point number precision of the triangular matrix equation, the numerical range which can be expressed by the floating point number precision can be determined according to the floating point number expression mode of the computer and the expression range of the floating point number precision;

Knowing the floating point number precision supported by the AI chip matrix calculation force, the numerical range which can be expressed by the floating point number precision can be determined according to the floating point number representation mode of the computer and the representation range of the floating point number precision, and the upper bound and the lower bound of the range are mainly determined;

op (A) in the triangular matrix equation is known ^-1 And B, judging op (A) ^-1 And B, converting the numerical value overflow into the accuracy of the matrix calculation force support of the AI chip. If op (A) ^-1 Out of the low precision representation range, then selecting an arbitrary scaling preset constant s ₁ Make s ₁ op(A) ^-1 The value of (2) does not exceed the low precision representation range; if B is outside the low-precision representation range, any scaling preset constant s is selected ₂ Make s ₂ B does not go beyond the low precision representation range.

Specific examples:

problem to be solved x=α×op (a) ^-1 In x B, the matrix is represented using single precision, and AI chips support only half-precision matrix multiplication. Suppose op of this problem (A) ^-1 Each element of the matrix is-1.0 x 10 ^-8 To 1.0 x 10 ^-8 The random number in between, each element of the B matrix is a random number between 10000.0 and 100000.0.

The lower bound of the known half-precision floating point representation range is 5.96×10 ^-8 ，op(A) ^-1 Each element of the matrix is-1.0 x 10 ^-8 To 1.0 x 10 ^-8 The random numbers in the random numbers cannot be expressed by half precision, and the direct conversion to half precision becomes 0. Here, a scaling preset constant s can be selected ₁ Such as 1.0 x 10 ⁸ Thus s ₁ op(A) ^-1 The number of (2) ranges from-1.0 to 1.0, and can be expressed with half precision.

The upper limit of the half-precision floating point representation range is 65504, each element of the B matrix is a random number between 10000.0 and 100000.0, and the half-precision floating point representation range with partial number exceeding the upper limit cannot be used. Here, a scaling preset constant s can be selected ₂ Such as 1.0 x 10 ^-4 Thus s ₂ B has a value in the range of 1.0 to 10.0The space can be completely represented by half precision.

In the data processing method, the triangular matrix equation is preprocessed, so that the equation solving problem can be processed by the matrix computing unit of the artificial intelligent chip, that is, the triangular matrix equation solving problem is converted into the matrix multiplication solving problem by a matrix inverse transformation data processing mode, and the matrix computing capacity of the artificial intelligent chip can be mobilized. Therefore, the essence of the application is how to obtain the data meeting the requirements through data processing, and mobilize the matrix computing capacity of the artificial intelligent chip through the data, and the application is not a technical scheme for simply solving the triangular matrix equation.

It will be appreciated that as shown in fig. 4, the present application also provides a data processing apparatus, including:

the obtaining module 100 is configured to obtain a matrix to be solved, where the matrix to be solved, the triangular matrix and the preset matrix are constructed to obtain an equation of the triangular matrix equation, and the matrix to be solved and the triangular matrix are located at one side of the equation, and the preset matrix is located at the other side of the equation;

an inversion module 110, configured to perform inverse transformation on the triangular matrix to obtain an inverse triangular matrix;

the precision processing module 120 is configured to perform precision processing on the inverse triangular matrix and the preset matrix according to the first floating point number precision of the matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved, so as to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix;

the solving module 130 is configured to input the first matrix and the second matrix to a matrix computing unit of the artificial intelligence chip, so as to obtain a matrix multiplication result, where the matrix multiplication result is used to represent a matrix to be solved.

With reference now to FIG. 5, a data processing system is depicted in accordance with an embodiment of the present application.

It will be appreciated that as shown in FIG. 5, a data processing system, comprising:

at least one memory 200;

At least one processor 300;

at least one program;

the programs are stored in the memory 200, and the processor 300 executes at least one program to implement the data processing method described above. Fig. 5 illustrates a processor 300.

The processor 300 and the memory 200 may be connected by a bus or other means, fig. 5 being an example of a connection via a bus.

Memory 200 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer-executable programs, and signals, such as program instructions/signals corresponding to a data processing system in an embodiment of the present application. The processor 300 performs various functional applications and data processing, i.e., implements the data processing method of the above-described method embodiments, by running non-transitory software programs, instructions, and signals stored in the memory 200.

Memory 200 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store related data of the above-described data processing method, and the like. In addition, memory 200 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 200 may optionally include memory located remotely from processor 300, which may be connected to the data processing system via a network. Examples of such networks include, but are not limited to, the internet of things, software defined networks, sensor networks, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more signals are stored in memory 200 that, when executed by one or more processors 300, perform the data processing method of any of the method embodiments described above. For example, the method of fig. 1 described above is performed.

A computer-readable storage medium according to an embodiment of the present application is described below with reference to fig. 5.

As shown in fig. 5, the computer-readable storage medium stores computer-executable instructions that are executed by one or more processors 300, for example, by one of the processors 300 in fig. 5, which may cause the one or more processors 300 to perform the data processing method in the method embodiment described above. For example, the method of fig. 1 described above is performed.

The system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the description of the embodiments above, those skilled in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media and communication media. The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable signals, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The embodiments of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application. Furthermore, embodiments of the application and features of the embodiments may be combined with each other without conflict.

Claims

1. The data processing method is characterized by being applied to an artificial intelligent chip, wherein the artificial intelligent chip comprises a matrix computing unit and comprises the following steps:

2. The data processing method according to claim 1, characterized by comprising, before the obtaining the matrix to be solved:

3. The data processing method according to claim 1, characterized in that the data processing method further comprises:

determining an initial matrix to be decomposed;

dividing the initial matrix into subarrays with equal length and width;

4. A data processing method according to claim 3, wherein starting from the first column from the left and the first row from the top, the sub-matrices on the diagonal line are sequentially subjected to LU decomposition, panel decomposition and triangular matrix solution to obtain the target matrix, and the method comprises:

5. The data processing method according to claim 1, wherein the performing precision processing on the inverse triangular matrix and the preset matrix according to the first floating point number precision of the matrix operation of the artificial intelligent chip and the operation precision required by the matrix to be solved to obtain a first matrix corresponding to the inverse triangular matrix and a second matrix corresponding to the preset matrix includes:

6. The method for processing data according to claim 5, wherein performing numerical scaling on the inverse triangular matrix and the preset matrix to obtain the first matrix corresponding to the inverse triangular matrix and the second matrix corresponding to the preset matrix includes:

7. The method of claim 6, wherein performing precision conversion on the inverse triangular matrix according to the first scaling preset constant to obtain the first matrix, and performing precision conversion on the preset matrix according to the second scaling preset constant to obtain the second matrix, includes:

8. A data processing apparatus, comprising:

9. A data processing system, comprising:

at least one memory;

at least one processor;

at least one program;

the program is stored in the memory, and the processor executes at least one of the programs to implement the data processing method according to any one of claims 1 to 7.

10. Computer readable storage medium, characterized in that it stores computer executable instructions for causing a computer to perform the data processing method according to any one of claims 1 to 7.