CN114329325A

CN114329325A - Batch matrix multiplier optimization method based on Shengteng AI processor

Info

Publication number: CN114329325A
Application number: CN202111374829.4A
Authority: CN
Inventors: 马银萍; 李若淼; 董昊森; 樊春; 杨宏辉; 龙汀汀
Original assignee: Peking University; Peng Cheng Laboratory
Current assignee: Peking University; Peng Cheng Laboratory
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-04-12
Anticipated expiration: 2041-11-19
Also published as: CN114329325B

Abstract

The invention discloses an optimization method of a batch matrix multiplier based on an Itanium AI processor, which comprises the following steps: acquiring first input data and second input data, and transporting the first input data and the second input data to AI Core; acquiring the loading line number of the second input data, and dividing the first input data and the second input data according to the loading line number and a double-cache mechanism of a preset buffer area; loading the divided first input data and second input data into the buffer area for calculation to obtain output data; and carrying the output data to an external storage for output. According to the method and the device, a double-buffer mechanism can be utilized, the operation time for multiplying the first matrix and the second matrix can be shortened, and therefore the data processing efficiency is improved.

Description

Batch matrix multiplier optimization method based on Shengteng AI processor

Technical Field

The invention relates to the field of data optimization, in particular to an optimization method of batch matrix multiplication operators based on an Itanium AI processor.

Background

The heave processor is an AI processor with da vinci Architecture, which aims to provide a chip with higher computing power and lower energy consumption for deep learning research, development and deployment, and is an AI processor with the first pass at present, the da vinci Architecture is essentially a 'Domain Specific Architecture' (DSA) chip adapted to common applications and algorithms in a Specific field, and the existing heave processor uses a conventional batch matrix multiplier in the operation process and does not use a double buffer mechanism of the heave processor, thereby resulting in lower actual operation efficiency.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The present invention provides an optimization method of batch matrix multiplier based on the soar AI processor, aiming at providing an optimization method of batch matrix multiplier based on the soar AI processor, and improving the efficiency of data processing by using an acceleration operation mechanism such as double buffers.

The technical scheme adopted by the invention for solving the technical problem is as follows:

in a first aspect, the present invention provides a batch matrix multiplier optimization method based on an soar AI processor, wherein the method includes:

acquiring first input data and second input data, and transporting the first input data and the second input data to AI Core;

acquiring the loading line number of the second input data, and dividing the first input data and the second input data according to the loading line number and a preset double-cache mechanism of a buffer area;

loading the divided first input data and second input data into the buffer area for calculation to obtain output data;

and carrying the output data to an external storage for output.

In one implementation, the obtaining first input data and second input data and the transporting the first input data and the second input data to an AI Core includes:

acquiring data types of the first input data and the second input data;

transposing the first input data with second input data when the data type is float16, float32, or int 32;

and calling a data moving instruction to carry the first input data and the second input data after the conversion to AI Core.

In one implementation, when the data type is float16, float32, or int32, transposing the first input data and second input data includes:

acquiring a first transposition mark corresponding to the first input data and a second transposition mark corresponding to the second input data;

acquiring the value taking states of the first transposition mark and the second transposition mark;

and switching the first input data and the second input data according to the value state.

In an implementation manner, the obtaining a loading line number of the second input data, and dividing the first input data and the second input data according to the loading line number and a double-cache mechanism of a preset buffer area includes:

when the loading line number is more than 1;

the second input data is divided into two portions according to the double-buffering mechanism.

In one implementation, the loading the divided first input data and second input data into the buffer for calculation to obtain output data includes:

loading the second input data of each part into a buffer area respectively;

loading the first input data into a buffer and multiplying the first input data by the second input data of the one section to obtain an intermediate result;

and adding the intermediate results of the two buffers to obtain output data.

In one implementation, said loading said first input data into a buffer and multiplying said first input data with said one portion of second input data to obtain an intermediate result comprises:

loading a line of said first input data into one of said buffers;

copying one line in the first input data according to the loading line number to obtain copied data;

multiplying the replicated data with the second input data of the one portion to obtain an intermediate result.

In one implementation, the obtaining first input data and second input data and before the transporting the first input data and the second input data to the AI Core includes:

determining the multiple to be expanded according to the data types of the first input data and the second input data and a preset data moving instruction;

expanding the last dimension of the first input data and the second input data according to the multiple when the dimension of the last dimension of the first input data and the second input data is not equal to 1;

and when the dimension of the second input data is not equal to 1, performing upward expansion on the second dimension of the second input data according to the multiple.

In one implementation manner, the loading the divided first input data and second input data into the buffer for calculation to obtain output data further includes:

if the second dimension of the second input data is expanded;

loading the divided first input data and second input data into the buffer area for calculation to obtain intermediate data;

and carrying out slicing dimensionality reduction on the intermediate data to obtain output data.

In one implementation, the invoking the data movement instruction to carry the translated first input data and the translated second input data to an AI Core includes:

acquiring the number of all AI cores;

and carrying a batch of the first input data and the second input data to an AI Core.

In one implementation, the moving the output data to an external storage for output includes:

acquiring output data in each AI Core;

and carrying the output data in each AI Core to an external storage for outputting.

In a second aspect, an apparatus for optimizing a batch matrix multiplier based on an itanium AI processor is provided, wherein the apparatus comprises:

the acquisition module is used for acquiring first input data and second input data and transporting the first input data and the second input data to AI Core;

the dividing module is used for acquiring the loading line number of the second input data and dividing the first input data and the second input data according to the loading line number and a double-cache mechanism of a preset buffer area;

the calculation module is used for loading the divided first input data and second input data into the buffer area for calculation to obtain output data;

and the output module is used for transporting the output data to an external storage for output.

In a third aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to call instructions in the storage medium to perform an optimization method for implementing a batch matrix multiplier based on an itanium AI processor according to any of the above aspects.

In a fourth aspect, the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs, which are executable by one or more processors to implement a method for optimizing a batch matrix multiplier based on an itanium AI processor according to any one of the above schemes.

The invention has the beneficial effects that: compared with the prior art, the invention provides a batch matrix multiplier optimization method based on an Itanium AI processor, which comprises the steps of obtaining first input data and second input data, then transporting the first input data and the second input data to an AI Core, conveniently utilizing a multi-Core mechanism of the AI processor to distribute each batch of calculation processes to one AI Core, thereby shortening the operation time and improving the operation efficiency, then dividing the first input data and the second input data according to the loading line number of the second input data, utilizing a double-buffer mechanism of the AI processor to respectively load the divided data into two buffers, thereby being capable of transporting the next data into the other buffer for calculation when one buffer is used for calculating the data, avoiding the buffer from being idle in the transportation process, further improving the efficiency of data processing.

Drawings

FIG. 1 is a block diagram of DaVinci AI Core in an optimization method based on a batch matrix multiplier of an Itanium AI processor according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of batch matrix multiplication in an optimization method of batch matrix multiplication operators based on the Itanium AI processor according to an embodiment of the present invention.

FIG. 3 is a flowchart of an embodiment of a batch matrix multiplier optimization method based on an Itanium AI processor according to the present invention.

FIG. 4 is a flowchart illustrating the transportation of the first input data and the second input data to the AI Core in the batch matrix multiplier optimization method based on the Itanium AI processor according to an embodiment of the present invention.

FIG. 5 is a block diagram of the data storage of input _ y in the batch matrix multiplication in the optimization method of the batch matrix multiplication operator based on the Itanium AI processor according to the embodiment of the present invention.

FIG. 6 is a schematic diagram of batch matrix multiplication after the second input data is transformed in the optimization method of batch matrix multiplication operator based on the Itanium AI processor according to the embodiment of the present invention.

FIG. 7 is a flowchart of output data obtained in the optimization method of batch matrix multiplier based on the Itanium AI processor according to the embodiment of the present invention.

FIG. 8 is a schematic diagram of batch matrix multiplication optimization in a batch matrix multiplier optimization method based on an Itanium AI processor according to an embodiment of the present invention.

FIG. 9 shows the test data result of an optimization method for batch matrix multiplier based on the Itanium AI processor according to an embodiment of the present invention.

FIG. 10 is a schematic block diagram of an apparatus for optimizing a batch matrix multiplier based on an Itanium AI processor according to an embodiment of the present invention.

Fig. 11 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The soaring processor is an AI processor with DaVinci architecture, and aims to provide a chip with higher computing power and lower energy consumption for deep learning research, development and deployment, which is the first-lead AI processor at present. The rising AI processor is mainly composed of AI Core, AI CPU (ARM), L2Buffer/Cache, DDR/HBM (DDR controller/HBM memory particles), I/O controller, etc. Wherein the AI Core is the computational Core of the soar AI processor and is responsible for performing vector and tensor dependent computationally intensive operators. The external stores, referred to as AI cores as L2/HBM/DDR, are collectively referred to as Global Memory, GM for short.

The AI Core adopts the da vinci architecture, the basic structure of which is shown in fig. 1, and can be seen as a relatively simplified basic architecture of a modern microprocessor from the control point of view, which includes three basic computing resources: a matrix calculation Unit (Cube Unit), a Vector calculation Unit (Vector Unit), and a Scalar calculation Unit (Scalar Unit). The three calculation units respectively correspond to three common calculation modes of tensor, vector and scalar, each calculation unit plays a role in actual calculation, three independent execution pipelines are formed, and the three calculation units are matched with each other under the unified scheduling of system software to achieve optimized calculation efficiency.

CANN is a heterogeneous computing architecture proposed for the existing AI scenario, and supports users to quickly construct an AI application and service based on the promotion platform by providing a multi-level programming interface. The deep learning algorithm is composed of computing units which are called operators, and matrix multiplication, convolution operation and pooling operation are all single operators. The CANN provides three operator development modes to realize the optimal balance of efficiency and performance, the first one is TBE-DSL (sensor Boost Engine-Domain Specific Language), the data segmentation and scheduling are realized based on the definition of the grammar rule of the DSL, for a developer, only the expression of calculation needs to be concerned, and the development efficiency is improved; the second is a TVM primitive development mode, a developer needs to know the related knowledge of the neural network, the TVM primitive and the da vinci architecture hardware, and the developer can control the scheduling (Schedule) flow of operators by himself without holding the use of the da vinci instructions. The third is TBE-TIK (sensor editor kernel), which requires developers to have the ability of instruction level programming and tuning, including data editing and calculation expression, compared to DSL and TVM development modes, the TIK development mode can better exert the extreme performance of the chip. CANN opens the code of the preset operator library, supports the user-defined operator modification, and the operator designed by the user can be quickly applied to the neural network model.

The batch matrix multiplication (BatchMatMul) is an operator for realizing batch processing matrix multiplication and is an important operator in various deep learning models, and the calculation speed of the batch matrix multiplication operator can greatly influence the training and reasoning speed of various deep learning models, so that the batch matrix multiplication (BatchMatMul) has important significance for the optimization of the batch matrix multiplication operator. As shown in fig. 2, the batch matrix multiplication is a multiplication that multiplies each row of input _ x of each batch corresponding matrix by a corresponding element of each column of input _ y, and sums the results to obtain one element at the corresponding position of output. If there are two third-order tensor dimensions [ b, m, n ] and [ b, n, k ], respectively, and the first dimension is batch size, then the multiplication of these two tensors results in the dimension [ b, m, k ].

It has been found that in a soar AI processor, there are usually multiple AI cores, so that the performance optimization can be performed in a Core-parallel manner, in which the Core-parallel is used, the upper limit of the Core is 65535, and each AI Core individually performs data operations in parallel, and the Global Memory is visible to each AI Core. Another common performance optimization method is to use a double Buffer mechanism (double Buffer), where the first output Buffer (UB) is used to store the input and output of the Vector and scalar computing unit, and the instruction queue executed on the AI Core mainly includes the following categories, namely Vector instruction queue, Matrix instruction queue and store-move instruction queue (MTE2, MTE3), and different instruction queues are independent of each other and can be executed in parallel. The double-buffer mechanism divides the output buffer into two parts, when the Vector reads and calculates the data in one buffer, the storage moving instruction can move the next data into the other buffer, and therefore parallel execution is achieved through data in and out carrying and Vector calculation, the situation that the Vector is idle in the data carrying process is avoided, and the utilization efficiency of a Vector unit is improved. The conventional CANN provides a TVM primitive development method for the batch matrix multiplier, which does not perform scheduling and optimization for the DaVinci instruction and architecture of the Itanium AI processor, and does not use a multi-core and double-buffer mechanism and a vector calculation unit to improve the data processing efficiency, so that the utilization efficiency of the Itanium AI processor is low.

In order to solve the problems of the prior art, the present embodiment provides a method for optimizing batch matrix multiplication operators based on an soar AI processor, the method includes obtaining first input data and second input data, and then transporting the first input data and the second input data to an AI Core, so as to allocate each batch of computation processes to one AI Core by using a multi-Core mechanism of the soar AI processor, thereby shortening the computation time and improving the computation efficiency, then dividing the first input data and the second input data according to the number of loading lines of the second input data, and respectively loading the divided data into two buffers by using a double buffer mechanism of the soar AI processor, thereby when one buffer is used for computing data, the next data can be moved to another buffer for computing, and the buffer is prevented from being idle during the transportation, the efficiency of data processing is further improved.

Exemplary method

The batch matrix multiplier optimization method based on the soar AI processor in this embodiment can be applied to a terminal device, such as a computer, in which the soar AI processor is installed and the TIK development method is adopted. In practice, as shown in FIG. 3, the optimization method of the batch matrix multiplier based on the Itanium AI processor in this embodiment includes the following steps:

step S100, acquiring first input data and second input data, and transporting the first input data and the second input data to an AI Core.

Since the present embodiment is based on processing the first input data and the second input data, it is convenient to improve the efficiency of obtaining the output data, and since the storage inside the AI Core is separated from the storage GM outside the AI Core, the data calculation in the AI Core requires the data to be first transferred into the AI Core, then calculated in the AI Core, and finally transferred back to the GM.

In one implementation, as shown in fig. 4, the step S100 includes the following steps:

s101, acquiring data types of the first input data and the second input data;

s102, when the data type is float16, float32 or int32, transposing the first input data and the second input data;

s103, calling a data moving instruction to transport the first input data and the second input data after conversion to AI Core.

In specific implementation, because the input data supported by the BatchMatMul operator in the Mindspore version 1.1 during registration is of three types, namely, float16, float32 and int32, and when the data of other data types call the BatchMatMul operator, a data type conversion operator (Cast) needs to be called first, in the embodiment of the present application, the first input data and the second input data are float16, float32 and int32 as examples. Since the first input data and the second input data need to be carried into the AI Core for operation, and a data move instruction needs to be used in the data carrying process, in the embodiment of the present application, a data _ move instruction is used, the offset addresses of the source operand and the destination operand of the instruction must be aligned with 32Byte, and the space occupied by one of float16, float32 and int32 in the memory is 2Byte, 4Byte and 4Byte, respectively, so that the data _ move instruction needs to carry at least 16 floats 16, 8 floats 32 or 8 int32 at a time. The BatchMatMul operator multiplies a row of input _ x with a column of input _ y, and when the number of columns of input _ y is greater than 1, a column of input _ y is disconnected in memory, as shown in FIG. 5, if input _ y is a float32 type [32,8]The input _ y matrix is stored in the memory row by row, as shown by the arrow direction in the figure, the row end of each row is next to the row head of the next row, a row for matrix operation is not continuous in the memory, and the data _ move instructionThe 32Byte data is carried at least once, and the data _ move instruction is called once, so that 8 data can be carried, namely one row of input _ y, but only one data at the head of the row is actually used for carrying out matrix multiplication, so that limited input buffer space is wasted when data carrying is carried out, and the calculation efficiency of the batch matrix multiplication operator is greatly influenced. Therefore, in the embodiment of the present application, a first Transpose flag corresponding to first input data and a second Transpose flag corresponding to second input data are obtained, and then the first input data and the second input data are transposed (Transpose) according to a value state of the first Transpose flag and the second Transpose flag. Specifically, when the first input data is input _ x, the second input data is input _ y, the first transpose flag transpose _ a, and the second transpose flag transpose _ b, the embodiment of the present application is further optimized for the following formula:

first transposing input _ y, then multiplying the batch matrix between input _ x and input _ y by the corresponding element of each row of input _ x and transposed input _ y, and summing the obtained result as an element of the corresponding position of output, as shown in fig. 6. At this time, each row of data calculated by input _ x and input _ y after being transferred is continuous, and can be conveyed to the output buffer area UB once by using the data _ move instruction, and the final result is obtained by performing multiplication once by using the vector multiplication vmul and then summing. According to the value state of the first transpose mark and the second transpose mark, the batch matrix multiplication can be divided into the following four types:

1. if the input _ x is not transposed and the input _ y is transposed, that is, the input _ x, the input _ y, the transpose _ a, and the transpose _ b, are True, the corresponding elements of each line of the input _ x and the input _ y can be directly multiplied to obtain a final result, which is denoted as CusBatchMatMul (input _ x, input _ y, False, True);

2. if neither input _ x nor input _ y is transposed, that is, input _ x, input _ y, transit _ a, and transit _ b are False, then input _ y may be transposed to obtain transit (input _ y), and CusBatchMatMul (input _ x, transit (input _ y), False, True) may be called to obtain the final result, and the calculation process is as follows:

input_y＝mindspore.ops.Transpose()(input_y,(0,2,1))

CusBatchMatMul(input_x,input_y,False,True)；

3. if the input _ x is transposed and the input _ y is not transposed, that is, input _ x, input _ y, transit _ a, and transit _ b are True, then input _ x and input _ y may be transposed to obtain transit (input _ x) and transit (input _ y), and then CusBatchMatMul (transit (input _ x), transit (input _ y), False, True) may be called to obtain the final result, and the calculation process is as follows:

input_x＝mindspore.ops.Transpose()(input_x,(0,2,1))

input_y＝mindspore.ops.Transpose()(input_y,(0,2,1))

CusBatchMatMul(input_x,input_y,False,True)；

4. if the input _ x and the input _ y are transposed, that is, input _ x, input _ y, transit _ a, and transit _ b are True, Transpose _ x is transposed to obtain transit (input _ x), and CusBatchMatMul (transit (input _ x), input _ y, False, True) is called to obtain the final result, which is as follows:

input_x＝mindspore.ops.Transpose()(input_x,(0,2,1))

CusBatchMatMul(input_x,input_y,False,True)。

in summary, any case can be converted into the case of CusBatchMatMul (input _ x, input _ y, False, True) for calculation, and we only need to optimize the case of CusBatchMatMul (input _ x, input _ y, False, True) below.

Preferably, since the data _ move instruction is used in carrying the first input data and the second input data to the AI Core, and requires that the offset addresses of the source operand and the destination operand must be 32Byte aligned, the 32Byte size can store 8 floats 32 and int32 or 16 floats 16. Therefore, the data _ move instruction carries 8 floats 32 and int32 or 16 floats 16 at a time, and if the source operand to be carried is less than 32 bytes, the source operand must be complemented to 32 bytes, so as to ensure the correctness of the operation result of the data _ move instruction. Therefore, in order to enable each row of data of the first input data, the second input data and the output data to be carried out by one or more data _ move commands, the last dimension of the first input data and the second input data must be expanded first, then the data _ move commands are carried into the UB, and the second dimension of the second input data must be expanded, so that the matrix calculation result is carried out, and a one-dimensional array with the length of the second dimension can also be directly carried back to the GM through the data _ move commands. For example, when n! When 1, the last dimension n of input _ x and input _ y is expanded upwards to a multiple n/8 × 8 of 8. Similarly, when k! When 1, the second dimension k of input _ y is expanded upwards by a multiple k/8 × 8 of 8. For the float16 data type, then n and k are expanded to multiples of 16. If the dimension of the final result is not affected by the final one-dimensional expansion of input _ x and input _ y, and the second dimension of input _ y is expanded, the dimension of the result slice needs to be reduced to obtain the original dimension [ b, m, k ] after the final calculation of CusBatchMatMul is completed.

S200, obtaining the loading line number of the second input data, and dividing the first input data and the second input data according to the loading line number and a double-cache mechanism of a preset buffer area.

Because the Vector calculation unit in the embodiment of the application has a limited data amount calculated at one time, when matrix multiplication is performed, all parts to be calculated of input _ y and input _ x need to be loaded into the output buffer UB, and at this time, the input _ y data of a batch can be loaded into the UB for calculation by dividing the first input data and the second input data.

In specific implementation, when the number of loading lines of the second input data is greater than 1, the second input data is divided into two parts, so that the second input data is divided into two parts according to a double-cache mechanism.

S300, loading the divided first input data and second input data into the buffer area for calculation to obtain output data.

After the first input data and the second input data are divided, the divided first input data and the divided second input data are loaded into the buffer area to be calculated, and then the output data are obtained.

In one implementation, as shown in fig. 7, the step S300 includes the following steps:

s301, loading the second input data of each part into a buffer area respectively;

s302, loading the first input data into a buffer area, and multiplying the first input data by the second input data of one part to obtain an intermediate result;

and S303, adding the intermediate results of the two buffers to obtain output data.

In specific implementation, due to the fact that a double-cache mechanism is used in the UB, after the second input data is divided into two, one line of the first input data is loaded into one buffer area, one line of the first input data is copied according to the number of the loading lines to obtain copied data, the copied data is multiplied by one part of the second input data, then a vcadd instruction is called to accumulate data of each line in the result to obtain an intermediate result, and finally the intermediate results of the two buffer areas are added to obtain output data. The above steps are calculation processes in one AI Core, and may allocate the calculation process of each batch to one AI Core by using a multi-Core mechanism, specifically, when first input data and second input data are transported, the number of all AI cores is obtained first, and then a batch of the first input data and the second input data is transported to one AI Core, so that when data is output, output data in each AI Core also needs to be obtained, and output data in each AI Core is transported to a GM for output. For example, as shown in fig. 8, input _ x has a size of [32,4,8], input _ y has a size of [32,8,8], the input data type is float32, it is assumed that the input buffer size can only store 140 float32 data, Vector calculation unit can only calculate 4 ═ 8 to 32 float32 data at a time, when matrix multiplication is performed, input _ y is divided into two by using a double-buffer mechanism, each buffer processes 4 rows of input _ y data, then loads one row of input _ x into UB, and copies one row into four rows to form one Vector, so that multiplication of input _ x in each buffer with all four rows of input _ y can be completed by one vmul Vector multiplication instruction without multiplying each row of input _ x with each row of input _ y, thus three cycles are not needed, then fewer vmul instructions are called, and the result of each row of input _ y is finally obtained by adding 8 rows of input _ y data, and then the final calculation result is moved back to the GM by the data _ move, so that the operation efficiency is greatly improved.

And S400, transporting the output data to an external storage for output.

In concrete implementation, since the data is optimized in this embodiment, after the output data is acquired, the output data needs to be transferred but output in an external storage. Specifically, the present embodiment calls a data move command to move the output data back to the GM, and preferably, the data move command in the present embodiment is a data _ move command.

In summary, because the invention performs operator development based on the existing Mindspore framework, the Mindspore deep learning framework provides the CusBatchMatMul operator of the AI processor with Python API conveniently and quickly extended, and the specific calculation process of the CusBatchMatMul operator is as follows:

1. obtaining types of first input data and second input data, wherein the number of bytes occupied by one data is bits, the minimum number of data moved at one time by a data _ move instruction is move _ mini _ size, the number of data which can be calculated at one time by a vector calculation unit is vector _ cal _ size, when the input data is float32 and int32, bits is 4, move _ mini _ size is 8, vector _ cal _ size is 64, when the input data is float16, two variables of bit is 2, move _ mini _ size is 16, vector _ cal _ size 128, move _ mini _ size and vector _ cal _ size are needed in a subsequent data _ move instruction, vector multiplication vmual instruction and vector addition vcadd instruction;

2. obtaining the size of an output buffer by a get _ unified _ buffer _ size function, wherein the unit of the obtained buffer size is Byte, which is recorded as ub _ size, and the total amount of data which can be loaded into the buffer is ub _ load _ data _ size, which is recorded as ub _ size/bits;

3. according to the dimensionalities of the first input data input _ x and the second input data input _ y, multi-core parallel division is performed, and the method mainly comprises the following steps:

(a) performing multi-Core division according to the batchs, and allocating each batch to one AI Core for calculation;

(b) when n < <ub _ load _ data _ size/2.2, at least one line n of input _ y can be put into the UB, at this time, input _ y _ lines > 1, theoretically, the calculated data of input _ x and input _ y in the UB account for half respectively, then multiplication calculation is performed, but in the actual calculation process, some variable auxiliary calculation and the like are needed, some extra storage overhead is brought, and the overhead is calculated by dividing by 2.2, a certain overhead space is left, and some calculated data are less loaded in each calculation. The reason why the number of rows loaded with input _ y is at most UB _ load _ data _ size/2.2// n//2 x 2,// 2 x 2 is that the number of rows loaded with input _ y into UB is equal to the number of rows, and the number of rows can be evenly distributed to the double buffer for calculation. When input _ y _ lines > k is large, the first batch of input _ y can be loaded all the way into UB for matrix calculation, when input _ y _ lines < k, input _ y can be processed circularly k/input _ y _ lines times, and each time an input _ y _ lines line is calculated;

(c) when n < ═ UB _ load _ data _ size/2.2, divide input _ x by rows, if input _ y _ lines >2 × k, two rows of input _ x can be moved into UB for calculation at a time, each buffer in the double buffer processes a row of input _ x data, otherwise, a row of input _ x is moved into UB, and the following operations are executed in parallel in the double buffer: firstly copying an input _ x line into an input _ y _ lines/2 line, moving the input _ y _ lines/2 line input _ y into the input _ y line, respectively carrying out vector multiplication vmul with the copied input _ x to obtain a vector multiplication result res _ vmul, then calling vcadd once or more times for each line of res _ vmul for accumulation, storing the finally accumulated value of each line into a one-dimensional vector res, and directly moving res back to the corresponding position of output in Global Memory by using a data _ move instruction;

(d) when n > UB _ load _ data _ size/2.2, at this time, input _ y _ lines <1, each row of input _ y may be divided into p segments on average, p is 1/input _ y _ lines, 1/p rows of input _ y are processed each time, a corresponding row of input _ x is also divided into p segments on average, each segment with the same length as input _ y is fed into the UB and multiplied by a vmul instruction, and then vcard instruction is used to sum up the segments to obtain an intermediate calculation result of a value, which needs to be expanded into a size move _ mini _ size that can be carried by data _ move instruction, for example, when move _ mini _ size is 8, a calculation result value of the segment needs to be placed at the head of an 8-bit array and carried out from the UB, and this intermediate calculation result needs to be stored in a [ b, k ] size intermediate matrix to store the entire input _ y and the entire input _ x calculation result are stored in a manner as described above, the xp (x ═ 0,1,2,3 …) position of each row in the matrix is the result of the multiplication and summation of the 1/p segments of each row of input _ x and input _ y. And finally, processing the inner _ output matrix, sending each row of the inner _ output matrix into the UB, taking out all matrix multiplied intermediate result values, and finally transmitting the matrix multiplied intermediate result values to output data. Preferably, in the context of the heave 910 machine, Mindspore 1.1.1, the results of testing multiple sets of data are shown in fig. 9, where the CusBatchMatMul operator time includes the time if transposing and slicing the input data and results, and it is obvious from the test results that CusBatchMatMul is significantly improved for the BatchMatMul operator in the existing CANN library, and for the case where input _ x is [32,128,128], input _ y is [32,128,128], tranpose _ a is True, and transpose _ b is True, cusbatchmatmux improves the performance of batchmatmux operator by about 3861 times. The BatchMatMul operator is a commonly used operator in the deep learning neural network, and compared with the BatchMatMul operator, the CusBatchMatMul operator has obviously improved performance, can improve the speed of network training and reasoning, and further improves the operation efficiency.

In summary, in this embodiment, first input data and second input data are obtained, and then the first input data and the second input data are transported to an AI Core, so that it is convenient to allocate each batch of calculation processes to one AI Core by using a multi-Core mechanism of an AI processor, thereby shortening operation time and improving operation efficiency.

Exemplary devices

As shown in FIG. 10, the present embodiment further provides an apparatus for optimizing a batch matrix multiplier based on an Itanium AI processor, the apparatus comprising: the device comprises an acquisition module 10, a dividing module 20, a calculation module 30 and an output module 40. Specifically, the obtaining module 10 is configured to obtain first input data and second input data, and transport the first input data and the second input data to an AI Core. The dividing module 20 is configured to obtain a number of loading lines of the second input data, and divide the first input data and the second input data according to the number of loading lines and a double-cache mechanism of a preset buffer area. The calculating module 30 is configured to load the divided first input data and second input data into the buffer area for calculation, so as to obtain output data. And the output module 40 is used for transporting the output data to an external storage for outputting.

In one implementation, the obtaining module 10 includes:

an obtaining unit, configured to obtain data types of the first input data and the second input data;

a transpose unit configured to transpose the first input data and the second input data when the data type is float16, float32, or int 32;

and the carrying unit is used for calling a data moving instruction to carry the first input data and the second input data after the transfer to the AI Core.

In one implementation, the calculation module 30 includes:

the loading unit is used for loading the second input data of each part into a buffer area respectively;

an intermediate result obtaining unit, configured to load the first input data into a buffer, and multiply the first input data by the second input data of the one portion to obtain an intermediate result;

and the output result acquisition unit is used for adding the intermediate results of the two buffers to obtain output data.

In one implementation, the intermediate result obtaining unit includes:

a loading subunit, configured to load a line of the first input data into one of the buffer areas;

the replication sub-unit is used for replicating one row of the first input data according to the number of the loading rows to obtain replicated data;

an intermediate result obtaining subunit, configured to multiply the copied data with the second input data of the one portion, to obtain an intermediate result.

Based on the above embodiments, the present invention further provides a terminal device, and a schematic block diagram thereof may be as shown in fig. 11. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is configured to provide computing and control capabilities. The memory of the terminal equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method for optimizing a batch matrix multiplier based on an Itanium AI processor. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is preset in the terminal equipment and used for detecting the operating temperature of the internal equipment.

It will be understood by those skilled in the art that the block diagram of fig. 11 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the terminal device to which the solution of the present invention is applied, and a specific terminal device may include more or less components than those shown in the figure, or may combine some components, or have different arrangements of components.

In one embodiment, a terminal device is provided, the terminal device includes a memory, a processor, and an optimization program based on an ascent AI processor batch matrix multiplier, stored in the memory and executable on the processor, the processor implements the following operation instructions when executing the optimization program based on the ascent AI processor batch matrix multiplier:

and carrying the output data to an external storage for output.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

In summary, the present invention provides a batch matrix multiplier optimization method based on an eaton AI processor, the method includes obtaining a first input data and a second input data, and then transporting the first input data and the second input data to an AI Core, so as to allocate each batch of computation process to an AI Core by using a multi-Core mechanism of the eaton AI processor, thereby reducing computation time and improving computation efficiency, then dividing the first input data and the second input data according to the number of loading lines of the second input data, and respectively loading the divided data into two buffers by using a double buffer mechanism of the eaton AI processor, thereby enabling to move the next data into another buffer for computation when one buffer is performing computation of data, and avoiding the buffer from being idle during transportation, further improving the operation efficiency.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for optimizing a batch matrix multiplier based on an Itanium AI processor, the method comprising:

acquiring the loading line number of the second input data, and dividing the first input data and the second input data according to the loading line number and a double-cache mechanism of a preset buffer area;

and carrying the output data to an external storage for output.

2. The method of claim 1, wherein the obtaining the first input data and the second input data and the transporting the first input data and the second input data to an AI Core comprises:

acquiring data types of the first input data and the second input data;

and calling a data moving instruction to carry the first input data and the second input data after conversion to AI Core.

3. The method as claimed in claim 2, wherein transposing the first input data and the second input data when the data type is float16, float32 or int32 comprises:

and transposing the first input data and the second input data according to the value taking state.

4. The method as claimed in claim 1, wherein the obtaining the number of loading lines of the second input data and dividing the first input data and the second input data according to the number of loading lines and a double-buffer mechanism of a predetermined buffer comprises:

when the loading line number is more than 1;

5. The method as claimed in claim 4, wherein the loading the divided first and second input data into the buffer for calculation to obtain the output data comprises:

loading the second input data of each part into a buffer area respectively;

and adding the intermediate results of the two buffers to obtain output data.

6. The method of claim 5, wherein loading the first input data into a buffer and multiplying the first input data by the portion of the second input data to obtain an intermediate result comprises:

loading a line of said first input data into one of said buffers;

7. The method as claimed in claim 1, wherein the obtaining the first input data and the second input data and transporting the first input data and the second input data to the AI Core comprises:

determining a multiple needing to be expanded according to the data types of the first input data and the second input data and a preset data moving instruction;

8. The method as claimed in claim 7, wherein the step of loading the divided first and second input data into the buffer for calculation to obtain the output data further comprises:

if the second dimension of the second input data is expanded;

9. The method as claimed in claim 2, wherein the call data move instruction transports the transformed first input data and second input data to the AI Core, comprising:

acquiring the number of all AI cores;

10. The method as claimed in claim 9, wherein the transferring the output data to an external storage for output comprises:

acquiring output data in each AI Core;

11. An apparatus for optimizing a batch matrix multiplier based on an soar AI processor, the apparatus comprising:

12. A terminal device, characterized in that the terminal device comprises: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to invoke instructions in the storage medium to perform an optimization method that implements the least squares AI processor based batch matrix multiplier of any of the above claims 1-10.

13. A computer readable storage medium, storing one or more programs, which are executable by one or more processors to implement a method for batch matrix multiplier optimization based on an elevate AI processor as claimed in any one of claims 1-10.