CN114035936B

CN114035936B - Multi-dimensional parallel processing method, system, equipment and readable storage medium based on artificial intelligence

Info

Publication number: CN114035936B
Application number: CN202111203399.XA
Authority: CN
Inventors: 卞正达; 李永彬; 柳泓鑫
Original assignee: Beijing Luchen Technology Co ltd
Current assignee: Beijing Luchen Technology Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2024-05-17
Anticipated expiration: 2041-10-15
Also published as: CN114035936A

Abstract

The application belongs to the field of artificial intelligence, and relates to a multidimensional parallel processing system and method based on artificial intelligence. In the training process, the data to be processed is automatically managed through data parallelism, and the data to be processed is distributed to a hardware processor; the data processing method comprises the steps of performing sequence parallelism, segmenting and distributing data, and placing each piece of data to be processed into a plurality of processors; the method comprises the steps of running water and paralleling, dividing a model into a plurality of sections, arranging the sections on different hardware processors, serially connecting the sections according to the model sequence, paralleling the multidimensional model, executing network model division on a training model of data to be processed, which is scheduled to the processors, scheduling the training model to a plurality of the processors, and updating parameters of the model by an optimizer to complete a training process. In the reasoning process, the resource scheduling and multidimensional parallelization technology are also adopted. By introducing multidimensional parallel processing in the AI model training and reasoning process, the consumption of AI on computing resources is reduced, the artificial intelligence deployment efficiency is improved, and the deployment cost is minimized.

Description

Multi-dimensional parallel processing method, system, equipment and readable storage medium based on artificial intelligence

Technical Field

The invention belongs to the field of artificial intelligence deep learning, and particularly relates to an artificial intelligence-based multidimensional parallel processing method, an artificial intelligence-based multidimensional parallel processing system, an artificial intelligence-based multidimensional parallel processing equipment and a readable storage medium.

Background

In recent years, the AI training market has a demand inflection point, the demand in the computing power market is rapidly expanded, the computing power use efficiency is required to be improved, the large-scale algorithm starts explosive breakthrough in the last two years, the new algorithm and the new model are continuously emerging, the demand of the market for computing power is increasingly larger, and the large model cannot be trained by a single GPU, because the model parameter is too large, and the model parameter cannot be put into the video memory of the single GPU; even if the device can be put down, the training time is unacceptable, the increasing trend of the hardware computing force is far from the demand of the model for computing force, and more hardware (chips) must be used for compensating the computing force increasing long and short boards.

In an enterprise scenario, a large number of factors are involved in large-scale deployment, including time delay, throughput, cost, load balancing and the like, and the main difficulties include that the computational efficiency is difficult to improve due to communication bottlenecks: the highest utilization rate of GPU computing power in the existing training is only 30%, computing, storing and network resources are required to be shared among different tasks, the problems of isolation and scheduling are related, different tasks need different multidimensional parallel processing solutions and hardware, and additional software and hardware cost is caused.

Disclosure of Invention

Aiming at the defects of the prior art, the invention creates a multidimensional parallel processing method and system which have high efficiency and low energy consumption and are suitable for AI large models, helps enterprises to maximally improve the artificial intelligence deployment efficiency and simultaneously minimizes the deployment cost.

The embodiment of the application provides a multidimensional parallel processing method, a system, equipment and a medium based on artificial intelligence.

In a first aspect, an embodiment of the present application provides an artificial intelligence based multidimensional parallel processing method for a hardware processor, where the method is executed on a software platform and uses a machine learning library;

Characterized in that the method comprises the steps of:

Data parallelism, automatically managing data to be processed from a user request, and distributing the data to be processed to each hardware processor;

the sequence is parallel, long sequence data in the data to be processed are further segmented, and each data to be processed is subjected to sequence division and is placed into a plurality of processors;

The multi-dimensional model is parallel, grid model division is executed for a training model of the data to be processed, which is scheduled to the processors, and the training model is scheduled to a plurality of the processors;

the data to be processed comprises a picture processing task and/or a natural language processing task;

The multi-dimensional model parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.

In a possible implementation manner of the first aspect, the step of automatically managing data to be processed from a user request, and distributing the data to be processed to each of the hardware processors further includes:

The data in the data parallelism is divided, each node or process has a model, each node takes batchsize of different data, then forward and backward calculation is respectively completed to obtain gradients, the processes for training are workers, besides the workers, parameter servers psserver, the workers send the gradients obtained by calculation to psserver, update operation is carried out by psserver, and the models after update are transmitted back to each node;

the data parallelism can expand the equivalent batchsize, i.e. the equivalent batch size, by calculating with the number of parallel processors, single processor batchsize, speeding up the calculation.

In a possible implementation manner of the first aspect, the further splitting long sequence data in the data to be processed, and performing sequence division on each data to be processed in a plurality of processors specifically includes:

The sequence parallelly prolongs the length of data received by a transducer model, and processes long text in NLP and high resolution pictures in CV task, namely large pictures and/or video, wherein the pictures can be cut into pictures of small blocks, and all the small pictures are sequentially arranged to be a sequence; the video itself is a sequence of pictures, each picture can be cut again;

After the computing resources are acquired, processing the picture processing task and/or the characteristic data of the picture, and distributing the processed data to each processor through data parallelism, wherein the data comprises but is not limited to GPU/CPU, and the data can be further segmented and distributed through the sequence parallelism;

If the single data length is greater than the threshold value, the single processor cannot process the single data, and after the sequence is segmented in parallel, one data is put into a plurality of processors;

the calculation is equivalent to directly processing the whole complete data through communication.

In a possible implementation manner of the first aspect, the performing mesh model partitioning for the training model of the data to be processed, which is scheduled to the processor, schedules the training model to a plurality of processors, specifically includes:

The 2-dimensional grid parallel adopts a measurable dense matrix multiplication SUMMA and an algorithm matrix, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized;

the 2.5-dimensional grid parallel design is a novel quantized deep learning model parallel architecture, expensive transmission loss among graphic processors is minimized, a flexible and efficient architecture is provided, and the model parallel speed and efficiency are further improved;

the 3-dimensional grid parallel adopts 3D parallel matrix multiplication, each matrix is divided into a plurality of small blocks according to rows and columns, large matrix multiplication is split into multiplication of a plurality of small matrices, and matrix storage is flattened on the whole processor.

In a second aspect, an embodiment of the present application provides an artificial intelligence based multidimensional parallel processing system for a hardware processor, where the system is implemented on a software platform and uses a machine learning library;

The data parallel module is used for automatically managing the data to be processed from the user request and distributing the data to be processed to each hardware processor;

The sequence parallel module is used for further segmenting long sequence data in the data to be processed, and dividing each data to be processed into sequences and putting the sequences into a plurality of processors;

In a possible implementation manner of the second aspect, the data parallel module automatically manages data to be processed from a user request, and allocates the data to be processed to each of the hardware processors further includes:

In one possible implementation manner of the second aspect, the sequence parallel module further performs segmentation on long sequence data in data to be processed, and performs sequence division on each data to be processed to put the data to be processed into a plurality of processors, where the method specifically includes:

In one possible implementation manner of the second aspect, the multidimensional model parallel module performs mesh model partitioning for a training model of the data to be processed, which is scheduled to the processor, and schedules the training model to a plurality of processors, and specifically includes:

In a third aspect, an embodiment of the present application provides an artificial intelligence based multidimensional parallel processing apparatus, which is characterized by comprising:

A memory for storing instructions for execution by one or more processors of the system, and

The processor is one of the processors of the system and is used for executing the instructions to implement the multi-dimensional parallel processing method based on artificial intelligence.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium encoded with a computer program, where the computer readable storage medium has instructions stored thereon, where the instructions when executed on a computer cause the computer to perform the multi-dimensional parallel processing method based on artificial intelligence.

Compared with the prior art, the application has the following effects:

The scheme adopted by the invention divides the model through multidimensional parallelism, improves the distributed AI training and reasoning efficiency, takes reasoning as an example, realizes the improvement of 70% response speed, and reduces the response time from original 30 seconds to 17-18 seconds; the training speed/parallelism is high, the training time of a user is reduced, the maximum performance of the existing computing power (GPU) is exerted, the maximum model supported on each average processor is improved from a 10 hundred million parameter scale to a 120 hundred million parameter scale, the quantity of the GPUs required by large model reasoning is reduced, the cost is reduced, and the availability and the product performance of the model are improved; the method has the advantages of strong usability, no need of a great deal of code learning and manual tuning when a user uses the model, and lower deployment cost.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a workflow diagram of an artificial intelligence based multidimensional parallel processing method in accordance with some embodiments of the present application;

FIG. 2 is an application scenario diagram illustrating an artificial intelligence based multi-dimensional parallel processing method according to some embodiments of the application;

FIG. 3 illustrates a block diagram of the hardware architecture of an artificial intelligence based multidimensional parallel processing system in accordance with some embodiments of the present application;

FIG. 4 illustrates a schematic diagram of a SUMMA algorithm module of an artificial intelligence based multi-dimensional parallel processing method, according to some embodiments of the application;

FIG. 5 illustrates a structural layout of a 2.5-dimensional grid parallel scheme for an artificial intelligence based multi-dimensional parallel processing method, in accordance with some embodiments of the present application;

FIG. 6 illustrates a SUMMA 2.5.5 algorithm block diagram of an artificial intelligence based multi-dimensional parallel processing method, according to some embodiments of the application;

FIG. 7 illustrates a matrix-vector parameter equalization architecture for an artificial intelligence based multidimensional parallel processing method in accordance with some embodiments of the present application;

FIG. 8 illustrates a weak extension efficiency comparison schematic of an artificial intelligence based multidimensional parallel processing method in accordance with some embodiments of the present application;

FIG. 9 illustrates a strongly extended efficiency alignment diagram of an artificial intelligence based multidimensional parallel processing method in accordance with some embodiments of the present application;

FIG. 10 illustrates a statistical graph of experimental results of a LAMB algorithm based on an artificial intelligence multi-dimensional parallel processing method, according to some embodiments of the application;

FIG. 11 illustrates a workflow diagram of a La-Lars algorithm for an artificial intelligence based multidimensional parallel processing method, in accordance with some embodiments of the present application;

FIG. 12 illustrates a block diagram of a multi-dimensional parallel processing system based on artificial intelligence, according to some embodiments of the application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Illustrative embodiments of the application include, but are not limited to, an artificial intelligence based multidimensional parallel processing method, apparatus, device and medium and an artificial intelligence based multidimensional parallel processing method, apparatus, device and medium.

It is to be appreciated that the method of determining content similarity provided by the present application can be implemented on a variety of electronic devices, including, but not limited to, servers, distributed server clusters of multiple servers, cell phones, tablet computers, laptop computers, desktop computers, wearable devices, head-mounted displays, mobile email devices, portable gaming devices, portable music players, reader devices, personal digital assistants, virtual reality or augmented reality devices, televisions with one or more processors embedded or coupled therein, and the like.

It is to be appreciated that in various embodiments of the application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single core processor, a multi-core processor, or the like, and/or any combination thereof.

The inventive concept of the embodiments of the present application will be briefly described below.

From the aspect of the computing power market, the situation of the demand of computing power supply and shortage appears in the market at present, the system hopes to reduce the requirements of AI on an AI infrastructure platform in the market of the requirements of AI on computing resources by accelerating large-scale multidimensional parallel processing, and the efficient multidimensional parallel processing is an indispensable function of the AI infrastructure platform, so that an efficient training scheme similar to the system is just required by the future AI market. From the aspect of AI model application scenes, a large number of application scenes can bring about great demands on efficient parallel training, and many existing front edge models cannot be applied to the floor sufficiently due to calculation force constraint, so that more markets are opened up after the calculation force efficiency is improved; deployment under the prior art is relatively difficult; such as nerf (application of deep learning in three-dimensional rendering) that appeared in 2019, because the limitation of computation speed has not been widely landed.

In addition, the multidimensional parallel processing and deployment threshold and the cost are high, taking PyTorch built-in schemes as an example, codes related to process groups, collective communication in groups, data sets and parallel models need to be written, and a back-end interface is adjusted according to used hardware (CPU/GPU). The multidimensional parallel processing deployment engineer needs to understand the multiple aspects of algorithm (parallel strategy), system (training architecture, synchronization method), AI frame and training method, communication programming, resource scheduling software, big data platform, bottom software programming, etc. simultaneously, the talent quality requirement is extremely high, and the corresponding enterprise employment cost is also high; different tasks require different multidimensional parallel processing solutions and hardware, with additional software and hardware costs. The training scheme is generally based on self hardware, is a customized solution directly integrated with the hardware, is difficult to face a new hardware/model architecture, and is in urgent need of a set of universal and standardized parallel training scheme; in the prior art, the breakthrough is often selected in the aspect of algorithm, but on one hand, the breakthrough of the algorithm is difficult, and on the other hand, the problem of limited multidimensional parallel processing efficiency is difficult to completely solve by the algorithm; for example, for the fields of medical treatment, security, etc., there may be a need for data confidentiality, or a model requiring a special structure; the training can still be realized by adopting a manual parameter adjustment and deployment mode in a short time, but a set of general and automatic parallel training mode is required to be realized in a long time, so that the rapid iterative algorithm can be adapted to reduce the cost of AI application and popularize AI application.

In view of this, FIG. 1 provides an artificial intelligence based multidimensional parallel processing method for a hardware processor, the method being implemented in a software platform using a machine learning library, according to a first embodiment of the present application;

Characterized in that the method comprises the steps of:

The method comprises the steps of running water in parallel, splitting a model into multiple sections, deploying each section on different hardware processors, and connecting the sections in series according to the sequence of the model, wherein the output of the former section is used as the input of the latter section;

The technical scheme provided by the embodiment of the application is suitable for multimedia content recommendation scenes such as characters, pictures (including static pictures in jpeg format and dynamic pictures in gif format), videos and the like, and is mainly exemplified by corpus vector training in natural language processing. Wherein the corpus vector in the natural language processing is from a network corpus such as Wikipedia. FIG. 2 illustrates a scene graph of an artificial intelligence based multi-dimensional parallel processing method, according to some embodiments of the application. Specifically, the scenario includes a terminal 101, a server, and a network 103.

The terminal 101 may be a desktop terminal or a mobile terminal, which may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, etc. The terminal 101 may be installed with an application that may perform natural language processing corpus training data set collection. The application related to the embodiment of the application can be a software client, a webpage, an applet and other clients, and if the application is a webpage, an applet and other clients, the background server is a background server corresponding to the software, the webpage, the applet and the like, and the specific type of the client is not limited. The user can log in the user on the application, so that the data set is collected.

The server may be a background server corresponding to an application installed on the terminal 101, for example, may be an independent physical server or a server cluster or a distributed system formed by a plurality of servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and an artificial intelligence platform, but is not limited thereto.

The server(s) can include one or more processors 1021, memory 1022, and I/O interfaces 1023 for interaction with terminals, etc. In addition, the server may also configure a database 1024, which database 1024 may be used to store the natural language processing corpus training data set submitted by the user. The memory 1022 of the server may further store program instructions such as a machine learning library and an optimizer provided by the embodiments of the present application, where the program instructions, when executed by the processor 1021, may be used to implement the steps of determining the multidimensional parallel processing method provided by the embodiments of the present application, so as to perform multidimensional parallel processing on data to be trained, which is input by a user, and further push the trained content to a target user, so as to be used in a subsequent artificial intelligence interactive application in the terminal 101.

The terminal 101 and the server are connected through a network 103, where the network 103 includes one or more and may include various connection types, such as a wired, wireless communication link, cloud, or optical fiber cable, etc., and the specific examples of the above-mentioned network may include the internet provided by a communication provider of the terminal 101.

First, the processor 1021 reads a training data set of a natural language processing corpus submitted by a user corresponding to the terminal 101 and stored in the database 1024 through the I/O interface 1023 interacting with the terminal 101, and then the memory 1022 pushes the training data set to the terminal 101 through the I/O interface 1023 interacting with the terminal after the training is completed by executing the program instructions of the stored multidimensional parallel processing method, and displays the training data set to the user.

FIG. 3 illustrates a block diagram of the hardware architecture of an artificial intelligence based multidimensional parallel processing system in accordance with some embodiments of the present application. Specifically, as shown in FIG. 3, it includes one or more processors, system control logic coupled to at least one of the processors, system memory coupled to the system control logic, non-volatile memory (NVM) coupled to the system control logic, and a network interface coupled to the system control logic.

In some embodiments, the processor may include one or more single-core or multi-core processors. In some embodiments, the processor may include any combination of general-purpose and special-purpose processors (e.g., graphics processor, application processor, baseband processor, etc.). In embodiments where the multidimensional parallel processing system employs an eNB (EvolvedNodeB, enhanced base station) or RAN (radio access network) controller, the processor may be configured to perform various conforming embodiments.

In some embodiments, the processor includes a GPU, a CPU, an FPGA, and a TPU. And performing resource scheduling of the processor based on the data set condition of the training task to be processed, migrating the GPU task to other non-GPU processors, and then performing corresponding control logic processing on the training task to be processed on the processor based on the computing resources of each processor.

In some embodiments, the system control logic may include any suitable interface controller to provide any suitable interface to at least one of the processors and/or any suitable device or component in communication with the system control logic.

In some embodiments, the system control logic may include one or more memory controllers to provide an interface to system memory. The system memory may be used to load and store data and/or instructions. In some embodiments the memory of the multidimensional parallel processing system may comprise any suitable volatile memory, such as a suitable Dynamic Random Access Memory (DRAM). In some embodiments, system memory may be used to load or store instructions that implement the multidimensional parallel processing described above, or system memory may be used to load or store instructions that implement an application that performs multidimensional parallel processing using the multidimensional parallel processing method described above.

The NVM/memory may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory may include any suitable nonvolatile memory such as flash memory and/or any suitable nonvolatile storage device, such as at least one of HDD (HARDDISKDRIVE ), CD (compact disc) drive, DVD (DIGITALVERSATILEDISC ) drive. The NVM/memory may also be used to store training models used in the multidimensional parallel processing set forth above.

The NVM/memory may include a portion of the memory resources on the device on which the multidimensional parallel processing system is installed, or it may be accessed by, but not necessarily part of, the device. For example, the NVM/memory may be accessed over a network via a network interface.

In particular, the system memory and NVM/storage may each include: a temporary copy and a permanent copy of the instruction. The instructions may include: instructions that when executed by at least one of the processors cause the multi-dimensional parallel processing system to implement the multi-dimensional parallel processing method of the present application. In some embodiments, instructions, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in system control logic, network interfaces, and/or processors.

The network interface may include a transceiver to provide a radio interface for the multidimensional parallel processing system to communicate with any other suitable device (e.g., front-end module, antenna, etc.) via one or more networks. In some embodiments, the network interface may be integrated with other components of the multidimensional parallel processing system. For example, the network interface may be integrated into at least one of a processor, a system memory, an NVM/storage, and a firmware device (not shown) having instructions that, when executed by at least one of the processors, implement the multidimensional parallel processing method of the present application.

The network interface may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, the network interface may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem. The network interface is also used for being in communication connection with the cloud application to realize cloud data processing.

In some embodiments, at least one of the processors may be packaged together with logic for one or more controllers of the system control logic to form a System In Package (SiP). In some embodiments, at least one of the processors may be integrated on the same die with logic for one or more controllers of the system control logic to form a system on a chip (SoC).

The multidimensional parallel processing system may further include: input/output (I/O) devices. The I/O device may include a user interface to enable a user to interact with the multidimensional parallel processing system; the design of the peripheral component interface enables the peripheral component to also interact with the multidimensional parallel processing system.

In a possible implementation manner of the first aspect, the step of data parallelizing automatically manages data to be processed from a user request, distributes the data to be processed to each of the hardware processors, and further includes:

the data parallelism can expand equivalent batchsize, namely the equivalent batch size, and the calculation is accelerated through the calculation of the parallel processor number by the single processor batchsize;

And different processors in the data parallelism use different data to synchronously update parameters of each data.

In a possible implementation manner of the first aspect, the sequence parallel further performs segmentation on long sequence data in data to be processed, and performs sequence division on each data to be processed to a plurality of processors, which specifically includes:

In one possible implementation of the first embodiment, the multidimensional model is parallel, performs mesh model partitioning for a training model of the data to be processed, which is scheduled to the processor, and schedules the training model to a plurality of processors, and specifically includes:

The 2-dimensional grid parallel adopts a measurable dense matrix multiplication SUMMA (ScalableUniversalMatrixMultiplicationAlgorithm), the 2-dimensional grid parallel adopts three matrixes of the SUMMA algorithm, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized; c=ab, c=ab ^T,C＝A^T B, based on the entered model data, defined as follows:

batch size batchsize is variable b, sequence length is variable s, hidden size is variable h, head attention number is variable N, vocabulary size is variable v, partition number is variable p, SUMMA dimension is variable q, and conversion layer (Transformerlayer) number is variable N.

Algorithm 1: c=ab;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

For l e (0 … q-1), broadcast a _il in any row, broadcast B _lj in any column,

C_ij＝C_ij+A_ilB_lj

Returning to C _ij;

algorithm 2: c=ab ^T;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

For l ε (0 … q-1), broadcast B _lj in either column is performed;

In any row Reducing to C _il;

returning to C _ij;

Algorithm 3: c=a ^T B;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

for l ε (0 … q-1), broadcast A _il in any row;

c _ij reduces to C _lj in either column;

returning to C _ij;

Algorithm SUMMA algorithm body step by partitioning p processors into The grid, matrices a and B, is divided into p parts. After sending the partitions of matrices a and B to the corresponding processors, the SUMMA algorithm runs the processors in parallel. At the end of the operation, the algorithm returns a result matrix C, which is distributed across the processors, similar to the splitting operation of the A and B matrices.

The specific algorithm comprises the following steps:

input: a matrix A [ a, B ], a matrix B [ B, c ];

And (3) outputting: matrix C [ a, C ] =a×b;

dividing A and B into p parts for matching the shape of the processor;

Sequentially storing A _ij,B_ij into p _ij;

for i, j ε {0, …, p ^1/2 -1}, execute C _ij in parallel, for any Broadcast a _it in p _it to p _ij, broadcast B _tj in p _tj to p _ij,C_ij＝C_ij+A_it*B_vt

All C _ij are combined to obtain matrix C.

Fig. 4 is a schematic diagram of an implementation of the algorithm one, in which 4*4 grids are adopted, and different colors represent different equipment identities. Firstly, each device has sub-blocks of matrices A and B, then an external result A ²B₂ is calculated, each device in the second row broadcasts the sub-block of the matrix A along the row of the device, each device in the second row broadcasts the sub-block of the matrix B along the column of the device, each device executes local matrix calculation with the sub-block broadcasting, and the local matrix calculation is added into a final result;

FIG. 5 is a structural layout of a 2.5-dimensional grid parallel scheme employing SUMMA2.5 algorithm, where p processors are built in a 2.5-dimensional layout of [ p, p, d ] for the number p of processors, d being the depth.

The 2.5-dimensional grid is obtained by separating a matrix A with the size [ a, B ] and a matrix B with the size [ B, C ], then merging the matrix A with the size [ a, C ], and specifically executing the following algorithm:

wherein q represents the dimension, b represents the batchsize batch size, h represents the concealment size, s represents the sequence length;

FIG. 6 is a diagram of a matrix partitioning and merging using SUMMA 2.5.5 algorithm, assuming that in the structural layout of the processor, p=2, q=2, d=2, the dark area indicates that the processor builds a layer structure of q=2, matrix A [ a, B ] is partitioned into dq ² partition matrices, structure [ a/qd, B/q ], [ q, q ] partition matrices will be stored in each layer, matrix B [ B, c ] is partitioned into q ² partition matrices, structure [ B/q, c/q ], [ q, q ] partition matrices are stored in each layer, The dq ² partition matrix of the structure is combined in matrix C of structure a, b.

Input: a matrix A [ a, B ], a matrix B [ B, c ];

And (3) outputting: matrix C [ a, C ] =a×b;

Dividing the matrix A, B into partitioned matrices shaped as [ a/qd, B/q ] and [ B/q, c/q ], respectively,

For i e {0, … qd-1}, j e {0, …, q-1}, perform h = i% p, k = i// p, store a _ij in p _kjh, C _ij = 0, store C _ij in p _kjh;

For i ε {0, … p-1}, j ε {0, …, p-1}, k ε {0, …, d-1}, perform storing B _ij in p _kjh;

For i, j e {0, … p-1}, k e {0, …, d-1}, concurrently executing broadcasting a _itk in p _itk to p _ijk and B _tjk in p _tjk to p _ijk,C_ijk＝C_ijk+A_itk*B_vtk for each element in t e {0, … p-1 };

all C _ijk are combined to obtain matrix C.

The 3-dimensional grid parallel adopts 3D parallel matrix multiplication, each matrix is divided into a plurality of small blocks according to rows and columns, and the large matrix multiplication is split into the multiplication of a plurality of small matrices;

the three-dimensional matrix multiplication of the original version, each matrix is stored on only one face (part of the GPU), so that storage resource waste is caused.

Fig. 7 is a diagram of a matrix-vector parameter balancing structure according to an embodiment of the present invention, in which, in the following algorithm, load balancing optimization is adopted, the operation between the matrix and the vector is performed, the vector B is uniformly stored on the B-facing corner line (i, l, j), and c=a+b is calculated;

Expanding from 8 GPUs to 64 GPUs, with a fixed parameter size on each GPU, the 3D method is minimal compared to 1D, 2D (0.672 seconds for 3D, 1.560,2D for 1.052 for 1D); with the overall parameter scale fixed, the 3D process is accelerated 2.3 and 1.6 times respectively than the 1D and 2D processes. Fig. 8 and 9 are schematic diagrams of weak expansion efficiency and strong expansion efficiency comparison, respectively, wherein the weak expansion efficiency comparison is to make the problem scale (calculated amount) increase with the increase of the number of processors, that is, the parameter scale on each gpu is fixed, and the gpu number is increased, and the strong expansion efficiency comparison is to make the problem scale unchanged and increase the number of processors, so as to find the most suitable number of processors for solving the problem. I.e. the time taken is as short as possible without incurring too much overhead, resulting in a 3-dimensional model that is less time-consuming to average in parallel, 2.32 and 1.57 times faster than 1 and 2 dimensions, respectively.

The data parallel, sequential parallel+2/2.5/3-dimensional meshing (2/2.5/3-dimensional model parallel) can constitute 4/4.5/5-dimensional parallel, which can be further recombined with the pipeline parallel into 5/5.5/6-dimensional parallel.

The specific dimension selection of the 2/2.5/3-dimensional model of the multi-dimensional grid parallel is determined according to the attribute of the processor, and specifically, the 2-dimensional model parallel requires a×a processors, for example, 2×2=4, 3×3=9, 4×4=16; the 2.5-dimensional model parallel requires a processors with a number of a, such as 2x2x1=4, 2x2x2=8, 2x2 x3=123-dimensional model parallel, and the processors with a number of a, such as 2x2=8, 3 x 3 x3=27

Even though the processors are all 8, the parallel concrete operation of the 2.5-dimensional model is different from that of the 3-dimensional model; are all 4 processors, and the parallel specific operation of the 2.5-dimensional model is also different from that of the 2-dimensional model

The number of parallel processors is consistent with the model parallel of various situations, such as 64, three are consistent, and the specific selection of which one needs to be selected to be optimal according to the actual running performance (speed). Because different running environments can bring about differences in processor performance, memory, communication bandwidth, processor network topology and the like, models and data used by different tasks are also quite different.

The model of the data to be processed is parallel through a 2/2.5/3-dimensional model, model parameters are decomposed on each processor, and due to the limited capacity of a single machine, the capacity of all machines after decomposition is equivalent to that of all machines to accommodate the model, so that the model which is larger on the whole is accommodated, and the communication of the parameters in the calculation process is reduced.

The data of the data to be processed, such as pictures/sentences, is input into a model, and the processors communicate with each other in the forward calculation process, which is equivalent to the calculation by using the complete long-sequence data. And (3) forward calculating to obtain an output result, comparing the output result with a training data label (label) to obtain a loss function (lossfunction) value, and then backward calculating a gradient for updating the model parameters in the next step. Both the forward calculation and the backward calculation can be parallel through a 2/2.5/3-dimensional model, so that the calculation speed is accelerated.

The multi-dimensional parallel processing method can be further combined with the pipeline parallel into 5/5.5/6-dimensional parallel.

In the pipeline parallel mode, the model is split into multiple sections, each section is deployed on different equipment and connected in series, the output of the former section is used as the input of the latter section, and the pipeline parallel is the model parallel of the consistent cross-layer.

In the pipelining parallel, each device is responsible for forward and corresponding backward operations of a part of layers; in the training scenario, each device has a bubble waiting because the next step needs to be performed after the reverse of one step is finished; the utilization rate of the pipeline parallel equipment is not high due to the existence of bubble waiting; the equipment utilization rate can be improved by increasing the batch size of each training and cutting into a plurality of small batches of micro-batch.

In a possible implementation of the first embodiment, the multidimensional parallel processing method further includes, after data parallel, serial parallel or pipelined parallel and multidimensional model parallel, selecting a plurality of optimizers according to the attribute of the data to be processed and a system operation environment;

the plurality of optimization algorithms comprise a LAMB (hardware description language) optimizer and/or a LARS (hardware description language) optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;

the LAMB, LARS, conAdv optimizer is suitable for large batch training,

The LARS is used for processing the data to be processed related to the computer vision;

the LAMB is used for processing data to be processed related to natural language processing;

ConAdv is suitable for processing the data to be processed with high speed requirement and low precision requirement;

the La-Lars is suitable for processing the data to be processed, which has narrow communication bandwidth and high network communication cost.

Although data parallelism can accelerate training speed by increasing (equivalently) batchsize, it can lead to optimization difficulties, and optimizers specific to large batch must be used to ensure good convergence. Both LAMB/LARS/ConAdv are suitable for large batch (batch) training, where LARS is most suitable for computer vision related tasks (extending CV task batchsize to 32K), LAMB is most suitable for natural language processing related tasks (extending NLP task batchsize to 64K), conAdv is suitable for CV tasks pursuing extreme speed with slightly lower accuracy requirements (extending CV task batchsize to 96K with slightly lost accuracy)

Furthermore, when data are concurrent, the gradient needs to be transferred through communication, model parameters need to be updated synchronously, and the communication traffic is extremely large (in proportion to the model size (i.e. the parameter number of the model), especially for the model which is larger and larger at present. Therefore, if the communication bandwidth (the amount of data that can be simultaneously transmitted) of the system is small, the operation speed is severely slowed down, so that an optimizer for a large batch with small communication traffic needs to be selected at this time.

The LAMB optimizer and/or the LARS optimizer and/or the ConAdv optimizer and/or the La-Lars optimizer are all extensible large-scale optimizers required for training an AI large model, different optimizers can be selected according to requirements, such as LAMB/LARS/ConAdv, are all suitable for large batch (batch) training, LARS is most suitable for computer vision related tasks, LAMB is most suitable for natural language processing related tasks, and ConAdv further extends the maximum batch of computer vision training. APS and La-Lars are suitable for the situation that the communication bandwidth is relatively narrow and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and La-Lars mainly uses gradient compression.

APS and La-Lars are suitable for the situation that the communication bandwidth is relatively narrow and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and La-Lars mainly uses gradient compression. APS can only require about 1/4 of the traffic with little loss of accuracy. La-Lars further compresses the traffic to about one thousandth to accommodate the narrow communication bandwidth, although accuracy is slightly lost.

Fig. 10 is a statistical graph of experimental results of the LAMB algorithm, ADAMW cannot converge under the training of hybrid batchsize (64 k/32 k), and the LAMB can reach an acceleration ratio of 101.8% (65.2 times improvement of the calculation speed under 64 times of calculation resources).

La-Lars is a gradient sparsification algorithm, see FIG. 11, i.e., only important gradients are sent each time the gradients are exchanged. The remaining gradients will accumulate locally and be sent in the future.

In order to speed up training, one of the simplest methods is to increase the number of compute nodes. But when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when batchsize exceeds a certain size, the generalization performance of the neural network may be deteriorated.

The LARS solves the problem of performance degradation caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can scale the batch size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, the present approach makes it difficult to simply use DGC and LARS together, as this can lead to gradient expiration problems.

The scheme provides an LA-LARS algorithm which has faster convergence speed and smaller performance loss than the direct simultaneous use of DGC and LARS. LA-LARS is superior to other baseline optimizers in ensuring 0.1% compression over MNIST and CIFAR-10 datasets. On the ImageNet dataset, it only required 60% -70% of the training time to achieve performance similar to the baseline optimizer.

Second embodiment referring to fig. 12, an embodiment of the present application provides an artificial intelligence based multidimensional parallel processing system for a hardware processor, the system being implemented in a software platform using a machine learning library;

The pipeline parallel module is used for splitting the model into a plurality of sections, arranging each section on different hardware processors, and connecting the sections in series according to the sequence of the model, wherein the output of the former section is used as the input of the latter section;

The multidimensional parallel processing system based on artificial intelligence runs on a cloud end and performs communication interaction with local data;

the multidimensional parallel processing system based on artificial intelligence is executed on a software platform, wherein the software platform comprises, but is not limited to, CUDA and ROCm;

The artificial intelligence based multidimensional parallel processing system uses a machine learning library including, but not limited to TensorFlow, keras, pyTorch.

In a possible implementation of the second embodiment, the data parallel module automatically manages data to be processed from a user request, distributes the data to be processed to each of the hardware processors, and further includes:

In one possible implementation of the second embodiment, the sequence parallel module further performs segmentation on long sequence data in data to be processed, and performs sequence division on each data to be processed in a plurality of processors, where the method specifically includes:

In one possible implementation of the second embodiment, the performing mesh model division on the training model of the data to be processed, which is scheduled to the processor, schedules the training model to a plurality of processors specifically includes:

Algorithm 1: c=ab;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

For l e (0 … q-1), broadcast a _il in any row, broadcast B _lj in any column,

C_ij＝C_ij+A_ilB_lj；

Returning to C _ij;

algorithm 2: c=ab ^T;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

For l ε (0 … q-1), broadcast B _lj in either column is performed;

In any row Reducing to C _il;

returning to C _ij;

Algorithm 3: c=a ^T B;

Input: a _ij,B_ij;

and (3) outputting: c _ij;

for l ε (0 … q-1), broadcast A _il in any row;

c _ij reduces to C _lj in either column;

returning to C _ij; the algorithm one-three pairs of matrix C are defined.

SUMMA algorithm body step, by dividing p processors intoThe grid, matrices a and B, is divided into p parts. After sending the partitions of matrices a and B to the corresponding processors, the SUMMA algorithm runs the processors in parallel. At the end of the operation, the algorithm returns a result matrix C, which is distributed across the processors, similar to the splitting operation of the A and B matrices.

The specific algorithm comprises the following steps:

input: matrix A [ a, B ], matrix B [ B, c ]

And (3) outputting: matrix ca, c=a×b

Dividing A and B into p parts for matching the shape of the processor;

Sequentially storing A _ij,B_ij into p _ij;

All C _ij are combined to obtain matrix C.

The implementation of the algorithm one adopts 4*4 grids, and different colors represent different equipment identities. Firstly, each device has sub-blocks of matrices A and B, then an external result A ²B₂ is calculated, each device in the second row broadcasts the sub-block of the matrix A along the row of the device, each device in the second row broadcasts the sub-block of the matrix B along the column of the device, each device executes local matrix calculation with the sub-block broadcasting, and the local matrix calculation is added into a final result;

the structural layout of the 2.5-dimensional grid parallel scheme adopts SUMMA2.5 algorithm, wherein p processors are built in a 2.5-dimensional layout diagram of [ p, p, d ] according to the number p of the processors, and d is depth.

Matrix segmentation combining using SUMMA2.5 algorithm, assuming that in the structural layout of the processor, p=2, q=2, d=2, the dark area indicates that the processor builds a layer structure of q=2, matrix a [ a, B ] is segmented into dq ² partition matrices, structure of [ a/qd, B/q ], [ q, q ] partition matrices will be stored in each layer, matrix B [ B, c ] is segmented into q ² partition matrices, structure of [ B/q, c/q ], [ q, q ] partition matrices are stored in each layer, The dq ² partition matrix of the structure is combined in matrix C of structure a, b.

Input: a matrix A [ a, B ], a matrix B [ B, c ];

And (3) outputting: matrix C [ a, C ] =a×b;

all C _ijk are combined to obtain matrix C.

In the matrix-vector parameter balancing structure of the embodiment of the invention, in the following algorithm, load balancing optimization is adopted, the operation between the matrix and the vector is carried out, the vector B is uniformly stored on the B-facing corner line (i, l, j), and C=A+b is calculated;

Expanding from 8 GPUs to 64 GPUs, with a fixed parameter size on each GPU, the 3D method is minimal compared to 1D, 2D (0.672 seconds for 3D, 1.560,2D for 1.052 for 1D); with the overall parameter scale fixed, the 3D process is accelerated 2.3 and 1.6 times respectively than the 1D and 2D processes. When the weak expansion efficiency is compared with the strong expansion efficiency, the problem size (calculated amount) is increased along with the increase of the number of processors, namely, the parameter size on each gpu is fixed, the number of gpus is increased, and the strong expansion efficiency is compared, so that the problem size is kept unchanged, the number of processors is increased, and the number of processors which is most suitable for solving the problem is found. I.e. the time taken is as short as possible without incurring too much overhead, resulting in a 3-dimensional model that is less time-consuming to average in parallel, 2.32 and 1.57 times faster than 1 and 2 dimensions, respectively.

The multi-dimensional parallel processing system can be further recombined with the pipelined parallel into 5/5.5/6-dimensional parallel.

In a possible implementation of the second embodiment, the multidimensional parallel processing system further includes, after the data parallel, the sequence parallel or the pipeline parallel and the multidimensional model parallel, selecting a plurality of optimizers according to the attribute of the data to be processed;

the LAMB, LARS, conAdv optimizer is suitable for large batch training,

A third embodiment of the present application provides an artificial intelligence based multidimensional parallel processing apparatus, which is characterized by comprising:

A processor, one of the processors of the system, for executing the instructions to implement any one of the possible artificial intelligence based multidimensional parallel processing methods of the first aspect described above.

A fourth embodiment of the present application provides a computer readable storage medium encoded with a computer program, wherein the computer readable storage medium has instructions stored thereon, the instructions when executed on a computer cause the computer to perform any one of the possible artificial intelligence based multidimensional parallel processing methods of the first aspect.

It should be noted that, each method embodiment of the present application may be implemented in software, hardware, firmware, etc. Regardless of whether the application is implemented in software, hardware, or firmware, the instruction code may be stored in any type of computer accessible memory (e.g., permanent or modifiable, volatile or non-volatile, solid or non-solid, fixed or removable media, etc.). Also, the memory may be, for example, programmable array logic (ProgrammableArrayLogic, abbreviated as "PAL"), random Access Memory (RAM), programmable Read only memory (ProgrammableReadOnlyMemory, abbreviated as "PROM"), read-only memory (ROM), electrically erasable programmable Read only memory (ElectricallyErasableProgrammableROM, abbreviated as "EEPROM"), magnetic disk, optical disk, digital versatile disk (DIGITALVERSATILEDISC, abbreviated as "DVD"), and the like.

It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit may be a physical unit, or may be a part of a physical unit, or may be implemented by a combination of multiple physical units, where the physical implementation manner of the logic unit itself is not the most important, and the combination of functions implemented by the logic units is the key to solve the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units that are less closely related to solving the technical problem posed by the present application, which does not indicate that the above-described device embodiments do not have other units.

It should be noted that in the claims and the description of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. An artificial intelligence-based multidimensional parallel processing method is used for a hardware processor, and is executed on a software platform and uses a machine learning library;

Characterized in that the method comprises the steps of:

Data parallelism, automatic management from user request to treat data, distribute the said to treat data to a plurality of processors;

the method comprises the steps of performing sequence parallelism, further cutting long sequence data in data to be processed, and performing sequence division on each data to be processed and putting the data to be processed into a plurality of processors;

the method comprises the steps of running water in parallel, splitting a model into multiple sections, disposing each section on different processors, and connecting the sections in series according to the model sequence, wherein the output of the former section is used as the input of the latter section;

The multidimensional model is parallel, grid model division is executed aiming at a training model of the data to be processed, which is scheduled to a processor, and the training model is scheduled to a plurality of processors;

the multi-dimensional model parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism;

the data parallelism, automatic management from user request wait to process data, will wait to process data distribute to a plurality of processors still include:

The data in the data parallelism is divided, each node or process is provided with a model, each node takes the batch size of different data, then forward and backward calculation is respectively completed to obtain gradients, the process for training is a worker, besides the worker, a parameter server ps server is also provided, the worker can send the gradients obtained by calculation to the parameter server ps server, the parameter server ps server carries out update operation, and the updated models are returned to each node;

the data parallelism can enlarge the equivalent batch size, namely the equivalent batch size, and the calculation is accelerated through the parallel processor number multiplied by the single processor batch size;

the sequence parallel further segments long sequence data in the data to be processed, and each data to be processed is subjected to sequence division and is put into a plurality of processors, and the method specifically comprises the following steps:

The sequence parallelly prolongs the length of data received by a transducer model, processes large pictures and/or videos in long texts and CV tasks in NLP, wherein the pictures are cut into pictures of small blocks, and all the small pictures are sequentially arranged to be a sequence; the video itself is a sequence of pictures, and each picture is re-segmented; after the computing resources are acquired, the image processing task and/or the characteristic data of the image are processed and distributed to each processor in parallel through the data, including but not limited to GPU, CPU, TPU, and the data can be further segmented and distributed in parallel in sequence;

the multi-dimensional model parallelism, aiming at the training model of the data to be processed which is scheduled to the processor, executing grid model division, and scheduling the training model to a plurality of processors, wherein the method specifically comprises the following steps:

2-dimensional grid parallel adopts a measurable dense matrix multiplication SUMMA and an algorithm matrix, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized;

2.5D grid parallel design a quantized deep learning model parallel architecture, minimize the transmission loss between graphic processors;

3D parallel matrix multiplication is adopted in 3D grid parallel, each matrix is divided into a plurality of small blocks according to rows and columns, large matrix multiplication is split into multiplication of a plurality of small matrices, and matrix storage is flattened on a plurality of processors.

2. An artificial intelligence-based multidimensional parallel processing system comprises a hardware processor, wherein the system is executed on a software platform and uses a machine learning library;

The data parallel module is used for automatically managing the data to be processed from the user request and distributing the data to be processed to a plurality of processors;

The pipeline parallel module is used for splitting the model into a plurality of sections, arranging each section on a plurality of different processors, and connecting the sections in series according to the sequence of the model, wherein the output of the former section is used as the input of the latter section;

the multidimensional model parallel module is used for executing grid model division on a training model of the data to be processed, which is scheduled to the processors, and scheduling the training model to the plurality of processors;

The data parallel module automatically manages data to be processed from a user request, and distributes the data to be processed to a plurality of processors, and the data parallel module further comprises:

The sequence parallel module further segments long sequence data in the data to be processed, and divides each data to be processed into sequences and places the sequences into a plurality of processors, and the method specifically comprises the following steps:

The sequence parallelly prolongs the length of data received by a transducer model, processes large pictures and/or videos in long texts and CV tasks in NLP, wherein the pictures are cut into pictures of small blocks, and all the small pictures are sequentially arranged to be a sequence; the video itself is a sequence of pictures, and each picture is re-segmented;

After the computing resources are acquired, the image processing task and/or the characteristic data of the image are processed and distributed to each processor in parallel through the data, including but not limited to GPU, CPU, TPU, and the data can be further segmented and distributed in parallel in sequence;

The multi-dimensional model parallel module comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallel, and specifically comprises: 2-dimensional grid parallel adopts a measurable dense matrix multiplication SUMMA and an algorithm matrix, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized;

3. An artificial intelligence based multidimensional parallel processing apparatus comprising:

A memory for storing instructions for execution by the processor, an

A processor for executing the instructions to implement an artificial intelligence based multidimensional parallel processing method in accordance with claim 1.

4. A computer readable storage medium encoded with a computer program, wherein the computer readable storage medium has instructions stored thereon, which when executed on a computer cause the computer to perform an artificial intelligence based multidimensional parallel processing method as claimed in claim 1.