CN116991483A

CN116991483A - Pipeline parallel method and device for language model calculation

Info

Publication number: CN116991483A
Application number: CN202311237774.1A
Authority: CN
Inventors: 杨海钦; 刘力铭; 叶俊鹏; 梁健豪; 杨杰; 幺宝刚
Original assignee: International Digital Economy Academy IDEA
Current assignee: International Digital Economy Academy IDEA
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-11-03
Anticipated expiration: 2043-09-25
Also published as: CN116991483B

Abstract

The application discloses a pipeline parallel method and a device for language model calculation, which are characterized in that model stages of a language model are deployed on different calculation nodes, data to be calculated is divided into preset number of data groups, then each data group is read in parallel through each calculation node, after one data group is read by the most upstream calculation node, output data corresponding to the data group is immediately transmitted to the most downstream calculation node, and finally, a generation process is executed in parallel through all calculation nodes, so that pipeline parallel among different data groups in the reading stage and the generation stage can be realized, the utilization rate of calculation node resources is improved, and the calculation efficiency of the language model can be improved.

Description

Pipeline parallel method and device for language model calculation

Technical Field

The application relates to the technical field of neural network application, in particular to a pipeline parallel method and device for language model calculation.

Background

Along with the rapid development of the language model, the language model is developed towards deeper and deeper trend, so that the parameter quantity of the language model is larger and larger, and even can reach the trillion level, and the memory occupied by the language model is overlarge, so that a single device is more and more difficult to succeed.

To solve this problem, the existing method generally divides a language model into a plurality of model stages and then executes the different model stages through different computing nodes, wherein each batch of data is sequentially calculated through each computing node when the computing process of the language model is commonly executed through the plurality of computing nodes. Although the method can be operated on a large language model (comprising a trillion-level language model), the generation process is carried out after all data sets are read, so that idle computing nodes exist in the computing process, the utilization rate of computing node resources is low, and the computing efficiency of the language model is further affected.

There is thus a need for improvements and improvements in the art.

Disclosure of Invention

The application aims to solve the technical problem of providing a pipeline parallel method and device for language model calculation aiming at the defects of the prior art.

To solve the above technical problem, a first aspect of an embodiment of the present application provides a pipeline parallel method for language model computation, where the method includes:

dividing a language model into a preset number of model stages, and respectively deploying the model stages on different computing nodes, wherein each model stage in the preset number of model stages comprises one or more layers in the language model;

Dividing data to be calculated into a preset number of data groups;

controlling all computing nodes deployed with model stages to read all data sets in parallel, wherein each time when the most upstream computing node reads one of the data sets with preset quantity, the most upstream computing node is controlled to transmit output data corresponding to the data set to the most downstream computing node, wherein the most upstream computing node is the computing node deployed with the first model stage, and the most downstream computing node is the computing node deployed with the last model stage;

and controlling all the computing nodes to execute the generation process of each data group in parallel.

According to the pipeline parallel method aiming at language model calculation, the data to be calculated included in each data group of the preset number of data groups are different from each other, and the data length of the calculated data included in each data group is equal.

According to the pipeline parallel method aiming at the language model calculation, the time steps required by each calculation node for executing the reading operation on each data group are the same, and the time steps required by each calculation node for executing the generating operation on each data group are the same.

The pipeline parallel method for language model calculation, wherein the generating process for controlling all calculation nodes to execute each data group in parallel specifically comprises the following steps:

Controlling each computing node to generate output data based on the received input data, and transmitting the output data to an upstream computing node of the computing node, wherein the input data is output data of a downstream computing node of the computing node;

when the most upstream computing node generates output data, the output data is transmitted to the most downstream computing node, so that the generation processes of different data groups are parallel in different computing nodes, wherein the time step of generating the output data by the most upstream computing node is adjacent to the time step of generating the output data by the most downstream computing node by taking the output data as input data.

The pipeline parallel method for language model calculation, wherein the generating process of different data sets is specifically:

for each time step, the input data executed in each computing node is output data of a different generation stage of a different data set.

In the pipeline parallel method for language model computation, when the most upstream computing node reads one of the preset number of data sets, the controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node specifically includes:

Detecting the working state of the most downstream computing node every time the most upstream computing node finishes reading one of the preset number of data sets;

and when the working state of the most downstream computing node is an idle state, controlling the most upstream computing node to transmit the output data corresponding to the data set to the most downstream computing node.

In the pipeline parallel method for language model calculation, the time step of executing the reading operation on the data set by the most upstream computing node is adjacent to the time step of executing the generating operation by taking the output data corresponding to the data set as the input data by the most downstream computing node.

A second aspect of an embodiment of the present application provides a pipelined parallel device for language model computation, the device including:

the deployment module is used for dividing the language model into a preset number of model stages and deploying the model stages on different computing nodes respectively, wherein each model stage in the preset number of model stages comprises one or more layers in the language model;

the dividing module is used for dividing the data to be calculated into a preset number of data groups;

the system comprises a reading parallel module, a model stage setting module and a model stage setting module, wherein the reading parallel module is used for controlling all computing nodes deployed with the model stage to read all data sets in parallel, and each time when one of the data sets in a preset number is read by the most upstream computing node, the most upstream computing node is controlled to transmit output data corresponding to the data set to the most downstream computing node;

And the generation parallel module is used for controlling all the computing nodes to execute the generation process of each data group in parallel.

A third aspect of the embodiments of the present application provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement steps in a pipelined parallel method for language model computation as described in any one of the above.

A fourth aspect of an embodiment of the present application provides a terminal device, including: a processor and a memory;

the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in a pipelined parallel method for language model computation as described in any one of the above.

The beneficial effects are that: compared with the prior art, the application provides a pipelining parallel method and a pipelining parallel device for language model calculation, wherein the method comprises the steps of dividing a language model into a preset number of model stages, and respectively deploying the model stages at different calculation nodes; dividing data to be calculated into a preset number of data groups; controlling all computing nodes deployed with model stages to read all data sets in parallel; when the most upstream computing node finishes reading one data set in a preset number of data sets, controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node; and controlling all the computing nodes to execute the generation process of each data group in parallel. According to the application, each model stage of the language model is deployed on different computing nodes, the data to be computed is divided into a preset number of data groups, then each data group is read in parallel through the computing nodes, after one data group is read by the most upstream computing node, output data corresponding to the data group is immediately transmitted to the most downstream computing node, and the generating process is executed in parallel through all the computing nodes, so that pipeline parallelism between different data groups in the reading stage and the generating stage can be realized, the utilization rate of computing node resources is improved, and the computing efficiency of the language model can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without creative effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a prior art process.

FIG. 2 is a flow chart of a pipeline parallel method for language model computation provided by the application.

FIG. 3 is a timing diagram of pipeline parallelism for a language model calculation process according to the present application.

Fig. 4 is a timing diagram of application example 1.

Fig. 5 is a timing diagram of application example 2.

Fig. 6 is a timing diagram of application example 3.

FIG. 7 is a schematic structural diagram of a pipeline parallel device for language model computation according to the present application.

Fig. 8 is a schematic structural diagram of a terminal device provided by the present application.

Detailed Description

The application provides a pipeline parallel method and a device for language model calculation, which are used for making the purposes, technical schemes and effects of the application clearer and more definite, and the application is further described in detail below by referring to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the sequence number and the size of each step in this embodiment do not mean the sequence of execution, and the execution sequence of each process is determined by the function and the internal logic of each process, and should not be construed as limiting the implementation process of the embodiment of the present application.

According to research, along with the rapid development of the language model, the language model is developed towards deeper and deeper trend, so that the parameter quantity of the language model is larger and larger, and even can reach the billion level, and the memory occupied by the language model is overlarge, so that a single device is more and more difficult to succeed.

To solve this problem, the existing method generally divides a language model into a plurality of model stages and then executes the different model stages through different computing nodes, wherein each batch of data is sequentially calculated through each computing node when the computing process of the language model is commonly executed through the plurality of computing nodes. For example, as shown in fig. 1, the language model is divided into three model stages, and the model stages are respectively disposed on GUP0, GPU1 and GPU2, and the parallel process of the language model is to sequentially pass a batch of data (block No. 0 in the figure) through GUP0, GPU1 and GPU2, and the GUP0, GPU1 and GPU2 sequentially execute the text reading stage and the autoregressive generating stage of each batch of data, and after the autoregressive generating stage is completed on the batch of data, the next batch of data sequentially passes through GUP0, GPU1 and GPU2, and the GUP0, GPU1 and GPU2 sequentially execute the text reading stage and the autoregressive generating stage of each batch of data, and so on until all batches of data are executed. Although the method can solve the problem that a single device cannot run a large language model, the utilization rate of computing resources of computing nodes is low, and the computing efficiency of the language model is affected.

Based on the above, in the embodiment of the application, the language model is divided into a preset number of model stages, and each model stage is deployed at different computing nodes respectively; dividing data to be calculated into a preset number of data groups; controlling all computing nodes deployed with model stages to read all data sets in parallel; when the most upstream computing node finishes reading one data set in a preset number of data sets, controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node; and controlling all the computing nodes to execute the generation process of each data group in parallel. According to the application, each model stage of the language model is deployed on different computing nodes, the data to be computed is divided into a preset number of data groups, then each data group is read in parallel through the computing nodes, after one data group is read by the most upstream computing node, output data corresponding to the data group is immediately transmitted to the most downstream computing node, and the generating process is executed in parallel through all the computing nodes, so that pipeline parallelism between different data groups in the reading stage and the generating stage can be realized, the utilization rate of computing node resources is improved, and the computing efficiency of the language model can be improved.

The application will be further described by the description of embodiments with reference to the accompanying drawings.

The present embodiment provides a pipeline parallel method for language model calculation, as shown in fig. 2, where the method includes:

s10, dividing the language model into a preset number of model stages, and respectively deploying the model stages on different computing nodes.

In particular, the language model includes several network layers, the layer types of each of which may be different, e.g., a convolution layer, a full connection layer, and a pooling layer. The division of the language model is performed in layers such that each model stage resulting from the division includes one or more layers. For example, the language model includes 12 layers, and the language model is divided into three model stages, one of which includes 1-4 layers, one of which includes 5-8 layers, and one of which includes 9-12 layers.

The preset number may be determined according to the number of nodes of the computing nodes for executing the language model, or may be determined according to the supplied computing power that can be provided by a single computing node. When the computing nodes for executing the language model are preset, directly taking the node data of the computing nodes for executing the language model as preset quantity; when determining from the supplied computing forces that can be provided by a single computing node, the total computing force required by the language model is obtained, and then the preset number is determined from the ratio of the total computing force to the supplied computing force of the single computing node.

Further, when the language model is divided into a preset number of model stages, the language model can be divided into a preset number of model stages in a layer unit directly; or, the corresponding parameter number of each layer and the supplied computing power provided by each computing node can be obtained, and then the language model is divided into a plurality of model stages according to the parameter number and the supplied computing power of each layer, so that the running time of each computing node in running the deployed model stages is basically the same, waiting among the computing nodes can be avoided, and the computing efficiency of the language model is improved.

After the language model is divided into model stages, the model stages are deployed on different computing nodes, that is, each model stage runs on one computing node, and the computing nodes running the model stages are different. For example, assume that the language model includes 12 network layers, and the computing nodes for executing the language model are GPUs, for a total of 3, denoted GPU0, GPU1, and GPU2, respectively. Then, the language model is divided into 3 model stages, which are respectively denoted as a first model stage, a second model stage and a third model stage, wherein the first model stage comprises 1-4 layers, the second model stage comprises 5-8 layers, the third model stage comprises 9-12 layers, the first model stage is deployed at GPU0, the second model stage is deployed at GPU1, and the third model stage is deployed at GPU2.

S20, dividing the data to be calculated into a preset number of data groups.

Specifically, the data to be calculated is data to be run through a language model, wherein the data to be calculated may be input data to be translated, or may be image data to be described, etc. For example, the data to be calculated includes [ "today is monday", "you are very happy to see", "how much money this cell phone is," "taste is general for lunch", "how many points are in today", "what is planned at night", ], or the data to be calculated includes [ picture 01, picture 02, ], picture 11, picture 12, ], picture 21, picture 22, ].

The number of data groups and the number of computing nodes may be the same or different, wherein when the number of data groups is different from the number of computing nodes, the number of data groups may be greater than the number of computing nodes, so that the data groups may satisfy the parallel requirements of the computing nodes. In one implementation of the embodiment of the present application, the number of data sets is determined based on the number of model stages corresponding to the language model, wherein the number of data sets is equal to the number of model stages. Based on this, when dividing the data to be calculated, the data to be calculated may be divided into a preset number of data groups, so that the number of data groups and the number of model stages are both preset numbers. According to the embodiment of the application, the data to be calculated and the language model are divided according to the preset number, and then the calculation process of the language model to be calculated is operated in parallel by adopting the preset number of calculation nodes, so that the situation that the data to be calculated is in waiting state can be avoided.

Further, the data to be calculated included in each data group is different from each other, and the data lengths of the calculation data included in each data group are equal, for example, when dividing the data to be calculated into a preset number of data groups, the data to be calculated may be equally divided into a preset number of data groups, or the like. The embodiment of the application ensures that the data length of the calculation data included in each data group is equal, the corresponding running time of each data group is basically the same, and the data groups to be calculated are in a parallel state to the greatest extent, thereby improving the utilization rate of the calculation nodes.

Illustrating: assuming that the language model is divided into three model stages and deployed at three computing nodes, the data to be computed is [ "today is monday", "you are very happy to see", "how much money this cell phone is, what is the afternoon taste of today", "how many points to open, what is the evening plan", ], and the data to be computed is divided into three data sets, respectively [ [ "today is monday", "you are very happy to see", ], [ "how much money this cell phone is, what is the afternoon taste of today", ], [ "what is the afternoon meeting of today", "what is the evening plan", ], where the data lengths included in each data set are equal.

S30, controlling all computing nodes deployed with model stages to read all data sets in parallel, wherein each time the most upstream computing node reads one of the data sets with a preset number, the most upstream computing node is controlled to transmit output data corresponding to the data set to the most downstream computing node.

Specifically, each computing node deployed with a model stage performs data transmission with its upstream computing node and its downstream computing node, and the most upstream computing node performs data transmission with the most downstream computing node, where the upstream and downstream are determined according to the positional relationship between the model stages configured by the computing nodes. The model phase of the upstream computing node deployment of the computing node is adjacent to the model phase of the computing node deployment, and the output data of the model phase of the computing node deployment is the input data of the model phase of the upstream computing node deployment thereof. Similarly, a model phase deployed by a downstream computing node of the computing node is adjacent to a model phase deployed by the computing node, and output data of the model phase deployed by an upstream computing node of the computing node is input data of the model phase deployed by the computing node. The most upstream computing node refers to the computing node deploying the first model stage, and the lowest computing node refers to the computing node deploying the last model stage. For example, the language model is divided into a first model stage, a second model stage and a third model stage according to a model execution sequence, wherein the first model stage is deployed at GPU0, the second model stage is deployed at GPU1, the third model stage is deployed at GPU2, GPU1 is an upstream computing node of GPU0, GPU2 is an upstream computing node of GPU1, GPU0 is a downstream computing node of GPU1, GPU1 is a downstream computing node of GPU2, GPU2 is an uppermost stream computing node, and GPU0 is a lowermost stream computing node.

The processing of the language model is divided into a text reading phase, which is denoted as a reading process, and an autoregressive generating phase, which is denoted as a generating process. The computing nodes execute the reading process of the data sets in parallel, and the time steps required by the computing nodes to execute the reading operation on the data sets are the same, that is, the time steps required by the computing nodes to read the data sets are the same by configuring the model stage corresponding to the computing nodes, so that the computing nodes are in a continuous working state, and the efficiency of the computing nodes can be improved.

Illustrating: assuming that the first model stage is deployed at GPU0, the second model stage is deployed at GPU1, the third model stage is deployed at GPU2, the predetermined number of data sets includes data set 0, data set 1 and data set 2, and one time step is required for gpuo, GPU1 and GPU2 to read one data set, as shown in fig. 3, then:

time step 1: data set 0 is read by model 1-4 layers, passing through GPU0.

Time step 2: after finishing model 1-4 layer reading on GPU0, the data set 0 transmits the reading result to GPU1 and is read by 5-8 layers of language model; while data set 1 completes the read at GPU0.

Time step 3: after the data group 0 is read in the model 5-8 layers, the data group is transmitted to the GPU2 to finish the model reading in the 9-12 layers; transmitting the data set 1 from the GPU0 to the GPU1 to finish reading; data set 2 completes the read on GPU 0.

Time step 4: transmitting the data group 1 to 9-12 layers of the GPU2 for reading; data set 2 is transmitted from 1-4 layers of GPU0 to 5-8 layers of GPU1 for reading;

time step 5: data set 2 completes the read at layers 9-12 of GPU 2.

When the most upstream computing node reads one of the preset number of data sets, the most upstream computing node is controlled to transmit output data corresponding to the data set to the most downstream computing node, so that when the most upstream computing node reads one data set, the output data corresponding to the data set is directly transmitted to the most downstream computing node, the most downstream computing node is not required to be controlled to generate a first read data set after all the data sets are read, the computing node can process the read process and the generation process in parallel, idle computing nodes generated in a waiting process from the read process to the generation process are reduced, the utilization rate of the computing nodes is improved, and the computing efficiency of a language model is improved.

In an implementation manner of the embodiment of the present application, each time the most upstream computing node finishes reading one of the preset number of data sets, controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node specifically includes:

Specifically, the working states include an idle state and an occupied state, the idle state refers to a computing node not executing a task, and the occupied state refers to the computing node being in the execution task. When the most upstream computing node finishes reading one data set, whether output data corresponding to the data set can be transmitted to the most downstream computing node is determined by detecting the working state of the most downstream computing node, so that the task queuing phenomenon of the most downstream computing node is avoided, and the computing efficiency is further improved. In addition, in order to increase the utilization of the computing nodes, when the most upstream computing node detects that the most downstream node is in an idle state, output data is immediately transmitted to the most downstream computing node.

Further, the detection process of the working state of the most downstream computing node may be that the most downstream computing node feeds back an inquiry instruction issued by the most upstream computing node to the most upstream computing node, or may be that the most downstream computing node automatically synchronizes with the most upstream computing node when entering an idle state, that is, each downstream computing node may send the working state to its upstream computing node in real time, while in a cyclic process, the most downstream computing node may be used as the upstream computing node of the most upstream computing node, and the most downstream computing node may also synchronize the working state to the most downstream computing node in real time. Of course, in practical application, other manners may be used to detect the working state of the most downstream computing node, which is not described herein.

In an exemplary implementation of the embodiment of the present application, each data set includes a plurality of calculation data, each calculation data of the data set is corresponding to one generation operation, the data set is corresponding to one reading operation, a time step required for controlling the calculation node to execute the reading operation is the same as or different from a time step required for executing one generation operation, for example, the reading operation configures 1 time step, and the generation operation configures 0.5 operation. Thus, when the most upstream computing node finishes the reading operation, the most downstream computing node finishes the reading operation or the generating operation executed by the most downstream computing node, and the time step of the most upstream computing node for executing the reading operation on the data set is adjacent to the time step of the most downstream computing node for executing the generating operation by taking the output data corresponding to the data set as the input data.

By way of illustration, based on the above example, and assuming that 1 time step is required for a read operation, 0.5 time steps are required for one of the generation operations; as shown in fig. 3:

time step 3.5: after the data set 0 finishes full model reading, transmitting the read output content to 1-4 layers of the GPU0 to generate a 0 th word (expressed as '00'); transmitting the data group 1 to 9-12 layers of the GPU2 for reading; data set 2 is transferred from layers 1-4 of GPU0 to layers 5-8 of GPU1 for reading.

Time step 4: the data group 1 finishes reading in 9-12 layers of the GPU 2; the data group 2 finishes reading in 5-8 layers of the GPU 1; GPU0 has no memory occupation in the calculation process;

time step 4.5: "00" is transmitted to the 5-8 layers of GPU1 to continue generation; at the same time, data set 1 completes the read phase, and 1-4 layers transmitted to GPU0 generate a "10".

Time step 5: generating the complete 5-8 layers of the data transmission of '10' to the GPU 1; the data group 2 finishes reading in 9-12 layers of the GPU 2; GPU0 has no computational process occupation.

Time step 5.5: the data group 2 completes the reading stage and transmits to GPU0 to complete 1-4 layers to generate '20'; transmitting the data group 0 from the GPU1 to the GPU2 to finish the generation of 5-8 layers; GPU1 has no memory occupation in the computing process.

S40, controlling all the computing nodes to execute the generation process of each data set in parallel.

Specifically, the generating process includes a generating process of each sub-data in each data group, for example, the data group includes [ "today is monday", "you are very happy to see",.], and then the generating process is to translate the data group into english, and then the generating process includes a translating process of each word in the data group, for example, a first token of the data group is [ "today", "nice",.], a second token of the data group is [ "is", "to",.], and so on.

The parallel execution of the generation process of each data set refers to that for each time step, the input data executed in each computing node is the output data of different generation stages of different data sets. In one implementation manner of the embodiment of the present application, the process of controlling all computing nodes to execute the generation process of each data set in parallel specifically includes:

controlling each computing node to generate output data based on the received input data, and transmitting the output data to an upstream computing node thereof;

Specifically, the input data is output data of a downstream computing node of the computing node, that is, after the computing node performs the generating operation to obtain the output data, the computing node transmits the output data to an upstream computing node thereof, and the upstream computing node uses the output data as the input data to perform the generating operation. Meanwhile, after forming output data based on the input data, the computing node transmits the output data to its upstream computing node.

After the most upstream computing node obtains the output data, the most downstream computing node transmits the output data to the most downstream computing node, so that the most downstream computing node performs generating operation based on the output data, and the input data in each computing node is sourced from different data groups at the same time step. For example, as shown in fig. 3, time step 6: the input data in GPU0 is derived from data set 0; the input data in GPU1 is derived from data set 1 and the input data in GPU2 is derived from data set 2. The embodiment of the application enables each computing node to run data of different batches and different stages in parallel at the same time step, avoids the condition that the computing nodes are idle because the input data of the same data group is required to be waited, and can improve the utilization rate of each computing node. Of course, it should be noted that, during the joining process of the reading process and the generating process, the computing node is idle at some time steps, but after the transition from the reading process to the generating process is completed, the computing node is in an operation state at each time step.

In summary, the present embodiment provides a pipeline parallel method for language model calculation, which includes dividing a language model into a preset number of model stages, and respectively disposing each model stage at different calculation nodes; dividing data to be calculated into a preset number of data groups; controlling all computing nodes deployed with model stages to read all data sets in parallel; when the most upstream computing node finishes reading one data set in a preset number of data sets, controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node; and controlling all the computing nodes to execute the generation process of each data group in parallel. According to the application, the model stages of the language model are deployed on different computing nodes, the data to be computed is divided into the preset number of data groups, then each data group is read in parallel through the computing nodes, after one data group is read by the most upstream computing node, output data corresponding to the data group is immediately transmitted to the computing nodes, and the generating process is executed in parallel through all the computing nodes, so that the pipeline parallelism between the different data groups in the reading stage and the generating stage can be realized, the utilization rate of computing node resources is improved, and the computing efficiency of the language model can be improved.

In addition, in order to further illustrate the pipeline parallel method for language model computation provided by the embodiment of the present application, several specific application examples are given below.

Application example 1: language model for machine translation

Assuming that the computing nodes for running the language model in parallel are three GPUs, namely GPU0, GPU1 and GPU2, the language model comprises 12 layers of network layers, 1-4 layers are deployed on GPU0, 5-8 layers are deployed on GPU1, and 9-12 layers are deployed on GPU 2. The data x1 to be calculated includes [ "today is monday", "you are seen very happily", "" how much money this cell phone is, what is done with lunch today "," how many points are in the meeting today "," what is planned in the evening "," ] divided into three data sets, respectively, data set 0= [ "today is monday", "you are seen very happy", data set 1= [ "how much money this cell phone is, what is done with lunch today" ], 1 data set 2= [ "what is done with lunch now", "what is planned in the evening", 2 data set, data length of three data sets are all n, n is a positive integer; the data read operation of each GPU requires one time step and the generate operation requires 0.5 time steps. As shown in fig. 4, the specific procedure of the pipeline parallel method for language model calculation may be:

Time step 1: the 0 data is read by GPU 0;

time step 2: the data 0 is read by GPU1, and the data 1 is read by GPU 0;

time step 3: the 0 data is read by GPU2 and 00,1 data is read by GPU 1;2, the data is read by the GPU 2;

time step 3.5: the data 0 is read by GPU2 to complete the translation output of the first token [ "today", "nice",..], denoted by 00, and 00 is read by GPU 0;

time step 4:1 data is read by GPU2 and generation 10 is started; 2, the data is read by the GPU 1;

time step 4.5:00 data is read by GPU 1; 1 data is completed by GPU2 for the translation output of the first token [ "how", "the", ], indicated by 10, 10 is completed by GPU0 for reading;

time step 5:10 data is read by GPU1, 2 data is read by GPU2 and generation 20 is started;

time step 5.5:00 data is read by GPU2 and 01 is generated; 2 data is completed by GPU2 for the translation output of the first token [ "what", "what's" ], denoted 20, 20 is completed by GPU0 for reading;

time step 6:01 data complete generation [ "is", "to", ], and complete reading by GPU 0; 10 data is read by GPU2 and generation 11 is started; 20 data is read by GPU 1;

Time step 6.5:01 data is read by GPU 1; data complete generation [ "mux", "stack" ]; the data is read by GPU2 and generation 21 is started;

time step 7:01 the data is read by GPU2 and starts to generate 02;11 data is read by GPU 1; data complete generation [ "time", "the" ];

and analogizing until the generation step in each data group is completed, obtaining output data corresponding to the data to be calculated, wherein the output data output: [ [ "today is monday", "nice to meet you", ], [ "how much is this phone", "the taste of lunch today is average", ], [ "what time is the meeting today", "what's the plan for the evening", ] ].

Application example 2: language model for image description

Assuming that the computing nodes for running the language model in parallel are three GPUs, namely GPU0, GPU1 and GPU2, the language model comprises 12 layers of network layers, 1-4 layers are deployed on GPU0, 5-8 layers are deployed on GPU1, and 9-12 layers are deployed on GPU 2. The input image data included in the data x1 to be calculated are divided into three sets of input image data, each set of input image data includes n pieces of input image data, n is a positive integer, wherein the three sets of input image data are respectively denoted as 0 image data, 1 image data and 2 image data. As shown in fig. 5, the specific procedure of the pipeline parallel method for language model calculation may be:

Time step 1: the 0 image data is read by GPU 0;

time step 2: the 0 image data is read by GPU1, and the 1 image data is read by GPU 0;

time step 3: the 0 image data is read by GPU2 and 00,1 image data is started to be generated, and is read by GPU 1;2, the image data is read by the GPU 2;

time step 3.5: the first token output of the 0 image data completed description by GPU2 [ "one", "two-bit",.], denoted by 00, 00 completed reading by GPU 0;

time step 4:1 the image data is completely read by the GPU2 and starts to generate 10;2, the image data is read by the GPU 1;

time step 4.5:00 data is read by GPU 1; 1 image data is finished by GPU2 to describe the first token output [ "this is", "man",.], indicated by 10, 10 is finished by GPU0 to read;

time step 5:10 data is read by GPU1, 2 image data is read by GPU2 and generation 20 is started;

time step 5.5:00 data is read by GPU2 and 01 is generated; 2 image data is output by GPU2 to complete the first token of the description [ "restaurant", "one",.], indicated at 20, 20 is completed by GPU0 to read;

time step 6:01 the second token output of the data completion description [ "cat", "woman", ".], and is read by GPU 0; 10 data is read by GPU2 and generation 11 is started; 20 data is read by GPU 1;

Time step 6.5:01 data is read by GPU 1; 11 data complete the second token output of the description [ "one field", "in", ]; the data is read by GPU2 and generation 21 is started;

time step 7:01 the data is read by GPU2 and starts to generate 02;11 data is read by GPU 1; 21 data complete the second token output of the description [ "being", "wearing" ];

and analogizing until the generation step in each data group is completed, obtaining output data corresponding to the data to be calculated, wherein the output data output: [ "one cat sleeps on a bookshelf", "two women laugh on a mobile phone",.], [ (this is a baseball game "," man is eating bread ",.], [" restaurant is running in a hot spot "," one person wearing red clothing is skiing ",.] ].

Application example 3: the language model is used for multi-modal VQA applications:

assuming that the computing nodes for running the language model in parallel are three GPUs, namely GPU0, GPU1 and GPU2, the language model comprises 12 layers of network layers, 1-4 layers are deployed on GPU0, 5-8 layers are deployed on GPU1, and 9-12 layers are deployed on GPU 2. The data to be calculated includes input image data x1 and text question data x2, the input image data x1 is equally divided into three sets of input image data sets, the text question data x2 is equally divided into three sets of text question data sets, one set of input image data sets in the first line in the data representation x1 and one set of text question data sets in the first line in the data representation x2 are used for 1 one set of input image data sets in the second line in the data representation x1 and one set of text question data sets in the second line in the data representation x2, one set of input image data sets in the first line in the data representation x1 and one set of text question data sets in the first line in the data representation x2 are used for 0 data, 1 data and 2 data, each of the input image data sets and the text question data sets includes n data elements, n is a positive integer, and the input image data sets in each data corresponds to the text question data sets. As shown in fig. 6, the specific procedure of the pipeline parallel method for language model calculation may be:

Time step 1: the 0 data is read by GPU 0;

time step 2: the data 0 is read by GPU1, and the data 1 is read by GPU 0;

time step 3: the 0 data is read by GPU2 and 00,1 data is read by GPU 1; 2, the data is read by the GPU 2;

time step 3.5: the 0 data is outputted by the first token of the GPU2 completion answer [ "sleep", "mobile", ], denoted by 00, 00 is read by GPU 0;

time step 4.5:00 data is read by GPU 1; 1 data is output [ "baseball", "bread", ], denoted 10, by the first token of the GPU2 complete answer, 10 is complete read by GPU 0;

time step 5.5:00 data is read by GPU2 and generates [ "EOS", ] represents an ending symbol, and a 0 data answer is complete; 2 data is output [ "four-port", "red", ], denoted 20, by the first token of the GPU2 complete answer, 20 is completed read by GPU 0;

time step 6:10 data is read by GPU2 and generates [ "EOS", ] represents an ending symbol, 1 data answer is complete; 20 data is read by GPU 1;

Time step 6.5:20 data is read by GPU2 and generates [ "EOS", ] represents an ending symbol, 2 data answer is complete;

and analogizing until the generation step in each data group is completed, obtaining output data corresponding to the data to be calculated, wherein the output data output: [ [ "sleep", "cell phone", ], [ "baseball", "bread", ], [ "four ports", "red", ] ] ].

Based on the above-mentioned pipeline parallel method for language model calculation, the present embodiment provides a pipeline parallel device for language model calculation, as shown in fig. 7, the device includes:

the deployment module 100 is configured to divide a language model into a preset number of model stages, and deploy each model stage to different computing nodes, where each model stage in the preset number of model stages includes one or more layers in the language model;

the dividing module 200 is configured to divide data to be calculated into a preset number of data sets;

the parallel reading module 300 is configured to control all computing nodes deployed with model stages to read each data set in parallel, where each time the most upstream computing node reads one of the preset number of data sets, the most upstream computing node is controlled to transmit output data corresponding to the data set to the most downstream computing node;

The generating parallel module 400 is configured to control all the computing nodes to execute the generating process of each data set in parallel.

Based on the above-described pipeline parallel method for language model computation, the present embodiment provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the pipeline parallel method for language model computation as described in the above-described embodiment.

Based on the above pipeline parallel method for language model calculation, the present application also provides a terminal device, as shown in fig. 8, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A pipelined parallel method for language model computation, the method comprising:

dividing data to be calculated into a preset number of data groups;

2. The pipeline parallel method for language model computation according to claim 1, wherein each data group of the preset number of data groups includes data to be computed different from each other, and the data lengths of the computation data included in each data group are equal.

3. The pipelined parallel method for language model computation of claim 1, wherein the time steps required for each compute node to perform a read operation on each data set are the same and the time steps required for each compute node to perform a generate operation on each data set are the same.

4. The method for pipeline parallelism of language model computation according to claim 1, wherein the controlling all computing nodes to execute the generation process of each data group in parallel specifically comprises:

5. The pipeline parallel method for language model computation according to claim 4, wherein the generating process of the different data sets is parallel in different computing nodes specifically comprises:

6. The method for pipeline parallelism of language model computing according to claim 1, wherein each time the most upstream computing node reads one of the preset number of data sets, controlling the most upstream computing node to transmit output data corresponding to the data set to the most downstream computing node specifically comprises:

7. The pipeline parallel method for language model computation of claim 1 or 6, wherein a time step in which the most upstream computing node performs a read operation on the data group is adjacent to a time step in which a most downstream computing node performs a generate operation with output data corresponding to the data group as input data.

8. A pipelined parallel apparatus for language model computation, the apparatus comprising:

9. A computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the pipelined parallel method for language model computation of any one of claims 1-7.

10. A terminal device, comprising: a processor and a memory;

the processor, when executing the computer readable program, implements the steps in the pipelined parallel method for language model computation of any one of claims 1-7.