CN118195033A

CN118195033A - Language model training method and related device

Info

Publication number: CN118195033A
Application number: CN202410623693.3A
Authority: CN
Inventors: 胡文龙; 于振华; 张海俊; 汪锦想
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-05-20
Filing date: 2024-05-20
Publication date: 2024-06-14
Anticipated expiration: 2044-05-20
Also published as: CN118195033B

Abstract

The application discloses a language model training method and a related device, which relate to the technical field of model training, wherein the training method comprises the following steps: deploying the language model on a plurality of computing devices, each computing device deploying one or more layers of the language model; acquiring a training sample set, wherein the training sample set comprises A training sequences with the length of S, which are acquired from the training sequence set; dividing a training sample set to obtain a plurality of training sample subsets, wherein each training sample subset comprises B training sequences with the length of S, and B is smaller than A; dividing the plurality of training sample subsets in sequence dimension respectively to obtain sub-sequence block sets corresponding to the plurality of training sample subsets respectively; and controlling a plurality of computing devices to perform model training by utilizing each subsequence block in the subsequence block set respectively corresponding to the plurality of training sample subsets and adopting a running water parallel training mode. The language model training method disclosed by the application has lower memory requirement and lower running water cavitation rate.

Description

Language model training method and related device

Technical Field

The application relates to the technical field of model training, in particular to a language model training method and a related device.

Background

In recent years, with the great increase of computational effort and data volume, a very large-scale language model based on a decoder-only tansformer architecture is rapidly developed, and is widely applied to many natural language processing tasks. Meanwhile, research shows that a language model based on a decoder-only transformation former framework has scaling law, namely, the larger the model parameter quantity is, the larger the data set is, the higher the performance of the model is, so that the parameter quantity of the language model in recent years grows exponentially, for example, GPT3, OPT, BLOOM parameter quantity reaches 175B+, paLM, MT-NLG parameter quantity reaches 530B+, and training the model is more challenging.

Over the past few years, a variety of parallel techniques have been proposed to train very large scale language models, with pipelined parallel techniques being one of a variety of parallel techniques. The pipelining parallel technology refers to dividing a model into a plurality of parts which are connected in a front-back manner according to layers, deploying the divided parts on a plurality of computing devices, and controlling the computing sequence of the plurality of computing devices by adopting a reasonable scheduling strategy so as to complete the training task of the model. At present, when training is performed in a running parallel training manner, a training sample set for each round of training is generally split into a plurality of training sample subsets, and then, a plurality of computing devices are controlled to perform model training according to a preset scheduling strategy (such as 1F1B (One Forward pass followed by One Backward pass), ZB (Zero bubble), etc.) by using the plurality of training sample subsets.

The training process of the model comprises a forward computing stage and a backward computing stage, wherein in the forward computing stage, the computing device caches an activation value generated by forward computing in a memory (the computing device uses the activation value when performing backward computing). The current pipelining parallel training method has a large memory requirement, which causes that the computing equipment may have insufficient memory, and the model training is seriously affected by the insufficient memory.

Disclosure of Invention

In view of the above, the present application provides a language model training method and related device, which is used to solve the problem that the current pipelining parallel training method has a large memory requirement, and the technical scheme is as follows:

The first aspect of the present application provides a language model training method, including:

splitting a language model into a plurality of model blocks according to layers, and deploying the plurality of model blocks on a plurality of computing devices, wherein each model block obtained by splitting comprises one layer or a plurality of continuous layers in the language model, and one or a plurality of model blocks in the plurality of model blocks are deployed on each computing device;

Obtaining a training sample set, wherein the training sample set comprises A training sequences with the length of S, which are obtained from the training sequence set, and both A and S are integers larger than 1;

dividing the training sample set to obtain a plurality of training sample subsets, wherein each training sample subset comprises B training sequences with the length of S, and B is an integer smaller than A;

Dividing the plurality of training sample subsets in a sequence dimension respectively to obtain subsequence block sets corresponding to the plurality of training sample subsets respectively, wherein each subsequence block in the subsequence block sets comprises B subsequences;

And controlling the computing devices to perform model training by utilizing each subsequence block in the subsequence block set respectively corresponding to the training sample subsets and adopting a running water parallel training mode.

In one possible implementation, the slicing the language model into a plurality of model blocks by layers and deploying the plurality of model blocks on a plurality of computing devices includes:

Dividing a language model into N model blocks according to layers, and deploying the N model blocks on N computing devices, wherein each computing device is provided with one model block in the N model blocks, and N is an integer greater than 1;

Or slicing the language model into M model blocks according to layers, and deploying the M model blocks on N computing devices, wherein each computing device is provided with one model block or a plurality of discontinuous model blocks in the M model blocks, and M is an integer larger than N.

In one possible implementation, the splitting the plurality of training sample subsets in the sequence dimension to obtain sub-sequence block sets corresponding to the plurality of training sample subsets respectively includes:

for each training sample subset, dividing the training sample subset into equal-length C parts in a sequence dimension, and combining the C sub-sequence blocks obtained by dividing into sub-sequence block sets corresponding to the training sample subset, wherein C is an integer greater than 1.

In one possible implementation, the controlling the multiple computing devices to perform model training by using each sub-sequence block in the sub-sequence block set corresponding to each of the multiple training sample subsets in a pipelined parallel training manner includes:

Inputting each subsequence block in the subsequence block set corresponding to each training sample subset into a computing operation pipeline formed by a plurality of computing devices one by one in sequence for computing, wherein the computing operation pipeline executes forward computing and backward computing for each subsequence block which is input;

And controlling the computing equipment to perform backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets according to gradient data obtained by performing backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets by the computing equipment after controlling the computing equipment to complete forward calculation and backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets.

In one possible implementation, the computing process of each computing device on the computing operation pipeline includes a first computing stage, a second computing stage, and a third computing stage;

Each computing device on the computing operation pipeline performs only forward computation in the first computing stage, alternately performs forward computation and reverse computation in the second computing stage, and performs only reverse computation in the third computing stage.

The forward calculations performed by each computing device on the computing operation pipeline include causal attention calculations;

In one possible implementation, for a sub-sequence block in a set of sub-sequence blocks, a computing device on the computing operation pipeline performs causal attention computation for the sub-sequence block, comprising:

determining a Q matrix, a K matrix and a V matrix of the sub-sequence block according to input data;

Splicing a K matrix of a forward subsequence block of the subsequence block with a K matrix of the subsequence block, wherein the spliced matrix is used as a target K matrix corresponding to the subsequence block, and V matrix of the forward subsequence block of the subsequence block is spliced with V matrix of the subsequence block, and the spliced matrix is used as a target V matrix corresponding to the subsequence block, wherein the forward subsequence block of the subsequence block is a subsequence block positioned in front of the subsequence block in a subsequence block set where the subsequence block is positioned;

And carrying out causal attention calculation on the Q matrix of the sub-sequence block, the target K matrix corresponding to the sub-sequence block and the target V matrix corresponding to the sub-sequence block.

The backward computation executed by each computing device on the computing operation pipeline comprises computation of target gradients respectively corresponding to a Q matrix, a K matrix and a V matrix;

In one possible implementation, the process of determining, by a computing device on the computing operation pipeline, the target gradients respectively corresponding to the Q matrix, the K matrix, and the V matrix of a sub-sequence block includes:

Calculating the gradient corresponding to the Q matrix of the subsequence block, wherein the obtained gradient is used as a target gradient corresponding to the Q matrix of the subsequence block;

Calculating gradients corresponding to the target K matrix corresponding to the subsequence block, wherein the gradients corresponding to the target K matrix corresponding to the subsequence block comprise gradients corresponding to the K matrix of the subsequence block and gradients corresponding to the K matrix of the forward subsequence block of the subsequence block; summing the gradients corresponding to the K matrix of the subsequence block in the gradients corresponding to the target K matrix of the subsequence block and the gradients corresponding to the K matrix of the subsequence block in the gradients corresponding to the target K matrix of the subsequence block in the backward subsequence block, wherein the summed gradients are used as the target gradients corresponding to the K matrix of the subsequence block; the backward subsequence block of the subsequence block is a subsequence block located behind the subsequence block in the subsequence block set where the subsequence block is located;

Calculating gradients corresponding to the target V matrix corresponding to the subsequence block, wherein the gradients corresponding to the target V matrix corresponding to the subsequence block comprise gradients corresponding to the V matrix of the subsequence block and gradients corresponding to the V matrix of the forward subsequence block of the subsequence block; and summing the gradients corresponding to the V matrix of the subsequence block in the gradients corresponding to the target V matrix of the subsequence block and the gradients corresponding to the V matrix of the subsequence block in the gradients corresponding to the target V matrix of the subsequence block in the backward subsequence block, wherein the summed gradients are used as the target gradients corresponding to the V matrix of the subsequence block.

A second aspect of the present application provides a language model training apparatus, comprising: the system comprises a model segmentation and deployment module, a training sample set acquisition module, a training sample set division module, a training sample subset segmentation module and a model training module;

the model segmentation and deployment module is used for segmenting a language model into a plurality of model blocks according to layers and deploying the model blocks on a plurality of computing devices, wherein each model block obtained by segmentation comprises one layer or a plurality of continuous layers in the language model, and one or a plurality of model blocks in the model blocks are deployed on each computing device;

The training sample set acquisition module is used for acquiring a training sample set, wherein the training sample set comprises A training sequences with the length of S, which are acquired from the training sequence set, and both A and S are integers larger than 1;

the training sample set dividing module is used for dividing the training sample set to obtain a plurality of training sample subsets, wherein each training sample subset comprises B training sequences with the length of S, and B is an integer smaller than A;

The training sample subset segmentation module is used for respectively segmenting the plurality of training sample subsets in a sequence dimension to obtain a subsequence block set respectively corresponding to the plurality of training sample subsets, wherein each subsequence block in the subsequence block set comprises B subsequences;

The model training module is used for controlling the plurality of computing devices to perform model training by utilizing each subsequence block in the subsequence block set respectively corresponding to the plurality of training sample subsets and adopting a running water parallel training mode.

A third aspect of the application provides an electronic device comprising at least one processor and a memory coupled to the processor, wherein:

the memory is used for storing a computer program;

The processor is configured to execute the computer program to enable the electronic device to implement the language model training method of any one of the above.

A fourth aspect of the present application provides a computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement a language model training method as described in any one of the preceding claims.

A fifth aspect of the application provides a computer program product comprising computer readable instructions which, when run on an electronic device, cause the electronic device to implement the language model training method of any one of the preceding claims.

By means of the technical scheme, the language model training method provided by the application comprises the steps of firstly dividing a language model into a plurality of model blocks according to layers, deploying the plurality of model blocks on a plurality of computing devices, then obtaining a training sample set, then dividing the training sample set to obtain a plurality of training sample subsets, respectively dividing the plurality of training sample subsets in a sequence dimension to obtain sub-sequence block sets respectively corresponding to the plurality of training sample subsets, and finally controlling the plurality of computing devices to perform model training by utilizing each sub-sequence block in the sub-sequence block set respectively corresponding to the plurality of training sample subsets in a running water parallel training mode. According to the language model training method provided by the application, after the plurality of training sample subsets are obtained, the training sample subsets are not directly utilized for model training, but are further segmented in the sequence dimension, so that one training sample subset is segmented into a plurality of subsequence blocks, and then each subsequence block obtained by segmentation is utilized for training the model, in the forward computing stage, as the computing equipment performs forward computation on the subsequence blocks instead of the training sample subsets, compared with the data quantity of the activation values generated by forward computation on the training sample subsets, the data quantity of the activation values generated by forward computation on the subsequence blocks is greatly reduced, and thus, the memory requirement is greatly reduced. In addition, the sub-sequence blocks obtained by segmenting the training sample subset are used for model training, so that the memory requirement can be effectively reduced, and the running water cavitation rate can be effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a language model training method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of controlling multiple computing devices to perform model training by using a running-water parallel training mode according to each sub-sequence block in a sub-sequence block set respectively corresponding to multiple training sample subsets provided by the embodiment of the application;

FIG. 3 is a schematic diagram of model training using a 1F1B-I scheduling strategy for each sub-sequence block in a sub-sequence block set corresponding to a training sample subset according to an embodiment of the present application;

FIG. 4 is a schematic diagram of model training using ZB-V scheduling strategy according to each sub-sequence block in the sub-sequence block set corresponding to the training sample subset provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of a language model training apparatus according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In view of the large memory requirement of the current pipelining parallel training method, the initial idea is to adopt a recalculation technology, namely, the computing equipment only caches input data in the memory and does not cache an activation value generated in a forward computing stage, and when the computing equipment performs backward computing in a backward computing stage, forward computing is performed on the cached input data again to obtain the activation value, and gradient computing is performed according to the activation value obtained by computing.

The above idea does not need to cache the activation value generated in the forward calculation stage, so that the requirement for the memory is greatly reduced. However, research on the above-mentioned ideas has found that, although the above-mentioned ideas greatly reduce the memory requirement, they bring about new problems, namely, introducing a large amount of calculation.

The method for training the language model is further researched aiming at the problems of the thinking, and finally provides a language model training method with good effect through continuous research.

The language model training method provided by the application is described by the following examples.

Referring to fig. 1, a flow chart of a language model training method provided by an embodiment of the present application is shown, where the method may include:

step S101: the language model is sliced into a plurality of model blocks by layer and the plurality of model blocks are deployed onto a plurality of computing devices.

Wherein each model block obtained by segmentation comprises one layer or a plurality of continuous layers in the language model, one model block or a plurality of model blocks (the plurality of model blocks can be continuous or discontinuous) in the plurality of model blocks are deployed on each computing device, and one layer or a plurality of continuous layers or a plurality of discontinuous layers of the language model are deployed on each computing device.

After the plurality of model blocks are deployed onto the plurality of computing devices, parameter initialization (e.g., random initialization) may be performed on the model blocks deployed on the plurality of computing devices.

Step S102: a training sample set is obtained.

The training sample set comprises A training sequences with the length of S, which are obtained from a training sequence set, the training sequence set comprises a large number of training sequences with the length of S, and both A and S are integers larger than 1.

Step S103: and dividing the training sample set to obtain a plurality of training sample subsets.

Wherein each training sample subset includes B training sequences of length S, B being an integer less than a.

For example, if the training sample set includes 100 training sequences of length S, then this step may divide the training sample set into 5 training sample subsets, each training sample subset including 20 training sequences of length S.

Step S104: and respectively segmenting the plurality of training sample subsets in the sequence dimension to obtain sub-sequence block sets respectively corresponding to the plurality of training sample subsets.

The training sample subset is segmented in the sequence dimension, that is, training sequences contained in the training sample subset are segmented into a plurality of segments, and as the training sample subset comprises a plurality of training sequences, a plurality of subsequence blocks can be obtained after the training sample subset is segmented in the sequence dimension, wherein each subsequence block comprises B subsequences.

Step S105: and controlling a plurality of computing devices to perform model training by utilizing each subsequence block in the subsequence block set respectively corresponding to the plurality of training sample subsets and adopting a running water parallel training mode.

According to the language model training method provided by the embodiment of the application, a language model is firstly segmented into a plurality of model blocks according to layers, the plurality of model blocks are deployed on a plurality of computing devices, then a training sample set is obtained, the training sample set is divided to obtain a plurality of training sample subsets, the plurality of training sample subsets are respectively segmented in a sequence dimension to obtain sub-sequence block sets respectively corresponding to the plurality of training sample subsets, and finally each sub-sequence block in the sub-sequence block set respectively corresponding to the plurality of training sample subsets is utilized to control the plurality of computing devices to carry out model training in a running water parallel training mode. According to the language model training method provided by the embodiment of the application, after the plurality of training sample subsets are obtained, the training sample subsets are not directly utilized for model training, but are further segmented in the sequence dimension, so that one training sample subset is segmented into a plurality of subsequence blocks, and then each subsequence block training model obtained by segmentation is utilized, in the forward computing stage, as the computing equipment performs forward computation on the subsequence blocks instead of performing forward computation on the training sample subsets, compared with the data quantity of the activation values generated by performing forward computation on the training sample subsets, the data quantity of the activation values generated by performing forward computation on the subsequence blocks is greatly reduced, and therefore, the memory requirement is greatly reduced. In addition, the sub-sequence blocks obtained by segmenting the training sample subset are used for model training, so that the memory requirement can be effectively reduced, and the running water cavitation rate can be effectively reduced.

In another embodiment of the present application, for "step S101" in the above embodiment: the specific implementation of splitting a language model into multiple model blocks by layer and deploying the multiple model blocks onto multiple computing devices is described.

The language model is split into a plurality of model blocks by layers, and the plurality of model blocks are deployed on a plurality of computing devices in various manners, and the following two manners are provided in this embodiment.

The first implementation mode: the language model is divided into N model blocks according to layers, the N model blocks are deployed on N computing devices, one model block in the N model blocks is deployed on each computing device, and N is an integer greater than 1.

In order to obtain a better training effect, it is preferable that the number of layers contained in each model block obtained by the segmentation be the same when the language model is uniformly segmented.

For example, the language model includes 8 layers (1 st layer to 8 th layer), the computing device is 4 (computing device 1 to computing device 4), the language model may be segmented into 4 model blocks, each model block includes 2 continuous layers, wherein the 1 st model block includes 1 st layer and 2 nd layer, the 2 nd model block includes 3 rd layer and 4 th layer, the 3 rd model block includes 5 th layer and 6 th layer, the 4 th model block includes 7 th layer and 8 th layer, the 4 model blocks obtained by segmentation are deployed on the 4 computing devices, specifically, the 1 st model block is deployed on the computing device 1, the 2 nd model block is deployed on the computing device 2, the 3 rd model block is deployed on the computing device 3, and the 4 th model block is deployed on the computing device 4. It should be noted that, since 4 model blocks have a dependency relationship, 4 computing devices correspondingly have a dependency relationship.

The second implementation mode: the language model is divided into M model blocks according to layers, the M model blocks are deployed on N computing devices, wherein one model block or a plurality of discontinuous model blocks in the M model blocks are deployed on each computing device, and M is an integer larger than N.

In order to obtain a better training effect, preferably, the language model can be uniformly segmented, the same number of model blocks are deployed on each computing device, and if a plurality of model blocks are deployed on each computing device, a plurality of discontinuous model blocks are deployed.

The language model includes 8 layers (layer 1 to layer 8), the computing devices are 4 (computing devices 1 to 4), the language model can be divided into 8 model blocks, each model block is 1 layer of the language model, namely, the 1 st model block is 1 layer, the 2 nd model block is 2 layer, the 3 rd model block is 3 rd layer, … th model block is 7 layer, the 8 th model block is 8 layer, the 8 model blocks obtained by division are deployed to 4 computing devices, specifically, the 1 st layer and the 5 th layer of the language model are deployed to the computing device 1, the 2 nd layer and the 6 th layer of the language model are deployed to the computing device 2, the 3 rd layer and the 7 th layer of the language model are deployed to the computing device 3, and the 4 th layer and the 8 th layer of the language model are deployed to the computing device 4.

Deploying multiple discrete layers on a computing device can reduce the amount of computation and computation time of each forward computation performed by the computing device, and can also reduce the memory footprint of the activation value generated by each computing device performing the forward computation, as compared to deploying multiple sequential layers on a computing device.

In addition, it should be noted that, if the language model is segmented and then deployed on N computing devices, the running parallelism for running parallel training is N, if multiple discontinuous model blocks are deployed on each computing device, a virtual running parallelism also exists, for example, if 2 discontinuous model blocks are deployed on each computing device, the virtual running parallelism for running parallel training is 2.

In another embodiment of the present application, for "step S104" in the above embodiment: and respectively segmenting the plurality of training sample subsets in a sequence dimension to obtain a specific implementation process of a sub-sequence block set corresponding to the plurality of training sample subsets.

In one possible implementation manner, the process of dividing the plurality of training sample subsets in the sequence dimension to obtain the sub-sequence block sets corresponding to the plurality of training sample subsets respectively may include:

For each training sample subset, the training sample subset is segmented into equal-length C pieces in the sequence dimension, and the segmented C subsequence blocks form a subsequence block set corresponding to the training sample subset.

Wherein C is an integer greater than 1. Dividing the training sample subset into equal-length C parts in the sequence dimension to obtain C sub-sequence blocks with the length of S/C, wherein the C sub-sequence blocks with the length of S/C form a sub-sequence block set corresponding to the training sample subset, and each sub-sequence block in the sub-sequence block set corresponding to the training sample subset comprises B sub-sequence blocks with the length ofIs a sequence of (a).

If the nth training sample subset of the plurality of training sample subsets is represented as x _n Dividing x _n into equal-length C parts in the sequence dimension to obtain a subsequence block x _n ⁰、x_n ¹、…、x_n ⁱ、x_n ^C-1, wherein x _n ⁱ S' =s/C (C divides S integer).

In addition to splitting the training sample subset into C shares in the sequence dimension in the manner described above, splitting may be performed in other manners, such as splitting the training sample subset into C shares of different lengths in the sequence dimension. In order to obtain higher training efficiency and better training effect, the embodiment of the application preferably cuts the training sample subset into equal-length C shares in the sequence dimension.

In another embodiment of the present application, for "step S105" in the above embodiment: and controlling a plurality of computing devices to perform model training by utilizing each subsequence block in the subsequence block set respectively corresponding to the plurality of training sample subsets and adopting a running water parallel training mode to introduce the implementation process.

Referring to fig. 2, a flow chart for controlling a plurality of computing devices to perform model training by using each sub-sequence block in a sub-sequence block set respectively corresponding to a plurality of training sample subsets in a pipelined parallel training manner is shown, which may include:

step S201: and inputting each subsequence block in the subsequence block set corresponding to each training sample subset into a computing operation pipeline formed by a plurality of computing devices one by one in sequence for computing.

Wherein a computational operation pipeline of multiple computing devices performs forward and backward computations for each sub-sequence block of inputs.

Each computing device on the computing operation pipeline, when performing forward computation for a sub-sequence block, caches the computed activation values in memory for use in performing backward computation for the sub-sequence block. When the backward computation is performed on a sub-sequence block, each computing device on the computing operation pipeline clears the activation value cached when the backward computation is performed on the sub-sequence block, so as to release the memory space, and further facilitate caching the activation value obtained when the forward computation is performed on the next sub-sequence block. It should be noted that, each computing device generates a plurality of activation values in the process of performing forward computation, the middle activation value is cached in the memory, and the last activation value is output to the next computing device as the final forward computation result of the computing device.

Next, a process of performing forward computation and backward computation for an i-th sub-sequence block x _n ⁱ in a sub-sequence block set by a computation operation pipeline composed of a plurality of computing devices is given.

Forward calculation stage: the first computing device on the computing operation pipeline performs forward computation on the sub-sequence block x _n ⁱ to obtain a forward computation result rf _n ¹ aiming at x _n ⁱ, the first computing device inputs a 2 nd computing device aiming at a forward computation result rf _n ¹ of x _n ⁱ, the 2 nd computing device performs forward computation on the input rf _n ¹, the forward calculation result rf _n ² and … … of the self aiming at x _n ⁱ is obtained, the forward calculation result rf _n ^N-1 of the N-1 computing device aiming at x _n ⁱ is input into the N computing device, and the N computing device performs forward calculation on the input rf _n ^N-1 to obtain the forward calculation result rf _n ^N of the self aiming at x _n ⁱ.

In the backward calculation phase: the nth computing device calculates a loss function according to the forward calculation result rf _n ^N of the nth computing device aiming at x _n ⁱ, further carries out gradient calculation according to the loss function and the activation value of the nth computing device aiming at x _n ⁱ (namely, the activation value generated by forward calculation on rf _n ^N-1), the calculated gradient comprises the gradient aiming at input data and the gradient aiming at model parameters (weights), the gradient aiming at the input data and calculated by the nth computing device is input into the nth-1 computing device, the N-1 computing device performs gradient calculation according to the input gradient and the activation value (namely, the activation value generated by forward calculation on rf _n ^N-2) cached by the computing device for x _n ⁱ, the gradient calculated by the N-1 computing device also comprises the gradient for input data and the gradient for model parameters, the gradient for the input data calculated by the N-1 computing device is input into the N-2 computing device, …, the gradient for the input data calculated by the 2 computing device is input into the 1 st computing device, and the 1 st computing device performs gradient calculation according to the input gradient and the activation value (namely, the activation value generated by forward calculation on the subsequence block x _n ⁱ by the 1 st computing device) cached by the computing device for x _n ⁱ.

It should be noted that, in the application, a pipelining parallel mode is adopted to control a plurality of computing devices to calculate each subsequence block in the subsequence block set corresponding to each training sample subset, and at the same time, a plurality of computing devices can calculate for different subsequence blocks, so that the training efficiency of the language model can be greatly improved.

Step S202: and controlling the computing equipment to perform backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets according to gradient data obtained by performing backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets by the computing equipment after controlling the computing equipment to complete forward calculation and backward calculation on each sub-sequence block in the sub-sequence block set corresponding to the training sample subsets.

When each subsequence block in the subsequence block set corresponding to each of the plurality of training sample subsets is utilized for carrying out the running parallel training, each computing device can adopt a preset scheduling strategy for carrying out model training.

Optionally, the preset scheduling policy may be a 1F1B scheduling policy, and may also be a ZB scheduling policy. The inverse calculation includes gradient calculation for input data and gradient calculation for model parameters, the 1F1B scheduling strategy is to take the input gradient and the gradient for the model parameters as a whole, calculate the two gradients at the same time, and the ZB scheduling strategy is to disassemble the input gradient and the gradient for the model parameters, calculate the two gradients at different moments, so as to reduce the cavitation rate of flowing water.

The 1F1B scheduling policies include non-interleaved 1F1B and interleaved 1F1B scheduling policies (i.e., 1F1B-I scheduling policies, forward computation and backward computation are performed in an interleaved manner), if the 1F1B scheduling policies are adopted for model training, the non-interleaved 1F1B scheduling policies may be adopted, and also the interleaved 1F1B scheduling policies, i.e., 1F1B-I scheduling policies, are adopted, and in view of better effect of the 1F1B-I scheduling policies (compared with the non-interleaved 1F1B scheduling policies, the running water cavitation rate of the 1F1B-I scheduling policies is smaller, and the model training efficiency is higher), the 1F1B-I scheduling policies may be preferably adopted for model training.

The ZB scheduling strategy comprises a ZB-H1 scheduling strategy, a ZB-H2 scheduling strategy, a ZB-V scheduling strategy and the like. If the model training is carried out by adopting a ZB scheduling strategy, a ZB-H1 scheduling strategy, a ZB-H2 scheduling strategy and a ZB-V scheduling strategy can be adopted. In view of better effect of the ZB-V scheduling strategy, the ZB-V scheduling strategy can be preferably adopted for model training.

It should be noted that, the 1F1B-I scheduling policy and the ZB-V scheduling policy require that discontinuous layers are deployed on one computing device, for example, the language model includes 8 layers, 4 computing devices, and then layers 1 and 5 of the language model are deployed on the computing device 1, layers 2 and 6 of the language model are deployed on the computing device 2, layers 3 and 7 of the language model are deployed on the computing device 3, and layers 4 and 8 of the language model are deployed on the computing device 4.

Referring to fig. 3 and fig. 4, fig. 3 shows a schematic diagram of model training using a 1F1B-I scheduling strategy for each sub-sequence block in a sub-sequence block set corresponding to a training sample subset, and fig. 4 shows a schematic diagram of model training using a ZB-V scheduling strategy for each sub-sequence block in a sub-sequence block set corresponding to a training sample subset.

Devices 0-3 in fig. 3 and 4 represent 4 computing devices that form a computing operation pipeline, and each computing device is deployed with 2 discontinuous model blocks, namely, the pipeline parallelism is 4, and the virtual pipeline parallelism is 2. 0a and 0B in fig. 3 and 4 represent the sequence numbers of two sub-sequence blocks in the sub-sequence block set corresponding to the training sample subset with sequence number 0 (each training sample subset is divided into equal-length 2 parts, that is, each training sample subset includes two sub-sequence blocks), 1a and 1B in fig. 3 and 4 represent the sequence numbers of two sub-sequence blocks in the sub-sequence block set corresponding to the training sample subset with sequence number 1, otherwise similarly, chunk0-F, chunk1-F in fig. 3 and 4 represent the forward operation of the 1 st model block in the computing device for the sub-sequence block with sequence number 0a, the forward operation of the 2 nd model block in the computing device for the sub-sequence block with sequence number 0a, the Chunk1-BW, chunk0-BW in FIG. 3 represents in order the backward operation (calculation of two gradients) of the 2 nd model block in the computing device for the sub-sequence block with sequence number 0a, the backward operation (calculation of two gradients) of the 1 st model block in the computing device for the sub-sequence block with sequence number 0a, the backward operations (calculation of two gradients) of the 2 nd model block in the computing device for the sub-sequence block with sequence number 0a in FIG. 4, chunk1-B and Chunk1-W represent the backward operation (calculation of the gradient of the 2 nd model block for the input data, chunk1-W represent the gradient calculation of the 2 nd model block for the model parameter), the backward operation (calculation of the 1 st model block for the gradient of the input data, chunk0-B represents the gradient of the 1 st model block for the model parameter) of the 1 st model block in the computing device, and Chunk0-W represents the calculation of the gradient of the model parameter in FIG. 4, the Optimizer step in fig. 3 and 4 represents a model optimization step, i.e., a model parameter update process.

As shown in fig. 3 and 4, when the 1F1B-I scheduling policy or the ZB-V scheduling policy is adopted, the calculation process of each computing device includes a first calculation stage (which may be referred to as a warm up stage, i.e., a warm up stage), a second calculation stage (which may be referred to as a steady stage, i.e., a cool down stage), and a third calculation stage (which may be referred to as a cooldown stage, i.e., a cool down stage), and the computing device performs only forward calculation in the warm up stage, performs forward calculation and backward calculation alternately in the steady stage, and performs only backward calculation in the cooldown stage. Alternating forward and reverse computations during the step phase can reduce the memory footprint of the device. As shown in FIG. 3, the arm up stage of the device 0 is 0 a-3 a, the step stage is 3 b-4 b before the 7 th blank, and the cooldown stage is 7a before 4a~Optimizer step after the 7 th blank.

Because the language model is a language model based on a decoder-only tansformer architecture, a calculation operation pipeline formed by a plurality of computing devices can perform causal attention calculation when performing forward calculation on an input subsequence block, and when performing causal attention calculation, the current subsequence block needs to be focused, a subsequence block before the subsequence block needs to be focused, and the causal attention calculation process is next provided.

For an i-th sub-sequence block x _n ⁱ of a set of sub-sequence blocks corresponding to the training sample subset x _n, a process for performing causal attention calculations for the sub-sequence block by a kth computing device on a computing operation pipeline may include:

And a1, determining a Q matrix, a K matrix and a V matrix of the subsequence block x _n ⁱ according to input data.

If the kth computing device is the first computing device on the computing operation pipeline, the "input data" in the step a1 is the subsequence block x _n ⁱ, and if the kth computing device is the non-first computing device on the computing operation pipeline, the "input data" in the step a1 is the forward computing result of the kth-1 computing device for the subsequence block x _n ⁱ.

Step a2, splicing the K matrix of the forward subsequence block of the subsequence block x _n ⁱ with the K matrix of the subsequence block x _n ⁱ, using the spliced matrix as a target K matrix corresponding to the subsequence block x _n ⁱ, splicing the V matrix of the forward subsequence block of the subsequence block x _n ⁱ with the V matrix of the subsequence block x _n ⁱ, and using the spliced matrix as a target V matrix corresponding to the subsequence block x _n ⁱ.

The forward subsequence block of the subsequence block x _n ⁱ is a subsequence block located before the subsequence block x _n ⁱ in the subsequence block set where the subsequence block x _n ⁱ is located, namely x _n ⁰、x_n ¹、…、x_n ^i-1.

If the subsequence block x _n ⁱ is the first subsequence block in the set of subsequence blocks, the K matrix of the subsequence block x _n ⁱ is directly used as the target K matrix corresponding to the subsequence block x _n ⁱ, and the V matrix of the subsequence block x _n ⁱ is directly used as the target V matrix corresponding to the subsequence block x _n ⁱ.

And a3, performing causal attention calculation on the Q matrix of the subsequence block x _n ⁱ, the target K matrix corresponding to the subsequence block x _n ⁱ and the target V matrix corresponding to the subsequence block x _n ⁱ.

The causal attention computation for sub-sequence block x _n ⁱ in the forward computation can be expressed as:

(1)

Where Q _n ⁱ denotes the Q matrix of the sub-sequence block x _n ⁱ, K _n ⁰ denotes the K matrix of the sub-sequence block x _n ⁰, K _n ¹ denotes the K matrix of the sub-sequence block x _n ¹, …, K _n ⁱ denotes the K matrix of the sub-sequence block x _n ⁱ, { K _n ⁰,K_n ¹,…,K_n ⁱ } represents the target K matrix corresponding to the subsequence block x _n ⁱ, V _n ⁰ represents the V matrix of the subsequence block x _n ⁰, V _n ¹ represents the V matrix of the subsequence block x _n ¹, …, V _n ⁱ represents the V matrix of the subsequence block x _n ⁱ, { V _n ⁰,V_n ¹,…,V_n ⁱ } represents the target V matrix corresponding to the subsequence block x _n ⁱ, o _n ⁱ represents the causal attention calculation for sub-sequence block x _n ⁱ.

The above procedure is described below in connection with a specific example: the sub-sequence block set corresponding to the training sample subset x _n comprises 3 sub-sequence blocks, namely x _n ⁰、x_n ¹、x_n ² respectively, and if the kth computing device currently needs to perform causal attention computation on the sub-sequence block x _n ², the Q matrix, the K matrix and the V matrix of the sub-sequence block x _n ² are computed according to input data to obtain Q _n ²、K_n ² and V _n ², and then the K _n ² of the sub-sequence block x _n ² is processed, K _n ¹ of the subsequence block x _n ¹ and K _n ⁰ of the subsequence block x _n ⁰ are spliced, the spliced matrix is taken as a target K matrix corresponding to the subsequence block x _n ², and similarly, V _n ² of the subsequence block x _n ² and V _n ¹ of the subsequence block x _n ¹ are spliced, V _n ⁰ of the subsequence block x _n ⁰ is spliced, the spliced matrix is used as a target V matrix corresponding to the subsequence block x _n ², after Q _n ² of the subsequence block x _n ², a target K matrix { K _n ⁰,K_n ¹,K_n ² } corresponding to the subsequence block x _n ² and a target V matrix { V _n ⁰,V_n ¹,V_n ² } corresponding to the subsequence block x _n ² are obtained, causal attention calculations may be performed on Q _n ² for sub-sequence block x _n ², the target K matrix { K _n ⁰,K_n ¹,K_n ² } corresponding to sub-sequence block x _n ², and the target V matrix { V _n ⁰,V_n ¹,V_n ² } corresponding to sub-sequence block x _n ². It should be noted that, when performing causal attention computation on the sub-sequence block x _n ¹, the kth computing device may calculate K _n ¹ and V _n ¹ of the sub-sequence block x _n ¹, and when performing causal attention computation on the sub-sequence block x _n ⁰, the kth computing device may calculate K _n ⁰ and V _n ⁰ of the sub-sequence block x _n ⁰.

And when each computing device on the computing operation pipeline performs backward computation, computing target gradients corresponding to the Q matrix, the K matrix and the V matrix respectively. On the basis of the causal attention calculation procedure given in the above embodiment, a calculation procedure of the target gradients corresponding to the Q matrix, the K matrix, and the V matrix, respectively, is given next.

For an ith sub-sequence block x _n ⁱ in the sub-sequence block set corresponding to the training sample subset x _n, a process of calculating a target gradient corresponding to a Q matrix, a K matrix and a V matrix of the sub-sequence block by a kth computing device on a computing operation pipeline may include:

and b-a, calculating a gradient DQ _n ⁱ corresponding to the Q matrix of the subsequence block x _n ⁱ, wherein the obtained gradient is used as a target gradient corresponding to the Q matrix of the subsequence block x _n ⁱ.

And b-b1, calculating the gradient corresponding to the target K matrix corresponding to the subsequence block x _n ⁱ.

The gradient corresponding to the target K matrix corresponding to the subsequence block x _n ⁱ includes: the gradient DK _n ⁱⁱ corresponding to the K matrix of the subsequence block x _n ⁱ and the gradient (i.e., DK _n ⁽ⁱ ^-1)i、…、DK_n ¹ⁱ、DK_n ⁰ⁱ) corresponding to the K matrix of the forward subsequence block (i.e., x _n ^i-1、…x_n ¹、x_n ⁰) of the subsequence block.

The gradient corresponding to the target K matrix corresponding to the sub-sequence block x _n ⁱ can be expressed as:

(2)

And b-b2, summing the gradient corresponding to the K matrix of the sub-sequence block x _n ⁱ corresponding to the target K matrix corresponding to the sub-sequence block x _n ⁱ with the gradient corresponding to the K matrix of the sub-sequence block x _n ⁱ corresponding to the target K matrix corresponding to the backward sub-sequence block of the sub-sequence block x _n ⁱ, and taking the summed gradient as the target gradient corresponding to the K matrix of the sub-sequence block x _n ⁱ.

The backward subsequence block of the subsequence block x _n ⁱ is a subsequence block located after the subsequence block x _n ⁱ in the subsequence block set where the subsequence block x _n ⁱ is located, namely x _n ⁱ⁺¹、x_n ⁱ⁺²、…、x_n ^C-1.

Target gradient corresponding to K matrix corresponding to subsequence block x _n ⁱ Can be expressed as:

(3)

If the subsequence block x _n ⁱ is the last subsequence block in the set of subsequence blocks, the gradient corresponding to the K matrix of the subsequence block x _n ⁱ is directly used as the target gradient corresponding to the K matrix of the subsequence block x _n ⁱ.

Illustratively, training sample subset x _n includes 3 sub-sequence blocks, x _n ⁰、x_n ¹、x_n ² each, calculating a gradient corresponding to a target K matrix corresponding to x _n ² to obtain DK _n ⁰²、DK_n ¹²、DK_n ²², wherein DK _n ²² is used as a target gradient corresponding to a K matrix of x _n ², calculating a gradient corresponding to a target K matrix corresponding to x _n ¹ to obtain DK _n ⁰¹、DK_n ¹¹, summing DK _n ¹¹ and DK _n ¹², obtaining a target gradient corresponding to the K matrix of x _n ¹, calculating a gradient corresponding to the target K matrix of x _n ⁰, obtaining DK _n ⁰⁰, and summing DK _n ⁰⁰、DK_n ⁰¹、DK_n ⁰² to obtain a target gradient corresponding to the K matrix of x _n ⁰.

And b-c1, calculating the gradient corresponding to the target V matrix corresponding to the subsequence block x _n ⁱ.

The gradient corresponding to the target V matrix corresponding to the subsequence block x _n ⁱ includes: the gradient DV _n ⁱⁱ corresponding to the V matrix of the subsequence block x _n ⁱ and the gradient (i.e., DV _n ⁽ⁱ ^-1)i、…、DV_n ¹ⁱ、DV_n ⁰ⁱ) corresponding to the V matrix of the forward subsequence block (i.e., x _n ^i-1、…x_n ¹、x_n ⁰) of the subsequence block.

And b-c2, summing the gradient corresponding to the V matrix of the sub-sequence block x _n ⁱ corresponding to the target V matrix corresponding to the sub-sequence block x _n ⁱ with the gradient corresponding to the V matrix of the sub-sequence block x _n ⁱ corresponding to the target V matrix corresponding to the backward sub-sequence block of the sub-sequence block x _n ⁱ, and taking the summed gradient as the target gradient corresponding to the V matrix of the sub-sequence block x _n ⁱ.

The target gradient corresponding to the V matrix corresponding to the sub-sequence block x _n ⁱ can be expressed as:

(4)

If the subsequence block x _n ⁱ is the last subsequence block in the set of subsequence blocks, the gradient corresponding to the V matrix of the subsequence block x _n ⁱ is directly used as the target gradient corresponding to the V matrix of the subsequence block x _n ⁱ.

Illustratively, training sample subset x _n includes 3 sub-sequence blocks, x _n ⁰、x_n ¹、x_n ², respectively, calculating a gradient corresponding to a target V matrix corresponding to x _n ² to obtain DV _n ⁰²、DV_n ¹²、DV_n ²², wherein DV _n ²² is used as a target gradient corresponding to a V matrix of x _n ², calculating a gradient corresponding to a target V matrix corresponding to x _n ¹ to obtain DV _n ⁰¹、DV_n ¹¹, summing DV _n ¹¹ with DV _n ¹², obtaining a target gradient corresponding to a V matrix of x _n ¹, calculating a gradient corresponding to a target V matrix of x _n ⁰, obtaining DV _n ⁰⁰, and summing DV _n ⁰⁰、DV_n ⁰¹、DV_n ⁰² to obtain a target gradient corresponding to a V matrix of x _n ⁰.

Next, taking a language model as GPT3 as an example, the memory occupancy condition of the activation value and the running water cavitation rate of the language model training method provided by the application are described.

Let the dimension of the hidden layer of GPT3 be denoted as h, the number of layers of transformer decoder layers included in GPT3 be denoted as L, the pipeline parallelism be denoted as P (p=n), and the virtual pipeline parallelism be denoted asThe number of layers each computing device separates into transformer decoder layers is L (l=l/P), the number of training sample subsets is m, then:

The existing language model training method is adopted (a training sample subset is directly input into a computing operation pipeline formed by N computing devices for computing), and each computing device needs to cache the peak value of the memory occupied by the activation value (hereinafter, the peak value of the memory occupied by the activation value needing to be cached is simply referred to as an "activation value memory occupied peak value") The method comprises the following steps:

(5)

Wherein, For the number of forward calculations performed by the first computing device in the arm up phase,The active value for monolayer transformer decoder occupies the memory size.

The language model training method provided by the application (the training sample subset is divided into equal-length C shares in the sequence dimension to obtain the sub-sequence block set corresponding to the training sample subset, and each sequence block in the sub-sequence block set corresponding to the training sample subset is sequentially input into a computing operation pipeline formed by N computing devices for computing), because the training sample subset is divided into equal-length C shares in the sequence dimension, compared with the method without dividing the training sample subset, the number of times that the first computing device executes forward computation in the arm up stage is increased by C-1, and when the forward computation is performed, the memory occupation of the activation value of a single layer transformer decoder is changed into M _B/C, and the activation value memory occupation peak value of each computing device is considered to be the reasonThe method comprises the following steps:

(6)

compared with the existing language model training method, the language model training method provided by the application saves the memory as follows:

(7)

because C is greater than 1, the memory saved by the language model training method provided by the application is greater than 0.

TABLE 1 comparison of the effects of the language model training method provided by the present application and the existing language model training method

The row 2 and the row 3 of the table show the number of times the computing device performs forward computation in the arm up stage, the active value memory occupancy peak value of the computing device and the running water cavitation rate when the 1F1B-I scheduling strategy and the ZB-V scheduling strategy are adopted by the existing language model training method, and the row 4 and the row 5 of the table show the number of times the computing device performs forward computation in the arm up stage, the active value memory occupancy peak value of the computing device and the running water cavitation rate when the 1F1B-I scheduling strategy and the ZB-V scheduling strategy are adopted by the language model training method provided by the application.

Taking the GPT3-175B model as an example (h=12288, l=96), when p=12, t=8, P' =8, c=4, s=6144, the training is performed by using the 1F1B-I scheduling strategy, compared with the language model training method in the prior art, the memory occupation of the theoretical activation value of each computing device is reduced by 24.65GB when the language model training method provided by the application is used for training, and when the ZB-V scheduling strategy is used, the memory occupation of the theoretical activation value of each computing device is reduced by 20.61GB when the language model training method provided by the application is used for training, compared with the language model training method in the prior art. It should be noted that, when model training is performed, each computing device performs model training in a pipelined parallel manner, and each computing device may perform computation in a tensor parallel manner, where T is tensor parallelism.

Compared with the existing language model training method, when the language model training method provided by the application is adopted for training, the running water cavitation rate is unchanged as a whole, but the absolute time of each cavitation is changed to be 1/C, so that the overall cavitation rate is lower. Because each training sample subset is segmented into C shares, the accumulated times of the parameter gradients are increased by C-1 times (each computing device can accumulate the parameter gradients calculated by each sub-sequence block and further update the parameters of the self-deployed model blocks according to the gradient accumulation result), namely the multiplied and added calculated amount of (C-1) x (12 h ² +13 h) is increased, but the multiplied and added calculated amount is far smaller than the calculated amount increased by a complete recalculation method, namely 24BSh ² +12BSh. Therefore, compared with the existing language model training method, the language model training method provided by the application can effectively reduce the memory occupation and the running water cavitation rate while only adding lower additional calculation amount, and can keep high-efficiency MFU (Model FLOPs Utilization, model calculation power utilization rate).

The embodiment of the application also provides a language model training device, as shown in fig. 5, the language model training device may include: a model segmentation and deployment module 501, a training sample set acquisition module 502, a training sample set partitioning module 503, a training sample subset segmentation module 504 and a model training module 505.

The model segmentation and deployment module 501 is configured to segment the language model into a plurality of model blocks according to a layer, and deploy the plurality of model blocks on a plurality of computing devices.

Wherein each model block obtained by segmentation comprises one layer or a plurality of continuous layers in the language model, and one or a plurality of model blocks in a plurality of model blocks are deployed on each computing device.

The training sample set obtaining module 502 is configured to obtain a training sample set.

The training sample set comprises A training sequences with the length of S, wherein the A training sequences are obtained from the training sequence set, and both A and S are integers larger than 1.

The training sample set dividing module 503 is configured to divide the training sample set to obtain a plurality of training sample subsets.

And the training sample subset segmentation module 504 is configured to segment the plurality of training sample subsets in a sequence dimension respectively, so as to obtain sub-sequence block sets corresponding to the plurality of training sample subsets respectively.

Wherein each sub-sequence block in the set of sub-sequence blocks comprises B sub-sequences.

The model training module 505 is configured to control a plurality of computing devices to perform model training by using each sub-sequence block in the sub-sequence block set corresponding to each of the plurality of training sample subsets in a pipelined parallel training manner.

In one possible implementation, the model segmentation and deployment module 501 is specifically configured to, when segmenting a language model into a plurality of model blocks by layer and deploying the plurality of model blocks on a plurality of computing devices:

The language model is divided into N model blocks according to layers, the N model blocks are deployed on N computing devices, one model block in the N model blocks is deployed on each computing device, and N is an integer greater than 1.

In another possible implementation, the model segmentation and deployment module 501 is specifically configured to, when segmenting the language model into a plurality of model blocks by layer and deploying the plurality of model blocks on a plurality of computing devices:

the language model is divided into M model blocks according to layers, the M model blocks are deployed on N computing devices, wherein one model block or a plurality of discontinuous model blocks in the M model blocks are deployed on each computing device, and M is an integer larger than N.

In one possible implementation manner, the training sample subset segmentation module 504 is specifically configured to, when segmenting the plurality of training sample subsets in the sequence dimension to obtain sub-sequence block sets corresponding to the plurality of training sample subsets respectively:

In one possible implementation manner, the model training module 505 is specifically configured to, when using each sub-sequence block in the sub-sequence block set corresponding to each of the plurality of training sample subsets and adopting a running parallel training manner, control the plurality of computing devices to perform model training:

In one possible implementation, a computing process for each computing device on a computing operation pipeline includes a first computing stage, a second computing stage, and a third computing stage; each computing device on the computing operation pipeline performs only forward computation in a first computing stage, alternately performs forward computation and reverse computation in a second computing stage, and performs only reverse computation in a third computing stage.

in one possible implementation, for a sub-sequence block in a set of sub-sequence blocks, a computing device on a computing operation pipeline performs causal attention computation for the sub-sequence block comprising:

In one possible implementation, the process of determining, by a computing device on a computing operation pipeline, target gradients for Q, K, and V matrices of a sub-sequence block, respectively, includes:

In the language model training device provided by the embodiment of the application, after a plurality of training sample subsets are obtained, the training sample subsets are not directly utilized for model training, but are further segmented in the sequence dimension, so that one training sample subset is segmented into a plurality of subsequence blocks, and then each subsequence block training model obtained by segmentation is utilized, in the forward computing stage, as the computing equipment performs forward computation on the subsequence blocks instead of the training sample subsets, compared with the data quantity of the activation values generated by forward computation on the training sample subsets, the data quantity of the activation values generated by forward computation on the subsequence blocks is greatly reduced, and thus, the memory requirement is greatly reduced. In addition, the sub-sequence blocks obtained by segmenting the training sample subset are used for model training, so that the memory requirement can be effectively reduced, and the running water cavitation rate can be effectively reduced.

The embodiment of the application also provides an electronic device, referring to fig. 6, which shows a schematic structural diagram of the electronic device, the electronic device may include: at least one processor 601, at least one communication interface 602, at least one memory 603 and at least one communication bus 604.

In the embodiment of the present application, the number of the processor 601, the communication interface 602, the memory 603 and the communication bus 604 is at least one, and the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604;

Processor 601 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;

The memory 603 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one disk memory;

The memory stores a program, and the processor may call the program stored in the memory, where the program is used to implement the language model training method provided in the foregoing embodiment.

The embodiment of the application also provides a computer readable storage medium, which can store a computer program/instruction suitable for being executed by a processor, and the computer program/instruction realizes the language model training method provided by the embodiment when being executed by the processor.

The embodiment of the application also provides a computer program product, which comprises computer readable instructions, wherein the computer readable instructions enable the electronic device to realize the language model training method provided by the embodiment.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training a language model, comprising:

2. The language model training method of claim 1, wherein the slicing the language model into a plurality of model blocks by layers and deploying the plurality of model blocks on a plurality of computing devices comprises:

3. The language model training method of claim 1, wherein the segmenting the plurality of training sample subsets in the sequence dimension to obtain sub-sequence block sets respectively corresponding to the plurality of training sample subsets comprises:

4. The language model training method according to claim 1, wherein the controlling the plurality of computing devices to perform model training by using each sub-sequence block in the sub-sequence block set respectively corresponding to the plurality of training sample subsets in a pipelined parallel training manner comprises:

5. The language model training method of claim 4, wherein the computing process of each computing device on the computing operation pipeline comprises a first computing stage, a second computing stage, and a third computing stage;

6. The language model training method of claim 4, wherein the forward calculations performed by each computing device on the computing operation pipeline comprise causal attention calculations;

For a sub-sequence block in a set of sub-sequence blocks, a computing device on the computing operation pipeline performs causal attention computation for the sub-sequence block, comprising:

7. The language model training method of claim 6, wherein the backward computation performed by each computing device on the computing operation pipeline comprises computation of a target gradient corresponding to a Q matrix, a K matrix, and a V matrix, respectively;

the process of determining the target gradients respectively corresponding to the Q matrix, the K matrix and the V matrix of a subsequence block by a computing device on the computing operation pipeline comprises the following steps:

8. A language model training apparatus, comprising: the system comprises a model segmentation and deployment module, a training sample set acquisition module, a training sample set division module, a training sample subset segmentation module and a model training module;

9. An electronic device comprising at least one processor and a memory coupled to the processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to enable the electronic device to implement the language model training method of any one of claims 1 to 7.

10. A computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement the language model training method of any one of claims 1 to 7.

11. A computer program product comprising computer readable instructions which, when run on an electronic device, cause the electronic device to implement the language model training method of any one of claims 1 to 7.