CN112862662A

CN112862662A - Method and equipment for distributed training of transform-xl language model

Info

Publication number: CN112862662A
Application number: CN202110264864.4A
Authority: CN
Inventors: 沈华东; 李轶杰; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-05-28

Abstract

The invention relates to a method and equipment for distributed training of a transform-xl language model, which are applied to a process of training the transform-xl language model by DPP, and the method comprises the following steps: acquiring text corpus data for training a tranformer xl language model; sequencing all the text corpus data according to a context sequence; partitioning the sequenced text corpus data into blocks according to the number of GPUs (graphics processing units) so as to divide the text corpus data into a plurality of subdata; distributing different sub-data to different GPUs, wherein the GPU trains the sequence of the distributed sub-data to be consistent with the context sequence; and sequentially training the sub data through the GPUs to realize training of the transform-xl language model. According to the scheme, the sampling method in the DDP is reconstructed, the reconstructed DDP is used for training the tranformer xl, the training speed of massive text corpora is accelerated, the efficiency problem is solved, and the historical information of the tranformer xl model is kept.

Description

Method and equipment for distributed training of transform-xl language model

Technical Field

The invention relates to the technical field of machine translation, in particular to a method and equipment for distributed training of a transform-xl language model.

Background

The training of the language model uses large-scale text corpora, and if only a single GPU (Graphics Processing Unit) is used for Processing, the speed is slow, so that multiple GPUs are often used for training together. Common training modes include DP (data parallel) and DDP (decentralized data parallel).

As shown in fig. 2A, the DP mode is to slice data of one batch, allocate the sliced data to a plurality of GPUs for calculation, and then synchronize each GPU parameter to one master GPU for parameter update; DDP adopts an all-reduce (protocol) mode, and distributes the number of data participles batch to each GPU according to a certain sampling mode; each GPU can update parameters; compared with the DP speed, the DDP speed is greatly improved; at present, a tranformer xl model generally adopts a DP mode to train multiple cards, and because the context of the tranformer xl model is related, each batch is required to be trained in sequence; the manner of DDP is disturbed, so that the tranformer xl model loses the historical information. The speed of a tranformer xl model for DP training massive text corpora is low; training using the DDP approach loses historical information for the tranformer xl model.

Thus, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

The invention provides a method and equipment for training a transformer-xl language model in a distributed mode, which can solve the technical problem that the transformer xl model loses historical information when a DDP (distributed data processing) mode is used for training in the prior art.

The technical scheme for solving the technical problems is as follows:

the embodiment of the invention provides a method for training a transform-xl language model in a distributed mode, which is applied to the process of training the transform-xl language model by DPP, and comprises the following steps:

acquiring text corpus data for training a tranformer xl language model;

sequencing all the text corpus data according to a context sequence;

partitioning the sequenced text corpus data into blocks according to the number of GPUs (graphics processing units) so as to divide the text corpus data into a plurality of subdata;

distributing different sub-data to different GPUs, wherein the GPU trains the sequence of the distributed sub-data to be consistent with the context sequence;

and sequentially training the sub data through the GPUs to realize training of the transform-xl language model.

In a specific embodiment, the allocating the different sub-data to the different GPUs includes:

sequencing the subdata according to the context sequence to generate a first sequence, and sequencing the GPUs according to the training sequence to generate a second sequence;

for each sub-data, determining a ranking of the sub-data in a first sequence; determining the GPU corresponding to the ranking in the second sequence; and distributing the sub data to the determined GPU.

In a specific embodiment, the "training the sub-data sequentially by each GPU to implement training of a transform-xl language model" includes:

sequentially training the sub data once through each GPU to realize training of one pair of transform-xl language models;

and finally training the transformer-xl language model by using a plurality of pairs of training wheels for the transformer-xl language model.

In a specific embodiment, in the training process of each pair of transform-xl language models, the iteration sequence of each subdata is the same.

In a specific embodiment, the same iteration sequence of each sub-data is realized by closing the parameter configuration of the shuffle of the DDP.

The embodiment of the invention also provides a device for distributed training of a transform-xl language model, which is applied to the process of training the transform-xl language model by DPP, and the device comprises:

the acquisition module is used for acquiring text corpus data for training a tranformer xl language model;

the sorting module is used for sorting all the text corpus data according to the context sequence;

the blocking module is used for blocking the sequenced text corpus data according to the number of the GPUs so as to divide the text corpus data into a plurality of subdata;

the distribution module is used for distributing different sub-data to different GPUs, and the GPU trains that the sequence of the distributed sub-data is consistent with the context sequence;

and the training module is used for training the subdata sequentially through the GPUs to realize the training of the transform-xl language model.

In a specific embodiment, the allocation module includes:

the generating module is used for sequencing the subdata according to the context sequence to generate a first sequence, and sequencing the GPUs according to the training sequence to generate a second sequence;

the processing module is used for determining the ranking of the subdata in the first sequence aiming at each subdata; determining the GPU corresponding to the ranking in the second sequence; and distributing the sub data to the determined GPU.

In a specific embodiment, the training module includes:

the first round of module is used for training the subdata once sequentially through the GPUs to realize the training of a transform-xl language model by one round;

and the multi-wheel module is used for realizing the final training of the transform-xl language model through the training of the multi-wheel transform-xl language model.

In a specific embodiment, the same iteration sequence of each sub-data is realized by closing the parameter configuration of the shuffle of the DDP

The invention has the beneficial effects that:

the embodiment of the invention provides a method and equipment for training a transform-xl language model in a distributed mode, which are applied to the process of training the transform-xl language model by DPP, and the method comprises the following steps: acquiring text corpus data for training a tranformer xl language model; sequencing all the text corpus data according to a context sequence; partitioning the sequenced text corpus data into blocks according to the number of GPUs (graphics processing units) so as to divide the text corpus data into a plurality of subdata; distributing different sub-data to different GPUs, wherein the GPU trains the sequence of the distributed sub-data to be consistent with the context sequence; and sequentially training the sub data through the GPUs to realize training of the transform-xl language model. According to the scheme, the sampling method in the DDP is reconstructed, the reconstructed DDP is used for training the tranformer xl, the training speed of massive text corpora is accelerated, the efficiency problem is solved, and the historical information of the tranformer xl model is kept.

Drawings

FIG. 1 is a flowchart illustrating a method for distributed training of a transform-xl language model according to an embodiment of the present invention;

FIG. 2A is a diagram illustrating data sampling in the prior art;

FIG. 2B is a schematic diagram of data sampling of a distributed method for training a transform-xl language model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for distributed training of a transform-xl language model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a partitioning module in an apparatus for distributed training of a transformer-xl language model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a training module in a device for distributed training of a transformer-xl language model according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment 1 of the invention discloses a method for training a transform-xl language model in a distributed manner, which is applied to a process of training the transform-xl language model by DPP (dipeptidyl peptidase), and as shown in FIG. 1, the method comprises the following steps:

101, acquiring text corpus data for training a tranformer xl language model;

first, text corpus data needs to be acquired for subsequent sampling.

Step 102, sequencing all the text corpus data according to a context sequence;

as shown in FIG. 2, the text corpus data is sorted in context order as shown by 1. cndot. N in FIG. 2.

103, partitioning the sequenced text corpus data into blocks according to the number of GPUs (graphics processing units) so as to divide the text corpus data into a plurality of subdata;

further, as shown in fig. 2B, there are 3 GPUs, which are GPU1, GPU2, and GPU 3; the sorted text corpus is also divided into 3 blocks, namely three blocks 1 · 1/3 · N, 1/3 · N · 2/3 · N, 2/3 · N.

Step 104, distributing different sub-data to different GPUs, wherein the GPU trains the sequence of the distributed sub-data to be consistent with the context sequence;

specifically, in step 104, the allocating the different sub-data to the different GPUs includes:

Specifically, as shown in fig. 2, the training sequence includes GPU1, GPU2, and GPU3, in this order, so that 1 · 1/3 · N is assigned to GPU1, 1/3 · N · 2/3 · N is assigned to GPU2, and 2/3 · N is assigned to GPU 3.

And 105, training the sub data sequentially through the GPUs to train a transform-xl language model.

Specifically, the "training the sub-data sequentially by each GPU to implement training of the transform-xl language model" in step 105 includes:

Specifically, taking fig. 2B as an example for explanation, the GPU1, the GPU2, and the GPU3 are executed in sequence, so as to train sub data allocated to the GPU1, the GPU2, and the GPU3 in sequence, and complete a round of training process. And the complete training comprises multiple rounds of training, and the iteration sequence of each subdata in each round of training is the same.

Specifically, in the training process of each pair of transform-xl language models, the iteration sequence of each subdata is the same.

Further, the same iteration sequence of each subdata is realized by closing the parameter configuration of the shuffle of the DDP.

Example 2

The embodiment 2 of the present invention further discloses a distributed apparatus for training a transform-xl language model, which is applied to a process of training a transform-xl language model by DPP, and as shown in fig. 3, the apparatus includes:

an obtaining module 201, configured to obtain text corpus data for training a tranformer xl language model;

a sorting module 202, configured to sort all the text corpus data according to a context sequence;

a blocking module 203, configured to block the sequenced text corpus data according to the number of GPUs, so as to divide the text corpus data into a plurality of sub-data;

an allocating module 204, configured to allocate different sub-data to different GPUs, where an order of the allocated sub-data is trained by the GPU to be consistent with the context order;

the training module 205 is configured to train the sub-data sequentially through each GPU, so as to train a transform-xl language model.

In a specific embodiment, as shown in fig. 4, the allocating module 203 includes:

a generating module 2031, configured to rank the sub-data according to a context order to generate a first sequence, and rank the GPUs according to a training order to generate a second sequence;

a processing module 2032, configured to determine, for each of the sub-data, a ranking of the sub-data in the first sequence; determining the GPU corresponding to the ranking in the second sequence; and distributing the sub data to the determined GPU.

In a specific embodiment, as shown in fig. 5, the training module 205 includes:

the round module 2051 is used for training the sub data once sequentially through the GPUs to realize the training of a one-wheel transform-xl language model;

and a multi-round module 2052, configured to implement final training on the transform-xl language model by training the transform-xl language model through multiple rounds.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for distributed training of a transform-xl language model, applied to a process for training a transform-xl language model by DPP, the method comprising:

acquiring text corpus data for training a tranformer xl language model;

sequencing all the text corpus data according to a context sequence;

2. The method of claim 1, wherein the assigning the different sub-data to the different GPUs comprises:

3. The method of claim 1, wherein the "training the sub-data to achieve training of the transform-xl language model sequentially through the GPUs" comprises:

4. The method of claim 3, wherein the order of iteration of each of the sub-data is the same during training of each pair of transform-xl language models.

5. The method of claim 4, wherein the sub-data is iterated in the same order by closing a parameter configuration of a shuffle of the DDP.

6. An apparatus for distributed training of a transform-xl language model, applied to a process of training the transform-xl language model by DPP, the apparatus comprising:

7. The apparatus of claim 6, wherein the assignment module comprises:

8. The apparatus of claim 6, wherein the training module comprises:

9. The apparatus of claim 8, wherein the order of iteration of each of the sub-data is the same during training of each pair of transform-xl language models.

10. The apparatus of claim 9, wherein the sub-data is iterated in the same order by closing a parameter configuration of a shuffle of the DDP.