CN118171108A

CN118171108A - Data preprocessing method and system for accelerating training of large language model

Info

Publication number: CN118171108A
Application number: CN202410501460.6A
Authority: CN
Inventors: 李多海
Original assignee: Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Current assignee: Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Priority date: 2024-04-25
Filing date: 2024-04-25
Publication date: 2024-06-11
Anticipated expiration: 2044-04-25
Also published as: CN118171108B

Abstract

The invention provides a data preprocessing method and a system for accelerating training of a large language model, which solve the problem that the randomness of data is lost because the data is completely scattered in the existing training process of the large language model. By dividing the data used for training the large language model after scattering and sequencing according to the length of the text data, the training efficiency is greatly improved under the condition that certain randomness of the data set is ensured, and the training time cost is further reduced.

Description

Data preprocessing method and system for accelerating training of large language model

Technical Field

The invention relates to the field of large language models, in particular to a data preprocessing method and system for accelerating training of a large language model.

Background

In recent years, with the rapid development of artificial intelligence technology, a large language model shows excellent expressive force and wide application potential in the field of natural language processing. Large language models typically have hundreds of millions or even up to billions of parameters that can handle a wide range of natural language processing tasks such as text generation, text classification, semantic understanding, machine translation, dialog systems, and the like. However, due to their enormous parameter size and the need for massive training data, the model training process tends to be time consuming and computationally intensive.

Data preprocessing is an important element in machine learning, especially in deep learning processes, especially for large language models. The advanced data preprocessing technology can not only effectively reduce invalid calculation and storage requirements, but also remarkably improve model convergence speed by providing high-quality training samples, thereby realizing optimization of training efficiency of a large language model.

Disclosure of Invention

The invention provides a data preprocessing method and a system for accelerating training of a large language model, which are used for solving the problem that the training time of the large language model is long and the random of the training data of the large language model cannot be ensured.

The first aspect of the invention provides a data preprocessing method for accelerating training of a large language model, which comprises the following steps:

Setting related parameters in the data preprocessing process;

acquiring a DATA set DATA1 for training a large language model;

Randomly scattering the DATA set DATA1 to form a DATA set DATA2;

dividing the DATA set DATA2 into b DATA set sub-blocks, each DATA set sub-block comprising one or more pieces of text DATA;

Uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;

splicing the sequenced DATA set sub-blocks according to the sequence to form a DATA set DATA3;

ordered reading of the DATA set DATA3 is accomplished by a large language model training device.

Preferably, the relevant parameters specifically include the following:

The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q;

The parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in the data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, and the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.

More preferably, the training coefficient d e [500,2000].

Preferably, the b data set sub-blocks are specifically: each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.

Preferably, the sorting mode is positive or reverse.

Preferably, the splicing is performed according to the sequence, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.

Preferably, each large language model training device in the large language model training devices is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.

In a second aspect of the present invention, a data preprocessing system for accelerating training of a large language model is provided, which specifically includes the following modules:

the parameter setting module is used for setting related parameters in the data preprocessing process;

the DATA set acquisition module is connected with the parameter setting module and is used for acquiring a DATA set DATA1 for training a large language model;

The DATA set scattering module is connected with the DATA set acquisition module and used for randomly scattering the DATA set DATA1 to form a DATA set DATA2;

The DATA set sub-block dividing module is connected with the DATA set scattering module and is used for dividing the scattered DATA set DATA2 into b DATA set sub-blocks, and each DATA set sub-block comprises one or more pieces of text DATA;

the sorting module is connected with the data set sub-block dividing module and is used for uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;

and the splicing module is connected with the ordering module and is used for splicing the ordered DATA set sub-blocks according to the sequence to form a DATA set DATA3.

And the large language model training DATA reading module is connected with the splicing module and is used for finishing orderly reading of the DATA set DATA3 through large language model training equipment.

Preferably, the relevant parameters set by the parameter setting module specifically include the following:

More preferably, the training coefficient d e [500,2000].

Preferably, in the b data set sub-blocks formed by the data set sub-block dividing module, each data set sub-block is provided with a positive integer number between 1 and b, and the numbers are not repeated.

Preferably, the sorting module sorts the text data within each data set sub-block in a positive or reverse order.

Preferably, the splicing module performs end-to-end splicing according to the sequence of the numbers of the sub-blocks of each data set.

Preferably, the large language model training device with the set number in the large language model training DATA reading module sequentially reads c texts from the DATA3 according to the size of the device number, and so on, to complete the reading of the large language model training DATA.

Compared with the prior art, the invention has the following obvious prominent substantive features and obvious advantages:

In order to solve the problem that the randomness of data is lost due to complete scattering of data in the existing large language model training process, the invention provides the data preprocessing method and system for accelerating the large language model training.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a data preprocessing method for accelerating training of a large language model according to a preferred embodiment of the present invention.

FIG. 2 is a schematic diagram of a data preprocessing system for accelerating training of large language models in accordance with a preferred embodiment of the present invention.

Detailed Description

The invention provides a data preprocessing method and a data preprocessing device for accelerating training of a large language model, which are used for making the purposes, the technical scheme and the effects of the invention clearer and more definite, and the invention is further described in detail below by referring to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It is noted that the terms "first," "second," and the like in the description and claims of the present invention and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the data so used may be interchanged where appropriate. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1:

As shown in fig. 1, the data preprocessing method for accelerating training of a large language model according to the embodiment specifically includes the following steps:

step S1: setting related parameters in the data preprocessing process.

In a specific implementation of the present invention, the relevant parameters specifically include the following:

In a specific implementation of the invention, the training coefficient d e [500,2000].

Step S2: a DATA set DATA1 for training a large language model is acquired.

Step S3: the DATA set DATA1 is randomly scattered to form a DATA set DATA2.

Step S4: the DATA set DATA2 is divided into b DATA set sub-blocks, each DATA set sub-block comprising one or more pieces of text DATA.

The b data set sub-blocks in step S4 specifically are: each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.

In a specific implementation of the present invention, the sorting manner in step S5 is a positive order or a reverse order.

Step S5: and uniformly arranging the text data in each data set sub-block in ascending order or descending order according to the length of the text data.

Step S6: and splicing the sequenced DATA set sub-blocks according to the sequence to form a DATA set DATA3.

In a specific implementation of the present invention, the splicing is performed according to the sequence in step S6, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.

Step S7: ordered reading of the DATA set DATA3 is accomplished by training the device with a large language model.

In the specific implementation of the present invention, each large language model training device in the large language model training devices in step S7 is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.

Example 2:

As shown in fig. 2, a data preprocessing system for accelerating training of a large language model specifically includes:

and the parameter setting module is used for setting related parameters in the data preprocessing process.

In a specific implementation of the present invention, the relevant parameters set by the parameter setting module specifically include the following:

The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q.

And the DATA set acquisition module is connected with the parameter setting module and is used for acquiring a DATA set DATA1 for training the large language model.

And the DATA set scattering module is connected with the DATA set acquisition module and is used for randomly scattering the DATA set DATA1 to form a DATA set DATA2.

The DATA set sub-block dividing module is connected with the DATA set scattering module and is used for dividing the scattered DATA set DATA2 into b DATA set sub-blocks, and each DATA set sub-block comprises one or more pieces of text DATA.

In the specific implementation of the invention, in the b data set sub-blocks formed by the data set sub-block dividing module, each data set sub-block is provided with a positive integer number between 1 and b, and the numbers are not repeated.

And the sequencing module is connected with the data set sub-block dividing module and is used for uniformly sequencing the text data in each data set sub-block in ascending order or descending order according to the length of the text data.

In a specific implementation of the invention, the sorting module sorts the text data in each data set sub-block in a positive sequence or a reverse sequence.

In the specific implementation of the invention, the splicing module performs head-to-tail splicing according to the sequence of the numbers of the sub-blocks of each data set.

In the specific implementation of the invention, the large language model training equipment with the set number in the large language model training DATA reading module reads c texts from DATA3 according to the size of the equipment number in turn, and so on, so as to finish the reading of the large language model training DATA.

Setting related parameters in the data preprocessing process through a parameter setting module, wherein the method comprises the following steps of: the number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; the total number n of text data is input by all training devices once; the number q of text data bars in a single data set sub-block; the total number of text data pieces in the dataset, K, wherein the parameters a, b, c, d, n, q, K are positive integers and d E [500,2000]. After the DATA set DATA1 of the training large language model is obtained by the DATA set obtaining module, the DATA set DATA1 is scattered randomly through the DATA set scattering module to form a DATA set DATA2, the scattered DATA set DATA2 is divided into b DATA set sub-blocks by the DATA set sub-block dividing module, and each DATA set sub-block comprises one or more pieces of text DATA. After the DATA set sub-blocks are divided, text DATA in each DATA set sub-block are uniformly arranged in ascending order or descending order according to the length of the text DATA through an ordering module, the head and tail splicing of the DATA set sub-blocks is completed through a splicing module after the ordering of each DATA set sub-block, a DATA set DATA3 is formed, and finally the ordered reading of the DATA set DATA3 is completed through a large language model training device by a large language model training DATA reading module.

Example 3:

In the distributed training process in the data parallel mode, the load of a single device is reduced by distributing the read data to a plurality of devices, so that the larger overall throughput is obtained.

The data parallel steps are as follows:

1) Reading different data on different devices; 2) Executing the same arithmetic logic; 3) Aggregating gradients across devices; 4) And reversely updating the model by utilizing the aggregated gradient information.

In step 1), since the data read by different devices are different, when the same arithmetic logic is executed, the arithmetic time between the different devices may be inconsistent. And for the cross-device aggregation gradient in the step 3), the operation of each device can be performed only after the operation is completed.

Therefore, by distributed training in the data parallel mode, the training of the model is determined by the device requiring the longest operation time.

In this embodiment, assuming that the DATA set DATA1 contains k=6000000 pieces of DATA, 6 pieces of text DATA T1, T2, T3, T4, T5 and T6 are extracted from the randomly scattered DATA1 to form a DATA set sub-block A1, the character lengths are 3, 2, 10, 9, 1 and 8, respectively, and the DATA set sub-block A1 information is shown in table 1:

TABLE 1 data set sub-block A1 information

In the existing 3 large language model training devices M1, M2 and M3 with the same performance, assuming that each device needs to input 1 piece of data for training in a single iteration, under the condition of sequentially loading data, M1 reads T1 and T4, M2 reads T2 and T5, M3 reads T3 and T6, the total length of M1 reads is 3+9=12, the total length of M2 reads is 2+1=3, and the total length of M3 reads is 10+8=18. Since the length of the data read by the device M3 is longest and the difference between the length of the data read by the device M1 and the length of the data read by the device M1 is large, the waiting time of the device M1 is too long, which affects the training efficiency of the large language model.

In order to solve the problem of too long waiting time in the training process of the large language model, the embodiment performs improvement optimization in the training data preparation stage, and comprises the following steps:

(1) The DATA set DATA1 is first randomly scattered to form the DATA set DATA2.

(2) Firstly, randomly scattering a DATA set DATA2 and then dividing the DATA set into blocks

In the above-mentioned a=3 devices, when a single device inputs each time the number of pieces of text DATA c=1, then all devices input the total number of pieces of text DATA n=a×c=3×1=3 at a time, in this embodiment, taking the training coefficient d=2000, the single DATA set sub-block contains q=a×c×d=3×1×2000=6000 pieces of text DATA, and DATA1 is divided into b=k/q=6000000/6000=1000 sub-blocks.

(3) And sequencing the text data in each data set sub-block from short to long according to the text length.

(4) And (3) sequentially splicing the ordered DATA set sub-blocks end to form a DATA set DATA3, and using the DATA3 for training a large language model.

In this embodiment, the text data in the sub-block A1 of the data set is sorted in positive order, and it is assumed that the text data are still randomly allocated to the same sub-block and are located close to each other. Then the data set sub-block A1 is reordered to obtain a data set sub-block B1, and the data set sub-block B1 information is shown in table 2:

TABLE 2 data set sub-block B1 information

In the case of sequentially loading data, M1 reads R1 and R4, M2 reads R2 and R5, M3 reads R3 and R6, the total length of M1 reads is 1+8=9, the total length of M2 reads is 2+9=11, and the total length of M3 reads is 3+10=13. The length difference of the data to be read among the devices is small, so that the waiting time of the devices is reduced, and the training speed of the large language model is improved.

Example 4:

In the specific implementation process of the present invention, the technical effect of the present invention is further verified by a comparative experiment, and the experimental configuration information is shown in table 3:

Table 3 experimental configuration information

The results of the comparative experiments are shown in Table 4:

Table 4 results of comparative experiments

The invention can greatly improve the model training speed (2.16-1.52)/2.16=29.63% under the condition that the training effect (verification set accuracy) of a large model is slightly reduced by the comparison experiment and the experimental result.

The above description of the specific embodiments of the present invention has been given by way of example only, and the present invention is not limited to the above described specific embodiments. Any equivalent modifications and substitutions for the present invention will occur to those skilled in the art, and are also within the scope of the present invention. Accordingly, equivalent changes and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope thereof.

Claims

1. The data preprocessing method for accelerating the training of the large language model is characterized by comprising the following steps of:

Setting related parameters in the data preprocessing process;

acquiring a DATA set DATA1 for training a large language model;

Randomly scattering the DATA set DATA1 to form a DATA set DATA2;

2. The method for preprocessing data for accelerating training of a large language model according to claim 1, wherein the relevant parameters specifically include the following:

The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; the total number n of text data is input by all training devices once; the number q of text data bars in a single data set sub-block; total number of text data pieces K in the dataset.

3. The data preprocessing method for accelerating training of a large language model according to claim 2, wherein the parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in a data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, but the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.

4. A method of data preprocessing for accelerating training of large language models as recited in claim 3, wherein said training factor de [500,2000].

5. The data preprocessing method for accelerating training of large language models according to claim 2, wherein each of the large language model training devices is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.

6. The data preprocessing method for accelerating training of a large language model according to claim 1, wherein the b data set sub-blocks specifically are:

Each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.

7. The data preprocessing method for accelerating training of a large language model according to claim 1, wherein the splicing is performed according to the sequence, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.

8. The data preprocessing system for accelerating training of the large language model is characterized by comprising the following modules:

the splicing module is connected with the sorting module and used for splicing the sorted DATA set sub-blocks according to the sequence to form a DATA set DATA3;

9. A computer readable storage medium, characterized in that the computer readable storage medium has a computer program which, when executed by a processor, implements a data preprocessing method of accelerating training of a large language model according to any one of claims 1 to 7.

10. An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing a data preprocessing method for accelerating training of a large language model according to any one of claims 1 to 7 when executing the computer program.