CN118171108A - Data preprocessing method and system for accelerating training of large language model - Google Patents

Data preprocessing method and system for accelerating training of large language model Download PDF

Info

Publication number
CN118171108A
CN118171108A CN202410501460.6A CN202410501460A CN118171108A CN 118171108 A CN118171108 A CN 118171108A CN 202410501460 A CN202410501460 A CN 202410501460A CN 118171108 A CN118171108 A CN 118171108A
Authority
CN
China
Prior art keywords
data set
data
training
language model
large language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410501460.6A
Other languages
Chinese (zh)
Other versions
CN118171108B (en
Inventor
李多海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Original Assignee
Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd filed Critical Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Priority to CN202410501460.6A priority Critical patent/CN118171108B/en
Publication of CN118171108A publication Critical patent/CN118171108A/en
Application granted granted Critical
Publication of CN118171108B publication Critical patent/CN118171108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a data preprocessing method and a system for accelerating training of a large language model, which solve the problem that the randomness of data is lost because the data is completely scattered in the existing training process of the large language model. By dividing the data used for training the large language model after scattering and sequencing according to the length of the text data, the training efficiency is greatly improved under the condition that certain randomness of the data set is ensured, and the training time cost is further reduced.

Description

Data preprocessing method and system for accelerating training of large language model
Technical Field
The invention relates to the field of large language models, in particular to a data preprocessing method and system for accelerating training of a large language model.
Background
In recent years, with the rapid development of artificial intelligence technology, a large language model shows excellent expressive force and wide application potential in the field of natural language processing. Large language models typically have hundreds of millions or even up to billions of parameters that can handle a wide range of natural language processing tasks such as text generation, text classification, semantic understanding, machine translation, dialog systems, and the like. However, due to their enormous parameter size and the need for massive training data, the model training process tends to be time consuming and computationally intensive.
Data preprocessing is an important element in machine learning, especially in deep learning processes, especially for large language models. The advanced data preprocessing technology can not only effectively reduce invalid calculation and storage requirements, but also remarkably improve model convergence speed by providing high-quality training samples, thereby realizing optimization of training efficiency of a large language model.
Disclosure of Invention
The invention provides a data preprocessing method and a system for accelerating training of a large language model, which are used for solving the problem that the training time of the large language model is long and the random of the training data of the large language model cannot be ensured.
The first aspect of the invention provides a data preprocessing method for accelerating training of a large language model, which comprises the following steps:
Setting related parameters in the data preprocessing process;
acquiring a DATA set DATA1 for training a large language model;
Randomly scattering the DATA set DATA1 to form a DATA set DATA2;
dividing the DATA set DATA2 into b DATA set sub-blocks, each DATA set sub-block comprising one or more pieces of text DATA;
Uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;
splicing the sequenced DATA set sub-blocks according to the sequence to form a DATA set DATA3;
ordered reading of the DATA set DATA3 is accomplished by a large language model training device.
Preferably, the relevant parameters specifically include the following:
The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q;
The parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in the data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, and the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.
More preferably, the training coefficient d e [500,2000].
Preferably, the b data set sub-blocks are specifically: each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.
Preferably, the sorting mode is positive or reverse.
Preferably, the splicing is performed according to the sequence, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.
Preferably, each large language model training device in the large language model training devices is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.
In a second aspect of the present invention, a data preprocessing system for accelerating training of a large language model is provided, which specifically includes the following modules:
the parameter setting module is used for setting related parameters in the data preprocessing process;
the DATA set acquisition module is connected with the parameter setting module and is used for acquiring a DATA set DATA1 for training a large language model;
The DATA set scattering module is connected with the DATA set acquisition module and used for randomly scattering the DATA set DATA1 to form a DATA set DATA2;
The DATA set sub-block dividing module is connected with the DATA set scattering module and is used for dividing the scattered DATA set DATA2 into b DATA set sub-blocks, and each DATA set sub-block comprises one or more pieces of text DATA;
the sorting module is connected with the data set sub-block dividing module and is used for uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;
and the splicing module is connected with the ordering module and is used for splicing the ordered DATA set sub-blocks according to the sequence to form a DATA set DATA3.
And the large language model training DATA reading module is connected with the splicing module and is used for finishing orderly reading of the DATA set DATA3 through large language model training equipment.
Preferably, the relevant parameters set by the parameter setting module specifically include the following:
The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q;
The parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in the data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, and the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.
More preferably, the training coefficient d e [500,2000].
Preferably, in the b data set sub-blocks formed by the data set sub-block dividing module, each data set sub-block is provided with a positive integer number between 1 and b, and the numbers are not repeated.
Preferably, the sorting module sorts the text data within each data set sub-block in a positive or reverse order.
Preferably, the splicing module performs end-to-end splicing according to the sequence of the numbers of the sub-blocks of each data set.
Preferably, the large language model training device with the set number in the large language model training DATA reading module sequentially reads c texts from the DATA3 according to the size of the device number, and so on, to complete the reading of the large language model training DATA.
Compared with the prior art, the invention has the following obvious prominent substantive features and obvious advantages:
In order to solve the problem that the randomness of data is lost due to complete scattering of data in the existing large language model training process, the invention provides the data preprocessing method and system for accelerating the large language model training.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a data preprocessing method for accelerating training of a large language model according to a preferred embodiment of the present invention.
FIG. 2 is a schematic diagram of a data preprocessing system for accelerating training of large language models in accordance with a preferred embodiment of the present invention.
Detailed Description
The invention provides a data preprocessing method and a data preprocessing device for accelerating training of a large language model, which are used for making the purposes, the technical scheme and the effects of the invention clearer and more definite, and the invention is further described in detail below by referring to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It is noted that the terms "first," "second," and the like in the description and claims of the present invention and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the data so used may be interchanged where appropriate. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1:
As shown in fig. 1, the data preprocessing method for accelerating training of a large language model according to the embodiment specifically includes the following steps:
step S1: setting related parameters in the data preprocessing process.
In a specific implementation of the present invention, the relevant parameters specifically include the following:
The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q;
The parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in the data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, and the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.
In a specific implementation of the invention, the training coefficient d e [500,2000].
Step S2: a DATA set DATA1 for training a large language model is acquired.
Step S3: the DATA set DATA1 is randomly scattered to form a DATA set DATA2.
Step S4: the DATA set DATA2 is divided into b DATA set sub-blocks, each DATA set sub-block comprising one or more pieces of text DATA.
The b data set sub-blocks in step S4 specifically are: each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.
In a specific implementation of the present invention, the sorting manner in step S5 is a positive order or a reverse order.
Step S5: and uniformly arranging the text data in each data set sub-block in ascending order or descending order according to the length of the text data.
Step S6: and splicing the sequenced DATA set sub-blocks according to the sequence to form a DATA set DATA3.
In a specific implementation of the present invention, the splicing is performed according to the sequence in step S6, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.
Step S7: ordered reading of the DATA set DATA3 is accomplished by training the device with a large language model.
In the specific implementation of the present invention, each large language model training device in the large language model training devices in step S7 is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.
Example 2:
As shown in fig. 2, a data preprocessing system for accelerating training of a large language model specifically includes:
and the parameter setting module is used for setting related parameters in the data preprocessing process.
In a specific implementation of the present invention, the relevant parameters set by the parameter setting module specifically include the following:
The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; all training devices input the total number n of text data once, wherein n=a×c; the number of text data stripes q, q=a×c×d in a single data set sub-block; total number of text data pieces in the dataset K, k=b×q.
The parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in the data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, and the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.
In a specific implementation of the invention, the training coefficient d e [500,2000].
And the DATA set acquisition module is connected with the parameter setting module and is used for acquiring a DATA set DATA1 for training the large language model.
And the DATA set scattering module is connected with the DATA set acquisition module and is used for randomly scattering the DATA set DATA1 to form a DATA set DATA2.
The DATA set sub-block dividing module is connected with the DATA set scattering module and is used for dividing the scattered DATA set DATA2 into b DATA set sub-blocks, and each DATA set sub-block comprises one or more pieces of text DATA.
In the specific implementation of the invention, in the b data set sub-blocks formed by the data set sub-block dividing module, each data set sub-block is provided with a positive integer number between 1 and b, and the numbers are not repeated.
And the sequencing module is connected with the data set sub-block dividing module and is used for uniformly sequencing the text data in each data set sub-block in ascending order or descending order according to the length of the text data.
In a specific implementation of the invention, the sorting module sorts the text data in each data set sub-block in a positive sequence or a reverse sequence.
And the splicing module is connected with the ordering module and is used for splicing the ordered DATA set sub-blocks according to the sequence to form a DATA set DATA3.
In the specific implementation of the invention, the splicing module performs head-to-tail splicing according to the sequence of the numbers of the sub-blocks of each data set.
And the large language model training DATA reading module is connected with the splicing module and is used for finishing orderly reading of the DATA set DATA3 through large language model training equipment.
In the specific implementation of the invention, the large language model training equipment with the set number in the large language model training DATA reading module reads c texts from DATA3 according to the size of the equipment number in turn, and so on, so as to finish the reading of the large language model training DATA.
Setting related parameters in the data preprocessing process through a parameter setting module, wherein the method comprises the following steps of: the number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; the total number n of text data is input by all training devices once; the number q of text data bars in a single data set sub-block; the total number of text data pieces in the dataset, K, wherein the parameters a, b, c, d, n, q, K are positive integers and d E [500,2000]. After the DATA set DATA1 of the training large language model is obtained by the DATA set obtaining module, the DATA set DATA1 is scattered randomly through the DATA set scattering module to form a DATA set DATA2, the scattered DATA set DATA2 is divided into b DATA set sub-blocks by the DATA set sub-block dividing module, and each DATA set sub-block comprises one or more pieces of text DATA. After the DATA set sub-blocks are divided, text DATA in each DATA set sub-block are uniformly arranged in ascending order or descending order according to the length of the text DATA through an ordering module, the head and tail splicing of the DATA set sub-blocks is completed through a splicing module after the ordering of each DATA set sub-block, a DATA set DATA3 is formed, and finally the ordered reading of the DATA set DATA3 is completed through a large language model training device by a large language model training DATA reading module.
Example 3:
In the distributed training process in the data parallel mode, the load of a single device is reduced by distributing the read data to a plurality of devices, so that the larger overall throughput is obtained.
The data parallel steps are as follows:
1) Reading different data on different devices; 2) Executing the same arithmetic logic; 3) Aggregating gradients across devices; 4) And reversely updating the model by utilizing the aggregated gradient information.
In step 1), since the data read by different devices are different, when the same arithmetic logic is executed, the arithmetic time between the different devices may be inconsistent. And for the cross-device aggregation gradient in the step 3), the operation of each device can be performed only after the operation is completed.
Therefore, by distributed training in the data parallel mode, the training of the model is determined by the device requiring the longest operation time.
In this embodiment, assuming that the DATA set DATA1 contains k=6000000 pieces of DATA, 6 pieces of text DATA T1, T2, T3, T4, T5 and T6 are extracted from the randomly scattered DATA1 to form a DATA set sub-block A1, the character lengths are 3, 2, 10, 9, 1 and 8, respectively, and the DATA set sub-block A1 information is shown in table 1:
TABLE 1 data set sub-block A1 information
In the existing 3 large language model training devices M1, M2 and M3 with the same performance, assuming that each device needs to input 1 piece of data for training in a single iteration, under the condition of sequentially loading data, M1 reads T1 and T4, M2 reads T2 and T5, M3 reads T3 and T6, the total length of M1 reads is 3+9=12, the total length of M2 reads is 2+1=3, and the total length of M3 reads is 10+8=18. Since the length of the data read by the device M3 is longest and the difference between the length of the data read by the device M1 and the length of the data read by the device M1 is large, the waiting time of the device M1 is too long, which affects the training efficiency of the large language model.
In order to solve the problem of too long waiting time in the training process of the large language model, the embodiment performs improvement optimization in the training data preparation stage, and comprises the following steps:
(1) The DATA set DATA1 is first randomly scattered to form the DATA set DATA2.
(2) Firstly, randomly scattering a DATA set DATA2 and then dividing the DATA set into blocks
In the above-mentioned a=3 devices, when a single device inputs each time the number of pieces of text DATA c=1, then all devices input the total number of pieces of text DATA n=a×c=3×1=3 at a time, in this embodiment, taking the training coefficient d=2000, the single DATA set sub-block contains q=a×c×d=3×1×2000=6000 pieces of text DATA, and DATA1 is divided into b=k/q=6000000/6000=1000 sub-blocks.
(3) And sequencing the text data in each data set sub-block from short to long according to the text length.
(4) And (3) sequentially splicing the ordered DATA set sub-blocks end to form a DATA set DATA3, and using the DATA3 for training a large language model.
In this embodiment, the text data in the sub-block A1 of the data set is sorted in positive order, and it is assumed that the text data are still randomly allocated to the same sub-block and are located close to each other. Then the data set sub-block A1 is reordered to obtain a data set sub-block B1, and the data set sub-block B1 information is shown in table 2:
TABLE 2 data set sub-block B1 information
In the case of sequentially loading data, M1 reads R1 and R4, M2 reads R2 and R5, M3 reads R3 and R6, the total length of M1 reads is 1+8=9, the total length of M2 reads is 2+9=11, and the total length of M3 reads is 3+10=13. The length difference of the data to be read among the devices is small, so that the waiting time of the devices is reduced, and the training speed of the large language model is improved.
Example 4:
In the specific implementation process of the present invention, the technical effect of the present invention is further verified by a comparative experiment, and the experimental configuration information is shown in table 3:
Table 3 experimental configuration information
The results of the comparative experiments are shown in Table 4:
Table 4 results of comparative experiments
The invention can greatly improve the model training speed (2.16-1.52)/2.16=29.63% under the condition that the training effect (verification set accuracy) of a large model is slightly reduced by the comparison experiment and the experimental result.
The above description of the specific embodiments of the present invention has been given by way of example only, and the present invention is not limited to the above described specific embodiments. Any equivalent modifications and substitutions for the present invention will occur to those skilled in the art, and are also within the scope of the present invention. Accordingly, equivalent changes and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope thereof.

Claims (10)

1. The data preprocessing method for accelerating the training of the large language model is characterized by comprising the following steps of:
Setting related parameters in the data preprocessing process;
acquiring a DATA set DATA1 for training a large language model;
Randomly scattering the DATA set DATA1 to form a DATA set DATA2;
dividing the DATA set DATA2 into b DATA set sub-blocks, each DATA set sub-block comprising one or more pieces of text DATA;
Uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;
splicing the sequenced DATA set sub-blocks according to the sequence to form a DATA set DATA3;
ordered reading of the DATA set DATA3 is accomplished by a large language model training device.
2. The method for preprocessing data for accelerating training of a large language model according to claim 1, wherein the relevant parameters specifically include the following:
The number a of the large language model training devices; the number of data set sub-blocks b; the number c of the text data input when a single device inputs each time; training a coefficient d; the total number n of text data is input by all training devices once; the number q of text data bars in a single data set sub-block; total number of text data pieces K in the dataset.
3. The data preprocessing method for accelerating training of a large language model according to claim 2, wherein the parameters a, b, c, d, n, q, K are positive integers, when the training coefficient d is smaller, the number q of text data in a data set sub-block is smaller, the randomness degree of the whole data is higher, the training effect of the large language model is better, but the training time of the large language model is longer; when the training coefficient d is larger, the number q of text data in the data set sub-block is larger, the randomness degree of the whole data is lower, the training time of the large language model is shorter, and the training effect of the large language model is poorer.
4. A method of data preprocessing for accelerating training of large language models as recited in claim 3, wherein said training factor de [500,2000].
5. The data preprocessing method for accelerating training of large language models according to claim 2, wherein each of the large language model training devices is provided with a positive integer number between 1 and a, and the numbers are not repeated; the large language model training device orderly reads the DATA set DATA3, specifically: and c text DATA are sequentially read from the DATA3 according to the number of the large language model training equipment for training, and the reading of the large language model training DATA is completed by analogy.
6. The data preprocessing method for accelerating training of a large language model according to claim 1, wherein the b data set sub-blocks specifically are:
Each data set sub-block in the b data set sub-blocks is provided with a positive integer number between 1 and b, and the numbers are not repeated.
7. The data preprocessing method for accelerating training of a large language model according to claim 1, wherein the splicing is performed according to the sequence, specifically, the end-to-end splicing is performed according to the sequence of the numbers of the sub-blocks of each data set.
8. The data preprocessing system for accelerating training of the large language model is characterized by comprising the following modules:
the parameter setting module is used for setting related parameters in the data preprocessing process;
the DATA set acquisition module is connected with the parameter setting module and is used for acquiring a DATA set DATA1 for training a large language model;
The DATA set scattering module is connected with the DATA set acquisition module and used for randomly scattering the DATA set DATA1 to form a DATA set DATA2;
The DATA set sub-block dividing module is connected with the DATA set scattering module and is used for dividing the scattered DATA set DATA2 into b DATA set sub-blocks, and each DATA set sub-block comprises one or more pieces of text DATA;
the sorting module is connected with the data set sub-block dividing module and is used for uniformly arranging text data in each data set sub-block in ascending order or descending order according to the length of the text data;
the splicing module is connected with the sorting module and used for splicing the sorted DATA set sub-blocks according to the sequence to form a DATA set DATA3;
and the large language model training DATA reading module is connected with the splicing module and is used for finishing orderly reading of the DATA set DATA3 through large language model training equipment.
9. A computer readable storage medium, characterized in that the computer readable storage medium has a computer program which, when executed by a processor, implements a data preprocessing method of accelerating training of a large language model according to any one of claims 1 to 7.
10. An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing a data preprocessing method for accelerating training of a large language model according to any one of claims 1 to 7 when executing the computer program.
CN202410501460.6A 2024-04-25 2024-04-25 Data preprocessing method and system for accelerating training of large language model Active CN118171108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410501460.6A CN118171108B (en) 2024-04-25 2024-04-25 Data preprocessing method and system for accelerating training of large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410501460.6A CN118171108B (en) 2024-04-25 2024-04-25 Data preprocessing method and system for accelerating training of large language model

Publications (2)

Publication Number Publication Date
CN118171108A true CN118171108A (en) 2024-06-11
CN118171108B CN118171108B (en) 2024-08-13

Family

ID=91353232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410501460.6A Active CN118171108B (en) 2024-04-25 2024-04-25 Data preprocessing method and system for accelerating training of large language model

Country Status (1)

Country Link
CN (1) CN118171108B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859982A (en) * 2020-06-19 2020-10-30 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111857991A (en) * 2020-06-23 2020-10-30 中国平安人寿保险股份有限公司 Data sorting method and device and computer equipment
CN116245197A (en) * 2023-02-21 2023-06-09 北京数美时代科技有限公司 Method, system, medium and equipment for improving training rate of language model
CN116468039A (en) * 2023-03-14 2023-07-21 杭州市第七人民医院 Training data determining method and device and computer equipment
WO2023192676A1 (en) * 2022-04-01 2023-10-05 Google Llc Deterministic training of machine learning models

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859982A (en) * 2020-06-19 2020-10-30 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111857991A (en) * 2020-06-23 2020-10-30 中国平安人寿保险股份有限公司 Data sorting method and device and computer equipment
WO2023192676A1 (en) * 2022-04-01 2023-10-05 Google Llc Deterministic training of machine learning models
CN116245197A (en) * 2023-02-21 2023-06-09 北京数美时代科技有限公司 Method, system, medium and equipment for improving training rate of language model
CN116468039A (en) * 2023-03-14 2023-07-21 杭州市第七人民医院 Training data determining method and device and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李雄 等: "基于词项聚类的文本语义标签抽取研究", 计算机科学, vol. 45, no. 11, 30 November 2018 (2018-11-30), pages 427 - 431 *

Also Published As

Publication number Publication date
CN118171108B (en) 2024-08-13

Similar Documents

Publication Publication Date Title
US10902318B2 (en) Methods and systems for improved transforms in convolutional neural networks
WO2022068663A1 (en) Memory allocation method, related device, and computer readable storage medium
KR102038390B1 (en) Artificial neural network module and scheduling method thereof for highly effective parallel processing
CN106778079A (en) A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
EP4375844A1 (en) Neural network on-chip mapping method and device based on tabu search algorithm
CN104020983A (en) KNN-GPU acceleration method based on OpenCL
CN103995827B (en) High-performance sort method in MapReduce Computational frames
CN106295670A (en) Data processing method and data processing equipment
CN103064991A (en) Mass data clustering method
CN106802787A (en) MapReduce optimization methods based on GPU sequences
CN118171108B (en) Data preprocessing method and system for accelerating training of large language model
CN105830160A (en) Apparatuses and methods for writing masked data to buffer
JP4310500B2 (en) Important component priority calculation method and equipment
Langr et al. Storing sparse matrices to files in the adaptive-blocking hierarchical storage format
Lima et al. Descent search approaches applied to the minimization of open stacks
Harada et al. Introduction to GPU radix sort
CN109800891A (en) A kind of machine learning redundant data delet method and system
CN111427857B (en) FPGA configuration file compression and decompression method based on partition reference technology
US20220343145A1 (en) Method and system for graph neural network acceleration
Salah et al. Lazy-Merge: A Novel Implementation for Indexed Parallel $ K $-Way In-Place Merging
Hwang et al. Multi-attractor gene reordering for graph bisection
CN113255270B (en) Jacobian template calculation acceleration method, system, medium and storage device
AlMasri Accelerating graph pattern mining algorithms on modern graphics processing units
Shao et al. Blockgraphchi: Enabling block update in out-of-core graph processing
CN113392124B (en) Structured language-based data query method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant