CN116627659B

CN116627659B - Model check point file storage method, device, equipment and storage medium

Info

Publication number: CN116627659B
Application number: CN202310899664.5A
Authority: CN
Inventors: 潘青华; 张海俊; 胡文龙; 汪锦想; 于振华; 胡国平; 刘聪; 魏思; 王士进; 刘权
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-12-01
Anticipated expiration: 2043-07-21
Also published as: CN116627659A

Abstract

The application discloses a method, a device, equipment and a storage medium for storing a model check point file, wherein when the check point file is determined to be needed to be stored, a load balancing mechanism is introduced to avoid congestion of a single-node network card or a disk IO, the check point file storing tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, and the different equipment nodes are controlled to execute the check point file storing tasks of the equipment nodes in a parallel processing mode, so that resources of all the equipment nodes can be fully utilized, the congestion of the single-node network card or the disk IO is avoided, and the storing efficiency is improved.

Description

Model check point file storage method, device, equipment and storage medium

Technical Field

The application relates to the technical field of large-scale model training, in particular to a method, a device, equipment and a storage medium for storing model check point files.

Background

Model pre-training is widely performed in various services, including but not limited to the fields of natural language processing, image recognition, voice recognition and the like, and particularly in the natural language processing direction, training of a larger-scale pre-training model (a large artificial intelligent model) and fine tuning in a downstream task has become a common method for improving the application of natural language processing NLP. In these applications, the model size extends from the first few hundred megaparameters to the hundreds of trillion parameters today, and the memory and computational power of a single GPU will not meet the model training requirements, requiring the parameters to be partitioned into more GPUs by model parallel techniques.

Meanwhile, the preservation of the checkpoint of the oversized model also faces many challenges (checkpoint is a term used to describe the preservation of the model state after each training, chinese may be referred to as checkpoints, and the preserved model state includes model network parameters, optimizer states, etc., collectively referred to as checkpoint files or checkpoint files). For the ultra-large model, the ultra-large model needs to be segmented into a plurality of parts, and the parts are deployed into a plurality of GPUs for calculation. Referring to FIG. 1, a typical oversized model training scenario is illustrated. In the figure, D _n Representing data parallel group rank number, P _l Representing the running parallel group rank number, T _m Represents the tensor parallel group rank number,representing model parallel group rank numbers. The model is split into 16 parts with data parallelism d=8, tensor parallelism +.>Pipeline parallelism->I.e. model parallelism m=t->P=16 (pipeline parallelism is typically split across nodes because of small traffic, tensor parallelism is typically split within nodes, and high bandwidth within nodes is fully utilized). Training is carried out by using the parallel strategy, 16 GPUs in every 4 machines form a model parallel group, and 16 checkpoint files need to be saved when the model is saved. Conventional model preservation methods typically only allow master nodes within several parallel groups (i.e., D _n Nodes=0) holds a model, such as the node where the process of bold line box labeling is located in fig. 1. The network card or the disk IO of the master node is congested, and the model storage efficiency is reduced.

Disclosure of Invention

In view of the above problems, the present application provides a method, an apparatus, a device, and a storage medium for storing a model checkpoint file, so as to avoid the problem that only a master node is used to store a checkpoint file of each part of a model, resulting in IO congestion of a network card or a disk of the master node, and reducing the model storage efficiency. The specific scheme is as follows:

in a first aspect, a method for saving a model checkpoint file is provided, where the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes to train, and the method includes:

when the checkpoint files are determined to be stored, distributing checkpoint file storage tasks of all the partial models to a plurality of different equipment nodes through a load balancing mechanism;

and controlling different equipment nodes, and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.

Preferably, the distributing the task of storing the checkpoint file of each part model to a plurality of different device nodes through a load balancing mechanism includes:

selecting an unassigned part from the parts after model segmentation;

determining candidate device nodes having the unassigned portion among all device nodes;

selecting a target candidate equipment node with the minimum load from the candidate equipment nodes, distributing the checkpoint file storage task of the unassigned part to the target candidate equipment node, adding a set value to the load state of the target candidate equipment node, and returning to execute the step of selecting an unassigned part from the parts after model segmentation until the unassigned model part does not exist.

Preferably, the model is trained by adopting a pipeline parallel strategy, a tensor parallel strategy and a data parallel strategy, and then the task of storing the checkpoint files of each part model is dispersed to a plurality of different equipment nodes through a load balancing mechanism, including:

selecting a device node process satisfying the following formula to save the model M _i Partial checkpoint file:

wherein P is _l Representing the running parallel group rank number, T _m Represent tensor parallel group rank number, D _n Representing data parallel group rank number, M _i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the operation of taking the remainder,the relation operator indicates whether the left and right values are the same.

Preferably, the process of controlling each device node to perform a checkpoint file save task includes:

unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;

and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.

Preferably, the method further comprises:

and in the process of storing the checkpoint file, if the training process is monitored to be abnormal, waiting for the checkpoint file in the storage queue to be stopped after the checkpoint file storage task is executed.

Preferably, the method further comprises:

if the preservation process of the current equipment node is abnormal, the preservation task of the checkpoint file distributed to the current equipment node is distributed to other available equipment nodes for preservation by other equipment nodes.

In a second aspect, a model checkpoint file storage device is provided, where the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes to train, and the device includes:

the file dispersing unit is used for dispersing the checkpoint file storage tasks of the partial models to a plurality of different equipment nodes through a load balancing mechanism when the checkpoint file needs to be stored;

and the parallel storage unit is used for controlling different equipment nodes and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.

Preferably, the parallel saving unit controls each device node to execute a process of a checkpoint file saving task of the device node, including:

In a third aspect, there is provided a model checkpoint file storage device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the model checkpoint file storage method as described above.

In a fourth aspect, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the model checkpoint file preservation method as previously described.

By means of the technical scheme, when the fact that the checkpoint files need to be stored is determined, in order to avoid the problem of single-node network card or disk IO congestion, a load balancing mechanism is introduced, checkpoint file storage tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, the different equipment nodes are controlled to execute the checkpoint file storage tasks of the equipment nodes in a parallel processing mode, resources of all the equipment nodes can be fully utilized, the single-node network card or disk IO congestion is avoided, and storage efficiency is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a typical oversized model training scenario;

FIG. 2 is a flow chart illustrating a method for saving a model checkpoint file according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an oversized model training scenario provided in an embodiment of the present application;

FIG. 4 is a flowchart of a method for distributing a checkpoint file storage task of each portion model to different device nodes through a load balancing mechanism according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a device for storing a model checkpoint file according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a model checkpoint file storage device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides a model check point file storage scheme which can be suitable for storing the check point files of various model training processes, in particular to large-scale parameter-containing models such as large-scale artificial intelligent models, and can effectively provide storage efficiency.

The model check point file storage method is applied to a model training cluster, wherein the cluster comprises a plurality of equipment nodes, and the equipment nodes are divided into a plurality of communication groups. In this embodiment, a large-scale model training task is taken as an example for explanation, and in the large-scale model training task, due to the oversized model, the model is usually required to be cut through a multi-dimensional parallel scheme and deployed in different equipment nodes. Common technical means include pipeline parallel technology, tensor parallel technology and data parallel technology for cluster expansion, and a three-dimensional parallel overall solution is constructed.

The device nodes in the cluster can be implemented by adopting a terminal with data processing capability, and the terminal can be a server, a server cluster, a cloud end and the like.

Next, as described in connection with fig. 2, the method for saving a model checkpoint file of the present application may include the following steps:

step S100, when it is determined that the checkpoint files need to be saved, the checkpoint file saving tasks of each part model are dispersed to a plurality of different device nodes through a load balancing mechanism.

Referring to fig. 3, in the case of data parallel training, each part of the model after being divided is allocated to a plurality of different device nodes for parallel training, in order to avoid the problem of network card or disk IO congestion in the prior art that all the checkpoint files are stored by a single device node, a load balancing mechanism may be introduced in this step, so that the checkpoint file storage task of each part of the model is dispersed to a plurality of different device nodes.

Step S110, controlling different equipment nodes, and executing a checkpoint file storage task of the equipment nodes in a parallel processing mode.

Specifically, different equipment nodes can execute the checkpoint file storage tasks of the equipment nodes in parallel, so that system resources are fully utilized, and storage efficiency is improved.

According to the method for storing the model check point file, when the check point file is determined to be stored, a load balancing mechanism is introduced to avoid the problem of single-node network card or disk IO congestion, the check point file storing tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, and the different equipment nodes are controlled to execute the check point file storing tasks of the equipment nodes in a parallel processing mode, so that resources of all the equipment nodes can be fully utilized, the problem of single-node network card or disk IO congestion is avoided, and the storage efficiency is improved.

In some embodiments of the present application, several different implementations are provided for the above-mentioned process of distributing the checkpoint file save tasks of each portion model to a plurality of different device nodes through a load balancing mechanism in step S100.

First, a greedy strategy may be employed for allocation.

Referring to fig. 4, the method specifically comprises the following steps:

step 200, selecting an unassigned part from the parts after model segmentation.

Step S210, determining candidate device nodes having the unassigned portion among all the device nodes.

Specifically, in the parallel training mode, each part of the model after being segmented can be copied to different equipment nodes for parallel training. To implement the storing of the checkpoint file of each part model, the checkpoint file storing task of each part model may be selected to be allocated to the device node having the part model, so that in this step, the candidate device node having the unallocated part is first determined among all the device nodes.

Step S220, selecting a target candidate device node with the smallest load from the candidate device nodes, allocating the checkpoint file storage task of the unallocated portion to the target candidate device node, and adding a set value to the load state of the target candidate device node.

Specifically, in order to implement load balancing as much as possible, a target candidate device node with the smallest load may be selected from the candidate device nodes, and the checkpoint file save task of the unassigned portion may be assigned to the target candidate device node. Further, a set value, for example, load state +1 of the target candidate device node, may be added to the load state of the target candidate device node after the allocation to update the load state of the target candidate device node.

Step S230, judging whether an unassigned model part exists, if so, returning to execute the step S200, and if not, ending.

Specifically, the above steps are repeated until the checkpoint file storage tasks of the respective portions after the model is divided are allocated.

According to the method provided by the embodiment, the checkpoint file storage tasks of each part of the model can be dispersed to a plurality of different equipment nodes, the load balance of each equipment node is ensured as much as possible, and the overall storage efficiency is improved.

The above method is exemplarily described with reference to fig. 3:

the model is divided into 16 parts, and rank numbers are M respectively ₀ ~M ₁₅ 。

According to M described above ₀ ~M ₁₅ To allocate a checkpoint file save task to each portion in turn for M ₀ First, determine the M ₀ A partial candidate device node comprising: the method comprises the following steps of a process number 0 and a process number 4 of a node number 0, a process number 0 and a process number 4 of a node number 4, a process number 0 and a process number 4 of a node number 8, and a process number 0 and a process number 4 of a node number 12.

Further, the load of each process in each equipment node is the same in the initial state, so that one target candidate equipment node can be randomly selected from each candidate equipment node, and the process 0 of the node 0 is assumed to be selected for executing M ₀ Part of the checkpoint file saves the task. Meanwhile, the load state of the node 0 is +1.

For M ₁ First, determine the M ₁ A partial candidate device node comprising: process 1 and Process 5 of node 0 and node 4Process 1 and process 5, process 1 and process 5 for node 8, process 1 and process 5 for node 12.

Further, selecting the least loaded one from among the nodes 4, 8, and 12, e.g. the node 1 selected for executing M ₁ Part of the checkpoint file saves the task. Meanwhile, the load state of the node 4 is +1.

And so on until M is dispensed ₁₅ Part of the checkpoint file saves the task. M is M ₀ ~M ₁₅ The partial allocation results can be referred to in fig. 3 by the various processes marked with bold wire frames.

The second scenario, in which training is performed by using pipeline parallel, tensor parallel and data parallel strategies for the model, provides an implementation of step S100 described above.

Specifically, according to the following exemplary formula, a device node process satisfying the formula may be selected to save the model M _i Partial checkpoint file:

The first formulaM used for guaranteeing selected device node process to have model _i Part(s). Exemplary, for part 0 after model segmentation, rank number thereofM _i =0. For process number 0 of node number 0, its P _l =0、T=4、T _m =0, then P _l /> T +T _m =0/>4+0==M _i 。

The second formulaFor hashing different save tasks within a tensor parallel group to different model parallel groups. Since pipelined parallelism is typically split across nodes, it may not be considered in this formula.

Exemplary, still taking part 0 after model segmentation as an example, its rank number M _i =0. For process number 0 of node number 0, d= 8,T =4, D _n =0，For ensuring a minimum value of 1 in case D is smaller than T. The left side of the formula is: max {1,2}>0%8=0== D _n 。

Obviously, through the two formulas, the No. 0 process of the No. 0 node can be selected to save the model M ₀ A portion of the checkpoint file.

For the rest of the model, the above formula can also be used to allocate the checkpoint file save task of each part, M ₀ ~M ₁₅ The partial allocation results can be referred to in fig. 3 by the various processes marked with bold wire frames.

By adopting the distribution method provided by the embodiment, the task of storing the checkpoint files of each part of the model can be dispersed to a plurality of different equipment nodes, the load balance of each equipment node is ensured as much as possible, and the overall storage efficiency is improved.

In some embodiments of the present application, it is considered that the conventional mode of synchronously saving the model may block the model training process, that is, the training process cannot simultaneously execute the model training task when executing the checkpoint file saving task, thereby affecting the training process and causing unnecessary pause time.

For this purpose, an asynchronous save scheme is provided in this embodiment. Specifically:

in the step S110, the process of controlling each device node to execute the checkpoint file save task may include:

and unloading the checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of the equipment node, and storing the checkpoint file into a set save queue.

Therefore, it can be known that in this embodiment, the training process of the device node only needs to offload the checkpoint file to be saved to the CPU and store the checkpoint file in the designated save queue, and the background save process asynchronously reads and executes the checkpoint file save task from the save queue, and the training process does not need to wait for the checkpoint file to be saved completely and then continue to execute the model training, that is, the training of the model can be continuously performed, and the model cannot be blocked due to the synchronization waiting for the checkpoint file to be saved, thereby greatly improving the training efficiency.

In some embodiments of the present application, an exception handling mechanism is further provided for handling exceptions that may occur during the saving of a checkpoint file.

In the process of storing the checkpoint file, if an abnormality occurs in the training process, such as network interrupt, node failure, code logic error, etc., for example, the process may be set to wait for the completion of the execution of the checkpoint file storing task in the storing queue and then exit. The problem that the model state cannot be recovered when training is started again and training time is wasted due to the fact that the checkpoint file is not successfully saved in a tray possibly occurring when the progress is exited before the checkpoint file saving task is executed is avoided.

In addition, the storage process of the device node may be abnormal, for example, if the hardware of the device node fails, the training process and the storage process of the device node may be abnormal, and the current device node cannot perform the storage task of the checkpoint file. On the basis, the application can distribute the checkpoint file storage task distributed to the current equipment node to other available equipment nodes for storage by other equipment nodes.

The process of reassigning the checkpoint file save task may refer to the assignment manner described in the foregoing embodiment, and the assignment process may exclude the current device node, which may refer to the foregoing description in detail, and will not be described herein.

In the embodiment, by setting the exception handling mechanism, fault recovery can be automatically performed, and the stability and reliability of the system are ensured.

The following describes a model checkpoint file storage device provided by an embodiment of the present application, where the model checkpoint file storage device described below and the model checkpoint file storage method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a model checkpoint file storage device according to an embodiment of the present application.

As shown in fig. 5, the apparatus may include:

a file distributing unit 11, configured to distribute, when it is determined that the checkpoint file needs to be saved, a checkpoint file saving task of each part model to a plurality of different device nodes through a load balancing mechanism;

and the parallel saving unit 12 is used for controlling different equipment nodes and executing the checkpoint file saving task of the equipment node in a parallel processing mode.

Optionally, the process of the file distributing unit to distribute the checkpoint file storing task of each part model to a plurality of different device nodes through a load balancing mechanism may include:

selecting an unassigned part from the parts after model segmentation;

In another optional case, when the model is trained by adopting the pipeline parallel, tensor parallel and data parallel strategies, the process of dispersing the checkpoint file storage task of each part of the model to a plurality of different device nodes by the file dispersing unit through a load balancing mechanism may include:

Optionally, the process of controlling each device node to execute the checkpoint file storing task of the device node by using the parallel storing unit may include:

Optionally, the apparatus of the present application may further include:

and the first exception processing unit is used for waiting for the progress to be exited after the completion of the execution of the checkpoint file storage task in the storage queue if the occurrence of an exception in the training progress is monitored in the checkpoint file storage process.

Optionally, the apparatus of the present application may further include:

and the second exception processing unit is used for distributing the checkpoint file preservation task distributed to the current equipment node to other available equipment nodes when the preservation process of the current equipment node is abnormal, so that the other equipment nodes can preserve the checkpoint file preservation task.

The model check point file storage device provided by the embodiment of the application can be applied to model check point file storage equipment, and the equipment can be training equipment in a model training cluster, such as a server, a server cluster, a cloud end and the like. Optionally, fig. 6 shows a block diagram of a hardware structure of a model checkpoint file save device, and referring to fig. 6, the hardware structure of the device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A model check point file storage method is characterized in that the model is divided into a plurality of parts, and the parts are respectively deployed on different equipment nodes for training, and the method comprises the following steps:

controlling different equipment nodes, and executing a checkpoint file storage task of the equipment node in a parallel processing mode;

the method for distributing the checkpoint file storage tasks of each part model to a plurality of different equipment nodes through a load balancing mechanism comprises the following steps:

selecting an unassigned part from the parts after model segmentation;

2. The method of claim 1, wherein the model is trained using pipeline parallel, tensor parallel, and data parallel strategies, and the distributing the checkpoint file save tasks of each partial model to a plurality of different device nodes via a load balancing mechanism comprises:

wherein P is _l Representing the running parallel group rank number, T _m Represent tensor parallel group rank number, D _n Representing data parallel group rank number, M _i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the remainder operation, ++>The relation operator indicates whether the left and right values are the same.

3. The method of claim 1, wherein controlling each device node to perform a checkpoint file save task comprises:

4. A method according to claim 3, further comprising:

5. A method according to claim 3, further comprising:

6. A model checkpoint file storage device, wherein the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes for training, the device comprising:

the parallel storage unit is used for controlling different equipment nodes and executing a checkpoint file storage task of the equipment nodes in a parallel processing mode;

selecting an unassigned part from the parts after model segmentation;

7. The apparatus according to claim 6, wherein the parallel save unit controls each device node to perform a process of saving a checkpoint file of the device node, including:

8. A model checkpoint file save apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the model checkpoint file storing method according to any one of claims 1 to 5.

9. A storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the model checkpoint file preservation method as claimed in any one of claims 1 to 5.