CN116627659B - Model check point file storage method, device, equipment and storage medium - Google Patents

Model check point file storage method, device, equipment and storage medium Download PDF

Info

Publication number
CN116627659B
CN116627659B CN202310899664.5A CN202310899664A CN116627659B CN 116627659 B CN116627659 B CN 116627659B CN 202310899664 A CN202310899664 A CN 202310899664A CN 116627659 B CN116627659 B CN 116627659B
Authority
CN
China
Prior art keywords
model
checkpoint file
node
nodes
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310899664.5A
Other languages
Chinese (zh)
Other versions
CN116627659A (en
Inventor
潘青华
张海俊
胡文龙
汪锦想
于振华
胡国平
刘聪
魏思
王士进
刘权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202310899664.5A priority Critical patent/CN116627659B/en
Publication of CN116627659A publication Critical patent/CN116627659A/en
Application granted granted Critical
Publication of CN116627659B publication Critical patent/CN116627659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for storing a model check point file, wherein when the check point file is determined to be needed to be stored, a load balancing mechanism is introduced to avoid congestion of a single-node network card or a disk IO, the check point file storing tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, and the different equipment nodes are controlled to execute the check point file storing tasks of the equipment nodes in a parallel processing mode, so that resources of all the equipment nodes can be fully utilized, the congestion of the single-node network card or the disk IO is avoided, and the storing efficiency is improved.

Description

Model check point file storage method, device, equipment and storage medium
Technical Field
The application relates to the technical field of large-scale model training, in particular to a method, a device, equipment and a storage medium for storing model check point files.
Background
Model pre-training is widely performed in various services, including but not limited to the fields of natural language processing, image recognition, voice recognition and the like, and particularly in the natural language processing direction, training of a larger-scale pre-training model (a large artificial intelligent model) and fine tuning in a downstream task has become a common method for improving the application of natural language processing NLP. In these applications, the model size extends from the first few hundred megaparameters to the hundreds of trillion parameters today, and the memory and computational power of a single GPU will not meet the model training requirements, requiring the parameters to be partitioned into more GPUs by model parallel techniques.
Meanwhile, the preservation of the checkpoint of the oversized model also faces many challenges (checkpoint is a term used to describe the preservation of the model state after each training, chinese may be referred to as checkpoints, and the preserved model state includes model network parameters, optimizer states, etc., collectively referred to as checkpoint files or checkpoint files). For the ultra-large model, the ultra-large model needs to be segmented into a plurality of parts, and the parts are deployed into a plurality of GPUs for calculation. Referring to FIG. 1, a typical oversized model training scenario is illustrated. In the figure, D n Representing data parallel group rank number, P l Representing the running parallel group rank number, T m Represents the tensor parallel group rank number,representing model parallel group rank numbers. The model is split into 16 parts with data parallelism d=8, tensor parallelism +.>Pipeline parallelism->I.e. model parallelism m=t->P=16 (pipeline parallelism is typically split across nodes because of small traffic, tensor parallelism is typically split within nodes, and high bandwidth within nodes is fully utilized). Training is carried out by using the parallel strategy, 16 GPUs in every 4 machines form a model parallel group, and 16 checkpoint files need to be saved when the model is saved. Conventional model preservation methods typically only allow master nodes within several parallel groups (i.e., D n Nodes=0) holds a model, such as the node where the process of bold line box labeling is located in fig. 1. The network card or the disk IO of the master node is congested, and the model storage efficiency is reduced.
Disclosure of Invention
In view of the above problems, the present application provides a method, an apparatus, a device, and a storage medium for storing a model checkpoint file, so as to avoid the problem that only a master node is used to store a checkpoint file of each part of a model, resulting in IO congestion of a network card or a disk of the master node, and reducing the model storage efficiency. The specific scheme is as follows:
in a first aspect, a method for saving a model checkpoint file is provided, where the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes to train, and the method includes:
when the checkpoint files are determined to be stored, distributing checkpoint file storage tasks of all the partial models to a plurality of different equipment nodes through a load balancing mechanism;
and controlling different equipment nodes, and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.
Preferably, the distributing the task of storing the checkpoint file of each part model to a plurality of different device nodes through a load balancing mechanism includes:
selecting an unassigned part from the parts after model segmentation;
determining candidate device nodes having the unassigned portion among all device nodes;
selecting a target candidate equipment node with the minimum load from the candidate equipment nodes, distributing the checkpoint file storage task of the unassigned part to the target candidate equipment node, adding a set value to the load state of the target candidate equipment node, and returning to execute the step of selecting an unassigned part from the parts after model segmentation until the unassigned model part does not exist.
Preferably, the model is trained by adopting a pipeline parallel strategy, a tensor parallel strategy and a data parallel strategy, and then the task of storing the checkpoint files of each part model is dispersed to a plurality of different equipment nodes through a load balancing mechanism, including:
selecting a device node process satisfying the following formula to save the model M i Partial checkpoint file:
wherein P is l Representing the running parallel group rank number, T m Represent tensor parallel group rank number, D n Representing data parallel group rank number, M i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the operation of taking the remainder,the relation operator indicates whether the left and right values are the same.
Preferably, the process of controlling each device node to perform a checkpoint file save task includes:
unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;
and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
Preferably, the method further comprises:
and in the process of storing the checkpoint file, if the training process is monitored to be abnormal, waiting for the checkpoint file in the storage queue to be stopped after the checkpoint file storage task is executed.
Preferably, the method further comprises:
if the preservation process of the current equipment node is abnormal, the preservation task of the checkpoint file distributed to the current equipment node is distributed to other available equipment nodes for preservation by other equipment nodes.
In a second aspect, a model checkpoint file storage device is provided, where the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes to train, and the device includes:
the file dispersing unit is used for dispersing the checkpoint file storage tasks of the partial models to a plurality of different equipment nodes through a load balancing mechanism when the checkpoint file needs to be stored;
and the parallel storage unit is used for controlling different equipment nodes and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.
Preferably, the parallel saving unit controls each device node to execute a process of a checkpoint file saving task of the device node, including:
unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;
and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
In a third aspect, there is provided a model checkpoint file storage device comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the model checkpoint file storage method as described above.
In a fourth aspect, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the model checkpoint file preservation method as previously described.
By means of the technical scheme, when the fact that the checkpoint files need to be stored is determined, in order to avoid the problem of single-node network card or disk IO congestion, a load balancing mechanism is introduced, checkpoint file storage tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, the different equipment nodes are controlled to execute the checkpoint file storage tasks of the equipment nodes in a parallel processing mode, resources of all the equipment nodes can be fully utilized, the single-node network card or disk IO congestion is avoided, and storage efficiency is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 illustrates a typical oversized model training scenario;
FIG. 2 is a flow chart illustrating a method for saving a model checkpoint file according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an oversized model training scenario provided in an embodiment of the present application;
FIG. 4 is a flowchart of a method for distributing a checkpoint file storage task of each portion model to different device nodes through a load balancing mechanism according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a device for storing a model checkpoint file according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a model checkpoint file storage device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The application provides a model check point file storage scheme which can be suitable for storing the check point files of various model training processes, in particular to large-scale parameter-containing models such as large-scale artificial intelligent models, and can effectively provide storage efficiency.
The model check point file storage method is applied to a model training cluster, wherein the cluster comprises a plurality of equipment nodes, and the equipment nodes are divided into a plurality of communication groups. In this embodiment, a large-scale model training task is taken as an example for explanation, and in the large-scale model training task, due to the oversized model, the model is usually required to be cut through a multi-dimensional parallel scheme and deployed in different equipment nodes. Common technical means include pipeline parallel technology, tensor parallel technology and data parallel technology for cluster expansion, and a three-dimensional parallel overall solution is constructed.
The device nodes in the cluster can be implemented by adopting a terminal with data processing capability, and the terminal can be a server, a server cluster, a cloud end and the like.
Next, as described in connection with fig. 2, the method for saving a model checkpoint file of the present application may include the following steps:
step S100, when it is determined that the checkpoint files need to be saved, the checkpoint file saving tasks of each part model are dispersed to a plurality of different device nodes through a load balancing mechanism.
Referring to fig. 3, in the case of data parallel training, each part of the model after being divided is allocated to a plurality of different device nodes for parallel training, in order to avoid the problem of network card or disk IO congestion in the prior art that all the checkpoint files are stored by a single device node, a load balancing mechanism may be introduced in this step, so that the checkpoint file storage task of each part of the model is dispersed to a plurality of different device nodes.
Step S110, controlling different equipment nodes, and executing a checkpoint file storage task of the equipment nodes in a parallel processing mode.
Specifically, different equipment nodes can execute the checkpoint file storage tasks of the equipment nodes in parallel, so that system resources are fully utilized, and storage efficiency is improved.
According to the method for storing the model check point file, when the check point file is determined to be stored, a load balancing mechanism is introduced to avoid the problem of single-node network card or disk IO congestion, the check point file storing tasks of all parts after model segmentation are distributed to a plurality of different equipment nodes, and the different equipment nodes are controlled to execute the check point file storing tasks of the equipment nodes in a parallel processing mode, so that resources of all the equipment nodes can be fully utilized, the problem of single-node network card or disk IO congestion is avoided, and the storage efficiency is improved.
In some embodiments of the present application, several different implementations are provided for the above-mentioned process of distributing the checkpoint file save tasks of each portion model to a plurality of different device nodes through a load balancing mechanism in step S100.
First, a greedy strategy may be employed for allocation.
Referring to fig. 4, the method specifically comprises the following steps:
step 200, selecting an unassigned part from the parts after model segmentation.
Step S210, determining candidate device nodes having the unassigned portion among all the device nodes.
Specifically, in the parallel training mode, each part of the model after being segmented can be copied to different equipment nodes for parallel training. To implement the storing of the checkpoint file of each part model, the checkpoint file storing task of each part model may be selected to be allocated to the device node having the part model, so that in this step, the candidate device node having the unallocated part is first determined among all the device nodes.
Step S220, selecting a target candidate device node with the smallest load from the candidate device nodes, allocating the checkpoint file storage task of the unallocated portion to the target candidate device node, and adding a set value to the load state of the target candidate device node.
Specifically, in order to implement load balancing as much as possible, a target candidate device node with the smallest load may be selected from the candidate device nodes, and the checkpoint file save task of the unassigned portion may be assigned to the target candidate device node. Further, a set value, for example, load state +1 of the target candidate device node, may be added to the load state of the target candidate device node after the allocation to update the load state of the target candidate device node.
Step S230, judging whether an unassigned model part exists, if so, returning to execute the step S200, and if not, ending.
Specifically, the above steps are repeated until the checkpoint file storage tasks of the respective portions after the model is divided are allocated.
According to the method provided by the embodiment, the checkpoint file storage tasks of each part of the model can be dispersed to a plurality of different equipment nodes, the load balance of each equipment node is ensured as much as possible, and the overall storage efficiency is improved.
The above method is exemplarily described with reference to fig. 3:
the model is divided into 16 parts, and rank numbers are M respectively 0 ~M 15
According to M described above 0 ~M 15 To allocate a checkpoint file save task to each portion in turn for M 0 First, determine the M 0 A partial candidate device node comprising: the method comprises the following steps of a process number 0 and a process number 4 of a node number 0, a process number 0 and a process number 4 of a node number 4, a process number 0 and a process number 4 of a node number 8, and a process number 0 and a process number 4 of a node number 12.
Further, the load of each process in each equipment node is the same in the initial state, so that one target candidate equipment node can be randomly selected from each candidate equipment node, and the process 0 of the node 0 is assumed to be selected for executing M 0 Part of the checkpoint file saves the task. Meanwhile, the load state of the node 0 is +1.
For M 1 First, determine the M 1 A partial candidate device node comprising: process 1 and Process 5 of node 0 and node 4Process 1 and process 5, process 1 and process 5 for node 8, process 1 and process 5 for node 12.
Further, selecting the least loaded one from among the nodes 4, 8, and 12, e.g. the node 1 selected for executing M 1 Part of the checkpoint file saves the task. Meanwhile, the load state of the node 4 is +1.
And so on until M is dispensed 15 Part of the checkpoint file saves the task. M is M 0 ~M 15 The partial allocation results can be referred to in fig. 3 by the various processes marked with bold wire frames.
The second scenario, in which training is performed by using pipeline parallel, tensor parallel and data parallel strategies for the model, provides an implementation of step S100 described above.
Specifically, according to the following exemplary formula, a device node process satisfying the formula may be selected to save the model M i Partial checkpoint file:
wherein P is l Representing the running parallel group rank number, T m Represent tensor parallel group rank number, D n Representing data parallel group rank number, M i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the operation of taking the remainder,the relation operator indicates whether the left and right values are the same.
The first formulaM used for guaranteeing selected device node process to have model i Part(s). Exemplary, for part 0 after model segmentation, rank number thereofM i =0. For process number 0 of node number 0, its P l =0、T=4、T m =0, then P l /> T +T m =0/>4+0==M i
The second formulaFor hashing different save tasks within a tensor parallel group to different model parallel groups. Since pipelined parallelism is typically split across nodes, it may not be considered in this formula.
Exemplary, still taking part 0 after model segmentation as an example, its rank number M i =0. For process number 0 of node number 0, d= 8,T =4, D n =0,For ensuring a minimum value of 1 in case D is smaller than T. The left side of the formula is: max {1,2}>0%8=0== D n
Obviously, through the two formulas, the No. 0 process of the No. 0 node can be selected to save the model M 0 A portion of the checkpoint file.
For the rest of the model, the above formula can also be used to allocate the checkpoint file save task of each part, M 0 ~M 15 The partial allocation results can be referred to in fig. 3 by the various processes marked with bold wire frames.
By adopting the distribution method provided by the embodiment, the task of storing the checkpoint files of each part of the model can be dispersed to a plurality of different equipment nodes, the load balance of each equipment node is ensured as much as possible, and the overall storage efficiency is improved.
In some embodiments of the present application, it is considered that the conventional mode of synchronously saving the model may block the model training process, that is, the training process cannot simultaneously execute the model training task when executing the checkpoint file saving task, thereby affecting the training process and causing unnecessary pause time.
For this purpose, an asynchronous save scheme is provided in this embodiment. Specifically:
in the step S110, the process of controlling each device node to execute the checkpoint file save task may include:
and unloading the checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of the equipment node, and storing the checkpoint file into a set save queue.
And asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
Therefore, it can be known that in this embodiment, the training process of the device node only needs to offload the checkpoint file to be saved to the CPU and store the checkpoint file in the designated save queue, and the background save process asynchronously reads and executes the checkpoint file save task from the save queue, and the training process does not need to wait for the checkpoint file to be saved completely and then continue to execute the model training, that is, the training of the model can be continuously performed, and the model cannot be blocked due to the synchronization waiting for the checkpoint file to be saved, thereby greatly improving the training efficiency.
In some embodiments of the present application, an exception handling mechanism is further provided for handling exceptions that may occur during the saving of a checkpoint file.
In the process of storing the checkpoint file, if an abnormality occurs in the training process, such as network interrupt, node failure, code logic error, etc., for example, the process may be set to wait for the completion of the execution of the checkpoint file storing task in the storing queue and then exit. The problem that the model state cannot be recovered when training is started again and training time is wasted due to the fact that the checkpoint file is not successfully saved in a tray possibly occurring when the progress is exited before the checkpoint file saving task is executed is avoided.
In addition, the storage process of the device node may be abnormal, for example, if the hardware of the device node fails, the training process and the storage process of the device node may be abnormal, and the current device node cannot perform the storage task of the checkpoint file. On the basis, the application can distribute the checkpoint file storage task distributed to the current equipment node to other available equipment nodes for storage by other equipment nodes.
The process of reassigning the checkpoint file save task may refer to the assignment manner described in the foregoing embodiment, and the assignment process may exclude the current device node, which may refer to the foregoing description in detail, and will not be described herein.
In the embodiment, by setting the exception handling mechanism, fault recovery can be automatically performed, and the stability and reliability of the system are ensured.
The following describes a model checkpoint file storage device provided by an embodiment of the present application, where the model checkpoint file storage device described below and the model checkpoint file storage method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a model checkpoint file storage device according to an embodiment of the present application.
As shown in fig. 5, the apparatus may include:
a file distributing unit 11, configured to distribute, when it is determined that the checkpoint file needs to be saved, a checkpoint file saving task of each part model to a plurality of different device nodes through a load balancing mechanism;
and the parallel saving unit 12 is used for controlling different equipment nodes and executing the checkpoint file saving task of the equipment node in a parallel processing mode.
Optionally, the process of the file distributing unit to distribute the checkpoint file storing task of each part model to a plurality of different device nodes through a load balancing mechanism may include:
selecting an unassigned part from the parts after model segmentation;
determining candidate device nodes having the unassigned portion among all device nodes;
selecting a target candidate equipment node with the minimum load from the candidate equipment nodes, distributing the checkpoint file storage task of the unassigned part to the target candidate equipment node, adding a set value to the load state of the target candidate equipment node, and returning to execute the step of selecting an unassigned part from the parts after model segmentation until the unassigned model part does not exist.
In another optional case, when the model is trained by adopting the pipeline parallel, tensor parallel and data parallel strategies, the process of dispersing the checkpoint file storage task of each part of the model to a plurality of different device nodes by the file dispersing unit through a load balancing mechanism may include:
selecting a device node process satisfying the following formula to save the model M i Partial checkpoint file:
wherein P is l Representing the running parallel group rank number, T m Represent tensor parallel group rank number, D n Representing data parallel group rank number, M i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the operation of taking the remainder,the relation operator indicates whether the left and right values are the same.
Optionally, the process of controlling each device node to execute the checkpoint file storing task of the device node by using the parallel storing unit may include:
unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;
and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
Optionally, the apparatus of the present application may further include:
and the first exception processing unit is used for waiting for the progress to be exited after the completion of the execution of the checkpoint file storage task in the storage queue if the occurrence of an exception in the training progress is monitored in the checkpoint file storage process.
Optionally, the apparatus of the present application may further include:
and the second exception processing unit is used for distributing the checkpoint file preservation task distributed to the current equipment node to other available equipment nodes when the preservation process of the current equipment node is abnormal, so that the other equipment nodes can preserve the checkpoint file preservation task.
The model check point file storage device provided by the embodiment of the application can be applied to model check point file storage equipment, and the equipment can be training equipment in a model training cluster, such as a server, a server cluster, a cloud end and the like. Optionally, fig. 6 shows a block diagram of a hardware structure of a model checkpoint file save device, and referring to fig. 6, the hardware structure of the device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
when the checkpoint files are determined to be stored, distributing checkpoint file storage tasks of all the partial models to a plurality of different equipment nodes through a load balancing mechanism;
and controlling different equipment nodes, and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:
when the checkpoint files are determined to be stored, distributing checkpoint file storage tasks of all the partial models to a plurality of different equipment nodes through a load balancing mechanism;
and controlling different equipment nodes, and executing the checkpoint file storage task of the equipment nodes in a parallel processing mode.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A model check point file storage method is characterized in that the model is divided into a plurality of parts, and the parts are respectively deployed on different equipment nodes for training, and the method comprises the following steps:
when the checkpoint files are determined to be stored, distributing checkpoint file storage tasks of all the partial models to a plurality of different equipment nodes through a load balancing mechanism;
controlling different equipment nodes, and executing a checkpoint file storage task of the equipment node in a parallel processing mode;
the method for distributing the checkpoint file storage tasks of each part model to a plurality of different equipment nodes through a load balancing mechanism comprises the following steps:
selecting an unassigned part from the parts after model segmentation;
determining candidate device nodes having the unassigned portion among all device nodes;
selecting a target candidate equipment node with the minimum load from the candidate equipment nodes, distributing the checkpoint file storage task of the unassigned part to the target candidate equipment node, adding a set value to the load state of the target candidate equipment node, and returning to execute the step of selecting an unassigned part from the parts after model segmentation until the unassigned model part does not exist.
2. The method of claim 1, wherein the model is trained using pipeline parallel, tensor parallel, and data parallel strategies, and the distributing the checkpoint file save tasks of each partial model to a plurality of different device nodes via a load balancing mechanism comprises:
selecting a device node process satisfying the following formula to save the model M i Partial checkpoint file:
wherein P is l Representing the running parallel group rank number, T m Represent tensor parallel group rank number, D n Representing data parallel group rank number, M i Representing model parallel group rank numbers, D representing data parallelism, T representing tensor parallelism,representing the remainder operation, ++>The relation operator indicates whether the left and right values are the same.
3. The method of claim 1, wherein controlling each device node to perform a checkpoint file save task comprises:
unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;
and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
4. A method according to claim 3, further comprising:
and in the process of storing the checkpoint file, if the training process is monitored to be abnormal, waiting for the checkpoint file in the storage queue to be stopped after the checkpoint file storage task is executed.
5. A method according to claim 3, further comprising:
if the preservation process of the current equipment node is abnormal, the preservation task of the checkpoint file distributed to the current equipment node is distributed to other available equipment nodes for preservation by other equipment nodes.
6. A model checkpoint file storage device, wherein the model is divided into a plurality of parts, and the parts are deployed on different equipment nodes for training, the device comprising:
the file dispersing unit is used for dispersing the checkpoint file storage tasks of the partial models to a plurality of different equipment nodes through a load balancing mechanism when the checkpoint file needs to be stored;
the parallel storage unit is used for controlling different equipment nodes and executing a checkpoint file storage task of the equipment nodes in a parallel processing mode;
the method for distributing the checkpoint file storage tasks of each part model to a plurality of different equipment nodes through a load balancing mechanism comprises the following steps:
selecting an unassigned part from the parts after model segmentation;
determining candidate device nodes having the unassigned portion among all device nodes;
selecting a target candidate equipment node with the minimum load from the candidate equipment nodes, distributing the checkpoint file storage task of the unassigned part to the target candidate equipment node, adding a set value to the load state of the target candidate equipment node, and returning to execute the step of selecting an unassigned part from the parts after model segmentation until the unassigned model part does not exist.
7. The apparatus according to claim 6, wherein the parallel save unit controls each device node to perform a process of saving a checkpoint file of the device node, including:
unloading a checkpoint file to be saved to a Central Processing Unit (CPU) through a training process of an equipment node, and storing the checkpoint file into a set save queue;
and asynchronously reading and executing the checkpoint file save task from the save queue through a background save process of the device node.
8. A model checkpoint file save apparatus comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the model checkpoint file storing method according to any one of claims 1 to 5.
9. A storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the model checkpoint file preservation method as claimed in any one of claims 1 to 5.
CN202310899664.5A 2023-07-21 2023-07-21 Model check point file storage method, device, equipment and storage medium Active CN116627659B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310899664.5A CN116627659B (en) 2023-07-21 2023-07-21 Model check point file storage method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310899664.5A CN116627659B (en) 2023-07-21 2023-07-21 Model check point file storage method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116627659A CN116627659A (en) 2023-08-22
CN116627659B true CN116627659B (en) 2023-12-01

Family

ID=87602896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310899664.5A Active CN116627659B (en) 2023-07-21 2023-07-21 Model check point file storage method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116627659B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117873789B (en) * 2024-03-13 2024-05-10 之江实验室 Checkpoint writing method and device based on segmentation quantization

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188346A (en) * 2013-03-05 2013-07-03 北京航空航天大学 Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system
US9158540B1 (en) * 2011-11-14 2015-10-13 Emc Corporation Method and apparatus for offloading compute resources to a flash co-processing appliance
WO2016122596A1 (en) * 2015-01-30 2016-08-04 Hewlett Packard Enterprise Development Lp Checkpoint-based scheduling in cluster
CN106027647A (en) * 2016-05-20 2016-10-12 云南云电同方科技有限公司 LXPFS (Linux XProgram File System) cluster distributed file storage system
CN109819057A (en) * 2019-04-08 2019-05-28 科大讯飞股份有限公司 A kind of load-balancing method and system
CN111258824A (en) * 2020-01-18 2020-06-09 重庆邮电大学 Increment check point fault tolerance method based on artificial potential field in cloud computing
CN114787833A (en) * 2019-09-23 2022-07-22 普雷萨根私人有限公司 Distributed Artificial Intelligence (AI)/machine learning training system
CN116185623A (en) * 2023-02-07 2023-05-30 北京百分点科技集团股份有限公司 Task allocation method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9804798B2 (en) * 2012-12-14 2017-10-31 Vmware, Inc. Storing checkpoint file in high performance storage device for rapid virtual machine suspend and resume
US10268744B2 (en) * 2015-09-22 2019-04-23 Walmart Apollo, Llc System for maintaining consistency across a decentralized database cluster and method therefor

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9158540B1 (en) * 2011-11-14 2015-10-13 Emc Corporation Method and apparatus for offloading compute resources to a flash co-processing appliance
CN103188346A (en) * 2013-03-05 2013-07-03 北京航空航天大学 Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system
WO2016122596A1 (en) * 2015-01-30 2016-08-04 Hewlett Packard Enterprise Development Lp Checkpoint-based scheduling in cluster
CN106027647A (en) * 2016-05-20 2016-10-12 云南云电同方科技有限公司 LXPFS (Linux XProgram File System) cluster distributed file storage system
CN109819057A (en) * 2019-04-08 2019-05-28 科大讯飞股份有限公司 A kind of load-balancing method and system
CN114787833A (en) * 2019-09-23 2022-07-22 普雷萨根私人有限公司 Distributed Artificial Intelligence (AI)/machine learning training system
CN111258824A (en) * 2020-01-18 2020-06-09 重庆邮电大学 Increment check point fault tolerance method based on artificial potential field in cloud computing
CN116185623A (en) * 2023-02-07 2023-05-30 北京百分点科技集团股份有限公司 Task allocation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116627659A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
US20230169351A1 (en) Distributed training method based on end-to-end adaption, and device
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
CN116627659B (en) Model check point file storage method, device, equipment and storage medium
CN112114973B (en) Data processing method and device
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN112732444A (en) Distributed machine learning-oriented data partitioning method
CN109739614A (en) Virtual machine rebuilding method, device and equipment
CN116340005B (en) Container cluster scheduling method, device, equipment and storage medium
CN111209106B (en) Flow chart dividing method and system based on caching mechanism
CN115237580A (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN111049900B (en) Internet of things flow calculation scheduling method and device and electronic equipment
US20230325235A1 (en) Training task queuing cause analysis method and system, device and medium
CN112256441B (en) Memory allocation method and device for neural network inference
CN115951845B (en) Disk management method, device, equipment and storage medium
CN113626173A (en) Scheduling method, device and storage medium
CN117687774A (en) Task model training method for computing power scheduling and computing power scheduling method and system
CN113821174B (en) Storage processing method, storage processing device, network card equipment and storage medium
CN109542601B (en) Policy compiling method and device, electronic equipment and computer storage medium
CN113094175A (en) Load balancing method and device
KR101916809B1 (en) Apparatus for placing virtual cluster and method for providing the same
CN113094168A (en) Distributed training method, device and system of model
CN112631743B (en) Task scheduling method, device and storage medium
CN115145714B (en) Scheduling method, device and system for container instance
US20240220794A1 (en) Training systems and operating method thereof
CN117234749A (en) Method, apparatus, device, storage medium and program product for grouping computing tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant