WO2023165512A1 - 一种故障文件保存方法及相关装置 - Google Patents

一种故障文件保存方法及相关装置 Download PDF

Info

Publication number
WO2023165512A1
WO2023165512A1 PCT/CN2023/078980 CN2023078980W WO2023165512A1 WO 2023165512 A1 WO2023165512 A1 WO 2023165512A1 CN 2023078980 W CN2023078980 W CN 2023078980W WO 2023165512 A1 WO2023165512 A1 WO 2023165512A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
fault
management node
task
node
Prior art date
Application number
PCT/CN2023/078980
Other languages
English (en)
French (fr)
Inventor
郝日佩
王义彬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023165512A1 publication Critical patent/WO2023165512A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of artificial intelligence (AI), and in particular to a fault file storage method, device, management node, distributed system, computer-readable storage medium, and computer program product.
  • AI artificial intelligence
  • AI models for ease of description, sometimes simply referred to as models
  • more and more merchants use AI customer service based on AI models to replace manual customer service to provide pre-sales and after-sales consulting services.
  • platforms use AI models instead of manually reviewing content posted by users to save labor costs.
  • AI model refers to a mathematical model built based on AI technology and used to predict unknown data.
  • the AI model can be a target detection model or an image classification model built on the basis of a neural network.
  • AI models usually need to be trained with large amounts of data.
  • distributed training methods emerged as the times require.
  • the so-called distributed training method is to distribute the training tasks to multiple training nodes for execution, and multiple training nodes train the model in parallel.
  • the training task is the process of using the data set to train the model and obtain the weight of the model.
  • the task types of training tasks can be divided into data parallel training types and model parallel training types.
  • the data parallel training type refers to distributing the data in the dataset to multiple training nodes for training
  • the model parallel training type refers to distributing different parts of the model to multiple training nodes for training.
  • Multiple training nodes can use the synchronous update mechanism to update the parameters of the model.
  • the synchronous update mechanism means that the gradients obtained by each training node are accumulated to calculate the mean value, and the parameters of the model are updated based on the mean value.
  • checkpoint checkpoint
  • ckpt checkpoint
  • This application provides a method for saving fault files, which detects faults in real time and saves fault files after faults are detected, so that the training results of the iterative rounds when the fault occurs can be retained, and the fault based on a large number of sample data is avoided. Restart the training of this iterative round to ensure the training efficiency.
  • the present application also provides a device, a management node, a distributed system, a computer-readable storage medium, and a computer program product corresponding to the above method.
  • the present application provides a method for saving a fault file.
  • This method is applied to distributed systems.
  • the distributed system includes a management node and a plurality of training nodes, and the plurality of training nodes are used for cooperatively executing training tasks.
  • the method can be performed by a management node in a distributed system.
  • the management node obtains a real-time signal from at least one training node in a plurality of training nodes, and the real-time signal is used to characterize the state of at least one training node, and then the management node performs fault detection according to the real-time signal, and then the management node detects a fault Save the fault file, which is used to resume the training task.
  • the management node performs real-time fault detection based on the real-time signal obtained from the training node. After detecting the fault, it saves the training results of the iterative round of the training task when the fault occurs.
  • the training task is rescheduled to
  • the new training node can continue training based on the training results of the iterative round when the fault occurs, without repeating training based on a large number of sample data, which improves the training efficiency.
  • the method does not need to save the fault file frequently, avoiding the performance challenge of regularly backing up the fault file at a higher frequency (shorter period).
  • the training node may generate a signal when performing the training task.
  • the management node triggers signal collection
  • the real-time signal collected from the training node is the real-time signal.
  • the management node can set the collection time window, and the management node obtains the signal of the training node within the collection time window from the training node, so as to obtain the real-time signal.
  • the real-time signals may include the aforementioned cluster communication signals.
  • the training node when the training node executes the training task, it can perform compilation operations (such as compiling to obtain a calculation graph, also called graph compilation result), running operations (such as running the graph compilation result), and generate compilation signals and running signals.
  • compilation operations such as compiling to obtain a calculation graph, also called graph compilation result
  • running operations such as running the graph compilation result
  • the training node executes the running operation, it may also generate a running manager signal. Therefore, the real-time signal may also include one or more of a compilation signal, a running signal, and a running manager signal.
  • the management node can realize real-time fault detection by obtaining one or more of real-time signals such as cluster communication signals, compilation signals, operation signals, and operation manager signals, and then save the iterative round when a fault occurs.
  • real-time signals such as cluster communication signals, compilation signals, operation signals, and operation manager signals
  • the management node may first perform fault prediction on the training task through a fault early warning algorithm to obtain a prediction result. Then the management node acquires a real-time signal from at least one training node among the plurality of training nodes according to the prediction result, so as to perform fault detection according to the real-time signal.
  • the prediction result indicates that when a fault is about to occur in a certain period of time, the management node can obtain real-time signals at the predicted time point (or a period of time before and after the time point), and then perform fault detection based on the real-time signal, which can improve efficiency and reduce resource usage.
  • the prediction result shows that when no fault occurs in a certain period of time, the management node can obtain real-time signals during this period of time, and then perform fault detection based on the real-time signal, so as to avoid the false alarm phenomenon caused by the accuracy of the fault warning algorithm not reaching 100%. occurs, improving the accuracy of fault detection.
  • the management node can determine the task type of the training task, and then the management node can save the fault file according to the storage strategy corresponding to the task type. In this way, personalized fault files can be saved to meet the needs of different application scenarios.
  • each of the multiple training nodes includes at least one accelerator card.
  • the storage strategy can be to save the fault file on any non-faulty card in at least one accelerator card, as follows It can avoid repeated saving and reduce storage resource usage.
  • the saving strategy can be to save the fault files on multiple non-faulty cards in the at least one accelerator card, for example It is to save the fault files on all non-faulty cards, so that the training results of the iterative round when the fault occurs can be preserved as comprehensively as possible, and the training of the iterative round can be avoided based on a large amount of sample data.
  • the management node may further determine whether the target accelerator card used for aggregate communication among the multiple training nodes is a non-faulty card.
  • the saving strategy can be further refined as: when the target accelerator card used for aggregated communication in multiple training nodes is a non-faulty card, save the faulty file on the target accelerator card without having to perform After the data exchange ensures data consistency, the fault file is saved, which shortens the storage time; when the target accelerator card used for aggregation communication in multiple training nodes is a faulty card, save the fault file in the non-faulty nodes among the multiple training nodes. Faulty files on the accelerator card of the node with the largest network bandwidth, thereby increasing the storage rate of faulty files.
  • the management node can reschedule the training task, for example, reschedule the training task to a new training node, the new training node is a node that has not failed, and then load the fault file , so that the training node can continue training from the iterative round when the failure occurs, without repeating training based on a large number of sample data.
  • the management node may load the fault file according to the recovery strategy corresponding to the task type of the training task.
  • the recovery strategy can be based on a single fault file such as ckpt file recovery; when the task type is a model parallel training type, the recovery strategy can be based on multiple fault files (for example, all non-fault ckpt file on the card) for recovery. In this way, fault files can be selectively loaded according to task types, and training tasks can be resumed to meet the needs of different application scenarios.
  • the fault file includes the following information: iteration rounds, weights, losses and hyperparameters.
  • hyperparameters such as iterative rounds, weights, losses, learning rates, and optimizers, it is possible to continue training from the iterative round when a fault occurs, which meets business needs.
  • the fault file further includes a graph compilation result of the training task.
  • the graph compilation result refers to the calculation graph corresponding to the method used by the model, which can usually be obtained by compiling the method used by the model.
  • the management node may identify the deep learning framework adopted by the training task, and obtain the identification result.
  • the recognition result is a framework that supports static compilation, such as the TensorFlow framework or the MindSpore framework
  • the management node can save the compilation result of the graph, so that when the training task is resumed later, the compilation result of the graph can be directly reused, thereby improving the recovery of the training task. s efficiency.
  • the present application provides a fault file storage device.
  • the device for storing fault files is applied to a distributed system, and the distributed system includes a management node and a plurality of training nodes, and the plurality of training nodes are used for cooperatively executing training tasks, the device is deployed on the management node, and the Devices include:
  • a communication module configured to obtain a real-time signal from at least one training node among the plurality of training nodes, the real-time signal being used to represent the state of the at least one training node;
  • a detection module configured to perform fault detection according to the real-time signal
  • the saving module is used to save the fault file after the fault is detected, and the fault file is used to restore the training task.
  • the real-time signal includes one or more of the following:
  • Cluster communication signal compile signal, run signal, run manager signal.
  • the device further includes:
  • a prediction module configured to perform fault prediction on the training task through a fault early warning algorithm to obtain a prediction result
  • the detection module is specifically used for:
  • the saving module is specifically configured to:
  • each of the multiple training nodes includes at least one accelerator card
  • the saving strategy is to save the at least one A fault file on any non-faulty card in the accelerator card
  • the storage strategy is to save the fault files on a plurality of non-faulty cards in the at least one accelerator card.
  • the storage strategy is:
  • the target accelerator card used for aggregated communication in the plurality of training nodes is a non-faulty card, save the fault file on the target accelerator card;
  • the target accelerator card used for aggregated communication among the multiple training nodes is a faulty card, save the faulty file on the accelerator card of the node with the largest network bandwidth among the non-faulty nodes among the multiple training nodes.
  • the device further includes:
  • a recovery module configured to reschedule the training task and load the fault file after the fault file is saved.
  • the recovery module is specifically configured to:
  • the fault file includes the following information: iteration rounds, weights, losses and hyperparameters.
  • the fault file further includes a graph compilation result of the training task.
  • the present application provides a management node.
  • the management node includes at least one processor and at least one memory.
  • the at least one processor and the at least one memory communicate with each other.
  • the at least one processor is configured to execute instructions stored in the at least one memory, so that the management node executes the method in the first aspect or any implementation manner of the first aspect.
  • the present application provides a distributed system.
  • the distributed system includes: a management node and multiple training nodes.
  • the multiple training nodes are used to coordinately execute training tasks
  • the management node is configured to acquire a real-time signal from at least one training node among the plurality of training nodes, the real-time signal is used to characterize the state of the at least one training node, perform fault detection according to the real-time signal, and detect After a fault occurs, the fault file is saved, and the fault file is used to restore the training task.
  • the present application provides a computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and the instructions instruct the management node to execute the method described in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer program product including instructions.
  • the management node When it runs on the management node, the management node is made to execute the method described in the first aspect or any implementation manner of the first aspect.
  • FIG. 1 is a schematic diagram of a distributed system architecture provided by an embodiment of the present application
  • FIG. 2 is a hardware structural diagram of a server provided by an embodiment of the present application.
  • FIG. 3A is a framework diagram of software deployed on a server provided in an embodiment of the present application.
  • FIG. 3B is a call relationship diagram of software deployed on a server provided in an embodiment of the present application.
  • Fig. 4 is a flow chart of a method for saving a fault file provided in an embodiment of the present application
  • FIG. 5 is a flowchart of a fault detection provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a fault file storage field provided by an embodiment of the present application.
  • FIG. 7 is a flow chart of a method for saving a fault file provided in an embodiment of the present application.
  • FIG. 8 is a signaling flow chart of a fault file saving method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a fault file storage device provided in an embodiment of the present application.
  • FIG. 10 is a hardware structural diagram of a management node provided by an embodiment of the present application.
  • first and second in the embodiments of the present application are used for description purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • AI Artificial intelligence
  • machine intelligence specifically refers to the intelligence shown by machines (such as computers) by imitating human thinking and behavior (such as learning, reasoning, thinking, planning, etc.).
  • Artificial intelligence usually imitates human thinking and behavior based on knowledge to achieve specific goals or complete specific tasks. Among them, knowledge can come from experience or data.
  • Deep learning Deep learning (deep learning, DL), as a branch of AI, specifically uses a deep neural network model (also called a deep learning model, for the sake of description, it is also referred to as a model in some cases) to process massive data, To learn knowledge from massive data and analyze the data based on this knowledge.
  • the trained deep learning model can be applied to scenarios such as perception and decision-making in the field of AI, such as image recognition, speech recognition, natural language translation, computer games and other scenarios.
  • the parameters of the deep learning model are relatively high, usually reaching hundreds of billions or trillions.
  • the parameters of large models in the field of natural language processing (NLP) can reach hundreds of billions.
  • NLP natural language processing
  • Such large-scale deep learning models usually require huge datasets for training.
  • a typical training method is distributed training.
  • the distributed system 10 includes a management node 100 and multiple training nodes 200 .
  • the management node 100 is also called a master node (master node)
  • the training node 200 is also called a worker node (worker node).
  • the management node 100 is specifically configured to maintain meta-information, and perform task scheduling according to the meta-information.
  • the training node 200 is specifically used to execute the task scheduled by the management node 100 .
  • the meta information includes one or more of the number of training nodes 200 in the distributed system 10 and the load of each training node 200 .
  • the management node 100 may distribute the training task to multiple training nodes 200 based on the above meta-information, and the multiple training nodes 200 train the model in parallel.
  • any node in the distributed system 10 can be both a management node 100 and a training node 200, that is, any node in the distributed system 10 can have both a management function and a training function.
  • Parallel strategies for distributed training can include data parallel training strategies and model parallel training strategies.
  • the data parallel training strategy refers to dividing the data set into multiple parts and distributing them to different training nodes 200.
  • Each training node 200 trains a model with the same structure based on different data in the data set, and multiple training nodes 200 Parameters are passed between, which can solve the problem that the data set is too large to train efficiently on a single machine.
  • the model parallel training strategy refers to dividing a model (such as a deep learning model) into multiple parts, and deploying multiple parts of the model on different training nodes 200, and different training nodes 200 use data sets to train multiple parts of the model in parallel, This can solve the problem that large-scale deep learning models are difficult to run on a single training node 200 due to the limitation of video memory.
  • a model such as a deep learning model
  • the distributed system 10 can use an iterative method to update the parameters of the model to implement model training.
  • Each iteration (iteration) updates the parameters of the model, wherein an iteration can also be called a training step train step, referred to as a step.
  • the sample size used for each iteration is called the batch size.
  • a process in which all sample data in a data set (for example, a training set) is used once is called an epoch.
  • the training set includes 1000 sample data
  • the batch size can be 100, so 100 sample data can be used for one iteration each time, and 10 iterations of 1000 sample data in the training set can complete the training of one epoch.
  • multiple training nodes 200 may use a synchronous update mechanism to update model parameters.
  • the synchronous update mechanism refers to accumulating the gradients obtained by each training node 200 to calculate the mean value, and updating the parameters of the model based on the mean value.
  • the synchronous update mechanism can ensure that the loss (loss) decreases relatively stably and avoids large jitter.
  • the entire distributed training task will be interrupted. As the number of training nodes 200 increases, the probability of outage becomes higher and higher.
  • related technologies provide a mechanism for regularly backing up checkpoint files so that when a training task fails, the training task can be restored based on the checkpoint file.
  • the management node 100 can reschedule according to the last saved checkpoint file Training tasks for failure recovery.
  • the backup period is short, the management node 100 needs to occupy a large amount of memory to save a large number of parameters, thereby affecting performance.
  • a longer backup cycle is usually set.
  • the backup period is long, the training result of the iterative round when the fault occurs will be lost, and the training node 200 needs to repeat training based on a large amount of sample data, which affects the training efficiency.
  • an embodiment of the present application provides a method for saving a fault file.
  • the method is applied to a distributed system 10 as shown in FIG. 1 .
  • multiple training nodes 200 are used for cooperatively executing training tasks.
  • the management node 100 can obtain a real-time signal from at least one training node 200 in a plurality of training nodes 200, and the real-time signal is used to characterize the state of at least one training node 200, and the management node 100 can perform fault detection according to the real-time signal , when the management node 100 detects a fault, save the fault file.
  • the management node 100 performs real-time fault detection based on the real-time signal obtained from the training node 200. After detecting a fault, it saves the training results of the iterative rounds of the training task when the fault occurs.
  • the new training node 200 can continue training based on the training results of the iterative round when the fault occurs, without repeating training based on a large amount of sample data, which improves the training efficiency.
  • the fault file saving method of the embodiment of the present application can be applied to various distributed training scenarios.
  • the fault file saving method can be used in the scene of distributed training image recognition model, when the distributed training task of training the image recognition model is interrupted due to failure of the training node 200, etc.
  • the management node 100 can save the fault file, the fault The file includes the training result of the iterative round when the fault occurs, and the management node 100 can reschedule the training task based on the fault file including the above training result, so as to continue training from the iterative round when the fault occurs, without repeating training based on a large number of sample data , improving the training efficiency.
  • the method for saving fault files can also be used in the scenario of distributed training of speech recognition models.
  • the management node 100 saves the fault file when the fault occurs.
  • the training result of the current iterative round, and based on the training result to resume the training task, can realize the continuous training from the iterative round when the fault occurs, and improve the training efficiency.
  • the fault file saving method in the embodiment of the present application can be packaged as a functional component, and the functional component can be integrated in a distributed deep learning framework for use by users.
  • the method for saving a fault file in the embodiment of the present application may also be packaged as an independent application for use by users.
  • the above functional components or applications may be collectively referred to as a device for storing fault files.
  • the fault file saving method is packaged as a functional component for illustration below.
  • FIG. 1 exemplifies the system architecture of the distributed system 10.
  • the distributed system 10 can be constructed first. The following describes how to configure the server as a management node 100 and a training node 200 by combining the hardware structure diagram of the server, the framework diagram of the software deployed on the server, and the call relationship diagram, so as to construct the distributed system 10 .
  • a user may purchase or lease a server, which may be a cloud server or a physical server.
  • the server 20 includes a host (host) 22 and at least one device (device) 24 .
  • host 22 is connected with at least one device 24.
  • the host 22 includes a processor and a memory
  • the processor may be a central processing unit (central processing unit, CPU)
  • the memory may be a dual in-line memory module (Dual In-line Memory Module, DIMM).
  • the DIMM may specifically be a double data rate (double data rate, DDR) type, for example, the memory may be a DDR4 DIMM.
  • host 22 includes 4 CPUs and 4 DDR4 DIMM groups, each CPU is connected to 1 DDR4 DIMM group, and each DDR4 DIMM group includes 8 DDR4 DIMMs. Multiple CPUs of host 22 can be connected to form a hydra mesh.
  • the host 22 also includes an interface, such as a Serial Advanced Technology Attachment (SATA) interface, a new generation of non-volatile memory (Non-Volatile Memory express, NVMe) interface and Gigabit Ethernet ( One or more of Gigabit Ethernet, GE) interfaces.
  • SATA Serial Advanced Technology Attachment
  • NVMe non-Volatile Memory express
  • GE Gigabit Ethernet
  • Device 24 includes a processor, which is typically an accelerator card.
  • the processor included in device24 may be a neural network processor (Neural-network Processing Unit, NPU).
  • NPU Neuro-network Processing Unit
  • Figure 2 uses device 24 including 8 NPUs as an example. In other possible implementations of the embodiment of the present application, device 24 may also include more accelerator cards.
  • the user can install firmware 302 and driver 304 on the server 20 .
  • the firmware 302 is usually a program written in the read-only memory, which can directly control the hardware, interact with the hardware, and check whether there is any error in the hardware.
  • the driver 304 is specifically a small piece of code added to the operating system, which contains information about the hardware. When a computer program requests to interact with a piece of hardware, a driver 304 may act as a converter of instructions between the hardware and the program that uses it.
  • the firmware 302 can control the device 24, interact with the device 24, and check the device 24 for any errors, and the driver 304 can act as a converter of instructions between the device 24 and the programs that use it.
  • the heterogeneous computing framework 306 can be a heterogeneous computing framework (Compute Architecture for Neuro Net, CANN) for neural networks.
  • CANN Computer Architecture for Neuro Net
  • RCANN can support users to quickly build AI applications by providing multi-level programming interfaces.
  • the AI application refers to an application constructed based on a trained model.
  • the heterogeneous computing framework 306 is an optional framework, and the fault file saving method of the embodiment of the present application can also be executed without installing the above framework on the server 20.
  • the function of the above framework is to improve the efficiency of building AI applications.
  • the user may then install the deep learning framework 308 on the server 20 .
  • the deep learning framework 308 is used to construct a large-scale computational graph (computational graph) by compiling the method for realizing the model, and automatically implement gradient calculation in the computational graph.
  • the calculation graph is also called the graph compilation result.
  • the result of graph compilation can be executed to perform calculations related to distributed training.
  • deep learning frameworks can be divided into frameworks that support static compilation and frameworks that support dynamic compilation.
  • the frameworks that support static compilation include one or more of the MindSpore framework and the Tensorflow framework
  • the frameworks that support dynamic compilation include the PyTorch framework.
  • the deep learning framework 308 may not be installed on the server 20, at this time, the server 20 may use a programming language such as Python to implement the model from scratch.
  • the installation package of the fault file storage device 310 can be packaged in the installation package of the deep learning framework 308.
  • the deep learning framework 308 on the server Learning framework 308 is installed on server 20 .
  • the user can also install the scheduling device 312 on the server 20 .
  • the scheduling device 312 is used to schedule training tasks to realize distributed training.
  • the scheduling device 312 may be a distributed scheduling component or an AI development platform.
  • the distributed scheduling component can be a MindX DL component, or other third-party distributed scheduling components
  • the AI development platform can be a development platform such as Model Arts.
  • multiple servers 20 can determine one server 20 as the management node 100 through voting or election, and the remaining servers 20 can be used as the training nodes 200 . It should be noted that when one or more training nodes 200 fail, the management node 100 can reschedule the training task to a new training node 200; when the management node 100 fails, the remaining servers 20 can re-vote or elect to determine the new management node 100.
  • multiple servers 20 may not perform voting or election, for example, a server 20 may serve as both the management node 100 and the training node 200 .
  • the server 20 can perform fault detection on other servers 20, thereby realizing management.
  • the server 20 when the server 20 is used as the training node 200, it can also be detected by other servers 20, And accept the management of other servers 20 .
  • the fault file storage device 310 includes a fault detection (fault detect) component 3102, a control engine (control engine) 3104 and a repair management (restore manager) component 3106.
  • Scheduling device 312 (such as MindX DL component or a third-party distributed scheduling component) can call the above-mentioned fault detection component 3102, control engine 3104 and repair management component 3106 to execute the fault file saving method of the embodiment of the present application.
  • the fault detection component 3102 can call an Ascend Computing Language (Ascend Computing Language, ACL) functional component in the heterogeneous computing framework 306 to find faults.
  • ACL functional components include ACL operator compilation and execution (aclopCompileAndExecute) and a collective communication library, such as Huawei Collective Communication Library (Huawei Collective Communication Library, HCCL).
  • aclopCompileAndExecute is used to compile and execute specified operators
  • the collective communication library can provide data-parallel or model-parallel high-performance collective communication solutions for multi-machine and multi-card training.
  • the control engine 3104 can call the ACL function component to specify the saving policy and restoring policy.
  • the repair management component 3106 is used to call the deep learning framework 308 and the ACL function component to restore the fault of the training task.
  • the training task is a deep learning training task, such as a deep learning training task in scenarios such as computer vision (Computer Vision) and NLP.
  • the method comprises the steps:
  • the management node 100 acquires a real-time signal from at least one training node 200 among the plurality of training nodes 200 .
  • the training node 200 may generate a signal when performing a training task.
  • the management node 100 triggers signal collection
  • the real-time signal collected from the training node 200 is the real-time signal.
  • the management node 100 may set a collection time window, and the management node 100 obtains a signal of the training node 200 within the collection time window from the training node 200, thereby obtaining a real-time signal.
  • the window length of the acquisition time window may be set according to empirical values, for example, the window length of the acquisition time window may be set to 5 seconds. It should be noted that there is often a time delay from when the management node 100 triggers signal acquisition to when the management node 100 starts to acquire signals within the acquisition time window.
  • the management node 100 can collect the current round of signals at 9:00:25. Even if the signal is delayed relative to the time of trigger signal collection, the delay is less than the set time. The fixed value can be ignored, so this signal is also called real-time signal.
  • the real-time signal is used to characterize the state of at least one training node 200 .
  • This status may be the health status of the training node 200 .
  • the real-time signals may include the aforementioned cluster communication signals.
  • the training node 200 executes the training task, it can perform compiling operations (such as compiling to obtain a calculation graph, also called graph compiling result), running operations (such as running the compiling result of the graph), and generate compiling signals and running signals.
  • the training node 200 can execute aclopCompileAndExecute to generate compiling signals and running signals.
  • running it can also generate running manager signals, such as runtime errors (Runtime Error). Therefore, real-time signals can also include compiling signals and running signals. and one or more of run manager signals.
  • the management node 100 may provide a system management library (system management library, SMI) command tool.
  • SMI system management library
  • the management node 100 can provide the npu-smi command tool Tool.
  • the management node 100 may execute a query command in the npu-smi command tool, such as execute an ascend-dmi command, to collect real-time signals of at least one training node 200 .
  • S404 The management node 100 performs fault detection according to the real-time signal.
  • the real-time signal collected by the management node 100 from the training node 200 may represent the status of the training node 200 .
  • the training node 200 includes a host 22 and at least one device 24. Different types of real-time signals can reflect the status of different hardware in the training node 200 .
  • the host 22 can compile the method of implementing the model into a calculation graph through the deep learning framework, and sink the calculation graph to the device 24.
  • the device 24 can call the ACL functional components to execute the calculation graph to perform related calculations, such as training When an image recognition model is used, calculations such as convolution and pooling of images can be performed.
  • the device 24 can calculate the gradient and return the gradient to the host 22, so that the host 22 aggregates and communicates the gradients returned by multiple devices 24, such as performing a reduction (reduce) operation, thereby obtaining the average gradient.
  • the compilation signal and the aggregation communication signal can reflect the status of the host 22, and the running signal and the running manager signal can reflect the status of the device 24.
  • the management node 100 can determine whether the host 22 is faulty based on at least one of the compiled signal or the aggregated communication signal, and determine whether the device 24 is faulty based on at least one of the running signal or the running manager signal, thereby realizing fault detection. For example, the management node 100 can determine that the device 24 fails according to the operation manager signal such as runtime error.
  • the management node 100 may also detect whether the network connected to the training node 200 is faulty. Specifically, the management node 100 can periodically send a heartbeat signal to the training node 200 and receive a response from the training node 200 to the heartbeat signal. When the management node 100 does not receive a response from the training node 200 for N consecutive cycles, it indicates that the training node The network between 200 and the management node 100 fails, or the training node 200 fails. When the management node 100 eliminates the failure of the training node 200 in combination with information such as logs, it can be determined that the network between the training node 200 and the management node 100 has a failure.
  • the management node 100 itself may also fail.
  • the training node 200 can perform fault detection on the management node 100 . Specifically, the training node 200 can detect the failure of the management node 100 based on the heartbeat signal of the management node 100. When multiple training nodes 200 do not receive the heartbeat signal from the management node 100 for N consecutive The confidence of failure is relatively high, and when the confidence is greater than the confidence threshold, the training node 200 may determine that the management node 100 has a failure. The training node 200 may re-determine the management node 100 through a voting or election mechanism.
  • the management node 100 may first perform fault prediction based on a fault early warning algorithm.
  • the fault early warning algorithm can be an algorithm based on prior knowledge or expert knowledge base, such as differential integrated moving average autoregressive (autoregressive integrated moving average, ARIMA) algorithm, time series prediction algorithm Prophet or time-aware convolutional neural network algorithm (time -aware CNN algorithm).
  • ARIMA differential integrated moving average autoregressive
  • time series prediction algorithm Prophet time-aware convolutional neural network algorithm
  • time -aware CNN algorithm time-aware CNN algorithm
  • the management node 100 may acquire a real-time signal from at least one training node 200 among the plurality of training nodes 200 according to the prediction result, and perform fault detection according to the real-time signal.
  • the management node 100 may perform fault detection according to the real-time signal at the point in time when the fault will occur.
  • the management node 100 can start real-time fault detection, specifically to capture real-time signals such as cluster communication signals, compilation signals, running signals, and running manager signals, and then perform fault detection based on the real-time signals. detection, so that missed detection caused by the algorithm can be avoided.
  • S406 may be executed to save the fault file. When no fault is detected by the management node 100, the next round of fault detection can be continued.
  • S406 The management node 100 saves the fault file.
  • the fault file is a file used to resume the training task.
  • the failure file holds the training results of the iteration round when the failure occurs.
  • the failure file includes the following information: iteration rounds, weights, losses, and hyperparameters. Iteration rounds include the epoch and/or step at which the failure occurred. Hyperparameters can include one or more of learning rate (LR) and optimizer (optimizer). Further, the fault file may also include hidden states. Referring to the schematic diagram of the fault file shown in Figure 6, the embodiment of the present application can restore the epoch or The training result of step.
  • training tasks can be divided into different types based on parallelism strategies.
  • the task type of the training task can be a data parallel training type or a model parallel training type.
  • the data parallel training type is to use different data in the data set to train the same model in parallel
  • the model parallel training type refers to using the same data in the data set to train multiple parts of the model in parallel. Therefore, when the management node 100 saves the fault file, Different storage strategies can be adopted for training tasks of different task types.
  • the management node 100 may determine the task type of the training task, and then save the fault file according to the storage policy corresponding to the task type.
  • the storage strategy can be to save any non-faulty card in at least one accelerator card Fault file on .
  • the preservation strategy can be to save multiple non-fault conditions in at least one accelerator card. Faulty files on the card, for example, save the faulty files on all non-faulty cards.
  • the management node 100 may further determine whether the target accelerator card used for aggregation communication among the multiple training nodes 200 is a non-faulty card.
  • the target accelerator card may be an accelerator card with a rank_id of 0 among the multiple training nodes 200 .
  • the management node 100 can save the fault file on the target accelerator card, without having to save the fault file after multiple training nodes 200 perform data exchange to ensure data consistency, shortening the storage time .
  • the management node 100 can save the node (also called the nearest node) with the largest network bandwidth among the non-faulty nodes among the multiple training nodes 200. Accelerate the fault files on the card, thereby improving the save rate of fault files.
  • the fault file may include a checkpoint file, that is, a ckpt file.
  • a checkpoint file that is, a ckpt file.
  • field values of storage fields such as iteration rounds, weights, losses, and hyperparameters can be written into a ckpt file, and then the fault file can be saved by saving the ckpt file.
  • the field value can form a key-value pair with the field name, and the key-value pair can be written into a ckpt file for saving.
  • the deep learning framework 308 adopted by different training tasks may be different, for example, the deep learning framework adopted by some training tasks is a framework that supports static compilation, and the graph compilation results compiled based on this framework can be reused, managed The node 100 can also save the graph compilation result.
  • the fault file may also include graph compilation results.
  • the management node 100 may determine the deep learning framework 308 adopted by the training task.
  • the deep learning framework 308 is a framework that supports static compilation
  • the management node 100 may write the graph compilation result into a fault file and save the fault file.
  • frameworks that support static compilation include but are not limited to MindSpore framework and Tensorflow framework.
  • the management node 100 may not save the graph compilation result because the graph compilation result cannot be reused.
  • frameworks that support dynamic compilation include but are not limited to the pytorch framework.
  • the management node 100 when the management node 100 detects that a fault occurs in a training task, it first determines that the task type of the training task is a data parallel training type or a model parallel training type. Then for different task types, different storage strategies are adopted, as follows:
  • the management node 100 may save ckpt files on multiple non-faulty cards. For example, the management node 100 may save the ckpt files on the non-faulty cards of the faulty node and all non-faulty nodes. Wherein, the management node 100 may also save a strategy (strategy) file, so as to restore the training task based on the strategy file.
  • strategy rategy
  • the management node 100 can determine whether the target accelerator card is a faulty card by capturing a real-time signal from the target accelerator card.
  • the management node 100 can determine to save the fault file on the target accelerator card.
  • the management node 100 It is possible to determine the node with the largest network bandwidth among the non-faulty nodes, and then determine to save the fault file on the accelerator card of the node with the largest network bandwidth.
  • the management node 100 can determine whether the deep learning framework 308 is a framework that supports static compilation.
  • the management node 100 can also save the graph compilation result.
  • the failure file also includes graph compilation results.
  • the management node 100 can save the ckpt file.
  • the management node 100 may also back up the faulty file, so as to ensure the safety of the faulty file. For example, the management node 100 can back up the ckpt file to ensure reliability and avoid data loss. Specifically, the management node 100 may save the ckpt file in a high-performance unified buffer (High-performance Unified Buffer, HUB), thereby realizing reliable backup. Similarly, the management node 100 may also store graph compilation results or policy files in a high-performance unified cache, so as to realize reliability backup.
  • HUB High-performance Unified Buffer
  • the management node 100 may start a recovery process of the training task based on the fault file after the fault file is saved. Wherein, the management node 100 may reschedule the training task to a new training node 200 , and the new training node does not include the failed training node 200 . Then the management node 100 loads the fault file, for example, loads the training results of the iteration round when the fault occurs, such as iteration rounds, weights, losses and hyperparameters, etc., so that the training node 200 can continue from the iteration round when the fault occurs Training without repeated training based on a large number of sample data.
  • the management node 100 may determine a recovery policy corresponding to the task type according to the task type of the training task, and then recover the training task according to the recovery policy according to the fault file.
  • the recovery strategy when the task type is data parallel training type, can be recovery based on a single ckpt file; when the task type is model parallel training type, the recovery strategy can be based on multiple ckpt files (for example, all ckpt file on a non-faulty card) for recovery.
  • the deep learning framework 308 used by the training task is a framework that supports dynamic compilation, such as the pytorch framework
  • the management node 100 can restore the training task based on the ckpt file, and the training task uses
  • the deep learning framework 308 is a framework that supports static compilation, such as the TensorFlow framework or the MindSpore framework
  • the management node 100 can also perform fault recovery based on the result of graph compilation.
  • the management node 100 can obtain the ckpt file, for example, the management node 100 can obtain the ckpt file from the HUB, and then load the ckpt file, so as to restore the data and model for continuous training based on the field values of the corresponding fields in the ckpt file.
  • the management node 100 can restore the continued training data based on the epoch and step in the ckpt file, and restore the model based on the weights, learning rate LR, and optimizer in the ckpt file, so as to realize the restoration training task.
  • the management node 100 resumes the training task it may load a policy file, and resume the training task according to the policy file.
  • S408 is an optional step in the embodiment of the present application, and S408 may not be executed in the method for saving a fault file in the embodiment of the present application.
  • the management node 100 may directly use the weights and the like in the above fault file for model reasoning.
  • the embodiment of the present application provides a method for saving a fault file.
  • the management node 100 performs real-time fault detection based on the real-time signal obtained from the training node 200. After detecting a fault, it saves the training results of the iterative rounds of the training task when the fault occurs.
  • the new training node 200 can continue training based on the training results of the iterative round when the fault occurs, without repeating training based on a large amount of sample data, which improves the training efficiency.
  • the method includes:
  • S802 The user triggers an operation of creating a training task.
  • S804 The MindX DL component in the management node 100 invokes the driver and the interface to perform service plane fault detection. When a fault is detected, execute S812.
  • S806 The fault detect component in the management node 100 performs prediction through a fault early warning algorithm, and obtains a prediction result.
  • execute S807 When the predicted result is that a fault occurs within the preset time period, execute S807; when the predicted result is that no fault occurs during the preset time period, execute S808.
  • the fault detect component in the management node 100 captures one or more of the cluster communication signal, compilation signal, running signal, and running manager signal at the predicted time point, and performs fault detection according to the captured above-mentioned signals, and obtains Fault detection results.
  • the fault detect component in the management node 100 captures one or more of cluster communication signals, compilation signals, running signals, and running manager signals, performs fault detection according to the captured signals, and obtains a fault detection result.
  • the fault detection result is that the training task is faulty, execute S810.
  • S812 The MindX DL component in the management node 100 sends a first notification message to the control engine.
  • the first notification message is used to notify that a failure occurs in the training task. Further, the first notification message may also carry the task type of the training task, so that the control engine can perceive the task type of the training task.
  • S814 The control engine in the management node 100 sends a saving policy to the restore manager according to the task type of the training task.
  • S816 The restore manager in the management node 100 saves the fault file according to the saving policy.
  • the restore manager can shield the difference of the underlying deep learning framework 308, and save the fault file according to the saving strategy.
  • the restore manager can obtain the epoch and step when the training task fails, and obtain the weight and hyperparameters of the model, and write the above epoch, step, weight, hyperparameter and other information into the ckpt file for storage.
  • the restore manager can also save the graph compilation result according to the saving strategy, so that the restore manager can combine the graph compilation result to perform fault recovery later.
  • the restore manager can use a distributed storage method to write the faulty file into the HUB, so that the reliable backup of the faulty file can be realized and the probability of losing the faulty file can be reduced.
  • S818 may not be executed when executing the fault file saving method of this embodiment.
  • the restore manager can back up the faulty files locally, or back them up in other ways.
  • S820 The restore manager in the management node 100 returns a backup completion notification to the control engine.
  • the restore manager returns a backup completion notification so that the control engine can start subsequent processes. Further, the restore manager can also instruct the MindX DL component in the management node 100 to reschedule the training task. In some embodiments, the restore manager may not perform the above steps.
  • the MindX DL component can reschedule training tasks based on instructions from the restore manager.
  • the MindX DL component can also determine whether the backup of the faulty file is completed by polling, and when it is determined that the backup of the faulty file is complete, reschedule the training task.
  • S824 The MindX DL component in the management node 100 sends a second notification message to the control engine.
  • the second notification message is used to notify that the training task is rescheduled by the MindX DL component. Further, the second notification message may also carry the task type of the training task.
  • this embodiment uses the MindX DL component as an example for illustration.
  • the management node 100 may also install other types of distributed scheduling components
  • the training task may also be scheduled by other types of distributed scheduling components.
  • S826 the control engine in the management node 100 instructs the restore manager to perform fault recovery.
  • control engine can send a restore command to the restore manager, so that the restore manager can execute subsequent processes and resume the training task according to the restore command.
  • S828 The restore manager in the management node 100 obtains the fault file from the HUB.
  • the recovery strategy when the task type is the data parallel training type, the recovery strategy can be to trigger a single ckpt file for recovery; when the task type is the model parallel training type, the recovery strategy can be to trigger multiple ckpt files (for example, the stored ckpt file) for recovery.
  • the restore manager in the management node 100 can obtain the ckpt file on demand according to the task type.
  • the restore manager in the management node 100 can also obtain the graph compilation result from the HUB.
  • the restore manager can also obtain the faulty file from the local or other locations.
  • S830 The restore manager in the management node 100 restores the training task according to the fault file.
  • the fault file includes a ckpt file
  • information such as epoch, step, weights, loss, optimizer, and LR is written in the ckpt file
  • the restore manager in the management node 100 can load the ckpt file and read the corresponding fields in the ckpt file
  • the field value of based on the field value, restore the data and model for continued training, thereby resuming the training task.
  • the management node 100 can implement fault detection at the training task level, and save the fault file after detecting that the training task fails, the fault file includes epoch and/or step when the training task fails, and the management node 100 Restoring the training task based on the epoch and/or step can avoid the loss of the iterative result of the step or epoch where the fault occurs, and the distributed system 10 does not need to repeat training based on a large amount of sample data, which improves the training efficiency.
  • this method decouples the deep learning framework 308 and the fault file storage device 310, and for the training tasks using different deep learning frameworks 308, the fault file storage device 310 can be used for fault recovery of the training task, which has better compatibility.
  • this method also decouples the saving and restoring mechanisms of different task types, so that users do not need to pay attention to task types, the interface is friendly, and the user's use cost is reduced.
  • the device 310 may be a software device, and the software device may be deployed in the management node 100.
  • the device 310 includes:
  • a communication module 902 configured to acquire a real-time signal from at least one training node 200 of the plurality of training nodes 200, the real-time signal being used to characterize the state of the at least one training node 200;
  • the saving module 906 is configured to save a fault file after a fault is detected, and the fault file is used to restore the training task.
  • Fig. 9 and Fig. 3B divide the fault file storage device 310 from different angles
  • the communication module 902 and the detection module 904 may correspond to the fault detection component 3102 in Fig. 3B
  • the preservation module 906 may correspond to the fault detection component 3102 in Fig. 3B Repair management component 3106 of .
  • the real-time signal includes one or more of the following:
  • Cluster communication signal compile signal, run signal, run manager signal.
  • the device 310 further includes:
  • a prediction module 908 configured to perform fault prediction on the training task through a fault early warning algorithm, and obtain a prediction result
  • the detection module 904 is specifically used for:
  • a real-time signal is acquired from at least one training node 200 among the plurality of training nodes 200 .
  • the prediction module 908 and the detection module 904 may jointly correspond to the fault detection component 3102 in FIG. 3B, so as to implement fault detection on the training node 200 (including host 22 and device 24).
  • the saving module 906 is specifically configured to:
  • the saving module 906 may correspond to the control engine 3104 and the repair management component 3106 in FIG. 3B , so as to save the faulty file according to the saving strategy.
  • each of the multiple training nodes 200 includes at least one accelerator card, and when the task type of the training task is a data parallel training type, the saving strategy is to save the A fault file on any non-faulty card in at least one acceleration card; when the task type of the training task is a model parallel training type, the preservation strategy is to save faults on a plurality of non-faulty cards in the at least one acceleration card document.
  • the storage strategy is:
  • the target accelerator card used for aggregated communication in the plurality of training nodes 200 is a non-faulty card, save the fault file on the target accelerator card;
  • the target accelerator card used for aggregate communication in the multiple training nodes 200 is a faulty card, save the faulty file on the accelerator card of the node with the largest network bandwidth among the non-faulty nodes in the multiple training nodes 200 .
  • the device 310 further includes:
  • the recovery module 909 is configured to reschedule the training task and load the fault file after the fault file is saved.
  • the recovery module 902 may correspond to the recovery management component 3106 in FIG. 3B , so as to reschedule the training task to the new training node 200 and load the fault file, so as to recover the training task.
  • the recovery module 909 is specifically configured to:
  • the fault file includes the following information: iteration rounds, weights, losses and hyperparameters.
  • the fault file further includes a graph compilation result of the training task.
  • the fault file storage device 310 may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules/units of the fault file storage device 310 are respectively in order to realize the The corresponding flow of each method in the embodiment is shown, and for the sake of brevity, details are not repeated here.
  • the embodiment of the present application also provides a management node 100 .
  • the management node 100 may be a server, such as a cloud server or a physical server.
  • a cloud server refers to a computing device in a cloud environment.
  • the cloud environment indicates the cluster of central computing equipment owned by the cloud service provider and used to provide computing, storage, and communication resources.
  • the physical server may be an independent server, and the configuration and performance of the physical server are usually exclusively used by the user.
  • the management node 100 may also be a terminal, including but not limited to a desktop computer, a notebook computer or a smart phone.
  • the management node 100 is specifically used to realize the function of the fault file saving device 310 in the embodiment shown in FIG. 9 .
  • FIG. 10 provides a hardware structure diagram of a management node 100. As shown in FIG. The processor 1002 , the memory 1004 , the communication interface 1003 , and the accelerator card 1005 communicate through the bus 1001 .
  • the bus 1001 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 10 , but it does not mean that there is only one bus or one type of bus.
  • the central processing unit 1002 may be a CPU with an x86 architecture, or a CPU with an Advanced RISC Machine (ARM) architecture, or a CPU with other architectures, which is not limited in this embodiment.
  • ARM Advanced RISC Machine
  • the communication interface 1003 is used for communicating with the outside.
  • the communication interface 1003 is used to obtain a real-time signal from at least one training node 200 in a plurality of training nodes 200, and after detecting a fault, obtain a fault file and save the fault file etc.
  • the memory 1004 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • volatile memory such as a random access memory (random access memory, RAM).
  • Memory 1004 can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive) , SSD).
  • volatile memory such as a random access memory (random access memory, RAM).
  • RAM random access memory
  • Memory 1004 can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive) , SSD).
  • non-volatile memory such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive) , SSD
  • the accelerator card 1005 may include an NPU or a GPU. The above descriptions are all described by taking the accelerator card 105 including the NPU as an example. It should be noted that, when the management node 100 is a node dedicated to performing management functions, the management node 100 may not include the accelerator card 1005 described above.
  • Computer-readable instructions are stored in the memory 1004 , and the processor 1002 executes the computer-readable instructions, so that the management node 100 executes the aforementioned fault file storage method (or realizes the function of the aforementioned fault file storage device 310 ).
  • the software or program code required to execute the functions of each module in FIG. 9 may be stored in at least one memory 1004 in the management node 100 .
  • At least one processor 1002 executes the program code stored in the memory 1004, so that the management node 100 executes the aforementioned method for saving a fault file.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc.
  • the computer-readable storage medium includes an instruction, and the instruction instructs the management node 100 to execute the method for saving a faulty file.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computing device such as the management node 100, the processes or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from a website, computing device, or data center via Wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) transmission to another website site, computing device, or data center.
  • Wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.

Abstract

本申请提供了一种故障文件保存方法,应用于人工智能(AI)领域中的分布式训练场景。分布式系统包括管理节点和多个训练节点,多个训练节点用于协同执行训练任务,该方法包括:管理节点从多个训练节点中的至少一个训练节点获取实时信号,该实时信号用于表征至少一个训练节点的状态,管理节点根据实时信号进行故障检测,检测到故障后进行故障文件保存,该故障文件用于恢复训练任务。该方法通过实时进行故障检测,并在检测到故障后进行故障文件保存,由此可以保留故障发生时的迭代轮次的训练结果,避免基于大量样本数据重新启动该迭代轮次的训练,保障了训练效率。

Description

一种故障文件保存方法及相关装置
本申请要求于2022年3月1日提交中国专利局、申请号为202210197961.0、发明名称为“一种故障文件保存方法及相关装置”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种故障文件保存方法、装置以及管理节点、分布式系统、计算机可读存储介质、计算机程序产品。
背景技术
随着AI技术的不断发展,越来越多的行业和领域采用AI模型(为了便于描述,有些情况下也简称为模型)实现业务的智能化、自动化。例如,电子商务行业中,越来越多商家采用基于AI模型构建的AI客服代替人工客服,提供售前、售后咨询服务。又例如,社交网络中,平台采用AI模型代替人工审核用户发布的内容,以节省人力成本。
AI模型是指基于AI技术构建的、用于对未知数据进行预测的数学模型。例如,AI模型可以是基于神经网络构建的目标检测模型、图像分类模型。AI模型通常需要通过大量数据进行训练。为了提高AI模型的训练效率,分布式训练方法应运而生。所谓分布式训练方法是将训练任务分散到多个训练节点执行,多个训练节点并行训练模型。其中,训练任务是利用数据集训练模型,获得模型的权重的过程。训练任务的任务类型可以分为数据并行训练类型和模型并行训练类型。数据并行训练类型是指将数据集中的数据分散到多个训练节点进行训练,模型并行训练类型是指将模型的不同部分分散到多个训练节点进行训练。多个训练节点可以采用同步更新机制更新模型的参数。同步更新机制是指将各个训练节点获得的梯度进行累加计算均值,基于该均值更新模型的参数。当个别训练节点、训练算法或者网络出现故障时,整个分布式训练任务就会中断。随着训练节点的增加,中断可能性越来越高,因此,需要提供一种故障文件保存机制,以便基于故障文件恢复训练任务。
目前,业界主要采用定时备份检查点(checkpoint,ckpt)文件的方式,以实现故障文件保存。当故障发生时,基于最近一次保存的checkpoint文件进行故障恢复,该方式会导致丢失故障发生时的迭代轮次的训练结果,训练节点需要基于大量样本数据重新启动该迭代轮次的训练,影响了训练效率。
发明内容
本申请提供了一种故障文件保存方法,该方法通过实时进行故障检测,并在检测到故障后进行故障文件保存,由此可以保留故障发生时的迭代轮次的训练结果,避免基于大量样本数据重新启动该迭代轮次的训练,保障了训练效率。本申请还提供了上述方法对应的装置、管理节点、分布式系统、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种故障文件保存方法。该方法应用于分布式系统。其中,分布式系统包括管理节点和多个训练节点,多个训练节点用于协同执行训练任务。该方法可以由分布式系统中的管理节点执行。
具体地,管理节点从多个训练节点中的至少一个训练节点获取实时信号,该实时信号用于表征至少一个训练节点的状态,然后管理节点根据实时信号进行故障检测,接着管理节点检测到故障后进行故障文件保存,该故障文件用于恢复训练任务。
在该方法中,管理节点基于从训练节点获取的实时信号,进行实时地故障检测,在检测到故障后,保存训练任务在故障发生时的迭代轮次的训练结果,当训练任务被重调度至新的训练节点时,新的训练节点可以基于故障发生时的迭代轮次的训练结果继续训练,无需基于大量样本数据重复训练,提高了训练效率。而且,该方法也无需频繁地进行故障文件保存,避免了以较高的频率(较短的周期)定时备份故障文件对性能的挑战。
在一些可能的实现方式中,训练节点在执行训练任务时可以产生信号。当管理节点触发信号采集时,从训练节点实时采集的信号即为实时信号。管理节点可以设置采集时间窗,管理节点从训练节点获取该训练节点在采集时间窗内的信号,从而获得实时信号。
考虑到多个训练节点在协同执行训练任务时,可以执行集合通信操作,产生集群通信信号,实时信号可以包括上述集群通信信号。此外,训练节点在执行训练任务时,可以执行编译操作(如编译得到计算图,也称作图编译结果)、运行操作(如运行图编译结果),产生编译信号、运行信号。其中,训练节点执行运行操作时,还可以产生运行管理器信号,因此,实时信号也可以包括编译信号、运行信号和运行管理器信号中的一种或多种。
在该方法中,管理节点通过获取集群通信信号、编译信号、运行信号、运行管理器信号等实时信号中的一种或多种,可以实现实时地故障检测,进而为保存故障发生时的迭代轮次的训练结果奠定基础。
在一些可能的实现方式中,管理节点还可以先通过故障预警算法对训练任务进行故障预测,得到预测结果。然后管理节点根据所述预测结果,从所述多个训练节点中的至少一个训练节点获取实时信号,以根据该实时信号进行故障检测。具体地,预测结果表征在某一时间段将要发生故障时,管理节点可以在预测的时间点(或者该时间点前后一段时间)进行获取实时信号,进而根据该实时信号进行故障检测,如此可以提高效率,并且减少对资源的占用。预测结果表征在某一时间段不发生故障时,管理节点可以在该时间段获取实时信号,进而根据实时信号进行故障检测,如此可以避免故障预警算法的准确度不能达到100%导致的漏报现象发生,提高故障检测的准确度。
在一些可能的实现方式中,管理节点可以确定训练任务的任务类型,然后管理节点可以按照与任务类型对应的保存策略,进行故障文件保存。如此可以个性化的故障文件保存,满足不同应用场景的需求。
在一些可能的实现方式中,多个训练节点中的每个训练节点包括至少一个加速卡。训练任务的任务类型为数据并行训练类型时,由于各加速卡会进行数据交换,以保持数据一致性,因此,保存策略可以为保存至少一个加速卡中任一个非故障卡上的故障文件,如此可以避免重复保存,减少存储资源占用。训练任务的任务类型为模型并行训练类型时,由于各加速卡是对模型的不同部分进行训练,因此,保存策略可以为保存所述至少一个加速卡中多个非故障卡上的故障文件,例如是保存所有非故障卡上的故障文件,如此可以实现尽可能全面地保留故障发生时的迭代轮次的训练结果,避免基于大量样本数据重新启动该迭代轮次的训练。
在一些可能的实现方式中,所述训练任务的任务类型为数据并行训练类型时,管理节点还可以进一步确定多个训练节点中用于聚合通信的目标加速卡是否为非故障卡。相应地,保存策略可以进一步细化为:当多个训练节点中用于聚合通信的目标加速卡为非故障卡时,保存所述目标加速卡上的故障文件,而不必在多个训练节点进行数据交换保证数据一致性后,再保存故障文件,缩短了保存时间;当多个训练节点中用于聚合通信的目标加速卡为故障卡时,保存所述多个训练节点中的非故障节点中网络带宽最大的节点的加速卡上的故障文件,由此提高故障文件的保存速率。
在一些可能的实现方式中,在故障文件保存完毕后,管理节点可以重调度训练任务,例如重调度训练任务至新的训练节点,该新的训练节点是未发生故障的节点,然后加载故障文件,从而使得训练节点可以从故障发生时的迭代轮次继续进行训练,而无需根据大量样本数据进行重复训练。
在一些可能的实现方式中,管理节点可以根据与训练任务的任务类型对应的恢复策略,加载故障文件。具体地,任务类型为数据并行训练类型时,恢复策略可以为基于单个故障文件如ckpt文件进行恢复;任务类型为模型并行训练类型时,恢复策略可以为基于多个故障文件(例如是所有非故障卡上的ckpt文件)进行恢复。如此可以实现根据任务类型选择性地加载故障文件,恢复训练任务,满足不同应用场景的需求。
在一些可能的实现方式中,所述故障文件包括以下信息:迭代轮次、权重、损失和超参数。通过上述迭代轮次、权重、损失以及学习率、优化器等超参数,可以实现从故障发生时的迭代轮次继续训练,满足了业务需求。
在一些可能的实现方式中,所述故障文件还包括所述训练任务的图编译结果。该图编译结果是指根据模型使用的方法对应的计算图,通常可以通过对模型使用的方法进行编译得到。
在该方法中,通过保存图编译结果,可以实现图编译结果的复用,提高训练任务恢复的效率。
在一些可能的实现方式中,管理节点可以识别所述训练任务采用的深度学习框架,获得识别结果。所述识别结果为支持静态编译的框架,例如为TensorFlow框架或者MindSpore框架时,管理节点可以保存该图编译结果,以便后续恢复训练任务时,直接复用该图编译结果,由此提高训练任务恢复的效率。
第二方面,本申请提供一种故障文件保存装置。该故障文件保存装置应用于分布式系统,所述分布式系统包括管理节点和多个训练节点,所述多个训练节点用于协同执行训练任务,所述装置部署于所述管理节点,所述装置包括:
通信模块,用于从所述多个训练节点中的至少一个训练节点获取实时信号,所述实时信号用于表征所述至少一个训练节点的状态;
检测模块,用于根据所述实时信号进行故障检测;
保存模块,用于检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
在一些可能的实现方式中,所述实时信号包括以下一种或多种:
集群通信信号、编译信号、运行信号、运行管理器信号。
在一些可能的实现方式中,所述装置还包括:
预测模块,用于通过故障预警算法对所述训练任务进行故障预测,得到预测结果;
所述检测模块具体用于:
根据所述预测结果,从所述多个训练节点中的至少一个训练节点获取实时信号。
在一些可能的实现方式中,所述保存模块具体用于:
确定所述训练任务的任务类型;
按照与所述任务类型对应的保存策略,进行故障文件保存。
在一些可能的实现方式中,所述多个训练节点中的每个训练节点包括至少一个加速卡,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为保存所述至少一个加速卡中任一个非故障卡上的故障文件;所述训练任务的任务类型为模型并行训练类型时,所述保存策略为保存所述至少一个加速卡中多个非故障卡上的故障文件。
在一些可能的实现方式中,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为:
当所述多个训练节点中用于聚合通信的目标加速卡为非故障卡时,保存所述目标加速卡上的故障文件;
当所述多个训练节点中用于聚合通信的目标加速卡为故障卡时,保存所述多个训练节点中的非故障节点中网络带宽最大的节点的加速卡上的故障文件。
在一些可能的实现方式中,所述装置还包括:
恢复模块,用于在所述故障文件保存完毕后,重调度所述训练任务,加载所述故障文件。
在一些可能的实现方式中,所述恢复模块具体用于:
根据与所述训练任务的任务类型对应的恢复策略,加载所述故障文件。
在一些可能的实现方式中,所述故障文件包括以下信息:迭代轮次、权重、损失和超参数。
在一些可能的实现方式中,所述故障文件还包括所述训练任务的图编译结果。
第三方面,本申请提供一种管理节点。所述管理节点包括至少一个处理器和至少一个存储器。所述至少一个处理器、所述至少一个存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得管理节点执行如第一方面或第一方面的任一种实现方式中的方法。
第四方面,本申请提供一种分布式系统。所述分布式系统包括:管理节点和多个训练节点。
所述多个训练节点,用于协同执行训练任务;
所述管理节点,用于从所述多个训练节点中的至少一个训练节点获取实时信号,所述实时信号用于表征所述至少一个训练节点的状态,根据所述实时信号进行故障检测,检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
第五方面,本申请提供一种计算机可读存储介质。所述计算机可读存储介质中存储有指令,所述指令指示管理节点执行上述第一方面或第一方面的任一种实现方式所述的方法。
第六方面,本申请提供了一种包含指令的计算机程序产品。当其在管理节点上运行时,使得管理节点执行上述第一方面或第一方面的任一种实现方式所述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种分布式系统的架构示意图;
图2为本申请实施例提供的一种服务器的硬件结构图;
图3A为本申请实施例提供的一种服务器上部署的软件的框架图;
图3B为本申请实施例提供的一种服务器上部署的软件的调用关系图;
图4为本申请实施例提供的一种故障文件保存方法的流程图;
图5为本申请实施例提供的一种故障检测的流程图;
图6为本申请实施例提供的一种故障文件保存字段的示意图;
图7为本申请实施例提供的一种故障文件保存方法的流程图;
图8为本申请实施例提供的一种故障文件保存方法的信令流程图;
图9为本申请实施例提供的一种故障文件保存装置的结构示意图;
图10为本申请实施例提供的一种管理节点的硬件结构图。
具体实施方式
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本申请实施例中所涉及到的一些技术术语进行介绍。
人工智能(artificial intelligence,AI),也称作机器智能,具体是指由机器(如计算机)通过模仿人类思维和行为(如学习、推理、思考、规划等)所表现出来的智能。人工智能通常是基于知识模仿人类思维和行为,以实现特定目标或者完成特定任务。其中,知识可以来源于经验或数据。
深度学习(deep learning,DL),作为AI的一个分支,具体是使用深层次神经网络模型(也称作深度学习模型,为了便于描述,一些情况下也简称为模型)对海量的数据进行处理,以从海量的数据中学习知识,并基于该知识对数据进行分析。经过训练的深度学习模型可以应用于AI领域的感知、决策等场景,例如应用于图像识别、语音识别、自然语言翻译、计算机博弈等场景。
深度学习模型的参数量较高,通常可以达到千亿、万亿级别。例如,自然语言处理(natural language processing,NLP)领域的大模型的参数量可以达到千亿级别。这种大规模的深度学习模型通常需要庞大的数据集进行训练。一种典型的训练方式为分布式训练。
分布式训练可以由分布式系统执行。参见图1所示的分布式系统的架构图,分布式系统10包括管理节点100和多个训练节点200。其中,管理节点100也称作主节点(master node),训练节点200也称作工作节点(worker node)。管理节点100具体用于维护元信息,根据元信息进行任务调度。训练节点200具体用于执行管理节点100调度的任务。
在分布式训练场景中,元信息包括分布式系统10中训练节点200的数量、每个训练节点200的负载中的一种或多种。管理节点100可以基于上述元信息,将训练任务分散到多个训练节点200,多个训练节点200并行训练模型。
需要说明的是,图1所示的系统架构仅是示例性的,在另一些情况下,也可以不设置专用于执行管理功能的管理节点100。参与分布式训练的训练节点200也可以具有管理功能。例如,分布式系统10中的任一个节点可以既是管理节点100,又是训练节点200,也即分布式系统10中的任一个节点可以同时具备管理功能和训练功能。
分布式训练的并行策略可以包括数据并行训练策略和模型并行训练策略。数据并行训练策略是指将数据集切分为多个部分,并分发到不同训练节点200,每个训练节点200基于数据集中的不同数据对相同结构的模型进行训练,并在多个训练节点200之间传递参数,如此可以解决数据集过大导致无法在单机高效率训练的问题。模型并行训练策略是指将模型(例如深度学习模型)分割为多个部分,并将模型的多个部分部署在不同训练节点200上,不同训练节点200采用数据集并行训练模型的多个部分,如此可以解决显存限制导致大规模的深度学习模型难以在单个训练节点200上运行的问题。
分布式系统10可以采用迭代法更新模型的参数,以实现模型训练。每次迭代(iteration)则更新一次模型的参数,其中,一次迭代也可以称作一次训练步骤train step,简称为step。每次迭代所使用的样本量称为批尺寸(batch size)。在训练过程中,将数据集(例如是训练集)中的样本数据均被使用一次的过程称为一个时期(epoch)。为了便于理解,下面以一个示例进行说明。在该示例中,训练集包括1000个样本数据,batch size可以为100,则每次可以使用100个样本数据进行一次迭代,训练集中的1000个样本数据进行10次迭代即完成一个epoch的训练。
在分布式系统10中,多个训练节点200可以采用同步更新机制更新模型的参数。其中,同步更新机制是指将各个训练节点200获得的梯度进行累加计算均值,基于该均值更新模型的参数。相较于异步更新机制,即各个训练节点200基于各自的梯度更新模型的参数,同步更新机制可以保障损失(loss)的下降比较稳定,避免出现较大的抖动。在采用同步更新机制的情况下,当个别训练节点200、训练算法或者网络出现故障时,整个分布式训练任务就会中断。随着训练节点200的增加,中断可能性越来越高。为此,相关技术提供了一种定时备份checkpoint文件,以便于训练任务发生故障时能够基于该checkpoint文件恢复训练任务的机制。然而,当故障发生时,管理节点100可以根据最近一次保存的checkpoint文件重调度 训练任务,从而进行故障恢复。当备份周期较短时,管理节点100需要占用大量的内存,以保存大量的参数,由此影响了性能。出于性能的考虑,通常会设置较长的备份周期。当备份周期较长时,将会导致丢失故障发生时的迭代轮次的训练结果,训练节点200需要基于大量样本数据重复训练,影响了训练效率。
有鉴于此,本申请实施例提供了一种故障文件保存方法。该方法应用于如图1所示的分布式系统10。其中,多个训练节点200用于协同执行训练任务。在训练时,管理节点100可以从多个训练节点200中的至少一个训练节点200获取实时信号,该实时信号用于表征至少一个训练节点200的状态,管理节点100可以根据该实时信号进行故障检测,当管理节点100检测到故障后进行故障文件保存。
在该方法中,管理节点100基于从训练节点200获取的实时信号,进行实时地故障检测,在检测到故障后,保存训练任务在故障发生时的迭代轮次的训练结果,当训练任务被重调度至新的训练节点200时,新的训练节点200可以基于故障发生时的迭代轮次的训练结果继续训练,无需基于大量样本数据重复训练,提高了训练效率。
本申请实施例的故障文件保存方法可以应用于各种分布式训练的场景。例如,该故障文件保存方法可以用于分布式训练图像识别模型的场景,当训练图像识别模型的分布式训练任务因训练节点200等发生故障而中断时,管理节点100可以保存故障文件,该故障文件包括故障发生时的迭代轮次的训练结果,管理节点100可以基于包括上述训练结果的故障文件重调度训练任务,实现从故障发生时的迭代轮次继续训练,无需根据大量样本数据进行重复训练,提高了训练效率。又例如,该故障文件保存方法也可以用于分布式训练语音识别模型的场景,当训练语音识别模型的分布式训练任务因训练节点200等发生故障而中断时,管理节点100通过保存故障发生时所在迭代轮次的训练结果,并基于该训练结果恢复训练任务,可以实现从故障发生时的迭代轮次继续训练,提高训练效率。
需要说明的是,本申请实施例的故障文件保存方法可以被封装为功能组件,该功能组件可以集成在分布式深度学习框架中,以供用户使用。在一些可能的实现方式中,本申请实施例的故障文件保存方法也可以被封装为独立的应用,以供用户使用。上述功能组件或应用可以统称为故障文件保存装置。为了便于描述,下面以故障文件保存方法被封装为功能组件进行示例说明。
图1对分布式系统10的系统架构进行示例说明,为了实现分布式训练过程中进行故障文件保存,以便基于保存的故障文件恢复训练任务,可以先构建分布式系统10。下面结合服务器的硬件结构图以及服务器上部署的软件的框架图、调用关系图,将服务器配置为管理节点100和训练节点200,从而构建分布式系统10进行说明。
具体地,用户可以购买或租赁服务器,该服务器可以是云服务器,或者是物理服务器。参见图2所示的服务器的硬件结构图,该服务器20包括主机(host)22和至少一个设备(device)24。其中,host 22和至少一个device 24连接。
其中,host 22包括处理器和内存,该处理器可以是中央处理器(central processing unit,CPU),该内存可以是双列直插式存储器模块(Dual In-line Memory Module,DIMM)。其中,DIMM具体可以是双倍数据率(double data rate,DDR)类型,例如内存可以是DDR4 DIMM。在图2的示例中,host 22包括4个CPU和4个DDR4 DIMM组,每个CPU连接1个DDR4 DIMM组,每个DDR4 DIMM组包括8个DDR4 DIMM。host 22的多个CPU可以连接,形成hydra mesh。
可选地,host 22还包括接口,例如是串行高级技术附件(Serial Advanced Technology Attachment,SATA)接口、新一代非易失性内存(Non-Volatile Memory express,NVMe)接口以及千兆以太网(Gigabit Ethernet,GE)接口中的一种或多种。
device 24包括处理器,该处理器通常是加速卡。在图2的示例中,device24包括的处理器可以为神经网络处理器(Neural-network Processing Unit,NPU)。图2以device 24包括8张NPU进行示例说明。在本申请实施例其他可能的实现方式中,device 24也可以包括更多的加速卡。
然后,参见图3A所示的服务器上部署的软件的框架图,用户可以在服务器20上安装固件302和驱动304。其中,固件302通常是写入只读存储器中的程序,可以直接控制硬件、与硬件交互,并检查硬件是否有任何错误。驱动304具体是添加到操作系统中的一小块代码,其中包含有关硬件的信息。当计算机程序请求与某个硬件交互时,驱动304可以充当硬件与使用它的程序之间指令的转换器。例如,固件302可以控制device 24、与device 24交互,并检查device 24是否有任何错误,驱动304可以充当device 24与使用它的程序之间指令的转换器。
进一步地,服务器20的硬件架构采用异构计算架构(包括使用不同类型指令集的计算单元的计算架构)时,用户还可以在服务器20上安装异构计算框架306。在分布式训练场景中,异构计算框架306可以是针对神经网络的异构计算框架(Compute Architecture for Neuro Net,CANN)RCANN可以通过提供多层次的编程接口,支持用户快速构建AI应用。其中,AI应用是指基于训练得到的模型构建的应用。需要说明的是,异构计算框架306为可选框架,服务器20上不安装上述框架,也可以执行本申请实施例的故障文件保存方法,上述框架的作用在于提高构建AI应用的效率。
然后,用户可以在服务器20上安装深度学习框架308。深度学习框架308用于通过对实现模型的方法进行编译,构建大规模的计算图(computational graph),以及自动实现计算图中的梯度计算。其中,计算图也称作图编译结果。如此,在进行分布式训练时,可以执行图编译结果,以进行分布式训练的相关计算。根据编译方式不同,深度学习框架可以分为支持静态编译的框架和支持动态编译的框架。其中,支持静态编译的框架包括MindSpore框架和Tensorflow框架中的一种或多种,支持动态编译的框架包括PyTorch框架。用户可以根据业务需求,选择在服务器20上安装一个或多个深度学习框架308。在一些实施例中,服务器20上也可以不安装深度学习框架308,此时,服务器20可以采用编程语言如Python从头开始实现模型。
在本申请实施例中,故障文件保存装置310的安装包可以封装在深度学习框架308的安装包内,当用户在服务器20上安装深度学习框架308时,故障文件保存装置310也可以随着深度学习框架308安装在服务器20上。
用户还可以在服务器20上安装调度装置312。该调度装置312用于调度训练任务,以实现分布式训练。其中,调度装置312可以是分布式调度组件或者AI开发平台。在一些实施例中,分布式调度组件可以是MindX DL组件,或者是其他第三方分布式调度组件,AI开发平台可以是Model Arts等开发平台。
在完成上述准备工作后,多个服务器20可以通过投票或选举方式,确定一个服务器20为管理节点100,剩余的服务器20可以作为训练节点200。需要说明的是,当一个或多个训练节点200故障时,管理节点100可以重调度训练任务至新的训练节点200;当管理节点100故障时,剩余的服务器20可以重新投票或选举,确定新的管理节点100。
在一些可能的实现方式中,多个服务器20也可以不进行投票或选举,例如服务器20可以既作为管理节点100,又作为训练节点200。具体地,当一个服务器20作为管理节点100时,该服务器20可以对其他的服务器20进行故障检测,进而实现管理,此外,该服务器20作为训练节点200时,也可以被其他的服务器20检测,并接受其他的服务器20的管理。
接下来,参见图3B所示的软件的调用关系图,故障文件保存装置310包括故障检测(fault detect)组件3102、控制引擎(control engine)3104和修复管理(restore manager)组件3106。调度装置312(例如MindX DL组件或者第三方分布式调度组件)可以调用上述故障检测组件3102、控制引擎3104和修复管理组件3106,以执行本申请实施例的故障文件保存方法。
具体地,故障检测组件3102可以调用异构计算框架306中的昇腾计算语言(Ascend Computing Language,ACL)功能组件,以发现故障。其中,ACL功能组件包括ACL算子编译并执行(aclopCompileAndExecute)和集合通信库,如华为集合通信库(Huawei Collective Communication Library,HCCL)。aclopCompileAndExecute用于编译并执行指定的算子,集合通信库可以为多机多卡训练提供数据并行或模型并行的高性能集合通信方案。控制引擎3104可以调用ACL功能组件,指定保存策略和恢复策略。修复管理组件3106用于调用深度学习框架308和ACL功能组件,以对训练任务进行故障恢复。在该示例中,训练任务为深度学习训练任务,例如是计算机视觉(Computer Vision)、NLP等场景中的深度学习训练任务。
接下来,将从管理节点100的角度,结合附图对本申请实施例提供的故障文件保存方法进行详细说明。
参见图4所示的故障文件保存方法的流程图,该方法包括如下步骤:
S402:管理节点100从多个训练节点200中的至少一个训练节点200获取实时信号。
训练节点200在执行训练任务时可以产生信号。当管理节点100触发信号采集时,从训练节点200实时采集的信号即为实时信号。其中,管理节点100可以设置采集时间窗,管理节点100从训练节点200获取该训练节点200在采集时间窗内的信号,从而获得实时信号。采集时间窗的窗口长度可以根据经验值设置,例如采集时间窗的窗口长度可以设置为5秒。需要说明的是,从管理节点100触发信号采集,到管理节点100开始获取采集时间窗内的信号,往往存在时延,例如,管理节点100在9点0分0秒触发信号采集,在9点0分20秒开始获取采集时间窗内的信号,管理节点100可以在9点0分25秒采集完本轮的信号,即使该信号相对于触发信号采集的时间有所延迟,但该延迟小于设定值,可以忽略不计,因此,该信号也称作实时信号。
实时信号用于表征至少一个训练节点200的状态。该状态可以是训练节点200的健康状态。考虑到多个训练节点200在协同执行训练任务时,可以执行集合通信操作,如HCCL op,产生集群通信信号,实时信号可以包括上述集群通信信号。此外,训练节点200在执行训练任务时,可以执行编译操作(如编译得到计算图,也称作图编译结果)、运行操作(如运行图编译结果),产生编译信号、运行信号。例如,训练节点200可以执行aclopCompileAndExecute,产生编译信号、运行信号,在运行过程中,还可以产生运行管理器信号,如运行时错误(Runtime Error),因此,实时信号也可以包括编译信号、运行信号和运行管理器信号中的一种或多种。
在实际应用时,管理节点100可以提供系统管理库(system management library,SMI)命令工具。例如,训练节点200的device为NPU时,管理节点100可以提供npu-smi命令工 具。管理节点100可以执行npu-smi命令工具中的查询命令,如执行ascend-dmi命令,以采集至少一个训练节点200的实时信号。
S404:管理节点100根据实时信号进行故障检测。
管理节点100从训练节点200采集的实时信号可以表征该训练节点200的状态。其中,训练节点200包括host 22和至少一个device 24。不同类型的实时信号可以反映训练节点200中不同硬件的状态。
具体地,host 22可以通过深度学习框架将实现模型的方法编译成计算图,并将计算图下沉至device 24,device 24可以调用ACL功能组件,执行计算图,以进行相关计算,例如在训练图像识别模型时,可以执行对图像的卷积、池化等计算。进一步地,device 24可以计算梯度,并向host 22返回梯度,以便于host 22对多个device 24返回的梯度进行聚合通信,如进行归约(reduce)运算,从而得到平均梯度。基于此,编译信号、聚合通信信号可以反映host 22的状态,运行信号、运行管理器信号可以反映device 24的状态。管理节点100可以基于编译信号或聚合通信信号中的至少一种,确定host 22是否故障,基于运行信号或运行管理器信号中的至少一种确定device 24是否故障,由此实现故障检测。例如,管理节点100可以根据运行管理器信号如runtime error,确定device 24发生故障。
在一些可能的实现方式中,管理节点100还可以对连接训练节点200的网络是否故障进行检测。具体地,管理节点100可以周期性地向训练节点200发送心跳信号,并接收训练节点200对心跳信号的响应,当管理节点100连续N个周期未接收到训练节点200的响应,则表明训练节点200与该管理节点100之间的网络发生故障,或者训练节点200发生故障。当管理节点100结合日志等信息,排除训练节点200发生故障时,可以确定训练节点200与管理节点100之间的网络发生故障。
进一步地,管理节点100自身也可能发生故障。训练节点200可以对管理节点100进行故障检测。具体地,训练节点200可以基于管理节点100的心跳信号对管理节点100进行故障检测,当多个训练节点200在连续N个周期,未接收到来自管理节点100的心跳信号,则该管理节点100发生故障的置信度较高,当置信度大于置信度阈值时,训练节点200可以确定该管理节点100发生故障。训练节点200可以通过投票或选举机制,重新确定管理节点100。
在一些可能的实现方式中,参见图5所示的故障检测的流程图,管理节点100可以先基于故障预警算法进行故障预测。其中,故障预警算法可以是基于先验知识或专家知识库的算法,例如差分整合移动平均自回归(autoregressive integrated moving average,ARIMA)算法、时间序列预测算法Prophet或者时间感知卷积神经网络算法(time-aware CNN algorithm)。上述故障预警算法可以用于预测训练任务在预设时间段,如未来1小时内是否发生故障。
进一步地,一些故障预警算法还可以预测上述预设时间段中发生故障时的时间点。管理节点100可以根据预测结果,从多个训练节点200中的至少一个训练节点200中获取实时信号,根据该实时信号进行故障检测。
具体地,当预测结果表征将要发生故障,管理节点100可以在故障将要发生的时间点,根据该实时信号进行故障检测。当预测结果表征预设时间段不存在故障时,管理节点100可以启动实时故障检测,具体是捕捉集群通信信号、编译信号、运行信号、运行管理器信号等实时信号,然后基于该实时信号进行故障检测,如此可以避免算法导致的漏检。当管理节点100检测到故障后,可以执行S406,进行故障文件保存。当管理节点100未检测到故障,可以继续进行下一轮的故障检测。
S406:管理节点100进行故障文件保存。
具体地,故障文件为用于恢复训练任务的文件。故障文件保存故障发生时迭代轮次的训练结果。在一些实施例中,故障文件包括以下信息:迭代轮次、权重、损失和超参数。迭代轮次包括故障发生时所在epoch和/或step。超参数可以包括学习率(learning rate,LR)和优化器(optimizer)中的一种或多种。进一步地,故障文件还可以包括隐藏状态hidden states。参见图6所示的故障文件的示意图,本申请实施例通过新增“LR”、“Epoch”、“Step”、“Loss”、“optimizer”等保存字段,可以实现恢复故障发生时所在epoch或step的训练结果。
在一些可能的实现方式中,训练任务可以基于并行策略分为不同类型。例如,训练任务的任务类型可以为数据并行训练类型或模型并行训练类型。其中,数据并行训练类型是利用数据集中的不同数据并行训练相同模型,模型并行训练类型是指利用数据集中的相同数据并行训练模型的多个部分,因此,管理节点100在进行故障文件保存时,可以针对不同任务类型的训练任务,采用不同的保存策略。
具体地,管理节点100可以确定所述训练任务的任务类型,然后按照与所述任务类型对应的保存策略,进行故障文件保存。在一些实施例中,训练任务的任务类型为数据并行训练类型时,由于各加速卡会进行数据交换,以保持数据一致性,因此,保存策略可以为保存至少一个加速卡中任一个非故障卡上的故障文件。在另一些实施例中,所述训练任务的任务类型为模型并行训练类型时,由于各加速卡是对模型的不同部分进行训练,因此,保存策略可以为保存至少一个加速卡中多个非故障卡上的故障文件,例如是保存所有非故障卡上的故障文件。
其中,训练任务的任务类型为数据并行训练类型时,管理节点100还可以进一步确定多个训练节点200中用于聚合通信的目标加速卡是否为非故障卡。其中,目标加速卡可以是多个训练节点200中rank_id为0的加速卡。当目标加速卡为非故障卡时,管理节点100可以保存该目标加速卡上的故障文件,而不必在多个训练节点200进行数据交换保证数据一致性后,再保存故障文件,缩短了保存时间。当多个训练节点200中用于聚合通信的目标加速卡为故障卡时,管理节点100可以保存多个训练节点200中的非故障节点中网络带宽最大的节点(也称作最近的节点)的加速卡上的故障文件,由此可以提高故障文件的保存速率。
在本实施例中,故障文件可以包括checkpoint文件也即ckpt文件。具体地,迭代轮次、权重、损失和超参数等保存字段的字段值可以写入ckpt文件,然后通过保存该ckpt文件以保存故障文件。其中,字段值可以和字段名形成键值对,该键值对可以被写入ckpt文件进行保存。进一步地,由于不同训练任务采用的深度学习框架308可以是不同的,例如,一些训练任务采用的深度学习框架为支持静态编译的框架,基于该框架编译所得的图编译结果可以被复用,管理节点100还可以将图编译结果也进行保存。也即故障文件还可以包括图编译结果。具体实现时,管理节点100可以判断训练任务采用的深度学习框架308。当深度学习框架308为支持静态编译的框架时,管理节点100可以将图编译结果写入故障文件,并进行故障文件保存。其中,支持静态编译的框架包括但不限于MindSpore框架和Tensorflow框架。当深度学习框架308为支持动态编译的框架时,由于图编译结果并不能被复用,管理节点100可以不保存上述图编译结果。其中,支持动态编译的框架包括但不限于pytorch框架。
为了便于理解,下面结合一具体示例说明。
参见图7所示的保存故障文件的流程图,管理节点100检测到训练任务发生故障时,先判断训练任务的任务类型为数据并行训练类型或模型并行训练类型。然后针对不同任务类型,分别采用不同保存策略,具体如下:
当任务类型为模型并行训练类型时,管理节点100可以保存多个非故障卡上的ckpt文件。例如,管理节点100可以保存故障节点和所有非故障节点的非故障卡上的ckpt文件。其中,管理节点100还可以保存策略(strategy)文件,以便基于该策略文件恢复训练任务。
当任务类型为数据并行训练类型时,管理节点100可以进一步判断目标加速卡是否发生故障。具体地,多个训练节点200中的每个训练节点200包括至少一个加速卡,其中,一个加速卡具有聚合通信功能,该加速卡即为目标加速卡(rank_id=0),管理节点100还可以判断上述目标加速卡是否发生故障,也即目标加速卡是否为故障卡。例如,管理节点100可以通过捕获来自于目标加速卡的实时信号,从而判断目标加速卡是否为故障卡。当目标加速卡为非故障卡时,也即rank_id=0的加速卡未发生故障时,管理节点100可以确定保存该目标加速卡上的故障文件,当目标加速卡为故障卡时,管理节点100可以确定非故障节点中网络带宽最大的节点,然后确定保存该网络带宽最大的节点的加速卡上的故障文件。
接着,管理节点100可以判断深度学习框架308是否为支持静态编译的框架。当深度学习框架308为支持静态编译的框架,例如TensorFlow框架或者MindSpore框架时,管理节点100还可以保存图编译结果。故障文件还包括图编译结果。当深度学习框架308为支持动态编译的框架,例如pytorch框架时,管理节点100可以保存ckpt文件。
在一些可能的实现方式中,管理节点100还可以备份故障文件,从而保证故障文件的安全性。例如,管理节点100可以将ckpt文件进行备份,以保证可靠性,避免数据丢失。具体地,管理节点100可以将ckpt文件保存在高性能统一缓存(High-performance Unified Buffer,HUB),从而实现可靠性备份。类似地,管理节点100也可以将图编译结果或者策略文件等保存在高性能统一缓存,从而实现可靠性备份。
S408:在故障文件保存完毕后,管理节点100重调度训练任务,加载故障文件。
具体地,管理节点100可以在故障文件保存完毕后,基于该故障文件启动对训练任务的恢复流程。其中,管理节点100可以重调度训练任务至新的训练节点200,新的训练节点不包括发生故障的训练节点200。然后管理节点100通过加载故障文件,例如加载故障发生时迭代轮次的训练结果,如迭代轮次、权重、损失和超参数等,从而使得训练节点200可以从故障发生时的迭代轮次继续进行训练,而无需根据大量样本数据进行重复训练。
其中,管理节点100可以根据训练任务的任务类型确定与该任务类型对应的恢复策略,然后根据所述故障文件,按照该恢复策略,恢复所述训练任务。
在一些可能的实现方式中,任务类型为数据并行训练类型时,恢复策略可以为基于单个ckpt文件进行恢复;任务类型为模型并行训练类型时,恢复策略可以为基于多个ckpt文件(例如是所有非故障卡上的ckpt文件)进行恢复。
进一步地,任务类型为数据并行训练类型的情况下,训练任务所采用的深度学习框架308为支持动态编译的框架,如pytorch框架时,管理节点100可以基于ckpt文件恢复训练任务,训练任务所采用的深度学习框架308为支持静态编译的框架,例如是TensorFlow框架或者MindSpore框架时,管理节点100还可以结合图编译结果进行故障恢复。
具体地,管理节点100可以获取ckpt文件,例如管理节点100可以从HUB获取ckpt文件,然后加载该ckpt文件,从而基于该ckpt文件中的相应字段的字段值恢复续训的数据和模型。例如,管理节点100可以基于ckpt文件中的epoch和step恢复续训的数据,基于ckpt文件中的权重weights、学习率LR、优化器optimizer恢复模型,如此实现恢复训练任务。其中,管理节点100在恢复训练任务时,可以加载策略文件,按照该策略文件恢复训练任务。
需要说明的时,上述S408为本申请实施例的可选步骤,执行本申请实施例的故障文件保存方法也可以不执行S408。例如,管理节点100可以直接将上述故障文件中的权重等用于模型推理。
基于上述内容描述,本申请实施例提供了一种故障文件保存方法。在该方法中,管理节点100基于从训练节点200获取的实时信号,进行实时地故障检测,在检测到故障后,保存训练任务在故障发生时的迭代轮次的训练结果,当训练任务被重调度至新的训练节点200时,新的训练节点200可以基于故障发生时的迭代轮次的训练结果继续训练,无需基于大量样本数据重复训练,提高了训练效率。
下面以NLP领域的pangu_alpha模型为例,从管理节点100的角度,介绍该模型的故障文件保存以及基于故障文件的故障恢复过程。
参见图8所示的故障文件保存及故障恢复方法的信令流程图,该方法包括:
S802:用户触发创建训练任务的操作。
S804:管理节点100中的MindX DL组件调用驱动和接口进行业务面故障检测。当检测到故障后,执行S812。
S806:管理节点100中的fault detect组件通过故障预警算法进行预测,获得预测结果。当预测结果为预设时间段发生故障时,执行S807;当预测结果为预设时间段不发生故障时,执行S808。
S807:管理节点100中的fault detect组件在预测的时间点,通过捕获集群通信信号、编译信号、运行信号、运行管理器信号中的一种或多种,根据捕获的上述信号进行故障检测,获得故障检测结果。
S808:管理节点100中的fault detect组件捕获集群通信信号、编译信号、运行信号、运行管理器信号中的一种或多种,根据捕获的上述信号进行故障检测,获得故障检测结果。当故障检测结果为训练任务发生故障时,执行S810。
S810:管理节点100中的fault detect组件向MindX DL组件上报告警消息。
S812:管理节点100中的MindX DL组件向control engine发送第一通知消息。
该第一通知消息用于通知训练任务发生故障。进一步地,第一通知消息还可以携带训练任务的任务类型,以便于control engine能够感知训练任务的任务类型。
S814:管理节点100中的control engine根据训练任务的任务类型向restore manager发送保存策略。
S816:管理节点100中的restore manager按照保存策略,保存故障文件。
在该方法中,restore manager可以屏蔽底层深度学习框架308的差异,按照保存策略保存故障文件。例如,restore manager可以获取训练任务发生故障时的epoch和step,并获取模型的权重、超参数,将上述epoch、step、权重、超参数等信息写入ckpt文件进行保存。
其中,训练任务采用支持动态编译的框架时,restore manager还可以按照保存策略,保存图编译结果,以便于restore manager后续结合该图编译结果进行故障恢复。
S818:管理节点100中的restore manager将故障文件写入HUB。
具体地,restore manager可以采用分布式存储方式,将故障文件写入HUB,如此可以实现故障文件的可靠性备份,降低故障文件丢失概率。
需要说明的是,执行本实施例的故障文件保存方法也可以不执行S818。例如,restore manager可以将故障文件在本地进行备份,或者通过其他方式进行备份。
S820:管理节点100中的restore manager向control engine返回备份完成通知。
可选地,restore manager返回备份完成通知,以便于control engine启动后续流程。进一步地,restore manager还可以指示管理节点100中的MindX DL组件重调度训练任务。在一些实施例中,restore manager也可以不执行上述步骤。
S822:管理节点100中的MindX DL组件重调度训练任务。
具体地,MindX DL组件可以基于restore manager的指示,重调度训练任务。在一些实施例中,MindX DL组件也可以通过轮询确定故障文件是否备份完成,当确定故障文件备份完成时,重调度训练任务。
S824:管理节点100中的MindX DL组件向control engine发送第二通知消息。
该第二通知消息用于通知训练任务被MindX DL组件重新调度。进一步地,第二通知消息还可以携带训练任务的任务类型。
需要说明的是,本实施例是以MindX DL组件进行示例说明。在本申请实施例其他可能的实现方式中,管理节点100也可以安装其他类型的分布式调度组件时,该训练任务也可以被其他类型的分布式调度组件所调度。
S826:管理节点100中的control engine指示restore manager进行故障恢复。
具体地,control engine可以向restore manager发送恢复指令,以使restore manager根据该恢复指令,执行后续流程,恢复训练任务。
S828:管理节点100中的restore manager从HUB获取故障文件。
其中,任务类型为数据并行训练类型时,恢复策略可以是触发单个ckpt文件进行恢复;任务类型为模型并行训练类型时,恢复策略可以为触发多个ckpt文件(例如是保存的所有非故障卡的ckpt文件)进行恢复。管理节点100中的restore manager可以根据任务类型,按需获取ckpt文件。
进一步地,任务类型为数据并行训练类型,且训练任务采用的深度学习框架308为支持静态编译的框架时,管理节点100中的restore manager还可以从HUB中获取图编译结果。
其中,故障文件在本地备份,或者通过其他方式备份时,restore manager也可以从本地或其他位置获取故障文件。
S830:管理节点100中的restore manager根据所述故障文件,恢复所述训练任务。
具体地,故障文件包括ckpt文件,该ckpt文件中写入有epoch、step、weights、loss、optimizer和LR等信息,管理节点100中的restore manager可以加载ckpt文件,并读取ckpt文件中相应字段的字段值,基于该字段值,恢复续训的数据和模型,从而恢复训练任务。
在该方法中,管理节点100可以实现训练任务级别的故障检测,并在检测到训练任务发生故障后进行故障文件保存,该故障文件包括训练任务发生故障时的epoch和/或step,管理节点100基于该epoch和/或step恢复训练任务,可以避免故障发生时所在step或epoch的迭代结果丢失,分布式系统10无需基于大量样本数据重复训练,提高了训练效率。
而且,该方法解耦了深度学习框架308和故障文件保存装置310,针对采用不同深度学习框架308的训练任务,故障文件保存装置310均可以用于对该训练任务进行故障恢复,具有较好的兼容性。此外,该方法还解耦了不同任务类型的保存及恢复机制,用户无需关注任务类型,接口友好,降低了用户的使用成本。
上文结合图1至图8对本申请实施例提供的故障文件保存方法进行了详细介绍,下面将结合附图对本申请实施例提供的装置进行介绍。
参见图9所示的故障文件保存装置310的结构示意图,该装置310可以是软件装置,该软件装置可以部署在管理节点100中,该装置310包括:
通信模块902,用于从所述多个训练节点200中的至少一个训练节点200获取实时信号,所述实时信号用于表征所述至少一个训练节点200的状态;
检测模块904,用于根据所述实时信号进行故障检测;
保存模块906,用于检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
其中,图9和图3B是从不同角度对故障文件保存装置310进行了划分,例如通信模块902和检测模块904可以对应于图3B中的故障检测组件3102,保存模块906可以对应于图3B中的修复管理组件3106。
在一些可能的实现方式中,所述实时信号包括以下一种或多种:
集群通信信号、编译信号、运行信号、运行管理器信号。
在一些可能的实现方式中,所述装置310还包括:
预测模块908,用于通过故障预警算法对所述训练任务进行故障预测,得到预测结果;
所述检测模块904具体用于:
根据所述预测结果,从所述多个训练节点200中的至少一个训练节点200获取实时信号。
其中,预测模块908和检测模块904可以共同对应于图3B中的故障检测组件3102,以实现对训练节点200(包括host 22和device 24)进行故障检测。
在一些可能的实现方式中,所述保存模块906具体用于:
确定所述训练任务的任务类型;
按照与所述任务类型对应的保存策略,进行故障文件保存。
其中,保存模块906可以对应于图3B中的控制引擎3104和修复管理组件3106,以实现按照保存策略,进行故障文件保存。
在一些可能的实现方式中,所述多个训练节点200中的每个训练节点200包括至少一个加速卡,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为保存所述至少一个加速卡中任一个非故障卡上的故障文件;所述训练任务的任务类型为模型并行训练类型时,所述保存策略为保存所述至少一个加速卡中多个非故障卡上的故障文件。
在一些可能的实现方式中,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为:
当所述多个训练节点200中用于聚合通信的目标加速卡为非故障卡时,保存所述目标加速卡上的故障文件;
当所述多个训练节点200中用于聚合通信的目标加速卡为故障卡时,保存所述多个训练节点200中的非故障节点中网络带宽最大的节点的加速卡上的故障文件。
在一些可能的实现方式中,所述装置310还包括:
恢复模块909,用于在所述故障文件保存完毕后,重调度所述训练任务,加载所述故障文件。
其中,恢复模块902可以对应于图3B中的修复管理组件3106,以实现重调度训练任务至新的训练节点200,并加载故障文件,从而恢复训练任务。
在一些可能的实现方式中,所述恢复模块909具体用于:
根据与所述训练任务的任务类型对应的恢复策略,加载所述故障文件。
在一些可能的实现方式中,所述故障文件包括以下信息:迭代轮次、权重、损失和超参数。
在一些可能的实现方式中,所述故障文件还包括所述训练任务的图编译结果。
根据本申请实施例的故障文件保存装置310可对应于执行本申请实施例中描述的方法,并且故障文件保存装置310的各个模块/单元的上述和其它操作和/或功能分别为了实现图3所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供了一种管理节点100。该管理节点100可以是服务器,例如是云服务器或者物理服务器。其中,云服务器是指云环境中的计算设备。云环境指示云服务提供商拥有的,用于提供计算、存储、通信资源的中心计算设备集群。物理服务器具体可以是独立服务器,该物理服务器的配置和性能通常被使用者独享。在一些实施例中,管理节点100也可以是终端,包括但不限于台式机、笔记本电脑或者智能手机。该管理节点100具体用于实现如图9所示实施例中故障文件保存装置310的功能。
图10提供了一种管理节点100的硬件结构图,如图10所示,管理节点100包括总线1001、中央处理器1002、通信接口1003、存储器1004和多个加速卡1005。处理器1002、存储器1004、通信接口1003、加速卡1005之间通过总线1001通信。
总线1001可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
中央处理器1002可以为x86架构的CPU,也可以为高级精简指令集机器(Advanced RISC Machine,ARM)架构的CPU,或者是其他架构的CPU,本实施例对此不作限制。
通信接口1003用于与外部通信。例如,通信接口1003用于从多个训练节点200中的至少一个训练节点200获取实时信号,以及在检测到故障后,获取故障文件进行故障文件保存 等等。
存储器1004可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器1004还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state drive,SSD)。
加速卡1005可以包括NPU或者GPU。上文均以加速卡105包括NPU进行示例说明。需要说明的是,当管理节点100为专用于执行管理功能的节点时,管理节点100也可以不包括上述加速卡1005。
存储器1004中存储有计算机可读指令,处理器1002执行该计算机可读指令,以使得管理节点100执行前述故障文件保存方法(或实现前述故障文件保存装置310的功能)。
具体地,在实现图9所示系统的实施例的情况下,且图9中所描述的故障文件保存装置310的各模块如通信模块902、检测模块904、保存模块906、预测模块908和恢复模块909的功能为通过软件实现的情况下,执行图9中各模块的功能所需的软件或程序代码可以存储在管理节点100中的至少一个存储器1004中。至少一个处理器1002执行存储器1004中存储的程序代码,以使得管理节点100执行前述故障文件保存方法。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示管理节点100执行上故障文件保存方法。
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备如管理节点100上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算设备或数据中心进行传输。所述计算机程序产品可以为一个软件安装包,在需要使用前述故障文件保存方法的任一方法的情况下,可以下载该计算机程序产品并在管理节点100上执行该计算机程序产品。
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。

Claims (24)

  1. 一种故障文件保存方法,其特征在于,应用于分布式系统,所述分布式系统包括管理节点和多个训练节点,所述多个训练节点用于协同执行训练任务,所述方法包括:
    所述管理节点从所述多个训练节点中的至少一个训练节点获取实时信号,所述实时信号用于表征所述至少一个训练节点的状态;
    所述管理节点根据所述实时信号进行故障检测;
    所述管理节点检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
  2. 根据权利要求1所述的方法,其特征在于,所述实时信号包括以下一种或多种:
    集群通信信号、编译信号、运行信号、运行管理器信号。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    所述管理节点通过故障预警算法对所述训练任务进行故障预测,得到预测结果;
    所述管理节点从所述多个训练节点中的至少一个训练节点获取实时信号,包括:
    所述管理节点根据所述预测结果,从所述多个训练节点中的至少一个训练节点获取实时信号。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述管理节点进行故障文件保存,包括:
    所述管理节点确定所述训练任务的任务类型;
    所述管理节点按照与所述任务类型对应的保存策略,进行故障文件保存。
  5. 根据权利要求4所述的方法,其特征在于,所述多个训练节点中的每个训练节点包括至少一个加速卡,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为保存所述至少一个加速卡中任一个非故障卡上的故障文件;所述训练任务的任务类型为模型并行训练类型时,所述保存策略为保存所述至少一个加速卡中多个非故障卡上的故障文件。
  6. 根据权利要求5所述的方法,其特征在于,所述训练任务的任务类型为数据并行训练类型时,所述保存所述至少一个加速卡中任一个非故障卡上的故障文件,包括:
    当所述多个训练节点中用于聚合通信的目标加速卡为非故障卡时,保存所述目标加速卡上的故障文件;
    当所述多个训练节点中用于聚合通信的目标加速卡为故障卡时,保存所述多个训练节点中的非故障节点中网络带宽最大的节点的加速卡上的故障文件。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    在所述故障文件保存完毕后,所述管理节点重调度所述训练任务,加载所述故障文件。
  8. 根据权利要求7所述的方法,其特征在于,所述管理节点加载所述故障文件,包括:
    所述管理节点根据与所述训练任务的任务类型对应的恢复策略,加载所述故障文件。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述故障文件包括以下信息:迭代轮次、权重、损失和超参数。
  10. 根据权利要求9所述的方法,其特征在于,所述故障文件还包括所述训练任务的图编译结果。
  11. 一种故障文件保存装置,其特征在于,应用于分布式系统,所述分布式系统包括管理节点和多个训练节点,所述多个训练节点用于协同执行训练任务,所述装置部署于所述管理节点,所述装置包括:
    通信模块,用于从所述多个训练节点中的至少一个训练节点获取实时信号,所述实时信 号用于表征所述至少一个训练节点的状态;
    检测模块,用于根据所述实时信号进行故障检测;
    保存模块,用于检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
  12. 根据权利要求11所述的装置,其特征在于,所述实时信号包括以下一种或多种:
    集群通信信号、编译信号、运行信号、运行管理器信号。
  13. 根据权利要求11或12所述的装置,其特征在于,所述装置还包括:
    预测模块,用于通过故障预警算法对所述训练任务进行故障预测,得到预测结果;
    所述检测模块具体用于:
    根据所述预测结果,从所述多个训练节点中的至少一个训练节点获取实时信号。
  14. 根据权利要求11至13任一项所述的装置,其特征在于,所述保存模块具体用于:
    确定所述训练任务的任务类型;
    按照与所述任务类型对应的保存策略,进行故障文件保存。
  15. 根据权利要求14所述的装置,其特征在于,所述多个训练节点中的每个训练节点包括至少一个加速卡,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为保存所述至少一个加速卡中任一个非故障卡上的故障文件;所述训练任务的任务类型为模型并行训练类型时,所述保存策略为保存所述至少一个加速卡中多个非故障卡上的故障文件。
  16. 根据权利要求15所述的装置,其特征在于,所述训练任务的任务类型为数据并行训练类型时,所述保存策略为:
    当所述多个训练节点中用于聚合通信的目标加速卡为非故障卡时,保存所述目标加速卡上的故障文件;
    当所述多个训练节点中用于聚合通信的目标加速卡为故障卡时,保存所述多个训练节点中的非故障节点中网络带宽最大的节点的加速卡上的故障文件。
  17. 根据权利要求11至16任一项所述的装置,其特征在于,所述装置还包括:
    恢复模块,用于在所述故障文件保存完毕后,重调度所述训练任务,加载所述故障文件。
  18. 根据权利要求17所述的装置,其特征在于,所述恢复模块具体用于:
    根据与所述训练任务的任务类型对应的恢复策略,加载所述故障文件。
  19. 根据权利要求11至18任一项所述的装置,其特征在于,所述故障文件包括以下信息:迭代轮次、权重、损失和超参数。
  20. 根据权利要求19所述的装置,其特征在于,所述故障文件还包括所述训练任务的图编译结果。
  21. 一种管理节点,其特征在于,应用于分布式系统,所述分布式系统包括所述管理节点和多个训练节点,所述多个训练节点用于协同执行训练任务,所述管理节点包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令,所述至少一个处理器执行所述计算机可读指令,使得所述管理节点执行如权利要求1至10任一项所述的方法。
  22. 一种分布式系统,其特征在于,所述分布式系统包括管理节点和多个训练节点;
    所述多个训练节点,用于协同执行训练任务;
    所述管理节点,用于从所述多个训练节点中的至少一个训练节点获取实时信号,所述实时信号用于表征所述至少一个训练节点的状态,根据所述实时信号进行故障检测,检测到故障后进行故障文件保存,所述故障文件用于恢复所述训练任务。
  23. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在管理节点上运行时,使得所述管理节点执行如权利要求1至10任一项所述的方法。
  24. 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在管理节点上运行时,使得所述管理节点执行如权利要求1至10任一项所述的方法。
PCT/CN2023/078980 2022-03-01 2023-03-01 一种故障文件保存方法及相关装置 WO2023165512A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210197961.0 2022-03-01
CN202210197961.0A CN114968947B (zh) 2022-03-01 2022-03-01 一种故障文件保存方法及相关装置

Publications (1)

Publication Number Publication Date
WO2023165512A1 true WO2023165512A1 (zh) 2023-09-07

Family

ID=82976197

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078980 WO2023165512A1 (zh) 2022-03-01 2023-03-01 一种故障文件保存方法及相关装置

Country Status (2)

Country Link
CN (1) CN114968947B (zh)
WO (1) WO2023165512A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968947B (zh) * 2022-03-01 2023-05-09 华为技术有限公司 一种故障文件保存方法及相关装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095008A (zh) * 2015-08-25 2015-11-25 国电南瑞科技股份有限公司 一种适用于集群系统的分布式任务故障冗余方法
CN105515847A (zh) * 2015-12-02 2016-04-20 深圳Tcl数字技术有限公司 终端故障处理方法、装置及系统
US20190114537A1 (en) * 2017-10-16 2019-04-18 Facebook, Inc. Distributed training and prediction using elastic resources
CN113569987A (zh) * 2021-08-19 2021-10-29 北京沃东天骏信息技术有限公司 模型训练方法和装置
CN114968947A (zh) * 2022-03-01 2022-08-30 华为技术有限公司 一种故障文件保存方法及相关装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909035B (zh) * 2017-11-16 2021-05-28 国家电网公司 一种准实时故障录波文件的读取及分析系统
CN111679953B (zh) * 2020-06-09 2022-04-12 平安科技(深圳)有限公司 基于人工智能的故障节点识别方法、装置、设备和介质
CN113505014B (zh) * 2021-06-09 2022-05-27 荣耀终端有限公司 一种故障诊断文件获取方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095008A (zh) * 2015-08-25 2015-11-25 国电南瑞科技股份有限公司 一种适用于集群系统的分布式任务故障冗余方法
CN105515847A (zh) * 2015-12-02 2016-04-20 深圳Tcl数字技术有限公司 终端故障处理方法、装置及系统
US20190114537A1 (en) * 2017-10-16 2019-04-18 Facebook, Inc. Distributed training and prediction using elastic resources
CN113569987A (zh) * 2021-08-19 2021-10-29 北京沃东天骏信息技术有限公司 模型训练方法和装置
CN114968947A (zh) * 2022-03-01 2022-08-30 华为技术有限公司 一种故障文件保存方法及相关装置

Also Published As

Publication number Publication date
CN114968947B (zh) 2023-05-09
CN114968947A (zh) 2022-08-30

Similar Documents

Publication Publication Date Title
US9680893B2 (en) Method and system for event state management in stream processing
US11301307B2 (en) Predictive analysis for migration schedulers
US10652119B2 (en) Automatic recovery engine with continuous recovery state machine and remote workflows
US10467048B2 (en) Techniques for virtual machine migration
US10073683B2 (en) System and method for providing software build violation detection and self-healing
US20180246751A1 (en) Techniques to select virtual machines for migration
CN107016480B (zh) 任务调度方法、装置及系统
US9122595B2 (en) Fault tolerance for complex distributed computing operations
TW201610703A (zh) 雲端中之分散式串流處理
US11144330B2 (en) Algorithm program loading method and related apparatus
US9229839B2 (en) Implementing rate controls to limit timeout-based faults
US10055307B2 (en) Workflows for series of snapshots
CN107479944B (zh) 混合云模式下的虚拟机内存自适应热迁移调度方法及系统
WO2023165512A1 (zh) 一种故障文件保存方法及相关装置
CN115004156A (zh) 实时多租户工作负载跟踪和自动节流
WO2023020355A1 (zh) Ai模型的分布式训练方法和相关设备
CN107992354B (zh) 用于降低内存负载的方法以及装置
US20190317849A1 (en) Automatic correcting of computing cluster execution failure
CN104216683A (zh) 利用同步多线程进行数据处理的方法及其系统
US20210124492A1 (en) Efficient utilization of storage resources on data recovery sites using machine learning
CN103235754B (zh) 分布式文件系统中请求的处理方法和装置
Bawankule et al. Early straggler tasks detection by recurrent neural network in a heterogeneous environment
CN112580816A (zh) 机器学习训练资源管理
US20220019461A1 (en) Platform health engine in infrastructure processing unit
US9880893B2 (en) Failure interval determination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23762910

Country of ref document: EP

Kind code of ref document: A1