CN114968636A

CN114968636A - Fault processing method and device

Info

Publication number: CN114968636A
Application number: CN202210543996.5A
Authority: CN
Inventors: 裴周宇; 付海涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-30

Abstract

The invention discloses a fault processing method and device, and relates to the technical field of computers. One embodiment of the method comprises: monitoring the running state of a target task manager; in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information. According to the embodiment, the abnormal signal aiming at the abnormal running state of the task manager is obtained in real time, and the corresponding operation manager is informed in time, so that the operation manager triggers fault recovery in time, the task recovery speed is greatly increased, and the service processing efficiency is improved.

Description

Fault processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a fault.

Background

With the development of big data technology, the requirement of the business on the real-time performance is higher and higher, more and more businesses utilize real-time computing to accelerate the development of the business, real-time computing products based on cloud originality are more and more widely applied, and the framework mode of the Flink on K8s becomes the main business flow.

Under the framework of Flink on K8s, a task manager (task manager) process abnormally depends on a heartbeat mechanism from the task manager to a resource manager (resource manager), the heartbeat default timeout time is generally more than 60 seconds, a fault recovery task is triggered after timeout, but the heartbeat timeout time is longer, the fault recovery time is longer, and the real-time requirement of service calculation cannot be met.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for fault handling, which can obtain an abnormal signal for a task manager in real time and notify a corresponding job manager in time, so that the job manager triggers fault recovery in time, thereby greatly increasing a task recovery speed and improving service handling efficiency.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of fault handling, including:

monitoring the running state of a target task manager;

in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information.

Optionally, the exception signal is a stop signal sent by the kubel process.

Optionally, the exception signal is an exit signal of the target task manager.

Optionally, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, including:

and acquiring the address information of the job manager from a cache, and sending the abnormal information to the job manager according to the address information.

Optionally, before acquiring the address information of the job manager from the cache, the method includes:

and monitoring the address information of the job manager corresponding to the target task manager, and correspondingly storing the monitored address information of the job manager and the identifier of the target task manager into the cache.

Optionally, the job manager triggers a failure recovery policy for the exception information, including:

the operation manager sends an exception notification indicating that the target task manager is abnormal to a resource manager, the resource manager sends the exception notification to a heartbeat manager, the heartbeat manager cancels heartbeat monitoring of the target task manager after receiving the exception notification, and the heartbeat manager calls back the resource manager so that the resource manager cancels registration of the target task manager to trigger the fault recovery strategy.

Optionally, the job manager triggers a failure recovery policy for the exception information, further comprising:

the resource manager calls back the job manager to enable the job manager to acquire the task which fails to be executed in the target task manager;

and the job manager redistributes the task with failed execution to a newly started task manager to realize fault recovery.

According to still another aspect of the embodiments of the present invention, there is provided an apparatus for fault handling, including:

the monitoring module monitors the running state of the target task manager;

and the sending module is used for responding to an abnormal signal which indicates that the running state is abnormal and sending abnormal information which indicates that the target task manager is abnormal to a job manager corresponding to the target task manager so as to enable the job manager to trigger a fault recovery strategy aiming at the abnormal information.

According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method for fault handling provided by the present invention.

According to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method of fault handling provided by the present invention.

One embodiment of the above invention has the following advantages or benefits: by monitoring the running state of the target task manager, abnormal signals with abnormal running states can be monitored in time, so that abnormal information can be sent to the corresponding operation manager in time, the operation manager triggers a fault recovery task in time, and the operation is replied in time. The method can accelerate the task recovery speed and improve the efficiency of service processing.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a method of fault handling according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a Flink on K8s architecture according to an embodiment of the invention;

FIG. 3 is a schematic main flow diagram of another method of fault handling according to an embodiment of the invention;

FIG. 4 is a flow diagram of a method of fault handling in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of the main modules of a fault handling apparatus according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of a method of fault handling according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S101: monitoring the running state of a target task manager;

step S102: judging whether an abnormal signal indicating the abnormal running state is monitored or not; if yes, executing step S103, otherwise, executing step S101;

step S103: and sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager so that the job manager triggers a fault recovery strategy aiming at the exception information.

The fault processing method is realized based on a Flink on K8s framework mode, wherein the Flink is a stateful computing engine based on a data stream; k8s is collectively referred to as Kubernets, container orchestration scheduling engine. Fig. 2 is a schematic diagram of an architecture of a Flink on K8s according to an embodiment of the present invention, a user may send a request for creating a Flink cluster to a K8s cluster through a K8s client on a Web service (e.g., jrc platform), where the request is a command for creating a K8s delivery (an object for Pod management in K8 s), a K8s Master (K8s cluster Master) creates a K8s delivery according to K8s delivery information in the request and pulls mirror information from a Docker Registry (application container engine mirror library), and obtains K8s PODs corresponding to each K8s delivery, obtains K8 delivery 3 delivery and K8s man (K8 468 management), respectively obtains K8 delivery and K8 task manager (task manager) corresponding to the JobManager and the TaskManager (task manager), respectively obtains address information of a combination of K8 delivery 3 delivery and K8 management (K8 delivery manager) and K8 management (K468 delivery manager) corresponding to the TaskManager, obtains address of a distributed hypervisor group, and obtains address information of a distributed hypervisor (address of the application server) corresponding to the client, the TaskManager, the address of the client is obtained by a distributed hypervisor, after acquiring address information of the JobManager, registering the JobManager, completing the creation of a Flink cluster on K8s, and storing a data processing result into a back-end Storage HDFS (Hadoop Distributed File System)/OSS (Object Storage Service) by the JobManager and the TaskManager; the user can then submit the job to the Flink cluster through the K8s client to have the Flink cluster do the job.

In the embodiment of the invention, whether the target task manager operates normally can be obtained by monitoring the operating state of the target task manager, and if the operating state is normal, the monitoring is continued; if an abnormal signal indicating the running state is abnormal is monitored, the target task manager is about to or has abnormal, and the task cannot be continuously executed. The exception signal may be a stop signal (TERM) sent by a kubel process (a process of a slave Node in the K8s cluster, which is used to process a task issued by the master Node K8s to the slave Node and manage Pod and Pod containers), which indicates that the K8s Pod corresponding to the target task manager is about to end operation; the exception information may be an exit signal (CHILD) of the target task manager, that is, an exit signal of the process corresponding to the target task manager.

In an optional implementation manner of the embodiment of the present invention, after the monitoring of the exception information, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager includes: and acquiring the address information of the job manager from the cache, and sending exception information to the job manager according to the address information.

In the embodiment of the present invention, as shown in fig. 3, before acquiring the address information of the job manager from the cache, the method includes:

step S301: monitoring address information of a job manager corresponding to a target task manager;

step S302: and correspondingly storing the monitored address information of the job manager and the identifier of the target task manager into a cache.

In the embodiment of the invention, the job manager is a main node of a Flink cluster and is used for coordinating stream processing jobs; each master node may correspond to one or more slave nodes, which are task managers, i.e. each job manager may correspond to one or more task managers for performing specific tasks assigned by the job manager, i.e. for executing specific data stream processing logic.

The target task manager may obtain address information of the job manager corresponding to the target task manager, that is, a server address (e.g., a WebMonitor address for receiving an HTTP request) by monitoring the ZK cluster, and then store the address information of the role manager obtained by monitoring and the identifier of the target task manager in the cache in correspondence, and may further store the address information and the identifier of the target task manager in a temporary file system of the cache, that is, a storage file (e.g., wmaari.ts file) including the identifier of the target task manager and the address information of the job manager corresponding to the target task manager in the temporary file system.

In the embodiment of the invention, the address information of the job manager corresponding to the target task manager may be changed, when the target task manager monitors that the address information of the corresponding job manager is changed, the address information of the changed job manager is acquired and stored to update the address information of the job manager corresponding to the identifier of the target task manager in the cache, and the address information of the job manager corresponding to the identifier of the target task manager can be updated in time through the temporary file system stored in the cache, so that the local disk is prevented from being influenced.

When an abnormal signal indicating that the running state of the target task manager is abnormal is monitored, the identification of the target task manager is obtained, then a storage file is obtained from the temporary file system according to the identification of the target task manager, the storage file is analyzed, address information of a job manager corresponding to the target task manager is obtained, abnormal information indicating that the target task manager is abnormal, namely information indicating that the target task manager exits from the process is sent to the job manager according to the address information, and the target task manager corresponding to the job manager is notified of the abnormality.

In the embodiment of the present invention, after the job manager receives the exception information, the job manager triggers a fault recovery policy for the exception information, for example, the exception signal is an exit signal of the process of the target task manager, and the process of the target task manager may be restarted to implement fault recovery, or a task that fails to be executed on the target task manager is acquired, and the task that fails to be executed is allocated to another task manager corresponding to the job manager to be executed, so as to perform task redeployment, and implement quick recovery of the fault.

In an embodiment of the present invention, after the job manager receives the exception information, the job manager triggers a failure recovery policy for the exception information, including: the job manager sends an exception notification indicating that the target task manager is abnormal to the resource manager, the resource manager sends the exception notification to the heartbeat manager, the heartbeat manager cancels heartbeat monitoring on the target task manager after receiving the exception notification, and the heartbeat manager calls back the resource manager so that the resource manager cancels registration on the target task manager to trigger a fault recovery strategy.

In the embodiment of the present invention, the resource manager is used for a job manager and a task manager in a Flink cluster, and is a unit for resource scheduling in the Flink cluster. The heartbeat manager is used for carrying out heartbeat monitoring on the task manager so as to judge whether the running state of the task manager is normal or not. The job manager calls an interface (e.g., a task manager end signal interface, a killsignaltab manager interface) of the resource manager through an RPC request to send an exception notification to the resource manager. And the heartbeat manager cancels heartbeat monitoring on the target task manager, and comprises information such as an identifier for cleaning the target task manager. The resource manager cancels the registration of the target task manager, including the information of the identifier of the target task manager and the like.

In this embodiment of the present invention, the triggering, by the job manager, a fault recovery policy for the exception information further includes: the resource manager calls back the job manager to enable the job manager to acquire the task which fails to be executed in the target task manager; and the job manager redistributes the task with failed execution to the newly started task manager so as to realize fault recovery. That is, the job manager creates a newly started task manager, and distributes the task which fails to be executed in the target task manager to the newly started task manager so that the newly started task manager processes the task which fails to be executed; the task that fails to be executed may be assigned to another task manager corresponding to the job manager.

Fig. 4 is a schematic flowchart of a method for handling a fault according to an embodiment of the present invention, where the method is implemented based on a Flink on K8s architecture, and a task manager monitors address information of a job manager corresponding to the task manager, and stores the obtained address information and an identifier of the task manager in a storage file of a temporary file system; monitoring the running state of a task manager, and monitoring a stop signal sent by a kubelet process of the task manager and an exit signal of the task manager process; when monitoring an exit signal to a task manager, reading and analyzing address information of a corresponding job manager from a Wmaari.ts file of a temporary file system according to the identifier of the task manager, and sending abnormal information indicating that the task manager is abnormal to the job manager according to the address information; after receiving the abnormal information, the job manager sends an abnormal notice indicating that the task manager is abnormal to the resource manager; after receiving the abnormal notification, the resource manager sends the abnormal notification to the heartbeat manager; after receiving the abnormal notification, the heartbeat manager cancels heartbeat monitoring on task management and calls back the resource manager; the resource manager cancels the registration of the task manager, so that the job manager redeploys the task which is executed by the task manager in failure, and distributes the task to other task managers which normally run and correspond to the job manager, thereby realizing the rapid recovery of the fault.

The fault handling method provided by the embodiment of the invention is realized based on a Flink on K8s framework, can acquire an abnormal signal indicating abnormal operation state in time by monitoring the operation state of a target task manager, and directly acquire the address information of a corresponding job manager from a temporary file system after monitoring the abnormal signal so as to send the abnormal information of the target task manager to the job manager, so that the job manager triggers a fault recovery strategy according to the abnormal information, redeployes a task which fails to be executed in the target task manager, and realizes quick recovery of the fault. The method can send the abnormal message to the operation manager in time to trigger the fault recovery strategy after the fault occurs, thereby improving the efficiency of fault processing, further improving the efficiency of service processing, meeting the requirement of service real-time calculation, and overcoming the problem of poor fault recovery timeliness caused by adopting a heartbeat mechanism to monitor and recover the fault in the prior art.

As shown in fig. 5, there is further provided an apparatus 500 for fault handling according to an embodiment of the present invention, including:

the monitoring module 501 monitors the running state of the target task manager;

the sending module 502, in response to monitoring an exception signal indicating that the running state is abnormal, sends exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information.

In one implementation of the embodiment of the present invention, the exception signal is a stop signal sent by the kubel process.

In another implementation manner of the embodiment of the present invention, the exception signal is an exit signal of the target task manager.

In this embodiment of the present invention, the sending module 502 is further configured to: and acquiring the address information of the job manager from the cache, and sending exception information to the job manager according to the address information.

In this embodiment of the present invention, the sending module 502 is further configured to: before the address information of the job manager is acquired from the cache, the address information of the job manager corresponding to the target task manager is monitored, and the monitored address information of the job manager and the identifier of the target task manager are correspondingly stored in the cache.

In this embodiment of the present invention, the sending module 502 is further configured to: and enabling the job manager to send an exception notification indicating that the target task manager is abnormal to the resource manager, enabling the resource manager to send the exception notification to the heartbeat manager, enabling the heartbeat manager to cancel heartbeat monitoring of the target task manager after receiving the exception notification, and enabling the heartbeat manager to call back the resource manager so that the resource manager cancels registration of the target task manager to trigger a fault recovery strategy.

In this embodiment of the present invention, the sending module 502 is further configured to: the resource manager is called back to the job manager so that the job manager can acquire the task which fails to be executed in the target task manager; and the job manager redistributes the task with failed execution to the newly started task manager to realize fault recovery.

According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by one or more processors, the one or more processors realize the fault processing method provided by the invention.

Fig. 6 shows an exemplary system architecture 600 of a fault handling apparatus or method to which embodiments of the invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server that provides various services, such as a background management server or a cloud server (just an example) that supports shopping websites browsed by users using the

terminal devices

601, 602, 603. The backend management server or the cloud server may analyze and otherwise process data such as the received data query request, and feed back a processing result (for example, a data query result — just an example) to the terminal device.

It should be noted that the method for handling the fault provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the device for handling the fault is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the use range of the embodiment of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present invention, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a listening module and a sending module. The names of these modules do not constitute a limitation to the module itself in some cases, and for example, a monitoring module may also be described as a "module that monitors the running state of a target task manager".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: monitoring the running state of a target task manager; and in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery strategy aiming at the exception information.

According to the technical solution of the embodiment of the present invention, the method for fault handling provided in the embodiment of the present invention is implemented based on a Flink on K8s architecture, by monitoring the running state of the target task manager, an abnormal signal indicating that the running state is abnormal can be obtained in time, and after the abnormal signal is monitored, the address information of the corresponding job manager is directly obtained from the temporary file system, so as to send the abnormal information of the target task manager to the job manager, so that the job manager triggers a fault recovery policy according to the abnormal information, redeployes the task that has failed to be executed in the target task manager, and realizes quick recovery of the fault. The method can send the abnormal message to the operation manager in time to trigger the fault recovery strategy after the fault occurs, thereby improving the efficiency of fault processing, further improving the efficiency of service processing, meeting the requirement of service real-time calculation, and overcoming the problem of poor fault recovery timeliness caused by adopting a heartbeat mechanism to monitor and recover the fault in the prior art.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of fault handling, comprising:

monitoring the running state of a target task manager;

2. The method of claim 1, wherein the exception signal is a stop signal sent by a kubel process.

3. The method of claim 1, wherein the exception signal is an exit signal of the target task manager.

4. The method of claim 1, wherein sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager comprises:

5. The method of claim 4, wherein prior to obtaining the address information of the job manager from the cache, comprising:

6. The method of claim 1, wherein the job manager triggers a fail-over policy for the exception information, comprising:

7. The method of claim 6, wherein the job manager triggers a fail-over policy for the exception information, further comprising:

8. An apparatus for fault handling, comprising:

the monitoring module monitors the running state of the target task manager;

9. An electronic device, comprising:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.