CN114968636A - Fault processing method and device - Google Patents

Fault processing method and device Download PDF

Info

Publication number
CN114968636A
CN114968636A CN202210543996.5A CN202210543996A CN114968636A CN 114968636 A CN114968636 A CN 114968636A CN 202210543996 A CN202210543996 A CN 202210543996A CN 114968636 A CN114968636 A CN 114968636A
Authority
CN
China
Prior art keywords
manager
target task
abnormal
job
task manager
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210543996.5A
Other languages
Chinese (zh)
Inventor
裴周宇
付海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202210543996.5A priority Critical patent/CN114968636A/en
Publication of CN114968636A publication Critical patent/CN114968636A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

The invention discloses a fault processing method and device, and relates to the technical field of computers. One embodiment of the method comprises: monitoring the running state of a target task manager; in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information. According to the embodiment, the abnormal signal aiming at the abnormal running state of the task manager is obtained in real time, and the corresponding operation manager is informed in time, so that the operation manager triggers fault recovery in time, the task recovery speed is greatly increased, and the service processing efficiency is improved.

Description

Fault processing method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a fault.
Background
With the development of big data technology, the requirement of the business on the real-time performance is higher and higher, more and more businesses utilize real-time computing to accelerate the development of the business, real-time computing products based on cloud originality are more and more widely applied, and the framework mode of the Flink on K8s becomes the main business flow.
Under the framework of Flink on K8s, a task manager (task manager) process abnormally depends on a heartbeat mechanism from the task manager to a resource manager (resource manager), the heartbeat default timeout time is generally more than 60 seconds, a fault recovery task is triggered after timeout, but the heartbeat timeout time is longer, the fault recovery time is longer, and the real-time requirement of service calculation cannot be met.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for fault handling, which can obtain an abnormal signal for a task manager in real time and notify a corresponding job manager in time, so that the job manager triggers fault recovery in time, thereby greatly increasing a task recovery speed and improving service handling efficiency.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of fault handling, including:
monitoring the running state of a target task manager;
in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information.
Optionally, the exception signal is a stop signal sent by the kubel process.
Optionally, the exception signal is an exit signal of the target task manager.
Optionally, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, including:
and acquiring the address information of the job manager from a cache, and sending the abnormal information to the job manager according to the address information.
Optionally, before acquiring the address information of the job manager from the cache, the method includes:
and monitoring the address information of the job manager corresponding to the target task manager, and correspondingly storing the monitored address information of the job manager and the identifier of the target task manager into the cache.
Optionally, the job manager triggers a failure recovery policy for the exception information, including:
the operation manager sends an exception notification indicating that the target task manager is abnormal to a resource manager, the resource manager sends the exception notification to a heartbeat manager, the heartbeat manager cancels heartbeat monitoring of the target task manager after receiving the exception notification, and the heartbeat manager calls back the resource manager so that the resource manager cancels registration of the target task manager to trigger the fault recovery strategy.
Optionally, the job manager triggers a failure recovery policy for the exception information, further comprising:
the resource manager calls back the job manager to enable the job manager to acquire the task which fails to be executed in the target task manager;
and the job manager redistributes the task with failed execution to a newly started task manager to realize fault recovery.
According to still another aspect of the embodiments of the present invention, there is provided an apparatus for fault handling, including:
the monitoring module monitors the running state of the target task manager;
and the sending module is used for responding to an abnormal signal which indicates that the running state is abnormal and sending abnormal information which indicates that the target task manager is abnormal to a job manager corresponding to the target task manager so as to enable the job manager to trigger a fault recovery strategy aiming at the abnormal information.
According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method for fault handling provided by the present invention.
According to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method of fault handling provided by the present invention.
One embodiment of the above invention has the following advantages or benefits: by monitoring the running state of the target task manager, abnormal signals with abnormal running states can be monitored in time, so that abnormal information can be sent to the corresponding operation manager in time, the operation manager triggers a fault recovery task in time, and the operation is replied in time. The method can accelerate the task recovery speed and improve the efficiency of service processing.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of fault handling according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a Flink on K8s architecture according to an embodiment of the invention;
FIG. 3 is a schematic main flow diagram of another method of fault handling according to an embodiment of the invention;
FIG. 4 is a flow diagram of a method of fault handling in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of the main modules of a fault handling apparatus according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a method of fault handling according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:
step S101: monitoring the running state of a target task manager;
step S102: judging whether an abnormal signal indicating the abnormal running state is monitored or not; if yes, executing step S103, otherwise, executing step S101;
step S103: and sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager so that the job manager triggers a fault recovery strategy aiming at the exception information.
The fault processing method is realized based on a Flink on K8s framework mode, wherein the Flink is a stateful computing engine based on a data stream; k8s is collectively referred to as Kubernets, container orchestration scheduling engine. Fig. 2 is a schematic diagram of an architecture of a Flink on K8s according to an embodiment of the present invention, a user may send a request for creating a Flink cluster to a K8s cluster through a K8s client on a Web service (e.g., jrc platform), where the request is a command for creating a K8s delivery (an object for Pod management in K8 s), a K8s Master (K8s cluster Master) creates a K8s delivery according to K8s delivery information in the request and pulls mirror information from a Docker Registry (application container engine mirror library), and obtains K8s PODs corresponding to each K8s delivery, obtains K8 delivery 3 delivery and K8s man (K8 468 management), respectively obtains K8 delivery and K8 task manager (task manager) corresponding to the JobManager and the TaskManager (task manager), respectively obtains address information of a combination of K8 delivery 3 delivery and K8 management (K8 delivery manager) and K8 management (K468 delivery manager) corresponding to the TaskManager, obtains address of a distributed hypervisor group, and obtains address information of a distributed hypervisor (address of the application server) corresponding to the client, the TaskManager, the address of the client is obtained by a distributed hypervisor, after acquiring address information of the JobManager, registering the JobManager, completing the creation of a Flink cluster on K8s, and storing a data processing result into a back-end Storage HDFS (Hadoop Distributed File System)/OSS (Object Storage Service) by the JobManager and the TaskManager; the user can then submit the job to the Flink cluster through the K8s client to have the Flink cluster do the job.
In the embodiment of the invention, whether the target task manager operates normally can be obtained by monitoring the operating state of the target task manager, and if the operating state is normal, the monitoring is continued; if an abnormal signal indicating the running state is abnormal is monitored, the target task manager is about to or has abnormal, and the task cannot be continuously executed. The exception signal may be a stop signal (TERM) sent by a kubel process (a process of a slave Node in the K8s cluster, which is used to process a task issued by the master Node K8s to the slave Node and manage Pod and Pod containers), which indicates that the K8s Pod corresponding to the target task manager is about to end operation; the exception information may be an exit signal (CHILD) of the target task manager, that is, an exit signal of the process corresponding to the target task manager.
In an optional implementation manner of the embodiment of the present invention, after the monitoring of the exception information, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager includes: and acquiring the address information of the job manager from the cache, and sending exception information to the job manager according to the address information.
In the embodiment of the present invention, as shown in fig. 3, before acquiring the address information of the job manager from the cache, the method includes:
step S301: monitoring address information of a job manager corresponding to a target task manager;
step S302: and correspondingly storing the monitored address information of the job manager and the identifier of the target task manager into a cache.
In the embodiment of the invention, the job manager is a main node of a Flink cluster and is used for coordinating stream processing jobs; each master node may correspond to one or more slave nodes, which are task managers, i.e. each job manager may correspond to one or more task managers for performing specific tasks assigned by the job manager, i.e. for executing specific data stream processing logic.
The target task manager may obtain address information of the job manager corresponding to the target task manager, that is, a server address (e.g., a WebMonitor address for receiving an HTTP request) by monitoring the ZK cluster, and then store the address information of the role manager obtained by monitoring and the identifier of the target task manager in the cache in correspondence, and may further store the address information and the identifier of the target task manager in a temporary file system of the cache, that is, a storage file (e.g., wmaari.ts file) including the identifier of the target task manager and the address information of the job manager corresponding to the target task manager in the temporary file system.
In the embodiment of the invention, the address information of the job manager corresponding to the target task manager may be changed, when the target task manager monitors that the address information of the corresponding job manager is changed, the address information of the changed job manager is acquired and stored to update the address information of the job manager corresponding to the identifier of the target task manager in the cache, and the address information of the job manager corresponding to the identifier of the target task manager can be updated in time through the temporary file system stored in the cache, so that the local disk is prevented from being influenced.
When an abnormal signal indicating that the running state of the target task manager is abnormal is monitored, the identification of the target task manager is obtained, then a storage file is obtained from the temporary file system according to the identification of the target task manager, the storage file is analyzed, address information of a job manager corresponding to the target task manager is obtained, abnormal information indicating that the target task manager is abnormal, namely information indicating that the target task manager exits from the process is sent to the job manager according to the address information, and the target task manager corresponding to the job manager is notified of the abnormality.
In the embodiment of the present invention, after the job manager receives the exception information, the job manager triggers a fault recovery policy for the exception information, for example, the exception signal is an exit signal of the process of the target task manager, and the process of the target task manager may be restarted to implement fault recovery, or a task that fails to be executed on the target task manager is acquired, and the task that fails to be executed is allocated to another task manager corresponding to the job manager to be executed, so as to perform task redeployment, and implement quick recovery of the fault.
In an embodiment of the present invention, after the job manager receives the exception information, the job manager triggers a failure recovery policy for the exception information, including: the job manager sends an exception notification indicating that the target task manager is abnormal to the resource manager, the resource manager sends the exception notification to the heartbeat manager, the heartbeat manager cancels heartbeat monitoring on the target task manager after receiving the exception notification, and the heartbeat manager calls back the resource manager so that the resource manager cancels registration on the target task manager to trigger a fault recovery strategy.
In the embodiment of the present invention, the resource manager is used for a job manager and a task manager in a Flink cluster, and is a unit for resource scheduling in the Flink cluster. The heartbeat manager is used for carrying out heartbeat monitoring on the task manager so as to judge whether the running state of the task manager is normal or not. The job manager calls an interface (e.g., a task manager end signal interface, a killsignaltab manager interface) of the resource manager through an RPC request to send an exception notification to the resource manager. And the heartbeat manager cancels heartbeat monitoring on the target task manager, and comprises information such as an identifier for cleaning the target task manager. The resource manager cancels the registration of the target task manager, including the information of the identifier of the target task manager and the like.
In this embodiment of the present invention, the triggering, by the job manager, a fault recovery policy for the exception information further includes: the resource manager calls back the job manager to enable the job manager to acquire the task which fails to be executed in the target task manager; and the job manager redistributes the task with failed execution to the newly started task manager so as to realize fault recovery. That is, the job manager creates a newly started task manager, and distributes the task which fails to be executed in the target task manager to the newly started task manager so that the newly started task manager processes the task which fails to be executed; the task that fails to be executed may be assigned to another task manager corresponding to the job manager.
Fig. 4 is a schematic flowchart of a method for handling a fault according to an embodiment of the present invention, where the method is implemented based on a Flink on K8s architecture, and a task manager monitors address information of a job manager corresponding to the task manager, and stores the obtained address information and an identifier of the task manager in a storage file of a temporary file system; monitoring the running state of a task manager, and monitoring a stop signal sent by a kubelet process of the task manager and an exit signal of the task manager process; when monitoring an exit signal to a task manager, reading and analyzing address information of a corresponding job manager from a Wmaari.ts file of a temporary file system according to the identifier of the task manager, and sending abnormal information indicating that the task manager is abnormal to the job manager according to the address information; after receiving the abnormal information, the job manager sends an abnormal notice indicating that the task manager is abnormal to the resource manager; after receiving the abnormal notification, the resource manager sends the abnormal notification to the heartbeat manager; after receiving the abnormal notification, the heartbeat manager cancels heartbeat monitoring on task management and calls back the resource manager; the resource manager cancels the registration of the task manager, so that the job manager redeploys the task which is executed by the task manager in failure, and distributes the task to other task managers which normally run and correspond to the job manager, thereby realizing the rapid recovery of the fault.
The fault handling method provided by the embodiment of the invention is realized based on a Flink on K8s framework, can acquire an abnormal signal indicating abnormal operation state in time by monitoring the operation state of a target task manager, and directly acquire the address information of a corresponding job manager from a temporary file system after monitoring the abnormal signal so as to send the abnormal information of the target task manager to the job manager, so that the job manager triggers a fault recovery strategy according to the abnormal information, redeployes a task which fails to be executed in the target task manager, and realizes quick recovery of the fault. The method can send the abnormal message to the operation manager in time to trigger the fault recovery strategy after the fault occurs, thereby improving the efficiency of fault processing, further improving the efficiency of service processing, meeting the requirement of service real-time calculation, and overcoming the problem of poor fault recovery timeliness caused by adopting a heartbeat mechanism to monitor and recover the fault in the prior art.
As shown in fig. 5, there is further provided an apparatus 500 for fault handling according to an embodiment of the present invention, including:
the monitoring module 501 monitors the running state of the target task manager;
the sending module 502, in response to monitoring an exception signal indicating that the running state is abnormal, sends exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information.
In one implementation of the embodiment of the present invention, the exception signal is a stop signal sent by the kubel process.
In another implementation manner of the embodiment of the present invention, the exception signal is an exit signal of the target task manager.
In this embodiment of the present invention, the sending module 502 is further configured to: and acquiring the address information of the job manager from the cache, and sending exception information to the job manager according to the address information.
In this embodiment of the present invention, the sending module 502 is further configured to: before the address information of the job manager is acquired from the cache, the address information of the job manager corresponding to the target task manager is monitored, and the monitored address information of the job manager and the identifier of the target task manager are correspondingly stored in the cache.
In this embodiment of the present invention, the sending module 502 is further configured to: and enabling the job manager to send an exception notification indicating that the target task manager is abnormal to the resource manager, enabling the resource manager to send the exception notification to the heartbeat manager, enabling the heartbeat manager to cancel heartbeat monitoring of the target task manager after receiving the exception notification, and enabling the heartbeat manager to call back the resource manager so that the resource manager cancels registration of the target task manager to trigger a fault recovery strategy.
In this embodiment of the present invention, the sending module 502 is further configured to: the resource manager is called back to the job manager so that the job manager can acquire the task which fails to be executed in the target task manager; and the job manager redistributes the task with failed execution to the newly started task manager to realize fault recovery.
According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by one or more processors, the one or more processors realize the fault processing method provided by the invention.
According to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method of fault handling provided by the present invention.
Fig. 6 shows an exemplary system architecture 600 of a fault handling apparatus or method to which embodiments of the invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The terminal devices 601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server that provides various services, such as a background management server or a cloud server (just an example) that supports shopping websites browsed by users using the terminal devices 601, 602, 603. The backend management server or the cloud server may analyze and otherwise process data such as the received data query request, and feed back a processing result (for example, a data query result — just an example) to the terminal device.
It should be noted that the method for handling the fault provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the device for handling the fault is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the use range of the embodiment of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present invention, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a listening module and a sending module. The names of these modules do not constitute a limitation to the module itself in some cases, and for example, a monitoring module may also be described as a "module that monitors the running state of a target task manager".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: monitoring the running state of a target task manager; and in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery strategy aiming at the exception information.
According to the technical solution of the embodiment of the present invention, the method for fault handling provided in the embodiment of the present invention is implemented based on a Flink on K8s architecture, by monitoring the running state of the target task manager, an abnormal signal indicating that the running state is abnormal can be obtained in time, and after the abnormal signal is monitored, the address information of the corresponding job manager is directly obtained from the temporary file system, so as to send the abnormal information of the target task manager to the job manager, so that the job manager triggers a fault recovery policy according to the abnormal information, redeployes the task that has failed to be executed in the target task manager, and realizes quick recovery of the fault. The method can send the abnormal message to the operation manager in time to trigger the fault recovery strategy after the fault occurs, thereby improving the efficiency of fault processing, further improving the efficiency of service processing, meeting the requirement of service real-time calculation, and overcoming the problem of poor fault recovery timeliness caused by adopting a heartbeat mechanism to monitor and recover the fault in the prior art.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of fault handling, comprising:
monitoring the running state of a target task manager;
in response to monitoring an exception signal indicating that the running state is abnormal, sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager, so that the job manager triggers a fault recovery policy for the exception information.
2. The method of claim 1, wherein the exception signal is a stop signal sent by a kubel process.
3. The method of claim 1, wherein the exception signal is an exit signal of the target task manager.
4. The method of claim 1, wherein sending exception information indicating that the target task manager is abnormal to a job manager corresponding to the target task manager comprises:
and acquiring the address information of the job manager from a cache, and sending the abnormal information to the job manager according to the address information.
5. The method of claim 4, wherein prior to obtaining the address information of the job manager from the cache, comprising:
and monitoring the address information of the job manager corresponding to the target task manager, and correspondingly storing the monitored address information of the job manager and the identifier of the target task manager into the cache.
6. The method of claim 1, wherein the job manager triggers a fail-over policy for the exception information, comprising:
the operation manager sends an exception notification indicating that the target task manager is abnormal to a resource manager, the resource manager sends the exception notification to a heartbeat manager, the heartbeat manager cancels heartbeat monitoring of the target task manager after receiving the exception notification, and the heartbeat manager calls back the resource manager so that the resource manager cancels registration of the target task manager to trigger the fault recovery strategy.
7. The method of claim 6, wherein the job manager triggers a fail-over policy for the exception information, further comprising:
the resource manager calls back the job manager to enable the job manager to acquire the task which fails to be executed in the target task manager;
and the job manager redistributes the task with failed execution to a newly started task manager to realize fault recovery.
8. An apparatus for fault handling, comprising:
the monitoring module monitors the running state of the target task manager;
and the sending module is used for responding to an abnormal signal which indicates that the running state is abnormal and sending abnormal information which indicates that the target task manager is abnormal to a job manager corresponding to the target task manager so as to enable the job manager to trigger a fault recovery strategy aiming at the abnormal information.
9. An electronic device, comprising:
one or more processors;
a storage device to store one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202210543996.5A 2022-05-19 2022-05-19 Fault processing method and device Pending CN114968636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210543996.5A CN114968636A (en) 2022-05-19 2022-05-19 Fault processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210543996.5A CN114968636A (en) 2022-05-19 2022-05-19 Fault processing method and device

Publications (1)

Publication Number Publication Date
CN114968636A true CN114968636A (en) 2022-08-30

Family

ID=82985674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210543996.5A Pending CN114968636A (en) 2022-05-19 2022-05-19 Fault processing method and device

Country Status (1)

Country Link
CN (1) CN114968636A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419756A (en) * 2022-01-30 2022-04-29 重庆长安汽车股份有限公司 Method and system for dynamically capturing abnormal scene of whole vehicle

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419756A (en) * 2022-01-30 2022-04-29 重庆长安汽车股份有限公司 Method and system for dynamically capturing abnormal scene of whole vehicle
CN114419756B (en) * 2022-01-30 2023-05-16 重庆长安汽车股份有限公司 Method and system for dynamically capturing abnormal scene of whole vehicle

Similar Documents

Publication Publication Date Title
CN108737270B (en) Resource management method and device for server cluster
US9742651B2 (en) Client-side fault tolerance in a publish-subscribe system
CN108733461B (en) Distributed task scheduling method and device
CN108681777B (en) Method and device for running machine learning program based on distributed system
US10491687B2 (en) Method and system for flexible node composition on local or distributed computer systems
US20140149352A1 (en) High availability for cloud servers
CN107729176B (en) Disaster recovery method and disaster recovery system for configuration file management system
CN109783151B (en) Method and device for rule change
CN109245908B (en) Method and device for switching master cluster and slave cluster
US20130283291A1 (en) Managing Business Process Messaging
CN111427701A (en) Workflow engine system and business processing method
CN113900834B (en) Data processing method, device, equipment and storage medium based on Internet of things technology
CN111526049B (en) Operation and maintenance system, operation and maintenance method, electronic device and storage medium
CN114968636A (en) Fault processing method and device
CN113220433B (en) Agent program operation management method and system
CN113079098B (en) Method, device, equipment and computer readable medium for updating route
CN111240760B (en) Application publishing method, system, storage medium and equipment based on registry
US20210149709A1 (en) Method and apparatus for processing transaction
US11381665B2 (en) Tracking client sessions in publish and subscribe systems using a shared repository
CN114785861B (en) Service request forwarding system, method, computer equipment and storage medium
CN111831503A (en) Monitoring method based on monitoring agent and monitoring agent device
US20230093004A1 (en) System and method for asynchronous backend processing of expensive command line interface commands
CN111301789A (en) Application software packaging method and device
CN108833147B (en) Configuration information updating method and device
CN112202605A (en) Service configuration method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination