CN111930565A

CN111930565A - Process fault self-healing method, device and equipment for components in distributed management system

Info

Publication number: CN111930565A
Application number: CN202010703292.0A
Authority: CN
Inventors: 高永伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Cloud Computing Beijing Co Ltd
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-11-13
Anticipated expiration: 2040-07-21
Also published as: CN111930565B

Abstract

The application discloses a process fault self-healing method, a device and equipment for a component in a distributed management system, wherein the method comprises the following steps: acquiring configuration information comprising an application program interface address of a distributed management server in the distributed management system and a metadata database address of a metadata database in the distributed management system; acquiring the current running state of the component from the distributed management server by using the application program interface address; acquiring metadata from a metadata base by using a metadata base address; performing fault check on the component according to the current running state and the metadata; and when the existence of the fault component is detected, sending a process restart task of the fault component to the distributed management server by using an application program interface. By utilizing the technical scheme provided by the embodiment of the application, cross-version compatibility can be realized, the self-healing process is visible, service integration codes are non-intrusive, and the process fault self-healing of the components in the distributed management system is simple and efficient.

Description

Process fault self-healing method, device and equipment for components in distributed management system

Technical Field

The present application relates to the field of internet communication technologies, and in particular, to a method, an apparatus, and a device for process fault self-healing of a component in a distributed management system.

Background

With the change of internet communication technology, some large-scale internet service systems adopt distributed cluster management due to complex service and other reasons. A large number of service management systems for distributed cluster management are also generated, such as Apache ambari, but as the number of nodes in a single distributed cluster system to be managed gradually increases, various component failures caused by hardware and software occur, and some common failures, such as process termination caused by insufficient memory, network jitter, disk IO overload, and the like, usually can realize failure self-healing only by pulling up the process again.

The existing fault self-healing schemes mainly comprise the following two schemes: one is that the process fault self-healing of the component is processed by an agent node in the distributed management system after the service is restarted by modifying the component global configuration file of the distributed management system; the other method is to abolish the former global configuration, define the fault self-healing related information by modifying the service integration code, and configure whether to start the fault self-healing in the web interface in the distributed management system. However, the two schemes have the problem of version incompatibility, and when the component process is abnormally terminated, the fault self-healing process can be recovered by quiescing the process in the background of the distributed management system, so that the fault self-healing process can be only known by checking the log of the proxy node through the login server, and the self-healing process has black box property; in the former scheme, the process is restarted after the configuration file of the component needs to be modified, the fault self-healing capacity of the latter service is in a closed state by default, the service is started one by manual operation, and the fault self-healing related information needs to be defined by modifying the service integration code, so that the operation process is complex and the maintenance cost of the service integration code is high; therefore, there is a need to provide a more reliable or efficient solution.

Disclosure of Invention

The application provides a process fault self-healing method, device and equipment for a component in a distributed management system, which can realize cross-version compatibility, visible self-healing process, non-invasion of service integration codes, simplicity and high efficiency.

In one aspect, the present application provides a process fault self-healing method for a component in a distributed management system, where the method includes:

acquiring configuration information, wherein the configuration information comprises an application program interface address of a distributed management server in a distributed management system and a metadata database address of a metadata database in the distributed management system;

acquiring the current running state of the component under the service corresponding to the distributed management system from the distributed management server by using the application program interface address;

acquiring metadata of the distributed management system from the metadata database by using the metadata database address, wherein the metadata represents services corresponding to the distributed management system and working states of components under the services;

performing fault checking on the component according to the current operating state and the metadata;

and when the existence of the fault component is detected, sending a process restart task of the fault component to the distributed management server by using the application program interface.

Another aspect provides a process fault self-healing device for a component in a distributed management system, the device including:

the system comprises a configuration information acquisition module, a configuration information acquisition module and a configuration information processing module, wherein the configuration information acquisition module is used for acquiring configuration information, and the configuration information comprises an application program interface address of a distributed management server in a distributed management system and a metadata database address of a metadata database in the distributed management system;

a current operation state obtaining module, configured to obtain, from the distributed management server, a current operation state of a component under a service corresponding to the distributed management system by using the application program interface address;

the metadata acquisition module is used for acquiring metadata of the distributed management system from the metadata database by using the metadata database address, wherein the metadata represents services corresponding to the distributed management system and working states of components under the services;

the fault checking module is used for carrying out fault checking on the component according to the current running state and the metadata;

and the process restart task triggering module is used for sending the process restart task of the fault component to the distributed management server by using the application program interface when the fault component is detected to exist.

Another aspect provides a process fault self-healing device for a component in a distributed management system, where the device includes a processor and a memory, where the memory stores computer instructions, and the computer instructions are loaded and executed by the processor to implement the process fault self-healing method for a component in a distributed management system as described above.

The method, the device and the equipment for process fault self-healing of the components in the distributed management system have the following technical effects:

the method and the device can acquire the current running state of the component under the corresponding service of the distributed management system from the distributed management server in real time through the configuration information comprising the application program interface address of the distributed management server in the distributed management system and the metadata database address of the metadata database in the distributed management system; acquiring metadata comprising services corresponding to the distributed management system and working states of components under the services from a metadata database; further, fault detection can be carried out on the components according to the current running state and the metadata; when the fault component is detected, the process restart task of the fault component is automatically triggered through the application program interface of the distributed management server, so that the process fault self-healing of the component in the distributed management system is realized, and the service of the component and the fault self-healing of the node where the component is located are further ensured. And the current running state of the component and the metadata of the distributed management system are transmitted from the background through the application program interface address of the distributed management server and the metadata of the metadata database in the distributed management system, so that a user can visually observe when the component fails and recovers from the failure through web UI, API, a command line tool and other entries, and complete output information of the recovery process, the visibility of the self-healing process is realized, and the black box property of the existing self-healing process is avoided. And the process task is restarted directly through an application program interface, the original fault self-healing technology of the distributed management system can be compatible, the cross-version universal fault self-healing capability is achieved, and the original use experience is kept. And the distributed management server does not need to be modified or restarted, so that the service integration code running on the distributed management system is prevented from being modified, and the service integration code is non-intrusive and is simpler and more efficient in fault self-healing.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

fig. 2 is a schematic flowchart of a process fault self-healing method for a component in a distributed management system according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating a fault checking of the component according to the current operating state and the metadata according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a fault checking process performed on the component according to the current operating state and the metadata according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another method for process fault self-healing of a component in a distributed management system according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another method for process fault self-healing of a component in a distributed management system according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another method for process fault self-healing of a component in a distributed management system according to an embodiment of the present application;

fig. 8 is a schematic flowchart of another method for process fault self-healing of a component in a distributed management system according to an embodiment of the present application;

fig. 9 is a schematic flowchart of another method for process fault self-healing of a component in a distributed management system according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a process fault self-healing apparatus of a component in a distributed management system according to an embodiment of the present application;

fig. 11 is a hardware block diagram of an apparatus for performing a semantic intent recognition method according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment according to an embodiment of the present disclosure, and as shown in fig. 1, the application environment at least includes a fault self-healing control end 100, a distributed management system 200, and a distributed cluster 300.

In this embodiment of the present specification, the fault self-healing control terminal 100 may be configured to perform fault self-healing control on a component under a service corresponding to the distributed management system, and specifically, the fault self-healing control terminal 100 may include a client and may also include a server.

In this illustrative embodiment, the distributed management system 200 may include a distributed management server 201, a metadata repository 202, and a plurality of proxy nodes 203. Specifically, the distributed management system 200 may be used in an integrated cluster management system for service definition, management, monitoring, and the like of a distributed cluster and its ecological related components, and in this embodiment of the present specification, the distributed management system 200 may include a set of simple and easy-to-use web UIs (User interfaces) and a set of canonical restful api (application programming Interface) sets.

In this embodiment, the distributed cluster 300 may include a plurality of nodes, and the plurality of nodes may include clients or servers.

In this embodiment, the plurality of agent nodes 203 may be deployed on a plurality of nodes of the distributed cluster 300, specifically, for example, one agent node is deployed on each node of the distributed cluster 300; accordingly, the distributed management server 201 may collect the working states of the services and components under the services on the plurality of nodes of the distributed cluster 300 through the plurality of proxy nodes 203, and store the working states in the metadata database; in addition, the distributed management server 201 may also collect the working states of the services on the plurality of agent nodes 203 and the components under the services, and store the working states of the plurality of agent nodes 203 in the metadata base.

In this embodiment of the present disclosure, the self-healing fault control end 100, the distributed management system 200, and the distributed cluster 300 may be directly or indirectly connected through wired or wireless communication, and the present disclosure is not limited herein.

In a specific embodiment, the self-healing control end 100 may implement a communication connection with the distributed management server through an application program interface of the distributed management server in the distributed management system 200.

In embodiments of the present description, the client may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and other types of entity devices. Or software in the operating and physical devices. The operating system running on the entity device in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In this embodiment, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content delivery network), a big data and an artificial intelligence platform.

In a specific embodiment, when the distributed management system 200 or the distributed cluster 300 is applied to a blockchain system, the system may be formed by a plurality of nodes (any form of computing devices in an access network, such as servers and user terminals), a Peer-to-Peer (P2P) network is formed between the nodes, and the P2P Protocol is an application layer Protocol running on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer. Specifically, the functions of each node in the blockchain system may include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

An embodiment of a process fault self-healing method for a component in a distributed management system according to the present application is described below, and fig. 2 is a flowchart illustrating a process fault self-healing method for a component in a distributed management system according to an embodiment of the present application, where the present specification provides the method operation steps described in the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual implementation, the system or client product may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: configuration information is obtained.

In this embodiment of the present description, configuration information of a distributed management system that needs to perform process fault self-healing control on a component may be configured in advance, and when performing process fault self-healing control, the configuration information may be obtained from a local load, and specifically, the configuration information may include an application program interface address of a distributed management server in the distributed management system and a metadata base address of a metadata base in the distributed management system.

In practical application, the fault self-healing control end may create a daemon process, and the daemon process may be used to load configuration information, so as to obtain relevant information for performing process fault self-healing control on the component by combining an application program interface address and a metadata base address in the configuration information.

S203: and acquiring the current running state of the component under the corresponding service of the distributed management system from the distributed management server by using the application program interface address.

In this embodiment, the distributed management server may provide a general application program interface, and accordingly, the fault self-healing control end may obtain, from the distributed management server, a current operating state of the component under the service corresponding to the distributed management system, in combination with address information (application program interface address) of the application program interface. Specifically, the current operating state of the component under the service corresponding to the distributed management system may be obtained from the distributed management server by a daemon process of the fault self-healing control end based on the application program interface address. Specifically, the current operating state of the component may include a normal operating state or a shutdown state.

In this embodiment, the component under the service corresponding to the distributed management system may include a plurality of agent nodes included in the distributed management system itself and a component under the service provided by the managed distributed cluster.

S205: and acquiring the metadata of the distributed management system from the metadata database by using the metadata database address.

In this embodiment, the daemon process of the fault self-healing control end may obtain the metadata of the distributed management system from the metadata database based on the metadata database address. Specifically, the metadata may represent a service corresponding to the distributed management system and a working state of a component under the service;

in this embodiment, the service corresponding to the distributed management system may include a plurality of proxy nodes included in the distributed management system itself and a service provided by the distributed cluster that is managed.

S207: and carrying out fault check on the component according to the current running state and the metadata.

In a particular embodiment, the metadata may include a target operating state of the component, schema information of a service to which the component belongs, and schema information of a node at which the service component is located. Accordingly, as shown in fig. 3, performing a fault check on the component according to the current operating state and the metadata may include:

s301: determining a component to be self-healed, wherein the current operation state is a stop operation state and the target operation state is a start operation state according to the current operation state and the target operation state of the component;

s303: performing maintenance mode check on the component to be self-healed according to the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located;

s305: and when the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located are all non-maintenance modes, determining that the component to be self-healed is a fault component.

In this embodiment of the present description, the target operation state of the component may include a start operation state or a stop operation state, and specifically, may be set in combination with an actual application requirement of the component. In practical applications, the target operation state of most components is the startup operation state, and when the components need to be maintained or there is some conflict causing the components to stop operating, the target operation state of the components may be the shutdown state.

In the embodiment of the present specification, a component that needs to perform process fault self-healing may be a component whose current operating state is a stop operating state and whose target operating state is a start operating state;

in the embodiment of the present specification, the mode information may include a maintenance mode and a non-maintenance mode. In practical application, when any mode information of a component, a service to which the component belongs, and a node where the component is located is in a maintenance mode, the component is often required to be in a stop operation state, correspondingly, in order to avoid restarting a process of the component which needs to be in the stop operation state due to the maintenance mode, maintenance mode check can be performed on a component to be self-healed, of which the current operation state is the stop operation state and the target operation state is a start operation state, and when the mode information of the component to be self-healed, the mode information of the service to which the component to be self-healed, and the mode information of the node where the component to be self-healed are all in a non-maintenance mode, the component to be self-healed is determined to be a faulty.

Further, when any one of the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is in the maintenance mode, the daemon process does not execute further fault self-healing operation, and acquires the current running state of the component from the distributed management server and acquires the metadata of the distributed management system from the metadata base again.

In other embodiments, to avoid an exception to the distributed management system itself, the metadata may also include a current operating state of the component; accordingly, as shown in fig. 4, the performing the fault check on the component according to the current operating state and the metadata may include:

s401: performing consistency verification on the current running state of the component in the metadata and the current running state of the component acquired from the distributed management server;

s403: when the consistency verification is passed, determining the component to be self-healed, wherein the current operation state is a stop operation state and the target operation state is a start operation state, according to the current operation state and the target operation state of the component;

s405: performing maintenance mode check on the component to be self-healed according to the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located;

s407: and when the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located are all non-maintenance modes, determining that the component to be self-healed is a fault component.

In the embodiment of the present specification, when the current operating state of a component in metadata is consistent with the current operating state of the component acquired in the distributed management server, consistency verification is passed; otherwise, the consistency verification fails; correspondingly, when the consistency verification fails, the distributed management system has an exception, and correspondingly, the daemon process acquires the current running state of the component from the distributed management server and acquires the metadata of the distributed management system from the metadata base again without executing further fault self-healing operation.

Furthermore, when the consistency verification fails, related prompt information can be fed back, so that related personnel can perform abnormal check on the distributed management system and recover in time.

S209: and when the existence of the fault component is detected, sending a process restart task of the fault component to the distributed management server by using the application program interface.

In this embodiment of the present description, when a faulty component exists in a component under a service corresponding to a distributed management system, the application program interface may be used to send a process restart task of the faulty component to the distributed management server, so that the distributed management server starts a process restart on the faulty component, and the faulty process self-healing of the component is achieved.

In other embodiments, in order to avoid an invalid process restart in consideration of a component failure caused by a node hardware failure or a power failure, as shown in fig. 5, before sending a process restart task of the failed component to the distributed management server by using the application program interface, the method further includes:

s211: when the existence of the fault component is checked, the connectivity of the node where the fault component is located is determined.

Correspondingly, step S209 is replaced by sending the process restart task of the failed component to the distributed management server by using the application program interface when the node where the failed component is located is connected.

In this embodiment of the present description, connectivity of a node where a failed component is located may be determined by connecting a TCP port of the node where the failed component is located, and specifically, if the connectivity is connected, determining that the node where the failed component is located is connected, and triggering a process restart task of the failed component; otherwise, if the connection is not made, determining that the node where the fault component is located is not connected, correspondingly, not executing further fault self-healing operation, and acquiring the current running state of the component from the distributed management server and the metadata of the distributed management system from the metadata database by the daemon process again.

Further, when it is determined that the node where the faulty component is located is not connected, prompt information that the node where the faulty component is located is not connected can be fed back, so that relevant workers can timely recover the node.

In other embodiments, before the process restarting task of sending the failed component to the distributed management server using the application program interface, as shown in fig. 6, the method may further include:

s213: when the fault component is detected to exist, performing task conflict detection on the fault component;

correspondingly, S209 is replaced by sending the process restart task of the failed component to the distributed management server by using the application program interface when the result of the task conflict detection indicates that no other task exists for the failed component.

In practical application, if the faulty component itself has other operations in execution, for example, there is a restart task in the past, it is not necessary to restart the faulty component at this time, and in order to avoid task conflict or task accumulation, task conflict detection may be performed on the faulty component, and when there is no other task in the faulty component, the process of the faulty component is triggered to restart the task.

In this embodiment of the present description, when a result of the task conflict detection indicates that the faulty component has other tasks, no further fault self-healing operation is performed, and the daemon process obtains the current operating state of the component from the distributed management server and obtains the metadata of the distributed management system from the metadata database again.

In other embodiments, the configuration information may further include a service blacklist, and before performing the fault check on the component according to the current operating state and the metadata, as shown in fig. 7, the method further includes:

s215: and screening the blacklist of the components under the corresponding services of the distributed management system according to the service blacklist to obtain a target component.

Correspondingly, the step S207 of performing fault checking on the component according to the current operating state and the metadata includes: and carrying out fault check on the target component according to the current running state and the metadata.

In practical application, the fault self-healing control can be performed on all the components under the corresponding service of the distributed management system, and the fault self-healing control of part of the components can also be performed by combining a service blacklist.

In a specific embodiment, when the components under the corresponding service of the distributed management system need to be subjected to fault self-healing control, the configuration information may be configured as follows:

[services]

include＝all

exclude＝

in another specific embodiment, when it is required to perform fault self-healing control on part of components in combination with the service blacklist, the following configuration may be performed in the configuration information:

[services]

include＝all

exclude＝YARN,ZOOKEEPER,HDFS

namely, the components under the services YARN, zokeeper, HDFS need to be excluded, wherein YARN (another Resource coordinator) is a new Hadoop Resource manager; ZooKeeper is a distributed, open source distributed application coordination service; HDFS (Hadoop distributed file system) refers to a distributed file system designed to fit on general purpose hardware (comfort hardware).

In practical application, the service blacklist may include some services which are in an adjustment stage and are not on-line formally, and accordingly, the components under the services in the service blacklist may be screened out to obtain the target components.

In this embodiment of the present specification, when a service to which a certain component belongs is in a service blacklist, no further fault self-healing operation is performed, and the daemon process obtains the current operating state of the component from the distributed management server and obtains the metadata of the distributed management system from the metadata database again.

In addition, it should be noted that, in practical application, a service white list may also be set, and accordingly, a component under the service in the service white list may be used as a target component.

In further embodiments, after sending the process restart task for the failed component to the distributed management server using the application program interface, the method further comprises:

recording the process restart accumulated times of the fault component;

correspondingly, the configuration information further comprises a preset restart upper limit; before the process restarting task of sending the failed component to the distributed management server by using the application program interface, as shown in fig. 8, the method may further include:

s217: when the existence of the fault component is detected, determining the accumulated number of times of restarting the process of the fault component.

Correspondingly, step S209 is replaced by sending the process restart task of the failed component to the distributed management server by using the application program interface when the accumulated number of process restart times of the failed component is less than or equal to the preset restart upper limit.

In practical application, when restarting is carried out for multiple times, the component is still in a stop running state, so as to avoid resource waste of the whole system caused by excessive invalid fault restarting, the accumulated number of process restarting times of the fault component can be recorded after the process restarting task of the fault component is triggered each time, the process restarting task of the fault component is not triggered after the preset restarting upper limit is reached, and correspondingly, the daemon process can obtain the current running state of the component from the distributed management server and the metadata of the distributed management system from the metadata base again.

Furthermore, after the preset restart upper limit is reached, corresponding prompt information can be fed back so that relevant workers can perform further inspection and recovery.

Further, if the restart is successful within the preset restart upper limit, the accumulated number of the process restart of the failed component may be cleared.

As can be seen from the technical solutions provided in the embodiments of the present specification, the current operating state of the component under the service corresponding to the distributed management system can be obtained in real time from the distributed management server by including the configuration information of the application program interface address of the distributed management server in the distributed management system and the metadata base address of the metadata base in the distributed management system; acquiring metadata comprising services corresponding to the distributed management system and working states of components under the services from a metadata database; further, fault detection can be carried out on the components according to the current running state and the metadata; when the fault component is detected, the process restart task of the fault component is automatically triggered through the application program interface of the distributed management server, so that the process fault self-healing of the component in the distributed management system is realized, and the service of the component and the fault self-healing of the node where the component is located are further ensured. And the current running state of the component and the metadata of the distributed management system are transmitted from the background through the application program interface address of the distributed management server and the metadata of the metadata database in the distributed management system, so that a user can visually observe when the component fails and recovers from the failure through web UI, API, a command line tool and other entries, and complete output information of the recovery process, the visibility of the self-healing process is realized, and the black box property of the existing self-healing process is avoided. And the process task is restarted directly through an application program interface, the original fault self-healing technology of the distributed management system can be compatible, the cross-version universal fault self-healing capability is achieved, and the original use experience is kept. And the distributed management server does not need to be modified or restarted, so that the service integration code running on the distributed management system is prevented from being modified, and the service integration code is non-intrusive and is simpler and more efficient in fault self-healing.

In addition, it should be noted that, in the actual application, different fault self-healing conditions may be set in combination with the process fault self-healing requirements of the actual application on the component, and different fault self-healing conditions may be combined, and the present invention is not limited to the above embodiments of the consistency verification of the current operating state, the limitation of the current operating state and the target operating state, the inspection of the maintenance mode, the connectivity confirmation of the node where the faulty component is located, the task conflict detection, the screening of the blacklist, the limitation of the number of times of restart, and the like.

In a particular embodiment, the present specification provides an embodiment of a method for self-healing process faults of components in a distributed management system. The present specification provides method steps as described in the examples or flowcharts, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual implementation, the system or client product may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 9, the method may include:

s901: acquiring configuration information, wherein the configuration information comprises an application program interface address of a distributed management server in a distributed management system and a metadata database address of a metadata database in the distributed management system;

s903: and acquiring the current running state of the component under the service corresponding to the distributed management system from the distributed management server by using the application program interface address, acquiring metadata of the distributed management system from the metadata database by using the metadata database address, wherein the metadata represents the service corresponding to the distributed management system and the working state of the component under the service.

S905: and screening the blacklist of the components under the corresponding services of the distributed management system according to the service blacklist to obtain a target component.

In this embodiment of the present specification, when a service to which a component belongs is in a service blacklist, the process returns to step S903.

S907: and performing consistency verification on the current running state of the component in the metadata and the current running state of the target component acquired from the distributed management server.

S909: and when the consistency passes the verification, determining the component to be self-healed, wherein the current operation state is a stop operation state and the target operation state is a start operation state, according to the current operation state and the target operation state of the target component.

In the embodiment of the present specification, when the consistency verification fails, the process may return to step S903.

S911: and performing maintenance mode check on the component to be self-healed according to the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located.

S913: and when the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located are all non-maintenance modes, determining that the component to be self-healed is a fault component.

In this embodiment of the present specification, when any one of the mode information of the component to be self-healed, the mode information of the service to which the component to be self-healed belongs, and the mode information of the node where the component to be self-healed is a maintenance mode, the step S903 may be returned to.

S915: and determining the connectivity of the node where the fault component is located.

S917: and when the node where the fault component is located is communicated, determining the accumulated number of restart times of the process of the fault component.

In this embodiment of the present specification, when the node where the failed component is located is not connected, the step S903 may be returned to.

S919: and when the accumulated process restart times of the fault component is less than or equal to a preset restart upper limit, performing task conflict detection on the fault component.

In this embodiment of the present description, when the accumulated number of restart times of the process of the failed component is greater than the preset restart upper limit, the process may return to step S903.

S921: and when the task conflict detection result indicates that other tasks do not exist in the fault component, the application program interface is utilized to send the process restart task of the fault component to the distributed management server.

In this embodiment of the present specification, when the result of the task conflict detection indicates that there are other tasks in the faulty component, the process may return to step S903.

An embodiment of the present application further provides a process fault self-healing device for a component in a distributed management system, as shown in fig. 10, the device includes:

a configuration information obtaining module 1010, configured to obtain configuration information, where the configuration information includes an application program interface address of a distributed management server in a distributed management system and a metadata base address of a metadata base in the distributed management system;

a current operating state obtaining module 1020, configured to obtain, from the distributed management server, a current operating state of a component under a service corresponding to the distributed management system by using the application program interface address;

a metadata obtaining module 1030, configured to obtain metadata of the distributed management system from the metadata database by using the metadata database address, where the metadata represents a service corresponding to the distributed management system and a working state of a component under the service;

a failure checking module 1040, configured to perform failure checking on the component according to the current operating state and the metadata;

the process restart task triggering module 1050 may be configured to send, to the distributed management server, a process restart task of the failed component by using the application program interface when it is checked that the failed component exists.

In some embodiments, the metadata may include a target operating state of the component, mode information of a service to which the component belongs, and mode information of a node on which the service component is located;

the fault checking module 1040 may include:

the device comprises a to-be-self-healing component determining unit, a self-healing component determining unit and a self-healing component judging unit, wherein the to-be-self-healing component determining unit is used for determining that the current operation state is a stop operation state and the target operation state is a start operation state according to the current operation state and the target operation state of the component;

the maintenance mode checking unit is used for carrying out maintenance mode checking on the component to be self-healed according to the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located;

and the fault component determining unit is used for determining that the component to be self-healed is a fault component when the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed are all in a non-maintenance mode.

In some embodiments, the metadata may also include a current operating state of the component;

correspondingly, the fault checking module 1040 may further include:

a consistency verification unit, configured to perform consistency verification on a current operation state of a component in the metadata and a current operation state of the component acquired in the distributed management server;

correspondingly, when the consistency verification is passed, the to-be-self-healing component determining unit determines that the current operation state is the to-be-self-healing component in the stop operation state and the target operation state according to the current operation state and the target operation state of the component.

In some embodiments, prior to the process restart task of sending the failed component to the distributed management server using the application programming interface, the apparatus further comprises:

a node connectivity determining unit, configured to determine connectivity of a node where the failed component is located;

correspondingly, when the node where the failed component is located is connected, the process restart task triggering module 1050 sends the process restart task of the failed component to the distributed management server by using the application program interface;

in some embodiments, the apparatus further comprises:

the task conflict detection module is used for detecting task conflicts of the fault components;

correspondingly, when the result of the task conflict detection indicates that no other task exists in the failed component, the process restart task triggering module 1050 sends the process restart task of the failed component to the distributed management server by using the application program interface.

In some embodiments, the configuration information further includes a service blacklist, the apparatus further comprising:

the blacklist screening module is used for screening a blacklist of components under the corresponding service of the distributed management system according to the service blacklist to obtain a target component;

correspondingly, the fault checking module 1040 is further configured to perform fault checking on the target component according to the current operating state and the metadata.

In some embodiments, the apparatus further comprises:

the process restarting accumulated time recording module is used for recording the process restarting accumulated time of the fault component after the application program interface is used for sending the process restarting task of the fault component to the distributed management server;

in some embodiments, the configuration information further includes a preset restart upper limit;

correspondingly, the device further comprises:

the process restart accumulative number determining module is used for determining the process restart accumulative number of the fault assembly;

correspondingly, when the accumulated number of restart times of the process of the failed component is less than or equal to the preset restart upper limit, the process restart task triggering module 1050 sends the process restart task of the failed component to the distributed management server by using the application program interface.

The device and method embodiments in the device embodiment are based on the same application concept.

Embodiments of the present specification also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and executes the computer instructions, so that the computer device executes the process fault self-healing method for the component in the distributed management system provided in the above various optional implementation modes.

The embodiment of the application provides a process fault self-healing device for a component in a distributed management system, and the process fault self-healing device for the component in the distributed management system comprises a processor and a memory, wherein a computer instruction is stored in the memory, and the computer instruction is loaded and executed by the processor to implement the process fault self-healing method for the component in the distributed management system provided by the embodiment of the method.

Embodiments of the present application further provide a computer-readable storage medium, where the storage medium may be disposed in a device to store computer instructions related to implementing a process fault self-healing method for a component in a distributed management system in the method embodiments, and the computer instructions are loaded and executed by the processor to implement the process fault self-healing method for a component in a distributed management system provided in the method embodiments.

The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.

Alternatively, in the embodiments of the present specification, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Alternatively, in the embodiments of the present specification, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.

The method provided by the embodiment of the application can be executed in a client (a mobile terminal, a computer terminal), a server or a similar operation device. For example, as shown in fig. 11, the client may include RF (Radio Frequency) circuitry 1110, a memory 1120 including one or more computer-readable storage media, an input unit 1130, a display unit 1140, a sensor 1150, audio circuitry 1160, a WiFi (wireless fidelity) module 1170, a processor 1180 including one or more processing cores, and a power supply 1190. Those skilled in the art will appreciate that the client architecture shown in fig. 11 does not constitute a limitation on the client, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

RF circuit 1110 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 1180; in addition, data relating to uplink is transmitted to the base station. In general, RF circuitry 1110 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low noise amplifier), a duplexer, and the like. In addition, RF circuit 810 may also communicate with networks and other clients via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), email, SMS (Short Messaging Service), and the like.

The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the client, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 1120 may also include a memory controller to provide the processor 1180 and the input unit 1130 access to the memory 1120.

The input unit 1130 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, input unit 1130 may include a touch-sensitive surface 1131 as well as other input devices 1132. Touch-sensitive surface 1131, also referred to as a touch display screen or a touch pad, may collect touch operations by a user on or near the touch-sensitive surface 1131 (e.g., operations by a user on or near the touch-sensitive surface 1131 using a finger, a stylus, or any other suitable object or attachment), and drive the corresponding connection device according to a preset program. Alternatively, touch-sensitive surface 1131 may include two portions, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. Additionally, touch-sensitive surface 1131 may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 1130 may include other input devices 1132 in addition to the touch-sensitive surface 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1140 may be used to display information input by or provided to the user as well as various graphical user interfaces of the client, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 1140 may include a Display panel 1141, and optionally, the Display panel 1141 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like. Further, touch-sensitive surface 1131 may cover display panel 1141, and when touch operation is detected on or near touch-sensitive surface 1131, the touch operation is transmitted to processor 1180 to determine the type of touch event, and processor 1180 then provides corresponding visual output on display panel 1141 according to the type of touch event. Touch-sensitive surface 1131 and display panel 1141 may be implemented as two separate components for input and output functions, although touch-sensitive surface 1131 and display panel 1141 may be integrated for input and output functions in some embodiments.

The client may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1141 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1141 and/or the backlight when the client moves to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the device is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for identifying client gestures, and related functions (such as pedometer and tapping) for vibration identification; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured at the client, detailed description is omitted here.

Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and the client. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, converts the electrical signals into audio data after being received by the audio circuit 1160, and then processes the audio data output processor 1180, and then sends the audio data to, for example, another client via the RF circuit 1110, or outputs the audio data to the memory 1120 for further processing. Audio circuitry 1160 may also include an earbud jack to provide communication of peripheral headphones with the client.

WiFi belongs to short distance wireless transmission technology, and the client can help the user send and receive e-mail, browse web page and access streaming media, etc. through WiFi module 1170, which provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the client and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1180 is a control center of the client, connects various parts of the whole client by using various interfaces and lines, and executes various functions and processes data of the client by running or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the client. Optionally, processor 1180 may include one or more processing cores; preferably, the processor 1180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.

The client further includes a power supply 1190 (such as a battery) for supplying power to various components, and preferably, the power supply may be logically connected to the processor 1180 through a power management system, so that functions of managing charging, discharging, power consumption management, and the like are implemented through the power management system. Power supply 1190 may also include one or more dc or ac power supplies, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, or any other component.

Although not shown, the client may further include a camera, a bluetooth module, and the like, which are not described herein again. Specifically, in this embodiment, the display unit of the client is a touch screen display, the client further includes a memory and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors according to the instructions of the method embodiments of the present invention.

As can be seen from the embodiments of the process fault self-healing method, apparatus, device, or storage medium for a component in a distributed management system provided by the present application, in the present application, through configuration information including an application program interface address of a distributed management server in the distributed management system and a metadata base address of a metadata base in the distributed management system, a current operating state of the component under a service corresponding to the distributed management system can be obtained from the distributed management server in real time; acquiring metadata comprising services corresponding to the distributed management system and working states of components under the services from a metadata database; further, fault detection can be carried out on the components according to the current running state and the metadata; when the fault component is detected, the process restart task of the fault component is automatically triggered through the application program interface of the distributed management server, so that the process fault self-healing of the component in the distributed management system is realized, and the service of the component and the fault self-healing of the node where the component is located are further ensured. And the current running state of the component and the metadata of the distributed management system are transmitted from the background through the application program interface address of the distributed management server and the metadata of the metadata database in the distributed management system, so that a user can visually observe when the component fails and recovers from the failure through web UI, API, a command line tool and other entries, and complete output information of the recovery process, the visibility of the self-healing process is realized, and the black box property of the existing self-healing process is avoided. And the process task is restarted directly through an application program interface, the original fault self-healing technology of the distributed management system can be compatible, the cross-version universal fault self-healing capability is achieved, and the original use experience is kept. And the distributed management server does not need to be modified or restarted, so that the service integration code running on the distributed management system is prevented from being modified, and the service integration code is non-intrusive and is simpler and more efficient in fault self-healing.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware to implement the above embodiments, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A process fault self-healing method for a component in a distributed management system is characterized by comprising the following steps:

2. The method of claim 1, wherein the metadata comprises a target operating state of the component, mode information of a service to which the component belongs, and mode information of a node in which the service component is located;

said performing a fault check on said component based on said current operating state and said metadata comprises:

determining a component to be self-healed, wherein the current operation state is a stop operation state and the target operation state is a start operation state according to the current operation state and the target operation state of the component;

performing maintenance mode check on the component to be self-healed according to the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located;

and when the mode information of the component to be self-healed, the mode information of the service of the component to be self-healed and the mode information of the node where the component to be self-healed is located are all non-maintenance modes, determining that the component to be self-healed is a fault component.

3. The method of claim 2, wherein the metadata further comprises a current operating state of the component;

correspondingly, the performing fault checking on the component according to the current operating state and the metadata further includes:

performing consistency verification on the current running state of the component in the metadata and the current running state of the component acquired from the distributed management server;

and when the consistency passes the verification, determining the component to be self-healed, wherein the current operation state is a stop operation state and the target operation state is a start operation state, according to the current operation state and the target operation state of the component.

4. The method of any of claims 1 to 3, wherein prior to the sending the process restart task for the failed component to the distributed management server using the application program interface, the method further comprises:

determining connectivity of a node where the failed component is located;

and when the node where the fault component is located is communicated, sending a process restart task of the fault component to the distributed management server by using the application program interface.

5. The method of any of claims 1 to 3, wherein prior to the sending the process restart task for the failed component to the distributed management server using the application program interface, the method further comprises:

performing task conflict detection on the failed component;

and when the task conflict detection result indicates that other tasks do not exist in the fault component, the application program interface is utilized to send the process restart task of the fault component to the distributed management server.

6. The method of any of claims 1 to 3, wherein the configuration information further comprises a service blacklist, and wherein, prior to performing a failure check on the component based on the current operating state and the metadata, the method further comprises:

screening a blacklist of components under the corresponding service of the distributed management system according to the service blacklist to obtain a target component;

correspondingly, the performing fault checking on the component according to the current operating state and the metadata includes:

and carrying out fault check on the target component according to the current running state and the metadata.

7. The method of any of claims 1 to 3, wherein after sending the process restart task for the failed component to the distributed management server using the application program interface, the method further comprises:

and recording the accumulated process restart times of the failed component.

8. The method of claim 7, wherein the configuration information further comprises a preset restart ceiling;

correspondingly, before the process restarting task of sending the fault component to the distributed management server by using the application program interface, the method further comprises the following steps:

determining the process restart accumulated times of the failed component;

and when the accumulated process restart times of the fault component is less than or equal to a preset restart upper limit, sending a process restart task of the fault component to the distributed management server by using the application program interface.

9. A process fault self-healing device for a component in a distributed management system, the device comprising:

10. A process fault self-healing apparatus for a component in a distributed management system, the apparatus comprising a processor and a memory, the memory having stored therein computer instructions, the computer instructions being loaded and executed by the processor to implement the process fault self-healing method for a component in a distributed management system according to any one of claims 1 to 8.