CN114064438A

CN114064438A - Database fault processing method and device

Info

Publication number: CN114064438A
Application number: CN202111406121.2A
Authority: CN
Inventors: 邓宇
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-02-18

Abstract

The invention discloses a database fault processing method and device, and relates to the technical field of cloud computing. One embodiment of the method comprises: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state. The implementation mode can improve the safety and the stability of the database.

Description

Database fault processing method and device

Technical Field

The invention relates to the technical field of cloud computing, in particular to a database fault processing method and device.

Background

Cloud computing has hatched a group of cloud Service providers, cloud services provided by the cloud Service providers can be divided into IaaS (Infrastructure as a Service), PaaS (Platform as a Service) and SaaS (Software as a Service), and cloud computing provides nearly infinite low-cost storage and better expansibility, allows a user to initialize a cloud cluster as required, and establishes safe and stable applications. In the IaaS technology, an MPP (Massively Parallel Processing) database is generally used to aggregate and process a large-scale data set, in the age of data volume index explosion, a larger database cluster scale is required to ensure execution efficiency when running mass data, but as the number of cluster nodes increases, the probability of database failure caused by a single node increases significantly, how to reduce the influence of the failure on the entire cluster as much as possible and recover service quickly becomes a problem to be solved urgently.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for processing a database fault, which can monitor the operating states of a computer cluster, a component and a node in real time through a pre-deployed monitoring service, and automatically execute a preset recovery action to quickly remove the fault when needed, so as to improve the security and stability of the database.

To achieve the above object, according to one aspect of the present invention, a database fault handling method is provided.

The database fault processing method of the embodiment of the invention comprises the following steps: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

Optionally, the obtaining the operation monitoring data of each computer cluster through a first type of monitoring service deployed in advance for at least one computer cluster supporting the target database includes: based on the first type of monitoring service, periodically sending a data operation instruction to each computer cluster, acquiring an instruction execution result, and generating operation monitoring data of each computer cluster in the current monitoring period according to the instruction execution result.

Optionally, the obtaining operation monitoring data of the component by a second type monitoring service deployed in advance for at least one component included in each computer cluster includes: and periodically sending a component state query instruction to each component based on the second type of monitoring service, acquiring a query result, and generating operation monitoring data of each component in the current monitoring period according to the query result.

Optionally, a third type of monitoring service is deployed on a physical host of the node; and the obtaining the operation monitoring data of each node through a third type monitoring service deployed for at least one node included in each component in advance comprises: and periodically acquiring the resource utilization information of any node based on the third type of monitoring service deployed in the node, and generating the running monitoring data of each node in the current monitoring period according to the resource utilization information.

Optionally, the generating operation monitoring data of each node in the current monitoring period according to the resource utilization information includes: and periodically acquiring hardware fault information of any node based on a third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the hardware fault information and the resource utilization information.

Optionally, the generating operation monitoring data of each node in the current monitoring period according to the hardware fault information and the resource utilization information includes: and periodically acquiring the service process survival state of any node based on the third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the service process survival state, the hardware fault information and the resource utilization information.

Optionally, the generating, according to the service process survival status, the hardware failure information, and the resource utilization information, operation monitoring data of each node in a current monitoring period includes: and periodically acquiring a response time index of any node based on a third type of monitoring service deployed at the node, and generating operation monitoring data of each node in the current monitoring period according to the response time index, the service process survival state, the hardware fault information and the resource utilization information.

Optionally, the operation state of the node further includes: a fault condition; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including: when the response duration index is not greater than a preset response duration threshold, the service process survival state shows that a service process is running, the hardware fault information shows that no hardware fault exists, and the resource utilization information is not greater than a preset utilization rate threshold, determining the running state of the node as a healthy state; and when the response duration index is greater than the response duration threshold, or the service process survival state shows that the service process is stopped, or the hardware fault information shows that a hardware fault exists, or the resource utilization information is greater than the utilization rate threshold, determining the running state of the node as a fault state.

Optionally, the operating state of the assembly further comprises: unhealthy and fault conditions; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including: when the query result shows that the component is normally served and no node in a fault state exists in the component, determining the running state of the component as a healthy state; when the query result shows that the component is normally served and a node in a fault state currently exists in the component, determining the running state of the component as an unhealthy state; and when the query result shows that the service of the component is abnormal, determining the running state of the component as a fault state.

Optionally, the operating state of the computer cluster further comprises: unhealthy and fault conditions; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including: when the instruction execution result shows that the instruction execution is successful and all the components in the computer cluster are currently in a healthy state, determining the running state of the computer cluster as a healthy state; when the instruction execution result shows that the instruction execution is successful and the computer cluster currently has components in unhealthy states, determining the running state of the computer cluster as the unhealthy state; and when the instruction execution result shows that the instruction execution fails, determining the running state of the computer cluster as a fault state.

Optionally, when the operating state is not a healthy state, executing a preset recovery action corresponding to the operating state includes: and when the component is in an unhealthy state or a fault state, determining a node in the component in the fault state, and executing a preset recovery action corresponding to the unhealthy state or the fault state on the node.

Optionally, the database is a massively parallel processing MPP database; the recovery action comprises at least one of: pulling up service, restarting nodes, removing nodes and replacing nodes; the resource utilization information includes at least one of: CPU utilization rate, memory utilization rate, disk utilization rate and network bandwidth; the response time length index is a percentage line TP index of the response time length.

To achieve the above object, according to another aspect of the present invention, there is provided a database fault handling apparatus.

The database fault processing device of the embodiment of the invention can comprise: a monitoring unit for: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; a state judgment unit for: judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; a failure recovery unit to: and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

An electronic device of the present invention includes: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the database fault processing method provided by the invention.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable storage medium.

A computer-readable storage medium of the present invention has stored thereon a computer program, which when executed by a processor implements the database fault handling method provided by the present invention.

According to the technical scheme of the invention, the embodiment of the invention has the following advantages or beneficial effects:

the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a target database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state. According to the steps, the embodiment of the invention can respectively deploy monitoring services on three layers of a computer cluster, a component and a node to more accurately and rapidly acquire the operation monitoring data, further locate the fault type, and finally automatically execute the corresponding recovery action with the fastest response time to recover the database to a normal working state, thereby realizing complete automatic operation and maintenance, avoiding time delay and error cost of manual fault handling, reducing the influence degree of machine faults, network faults and the like on operation to the minimum, ensuring stable operation of mass data, and greatly improving the safety and stability of a database system.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram illustrating the main steps of a database fault handling method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating operational state switching of components in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an example of a database fault handling method in an embodiment of the invention;

FIG. 4 is a schematic diagram of the components of the database fault handling apparatus in the embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 6 is a schematic structural diagram of an electronic device for implementing the database fault handling method in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.

Fig. 1 is a schematic diagram of main steps of a database fault handling method according to an embodiment of the present invention.

As shown in fig. 1, the database fault processing method according to the embodiment of the present invention may specifically be executed according to the following steps:

step S101: the method comprises the steps of obtaining operation monitoring data of each computer cluster through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, obtaining operation monitoring data of components through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and obtaining operation monitoring data of each node through a third type of monitoring service which is deployed for at least one node contained in each component in advance.

In an embodiment of the present invention, the database may be an MPP server in the IaaS field. In the technical field of cloud computing, IaaS is one of three widely accepted cloud service models (the other two are platform as a service PaaS and software as a service SaaS), which can enable users to enjoy all the advantages of local computing resources without additional overhead. In the IaaS model, users are responsible for processing applications, data, operating systems, middleware, and operations, and IaaS providers provide virtualization, storage, networks, and servers. In this way, the user does not have to own an internal data center and worry about updating or maintaining the components themselves. In most cases, the IaaS user can fully control the infrastructure through an application programming interface API or control panel. As the most flexible cloud model, IaaS may make it easier for users to extend, upgrade, and add resources (e.g., cloud storage) without having to predict future needs and pay in advance.

The MPP database is optimized for analysis workload, and is mainly used for aggregating and processing large data sets. MPP databases tend to be columnar, so instead of storing each row in a table as an object, MPP databases typically store each column as an object, an architecture that allows complex analytical queries to be processed more quickly and efficiently. It can be understood that the database fault handling method of the embodiment of the present invention may also be applied to any other database system, and is not limited to the MPP database.

In practice, the above database may operate with the support of at least one computer cluster, each computer cluster may comprise at least one component, typically one component providing a specific service. Each component may include at least one node, and each node may correspond to a physical server. For example, an MPP database may include a metadata cluster that may include a FoudationDB component, a Catalog component, and an ETCD component, and a compute cluster that may include a Master component and a Segment component, a FoudationDB component that may include 5 nodes, and a Catalog component and an ETCD component that may each include 3 nodes. In particular, the above FoudationDB component, Catalog component and ETCD component are all multi-active systems with redundancy mechanisms, and taking the FoudationDB component as an example, as long as no less than three of the 5 nodes work normally, the component can provide services to the outside normally based on the high availability mechanism.

In order to monitor the abnormal situation of the server at the first time and perform the positioning analysis on the problem rapidly, the following monitoring mechanism can be used: the method comprises the steps of deploying a first type of monitoring service for each computer cluster in advance, deploying a second type of monitoring service for each component, and deploying a third type of monitoring service for each node, wherein the third type of monitoring service can be deployed on a physical host of the corresponding node so as to directly monitor the working state of the node. The first type of monitoring service, the second type of monitoring service, and the third type of monitoring service may independently and periodically monitor computer clusters, components, or nodes. After the deployment is completed, the operation monitoring data of each computer cluster can be obtained through the first type of monitoring service, the operation monitoring data of the component can be obtained through the second type of monitoring service, and the operation monitoring data of each node can be obtained through the third type of monitoring service.

The operation monitoring data is used to determine the operation states of the computer cluster, the component, or the node, and generally, the operation states of the node may include a health state (health) and a fault state (out of service), where the health state of the node refers to that the node can normally operate, and the fault state of the node refers to that the node cannot normally operate. The operation state of the component may include a healthy state, an unhealthy state and a fault state, wherein the healthy state of the component refers to that the component can normally provide services to the outside while nodes included in the component are all in the healthy state, the unhealthy state of the component refers to that the component can normally provide services to the outside while at least one node included in the component is in the fault state, and the fault state of the component refers to that the component cannot normally provide services to the outside while at least one node included in the component is necessarily in the fault state. The operation states of the computer cluster may include a healthy state, an unhealthy state and a fault state, wherein the healthy state of the computer cluster means that the computer cluster can normally operate while the services included in the computer cluster are all in the healthy state, the unhealthy state of the computer cluster means that the computer cluster can normally operate while at least one component included in the computer cluster is in the unhealthy state (but no component is in the fault state, because generally speaking, a computer cluster is also in the fault state as long as one component is in the fault state), and the fault state of the computer cluster means that the computer cluster cannot normally operate while at least one component included in the computer cluster is necessarily in the fault state.

Specifically, for each computer cluster, based on the first type of monitoring service, a data operation instruction (e.g., a data read instruction and/or a data write instruction) may be periodically issued to each computer cluster, an instruction execution result may be obtained, and the operation monitoring data of each computer cluster in the current monitoring period may be generated according to the instruction execution result. Generally, the above instruction execution result shows whether the instruction is successfully executed and whether the computer cluster currently has components in an unhealthy state or a fault state, and in some scenarios, also shows the number of components in an unhealthy state or a fault state, and the above operation monitoring data of the computer cluster generally includes the above instruction execution result.

For each component, a component state query instruction can be periodically sent to each component based on the second type of monitoring service, a query result is obtained, and operation monitoring data of each component in the current monitoring period is generated according to the query result. Generally, the query result shows whether the component is normally serviced and whether the component currently has a node in a failure state, and in some scenarios, the number of nodes in the failure state is also shown, and the above operation monitoring data of the component generally includes the query result.

For each node, the following four methods can be used to obtain the operation monitoring data of the node.

In the first method, the resource utilization information of any node may be periodically obtained based on a third type of monitoring service deployed in the node, and the operation monitoring data of each node in the current monitoring period may be generated according to the resource utilization information. The above resource utilization information includes at least one of: CPU utilization, memory utilization, disk utilization, and network bandwidth. The above operation monitoring data of the node generally includes the above resource utilization information.

In the second method, based on the third type of monitoring service deployed in any node, hardware fault information of the node (the information indicates whether a hardware fault exists) may be periodically obtained, and operation monitoring data of each node in the current monitoring period may be generated according to the hardware fault information. The above operation monitoring data of the node generally includes the above hardware fault information.

In a third method, based on a third type of monitoring service deployed in any node, a service process survival status of the node may be periodically obtained, and operation monitoring data of each node in a current monitoring period may be generated according to the service process survival status (which characterizes whether a current service process is in operation or has stopped). The above operation monitoring data of the node generally includes the above service process survival status.

In the fourth method, the response duration index of the node may be periodically obtained based on the third type of monitoring service deployed in any node, and the operation monitoring data of each node in the current monitoring period may be generated according to the response duration index. The above response duration index may be a TP (Top Percentile) index of the response duration, such as a TP90 index, a TP50 index, and the like, where the physical meaning of TP90 is: and sequencing a plurality of response time lengths in the statistical period from small to large, wherein the response time length is positioned at the 90% response time length. Similarly, TP50 refers to sorting the response durations within the statistical period from small to large, at the 50 th% response duration. The above operation monitoring data of the node generally includes the above TP index.

In practical applications, any of the above 4 methods may also be combined to generate the operation monitoring data of the node. For example, 4 methods may be combined together, based on the third type of monitoring service deployed in any node, the response duration index, the service process survival status, the hardware fault information, and the resource utilization information of the node are periodically obtained, and the operation monitoring data of each node in the current monitoring period is jointly generated according to the response duration index, the service process survival status, the hardware fault information, and the resource utilization information. It is understood that the above operation monitoring data may include resource utilization information, hardware failure information, service process survival status, and response duration TP index. The following description will mainly use this monitoring method as an example.

Step S102: and judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data.

In this step, the operation state of the computer cluster may be determined in a preset state machine according to the operation monitoring data of the computer cluster acquired in step S101, the operation state of the component may be determined according to the operation monitoring data of the component acquired in step S101, and the operation state of the node may be determined according to the operation monitoring data of the node acquired in step S101.

Specifically, for each node, when the response duration index is not greater than a preset response duration threshold, the service process survival state shows that the service process is running, the hardware fault information shows that no hardware fault exists, and the resource utilization information is not greater than a preset utilization rate threshold, the running state of the node may be determined as a healthy state; when the response duration index is greater than the response duration threshold, or the service process survival state shows that the service process has stopped, or the hardware fault information shows that a hardware fault exists, or the resource utilization information is greater than the utilization rate threshold, the operating state of the node may be determined as a fault state.

For each component, when the query result shows that the component is in normal service and no node in a fault state exists in the component, determining the running state of the component as a healthy state; when the query result shows that the service of the component is normal and the node in the fault state exists in the component currently, determining the running state of the component as an unhealthy state; and when the query result shows that the service of the component is abnormal, determining the running state of the component as a fault state.

For each computer cluster, determining the running state of the computer cluster as a healthy state when the instruction execution result shows that the instruction is successfully executed and all the components in the computer cluster are in the healthy state currently; when the instruction execution result shows that the instruction execution is successful and the computer cluster has the components in the unhealthy state currently, determining the running state of the computer cluster as the unhealthy state; and when the instruction execution result shows that the instruction execution fails, determining the running state of the computer cluster as a fault state.

Through the above steps, the operation state of each computer cluster, each component, and each node can be determined and the current operation state can be updated into the state machine, and then, the troubleshooting operation can be performed at the computer cluster level, the component level, and the node level according to the above operation state, and the following description mainly takes the component level as an example.

Step S103: and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

In the following description, the troubleshooting at the component level is taken as an example, and a case where a component is in an unhealthy state is taken as an example, at this time, a specific node in the component in a faulty state is determined according to a determination result of an operation state at the node level, and then a recovery action corresponding to a preset "unhealthy state" is performed. In practical applications, the above recovery action may include at least one of the following actions: pulling a service, restarting a node, eliminating a node (i.e., placing a node in a failed state outside of a component), replacing a node (i.e., adding the same number of new nodes to a component after placing a component in a failed state outside of a component), or a combination thereof, such as attempting to pull a service first, restarting a node if it is unsuccessful, and eliminating a node if it is still in a failed state after restarting.

In practical applications, the recovery action corresponding to the "unhealthy state" may be formulated in a more detailed manner according to requirements, for example, if the "unhealthy state" is caused by a failure of one node, the preset recovery action is "pull up service", and if the "unhealthy state" is caused by a failure of two nodes, the preset recovery action is "restart node". For another example, in the case of an "unhealthy state" caused by a failure of two nodes, if two nodes have a hardware problem (there is a hardware failure or the resource utilization information is greater than the utilization threshold), the preset recovery action is taken as a "excluded node", and if two nodes have a software problem (the response duration index is greater than the response duration threshold or the service process has stopped), the preset recovery action is taken as a "restarted node". The recovery action corresponding to the "failed state" is similar to this. In summary, the above recovery actions can be flexibly configured according to various operation states, and the present invention does not limit this.

Through the steps, the fault of each layer of the computer cluster, the component and the node can be automatically eliminated, so that the computer cluster, the component or the node can be recovered to a healthy state, the running state of each layer of the computer cluster, the component and the node in the state machine can be updated, and the judgment result in a report form can be automatically generated. It is to be understood that the foregoing state machine may form an operating state triggering and switching relationship of each computer cluster, each component and each node in any time period, as shown in fig. 2, a component in a healthy state may be switched to an unhealthy state due to a single point of failure (i.e. fewer nodes are failed), and may also be switched to a failed state due to a service downtime, and a component in a healthy state may also be directly switched to a failed state due to a service downtime, and thereafter, the component in an unhealthy state or a failed state may be restored to a healthy state by performing a preset recovery action. In practical application, along with the periodic operation of each monitoring service, the judgment result at each moment can be generated circularly, and all the layers of computer clusters, assemblies and nodes can be kept in a healthy state through the steps.

Fig. 3 is an exemplary schematic diagram of a database fault processing method in an embodiment of the present invention, as shown in fig. 3, after an operation state determination, a Master component and a Segment component in a computing cluster are both in a healthy state, a metadata cluster is in a fault state, a foundation db component in the computing cluster is in an unhealthy state, a Catalog component is in a fault state, and an etc component is in a healthy state, a determination result including the above operation state may be finally generated, and then a corresponding recovery action is performed on a relevant node, so that each component and the computer cluster are recovered to a healthy state.

In the technical scheme of the embodiment of the invention, the operation monitoring data of each computer cluster is obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a target database in advance, the operation monitoring data of each component is obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and the operation monitoring data of each node is obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state. According to the steps, the embodiment of the invention can respectively deploy monitoring services on three layers of a computer cluster, a component and a node to more accurately and rapidly acquire the operation monitoring data, further locate the fault type, and finally automatically execute the corresponding recovery action with the fastest response time to recover the database to a normal working state, thereby realizing complete automatic operation and maintenance, avoiding time delay and error cost of manual fault handling, reducing the influence degree of machine faults, network faults and the like on operation to the minimum, ensuring stable operation of mass data, and greatly improving the safety and stability of a database system.

It should be noted that, for the convenience of description, the foregoing method embodiments are described as a series of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts described, and that some steps may in fact be performed in other orders or concurrently. Moreover, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required to implement the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides relevant means for implementing the above-described aspects.

Referring to fig. 4, a database fault handling apparatus 400 according to an embodiment of the present invention may include: a monitoring unit 401, a state judgment unit 402 and a failure recovery unit 403.

Wherein the monitoring unit 401 may be configured to: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; the state determination unit 402 may be configured to: judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; the failure recovery unit 403 may be configured to: and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

In this embodiment of the present invention, the monitoring unit 401 may further be configured to: based on the first type of monitoring service, periodically sending a data operation instruction to each computer cluster, acquiring an instruction execution result, and generating operation monitoring data of each computer cluster in the current monitoring period according to the instruction execution result.

As a preferred solution, the monitoring unit 401 may further be configured to: and periodically sending a component state query instruction to each component based on the second type of monitoring service, acquiring a query result, and generating operation monitoring data of each component in the current monitoring period according to the query result.

Preferably, the third type of monitoring service is deployed on a physical host of the node; the monitoring unit 401 may further be configured to: and periodically acquiring the resource utilization information of any node based on the third type of monitoring service deployed in the node, and generating the running monitoring data of each node in the current monitoring period according to the resource utilization information.

In a specific application, the monitoring unit 401 may further be configured to: and periodically acquiring hardware fault information of any node based on a third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the hardware fault information and the resource utilization information.

In practical applications, the monitoring unit 401 may further be configured to: and periodically acquiring the service process survival state of any node based on the third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the service process survival state, the hardware fault information and the resource utilization information.

In one embodiment, the monitoring unit 401 may further be configured to: and periodically acquiring a response time index of any node based on a third type of monitoring service deployed at the node, and generating operation monitoring data of each node in the current monitoring period according to the response time index, the service process survival state, the hardware fault information and the resource utilization information.

In an optional technical solution, the operation state of the node further includes: a fault condition; the state determination unit 402 may be further configured to: when the response duration index is not greater than a preset response duration threshold, the service process survival state shows that a service process is running, the hardware fault information shows that no hardware fault exists, and the resource utilization information is not greater than a preset utilization rate threshold, determining the running state of the node as a healthy state; and when the response duration index is greater than the response duration threshold, or the service process survival state shows that the service process is stopped, or the hardware fault information shows that a hardware fault exists, or the resource utilization information is greater than the utilization rate threshold, determining the running state of the node as a fault state.

In an alternative implementation, the operating state of the component further includes: unhealthy and fault conditions; the state determination unit 402 may be further configured to: when the query result shows that the component is normally served and no node in a fault state exists in the component, determining the running state of the component as a healthy state; when the query result shows that the component is normally served and a node in a fault state currently exists in the component, determining the running state of the component as an unhealthy state; and when the query result shows that the service of the component is abnormal, determining the running state of the component as a fault state.

Preferably, the operating state of the computer cluster further comprises: unhealthy and fault conditions; the state determination unit 402 may be further configured to: when the instruction execution result shows that the instruction execution is successful and all the components in the computer cluster are currently in a healthy state, determining the running state of the computer cluster as a healthy state; when the instruction execution result shows that the instruction execution is successful and the computer cluster currently has components in unhealthy states, determining the running state of the computer cluster as the unhealthy state; and when the instruction execution result shows that the instruction execution fails, determining the running state of the computer cluster as a fault state.

In some embodiments, the failure recovery unit 403 may be configured to: and when the component is in an unhealthy state or a fault state, determining a node in the component in the fault state, and executing a preset recovery action corresponding to the unhealthy state or the fault state on the node.

In addition, in the embodiment of the present invention, the database is a massively parallel processing MPP database; the recovery action comprises at least one of: pulling up service, restarting nodes, removing nodes and replacing nodes; the resource utilization information includes at least one of: CPU utilization rate, memory utilization rate, disk utilization rate and network bandwidth; the response time length index is a percentage line TP index of the response time length.

Fig. 5 illustrates an exemplary system architecture 500 to which the database fault handling method or the database fault handling apparatus of the embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505 (this architecture is merely an example, and the components included in a particular architecture may be adapted according to application specific circumstances). The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have various client applications installed thereon, such as a database fault handling application, etc. (by way of example only).

The

terminal devices

501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server providing various services, such as an operation and maintenance server (for example only) providing support for a user using a database fault handling application operated by the

terminal device

501, 502, 503. The operation and maintenance server may process the received fault handling request and feed back a handling result (e.g., whether the fault is cleared-for example only) to the

terminal device

501, 502, 503.

It should be noted that the database failure processing method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the database failure processing apparatus is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides the electronic equipment. The electronic device of the embodiment of the invention comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the database fault processing method provided by the invention.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use with the electronic device implementing an embodiment of the present invention. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the computer system 600 are also stored. The CPU601, ROM 602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, the processes described in the main step diagrams above may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the main step diagram. In the above-described embodiment, the computer program can be downloaded and installed from the network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the central processing unit 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a monitoring unit, a state determination unit, and a failure recovery unit. Here, the names of these units do not constitute a limitation to the unit itself in some cases, and for example, the monitoring unit may also be described as a "unit that supplies operation monitoring data to the state judging unit".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to perform steps comprising: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance; judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state; and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A database fault handling method is characterized by comprising the following steps:

the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance;

judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state;

and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

2. The method of claim 1, wherein obtaining operational monitoring data for each computer cluster via a first type of monitoring service previously deployed for at least one computer cluster supporting a target database comprises:

based on the first type of monitoring service, periodically sending a data operation instruction to each computer cluster, acquiring an instruction execution result, and generating operation monitoring data of each computer cluster in the current monitoring period according to the instruction execution result.

3. The method according to claim 2, wherein the obtaining operation monitoring data of the components through a second type monitoring service deployed in advance for at least one component included in each computer cluster comprises:

and periodically sending a component state query instruction to each component based on the second type of monitoring service, acquiring a query result, and generating operation monitoring data of each component in the current monitoring period according to the query result.

4. The method of claim 3, wherein a third type of monitoring service is deployed on a physical host of the node; and the obtaining the operation monitoring data of each node through a third type monitoring service deployed for at least one node included in each component in advance comprises:

and periodically acquiring the resource utilization information of any node based on the third type of monitoring service deployed in the node, and generating the running monitoring data of each node in the current monitoring period according to the resource utilization information.

5. The method of claim 4, wherein the generating the operation monitoring data of each node in the current monitoring period according to the resource utilization information comprises:

and periodically acquiring hardware fault information of any node based on a third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the hardware fault information and the resource utilization information.

6. The method according to claim 5, wherein the generating operation monitoring data of each node in a current monitoring period according to the hardware fault information and the resource utilization information comprises:

and periodically acquiring the service process survival state of any node based on the third type of monitoring service deployed in the node, and generating operation monitoring data of each node in the current monitoring period according to the service process survival state, the hardware fault information and the resource utilization information.

7. The method of claim 6, wherein the generating the operation monitoring data of each node in the current monitoring period according to the service process survival status, the hardware failure information and the resource utilization information comprises:

and periodically acquiring a response time index of any node based on a third type of monitoring service deployed at the node, and generating operation monitoring data of each node in the current monitoring period according to the response time index, the service process survival state, the hardware fault information and the resource utilization information.

8. The method of claim 7, wherein the operational state of the node further comprises: a fault condition; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including:

when the response duration index is not greater than a preset response duration threshold, the service process survival state shows that a service process is running, the hardware fault information shows that no hardware fault exists, and the resource utilization information is not greater than a preset utilization rate threshold, determining the running state of the node as a healthy state;

and when the response duration index is greater than the response duration threshold, or the service process survival state shows that the service process is stopped, or the hardware fault information shows that a hardware fault exists, or the resource utilization information is greater than the utilization rate threshold, determining the running state of the node as a fault state.

9. The method of claim 8, wherein the operational state of the component further comprises: unhealthy and fault conditions; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including:

when the query result shows that the component is normally served and no node in a fault state exists in the component, determining the running state of the component as a healthy state;

when the query result shows that the component is normally served and a node in a fault state currently exists in the component, determining the running state of the component as an unhealthy state;

and when the query result shows that the service of the component is abnormal, determining the running state of the component as a fault state.

10. The method of claim 9, wherein the operational state of the computer cluster further comprises: unhealthy and fault conditions; and judging the operating state of the computer cluster, the component and/or the node according to the acquired operation monitoring data, including:

when the instruction execution result shows that the instruction execution is successful and all the components in the computer cluster are currently in a healthy state, determining the running state of the computer cluster as a healthy state;

when the instruction execution result shows that the instruction execution is successful and the computer cluster currently has components in unhealthy states, determining the running state of the computer cluster as the unhealthy state;

and when the instruction execution result shows that the instruction execution fails, determining the running state of the computer cluster as a fault state.

11. The method according to claim 9, wherein when the operating state is not a healthy state, performing a preset recovery action corresponding to the operating state comprises:

and when the component is in an unhealthy state or a fault state, determining a node in the component in the fault state, and executing a preset recovery action corresponding to the unhealthy state or the fault state on the node.

12. The method of any of claims 7-11, wherein the database is a Massively Parallel Processing (MPP) database;

the recovery action comprises at least one of: pulling up service, restarting nodes, removing nodes and replacing nodes;

the resource utilization information includes at least one of: CPU utilization rate, memory utilization rate, disk utilization rate and network bandwidth;

the response time length index is a percentage line TP index of the response time length.

13. A database fault handling apparatus, comprising:

a monitoring unit for: the method comprises the steps that operation monitoring data of each computer cluster are obtained through a first type of monitoring service which is deployed for at least one computer cluster supporting a database in advance, operation monitoring data of components are obtained through a second type of monitoring service which is deployed for at least one component contained in each computer cluster in advance, and operation monitoring data of each node are obtained through a third type of monitoring service which is deployed for at least one node contained in each component in advance;

a state judgment unit for: judging the running state of the computer cluster, the components and/or the nodes according to the obtained running monitoring data; wherein the operating state comprises a healthy state;

a failure recovery unit to: and when the running state is not the healthy state, executing a preset recovery action corresponding to the running state so as to recover the running state to the healthy state.

14. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-12.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-12.