CN113328899A

CN113328899A - Fault processing method and system for cluster nodes

Info

Publication number: CN113328899A
Application number: CN202110888888.7A
Authority: CN
Inventors: 李二明; 李世杰
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-08-31
Anticipated expiration: 2041-08-04
Also published as: CN113328899B

Abstract

The invention provides a fault processing method and a system for cluster nodes, wherein the method comprises the following steps: adding a node information database in the cluster, acquiring information of all nodes in the cluster and storing the information into the node information database; after a cluster is started and a client is connected to a cluster node, updating a node information database at regular time, and determining node health condition sequencing by adopting a sequencing algorithm according to data stored in the node information database; determining the label of the current fault processing node according to the sequencing result, and storing the label in a node information database; when a node in the cluster fails, the cluster directly reads the current fault processing node label recorded in the node information database and informs the corresponding node of fault recovery. The invention provides the database, the fault processing node is selected according to the information in the database, and when the service node in the cluster fails, the selected fault processing node can timely process the fault, thereby ensuring the continuity of the service.

Description

Fault processing method and system for cluster nodes

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and a system for handling a failure of a cluster node.

Background

A computer cluster, referred to as a cluster for short, is a computer system that performs computing tasks in a highly compact collaboration through a set of loosely integrated computer software (and/or) hardware connections. In a sense, they may be considered a computer. The individual computers in a clustered system, often referred to as nodes, are typically connected by a local area network, but there are other possible connections. Clustered computers are often used to improve the computing speed (and/or) reliability of individual computers. Typically, clustered computers are much more cost effective than individual computers, workstations or supercomputers.

The cluster is a group which is composed of a plurality of nodes and provides services for the client side, and in order to deal with the occurrence of unplanned faults, once the service nodes in the cluster send the faults, a fault processing node is formulated in the cluster to process the faults. In the fault processing node generation method in the prior art, split brains often occur, so that the situation that a cluster cannot provide services to the outside is caused, and the service of a client side is seriously influenced.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and a system for handling a fault of a cluster node, in which a database is provided, a fault handling node is selected according to information in the database, and when a service node in the cluster fails, the selected fault handling node performs fault handling in time, so as to ensure service continuity.

In order to achieve the purpose, the invention is realized by the following technical scheme: a fault handling method of a cluster node comprises the following steps:

adding a node information database in the cluster, acquiring information of all nodes in the cluster and storing the information into the node information database;

after a cluster is started and a client is connected to a cluster node, updating a node information database at regular time, and determining node health condition sequencing by adopting a sequencing algorithm according to data stored in the node information database;

determining the label of the current fault processing node according to the sequencing result, and storing the label in a node information database;

when a node in the cluster fails, the cluster directly reads the current fault processing node label recorded in the node information database and informs the corresponding node of fault recovery.

Further, the data stored in the node information database includes: node service information, node information, cluster information and fault handling node bars;

the node service information comprises a service name, service starting time and service state information;

the node information comprises node starting time, the CPU occupancy rate of the node and the number of clients connected to the node;

the cluster information comprises the number of nodes in the cluster, node state information and cluster state information;

and the fault processing node bar is used for storing the current fault processing node label.

Further, the step of determining the node health condition sequence by using a sequencing algorithm according to the data stored in the node information database comprises the following steps:

step 1: determining the state score of each node according to the node state information;

step 2: determining the starting time score of each node according to the starting time of the nodes;

and step 3: determining the service state score of each node according to the service state information;

and 4, step 4: determining the client connection number score of each node according to the number of clients connected to the node;

and 5: determining the CPU occupancy rate score of each node according to the CPU occupancy rate of the node;

step 6: determining the service starting time score of each node according to the service starting time;

and 7: adding the state score, the starting time score, the service state score, the client connection number score, the CPU occupancy rate score and the service starting time score of each node to obtain a health state score of each node;

and 8: and filling the node number of the node with the highest health state score into the fault processing node bar, and storing the health state score of each node into the node information database according to a descending order.

Further, the step 1 comprises:

if the node state information of the node is that the node is normal, the state score of the node is 1; if the node state information of the node is the node abnormal state, the state score of the node is 0.

Further, the step 2 comprises:

and reading the starting time of the nodes, arranging the nodes in a descending order according to the starting time sequence, and taking the position of each node after the ordering as the starting time score of the node.

Further, the step 3 comprises:

if the service state information of the node is that the service state is normal, the service state of the node is divided into 1; and if the service state information of the node is abnormal, the service state score of the node is 0.

Further, the step 4 comprises:

reading the number of clients connected to each node, arranging the nodes in descending order from multiple to few according to the number of the connected clients, and taking the rank of each node after the ordering as the client connection number score of the node.

Further, the step 5 comprises:

reading the CPU occupancy rate of each node, arranging the nodes in descending order from high to low according to the CPU occupancy rate, and taking the rank of each node after the ordering as the CPU occupancy rate score of the node.

Further, the step 6 includes:

and reading the service starting time of the nodes, arranging the nodes in a descending order according to the time sequence of the service starting time, and taking the position of each node after the ordering as the service starting time score of the node.

Correspondingly, the invention also discloses a system for processing the fault of the cluster node, which comprises the following steps:

the database building unit is used for adding a node information database in the cluster, acquiring the information of all nodes in the cluster and storing the information into the node information database;

the sequencing unit is used for updating the node information database at regular time after the cluster is started and the client is connected to the cluster nodes, and determining the health condition sequencing of the nodes by adopting a sequencing algorithm according to data stored in the node information database;

the storage unit is used for determining the label of the current fault processing node according to the sequencing result and storing the label into the node information database;

and the node selection unit is used for directly reading the current fault processing node label recorded in the node information database by the cluster and informing the corresponding node of fault recovery after a node in the cluster fails.

Further, the sorting unit includes:

the first scoring module is used for determining the state score of each node according to the node state information;

the second scoring module is used for determining the starting time score of each node according to the starting time of the nodes;

the third scoring module is used for determining the service state score of each node according to the service state information;

the fourth scoring module is used for determining the client connection number score of each node according to the number of the clients connected to the node;

the fifth scoring module is used for determining the CPU occupancy rate score of each node according to the CPU occupancy rate of the node;

the sixth scoring module is used for determining the service starting time score of each node according to the service starting time;

the summarizing module is used for summing up the state score, the starting time score, the service state score, the client connection number score, the CPU occupancy rate score and the service starting time score of each node to obtain the health state score of each node;

and the screening module is used for filling the node number of the node with the highest health state score into the fault processing node bar and storing the health state score of each node into the node information database according to a descending order.

Further, the first scoring module is specifically configured to: if the node state information of the node is that the node is normal, the state score of the node is 1; if the node state information of the node is the node abnormal state, the state score of the node is 0.

Further, the second scoring module is specifically configured to: and reading the starting time of the nodes, arranging the nodes in a descending order according to the starting time sequence, and taking the position of each node after the ordering as the starting time score of the node.

Further, the third scoring module is specifically configured to: if the service state information of the node is that the service state is normal, the service state of the node is divided into 1; and if the service state information of the node is abnormal, the service state score of the node is 0.

Further, the fourth scoring module is specifically configured to: reading the number of clients connected to each node, arranging the nodes in descending order from multiple to few according to the number of the connected clients, and taking the rank of each node after the ordering as the client connection number score of the node.

Further, the fifth scoring module is specifically configured to: reading the CPU occupancy rate of each node, arranging the nodes in descending order from high to low according to the CPU occupancy rate, and taking the rank of each node after the ordering as the CPU occupancy rate score of the node.

Further, the sixth scoring module is specifically configured to: and reading the service starting time of the nodes, arranging the nodes in a descending order according to the time sequence of the service starting time, and taking the position of each node after the ordering as the service starting time score of the node.

Correspondingly, the invention discloses a fault processing device of a cluster node, which comprises:

a memory for storing a fault handling program of the cluster node;

a processor for implementing the steps of the fault handling method of the cluster node as described in any one of the above when executing the fault handling program of the cluster node.

Accordingly, the present invention discloses a readable storage medium, on which a fault handling program of a cluster node is stored, which when executed by a processor implements the steps of the fault handling method of the cluster node according to any one of the above.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, a node information database is established for the cluster, after a node fault occurs, election is not needed, the database in the cluster is read to directly obtain the node number for fault processing, the node completes fault recovery, the brain crack risk is avoided, the influence on the service of a client caused by the fault is minimized, and the stability and the scene adaptability of the cluster are further improved.

2. The node information database can effectively collect the node information in the cluster, and determines the node health condition sequencing according to the sequencing algorithm to determine the node most suitable for fault processing.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a system block diagram of the present invention.

In the figure, 1 is a database building unit; 2 is a sorting unit; 3 is a storage unit; and 4 is a node selection unit.

Detailed Description

The core of the invention is to provide a fault processing method for cluster nodes, and in the prior art, a fault processing node generation method often generates split brains, so that the condition that a cluster cannot provide services to the outside is caused, and the service of a client side is seriously influenced.

According to the fault processing method of the cluster nodes, after the nodes are started, the cluster is additionally provided with the node information database, and the timing event is started to acquire the information of each node and store the information into the updating database. After the cluster is started and the client is connected to the cluster nodes, the node information is updated to the database at regular time, the health condition sequencing of the nodes is determined according to a sequencing algorithm, and the nodes which are most suitable for fault processing are sequentially sequenced from top to bottom. And finally, scoring the health state of the nodes according to the scores, and filling the node number with the highest scoring result into a fault processing node bar in the database. The results are saved in the database from high to low for use by the cluster. When a node in the cluster fails, the cluster directly reads the node recorded in the main node in the database and informs the node of failure recovery. Therefore, the invention provides the database, the fault processing node is selected according to the information in the database, and when the service node in the cluster fails, the selected fault processing node carries out fault processing in time, so that the continuity of the service is ensured.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

as shown in fig. 1, the present embodiment provides a method for processing a fault of a cluster node, including the following steps:

s1: and adding a node information database in the cluster, acquiring the information of all nodes in the cluster and storing the information into the node information database.

Wherein, the data stored in the node information database comprises: node service information, node information, cluster information, and a fault handling node bar. The node service information comprises a service name, service starting time and service state information; the node information comprises node starting time, the CPU occupancy rate of the node and the number of clients connected to the node; the cluster information comprises the number of nodes in the cluster, node state information and cluster state information; and the fault processing node bar is used for storing the current fault processing node label.

S2: and after the cluster is started and the client is connected to the cluster nodes, updating the node information database at regular time, and determining the health condition sequencing of the nodes by adopting a sequencing algorithm according to data stored in the node information database.

The node health condition sequencing determined by adopting a sequencing algorithm according to data stored in a node information database is mainly realized by the following 8 steps:

step 1: and determining the state score of each node according to the node state information. If the node state information of the node is that the node is normal, the state score of the node is 1; if the node state information of the node is the node abnormal state, the state score of the node is 0.

Step 2: and determining the starting time score of each node according to the starting time of the nodes. And reading the starting time of the nodes, arranging the nodes in a descending order according to the starting time sequence, and taking the position of each node after the ordering as the starting time score of the node. I.e., the earlier the node's activation time, the higher its activation time score.

And step 3: and determining the service state score of each node according to the service state information. If the service state information of the node is that the service state is normal, the service state of the node is divided into 1; and if the service state information of the node is abnormal, the service state score of the node is 0.

And 4, step 4: and determining the client connection number score of each node according to the number of the clients connected to the node. Reading the number of clients connected to each node, arranging the nodes in descending order from multiple to few according to the number of the connected clients, and taking the rank of each node after the ordering as the client connection number score of the node. I.e., the fewer the number of clients connected to a node, the higher its score for the number of client connections.

And 5: and determining the CPU occupancy rate score of each node according to the CPU occupancy rate of the node. Reading the CPU occupancy rate of each node, arranging the nodes in descending order from high to low according to the CPU occupancy rate, and taking the rank of each node after the ordering as the CPU occupancy rate score of the node. That is, the lower the CPU occupancy of a node, the higher its CPU occupancy score.

Step 6: and determining the service starting time score of each node according to the service starting time. And reading the service starting time of the nodes, arranging the nodes in a descending order according to the time sequence of the service starting time, and taking the position of each node after the ordering as the service starting time score of the node. I.e., the earlier a node's service starts, the higher its service start score.

And 7: and adding the state score, the starting time score, the service state score, the client connection number score, the CPU occupancy rate score and the service starting time score of each node to obtain the health state score of each node. The higher the health state score of the node is, the stronger the data processing capacity of the node is.

S3: and determining the label of the current fault processing node according to the sequencing result, and storing the label in a node information database.

S4: when a node in the cluster fails, the cluster directly reads the current fault processing node label recorded in the node information database and informs the corresponding node of fault recovery.

The embodiment provides a fault processing method for a cluster node, wherein a node information database is created for a cluster, after a node fault occurs, election is not needed, a node number for fault processing is directly obtained by reading the database in the cluster, and the node completes fault recovery, so that a brain crack risk is avoided, the influence of the fault on a service of a client is minimized, and the stability and the scene adaptability of the cluster are further improved.

Example two:

the embodiment also provides a method for processing a fault of a cluster node, which includes:

1. after the nodes are started, the cluster is additionally provided with a node information database, and a timing event is started to acquire the information of each node and store the information into the updated node information database.

The node information database stores the following information;

node service information: service name, service start time, service state information;

the node information: node starting time, CPU occupancy rate and client connection number information;

cluster information: the number of nodes, the state of each node and the cluster state;

and the fault processing node bar is used for storing the label of the fault recovery node.

2. And after the cluster is started and the client is connected to the cluster nodes, updating the information to the node information database at regular time, determining the health condition sequencing of the nodes according to a sequencing algorithm, and sequentially sequencing the nodes which are most suitable for fault processing from top to bottom.

The sorting algorithm rules are as follows:

step 1: and (4) node state scores. The node is normal: the score is 1, the node status is abnormal: the score was 0.

Step 2: node activation time score: and (3) ordering the nodes according to the start time from morning to evening, wherein if the nodes are ordered as node 2, node 0 and node 1 after the nodes 0, node 1 and node 2 are ordered, the weight of each node is 123, the score is 1 for node 0, 2 for node 1 and 3 for node 2.

And step 3: providing the client with a status score for the service. The service state is normal: score 1, service status exception: the score was 0.

And 4, step 4: client connection number score: the scores are respectively 1, 2 and … … N, and if the node 0, the node 1 and the node 2 are sorted according to the connection number from top to bottom and then are sequentially the node 2, the node 0 and the node 1, the scores of the nodes are respectively 1, 2 and 3.

And 5: CPU occupancy score: the score is that each node is sorted according to the CPU occupancy rate of each node from high to low, and each node is divided into 1 and 2 … … N according to the sorting.

Step 6: service launch time score: and the scores are sorted in descending order according to the starting time of providing service for the client by each node, and the scores of each node are 1 and 2 … … N according to the sorting.

3. And finally, scoring the health state of the nodes after summing the scores, and filling the node number with the highest scoring result into a fault processing node bar in the node information database. The results are saved in the database from high to low for use by the cluster. Wherein, the service name includes but is not limited to SAMBA/NFS/FTP/HTTP/ISCSI/RGW, the state information of the service includes but is not limited to the number of processes, the state of each process; the client connection number includes but is not limited to mounting through a domain name, mounting through a virtual IP, and mounting through a physical IP; the status information of each node includes, but is not limited to, the status of the network card, and whether the network card transmits and receives packets with congestion.

4. When a node in the cluster fails, the cluster directly reads the node recorded in the master node in the node information database and informs the node of failure recovery.

Example three:

based on the first embodiment, as shown in fig. 2, the present invention further discloses a system for handling a failure of a cluster node, including: the system comprises a database building unit 1, a sorting unit 2, a storage unit 3 and a node selecting unit 4.

And the database establishing unit 1 is used for adding a node information database in the cluster, acquiring the information of all nodes in the cluster and storing the information into the node information database.

And the sequencing unit 2 is used for updating the node information database at regular time after the cluster is started and the client is connected to the cluster nodes, and determining the health condition sequencing of the nodes by adopting a sequencing algorithm according to data stored in the node information database.

Wherein, the sorting unit 2 specifically includes:

And the storage unit 3 is used for determining the current fault processing node label according to the sequencing result and storing the current fault processing node label into the node information database.

And the node selection unit 4 is used for directly reading the current fault processing node label recorded in the node information database by the cluster and informing the corresponding node of fault recovery after a node in the cluster fails.

The embodiment provides a fault handling system of a cluster node, which selects a fault handling node according to information in a database, and when a service node in a cluster fails, the selected fault handling node timely handles the fault, thereby ensuring service continuity.

Example four:

the embodiment discloses a fault processing device of a cluster node, which comprises a processor and a memory; wherein, when executing the fault processing program of the cluster node stored in the memory, the processor implements the following steps:

1. and adding a node information database in the cluster, acquiring the information of all nodes in the cluster and storing the information into the node information database.

2. And after the cluster is started and the client is connected to the cluster nodes, updating the node information database at regular time, and determining the health condition sequencing of the nodes by adopting a sequencing algorithm according to data stored in the node information database.

3. And determining the label of the current fault processing node according to the sequencing result, and storing the label in a node information database.

4. When a node in the cluster fails, the cluster directly reads the current fault processing node label recorded in the node information database and informs the corresponding node of fault recovery.

Further, the fault handling apparatus of a cluster node in this embodiment may further include:

and the input interface is used for acquiring a fault processing program of the cluster node imported from the outside, storing the acquired fault processing program of the cluster node into the memory, and also used for acquiring various instructions and parameters transmitted by external terminal equipment and transmitting the instructions and parameters to the processor, so that the processor performs corresponding processing by using the instructions and the parameters. In this embodiment, the input interface may specifically include, but is not limited to, a USB interface, a serial interface, a voice input interface, a fingerprint input interface, a hard disk reading interface, and the like.

And the output interface is used for outputting various data generated by the processor to the terminal equipment connected with the output interface, so that other terminal equipment connected with the output interface can acquire various data generated by the processor. In this embodiment, the output interface may specifically include, but is not limited to, a USB interface, a serial interface, and the like.

And the communication unit is used for establishing remote communication connection between the fault processing device of the cluster node and the external server so that the fault processing device of the cluster node can mount the mirror image file into the external server. In this embodiment, the communication unit may specifically include, but is not limited to, a remote communication unit based on a wireless communication technology or a wired communication technology.

And the keyboard is used for acquiring various parameter data or instructions input by a user through real-time key cap knocking.

And the display is used for displaying relevant information in the short circuit positioning process of the power supply line of the running server in real time.

The mouse can be used for assisting a user in inputting data and simplifying the operation of the user.

Example five:

the present embodiments also disclose a readable storage medium, where the readable storage medium includes Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, a hard disk, a removable hard disk, a CD-ROM, or any other form of storage medium known in the art. The readable storage medium has stored therein a fault handling program of a cluster node, which when executed by a processor implements the steps of:

In summary, the node information database is created for the cluster, after a node failure occurs, election is not needed, the database in the cluster is read to directly obtain the node number for failure processing, the node completes failure recovery, the risk of split brain is avoided, the influence of the failure on the service of the client is minimized, and the stability and the scene adaptability of the cluster are further improved.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The method disclosed by the embodiment corresponds to the system disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed system, system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit.

Similarly, each processing unit in the embodiments of the present invention may be integrated into one functional module, or each processing unit may exist physically, or two or more processing units are integrated into one functional module.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method, system, device and readable storage medium for processing the fault of the cluster node provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A fault handling method for a cluster node is characterized by comprising the following steps:

2. The method of fault handling for cluster nodes of claim 1, wherein the data stored in the node information database comprises: node service information, node information, cluster information and fault handling node bars;

3. The method for processing the faults of the cluster nodes according to claim 2, wherein the step of determining the health condition sequence of the nodes by using a sequencing algorithm according to the data stored in the node information database comprises the following steps:

4. The method for processing the fault of the cluster node according to claim 3, wherein the step 1 comprises:

5. The method for processing the fault of the cluster node according to claim 3, wherein the step 2 comprises:

6. The method for processing the fault of the cluster node according to claim 3, wherein the step 3 comprises:

7. The method for processing the fault of the cluster node according to claim 3, wherein the step 4 comprises:

8. The method for handling the fault of the cluster node according to claim 3, wherein the step 5 comprises:

9. The method for handling the fault of the cluster node according to claim 3, wherein the step 6 comprises:

10. A system for handling a failure of a cluster node, comprising: