CN107026762B

CN107026762B - Disaster recovery system and method based on distributed cluster

Info

Publication number: CN107026762B
Application number: CN201710372773.6A
Authority: CN
Inventors: 张大帅
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-07-03
Anticipated expiration: 2037-05-24
Also published as: CN107026762A

Abstract

The invention discloses a disaster recovery system based on a distributed cluster, which comprises a state detection module and a plurality of data nodes, wherein the state detection module is used for detecting the state of a disaster; the data nodes are all provided with management system processes, the data nodes running the management system processes are management nodes, and the management nodes are used for managing all the data nodes; when a data node serving as a management node fails, the state detection module is used for selecting any non-failure data node and taking the selected non-failure data node as a current management node; the system disclosed by the invention greatly increases the redundancy of the management node disaster recovery, and can ensure that enterprise services run smoothly as much as possible; the invention also discloses a disaster recovery method based on the disaster recovery system, and the method also has the beneficial effects.

Description

Disaster recovery system and method based on distributed cluster

Technical Field

The invention relates to the technical field of network communication, in particular to a disaster recovery system based on a distributed cluster; the invention also relates to a disaster recovery method based on the distributed cluster.

Background

Currently, with the development of network communication technology and the continuous expansion of enterprise size, the dependence of enterprise business on network is increasing. But due to the occurrence of various natural disasters and human accidents, the business of the enterprise is interrupted, which brings huge property loss to the enterprise. Therefore, modern enterprises need a complete set of disaster recovery system to ensure the normal operation of enterprise business.

In the present society, the business of an enterprise is usually operated in a system formed by distributed clusters, and at this time, the enterprise needs to perform disaster recovery construction on the distributed clusters to ensure normal operation of the business of the enterprise.

In a distributed cluster, there is usually one management node dedicated to managing other nodes. In the prior art, a standby node is usually provided, which can also function as a management node, but the management processes in the standby node are usually switched off. The management node and the standby node are connected by heartbeat to judge the survival states of each other. When the management node fails, the standby node takes over the management node to provide management service.

However, in the prior art, a situation that neither the management node nor the standby node is available occurs, and at this time, the whole system is affected, and the business of the whole enterprise is interrupted, which may cause serious property loss to the enterprise.

Disclosure of Invention

In view of this, the main objective of the present invention is to provide a disaster recovery system based on distributed clusters, which can greatly increase redundancy of disaster recovery of management nodes; another objective of the present invention is to provide a disaster recovery method based on distributed clusters, which can effectively increase redundancy of disaster recovery of management nodes, so that enterprise services can run smoothly.

In order to solve the above problem, the present invention provides a disaster recovery system based on a distributed cluster, wherein the system comprises:

the system comprises a state detection module and a plurality of data nodes;

the data nodes are all provided with management system processes, the data nodes running the management system processes are management nodes, and the management nodes are used for managing all the data nodes;

when a data node serving as a management node fails, the state detection module is used for selecting any non-failure data node and taking the selected non-failure data node as a current management node.

Optionally, the state monitoring module is further configured to:

measuring a load state value of each data node;

and when the load state value of the data node serving as the management node exceeds a preset threshold value, closing the management system process corresponding to the data node, selecting any one of the non-fault data nodes, and taking the selected non-fault data node as the current management node.

Optionally, the non-failure data node is a data node with a minimum load state value.

Optionally, the management node is further configured to provide a common management platform, and the platform is configured to display the state parameters of the data nodes.

Optionally, the state detection module is further configured to:

and when the data node serving as the management node fails, pushing failure information to the public management platform.

The invention also provides a disaster tolerance method based on the distributed cluster, which comprises the following steps:

when a management node fails, a state detection module acquires fault information of the management node, wherein the management node is a data node which is running a management system process and is used for managing all the data nodes;

the state detection module selects any non-fault data node;

starting the management system process in the non-failed data node.

Optionally, the method further comprises:

the state detection module measures a load state value of the data node;

when the load state value of the management node exceeds a preset threshold value, the state detection module closes the management system process of the management node;

the state detection module selects any one of the non-failed data nodes;

and the state detection module starts a management system process in the non-fault data node.

Optionally, the selecting any one of the non-failed data nodes includes:

and selecting the data node with the minimum load state value.

Optionally, the method further comprises:

when the management node fails, the state detection module pushes failure information to a public management platform, and the public management platform is provided by the management node and used for displaying the state parameters of the data nodes.

The disaster recovery system based on the distributed cluster comprises a plurality of data nodes, each data node is provided with a management system process, the data node running the management system process is a management node, and the management node is used for managing all the data nodes.

When the management node which provides management service for all the nodes fails, all the other data nodes can manage all the data nodes, and at the moment, one node is selected to provide the management service for all the other nodes. The disaster recovery system provided by the invention greatly increases the redundancy of the management node disaster recovery, and can ensure that enterprise services run smoothly as much as possible. The invention also provides a disaster recovery method based on the distributed cluster, which has the beneficial effects and is not repeated herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic structural diagram of a first disaster recovery system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a second disaster recovery system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a third disaster recovery system according to an embodiment of the present invention;

fig. 4 is a flowchart of a first disaster recovery method according to an embodiment of the present invention;

fig. 5 is a flowchart of a second disaster recovery method according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a disaster recovery system based on a distributed cluster, in the prior art, because only one management node and one standby node are usually arranged, the situation that the management node and the standby node are unavailable sometimes occurs, at the moment, the whole system is influenced, the service of the whole enterprise is interrupted, and the serious property loss is caused to the enterprise. The reason for this is that the redundancy of the management node is still insufficient in the disaster recovery system provided in the prior art.

In the disaster recovery system provided by the present invention, each data node is provided with a management system process, that is, each data node has the ability to manage all data nodes. Compared with the prior art, the disaster recovery system provided by the invention greatly increases the redundancy of the management node in the disaster recovery system, can effectively avoid the condition that the business of an enterprise is interrupted, and effectively reduces the property loss of the enterprise caused by the failure of the management node. A

The present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a first disaster recovery system according to an embodiment of the present invention, where the system includes:

a state detection module 101 and a plurality of data nodes 102; the data nodes 102 are all provided with management system processes, the data node 102 running the management system processes is a management node 103, and the management node 103 is used for managing all the data nodes 102;

in the embodiment of the present invention, a plurality of data nodes 102 are provided, each data node 102 may provide a data service, and each data node 102 is provided with a management system process, in a normal case, most of the management system processes in the data nodes 102 are in a closed state, and only one management system process in the data node 102 is in an open state. The data node 102 running the management system process is a management node 103, and the management node 103 manages all the data nodes 102, which specifically includes: control of ongoing traffic of each data node 102, management of on or off times of each data node 102, detection of data generated by each data node 102, and so on, and in general, one of the tasks of the management node 103 is to manage all nodes.

When the management system process of a certain data node 102 starts to run, the data node 102 may continue to provide the original data service, or may only perform the management service, that is, only manage all the data nodes 102, and no longer continue to provide the original data service.

In the embodiment of the present invention, the number of the state detection modules 101 may be only one, or may be multiple, and in a normal case, a stat _ check process is set in each data node 102, and the stat _ check processes may communicate with each other through a UDP (user datagram protocol) protocol, because the stat _ check process occupies a small system overhead and has no large influence on the load of each data node 102. Of course, besides the stat _ check process, other processes or other modules may be set, and the specific number of the processes or modules is not specifically limited, and all the processes or modules constitute the whole state detection module 101.

When a data node 102 serving as a management node 103 fails, the state detection module 101 is configured to select any non-failed data node 102, and use the selected non-failed data node 102 as a current management node 103.

In the embodiment of the present invention, since the management service performed by the management node 103 is usually very complex, the workload is very large, and the overhead for the management node 103 is very large, the management node 103 may first fail in a normal case.

When a data node 102 serving as a management node 103 fails, the state detection module 101 may first acquire failure information of the management node 103, and there are many ways to acquire the failure information, for example, some processes or hardware devices in the management node 103 fail for some reasons and cannot perform management services, but the management node 103 may also send out the failure information, and when the state detection module 101 receives the failure information, any non-failed data node 102 may be selected, and the selected non-failed data node 102 is used as the current management node 103.

Or when the management node 103 has a serious failure, that is, cannot send information to the outside, the state detection module 101 may scan each data node 102 at a certain frequency, when the management node 103 is continuously scanned and cannot obtain any information, for example, the state detection module 101 may scan each data node 102 at a frequency of once per minute, when the management node 103 is continuously scanned three times and cannot obtain any information of the management node 103, it is determined that the management node 103 has failed, and then the state detection module 101 may select any non-failed data node 102, and use the selected non-failed data node 102 as the current management node 103; when the state detection module 101 is composed of stat _ check processes arranged in each data node 102, the state detection module 101 scans each data node 102, specifically, the stat _ check processes in each data node 102 at a certain frequency, and performs communication with each other at a certain frequency, that is, the state of the current data node 102 is sent to other data nodes 102 in a certain form, for example, a form of a packet, through a UDP protocol. When the other data nodes 102 do not receive the information sent by the management node 103 for the preset number of times, it is determined that the management node 103 has failed, and then the state detection module 101 selects any non-failed data node 102, and takes the selected non-failed data node 102 as the current management node 103. In practical situations, other situations of failures may also be encountered, and the specific determination methods are not limited to the above two methods, but no matter which method is used for determining that the management node 103 has a failure, the implementation of the present invention is not affected.

In a normal situation, the two methods are generally used in combination, that is, when the state detection module 101 receives the failure information sent by the management node 103, or when the state detection module 101 does not scan the management node 103 for multiple times, either of the two situations occurs, and the state detection module 101 selects any non-failure data node 102 and uses the selected non-failure data node 102 as the current management node 103. Of course, one of the above two methods may be selected, and is not particularly limited in the embodiment of the present invention.

In the embodiment of the present invention, the state detection module 101 selects any one of the non-failed data nodes 102, which may be a randomly selected non-failed data node 102, or may be a selected data node 102 meeting a specific requirement, and details will be described in the following embodiments and will not be expanded herein.

In the embodiment of the present invention, the selected non-failed data node 102 is used as the current management node 103, specifically, after the state detection module 101 selects one non-failed data node 102, the management system process of the data node 102 is opened, the data node 102 running the management system process is the new management node 103 at this time, and the management node 103 starts to provide management services for all the data nodes 102.

The disaster recovery system based on the distributed cluster provided by the embodiment of the invention comprises a plurality of data nodes 102, each data node 102 is provided with a management system process, when the management node 103 which provides management service for all the nodes fails, all the other data nodes 102 can manage all the data nodes 102, and at the moment, one node is selected to provide management service for all the other nodes. The disaster recovery system provided by the invention greatly increases the redundancy of the management node 103 disaster recovery, and ensures that enterprise services run smoothly as much as possible.

In the above embodiment of the invention, when the management node 103 fails, another management node 103 is started to continue providing management services to all data nodes 102. When some state parameters of the management node 103 are relatively high, for example, the load or the temperature is relatively high, the management node 103 is also prone to failure, and at this time, the current management node 103 may be turned off, and another management node 103 may be turned on, so as to balance the utilization of resources in the system and prolong the service time of each data node 102.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a second disaster recovery system according to an embodiment of the present invention, and the difference between the system according to the embodiment of the present invention and the system according to the previous embodiment of the present invention is that in the system according to the embodiment of the present invention, the state detection module 101 may further measure a state value of the data node 102.

In an embodiment of the present invention, the status monitoring module is further configured to:

measuring a load status value of each of the data nodes 102;

in the embodiment of the present invention, the status detecting module 101 may measure the status parameters of each data node 102, wherein there may be a load status value, and of course, there may also be a temperature value, a humidity value, and the like. When the temperature value needs to be measured, a temperature sensor needs to be added into each data node 102 correspondingly, and when the humidity value needs to be measured, a humidity sensor needs to be added into each data node 102 correspondingly.

When the load state value of the data node 102 as the management node 103 exceeds a preset threshold, the management system process corresponding to the data node 102 is closed, any one of the non-failure data nodes 102 is selected, and the selected non-failure data node 102 is used as the current management node 103.

When the measured status parameter of the data node 102 is a load status value, the corresponding preset threshold is a threshold of the load status. Of course, in the last step, a plurality of state parameters may also be measured, and at this time, a plurality of preset thresholds may correspond to the measured state parameters. In the embodiment of the present invention, the load status value is only one example. When the load status value of the data node 102 is relatively high, the data node 102 is prone to failure, so when the load status value of the data node 102 as the management node 103 exceeds a predetermined threshold, for example, when the load status value of the management node 103 exceeds 80%, the status detection module 101 may shut down the management system process of the management node 103 at that time, select any one of the non-failed data nodes 102, and turn on the management system process in the non-failed data node 102. The data node 102 that is running the management system process at this time is a new management node 103 for providing management services to all the data nodes 102.

In addition to being prone to failure when the load status value is high, the data node 102 is also prone to failure when other status parameters are high, such as: the data node 102 is also prone to failure when the temperature of the operating environment of the data node 102 is relatively high. Correspondingly, when the temperature value of the data node 102 serving as the management node 103 exceeds a preset threshold, for example, when the temperature value of the management node 103 exceeds 70 ℃, the state detection module 101 may close the management system process of the management node 103 at this time, select any one of the non-failed data nodes 102, and open the management system process in the non-failed data node 102.

Further, the state detection module 101 may select any non-failed data node 102, and may randomly select a non-failed data node 102, or may select a data node 102 meeting a specific requirement.

In this embodiment of the present invention, the non-faulty data node 102 is the data node 102 with the smallest load status value, that is, when the status detection module 101 finds that the management node 103 is down, or when the status detection module 101 detects that the load status value of the current management node 103 does not meet the requirement, the data node 102 with the smallest load status value in the entire system may be selected, and the management system process of the data node 102 is opened, where the data node 102 is the management node 103 and is used to provide management services for all the data nodes 102.

Because the resources consumed by running the management system process are more, the load state value of the data node 102 is greatly increased, and at this time, the data node 102 with the smallest load state value in the whole system is taken as the management node 103 to provide management services for all the data nodes 102, so that the resources of the whole system can be well balanced, and the resource utilization rate of the whole system is obviously improved.

When there are a plurality of data nodes 102 having the minimum load status value in parallel in the system, since a specific serial number is added to each data node 102 in order to distinguish the data nodes 102 in the system, when there are a plurality of data nodes 102 having the minimum load status value, the data node 102 having the minimum serial number may be selected from the data nodes 102 having the minimum load status value as the management node 103, of course, the data node 102 having the maximum serial number may be selected as the management node 103, and similarly, any one data node 102 may be selected as the management node 103 by other methods, which is not particularly limited herein.

In the disaster recovery system based on the distributed cluster provided in the embodiment of the present invention, when the state parameter of the management node 103, for example, the load state value is relatively high, in order to prevent the management node 103 from failing, another data node 102 may be replaced to provide a management service for the entire system. Furthermore, when the management node 103 needs to be replaced, the data node 102 with the smallest load state value in the current system can be used as the management node 103, so as to better balance the resources of the whole system.

In order to facilitate the operator's grasp of the current state of the whole system, the management node 103 may further provide a common management platform.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a third disaster recovery system according to an embodiment of the present invention, and the difference between the system according to the embodiment of the present invention and the systems according to the first two embodiments of the present invention is that the management node 103 provides a common management platform 301, which is used for displaying the status parameters of each data node 102.

After measuring the status parameters of each data node 102, the status detection module 101 uploads the status parameters to the common management platform 301 provided by the management node 103, so as to display the status parameters of each data node 102.

Besides the common management platform 301, the staff may view the status parameters of each data node 102, and further, the staff may manage each data node 102 through the common management platform 301, for example, control the ongoing service of each data node 102, manage the time for turning on or off each data node 102, and so on. The common management platform 301 may have other functions besides the above functions, and is not limited in detail herein. The function of the common management platform 301 is only to facilitate direct management of the whole system by workers.

When the data node 102 serving as the management node 103 fails, the state detection module 101 further pushes failure information to the common management platform 301.

When the management node 103 fails, the state detection module 101 may further push fault information to the public management platform 301, where the fault information may include a serial number of the failed node, a fault time, a fault reason, and the like, so as to remind a worker to repair the failed data node 102 as soon as possible.

When the state parameter of the management node 103 is too high, for example, the load state value exceeds a preset threshold value, the information of replacing the management node 103 may also be pushed to the common management platform 301, so that the worker can know the state of the current system in time.

In the disaster recovery system based on the distributed cluster provided by the embodiment of the present invention, the management node 103 provides the common management platform 301, and through the common management platform 301, the working personnel can timely know the current state of the whole system, so that the working personnel of the method can manage the whole system.

Please refer to fig. 4 and fig. 4, which are flowcharts of a first disaster recovery method according to an embodiment of the present invention.

The disaster recovery method provided by the embodiment of the present invention is applied to the disaster recovery system based on the distributed cluster described in any of the above embodiments, and the system is described in detail in the above embodiments, and is not described herein again, for specific cases, refer to the above embodiments.

The disaster recovery method provided by the embodiment of the invention specifically comprises the following steps:

step 101: when a management node fails, a state detection module acquires fault information of the management node, wherein the management node is a data node which is running a management system process, and the management node is used for managing all the data nodes.

In this step, there are many ways for the status detection module to obtain the fault information of the management node, which have been described in detail in the above embodiments of the present invention and are not specifically developed here.

Step 102: the state detection module selects any one of the non-failed data nodes.

Step 103: the state detection module starts the management system process in the non-failed data node.

After the management system process is started, the data node running the management system process is a new management node, and the new management node is used for managing all the data nodes.

According to the disaster recovery method based on the distributed cluster, provided by the embodiment of the invention, when the management node which provides the management service for all the nodes fails, all the other data nodes can manage all the data nodes, and at the moment, one node is selected to provide the management service for all the other nodes. The disaster recovery method provided by the invention greatly increases the redundancy of the management node disaster recovery, and ensures that enterprise services run smoothly as much as possible.

In the embodiment of the invention, when the management node fails, another management node is started to continue to provide management services for all the data nodes. When some state parameters of the management node are relatively high, for example, the load or the temperature is relatively high, the management node is also prone to failure, and at this time, the current management node may be turned off, and another management node may be turned on, so as to balance the utilization of resources in the system and prolong the service time of each data node.

Referring to fig. 5, fig. 5 is a flowchart of a second disaster recovery method according to an embodiment of the present invention.

step 201: the state detection module measures a load state value of the data node.

In the embodiment of the present invention, the status detection module may measure the status parameters of each data node, where there may be a load status value, and of course, there may also be a temperature value, a humidity value, and the like. The specific situation has been described in detail in the above embodiments, and is not described herein again.

Step 202: and when the load state value of the management node exceeds a preset threshold value, the state detection module closes the management system process of the management node.

When the measured state parameter of the data node is a load state value, the corresponding preset threshold is a threshold of the load state. Of course, in step 201, a plurality of state parameters may be measured, and a plurality of preset thresholds may correspond to the measured state parameters.

Step 203: the state detection module selects any one of the non-failed data nodes.

Furthermore, the state detection module may select any one of the non-failed data nodes randomly or according to a specific requirement.

When the state detection module finds that the management node is down, or when the state detection module detects that the load state value of the current management node does not meet the requirement, the data node with the minimum load state value in the whole system at the moment can be selected, the management system process of the data node is opened, and the data node is the management node at the moment and is used for providing management services for all the data nodes. The specific situation has been described in detail in the above embodiments, and is not described herein again.

Step 204: and the state detection module starts a management system process in the non-fault data node.

In the embodiment of the present invention, step 205 may be further included.

Step 205: when the management node fails, the state detection module pushes failure information to a public management platform, and the public management platform is provided by the management node and used for displaying the state parameters of the data nodes.

When the management node fails, the state detection module can further push fault information to the public management platform, wherein the fault information can comprise a serial number of the failed node, fault time, fault reasons and the like, so that a worker is reminded to maintain the failed data node as soon as possible. The specific situation has been described in detail in the above embodiments, and is not described herein again.

In the disaster recovery method based on the distributed cluster provided by the embodiment of the present invention, when a state parameter of a management node, for example, a load state value is relatively high, in order to prevent the management node from failing, another data node may be used to provide a management service to the entire system. Furthermore, when the management node needs to be replaced, the data node with the smallest load state value in the current system can be used as the management node, so that the resources of the whole system can be better balanced. The disaster recovery method provided by the embodiment of the invention can also be used for timely knowing the current state of the whole system through the working personnel of the public management platform, and the working personnel of the method can manage the whole system.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A distributed cluster-based disaster recovery system, the system comprising:

the system comprises a state detection module and a plurality of data nodes;

when a data node serving as a management node fails, the state detection module is used for selecting any non-failure data node and taking the selected non-failure data node as a current management node;

the management node is also used for providing a public management platform, and the platform is used for displaying the state parameters of the data nodes;

the state detection module is used for measuring the state parameters of the data nodes and uploading the state parameters to the public management platform;

the state detection module is further configured to:

2. The system of claim 1, wherein the condition monitoring module is further configured to:

measuring a load state value of each data node;

3. The system of claim 2, wherein the non-failed data node is the data node with the smallest load status value.

4. A disaster recovery method based on a distributed cluster is characterized in that the method comprises the following steps:

the state detection module selects any non-fault data node;

starting the management system process in the non-failed data node;

the method further comprises:

when the management node fails, the state detection module pushes failure information to a public management platform; the public management platform is provided by the management node and is used for displaying the state parameters of each data node;

and the state detection module measures the state parameters of the data nodes and uploads the state parameters to the public management platform.

5. The method of claim 4, further comprising:

the state detection module measures a load state value of the data node;

the state detection module selects any one of the non-failed data nodes;

6. The method of claim 5, wherein said selecting any one of said non-failed data nodes comprises:

and selecting the data node with the minimum load state value.