CN117640349A - Fault recovery method and device for network additional storage system cluster and host equipment - Google Patents

Fault recovery method and device for network additional storage system cluster and host equipment Download PDF

Info

Publication number
CN117640349A
CN117640349A CN202311609643.1A CN202311609643A CN117640349A CN 117640349 A CN117640349 A CN 117640349A CN 202311609643 A CN202311609643 A CN 202311609643A CN 117640349 A CN117640349 A CN 117640349A
Authority
CN
China
Prior art keywords
node
cluster
recoverable
storage system
offline
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311609643.1A
Other languages
Chinese (zh)
Inventor
储欣媛
侯胜伟
马桂杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN202311609643.1A priority Critical patent/CN117640349A/en
Publication of CN117640349A publication Critical patent/CN117640349A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a fault recovery method, a device and host equipment of a network additional storage system cluster, and relates to the technical field of network additional storage systems, wherein the method comprises the following steps: detecting an offline node in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network additional storage system cluster; determining recoverable nodes in the offline nodes according to the communication states of the current nodes and the offline nodes and the host states of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal; controlling each recoverable node to configure a cluster configuration file, and recovering each recoverable node to the network additional storage system cluster; the method and the system can detect and identify the recoverable offline node in the NAS cluster, automatically recover the recoverable offline node to the NAS cluster, and reduce the operation pressure and the fault risk of the node.

Description

Fault recovery method and device for network additional storage system cluster and host equipment
Technical Field
The present invention relates to the field of network attached storage systems, and in particular, to a method and apparatus for recovering a failure of a network attached storage system cluster, a host device, and a computer readable storage medium.
Background
Currently, in a NAS (Network Attach Storage, network attached storage system) cluster, when a node fails, the node cluster information is lost; when the virtual machine of the node is recovered to be normal, the virtual machine does not have cluster information, so that the virtual machine cannot be added to the NAS cluster again, and the NAS service of the node cannot be recovered. Taking a NAS cluster of four nodes as an example: the cluster management system drifts the NAS service of the fault node to a normal node, and provides the NAS service of the fault node through the normal node; when two or three nodes fail, NAS services originally distributed in four nodes are concentrated in one or two nodes to provide NAS services externally, single-node operation pressure is increased, and meanwhile, the risk of failure is increased.
Therefore, how to realize the automatic recovery of the fault node in the NAS cluster, reduce the operation pressure of the node and reduce the fault risk is an urgent problem to be solved nowadays.
Disclosure of Invention
The invention aims to provide a fault recovery method, a fault recovery device, a host device and a computer readable storage medium of a network attached storage system cluster, so as to realize automatic recovery of fault nodes in an NAS cluster, reduce the operation pressure of the nodes and reduce the fault risk.
In order to solve the above technical problems, the present invention provides a method for recovering a failure of a network attached storage system cluster, including:
when the current node is a master node of a network additional storage system cluster, detecting an offline node in the network additional storage system cluster; the current node is any node in the network additional storage system cluster;
determining recoverable nodes in the offline nodes according to the communication states of the current nodes and the offline nodes and the host states of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal;
controlling the configuration files of the recoverable node configuration clusters, and recovering the recoverable nodes to the network additional storage system clusters; the cluster configuration file comprises cluster configuration information of nodes and token information of all nodes in the network additional storage system cluster.
In some embodiments, when the current node is a master node of a network additional storage system cluster, detecting an offline node in the network additional storage system cluster includes:
monitoring whether the current node is the master node;
if the node is the master node, detecting the offline node;
and if the node is not the master node, ending.
In some embodiments, the monitoring whether the current node is the master node includes:
monitoring the node type of the current node and the starting condition of the recovery process by using a monitoring process;
if the node type is the master node, starting the recovery process when the starting condition of the recovery process is in a non-starting state, and detecting the offline node by using the recovery process; executing the step of monitoring the node type of the current node and the starting condition of the recovery process by using the monitoring process when the starting condition of the recovery process is the starting condition;
and if the node type is the slave node, ending the recovery process.
In some embodiments, the method further comprises:
when the current node is a slave node of the network attached storage system cluster, if the current node is the offline node, recovering the cluster configuration file according to the control of the master node;
Restarting a cluster management system of the network attached storage system cluster by using the restored cluster configuration file;
and after the cluster management system is restarted successfully, an authentication instruction is sent to the master node, so that the master node can rejoin the network additional storage system cluster after the current node is authenticated successfully according to the authentication instruction.
In some embodiments, the determining a recoverable node in the offline node according to the communication state of the current node and the offline node and the host state of the offline node includes:
determining the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery mark file condition; the cluster information of the recoverable node is not in existence or is in existence, and when the cluster information of the current recoverable node is in existence, the recovery mark file of the current recoverable node is in existence;
correspondingly, the controlling the configuration cluster configuration file of each recoverable node to restore each recoverable node to the network attached storage system cluster includes:
And controlling the configuration files and the restoration mark files of each restorable node configuration cluster, and restoring the restorable nodes into the network-attached storage system clusters.
In some embodiments, the determining the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery mark file condition includes:
judging whether the offline node exists in the network additional storage system cluster;
if the offline node exists, judging whether a reset node exists in the offline node according to the recovery mark file condition and the cluster information condition of the offline node; the recovery mark file condition of the reset node is that the recovery mark file exists and the cluster information condition is that cluster information exists;
if the reset node exists, controlling the reset node to perform node reset so as to clear cluster information in the reset node;
if the reset node does not exist, judging whether a recoverable node exists in the offline node;
if the recoverable node exists, executing the steps of controlling the configuration cluster configuration file and the recovery mark file of each recoverable node, and recovering the recoverable node into the network-attached storage system cluster;
Correspondingly, the controlling each recoverable node configuration cluster configuration file and a recovery mark file to recover the recoverable node to the network attached storage system cluster includes:
controlling each recoverable node to respectively configure the recovery mark file;
sending respective corresponding cluster configuration files to each recoverable node, so that each recoverable node restarts a cluster management system of the network attached storage system cluster by using the respective received cluster configuration file;
authenticating the recoverable node which is successfully restarted by the cluster management system, recovering the recoverable node which is successfully authenticated into the network attached storage system cluster, and controlling the recoverable node which is successfully authenticated to delete the respective recovery mark file.
In some embodiments, the controlling the reset node to perform node reset includes:
and sending a reset instruction to each reset node through a secure shell protocol so as to control each reset node to reset the node.
The invention also provides a fault recovery device of the network additional storage system cluster, which is applied to the current node and comprises:
the detection module is used for detecting offline nodes in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network additional storage system cluster;
The determining module is used for determining recoverable nodes in the offline nodes according to the communication state of the current node and the offline nodes and the host state of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal;
the recovery module is used for controlling the configuration files of the recoverable node configuration clusters and recovering the recoverable nodes to the network additional storage system clusters; the cluster configuration file comprises cluster configuration information of nodes and token information of all nodes in the network additional storage system cluster.
The present invention also provides a host device including:
a memory for storing a computer program;
and a processor, configured to implement the steps of the method for recovering a failure of a network attached storage system cluster as described above when executing the computer program.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for recovering from a failure of a network attached storage system cluster as described above.
The invention provides a fault recovery method of a network additional storage system cluster, which comprises the following steps: detecting an offline node in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network additional storage system cluster; determining recoverable nodes in the offline nodes according to the communication states of the current nodes and the offline nodes and the host states of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal; controlling each recoverable node to configure a cluster configuration file, and recovering each recoverable node to the network additional storage system cluster; the cluster configuration file comprises cluster configuration information of nodes and token information of all nodes in the network additional storage system cluster;
therefore, the invention can detect and identify the recoverable offline node in the NAS cluster by determining the recoverable node in the offline node according to the communication state of the current node and the offline node and the host state of the offline node; by controlling the configuration files of the recoverable nodes to configure the cluster, the recoverable nodes are recovered to the network attached storage system cluster, so that the automatic recovery of the recoverable offline nodes can be realized, the problem that the offline nodes cannot be added to the NAS cluster again due to failure in the prior art is solved, the operation pressure of the nodes is reduced, and the failure risk is reduced; and the recovery of the offline nodes is not limited by the number of the nodes, so that the high availability of the NAS cluster is ensured. In addition, the invention also provides a fault recovery device, host equipment and a computer readable storage medium of the network additional storage system cluster, which also have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for recovering a failure of a network attached storage system cluster according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another method for recovering failures of a network attached storage system cluster according to an embodiment of the present invention;
FIG. 3 is an interactive schematic diagram of another method for recovering failures of a network attached storage system cluster according to an embodiment of the present invention;
FIG. 4 is a block diagram illustrating a failure recovery apparatus for a network attached storage system cluster according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a simplified structure of a host device according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a specific structure of a host device according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart of a fault recovery method for a network attached storage system cluster according to an embodiment of the present invention. The method may include:
step 101: detecting an offline node in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network attached storage system cluster.
It is understood that the current node in this embodiment may be any node in a NAS (network attached storage system) cluster, such as a KVM (Kernel-based Virtual Machine, a system virtualization module with an open source) virtual machine. For example, the storage system corresponding to the NAS cluster in this embodiment may include a plurality of (i.e. two or more) HOST device (HOST) nodes, where a cluster management system is disposed on the HOST device nodes to manage the plurality of HOST device nodes, and each HOST device node is disposed with a KVM virtual machine to provide NAS services to the outside; and all the KVM virtual machines are communicated with each other, and all the KVM virtual machines are managed by using the NAS cluster.
Correspondingly, the fault recovery method of the network attached storage system cluster provided by the embodiment can be applied to the host equipment (i.e. the host) where the current node is located, that is, the processor of the host equipment can execute the method provided by the embodiment, so as to realize the automatic recovery of the recoverable offline node (i.e. the recoverable node) in the NAS cluster.
Note that, the nodes in the NAS cluster may be divided into a master node and a slave node other than the master node. Taking four-control NAS clusters as an example, each node in the NAS clusters can be configured with a unique node name named by a number, for example, the node names of the four nodes in the NAS clusters can be respectively 1, 2, 3 and 4; the cluster management system can realize an arbitration mechanism through node names of all nodes, and selects the node with the smallest node name number as a master node in all nodes, namely, the node with the node name 1 in four nodes as the master node and the other three nodes as slave nodes. When a node fault exists in the NAS cluster, the cluster management system can identify and report the offline (offline) state of the failed node; if the master node of the NAS cluster fails, the cluster management system selects a new master node from the remaining nodes.
Correspondingly, in this step, the host device where the current node is located may detect an offline node (i.e., a node in an offline state) in the network additional storage system cluster when the current node is the master node of the network additional storage system cluster. For example, a recovery process may be set in the current node to detect offline nodes in the NAS cluster using the recovery process, and automatically join the recoverable offline nodes into the NAS cluster. The specific manner in which the current node (i.e., the master node) detects the offline node in the network additional storage system cluster in this step may be set by a designer according to a practical scenario and a user requirement, for example, may be implemented in the same or similar manner as the offline state identification method of the node in the NAS cluster in the prior art, which is not limited in this embodiment.
Furthermore, in order to ensure that only one node (namely the master node) in the NAS cluster can recover the offline node of the NAS cluster, the condition that a plurality of nodes recover the same recoverable offline node simultaneously is avoided, and in the step, the current node can monitor whether the current node is the master node or not; if yes, detecting an offline node in the network additional storage system cluster; if not, the process can be ended or the current node is monitored to be the master node, so that the recovery process of the slave node to the offline node is avoided.
Step 102: determining recoverable nodes in the offline nodes according to the communication states of the current nodes and the offline nodes and the host states of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal.
It can be understood that, in this step, when the offline node exists in the NAS cluster, the offline node whose host device state (i.e., host state) is the host is normal and whose communication state with the current node is communicable is determined as the offline node capable of automatically recovering (i.e., the recovery node) according to the communication state between the current node and the offline node and the host state of the offline node.
Correspondingly, for the specific mode of determining the recoverable node in the offline node according to the communication state of the current node and the offline node and the host state of the offline node in this step, if the current node can use a reply process, the offline node whose host state is normal (i.e. the host device where the offline node is located is normal) and whose communication state with the current node is communicable (i.e. the network can communicate between the offline node and the host node) is directly determined as the recoverable node.
In order to avoid multiple invalid recovery of the recoverable node with recovery failure, in the step, the current node can determine the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node and the recovery mark file condition; the communication state of the recoverable node and the current node is communicable, the host state of the recoverable node is that the host is normal, and the condition of the recoverable node is that the recoverable mark file does not exist. Correspondingly, in step 103, the current node may control each recoverable node to configure a cluster configuration file and a recovery mark file, and recover the recoverable node to the network attached storage system cluster; and after the recoverable node is recovered to the network attached storage system cluster, controlling the recoverable node to delete the configured recovery mark file. That is, in this embodiment, by setting the recovery flag file configured in the offline node to flag in-process recovery flag files, the current node does not repeatedly recover the offline node that fails to recover.
Further, in the step, the current node may further determine a recoverable node and an offline node in the offline node according to a communication state between the current node and the offline node, a host state of the offline node, and a recovery flag file condition; the communication state of the offline node and the current node is communicable, the host state of the offline node is that the host is normal, and the condition of the recovery mark file of the offline node is that the recovery mark file exists. Correspondingly, in this embodiment, the current node may further control the reset node to perform node reset, so that the offline node after reset may be redetermined as a recoverable node and recovered into the NAS cluster in a subsequent recovery process performed again.
For example, the current node may determine a recoverable node and an offline node in the offline node according to a communication state of the current node and the offline node, a host state of the offline node, and a recovery flag file condition; the method comprises the steps that the communication state of an offline node and a current node is communicable, the host state of the offline node is that the host is normal, and the condition of a recovery mark file of the offline node is that the recovery mark file exists; and controlling the reset node to reset the node, and deleting the restoration mark file in the reset node.
Correspondingly, the current node can also determine the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery mark file condition; the cluster information of the recoverable node is no cluster information (namely configuration information of an NAS cluster) or cluster information exists, and when the cluster information of the current recoverable node is the cluster information exists, the recovery mark file of the current recoverable node is no recovery mark file, and the current recoverable node is any recoverable node; and determining an offline node which has a communication state of being communicable with the current node, a host state of being normal, a recovery mark file of being in existence of the recovery mark file and cluster information of being in existence of the cluster information as a reset node, and controlling the reset node to reset the node so as to clear the cluster information in the reset node, such as the cluster information configured by using the cluster configuration file.
Step 103: controlling each recoverable node to configure a cluster configuration file, and recovering each recoverable node to the network additional storage system cluster; the cluster configuration file comprises cluster configuration information of the nodes and token information of all the nodes in the network additional storage system cluster.
It should be noted that, in this embodiment, the current node may configure the cluster configuration file by controlling each recoverable node, so that each recoverable node may restart the cluster management system of the network additional storage system cluster by using the recovered cluster configuration file; and authenticating the recoverable node which is successfully restarted by the cluster management system, and recovering the recoverable node which is successfully authenticated to the network attached storage system cluster to realize automatic recovery of the recoverable node. Correspondingly, when the recovery mark file is set in the recoverable node, the current node can also control the recoverable node with successful authentication to delete the respective recovery mark file.
Correspondingly, when the current node is a slave node of the network attached storage system cluster in the embodiment, if the current node is an offline node, recovering a cluster configuration file according to control of the master node, for example, directly receiving the cluster configuration file sent by the master node; restarting a cluster management system of the network attached storage system cluster by using the restored cluster configuration file; after the cluster management system is restarted successfully, an authentication instruction is sent to the master node, so that the master node can make the current node rejoin the network additional storage system cluster after the current node is authenticated successfully according to the authentication instruction.
The specific mode of controlling the configuration cluster configuration files of each recoverable node by the current node in the step and recovering each recoverable node into the network additional storage system cluster can be set by a designer according to a practical scene and user requirements, for example, the current node can sequentially control the configuration cluster configuration files of each recoverable node and respectively recover each recoverable node into the network additional storage system cluster; the current node can also control each recoverable node to configure cluster configuration files in parallel, and each recoverable node is recovered to the network attached storage system cluster respectively, so that the parallel recovery of the offline node is realized, and the recovery speed of the NAS cluster is improved.
In the embodiment of the invention, the restorable node in the offline node is determined according to the communication state of the current node and the offline node and the host state of the offline node, so that the restorable offline node in the NAS cluster can be detected and identified; by controlling the configuration files of the recoverable nodes to configure the cluster, the recoverable nodes are recovered to the network attached storage system cluster, so that the automatic recovery of the recoverable offline nodes can be realized, the problem that the offline nodes cannot be added to the NAS cluster again due to failure in the prior art is solved, the operation pressure of the nodes is reduced, and the failure risk is reduced; and the recovery of the offline nodes is not limited by the number of the nodes, so that the high availability of the NAS cluster is ensured.
Based on the above embodiment, the embodiment of the present invention further provides another method for recovering failures of a network attached storage system cluster, so as to avoid a situation of repeated recovery. Specifically, referring to fig. 2, fig. 2 is a flowchart of another method for recovering failures of a network attached storage system cluster according to an embodiment of the present invention. The method may include:
step 201: monitoring whether the current node is a master node of a network attached storage system cluster; if yes, go to step 202.
It is understood that the current node in this embodiment may be any node in a NAS (network attached storage system) cluster. In the step, the host equipment where the current node is located can monitor whether the current node is a master node of the network additional storage system cluster; if the node is the master node, step 202 is entered to perform an offline node recovery process; if the network node is not the master node, the process can be ended, and after the fault is offline, the network node is restored to the NAS cluster according to the control of the master node.
Correspondingly, for the specific mode that the current node monitors whether the current node is the master node of the network additional storage system cluster in the embodiment, the specific mode can be set by a designer, for example, when a monitoring process for monitoring the node type of the node and a recovery process for recovering the offline node are set in the current node, the current node can monitor the node type of the current node and the starting condition of the recovery process by using the monitoring process in the step; if the node type is the master node (namely the current node is the master node), starting a recovery process when the starting condition of the recovery process is in a non-starting state, and detecting an offline node by using the recovery process; when the starting condition of the recovery process is the starting condition, ending the process or returning to execute the step of monitoring the node type of the current node and the starting condition of the recovery process by using the monitoring process; if the node type is the slave node, ending the recovery process, namely not starting the recovery process or ending the recovery process to be started. As shown in fig. 3, each node in each NAS cluster may set a monitoring process, where the monitoring process may circularly monitor whether the node is a master node; when the node is monitored to be a master node, judging whether a recovery process in the node is started or not, and if not, pulling up the recovery process through the monitoring process; if the node is monitored to be not a master node, the recovery process can be killed, and only one node in one NAS cluster can be ensured to run the recovery process; therefore, when the main node drifts, the recovery process of the new main node can be quickly pulled up by monitoring the process setting, and the recovery process on the original main node is killed, so that the condition that the same offline node is recovered by different processes by starting the recovery processes on a plurality of nodes at the same time is avoided.
Step 202: off-line nodes of a network attached storage system cluster are detected.
In this step, when the current node is the master node of the NAS cluster, an offline node in the NAS cluster may be detected, that is, the offline node may be a slave node of the NAS cluster, such as a failed slave node or a previous master node. For example, the current node may detect offline nodes in the NAS cluster at preset time intervals using the initiated recovery process.
Step 203: and determining the recoverable node and the reset node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery mark file condition.
The method comprises the steps that the communication state of a recoverable node and a current node is communicable, the host state of the recoverable node is host normal, the cluster information of the recoverable node is absent or is present, and when the cluster information of the current recoverable node is present, the recovery mark file of the current recoverable node is absent, and the current recoverable node is any recoverable node; the communication state of the reset node and the current node is communicable, the host state of the reset node is the host normal, the recovery mark file of the reset node is the existence recovery mark file, and the cluster information of the reset node is the existence cluster information.
It can be understood that, as shown in fig. 3, in this step, the current node (i.e., the master node) may determine, by using the initiated recovery process, when an offline node exists, an offline node that meets the condition of automatically joining the NAS cluster as a recoverable node; the condition of automatically joining the NAS cluster may be that the communication state with the current node is communicable, the host state is normal and the cluster information condition is that there is no cluster information, or that the communication state with the current node is communicable, the host state is normal, the cluster information condition is that there is cluster information (i.e., cluster residues) and the restoration mark file condition is that there is no restoration mark file; and determining an offline node which does not meet the condition of automatically joining the NAS cluster and has a communicable communication state with the current node, a host state as a host normal state, cluster information as cluster information and a recovery mark file as a reset node.
For example, the current node in this step may determine whether the offline node exists in the network additional storage system cluster; if the offline node does not exist, ending the flow to wait for the next offline node detection; if the offline node exists, judging whether a reset node exists in the offline node according to the recovery mark file condition and the cluster information condition of the offline node; the recovery mark file condition of the reset node is that the recovery mark file exists and the cluster information condition is that cluster information exists; if the reset node exists, the step 204 is entered; if the reset node does not exist, judging whether a recoverable node exists in the offline node; if the recoverable node exists, step 205 is entered; if the reset node does not exist, ending the flow.
Step 204: and controlling each reset node to reset the node.
In this step, the current node may communicate with each reset node, and control each reset node to perform node reset, so that each subsequent reset node may be successfully recovered to the NAS cluster; accordingly, during the resetting process of each resetting node, the respective cluster information, such as the cluster information set when the cluster configuration file is configured, can be cleared, so that the resetting node after the subsequent resetting is completed can be determined as a recoverable node by the current node.
Correspondingly, the embodiment is not limited to a specific manner in which the current node controls each reset node to perform node reset, for example, the current node may send a reset instruction to each reset node through a Secure Shell (SSH) protocol, so as to control each reset node to perform node reset.
Step 205: and controlling each recoverable node to configure the cluster configuration file and the recovery mark file, recovering each recoverable node to the network attached storage system cluster, and controlling the recoverable node which is successfully recovered to delete the recovery mark file.
In this step, the current node may communicate with each recoverable node, control each recoverable node to configure a cluster configuration file and a recovery mark file, recover each recoverable node to the NAS cluster, and control the recoverable node that is successfully recovered to delete the recovery mark file, so as to avoid the influence of the recovery mark file on the subsequent recovery of the failure.
Correspondingly, the embodiment is not limited to the specific process of step 205, for example, the current node may control each recoverable node to configure the recovery mark file respectively; sending respective corresponding cluster configuration files to each recoverable node, so that each recoverable node restarts a cluster management system of the network attached storage system cluster by using the respective received cluster configuration file; authenticating the recoverable node which is successfully restarted by the cluster management system, recovering the recoverable node which is successfully authenticated into the network attached storage system cluster, and controlling the recoverable node which is successfully authenticated to delete the respective recovery mark file.
For example, in this step, the current node may add the determined recoverable nodes to a recovery queue, and create recovery mark files on all recoverable nodes in the recovery queue respectively, so as to avoid an abnormal situation that the same node is not reset after recovery failure, and automatic recovery cannot be continued due to repeated recovery; copying cluster configuration files containing cluster configuration information of nodes in the NAS cluster and token information of all nodes to all recoverable nodes so that the recoverable nodes can utilize the cluster configuration files to configure the cluster information of the NAS cluster, the cluster information synchronization among all nodes in the NAS cluster is ensured, and the occurrence of brain fracture is prevented; and authenticating and starting the recoverable node which is successfully restarted by the cluster management system of the NAS cluster, deleting the recovery mark file on the recoverable node which is successfully recovered, and completing the automatic recovery of the recoverable node. Correspondingly, when the recovery of the recoverable node fails in the recovery process, the node can be removed from the recovery queue, the node can be monitored to be in an offline state again later, and the node is reset, so that the node after being reset can be recovered into the NAS cluster through the recovery process again as the recoverable node.
In this embodiment, according to the embodiment of the present invention, the recoverable node and the reset node in the offline node are determined according to the communication state between the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery flag file condition, so that the offline node with failed recovery can be detected and reset, and the situation of repeated recovery is avoided.
Corresponding to the above method embodiment, the present invention further provides a failure recovery device of a network additional storage system cluster, where the failure recovery device of a network additional storage system cluster described below and the failure recovery method of a network additional storage system cluster described above can be referred to each other.
Referring to fig. 4, fig. 4 is a block diagram illustrating a failure recovery apparatus for a network attached storage system cluster according to an embodiment of the present invention. The apparatus may include:
the detection module 10 is configured to detect an offline node in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network additional storage system cluster;
a determining module 20, configured to determine a recoverable node in the offline node according to the communication state of the current node and the offline node and the host state of the offline node; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal;
A restoration module 30, configured to control each restorable node configuration cluster configuration file, and restore each restorable node to the network-attached storage system cluster; the cluster configuration file comprises cluster configuration information of the nodes and token information of all the nodes in the network additional storage system cluster.
In some embodiments, the detection module 10 may be specifically configured to monitor whether the current node is the master node; if the node is the master node, detecting the offline node; and if the node is not the master node, ending.
In some embodiments, the detection module 10 may include:
the monitoring sub-module is used for monitoring the node type of the current node and the starting condition of the recovery process by using the monitoring process;
the detection sub-module is used for starting the recovery process when the starting condition of the recovery process is in a non-starting state if the node type is the master node, and detecting the offline node by using the recovery process; when the starting condition of the recovery process is the starting state, a starting signal is sent to a monitoring submodule;
and the ending submodule is used for ending the recovery process if the node type is the slave node.
In some embodiments, the apparatus further comprises:
The configuration module is used for restoring the cluster configuration file according to the control of the master node if the current node is the off-line node when the current node is the slave node of the network attached storage system cluster;
a restarting module, configured to restart a cluster management system of the network attached storage system cluster by using the restored cluster configuration file;
and the authentication module is used for sending an authentication instruction to the master node after the cluster management system is restarted successfully, so that the master node can rejoin the network additional storage system cluster after the current node is authenticated successfully according to the authentication instruction.
In some embodiments, the determining module 20 may be specifically configured to determine the recoverable node in the offline node according to a communication state of the current node and the offline node, a host state of the offline node, a cluster information condition, and a recovery mark file condition; the cluster information of the recoverable node is not in existence or is in existence, and when the cluster information of the current recoverable node is in existence, the recovery mark file of the current recoverable node is in existence;
Correspondingly, the restoration module 30 may be specifically configured to control each of the configuration files and the restoration mark files of the restorable node configuration cluster, and restore the restorable node to the network-attached storage system cluster.
In some embodiments, the determination module 20 may include:
the off-line judging sub-module is used for judging whether the off-line node exists in the network additional storage system cluster;
the reset judging sub-module is used for judging whether a reset node exists in the offline node according to the condition of the recovery mark file and the condition of cluster information of the offline node if the offline node exists; the recovery mark file condition of the reset node is that the recovery mark file exists and the cluster information condition is that cluster information exists;
a reset sub-module, configured to control the reset node to perform node reset if the reset node exists, so as to clear cluster information in the reset node;
a restoration judging sub-module, configured to judge whether a restorable node exists in the offline node if the reset node does not exist; if the recoverable node exists, sending a start signal to a recovery module 30;
Correspondingly, the recovery module 30 may include:
the mark configuration submodule is used for controlling each recoverable node to respectively configure the recovery mark files;
the cluster configuration submodule is used for sending respective corresponding cluster configuration files to each recoverable node so that each recoverable node can restart a cluster management system of the network additional storage system cluster by utilizing the respective received cluster configuration files;
and the cluster authentication sub-module is used for authenticating the recoverable nodes which are successfully restarted by the cluster management system, recovering the recoverable nodes which are successfully authenticated into the network attached storage system cluster, and controlling the recoverable nodes which are successfully authenticated to delete the respective recovery mark files.
In some embodiments, the reset sub-module may be specifically configured to send a reset instruction to each of the reset nodes through a secure shell protocol, so as to control each of the reset nodes to perform node reset.
In this embodiment, the determining module 20 determines the recoverable node in the offline node according to the communication state between the current node and the offline node and the host state of the offline node, so as to detect and identify the recoverable offline node in the NAS cluster; the recovery module 30 is used for controlling the configuration files of the recoverable node configuration clusters to recover the recoverable nodes to the network attached storage system clusters, so that the automatic recovery of the recoverable offline nodes can be realized, the problem that the offline nodes cannot be added to the NAS clusters again due to failure in the prior art is solved, the operation pressure of the nodes is reduced, and the failure risk is reduced; and the recovery of the offline nodes is not limited by the number of the nodes, so that the high availability of the NAS cluster is ensured.
Corresponding to the above method embodiments, the present invention further provides a host device, and a host device described below and a method for recovering a failure of a network-attached storage system cluster described above may refer to each other.
Referring to fig. 5, fig. 5 is a simplified schematic diagram of a host device according to an embodiment of the invention. The host device may include:
a memory D1 for storing a computer program;
and the processor D2 is configured to implement the steps of the fault recovery method of the network additional storage system cluster provided in the method embodiment when executing the computer program.
Accordingly, referring to fig. 6, fig. 6 is a schematic diagram of a specific structure of a host device according to an embodiment of the present invention, where the host device 310 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 322 (e.g., one or more processors) and a memory 332, and one or more storage media 330 (e.g., one or more mass storage devices) storing application programs 342 or data 344. Wherein the memory 332 and the storage medium 330 may be transitory or persistent. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instruction operations. Still further, the central processor 322 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the host device 310.
The host device 310 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341.
The host device provided in this embodiment may be specifically a computer or a server where a node of the NAS cluster is located.
The steps in the method for failure recovery of a network attached storage system cluster described above may be implemented by the structure of the host device.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a computer readable storage medium, where a computer readable storage medium described below and a method for recovering a failure of a network-attached storage system cluster described above may be referred to correspondingly.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the invention. The computer readable storage medium 40 has stored thereon a computer program 41, which when executed by a processor, implements the steps of the method for recovering from a failure of a network attached storage system cluster as provided by the method embodiments described above.
The computer readable storage medium 40 may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc. which can store various program codes.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The apparatus, the host device and the computer readable storage medium disclosed in the embodiments are relatively simple in description, and the relevant points refer to the description of the method section since they correspond to the methods disclosed in the embodiments.
The method, the device, the host equipment and the computer readable storage medium for recovering the faults of the network additional storage system cluster provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims (10)

1. A method for recovering a failure of a network attached storage system cluster, comprising:
When the current node is a master node of a network additional storage system cluster, detecting an offline node in the network additional storage system cluster; the current node is any node in the network additional storage system cluster;
determining recoverable nodes in the offline nodes according to the communication states of the current nodes and the offline nodes and the host states of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal;
controlling the configuration files of the recoverable node configuration clusters, and recovering the recoverable nodes to the network additional storage system clusters; the cluster configuration file comprises cluster configuration information of nodes and token information of all nodes in the network additional storage system cluster.
2. The method for recovering from a failure of a network attached storage system cluster according to claim 1, wherein detecting an offline node in the network attached storage system cluster when the current node is a master node of the network attached storage system cluster comprises:
monitoring whether the current node is the master node;
If the node is the master node, detecting the offline node;
and if the node is not the master node, ending.
3. The method for recovering from a failure of a network attached storage system cluster of claim 2, wherein said monitoring whether a current node is the master node comprises:
monitoring the node type of the current node and the starting condition of the recovery process by using a monitoring process;
if the node type is the master node, starting the recovery process when the starting condition of the recovery process is in a non-starting state, and detecting the offline node by using the recovery process; executing the step of monitoring the node type of the current node and the starting condition of the recovery process by using the monitoring process when the starting condition of the recovery process is the starting condition;
and if the node type is the slave node, ending the recovery process.
4. The method for recovering from a failure of a network attached storage system cluster of claim 1, further comprising:
when the current node is a slave node of the network attached storage system cluster, if the current node is the offline node, recovering the cluster configuration file according to the control of the master node;
Restarting a cluster management system of the network attached storage system cluster by using the restored cluster configuration file;
and after the cluster management system is restarted successfully, an authentication instruction is sent to the master node, so that the master node can rejoin the network additional storage system cluster after the current node is authenticated successfully according to the authentication instruction.
5. The method for recovering from a failure of a network attached storage system cluster according to any one of claims 1 to 4, wherein said determining a recoverable node among said offline nodes based on a current node's communication state with said offline nodes and a host state of said offline nodes comprises:
determining the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition and the recovery mark file condition; the cluster information of the recoverable node is no cluster information or cluster information exists, and when the cluster information of the current recoverable node is the cluster information exists, the recovery mark file of the current recoverable node is no recovery mark file, and the current recoverable node is any recoverable node;
Correspondingly, the controlling the configuration cluster configuration file of each recoverable node to restore each recoverable node to the network attached storage system cluster includes:
and controlling the configuration files and the restoration mark files of each restorable node configuration cluster, and restoring the restorable nodes into the network-attached storage system clusters.
6. The method according to claim 5, wherein determining the recoverable node in the offline node according to the communication state of the current node and the offline node, the host state of the offline node, the cluster information condition, and the recovery flag file condition, comprises:
judging whether the offline node exists in the network additional storage system cluster;
if the offline node exists, judging whether a reset node exists in the offline node according to the recovery mark file condition and the cluster information condition of the offline node; the recovery mark file condition of the reset node is that the recovery mark file exists and the cluster information condition is that cluster information exists;
if the reset node exists, controlling the reset node to perform node reset so as to clear cluster information in the reset node;
If the reset node does not exist, judging whether a recoverable node exists in the offline node;
if the recoverable node exists, executing the steps of controlling the configuration cluster configuration file and the recovery mark file of each recoverable node, and recovering the recoverable node into the network-attached storage system cluster;
correspondingly, the controlling each recoverable node configuration cluster configuration file and a recovery mark file to recover the recoverable node to the network attached storage system cluster includes:
controlling each recoverable node to respectively configure the recovery mark file;
sending respective corresponding cluster configuration files to each recoverable node, so that each recoverable node restarts a cluster management system of the network attached storage system cluster by using the respective received cluster configuration file;
authenticating the recoverable node which is successfully restarted by the cluster management system, recovering the recoverable node which is successfully authenticated into the network attached storage system cluster, and controlling the recoverable node which is successfully authenticated to delete the respective recovery mark file.
7. The method for recovering from a failure of a network attached storage system cluster of claim 6, wherein said controlling said reset node to perform a node reset comprises:
And sending a reset instruction to each reset node through a secure shell protocol so as to control each reset node to reset the node.
8. A failure recovery apparatus for a network attached storage system cluster, applied to a current node, comprising:
the detection module is used for detecting offline nodes in the network additional storage system cluster when the current node is a master node of the network additional storage system cluster; the current node is any node in the network additional storage system cluster;
the determining module is used for determining recoverable nodes in the offline nodes according to the communication state of the current node and the offline nodes and the host state of the offline nodes; the communication state of the recoverable node and the current node is communicable, and the host state of the recoverable node is normal;
the recovery module is used for controlling the configuration files of the recoverable node configuration clusters and recovering the recoverable nodes to the network additional storage system clusters; the cluster configuration file comprises cluster configuration information of nodes and token information of all nodes in the network additional storage system cluster.
9. A host device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for recovering from a failure of a network attached storage system cluster according to any one of claims 1 to 7 when executing said computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method for recovering from a failure of a network attached storage system cluster according to any of claims 1 to 7.
CN202311609643.1A 2023-11-28 2023-11-28 Fault recovery method and device for network additional storage system cluster and host equipment Pending CN117640349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311609643.1A CN117640349A (en) 2023-11-28 2023-11-28 Fault recovery method and device for network additional storage system cluster and host equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311609643.1A CN117640349A (en) 2023-11-28 2023-11-28 Fault recovery method and device for network additional storage system cluster and host equipment

Publications (1)

Publication Number Publication Date
CN117640349A true CN117640349A (en) 2024-03-01

Family

ID=90015817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311609643.1A Pending CN117640349A (en) 2023-11-28 2023-11-28 Fault recovery method and device for network additional storage system cluster and host equipment

Country Status (1)

Country Link
CN (1) CN117640349A (en)

Similar Documents

Publication Publication Date Title
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
US10895996B2 (en) Data synchronization method, system, and apparatus using a work log for synchronizing data greater than a threshold value
US20180077230A1 (en) Method and apparatus for switching between servers in server cluster
CN105933407B (en) method and system for realizing high availability of Redis cluster
US9087005B2 (en) Increasing resiliency of a distributed computing system through lifeboat monitoring
CN111953566B (en) Distributed fault monitoring-based method and virtual machine high-availability system
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
CN102394914A (en) Cluster brain-split processing method and device
CN109391691B (en) Method and related device for recovering NAS service under single-node fault
CN110377456A (en) A kind of management method and device of virtual platform disaster tolerance
CN111130879A (en) PBFT algorithm-based cluster exception recovery method
CN107124305A (en) node device operation method and node device
CN114138732A (en) Data processing method and device
CN114554593A (en) Data processing method and device
JP6421516B2 (en) Server device, redundant server system, information takeover program, and information takeover method
CN113438111A (en) Method for restoring RabbitMQ network partition based on Raft distribution and application
CN107181608B (en) Method for recovering service and improving performance and operation and maintenance management system
CN114124803B (en) Device management method and device, electronic device and storage medium
CN117640349A (en) Fault recovery method and device for network additional storage system cluster and host equipment
CN111090537A (en) Cluster starting method and device, electronic equipment and readable storage medium
CN113961398A (en) Business processing method, device, system, equipment, storage medium and product
CN112491633B (en) Fault recovery method, system and related components of multi-node cluster
CN112433968B (en) Controller sharing synchronization method and device
US10909002B2 (en) Fault tolerance method and system for virtual machine group
CN115145782A (en) Server switching method, mooseFS system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination