CN116112500B

CN116112500B - NFS high availability system and method based on fault detection and routing strategy

Info

Publication number: CN116112500B
Application number: CN202310082854.8A
Authority: CN
Inventors: 陈奇; 徐文豪; 王弘毅; 张凯
Original assignee: SmartX Inc
Current assignee: SmartX Inc
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-08-15
Anticipated expiration: 2043-02-08
Also published as: CN116112500A

Abstract

The invention discloses an NFS high availability system and method based on fault detection and routing strategy, which establishes a subnetwork specifically used for NFS mode connection; then setting fixed NFS service end node virtual IP on the current NFS client node; virtual IP routing of the NFS server node to the NFS server node; detecting the connection state of the NFS mode connection and the running state of the current NFS server node in real time, and switching the connection state of the failed NFS mode connection to the normal NFS server node or switching the failed NFS server node to the normal NFS server node in real time according to the detection result of the monitoring module.

Description

NFS high availability system and method based on fault detection and routing strategy

Technical Field

The invention relates to the field of super-fusion storage data processing, in particular to an NFS high availability system and method based on fault detection and routing strategies.

Background

In the super fusion infrastructure, the computational load (application/virtual machine) is on the same set of physical servers as the associated data. However, unlike the traditional storage device directly connected with the application program only by using the device, in the super fusion system, the storage resources (hard disk or AEP and other novel storage media) are not directly exposed to the application program for use, but all the storage resources in the whole super fusion cluster are pooled first, and then virtual storage services (virtual disk, virtual file system and the like) are provided for delivering the application program for use. The accessed data of each application program may be distributed over all nodes of the whole super fusion system, and when the data loss is caused by the abnormality of a single storage server or a single disk or other storage media, redundant/backup data can be obtained from other healthy disks or servers to reconstruct the lost data.

In addition, in the super-fusion infrastructure, after the super-fusion distributed storage cluster pools the storage resources, the storage resources may be exposed to the server host in a variety of ways, and NFS is one of them. However, there are some problems with directly using NFS, the most significant of which are: NFS has the potential for single point failures to render storage unavailable, that is, NFS itself does not have high availability capabilities.

To address this problem, the prior art generally uses DNS-based NFS high availability schemes with VIP-based NFS high availability schemes,

in a common DNS-based high availability solution, the NFS client no longer directly accesses the NFS server, but instead proxies through a domain name; the domain name can be mapped to a plurality of IPs, an available NFS server is simply polled and found, and the NFS request is transferred to the corresponding server; however, multiple NFS servers are maintained to ensure high availability of NFS, where multiple NFS servers belong to the same cluster and represent the same set of data. Data is synchronized between them.

The VIP-based NFS high availability solution workflow is similar to the DNS-based NFS high availability solution, and the client sees a unique VIP (virtual IP) address. VIPs will remain highly available in the storage cluster (i.e., only one primary server holds this VIP address at the same time). After the main server is abnormal, other servers in the cluster can sense through the cluster strategy, reselect a new main service and automatically configure the VIP to provide services to the outside. This VIP address is always connected to the client.

The VIP-based high availability solution failover may be faster than DNS-based high availability solutions (since the DNS protocol resolution and probing normally used for external public networks tolerates anomaly thresholds that are higher from the original design goals than IP protocols primarily used for internal networks to avoid unnecessary anomaly switching caused by high delay jitter common in public networks). But the NFS server to which VIP maps at any one time in this scheme is unique. Therefore, the conventional NFS high availability schemes have a single point problem, that is, only one NFS server actually provides services at any time, and other NFS servers only provide hot backups, but do not provide services. This is disadvantageous for fully exploiting the cluster performance.

Disclosure of Invention

The invention aims to provide an NFS high availability system and method based on fault detection and routing strategies, which solve the technical problems pointed out in the prior art.

The invention provides an NFS high availability system based on fault detection and routing strategy, which comprises a distributed storage server cluster, an NFS system, a plurality of NFS clients and a monitoring module, wherein the NFS client is connected with the distributed storage server cluster;

the distributed storage server cluster comprises a plurality of NFS servers;

the NFS system comprises a heartbeat module, a network detection module and a fault detection module;

the heartbeat module is used for asynchronously collecting the connection state of the NFS mode connection and the running state of the current NFS service end node every preset time period which changes at intervals, and then reporting the connection state of the NFS mode connection and the running state of the current NFS service end node to the monitoring module;

the network detection module is used for acquiring node information of the distributed storage cluster; then sending a data packet to each NFS server node to check whether each NFS server node is IP accessible;

the fault detection module is used for detecting whether the destination of the virtual IP of the current NFS server node is faulty in real time according to the IP accessible node list, and if the NFS server node is faulty, actively switching the virtual IP of the NFS server to the next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node is recovered to be normal, actively switching the connection of the virtual IP of the NFS server back to the original NFS server node;

the NFS server side is connected with the NFS client side through an NFS system respectively;

the NFS client is used for sending an access signal to the NFS server through the NFS system;

the NFS server is used for receiving the access signal of the NFS client through the NFS system and feeding back the access signal of the NFS client.

Accordingly, the invention provides an NFS high availability method based on fault detection and routing strategy, comprising the following operation steps:

establishing a subnet special for NFS mode connection on a currently operated server so as to ensure that a current NFS client and a current NFS server node virtual IP are in the same network segment; then setting fixed NFS service end node virtual IP on the current NFS client node;

initializing node information of a current distributed storage cluster, recording IP information of the distributed storage cluster, and recording IP information of NFS server nodes in the distributed storage cluster; and virtual IP routing of the NFS server end node to a normal NFS server end node;

asynchronously collecting the connection state of the NFS mode connection and the running state of the current NFS server node in every interval-changing preset time period, and reporting the connection state of the NFS mode connection and the running state of the current NFS server node to a monitoring module;

acquiring node information of a distributed storage cluster; then, sending a data packet to each NFS server node to check whether each NFS server node is IP accessible, selecting an NFS server with IP accessible, and establishing an IP accessible node list according to the serial number of the NFS server;

detecting whether the destination of the virtual IP of the current NFS server node fails or not in real time according to the IP accessible node list, and if the NFS server node fails, actively switching the virtual IP of the NFS server to the next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node returns to normal, the connection of the virtual IP of the NFS client is actively switched back to the original NFS server node.

Preferably, as an embodiment; the distributed storage cluster node information is distributed storage server cluster node information, and the distributed storage server cluster node information comprises attribute information of all NFS server nodes in a server cluster.

Preferably, as an embodiment; the fault detection module detects whether the destination of the virtual IP of the current NFS server node fails or not in real time according to the IP accessible node list, and if the NFS server node fails, the NFS server virtual IP is actively switched to the next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node returns to normal, the connection of the virtual IP of the NFS client is actively switched back to the original NFS server node, which specifically includes the following steps:

accessing a distributed storage server cluster, and updating an NFS server node IP list in the distributed storage server cluster;

checking whether the connection configuration of the virtual IP of the current NFS server node is already configured, if not, preferentially selecting to route the virtual IP to any one normal NFS server node in the distributed storage NFS server cluster; if the selected current NFS service end node fails, selecting the next non-failure NFS service end according to the ordering of the IP accessible node list, and routing the virtual IP to the non-failure NFS service end node.

Preferably, as an embodiment; the subnetwork comprises an NFS server, an NFS client and an NFS system; the NFS mode connection refers to a mode of realizing connection between an NFS server and an NFS client by using an NFS system.

Preferably, as an embodiment; and the currently operated server is a server where the current NFS client accesses the corresponding NFS server through NFS connection.

Compared with the prior art, the embodiment of the invention has at least the following technical advantages:

according to the technical scheme adopted by the embodiment of the invention, the subnet special for the NFS mode connection is established, so that the current NFS client and the current NFS server node virtual IP are ensured to be in the same network segment, and IP conflict is avoided even if the NFS server virtual IP on each node is the same; then setting fixed NFS service end node virtual IP on the current NFS client node;

acquiring record NFS server node IP information, and routing NFS server node virtual IP to NFS server node; each NFS client has a corresponding NFS server, so that access signals initiated by different NFS clients are sent to different NFS servers, thereby dispersing access pressure and fully utilizing the capabilities of a plurality of NFS servers;

and detecting the connection state of the NFS mode connection and the running state of the current NFS server node in real time by utilizing a monitoring module, detecting whether the current NFS server fails or not in real time according to the detection result of the monitoring module, and if the current NFS server fails, routing the NFS server node virtual IP to another normal NFS server in real time.

By analyzing the NFS high availability system and the method based on the fault detection and routing strategy, provided by the invention, when the NFS client side is specifically applied, the NFS client side firstly accesses the NFS server node virtual IP, and the configuration of the NFS client side can be kept unchanged by abstracting the server side virtual IP, so that the change of a background system is shielded;

compared with the traditional polling mode, the method has the advantages that problems can be discovered faster and routes can be switched actively, and in addition, because strategies such as polling are not needed to be directly routed and linked to the target NFS server, the access request speed is faster;

the access exception is processed by adopting a route switching mode, and compared with the domain name resolution, the IP is changed in a much faster way;

localization can be achieved as much as possible, and after the local NFS server nodes corresponding to the NFS client are normal, the local NFS server nodes can be actively switched back even if no access abnormality occurs, so that the access of the client is faster and more accurate, delay is reduced, and the efficiency of the access per se is improved;

the method also has the automatic expansion capability, and can automatically acquire and update the latest node list without manual intervention configuration after the distributed storage cluster is newly added or deleted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an NFS high availability system architecture based on a fault detection and routing policy according to a first embodiment of the present invention;

fig. 2 is a schematic operation flow diagram of an NFS high availability method based on a fault detection and routing policy according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of fault detection flow in an NFS high availability method based on fault detection and routing policy according to a second embodiment of the present invention.

Reference numerals: a distributed storage server cluster 10; NFS system 20; NFS client 30; a monitoring module 40; NFS server 11; a heartbeat module 21; a network detection module 22; the fault detection module 23.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention will now be described in further detail with reference to specific examples thereof in connection with the accompanying drawings.

Example 1

As shown in fig. 1, the present invention proposes an NFS high availability system based on fault detection and routing policy, which includes a distributed storage server cluster 10, an NFS system 20, a plurality of NFS clients 30, and a monitoring module 40;

the distributed storage server cluster 10 includes a plurality of NFS servers 11;

the NFS system comprises a heartbeat module 21, a network detection module 22 and a fault detection module 23;

the heartbeat module 21 is configured to collect, for each interval of a preset time period, a connection state of NFS connection and an operation state of a current NFS server node, and then report, to the monitoring module 40, the connection state of the NFS connection and the operation state of the current NFS server node;

the network detection module 22 is configured to obtain node information of the distributed storage cluster 10; then, sending a data packet to each NFS server node to check whether each NFS server node is IP accessible, selecting an NFS server with IP accessible, and establishing an IP accessible node list according to the serial number of the NFS server;

the fault detection module 23 is configured to detect, in real time, whether a current NFS server node has a fault according to the IP accessible node list, and if it is detected that the current NFS server node has a fault, actively switch the NFS server virtual IP to a next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node is recovered to be normal, actively switching the connection of the virtual IP of the NFS client back to the original NFS server node;

the NFS server 11 is connected with the NFS client 30 through the NFS system 20;

the NFS client 30 is configured to send, through the NFS system 20, an access signal to the NFS server 11;

the NFS server 11 is configured to receive, by using the NFS system 20, an access signal of the NFS client 30, and make an access signal feedback to the access signal of the NFS client 30.

In summary, in the above-mentioned NFS high availability system based on fault detection and routing policy, when the current server initially operates, a subnet specific for NFS connection is established, and a fixed NFS server node virtual IP is set on the current NFS client node; recording IP information of the distributed storage cluster, and recording IP information of NFS server nodes in the distributed storage cluster; and routing the NFS server node virtual IP to the current NFS server node (the routing refers to routing the NFS server node virtual IP to the current NFS server node); detecting the connection state of NFS mode connection and the running state of the current NFS server node through a heartbeat module, sending the detected connection state of NFS mode connection and the running state of the current NFS server node to a monitoring module, detecting whether each NFS client node is IP accessible or not through a network detection module, selecting an NFS server which is IP accessible, and establishing an IP accessible node list according to the serial number of the NFS server; further, the fault detection module acquires the connection state of the NFS mode connection of the monitoring module, the running state of the current NFS server node and the list of the standing IP accessible nodes of the network detection module, detects whether the current NFS server node with the fault has the fault in real time, and if the current NFS server node has the fault, switches the connection state of the NFS mode connection with the fault to the normal NFS server node or switches the NFS server node with the fault to the normal NFS server node.

Example two

As shown in fig. 2, correspondingly, the invention further provides an NFS high availability method based on fault detection and routing policy, which comprises the following operation steps:

step S10: establishing a subnet special for NFS mode connection on a currently operated server so as to ensure that a current NFS client and a current NFS server node virtual IP are in the same network segment; the subnetwork comprises an NFS server, an NFS client and an NFS system; the NFS mode connection is a mode of realizing connection between an NFS server and an NFS client by utilizing an NFS system; then setting fixed NFS service end node virtual IP on the current NFS client node; (i.e., since the foregoing steps have illustrated that the NFS server and the NFS client are already established in the same ad hoc subnet, it is ensured that the current NFS client and the NFS server node virtual IP of the current NFS server are on the same network segment);

initializing current distributed storage cluster node information (the distributed storage cluster node information is distributed storage server cluster node information, and the distributed storage server cluster node information comprises attribute information of all NFS server nodes in a server cluster) after setting a subnet connected in an NFS mode and a fixed NFS server node virtual IP (the IP information is not the NFS server node virtual IP), recording the IP information of the distributed storage cluster, and recording the IP information of the NFS server nodes in the distributed storage cluster; virtual IP routing of the NFS server end node to the current NFS server end node;

the currently operated server is a server where the current NFS client accesses the corresponding NFS server through the NFS mode connection;

it should be noted that, a subnet is pre-planned on a node operated by an instance of the method ("the node operated by the instance of the method" is on a server operated by the instance of the method, that is, a server where a current NFS client accesses a current corresponding NFS server through an NFS connection), a fixed virtual IP of an NFS server node is set on the current NFS client node, and it is ensured that the virtual IP of the NFS server node and the current NFS client node IP are in the same network segment. It should be noted that the current NFS client and the NFS server node virtual IP are in the same subnet, so that even if the NFS server node virtual IP accessed on all NFS clients are the same, the problem of IP conflict does not exist;

initializing node information of a distributed storage cluster, and recording IP information of the storage cluster. In addition, the local storage node IP information in the distributed storage cluster needs to be recorded, and then the NFS server virtual IP is routed to a local NFS server node IP, where the local NFS server node is the NFS server node closest to the current NFS client node. The reason for this is that the method of the present invention hopes to make NFS highly available and also IO friendly, so when the local NFS server node is healthy, the method will reroute the NFS server node virtual IP to the local NFS server node as much as possible.

Step S20: asynchronously (the asynchronously is not carried out simultaneously with the steps, but is detected in real time by a single module, asynchronous operation is not influenced by the connection state of the current NFS mode connection and the operation state of the current NFS service end node, the efficiency is higher), the connection state of the NFS mode connection and the operation state of the current NFS service end node are collected every interval-changed (or different) preset time periods, and then the connection state of the NFS mode connection and the operation state of the current NFS service end node are reported to a monitoring module;

it should be noted that, in the above technical solution in the embodiment of the present invention, the current service state and the routing state are continuously checked (the "connection state of NFS connection" is the state of connection access between an NFS client and an NFS server; the "current service state" is the state of a current NFS server), and the status of the current service state is reported to the monitoring module;

the heartbeat module is created and started, and is an independent module which can be independently operated after being created and started;

the heartbeat module asynchronously collects the routing state and the running state of the method every preset time period, and then reports the information to the monitoring module.

Step S30: creating and starting a network detection module; acquiring node information of a distributed storage cluster; then, sending a data packet (the data packet is a detection data packet) to each NFS server node to check whether each NFS server node is IP accessible, selecting an NFS server with IP accessible, and establishing an IP accessible node list according to the serial number of the NFS server;

it should be noted that, the network detection module records the cluster node accessible by the IP, but the IP is accessible and does not represent that the NFS service can work normally, so the fault detection module continues to perform fault detection; creating and starting a network detection module, wherein the network detection module is an independent module and can independently operate after being created and started; the network detection module can acquire node information of the distributed storage cluster. Then sending data packets to each node to check whether each node is IP accessible, and finally updating and maintaining an IP accessible node list;

step S40: the fault detection module detects whether the current NFS service end node breaks down according to the IP accessible node list in real time (whether the destination of the virtual IP of the current NFS service end node breaks down is the destination NFS service end connected with the NFS client end or not), and if the NFS service end node breaks down, the NFS service end virtual IP is actively switched to the next normal NFS service end node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node returns to normal (the NFS server node is the NFS server node closest to the network of the NFS client in the initial state or the NFS server node corresponding to the NFS client, the NFS server node connected to the NFS client during initialization is the local NFS server node), the virtual IP route (or the NFS connection) of the NFS client is actively switched back to the original NFS server node.

Specifically, as shown in fig. 3, in step S40, if it is detected that the NFS server node fails, the NFS server virtual IP is actively switched to the next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node returns to normal, the virtual IP route of the NFS client is actively switched back to the original NFS server node, which includes the following steps:

step S41: accessing a distributed storage server cluster, and updating an NFS server node IP list in the distributed storage server cluster;

it should be noted that, the distributed storage cluster (distributed storage server cluster) may dynamically add or delete nodes (NFS server nodes), and update node information to ensure that the latest node information can be maintained all the time. The storage nodes of the distributed storage cluster can be understood herein as NFS server side lists;

step S42: checking whether the connection configuration of the virtual IP of the current NFS service end node is already configured (checking whether the connection configuration of the virtual IP of the current NFS service end node is already configured or not is the checking whether the virtual IP of the NFS service end node on the current NFS client node is already routed to the current NFS service end node or not), if not, preferentially selecting to route the virtual IP of the NFS service end node to any normal NFS service end node in the distributed storage NFS service end cluster; if the NFS server node fails (namely, the current NFS server fails), selecting the next NFS server without failure according to the ordering of the IP accessible node list, and virtually routing the NFS server node to the NFS server without failure by using the virtual IP;

it should be noted that in the section of the failure detection module, an NFS server node is normally referred to as IP accessible for the NFS server node, and the NFS server node NFS service can also provide a service to determine whether the NFS server node IP is accessible to the IP accessible node list that needs to be maintained by the network detection module.

In summary, according to the NFS high availability system and method based on the fault detection and routing policy provided by the embodiments of the present invention, by establishing a subnet specifically used for NFS connection, it is ensured that the current NFS client and the current NFS server node virtual IP are in the same network segment, and it is ensured that no IP conflict occurs even if the NFS server virtual IP on each node is the same; then setting a fixed NFS service end node virtual IP on the current NFS service end node;

acquiring record NFS server node IP information, and transmitting NFS server node virtual IP to the NFS server node; each NFS client has a corresponding NFS server, so that access signals initiated by different NFS clients are sent to different NFS servers, thereby dispersing access pressure and fully utilizing the capabilities of a plurality of NFS servers;

detecting the connection state of the NFS mode connection and the running state of the current NFS server node in real time by utilizing a monitoring module, and switching the connection state of the failed NFS mode connection to a normal NFS server node or switching the failed NFS server node to the normal NFS server node in real time according to the detection result of the monitoring module;

the NFS client accesses the NFS server node virtual IP first, and the configuration of the NFS client side can be kept unchanged by abstracting the server virtual IP, so that the change of a background system is shielded;

localization can be achieved as much as possible, and after the NFS server nodes corresponding to the NFS clients are normal, even if no access abnormality occurs, the local storage nodes can be actively switched back, so that the access is friendly, and the efficiency of the access is improved;

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; modifications of the technical solutions described in the foregoing embodiments, or equivalent substitutions of some or all of the technical features thereof, may be made by those of ordinary skill in the art; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. An NFS high availability system based on fault detection and routing strategy comprises a distributed storage server cluster, an NFS system, a plurality of NFS clients and a monitoring module;

the distributed storage server cluster comprises a plurality of NFS servers;

the NFS system is used for establishing a subnet special for NFS mode connection on a currently operated server so as to ensure that a current NFS client and a current NFS server node virtual IP are in the same network segment; then setting fixed NFS service end node virtual IP on the current NFS client node; initializing node information of a current distributed storage cluster, recording IP information of the distributed storage cluster, and recording IP information of NFS server nodes in the distributed storage cluster; virtual IP routing of the NFS server end node to a normal NFS server end node;

the network detection module is used for acquiring node information of the distributed storage cluster; then sending a data packet to each NFS server node to check whether each NFS server node is IP accessible; selecting an NFS server which can be accessed by the IP, and establishing an IP accessible node list according to the number of the NFS server;

the fault detection module is used for detecting whether the current NFS server node has a fault in real time according to the IP accessible node list, and if the NFS server node is detected to have the fault, actively switching the virtual IP of the NFS server node to the next normal NFS server node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node is recovered to be normal, actively switching the connection of the virtual IP of the NFS server back to the original NFS server node;

2. An NFS high availability method based on fault detection and routing policy, comprising the following steps:

initializing node information of a current distributed storage cluster, recording IP information of the distributed storage cluster, and recording IP information of NFS server nodes in the distributed storage cluster; virtual IP routing of the NFS server end node to a normal NFS server end node;

detecting whether the current service end node fails or not in real time according to the IP accessible node list, and if the NFS service end node fails, actively switching the virtual IP of the NFS service end node to the next normal NFS service end node in the cluster according to the ordering of the IP accessible node list; meanwhile, if the NFS server node returns to normal, the connection of the virtual IP of the NFS client is actively switched back to the original NFS server node.

3. The NFS high availability method based on fault detection and routing policy according to claim 2, wherein the distributed storage cluster node information is distributed storage server cluster node information, and the distributed storage server cluster node information includes attribute information of all NFS server nodes in a server cluster.

4. The method for high availability of NFS based on fault detection and routing policy according to claim 3, wherein detecting whether the current service end node fails according to the list of IP accessible nodes in real time, if the NFS service end node fails, actively switching the virtual IP of the NFS service end node to the next normal NFS service end node in the cluster according to the ordering of the list of IP accessible nodes; meanwhile, if the NFS server node returns to normal, the connection of the virtual IP of the NFS client is actively switched back to the original NFS server node, which specifically includes the following steps:

checking whether the connection route of the virtual IP of the current NFS server has been configured, if not, preferentially selecting to route the virtual IP to any NFS server node in the distributed storage NFS server cluster, and if the local NFS server node is not available, selecting a normal NFS server node; if the selected current NFS service end node fails, selecting the next non-failure NFS service end according to the ordering of the IP accessible node list, and virtually routing the NFS service end node to the non-failure NFS service end.

5. The NFS high availability method based on fault detection and routing policy of claim 4, wherein the subnetwork comprises an NFS server and an NFS client and an NFS system; the NFS mode connection refers to a mode of realizing connection between an NFS server and an NFS client by using an NFS system.

6. The method for NFS high availability based on fault detection and routing policy according to claim 5, wherein the currently running server accesses a server at a currently corresponding NFS server for a current NFS client through NFS connection.