CN108847982B - Distributed storage cluster and node fault switching method and device thereof - Google Patents

Distributed storage cluster and node fault switching method and device thereof Download PDF

Info

Publication number
CN108847982B
CN108847982B CN201810668234.1A CN201810668234A CN108847982B CN 108847982 B CN108847982 B CN 108847982B CN 201810668234 A CN201810668234 A CN 201810668234A CN 108847982 B CN108847982 B CN 108847982B
Authority
CN
China
Prior art keywords
node
service
nodes
distributed storage
storage cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810668234.1A
Other languages
Chinese (zh)
Other versions
CN108847982A (en
Inventor
孙业宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810668234.1A priority Critical patent/CN108847982B/en
Publication of CN108847982A publication Critical patent/CN108847982A/en
Application granted granted Critical
Publication of CN108847982B publication Critical patent/CN108847982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1034Reaction to server failures by a load balancer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Abstract

The invention discloses a distributed storage cluster node power-off switching method and a device thereof, which are applied to a main node of a distributed storage cluster, and the method comprises the following steps: detecting the state of each node in the cluster according to the CTDB heartbeat detection mode; after the power-off node is detected, acquiring service information of the power-off node; and sending the service information to normal nodes with corresponding service functions in the distributed storage cluster for each normal node receiving the service information to perform service drift and service recovery according to the service information. The method shortens the detection and recovery process time of the power failure node from the original minute level to the second level, accelerates the speed of recovering the cluster to be normal and recovering the access of the service of the power failure node, and improves the reliability of the cluster; the invention also discloses a distributed storage cluster based on the method.

Description

Distributed storage cluster and node fault switching method and device thereof
Technical Field
The invention relates to the technical field of distributed cluster high availability, in particular to a distributed storage cluster node power-off switching method and a distributed storage cluster node power-off switching device. The invention also relates to a distributed storage cluster.
Background
The distributed storage cluster is a cluster formed by a plurality of storage node servers, supports that a piece of data is stored on a plurality of nodes, each node can acquire complete data through communication among the nodes, and can recover the complete data according to a configured strategy when the node goes down, and the distributed storage cluster comprises service modules such as a monitoring module, a storage pool module and a metadata management module.
When a distributed storage cluster runs, a part of nodes may have faults such as power line looseness or power line unplugging to cause node power failure, and at the moment, if the number of power failure nodes is within the range of the number of power failure nodes allowed by the cluster (namely, the redundancy number of the nodes of the cluster), the distributed storage cluster recovers to be normal and continues to provide normal access of services, and time of minute level is required, because whether each node is powered off is determined by each service module through heartbeat detection at present, and the heartbeat detection precision of the service module is of minute level, namely, more than 60s (because the heartbeat detection precision is lower than 60s, the cluster is vibrated), at present, whether node power failure occurs or not needs to be determined through the time of more than 60s, and then cluster recovery, service recovery of the power failure nodes and the like are performed.
Therefore, in the current node power failure detection and recovery process, the cluster cannot quickly detect the power failure fault, and further cannot quickly perform cluster recovery and recover service access on the power failure node, so that service interruption time is long, and cluster reliability is poor.
Therefore, how to provide a distributed storage cluster node power-off switching method with high reliability, an apparatus thereof, and a distributed storage cluster are problems that need to be solved by those skilled in the art at present.
Disclosure of Invention
The invention aims to provide a distributed storage cluster node power-off switching method and a device thereof, which shorten the detection recovery process time of a power-off node from the original minute level to the second level, accelerate the speed of recovering the cluster to be normal and recovering the service of the power-off node to access, and improve the reliability of the cluster; another object of the present invention is to provide a distributed storage cluster based on the above method.
In order to solve the above technical problem, the present invention provides a distributed storage cluster node power-off switching method, which is applied to a master node of a distributed storage cluster, and the method includes:
detecting the state of each node in the cluster according to a heartbeat detection mode of a CTDB lightweight cluster database;
after detecting that a node is powered off, acquiring service information of the powered-off node;
and sending the service information to normal nodes with corresponding service functions in the distributed storage cluster, so that each normal node receiving the service information can perform service drift and service recovery according to the service information.
Preferably, after detecting that there is a node that is powered off, before acquiring service information of the powered-off node, the method further includes:
and judging whether the power-off node is obtained through heartbeat detection, and if so, acquiring service information of the power-off node.
Preferably, the service information includes a virtual IP.
Preferably, the service information further includes service cache data.
Preferably, the process of sending the service information to the normal node with the corresponding service function in the distributed storage cluster specifically includes:
calling a failover program in the distributed storage cluster;
selecting normal nodes containing each service function;
and sending the service information to the selected node.
Preferably, the service functions include a monitoring function, a storage pool function, and a metadata management function.
Preferably, the process of detecting the node state according to the CTDB heartbeat detection method specifically includes:
sending a plurality of heartbeat packets to each node in the distributed storage cluster in each heartbeat detection period;
and judging whether responses returned by all the nodes are received within preset time, and if the nodes which do not return responses exist, determining the nodes which do not return responses to be power-off nodes.
In order to solve the above technical problem, the present invention further provides a distributed storage cluster node power-off switching apparatus, which is applied to a master node of a distributed storage cluster, and the apparatus includes:
the state monitoring module is used for detecting the state of each node in the cluster according to the CTDB heartbeat detection mode;
the information acquisition module is used for acquiring the service information of the power-off node after detecting that the node is powered off;
and the sending module is used for sending the service information to normal nodes with corresponding service functions in the distributed storage cluster, and carrying out service drifting and service recovery by each normal node receiving the service information according to the service information.
In order to solve the technical problem, the invention also provides a distributed storage cluster, which comprises a plurality of nodes with CTDB functions, wherein one node is selected from the plurality of nodes as a main node; the master node includes:
a memory for storing a computer program;
a processor for implementing the steps of the distributed storage cluster node power-down switching method as described in any one of the above when executing the computer program.
Preferably, the nodes other than the master node are specifically configured to:
and performing self service recovery operation in parallel and performing service drifting operation according to the service information.
The invention provides a distributed storage cluster node power-off switching method and a device thereof, which utilize the heartbeat detection of CTDB to detect whether a power-off node exists in a cluster, and then if the power-off node is detected, the service information of the power-off node is obtained and sent to a normal node with a corresponding service function (or can be understood as having a corresponding service module) for the normal node receiving the service information to carry out service drift and recover the service access on the power-off node. It can be understood that, because the heartbeat detection time precision of the CTDB is in the second level, that is, usually in several seconds, the heartbeat detection of the CTDB can quickly detect whether there is a node outage, and send the service information of the outage node into the normal node, so that the normal node can timely perform data recovery and service drift, that is, the detection recovery process time of the outage node is shortened from the original minute level to the second level, thereby accelerating the speed of recovering the normal state of the cluster and recovering the access of the service of the outage node, so as to shorten the terminal time of the service as much as possible, and improve the reliability of the cluster. The invention also provides a distributed storage cluster based on the method, and the distributed storage cluster also has the advantages.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the prior art and the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart illustrating a procedure of a distributed storage cluster node power-off switching method according to the present invention;
fig. 2 is a schematic structural diagram of a distributed storage cluster node power-off switching apparatus provided in the present invention.
Detailed Description
The core of the invention is to provide a distributed storage cluster node power-off switching method and a device thereof, which shorten the detection recovery process time of a power-off node from the original minute level to the second level, accelerate the speed of recovering the cluster to be normal and recovering the service of the power-off node to access, and improve the reliability of the cluster; the other core of the invention is to provide a distributed storage cluster based on the method.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a distributed storage cluster node power-off switching method, which is applied to a master node of a distributed storage cluster, and is shown in fig. 1, wherein fig. 1 is a flow chart of a process of the distributed storage cluster node power-off switching method provided by the invention; the method comprises the following steps:
step s 1: detecting the state of each node in the cluster according to the CTDB heartbeat detection mode;
it is understood that a CTDB (clustered virtual database) is a highly available management software for cluster for monitoring cluster node status and traffic distribution. Generally, a distributed storage cluster with a CTDB function has CTDB software installed in each cluster node, so that each node can perform heartbeat detection according to the CTDB, and detection results between different nodes are interacted. All nodes in the distributed storage cluster with the CTDB function select a main node, and at the moment, fault recovery operation (such as virtual IP allocation and the like) is only performed by the main node.
In the current distributed storage cluster, the CTDB has a heartbeat detection function, and the time precision of detection is also in the second level, but the detection result of the CTDB is not applied to cluster and service restoration after the node is powered off, and whether the cluster and the service restoration operation of the powered-off node is performed in the prior art is performed according to the detection result of the service module in the node, and the time precision level is in the minute level, and the time is long, and the efficiency is low, that is, the heartbeat detection of the CTDB and the service restoration of the powered-off node are two sets of mutually unrelated processes in the prior art. In the invention, the heartbeat detection of the CTDB and the service recovery of the power-off node are linked together, the subsequent recovery operation of the cluster and the power-off node service is controlled according to the heartbeat detection result of the CTDB, the recovery operation time of the cluster and the power-off node service is shortened to the second level, and the cluster recovery efficiency and reliability are improved. Step s 2: after the power-off node is detected, acquiring service information of the power-off node;
after a node is powered off, the service on the powered-off node is definitely interrupted, and in order to recover the normal access of the service as soon as possible, the service needs to be migrated (or switched) to other normal nodes to operate, so that information of the service operating on the powered-off node needs to be determined, and a suitable node is conveniently selected and the service is migrated. Since the power-off node is powered off at this time, the service information of the power-off node is usually obtained from the master node, because the master node is responsible for service allocation, and therefore, the master node stores information about services running in each node.
Step s 3: and sending the service information to normal nodes with corresponding service functions in the distributed storage cluster for each normal node receiving the service information to perform service drift and service recovery according to the service information.
It is understood that a distributed storage cluster includes a plurality of nodes, and the nodes are associated with each other to collectively perform processing of a service, so that different nodes may have different functions, that is, different nodes may include the same service module or different service modules. When service drifting is carried out, in order to ensure that the service can normally operate subsequently, various service functions required by the service operation need to be determined firstly, then service information is distributed to normal nodes with the service functions, the service of the power-off node is drifted to the normal nodes, and after the normal nodes are recovered to be normal subsequently, the access execution of the service can be completed together according to the service information.
In addition, the service restoration may also be understood as node restoration, and because the nodes in the distributed storage cluster are associated with each other, once a node fails and is powered off, other nodes are also affected and cannot work normally, and at this time, in order to make the service of the powered-off node run at other nodes, the service needs to be migrated to other nodes, and the configuration of the nodes needs to be adjusted to be restored to a normal working state; moreover, if there is a special requirement for the service to be migrated, when the nodes are subjected to recovery operation, the configuration data thereof needs to be adjusted so that the nodes can support the operation of the migrated service.
The service operation needs to complete service drift and service recovery firstly, so the time for successfully switching the service to other nodes is based on the time when the service drift and the service recovery are both completed. If the service drift and the service recovery are performed in series, the service switching time is equal to the sum of the service drift and the service recovery, and if the service drift and the service recovery are performed in parallel, the service switching time is based on the longer time of the service drift and the service recovery.
Experiments show that through the operation, the power failure detection can be controlled within 10 seconds generally, and the service drift and the node service recovery time are about 10 seconds basically, so that the overall completion time of service switching is controlled within 30 seconds, the service recovery time is shorter than that of the current minute-level service switching, and the reliability and the stability of the cluster are improved.
Wherein the service functions include a monitoring function, a storage pool function, and a metadata management function.
It can be understood that the cluster is required to operate normally, the storage and metadata management functions are indispensable, and in order to find out problems existing in the service operation in time, a monitoring function is also required to monitor the operation condition of the service. Of course, other service functions may also be included in the distributed storage cluster, and the present invention is not limited thereto.
In a preferred embodiment, after the power-off node is detected, before the service information of the power-off node is acquired, the method further includes:
and judging whether the power failure node is obtained through heartbeat detection, and if so, acquiring service information of the power failure node.
It can be understood that, although the invention adopts heartbeat detection of the CTDB to detect whether there is a node power failure, the finally obtained power failure node may not be obtained by heartbeat detection, because the CTDB may also determine that the node where the CTDB is located is the power failure node when executing the stop or restart command, at this time, it is obviously wrong, and therefore, in order to distinguish this situation, before acquiring the service information of the power failure node, it is necessary to first distinguish whether the power failure node is obtained by heartbeat detection, because only the failed node detected by the heartbeat detection function is the power failure node, otherwise, the failed node is not processed. The specific implementation mode is that a flag bit is added in a power-off node identifier detected by heartbeat, and then the node identifier can be distinguished by judging whether the detected node contains the flag bit. Of course, the above is only one implementation manner, and whether the node is a power-off node may also be determined by other manners, which is not limited in the present invention.
In a particular embodiment, the traffic information includes virtual IP.
It is understood that for a distributed storage cluster, virtual IPs are one-to-one correspondence to traffic, and the CTDB master node is responsible for the allocation of virtual IPs. When a cluster node fails, in order to ensure normal access to the service on the node, the virtual IP allocated to the node is migrated to another node, and then the service of the node is migrated to another node along with the virtual IP, thereby ensuring high availability of the cluster.
Of course, for most distributed storage clusters, the service drift only needs virtual IP, but in some cases, the service drift may also be implemented according to other parameters, which is not limited by the present invention.
In addition, the service information also comprises service cache data. For some cases, the continuous access of the service may require previous data, and at this time, the continuous access of the service may not be completed only by the virtual IP, so that the service information needs to include service cache data. Of course, the service information may also include a host number of the power-off node, and the specific content of the service information is not limited in the present invention.
In a specific embodiment, the process of sending the service information to the normal node having the corresponding service function in the distributed storage cluster specifically includes:
calling a fault switching program in the distributed storage cluster;
selecting normal nodes containing each service function;
and sending the service information to the selected node.
It can be understood that, in the current distributed storage cluster, a failover program is usually provided in each node, and as long as one node calls the program of itself, because there is data interaction between nodes in the cluster, the programs in other nodes will run to perform service switching operation, so that the main node directly calls the self fault switching program after acquiring service information, a program in the other node, which has the function of node selection, is also started, and is ready for service switching, and, after the program is called by the master node, the program itself selects a suitable node, and the master node sends the service information to the selected node, since the failover procedure in the selected node has already been started, the node can start service switching (service drift, service restoration, etc.) as soon as it receives the service information.
Of course, if the cluster is not provided with the failover program with the above functions, the master node may analyze the setting condition of each node in the cluster by itself, and then select the corresponding node. Specifically, which way to select the node receiving the service information is adopted, the present invention is not limited.
The process of detecting the node state according to the CTDB heartbeat detection mode specifically comprises the following steps:
sending a plurality of heartbeat packets to each node in the distributed storage cluster in each heartbeat detection period;
and judging whether responses returned by all the nodes are received within preset time, and if the nodes which do not return responses exist, determining the nodes which do not return responses as power-off nodes.
The preset time generally corresponds to the heartbeat detection period, but since the transmission and reception of the signal require time, the heartbeat detection period is preferably slightly longer than the heartbeat detection period.
For example, assuming that each heartbeat detection period is 4 seconds (the time interval between two heartbeat periods is not limited in the present invention), a heartbeat packet is sent every 2 seconds, and is sent for 2 times in total, and if the opposite node does not receive the heartbeat packet within 4 seconds, the node is considered to be faulty; or the heartbeat detection period can also be 8 seconds, namely a heartbeat packet is sent once every 2 seconds and is sent 4 times, so that the fault misjudgment caused by too small heartbeat is avoided. Of course, the length of the heartbeat detection period and the transmission frequency of the heartbeat packet are not limited in the present invention.
The invention provides a distributed storage cluster node power-off switching method, which utilizes the heartbeat detection of CTDB to detect whether a power-off node exists in a cluster, and then if the power-off node is detected, the service information of the power-off node is obtained and sent to a normal node with a corresponding service function (or can be understood as having a corresponding service module) for the normal node receiving the service information to carry out service drift and recover the service access on the power-off node. It can be understood that, because the heartbeat detection time precision of the CTDB is in the second level, that is, usually in several seconds, the heartbeat detection of the CTDB can quickly detect whether there is a node outage, and send the service information of the outage node into the normal node, so that the normal node can timely perform data recovery and service drift, that is, the detection recovery process time of the outage node is shortened from the original minute level to the second level, thereby accelerating the speed of recovering the normal state of the cluster and recovering the access of the service of the outage node, so as to shorten the terminal time of the service as much as possible, and improve the reliability of the cluster.
The invention further provides a distributed storage cluster node power-off switching device, which is applied to a master node of a distributed storage cluster, and as shown in fig. 2, fig. 2 is a schematic structural diagram of the distributed storage cluster node power-off switching device provided by the invention.
The device includes:
the state monitoring module 1 is used for detecting the state of each node in the cluster according to the CTDB heartbeat detection mode;
the information acquisition module 2 is used for acquiring the service information of the power-off node after the power-off node is detected;
and the sending module 3 is used for sending the service information to normal nodes with corresponding service functions in the distributed storage cluster, so that each normal node receiving the service information can perform service drift and service recovery according to the service information.
The invention also provides a distributed storage cluster which comprises a plurality of nodes with CTDB functions, wherein one node is selected from the nodes as a main node; the master node includes:
a memory for storing a computer program;
a processor for implementing the steps of the distributed storage cluster node power down switching method as any one of the above when executing the computer program.
In a preferred embodiment, the nodes other than the master node are specifically configured to:
and performing self service recovery operation and service drifting operation according to the service information in parallel.
It will be appreciated that since both the traffic restoration operation and the traffic drift operation do not interfere, the parallel operation can reduce the time for traffic switching as much as possible compared to the serial operation.
The above embodiments are only preferred embodiments of the present invention, and the above embodiments can be combined arbitrarily, and the combined embodiments are also within the scope of the present invention. It should be noted that other modifications and variations that may suggest themselves to persons skilled in the art without departing from the spirit and scope of the invention are intended to be included within the scope of the invention as defined by the appended claims.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A distributed storage cluster node power-off switching method is applied to a master node of a distributed storage cluster, and comprises the following steps:
detecting the state of each node in the cluster according to a heartbeat detection mode of a CTDB lightweight cluster database;
after the node power failure is detected, judging whether the power failure node is obtained through heartbeat detection, if so, acquiring service information of the power failure node; if not, not processing;
sending the service information to normal nodes with corresponding service functions in the distributed storage cluster, and enabling each normal node receiving the service information to perform service drift and service recovery according to the service information;
the process of sending the service information to the normal nodes with the corresponding service functions in the distributed storage cluster specifically includes:
calling a failover program in the distributed storage cluster;
selecting normal nodes containing each service function;
sending the service information to the selected node;
the process of detecting the node state according to the CTDB heartbeat detection mode specifically comprises the following steps:
sending a plurality of heartbeat packets to each node in the distributed storage cluster in each heartbeat detection period;
judging whether responses returned by all nodes are received within preset time, if the nodes which do not return responses exist, the nodes which do not return responses are power-off nodes;
the fault switching program is a program which is pre-deployed in each node and has a function of selecting nodes; the service information comprises a virtual IP and service cache data; the service functions include a monitoring function, a storage pool function, and a metadata management function.
2. A distributed storage cluster node power-off switching apparatus, applied to a master node of the distributed storage cluster, the apparatus comprising:
the state monitoring module is used for detecting the state of each node in the cluster according to the CTDB heartbeat detection mode, and specifically comprises the following steps: sending a plurality of heartbeat packets to each node in the distributed storage cluster in each heartbeat detection period; judging whether responses returned by all nodes are received within preset time, if the nodes which do not return responses exist, the nodes which do not return responses are power-off nodes;
the information acquisition module is used for judging whether the power failure node is obtained through heartbeat detection after detecting that the node is powered off, and if so, acquiring service information of the power failure node; if not, not processing;
a sending module, configured to send the service information to normal nodes in the distributed storage cluster having corresponding service functions, so that each normal node receiving the service information performs service drift and service recovery according to the service information, where the sending module specifically includes: calling a failover program in the distributed storage cluster; selecting normal nodes containing each service function; sending the service information to the selected node;
the fault switching program is a program which is pre-deployed in each node and has a function of selecting nodes; the service information comprises a virtual IP and service cache data; the service functions include a monitoring function, a storage pool function, and a metadata management function.
3. A distributed storage cluster is characterized by comprising a plurality of nodes with CTDB functions, wherein one node is selected from the plurality of nodes as a main node; the master node includes:
a memory for storing a computer program;
a processor for implementing the steps of the distributed storage cluster node power-down switching method of claim 1 when executing the computer program.
4. The distributed storage cluster of claim 3, wherein the nodes other than the master node are specifically configured to:
and performing self service recovery operation in parallel and performing service drifting operation according to the service information.
CN201810668234.1A 2018-06-26 2018-06-26 Distributed storage cluster and node fault switching method and device thereof Active CN108847982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810668234.1A CN108847982B (en) 2018-06-26 2018-06-26 Distributed storage cluster and node fault switching method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810668234.1A CN108847982B (en) 2018-06-26 2018-06-26 Distributed storage cluster and node fault switching method and device thereof

Publications (2)

Publication Number Publication Date
CN108847982A CN108847982A (en) 2018-11-20
CN108847982B true CN108847982B (en) 2021-11-19

Family

ID=64203566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810668234.1A Active CN108847982B (en) 2018-06-26 2018-06-26 Distributed storage cluster and node fault switching method and device thereof

Country Status (1)

Country Link
CN (1) CN108847982B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783264A (en) * 2018-12-29 2019-05-21 南京富士通南大软件技术有限公司 A kind of High Availabitity solution of database
CN111865632A (en) * 2019-04-28 2020-10-30 阿里巴巴集团控股有限公司 Switching method of distributed data storage cluster and switching instruction sending method and device
CN110286732B (en) * 2019-06-27 2021-01-12 华云数据控股集团有限公司 Method, device and equipment for automatically recovering power failure of high-availability cluster and storage medium
CN110611603B (en) * 2019-09-09 2021-08-31 苏州浪潮智能科技有限公司 Cluster network card monitoring method and device
CN110740064A (en) * 2019-10-25 2020-01-31 北京浪潮数据技术有限公司 Distributed cluster node fault processing method, device, equipment and storage medium
CN110855504A (en) * 2019-11-22 2020-02-28 苏州浪潮智能科技有限公司 Method, system and related device for recovering faults of cloud platform management nodes
CN112711632A (en) * 2019-12-27 2021-04-27 山东鲁能软件技术有限公司 Asynchronous data stream replication method and system for high-availability cluster
CN111212127A (en) * 2019-12-29 2020-05-29 浪潮电子信息产业股份有限公司 Storage cluster, service data maintenance method, device and storage medium
CN111756573A (en) * 2020-05-28 2020-10-09 浪潮电子信息产业股份有限公司 CTDB double-network-card fault monitoring method in distributed cluster and related equipment
CN112035326A (en) * 2020-09-03 2020-12-04 中国银行股份有限公司 Abnormal node task processing method and device based on cluster node mutual detection
CN112866408B (en) * 2021-02-09 2022-08-09 山东英信计算机技术有限公司 Service switching method, device, equipment and storage medium in cluster
CN113162797B (en) * 2021-03-03 2023-03-21 山东英信计算机技术有限公司 Method, system and medium for switching master node fault of distributed cluster
CN113794595A (en) * 2021-09-15 2021-12-14 领云悠逸(北京)科技有限公司 IoT (Internet of things) equipment high-availability method based on industrial Internet
CN114584489A (en) * 2022-03-08 2022-06-03 浪潮云信息技术股份公司 Ssh channel-based remote environment information and configuration detection method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394791A (en) * 2011-10-26 2012-03-28 浪潮(北京)电子信息产业有限公司 Downtime recovery method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539285B2 (en) * 2010-06-22 2013-09-17 International Business Machines Corporation Systems for agile error determination and reporting and methods thereof
CN103607297B (en) * 2013-11-07 2017-02-08 上海爱数信息技术股份有限公司 Fault processing method of computer cluster system
CN108009045B (en) * 2016-10-31 2020-11-06 杭州海康威视数字技术股份有限公司 Method and device for processing faults of main and standby databases
CN106933693A (en) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 A kind of data-base cluster node failure self-repairing method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394791A (en) * 2011-10-26 2012-03-28 浪潮(北京)电子信息产业有限公司 Downtime recovery method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于开源软件构建高性能集群NAS系统;刘爱贵;《CSDN:https://blog.csdn.net/liuaigui/article/details/7163482》;20111229;正文 *

Also Published As

Publication number Publication date
CN108847982A (en) 2018-11-20

Similar Documents

Publication Publication Date Title
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
US5875290A (en) Method and program product for synchronizing operator initiated commands with a failover process in a distributed processing system
US6012150A (en) Apparatus for synchronizing operator initiated commands with a failover process in a distributed processing system
CN105933407B (en) method and system for realizing high availability of Redis cluster
CN107404522B (en) Cross-node virtual machine cluster high-availability implementation method and device
JPH10214199A (en) Process restarting method, and system for realizing process restart
CN102394914A (en) Cluster brain-split processing method and device
CN103152419A (en) High availability cluster management method for cloud computing platform
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
CN110351313B (en) Data caching method, device, equipment and storage medium
CN114116912A (en) Method for realizing high availability of database based on Keepalived
CN114064217B (en) OpenStack-based node virtual machine migration method and device
CN106506278B (en) Service availability monitoring method and device
CN111198662A (en) Data storage method and device and computer readable storage medium
CN112787918B (en) Data center addressing and master-slave switching method based on service routing tree
CN111342986A (en) Distributed node management method and device, distributed system and storage medium
CN108009045B (en) Method and device for processing faults of main and standby databases
CN113765690A (en) Cluster switching method, system, device, terminal, server and storage medium
CN111880947A (en) Data transmission method and device
JP5285044B2 (en) Cluster system recovery method, server, and program
CN104158843A (en) Storage unit invalidation detecting method and device for distributed file storage system
CN116633766A (en) Fault processing method and device, electronic equipment and storage medium
CN115314361B (en) Server cluster management method and related components thereof
CN112491633B (en) Fault recovery method, system and related components of multi-node cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant