CN111488247A - High-availability method and device for managing and controlling multiple fault tolerance of nodes - Google Patents

High-availability method and device for managing and controlling multiple fault tolerance of nodes Download PDF

Info

Publication number
CN111488247A
CN111488247A CN202010277503.9A CN202010277503A CN111488247A CN 111488247 A CN111488247 A CN 111488247A CN 202010277503 A CN202010277503 A CN 202010277503A CN 111488247 A CN111488247 A CN 111488247A
Authority
CN
China
Prior art keywords
control node
management
control
node
failed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010277503.9A
Other languages
Chinese (zh)
Other versions
CN111488247B (en
Inventor
赵胜龑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zstack Information Technology Co ltd
Original Assignee
Shanghai Zstack Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zstack Information Technology Co ltd filed Critical Shanghai Zstack Information Technology Co ltd
Priority to CN202010277503.9A priority Critical patent/CN111488247B/en
Publication of CN111488247A publication Critical patent/CN111488247A/en
Application granted granted Critical
Publication of CN111488247B publication Critical patent/CN111488247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The method comprises the steps that a management and control service system is established according to a main management and control node and a slave management and control node on an application layer, wherein the main management and control node and the slave management and control node respectively comprise a pair of FT (FT) management and control nodes protected by FT, and each pair of FT management and control nodes comprises a main FT management and control node and a secondary FT management and control node; determining failed FT control nodes in the control service system and FT control nodes where virtual access addresses are located; and carrying out fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located. The method can realize that the recovery time of externally provided service can be kept at the second level under the condition of multiple fault tolerance, and simultaneously meets the requirements of ensuring the recovery time and having multiple fault tolerance.

Description

High-availability method and device for managing and controlling multiple fault tolerance of nodes
Technical Field
The application relates to the field of computers, in particular to a high-availability method and equipment for managing and controlling multiple fault tolerance of nodes.
Background
The management and control node of the cloud management platform is a central node for distributing and managing various cloud resources, and the availability of the management and control node is extremely important. The traditional management and control node is often operated on a single server, and has a problem of single point of failure, and when the server fails (for example, power failure, network failure, etc.), there is a risk that the management and control node is inaccessible.
In a production environment, the larger the scale of the cluster, the higher the requirement on the high availability of the cloud pipe nodes is; in some special fields, such as finance, where high-frequency operation is required, higher requirements are initially made on the high availability of the pipe control nodes. The solutions adopted in the industry at present, while addressing the high availability requirement to some extent, still have some disadvantages: the scheme as used requires extra time in heartbeat detection, virtual machine operating system startup and managed node startup, and the time is accumulated together. During this period, the management node cannot provide access to the outside, and it usually takes several minutes for the management node to restore access. Or, due to the logic limitation of the synchronization mechanism of the database, the requirement that more than 2 nodes are simultaneously the master nodes cannot be realized. Therefore, only 2 nodes can be on line at the same time by the structure, so the scheme can only tolerate 1 time.
The current solution only guarantees multiple fault tolerance and sacrifices recovery time; or only the recovery time is ensured, and multiple fault tolerance is sacrificed; it is difficult to satisfy the requirements of ensuring the recovery time and having multiple fault tolerance at the same time.
Disclosure of Invention
An object of the present application is to provide a method and an apparatus for managing and controlling multiple fault tolerance, which solve the problem that it is difficult for a management and control node to satisfy the requirement of guaranteeing the recovery time and having multiple fault tolerance simultaneously in the prior art.
According to one aspect of the application, a highly available method for managing multiple fault tolerance of a node is provided, and the method comprises the following steps:
the method comprises the steps that a management and control service system is established according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node respectively comprise a pair of FT (variable transmission) protected nodes, and each pair of FT management and control nodes comprises a master FT management and control node and a slave FT management and control node;
determining failed FT control nodes in the control service system and FT control nodes where virtual access addresses are located;
and carrying out fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
Further, the primary FT management and control node and the secondary FT management and control node contain the same data content and the corresponding databases are encapsulated in the respective virtual machines.
Further, determining the FT management node where the failed FT management node and the virtual access address are located in the management and control service system includes:
positioning a failed physical host in the management and control service system, and determining that a virtual machine on the failed physical host is a failed FT management and control node;
determining the position of a main control node on an application layer, and determining an FT control node where a virtual access address in the control service system is located according to the position of the main control node.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if a primary FT control node in the primary FT control nodes on the application layer is a failed FT control node, switching a secondary FT control node protected by the same FT as the primary FT control node into a primary FT control node, and meanwhile, taking the failed FT control node off line;
automatically docking through a network card protecting the FT outer layer of the failed FT control node, and forwarding a data packet to the virtual access address through the network card;
and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if a failed FT control node exists in each pair of FT control nodes protected by FT, judging whether the failed FT control node is a primary FT control node, if so, switching a secondary FT control node protected by the same FT as the primary FT control node into a primary FT control node, and meanwhile, taking the failed FT control node off line;
automatically docking a network card on the outer layer of the FT corresponding to the FT control node where the virtual access address is located, and forwarding a data packet to the virtual access address through the network card;
and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node in the slave control nodes on the application layer are invalid FT control nodes, continuously completing fault-tolerant processing of the control service system through the primary FT control node and the secondary FT control node in the primary control nodes on the application layer.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if the primary FT control node and the secondary FT control node on the primary control node are both failed FT control nodes, switching the virtual access address to the primary FT control node in the secondary control nodes;
and continuously finishing the fault tolerance processing of the management and control service system through a primary FT management and control node and a secondary FT management and control node in the secondary management and control nodes.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node on the primary control node are both failed and one FT control node in the secondary control nodes is failed, switching the virtual access address to the non-failed FT control node in the secondary control nodes, and continuing the fault tolerance processing of the control service system through the non-failed FT control node where the virtual access address is newly located.
Further, performing fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node on the slave control node are both failed and one FT control node in the master control node is failed, continuing the fault tolerance processing of the control service system through the rest non-failed FT control nodes.
According to another aspect of the present application, there is also provided a high availability apparatus for managing multiple fault tolerance of a node, the apparatus including:
the system comprises a building device and a service management and control device, wherein the building device is used for building a management and control service system according to a master management and control node and a slave management and control node on an application layer, the master management and control node and the slave management and control node respectively comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node;
the determining device is used for determining the failed FT control node in the control service system and the FT control node where the virtual access address is located;
and the fault-tolerant processing device is used for carrying out fault-tolerant processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
According to another aspect of the present application, there is also provided a high availability apparatus for managing multiple fault tolerance of a node, the apparatus including:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method as previously described.
According to yet another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the method as described above.
Compared with the prior art, the management and control service system is established according to the master management and control node and the slave management and control node on the application layer, wherein the master management and control node and the slave management and control node respectively comprise a pair of FT (FT) management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node; determining failed FT control nodes in the control service system and FT control nodes where virtual access addresses are located; and carrying out fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located. The method can realize that the recovery time of externally provided service can be kept at the second level under the condition of multiple fault tolerance, and simultaneously meets the requirements of ensuring the recovery time and having multiple fault tolerance.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a highly available method of managing multiple fault tolerance of a node provided in accordance with an aspect of the present application;
fig. 2 is a schematic diagram illustrating an architecture of a management node service system according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a failure condition of 1 physical node in one embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a first case where 2 physical nodes fail in one embodiment of the present application;
FIG. 5 is a diagram illustrating a second scenario in which 2 physical nodes fail in one embodiment of the present application;
FIG. 6 is a diagram illustrating a third scenario in which 2 physical nodes fail in one embodiment of the present application;
FIG. 7 is a diagram illustrating a first scenario in which 3 physical nodes fail in one embodiment of the present application;
FIG. 8 is a diagram illustrating a second scenario in which 3 physical nodes fail in one embodiment of the present application;
fig. 9 is a schematic structural diagram of a high availability device for managing multiple fault tolerance of a node according to another aspect of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.
The Memory may include volatile Memory in a computer readable medium, Random Access Memory (RAM), and/or nonvolatile Memory such as Read Only Memory (ROM) or flash Memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change RAM (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, magnetic cassette tape, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transmyedia), such as modulated data signals and carrier waves.
Fig. 1 is a flow chart illustrating a highly available method for managing multiple fault tolerance of a node according to an aspect of the present application, the method including: step S11 and step S13,
in step S11, a management and control service system is established according to a master management and control node and a slave management and control node on an application layer, where the master management and control node and the slave management and control node each include a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes includes a master FT management and control node and a slave FT management and control node; here, the management and control node service (mn services) is composed of a plurality of internal sub-services, the service is encapsulated into a mirror image of a virtual machine, and then a pair of virtual machines protected by FT is created respectively by using the mirror image, thereby forming a management and control node service commonly supported by 4 virtual machines online at the same time; meanwhile, in order to disperse risks, 4 virtual machines are deployed on 4 physical nodes, and FT is Fault Tolerance (Fault Tolerance). The primary FT control node and the secondary FT control node contain the same data content, and corresponding databases are packaged in corresponding virtual machines. Specifically, 4 virtual machines are deployed, one pair of each virtual machine is a master pair and a slave pair, as shown in fig. 2, mn1, mn2, mn3 and mn4 all serve as management and control node services, and are encapsulated into the respective virtual machines together with respective database services, and each pair of mn and a corresponding database constitute a management and control node service for external support access. The PVM and the SVM respectively represent a primary virtual machine (a primary VM) and a secondary virtual machine (a secondary VM) protected by FT, and the contents of a pair of the primary VM and the secondary VM protected by FT always keep the same, that is, the contents of each pair of FT management nodes including the primary FT management node and the secondary FT management node always keep the same. It should be noted that, there are 2 layers of synchronization mechanisms in the management and control service system to ensure 4 nodes to be synchronized, one layer is an FT mechanism (bottom layer virtualization) and the other layer is a master-slave mechanism of a database (application layer), as shown in fig. 2, mn1 and mn2 are a pair of FT virtual machines, and mn3 and mn4 are another pair of FT virtual machines; for the application layer, the current 2 nodes are mn2 and mn4, and mn2 also has a master-slave relationship mn1 in the virtualization layer, and mn3 also has a master-slave relationship mn4 in the virtualization layer.
In step S12, determining a failed FT management and control node in the management and control service system and an FT management and control node where the virtual access address is located; here, the virtual access address (vip) is an entry ip of an external access control node, for an application layer, the system only has nodes formed by 2 PVMs, a service in the system provides the vip, and it is calculated which PVM node is a master node, and the vip is configured on which node. And determining the FT control node where the vip is located by calculating which control node on the application layer is the master control node. The failed FT management node is a failed node, and may be any one or any combination of two pairs of primary FT management nodes and secondary FT management nodes protected by FT, that is, any one or any combination of mn1, mn2, mn3, and mn 4.
In step S13, fault tolerance processing of the management and control service system is performed according to the failed FT management and control node and the FT management and control node where the virtual access address is located. Here, subsequent fault-tolerant processing can be performed according to the judged failed FT control node and the control node where the vip is determined, where the fault-tolerant processing is associated with the number, the location, and the location of the failed FT control node, and if the number of the failed FT control nodes is 1 and the vip fails for the PVM on the master control node, the system can also perform self-recovery through FT to perform fault tolerance for the original number of times or perform fault tolerance again for 2 times while still having 3 nodes working simultaneously. That is, after a failure, there is only one vip switch in the worst case, all services are ready after the switch, and there is no need to restart the system, and even if there is a vip switch, that is, about 1s, the user is almost imperceptible.
By the method, the nodes can be checked and recovered by self through the FT, each fault node can automatically find the nodes meeting the revival condition and restart and recover again, and the process control node is completely insensitive. By using the method, the fault tolerance can be carried out for at least 3 times, and if only 1 fault or 1 fault of the main and standby equipment exists, the FT can find a chance to automatically recover to 4 times of lives.
In an embodiment of the present application, in step S12, a failed physical host in the management and control service system is located, and a virtual machine on the failed physical host is determined to be a failed FT management and control node; determining the position of a main control node on an application layer, and determining an FT control node where a virtual access address in the control service system is located according to the position of the main control node. Here, with reference to fig. 2, the number and the position of the physical nodes that fail in the system are determined, for example, 1 physical node failure, 2 physical node failures, or 3 physical node failures exist, and for example, the PVM on the master node is determined when 1 physical node failure exists. And calculating whether the control node on the application layer formed by mn1 and mn2 is a master control node or the control node on the application layer formed by mn3 and mn4 is a master control node, and determining the FT control node where the vip is located, for example, if the control node on the application layer formed by mn1 and mn2 is calculated to be the master control node, determining the FT control node where the vip is located to be mn 2. Dbsync represents the process of data synchronization between virtual machines, and the management and control nodes between 2 PVMs serve the master and standby nodes of the mutual database, but at the same time, only one master node provides access to the outside, that is, the node where the virtual access address (vip) is located.
In an embodiment of the application, in step S13, if a primary FT control node of primary FT control nodes on the application layer is a failed FT control node, a secondary FT control node protected by the same FT as the primary FT control node is switched to the primary FT control node, and the failed FT control node is offline; automatically docking through a network card protecting the FT outer layer of the failed FT control node, and forwarding a data packet to the virtual access address through the network card; and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition. Here, when 1 physical node in the system fails, if the physical node is a master FT management node in the master management nodes on the application layer, that is, a PVM (mn2) node where the vip is located fails, as shown in fig. 3, a node where mn2 is located fails, and then a node where mn1, which is originally a slave node, is changed from an SVM to a PVM, and at the same time, the original PVM goes offline. The Vip and the ip are both set in the virtual machine, only the PVM provides access to the virtual machine protected by the FT at the same time, when the PVM is correspondingly switched, the network card on the outer layer of the FT can be automatically butted, the virtual machine is automatically adapted on a virtualization layer, and an application layer is not sensitive. In the new PVM (namely the original mn1), the network packet is also directly forwarded to the vip through the network card of the FT as the configuration of the internal network of the original PVM (mn2), an external user accesses the management and control node through the vip, and the switching at the moment is not sensible to the user. Under the scene that 1 node fails, the FT back end searches for an FT physical node meeting the condition in the background, and if a healthy physical node meeting the condition exists in the system, a new FT slave node is created again, so that the FT combination is recovered automatically; if not, the current environment still has 3 nodes working simultaneously, and the fault tolerance can be carried out for 2 times. In the recovery process of 1 node failure, only FT service switching time is needed, FT switching does not need any restarting process, and the service is always prepared at the beginning, so that the recovery time is second level; the process of searching the nodes meeting the conditions to reconstruct the FT is completely background operation, and is irrelevant to the access of the user to the management and control node service, and the user has no perception in an application layer. Wherein, the virtual machines of FT are created, deleted, etc. in the same cluster, and physical machines with the same configuration can be added to the same cluster. The virtual machine needs a physical machine as a host, if 1 node of the FT-protected virtual machine fails, the FT tries to automatically create a SVM that is a candidate for FT again on other physical machines that meet the condition that the FT can create the SVM, and the condition that the FT can create the SVM includes, but is not limited to, other physical machines in the cluster, which have sufficient computing resources (such as CPU memory, etc.), and on which a new SVM can be automatically created. For example, under the same cluster, other physical machines identical to the failed physical machine are configured, the CPU memory resource is sufficient, and the management node is already added to the management node, and the management node automatically arranges the SVM to be created on the management node.
In an embodiment of the application, in step S13, if there is one failed FT control node in each pair of FT control nodes protected by FT, it is determined whether the failed FT control node is a primary FT control node, and if so, a secondary FT control node protected by the same FT as the primary FT control node is switched to be the primary FT control node, and the failed FT control node is offline; automatically docking a network card on the outer layer of the FT corresponding to the FT control node where the virtual access address is located, and forwarding a data packet to the virtual access address through the network card; and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition. Here, when there are 2 physical nodes failing (2 nd fault-tolerant scenario), there are 3 cases, the first case is that one failed FT management node exists in each pair of FT management nodes protected by FT, that is, 2 virtual machines protected by FT each fail 1, fault tolerance occurs for the 2 nd time when the failed node occurring at the time of the first fault tolerance is the primary FT management node where vip is located, and another primary FT management node fails, as shown in fig. 4, mn2 and mn4 fail, mn1 is switched to PVM, the original PVM is offline, FT automatically recovers to 3 or 4 nodes by searching for a physical machine satisfying the condition in the same cluster, or degrades to the above-mentioned case where 1 failed physical node exists, and the service originally deployed on the physical machine is deployed into the virtual machine. In the recovery process of node failure, only FT service switching time is needed, FT switching does not need any restarting process, the scene can still continue fault tolerance for 1 time, and the recovery time only relates to FT switching time and is second level.
Continuing to the above embodiment, in the second case that 2 physical nodes are failed, if a primary FT management and control node and a secondary FT management and control node in the slave management and control nodes on the application layer are both failed FT management and control nodes, continuing to complete the fault tolerance processing of the management and control service system through the primary FT management and control node and the secondary FT management and control node in the primary management and control node on the application layer. Here, the slave management and control node on the application layer is the node where the non-vip is located, when the physical machine corresponding to 1 pair of virtual machines where the non-vip is located fails, as shown in fig. 5, mn3 and mn4 fail, the original vip is not switched, because the failed node is a pair of virtual machines protected by FT and 2 nodes all fail, in this case, FT does not recover by itself, vip does not need to be switched, the node where the vip is located is still protected by FT, fault tolerance can be performed once again, the recovery time does not involve FT switching and vip switching, and the original network connection is not interrupted.
In an embodiment of the application, in the case that a third 2 physical nodes fail, if a primary FT control node and a secondary FT control node on the primary control node are both failed FT control nodes, the virtual access address is switched to a primary FT control node in the secondary control nodes; and continuously finishing the fault tolerance processing of the management and control service system through a primary FT management and control node and a secondary FT management and control node in the secondary management and control nodes. Here, the master control node is a node where the vip is located, when 1 pair of virtual machines where the vip is located correspond to physical machines which fail, both the master FT control node and the secondary FT control node on the master control node fail, as shown in fig. 6, mn1 and mn2 fail, and at this time, the vip needs to be switched, mn3 is a PVM in a protected pair of FT virtual machines, and when mn1 and mn2 both fail, the vip is switched to a PVM in a slave control node, that is, to mn 3. Because the failure node is a pair of virtual machines protected by FT and 2 nodes fail, under the condition, FT is not recovered by self, the other pair of virtual machines are still protected by FT and still can fault-tolerant for 1 time, and the recovery time relates to vip switching and is second level. In summary, there is only one vip switch in all the cases of fault tolerance 2, and vip switch is also the second-level recovery time, so the recovery time of fault tolerance 2 is still the second level.
In an embodiment of the application, in step S13, if both the primary FT management node and the secondary FT management node on the primary management node are failed and there is a FT management node failure in the secondary management nodes, the virtual access address is switched to the FT management node that is not failed in the secondary management nodes, and the fault tolerance processing of the management and control service system is continued through the FT management node that is not failed and where the virtual access address is newly located. Here, when a pair of FTs fails, only the database (application layer) is synchronously protected, and there are two cases in this case where there is a physical node failure, that is, there are 3 physical nodes in the system, where a primary FT management node and a secondary FT management node on a master node both fail and there is a FT management node failure in the slave management nodes, that is, the failed nodes include a node where the vip is located, as shown in fig. 7, mn1, mn2, and mn4 fail, and the vip needs to be switched, and at this time, the recovery time is of the order of seconds since it only involves switching the vip.
Continuing with the above embodiment, when there are 3 physical nodes in the system that fail. And if the primary FT control node and the secondary FT control node on the secondary control node both fail and one FT control node in the primary control node fails, continuing the fault tolerance processing of the control service system through the rest FT control nodes which do not fail. Here, if a node where a non-vip is located fails, as shown in fig. 8, in the case where the first failure is mn2, mn3 and mn4 fail, and then vip switching is not required, and access can still be provided to the outside although fault tolerance is not achieved.
It should be noted that, in all the above failure situations, the offline of any node will send a corresponding warning to the user layer, and the user can configure the receiving end of the user, and when any node is offline, the user can receive the warning notification sent by the system at any time. Through the design of this application, the synchronous many control nodes of database are born to realize can be fault-tolerant 3 at least to cloud control node through the virtual machine of FT, can be from recovering when the condition satisfies, and simultaneously in fault-tolerant in-process, the external service recovery time that provides can keep at the second level.
In addition, the embodiment of the present application further provides a computer readable medium, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the aforementioned high availability method for managing multiple fault tolerance of a node.
In correspondence with the method described above, the present application also provides a terminal, which includes modules or units capable of executing the steps of the method described in fig. 1 or each embodiment, and these modules or units can be implemented by hardware, software or a combination of hardware and software, and this application is not limited thereto. For example, in an embodiment of the present application, there is also provided an apparatus for managing a highly available method for multiple fault tolerance of a node, the apparatus including:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method as previously described.
For example, the computer readable instructions, when executed, cause the one or more processors to:
the method comprises the steps that a management and control service system is established according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node respectively comprise a pair of FT (variable transmission) protected nodes, and each pair of FT management and control nodes comprises a master FT management and control node and a slave FT management and control node;
determining failed FT control nodes in the control service system and FT control nodes where virtual access addresses are located;
and carrying out fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
Fig. 9 is a schematic structural diagram of a high availability device for managing multiple fault tolerance of a node according to another aspect of the present application, where the device includes: the system comprises a construction device 11, a determination device 12 and a fault-tolerant processing device 13, wherein the construction device 11 is used for constructing a management and control service system according to a master management and control node and a slave management and control node on an application layer, wherein each of the master management and control node and the slave management and control node comprises a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node; the determining device 12 is configured to determine an FT management and control node in the management and control service system that fails and an FT management and control node where a virtual access address is located; the fault-tolerant processing device 13 is configured to perform fault-tolerant processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
It should be noted that the content executed by the building device 11, the determining device 12 and the fault-tolerant processing device 13 is the same as or corresponding to the content in the above steps S11, S12 and S13, and for brevity, the description is omitted here.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (12)

1. A highly available method for managing multiple fault tolerance of a node, the method comprising:
the method comprises the steps that a management and control service system is established according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node respectively comprise a pair of FT (variable transmission) protected nodes, and each pair of FT management and control nodes comprises a master FT management and control node and a slave FT management and control node;
determining failed FT control nodes in the control service system and FT control nodes where virtual access addresses are located;
and carrying out fault tolerance processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
2. The method of claim 1, wherein the primary and secondary FT policing nodes contain identical data content and corresponding databases are encapsulated in respective virtual machines.
3. The method according to claim 1 or 2, wherein determining the FT management node in which the failed FT management node and the virtual access address are located in the management service system comprises:
positioning a failed physical host in the management and control service system, and determining that a virtual machine on the failed physical host is a failed FT management and control node;
determining the position of a main control node on an application layer, and determining an FT control node where a virtual access address in the control service system is located according to the position of the main control node.
4. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if a primary FT control node in the primary FT control nodes on the application layer is a failed FT control node, switching a secondary FT control node protected by the same FT as the primary FT control node into a primary FT control node, and meanwhile, taking the failed FT control node off line;
automatically docking through a network card protecting the FT outer layer of the failed FT control node, and forwarding a data packet to the virtual access address through the network card;
and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition.
5. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if a failed FT control node exists in each pair of FT control nodes protected by FT, judging whether the failed FT control node is a primary FT control node, if so, switching a secondary FT control node protected by the same FT as the primary FT control node into a primary FT control node, and meanwhile, taking the failed FT control node off line;
automatically docking a network card on the outer layer of the FT corresponding to the FT control node where the virtual access address is located, and forwarding a data packet to the virtual access address through the network card;
and searching a physical machine meeting the condition in the cluster where the main control node is located by protecting the FT of the failed FT control node, so as to create a new secondary FT control node on the physical machine meeting the condition.
6. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node in the slave control nodes on the application layer are invalid FT control nodes, continuously completing fault-tolerant processing of the control service system through the primary FT control node and the secondary FT control node in the primary control nodes on the application layer.
7. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
if the primary FT control node and the secondary FT control node on the primary control node are both failed FT control nodes, switching the virtual access address to the primary FT control node in the secondary control nodes;
and continuously finishing the fault tolerance processing of the management and control service system through a primary FT management and control node and a secondary FT management and control node in the secondary management and control nodes.
8. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node on the primary control node are both failed and one FT control node in the secondary control nodes is failed, switching the virtual access address to the non-failed FT control node in the secondary control nodes, and continuing the fault tolerance processing of the control service system through the non-failed FT control node where the virtual access address is newly located.
9. The method according to claim 3, wherein performing fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located includes:
and if the primary FT control node and the secondary FT control node on the slave control node are both failed and one FT control node in the master control node is failed, continuing the fault tolerance processing of the control service system through the rest non-failed FT control nodes.
10. A highly available apparatus for managing multiple fault tolerance of a node, the apparatus comprising:
the system comprises a building device and a service management and control device, wherein the building device is used for building a management and control service system according to a master management and control node and a slave management and control node on an application layer, the master management and control node and the slave management and control node respectively comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node;
the determining device is used for determining the failed FT control node in the control service system and the FT control node where the virtual access address is located;
and the fault-tolerant processing device is used for carrying out fault-tolerant processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
11. A highly available apparatus for managing multiple fault tolerance of a node, the apparatus comprising:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 9.
12. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 9.
CN202010277503.9A 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes Active CN111488247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010277503.9A CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010277503.9A CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Publications (2)

Publication Number Publication Date
CN111488247A true CN111488247A (en) 2020-08-04
CN111488247B CN111488247B (en) 2023-07-25

Family

ID=71797869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010277503.9A Active CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Country Status (1)

Country Link
CN (1) CN111488247B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157393A (en) * 2021-04-09 2021-07-23 上海云轴信息科技有限公司 Method and device for managing high availability of nodes
CN113595899A (en) * 2021-06-30 2021-11-02 上海云轴信息科技有限公司 Method and system for realizing multi-node point cloud routing

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859263A (en) * 2010-06-12 2010-10-13 中国人民解放军国防科学技术大学 Quick communication method between virtual machines supporting online migration
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104360943A (en) * 2014-11-11 2015-02-18 浪潮电子信息产业股份有限公司 Resource guarantee model of service-oriented architecture
CN104536842A (en) * 2014-12-17 2015-04-22 中电科华云信息技术有限公司 Virtual machine fault-tolerant method based on KVM virtualization
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN107992351A (en) * 2016-10-26 2018-05-04 阿里巴巴集团控股有限公司 A kind of hardware resource distribution method and device, electronic equipment
US20190102265A1 (en) * 2017-03-23 2019-04-04 Dh2I Company Highly available stateful containers in a cluster environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859263A (en) * 2010-06-12 2010-10-13 中国人民解放军国防科学技术大学 Quick communication method between virtual machines supporting online migration
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104360943A (en) * 2014-11-11 2015-02-18 浪潮电子信息产业股份有限公司 Resource guarantee model of service-oriented architecture
CN104536842A (en) * 2014-12-17 2015-04-22 中电科华云信息技术有限公司 Virtual machine fault-tolerant method based on KVM virtualization
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN107992351A (en) * 2016-10-26 2018-05-04 阿里巴巴集团控股有限公司 A kind of hardware resource distribution method and device, electronic equipment
US20190102265A1 (en) * 2017-03-23 2019-04-04 Dh2I Company Highly available stateful containers in a cluster environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周利敏;傅妍芳;高武奇;高祥;程兵;: "基于云仿真平台的高可用技术研究", no. 04 *
王伟成;罗宇;: "基于分布式架构的星载并行计算机容错技术" *
王伟成;罗宇;: "基于分布式架构的星载并行计算机容错技术", 计算机工程与科学, no. 03 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157393A (en) * 2021-04-09 2021-07-23 上海云轴信息科技有限公司 Method and device for managing high availability of nodes
CN113595899A (en) * 2021-06-30 2021-11-02 上海云轴信息科技有限公司 Method and system for realizing multi-node point cloud routing

Also Published As

Publication number Publication date
CN111488247B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
US11163653B2 (en) Storage cluster failure detection
RU2751551C1 (en) Method and apparatus for restoring disrupted operating ability of a unit, electronic apparatus and data storage medium
WO2017219857A1 (en) Data processing method and device
TW523656B (en) Method and apparatus for building and managing multi-clustered computer systems
CN109005045B (en) Main/standby service system and main node fault recovery method
CN107241430A (en) A kind of enterprise-level disaster tolerance system and disaster tolerant control method based on distributed storage
US8943082B2 (en) Self-assignment of node identifier in a cluster system
CN109446169B (en) Double-control disk array shared file system
US20110219263A1 (en) Fast cluster failure detection
CN109918360A (en) Database platform system, creation method, management method, equipment and storage medium
RU2643642C2 (en) Use of cache memory and another type of memory in distributed memory system
CN111488247B (en) High availability method and equipment for managing and controlling multiple fault tolerance of nodes
US8015432B1 (en) Method and apparatus for providing computer failover to a virtualized environment
CN110557413A (en) Business service system and method for providing business service
CN112307045A (en) Data synchronization method and system
US20120143829A1 (en) Notification of configuration updates in a cluster system
CN112256477A (en) Virtualization fault-tolerant method and device
CN114443768A (en) Main/standby switching method and device of distributed database and readable storage medium
CN107528703B (en) Method and equipment for managing node equipment in distributed system
CN116389233B (en) Container cloud management platform active-standby switching system, method and device and computer equipment
CN113596195B (en) Public IP address management method, device, main node and storage medium
US20190124145A1 (en) Method and apparatus for availability management
CN113157392B (en) High-availability method and equipment for mirror image warehouse
WO2012072644A1 (en) Validation of access to a shared data record subject to read and write access by multiple requesters
CN112202601B (en) Application method of two physical node mongo clusters operated in duplicate set mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant