CN111488247B - High availability method and equipment for managing and controlling multiple fault tolerance of nodes - Google Patents

High availability method and equipment for managing and controlling multiple fault tolerance of nodes Download PDF

Info

Publication number
CN111488247B
CN111488247B CN202010277503.9A CN202010277503A CN111488247B CN 111488247 B CN111488247 B CN 111488247B CN 202010277503 A CN202010277503 A CN 202010277503A CN 111488247 B CN111488247 B CN 111488247B
Authority
CN
China
Prior art keywords
management
control node
control
node
failed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010277503.9A
Other languages
Chinese (zh)
Other versions
CN111488247A (en
Inventor
赵胜龑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yunzhou Information Technology Co ltd
Original Assignee
Shanghai Yunzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yunzhou Information Technology Co ltd filed Critical Shanghai Yunzhou Information Technology Co ltd
Priority to CN202010277503.9A priority Critical patent/CN111488247B/en
Publication of CN111488247A publication Critical patent/CN111488247A/en
Application granted granted Critical
Publication of CN111488247B publication Critical patent/CN111488247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The utility model provides a high availability method and equipment for multiple fault tolerance of management and control nodes, which is characterized in that a management and control service system is built according to the management and control nodes and the slave management and control nodes on an application layer, wherein the management and control nodes and the slave management and control nodes comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a main FT management and control node and a secondary FT management and control node; determining an invalid FT control node and an FT control node where a virtual access address is located in the control service system; and carrying out fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located. The recovery time of the external service can be kept at the second level under the condition of realizing multiple fault tolerance, and the requirements of ensuring the recovery time and having multiple fault tolerance are met.

Description

High availability method and equipment for managing and controlling multiple fault tolerance of nodes
Technical Field
The present disclosure relates to the field of computers, and in particular, to a method and apparatus for managing multiple fault tolerance of a node.
Background
The management and control node of the cloud management platform is a central node for distributing and managing various cloud resources, and the usability of the management and control node is extremely important. Conventional management and control nodes often run on a single server, and there is a problem of single point of failure, and when the server fails (such as power failure, network failure, etc.), there is a risk that the management and control nodes are not accessible.
In the production environment, the larger the large-scale cluster, the higher the requirement of the cloud pipe node on high availability; in particular, in the fields of finance and where high-frequency operation is required, a higher demand is initially placed on the high availability of management and control nodes. The solutions adopted in the industry today, while addressing the high availability demands to some extent, still have some drawbacks: the schemes used require additional time in heartbeat detection, virtual machine operating system start-up, and management node start-up, and these times are additive together. During this time the policing node cannot provide access to the outside, typically requiring several minutes for the policing node to resume access. Or, due to the logic limitation of the synchronization mechanism of the database, the requirement that more than 2 nodes are master nodes at the same time cannot be realized. So by this architecture only up to 2 nodes are online simultaneously, so the scheme can only be fault tolerant 1 time.
The existing solution only guarantees multiple fault tolerance, and sacrifices recovery time; or only guarantee recovery time, sacrifice many fault-tolerant; it is difficult to meet both the requirements of ensuring recovery time and having multiple fault tolerance.
Disclosure of Invention
An object of the present application is to provide a method and an apparatus for managing and controlling multiple fault tolerance of a node, which solve the problem in the prior art that it is difficult for the managing and controlling node to simultaneously satisfy the requirements of ensuring recovery time and having multiple fault tolerance.
According to one aspect of the present application, there is provided a high availability method of managing multiple fault tolerance of a node, the method comprising:
constructing a management and control service system according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node;
determining an invalid FT control node and an FT control node where a virtual access address is located in the control service system;
and carrying out fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
Further, the primary FT management and control node and the secondary FT management and control node contain the same data content and the corresponding databases are packaged in the respective corresponding virtual machines.
Further, determining the failed FT management and control node and the FT management and control node where the virtual access address is located in the management and control service system includes:
positioning a failed physical host in the management and control service system, and determining a virtual machine on the failed physical host as a failed FT management and control node;
and determining the position of a master control node on an application layer, and determining the FT control node where the virtual access address is located in the control service system according to the position of the master control node.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
if the main FT control node in the main control node on the application layer is a failed FT control node, switching the secondary FT control node which is protected by the same FT with the main FT control node into the main FT control node, and simultaneously, enabling the failed FT control node to be offline;
automatically interfacing through a network card protecting the FT outer layer of the failed FT control node so as to forward a data packet to the virtual access address through the network card;
searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
if a failed FT control node exists in each pair of FT control nodes protected by the FT, judging whether the failed FT control node is a main FT control node, if so, switching a secondary FT control node protected by the same FT with the main FT control node into the main FT control node, and meanwhile, enabling the failed FT control node to be offline;
automatically interfacing through a network card of an FT outer layer corresponding to the FT control node where the virtual access address is located, so as to forward a data packet to the virtual access address through the network card;
searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
and if the main FT control node and the secondary FT control node in the secondary control nodes on the application layer are the failed FT control nodes, continuously completing fault tolerance processing of the control service system through the main FT control node and the secondary FT control node in the main control nodes on the application layer.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
if the main FT control node and the secondary FT control node on the master control node are failed FT control nodes, switching the virtual access address to the main FT control node in the slave control nodes;
and continuously completing fault tolerance processing of the management and control service system through the main FT management and control node and the secondary FT management and control node in the secondary management and control nodes.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
and if the primary FT control node and the secondary FT control node on the master control node are invalid and the FT control node in the slave control node is invalid, switching the virtual access address to the non-invalid FT control node in the slave control nodes, and continuing fault tolerance processing of the control service system through the non-invalid FT control node in which the virtual access address is newly located.
Further, performing fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, including:
if the primary FT control node and the secondary FT control node on the secondary control node are invalid and one FT control node in the primary control node is invalid, continuing fault tolerance processing of the control service system through the rest non-invalid FT control nodes.
According to another aspect of the present application, there is also provided a high availability device for managing multiple fault tolerance of a node, the device comprising:
the system comprises a construction device, a management and control service system and a management and control device, wherein the construction device is used for constructing the management and control service system according to a master management and control node and a slave management and control node on an application layer, the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node;
the determining device is used for determining the failed FT control node and the FT control node where the virtual access address is located in the control service system;
and the fault-tolerant processing device is used for carrying out fault-tolerant processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
According to yet another aspect of the present application, there is also provided a highly available device for managing multiple fault tolerance of a node, the device comprising:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform operations of the method as described above.
According to yet another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to implement a method as described above.
Compared with the prior art, the management and control service system is built according to the master management and control node and the slave management and control node on the application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node; determining an invalid FT control node and an FT control node where a virtual access address is located in the control service system; and carrying out fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located. The recovery time of the external service can be kept at the second level under the condition of realizing multiple fault tolerance, and the requirements of ensuring the recovery time and having multiple fault tolerance are met.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:
FIG. 1 illustrates a flow diagram of a highly available method of managing multiple fault tolerance of a node according to one aspect of the present application;
FIG. 2 is a schematic diagram of a management node service system according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a failure condition of 1 physical node in one embodiment of the present application;
FIG. 4 is a schematic diagram of a first case where there are 2 physical node failures in an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a second case where there are 2 physical node failures in an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a third scenario in which 2 physical nodes fail in an embodiment of the present application;
FIG. 7 is a schematic diagram of a first case where there are 3 physical nodes failing in an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating a second scenario in which there are 3 physical node failures in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a high availability device for managing multiple fault tolerance of a node according to another aspect of the present application.
The same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
The present application is described in further detail below with reference to the accompanying drawings.
In one typical configuration of the present application, the terminal, the devices of the service network, and the trusted party each include one or more processors (e.g., central processing units (Central Processing Unit, CPU)), input/output interfaces, network interfaces, and memory.
The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change RAM (PRAM), static random access Memory (Static Random Access Memory, SRAM), dynamic random access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other Memory technology, read-Only optical disk read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disk, DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
Fig. 1 is a flow diagram of a high availability method for managing multiple fault tolerance of a node according to one aspect of the present application, the method comprising: in step S11 and step S13,
in step S11, a management and control service system is built according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node; here, the management and control node service (mn services) is composed of a plurality of internal sub-services, the services are packaged into mirror images of the virtual machines, and a pair of virtual machines protected by FT are respectively created by utilizing the mirror images, so that 4 management and control node services which are supported by the virtual machines on line simultaneously are formed; meanwhile, in order to disperse risks, 4 virtual machines are deployed on 4 physical nodes, and FT is Fault Tolerance (Fault Tolerance). The primary FT control node and the secondary FT control node comprise the same data content, and the corresponding databases are packaged in the corresponding virtual machines. Specifically, 4 virtual machines are deployed, two pairs are respectively one pair of a master pair and one pair of a slave pair, as shown in fig. 2, mn1, mn2, mn3 and mn4 are all served by the control node, and are packaged into the respective virtual machines together with respective database services, and each pair of mn and a corresponding database form the control node service for external support access. The PVM and the SVM respectively represent a master virtual machine (master VM) and a slave virtual machine (slave VM) which are protected by the FT, and the contents in a pair of master VM and slave VM which are protected by the FT always keep consistent, namely the contents of each pair of FT control nodes including a master FT control node and a secondary FT control node always keep consistent. It should be noted that, in the management and control service system, 2 layers of synchronization mechanisms exist to ensure synchronization of 4 nodes, one layer is an FT mechanism (bottom virtualization), and the other layer is a master-slave mechanism of a database (application layer), as shown in fig. 2, mn1 and mn2 are a pair of FT virtual machines, and mn3 and mn4 are another pair of FT virtual machines; for the application layer, the current 2 nodes are mn2 and mn4, and mn2 has a master-slave relationship mn1 at the virtualization layer, and mn3 has a master-slave relationship mn4 at the virtualization layer.
In step S12, determining an failed FT management and control node and an FT management and control node where a virtual access address is located in the management and control service system; here, the virtual access address (vip) is an entry ip of an external access control node, and for an application layer, the system has only nodes composed of 2 PVMs, and service is provided in the system, and the vip is correspondingly configured on which node is calculated on which PVM node is the master node. And determining the FT management and control node where vip is located by calculating which management and control node on the application layer is the management and control node. The failed FT control node is a node with a fault, and can be any one or any combination of two pairs of main FT control nodes and secondary FT control nodes protected by FT, namely any one or any combination of mn1, mn2, mn3 and mn4.
In step S13, fault tolerance processing of the management and control service system is performed according to the failed FT management and control node and the FT management and control node where the virtual access address is located. The subsequent fault-tolerant processing can be performed according to the determined failed FT control nodes and the determined control nodes where vip is located, the fault-tolerant processing is related to the number, the positions and the positions of the failed FT control nodes, if the number of the failed FT control nodes is 1, and the failed FT control nodes are PVM on the master control nodes, vip is on the PVM, the system can perform self-recovery through FT to perform fault tolerance for the original times or can perform fault tolerance for 2 times under the condition that 3 nodes work simultaneously. That is, after the failure, only one vip is switched, all services are ready after the switch, and the system does not need to be restarted, but even if the vip is switched, namely about 1s, the user is almost insensitive.
By the method, the fault nodes can be checked and recovered by FT, each fault node can also automatically find the node meeting the reactivation condition and restart recovery again, and the process control node is completely insensitive. By using the method, fault tolerance can be realized for at least 3 times, and if only 1 fault exists or each fault exists for 1 time, the FT can find an opportunity to restore to 4 orders by itself.
In an embodiment of the present application, in step S12, a failed physical host in the management and control service system is located, and a virtual machine on the failed physical host is determined to be a failed FT management and control node; and determining the position of a master control node on an application layer, and determining the FT control node where the virtual access address is located in the control service system according to the position of the master control node. With continued reference to fig. 2, the number and location of failed physical nodes in the system are determined, for example, if there are 1 physical node failure, 2 physical node failure, or 3 physical node failure, and for example, if there are 1 physical node failure, the PVM on the master node is determined. And calculating whether a management and control node on an application layer formed by mn1 and mn2 is a management and control node or a management and control node on an application layer formed by mn3 and mn4 is a management and control node, determining the FT management and control node where the vip is located, for example, calculating to obtain the management and control node on the application layer formed by mn1 and mn2 as the management and control node, and determining the FT management and control node where the vip is located as mn2.Db sync means that the virtual machines are synchronized by data, and the management and control nodes between 2 PVMs serve as the master and slave nodes of the database, but only one master node provides access to the outside at the same time, that is, the node where the virtual access address (vip) is located.
In an embodiment of the present application, in step S13, if a primary FT management node in the primary management node on the application layer is a failed FT management node, a secondary FT management node protected by the same FT as the primary FT management node is switched to the primary FT management node, and at the same time, the failed FT management node is offline; automatically interfacing through a network card protecting the FT outer layer of the failed FT control node so as to forward a data packet to the virtual access address through the network card; searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions. Here, when 1 physical node in the system fails, if the physical node is the master FT management node in the master management node on the application layer, that is, the node PVM (mn 2) where vip is located fails, as shown in fig. 3, the node where mn2 is located fails, and then the node where mn1 originally serving as the slave node is changed from SVM to PVM, and meanwhile, the original PVM is offline. The virtual machine under FT protection is only provided with access to the outside by the PVM at the same time, when the PVM is correspondingly switched, the network card at the FT outer layer can be automatically docked, and the virtual machine is automatically adapted at the virtualization layer, so that the application layer is free from sense. The network packet in the new PVM (namely the original mn 1) is same as the internal network configuration of the original PVM (mn 2), and the network packet is directly forwarded to the vip through the network card of the FT, and the external user accesses the control node through the vip, so that the switching is not felt to the user. Under the scene that 1 node fails, the FT back end searches FT physical nodes meeting the conditions in the background, if healthy physical nodes meeting the conditions exist in the system, new FT slave nodes are created again, and therefore FT combinations are restored automatically; if not, 3 nodes in the current environment still work simultaneously, and fault tolerance can be carried out for 2 times. In the recovery process of 1 node failure, only FT service switching time is needed, FT switching does not need any restarting process, and the initial service is always prepared, so the recovery time is in the second level; the process of searching the nodes meeting the conditions and reconstructing FT is completely operated in the background, and is irrelevant to the access of the user to the control node service, and the user has no perception in the application layer. Wherein virtual machines of FT are created, deleted, etc. in the same cluster, physical machines with exactly the same configuration can be added to the same cluster. If 1 node of the virtual machine protected by the FT fails, the FT tries to automatically create the SVM of the secondary candidate FT on other physical machines meeting the condition that the FT can create the SVM, including but not limited to other physical machines in the cluster, and sufficient computing resources (such as CPU memory and the like) can be provided for automatically creating a new SVM. For example, under the same cluster, other physical machines which are identical to the failed physical machine are configured, the CPU memory resources are sufficient, and the CPU memory resources are added into the management node, and the management node automatically arranges for attempting to create the SVM on the management node.
In an embodiment of the present application, in step S13, if one failed FT management node exists in each pair of FT management nodes protected by FT, determining whether the failed FT management node is a primary FT management node, if yes, switching a secondary FT management node protected by the same FT with the primary FT management node to the primary FT management node, and meanwhile, the failed FT management node is offline; automatically interfacing through a network card of an FT outer layer corresponding to the FT control node where the virtual access address is located, so as to forward a data packet to the virtual access address through the network card; searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions. Here, when there are 2 physical node failures (the 2 nd fault tolerance scenario), there are 3 cases, where the first case is that there is one failed FT management node in each pair of FT management nodes protected by FT, that is, 1 failed FT management node in each pair of FT protected virtual machines, fault tolerance occurs 2 nd time when the failed node occurring in the first fault tolerance is the primary FT management node where vip is located, and the other primary FT management node fails, as shown in fig. 4, mn2 and mn4 fail, then mn1 is switched to PVM, the original PVM is offline, FT is automatically restored to 3 or 4 nodes by searching for physical machines meeting the conditions in the same cluster, or degraded to the above case where there are 1 failed physical node, and the service originally deployed on the physical machine is deployed in the virtual machine. In the recovery process of node failure, only FT service switching time is needed, FT switching does not need any restarting process, the scene can still continue fault tolerance for 1 time, and the recovery time only relates to FT switching time and is of the second level.
Continuing to connect the above embodiment, if the second 2 physical nodes fail, if the primary FT management node and the secondary FT management node in the secondary management node on the application layer are all failed FT management nodes, then continuing to complete fault tolerance processing of the management and control service system through the primary FT management node and the secondary FT management node in the primary management node on the application layer. Here, the slave control node on the application layer is the node where the non-vip is located, when the physical machine corresponding to the 1 pair of virtual machines where the non-vip is located fails, as shown in fig. 5, mn3 and mn4 fail, the original vip is not switched, because the failed node is a pair of virtual machines protected by FT and 2 nodes fail, under the condition that the FT is not recovered by itself, the vip does not need to be switched, the node where the vip is still protected by FT, fault tolerance can be realized again, the recovery time does not relate to the FT switching and vip switching, and the original network connection is not interrupted.
In an embodiment of the present application, if the third 2 physical nodes fail, if the primary FT management node and the secondary FT management node on the master management node are both failed FT management nodes, then switching the virtual access address to the primary FT management node in the slave management nodes; and continuously completing fault tolerance processing of the management and control service system through the main FT management and control node and the secondary FT management and control node in the secondary management and control nodes. Here, when the virtual machine corresponding to the virtual machine 1 pair where the master control node is located fails, the master FT control node and the secondary FT control node on the master control node fail, as shown in fig. 6, mn1 and mn2 fail, at this time, the vip needs to be switched, mn3 is the PVM in the protected pair of FT virtual machines, and when mn1 and mn2 fail, the vip is switched to the PVM in the slave control node, that is, to mn 3. Because the failed node is a pair of virtual machines protected by FT and 2 nodes are failed, under the condition that FT is not automatically recovered, the other pair of virtual machines is still protected by FT, fault tolerance can be carried out for 1 time, and the recovery time is in the second level in relation to vip switching. In summary, all cases of fault tolerance 2 times have only one vip switch at worst, and the vip switch is also the second level recovery time, so the fault tolerance 2 times recovery time is still second level.
In an embodiment of the present application, in step S13, if both the primary FT management node and the secondary FT management node on the master management node fail and an FT management node in the slave management nodes fails, the virtual access address is switched to an FT management node that does not fail among the slave management nodes, and fault tolerance processing of the management and control service system is continued through the FT management node that does not fail in which the virtual access address is newly located. Here, when a pair of FT failures have only the case of synchronous protection of the database (application layer), in this case, there are two cases when there are 3 physical node failures in the system, that is, there are two cases when there is a failure of both the primary FT management node and the secondary FT management node on the primary management node and a failure of one FT management node in the secondary management nodes, that is, the failed node includes a failure of the node where vip is located, as shown in fig. 7, mn1, mn2 and mn4 fail, and vip needs to be switched, and at this time, the recovery time is in seconds because only vip is switched.
Continuing with the above embodiment, when there are 3 physical nodes in the system that fail. And if the primary FT control node and the secondary FT control node on the secondary control node are invalid and one FT control node in the primary control node is invalid, continuing fault tolerance processing of the control service system through the rest non-invalid FT control nodes. Here, if the node where the non-vip is located fails, as shown in fig. 8, when the first failure is mn2, mn3 and mn4 fail, and vip switching is not required at this time, and access can be provided to the outside although fault tolerance is no longer possible.
It should be noted that under all the above failure conditions, any node offline will send out a corresponding warning to the user layer, and the user may configure its own receiving end, and when any node offline, the user may receive the warning notification sent by the system at any time. Through the design of the application, the cloud management and control node can be fault-tolerant for at least 3 times through the FT virtual machine to bear the synchronous management and control node of the database, the cloud management and control node can be recovered automatically when the condition is met, and simultaneously, in the fault-tolerant process, the recovery time of externally provided service can be kept at the second level.
In addition, the embodiment of the application further provides a computer readable medium, on which computer readable instructions are stored, the computer readable instructions being executable by a processor to implement the aforementioned high availability method for managing multiple fault tolerance of a node.
Corresponding to the method described above, the present application further provides a terminal, which includes modules or units capable of performing the steps of the method described in fig. 1 or the respective embodiments, where the modules or units may be implemented by hardware, software or a combination of hardware and software, and the present application is not limited thereto. For example, in an embodiment of the present application, there is further provided an apparatus for managing a high availability method for multiple fault tolerance of a node, the apparatus including:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform operations of the method as described above.
For example, computer-readable instructions, when executed, cause the one or more processors to:
constructing a management and control service system according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node;
determining an invalid FT control node and an FT control node where a virtual access address is located in the control service system;
and carrying out fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
Fig. 9 is a schematic structural diagram of a device for managing multiple fault tolerance and high availability of a node according to another aspect of the present application, where the device includes: the system comprises a building device 11, a determining device 12 and a fault tolerance processing device 13, wherein the building device 11 is used for building a management and control service system according to a master management node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node; the determining device 12 is configured to determine an FT management and control node where a virtual access address is located, where the FT management and control node fails in the management and control service system; the fault-tolerant processing device 13 is configured to perform fault-tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
It should be noted that, the contents executed by the constructing device 11, the determining device 12 and the fault-tolerant processing device 13 are the same as or corresponding to the contents in the steps S11, S12 and S13, and are not described herein for brevity.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions as described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the methods of the present application may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the present application as described above.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (11)

1. A high availability method of managing multiple fault tolerance of a node, the method comprising:
constructing a management and control service system according to a master management and control node and a slave management and control node on an application layer, wherein the master management and control node and the slave management and control node both comprise a pair of FT (FT) management and control nodes protected by FT, and each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node, wherein FT is fault tolerance;
positioning a failed physical host in the management and control service system, and determining a virtual machine on the failed physical host as a failed FT management and control node;
determining the position of a master control node on an application layer, and determining an FT management control node where a virtual access address is located in the management and control service system according to the position of the master control node;
and carrying out fault tolerance processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
2. The method of claim 1, wherein the primary FT management node and secondary FT management node contain the same data content and the corresponding databases are encapsulated in respective corresponding virtual machines.
3. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
if the main FT control node in the main control node on the application layer is a failed FT control node, switching the secondary FT control node which is protected by the same FT with the main FT control node into the main FT control node, and simultaneously, enabling the failed FT control node to be offline;
automatically interfacing through a network card protecting the FT outer layer of the failed FT control node so as to forward a data packet to the virtual access address through the network card;
searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions.
4. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
if a failed FT control node exists in each pair of FT control nodes protected by the FT, judging whether the failed FT control node is a main FT control node, if so, switching a secondary FT control node protected by the same FT with the main FT control node into the main FT control node, and meanwhile, enabling the failed FT control node to be offline;
automatically interfacing through a network card of an FT outer layer corresponding to the FT control node where the virtual access address is located, so as to forward a data packet to the virtual access address through the network card;
searching physical machines meeting the conditions in the cluster where the master control node is located by protecting the FT of the failed FT control node, so as to create new sub FT control nodes on the physical machines meeting the conditions.
5. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
and if the main FT control node and the secondary FT control node in the secondary control nodes on the application layer are the failed FT control nodes, continuously completing fault tolerance processing of the control service system through the main FT control node and the secondary FT control node in the main control nodes on the application layer.
6. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
if the main FT control node and the secondary FT control node on the master control node are failed FT control nodes, switching the virtual access address to the main FT control node in the slave control nodes;
and continuously completing fault tolerance processing of the management and control service system through the main FT management and control node and the secondary FT management and control node in the secondary management and control nodes.
7. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
and if the primary FT control node and the secondary FT control node on the master control node are invalid and the FT control node in the slave control node is invalid, switching the virtual access address to the non-invalid FT control node in the slave control nodes, and continuing fault tolerance processing of the control service system through the non-invalid FT control node in which the virtual access address is newly located.
8. The method of claim 1, wherein performing fault tolerant processing of the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located, comprises:
if the primary FT control node and the secondary FT control node on the secondary control node are invalid and one FT control node in the primary control node is invalid, continuing fault tolerance processing of the control service system through the rest non-invalid FT control nodes.
9. A highly available device for managing multiple fault tolerance of a node, the device comprising:
the device comprises a construction device, a management and control service system and a management and control device, wherein the construction device is used for constructing the management and control service system according to a master management and control node and a slave management and control node on an application layer, the master management and control node and the slave management and control node both comprise a pair of FT management and control nodes protected by FT, each pair of FT management and control nodes comprises a master FT management and control node and a secondary FT management and control node, and FT is fault tolerance;
the determining device is used for positioning a failed physical host in the management and control service system, determining that a virtual machine on the failed physical host is a failed FT management and control node, determining the position of the main management and control node on an application layer, and determining the FT management and control node where a virtual access address in the management and control service system is located according to the position of the main management and control node;
and the fault-tolerant processing device is used for carrying out fault-tolerant processing on the management and control service system according to the failed FT management and control node and the FT management and control node where the virtual access address is located.
10. A highly available device for managing multiple fault tolerance of a node, the device comprising:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any one of claims 1 to 8.
11. A computer readable medium having stored thereon computer readable instructions executable by a processor to implement the method of any of claims 1 to 8.
CN202010277503.9A 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes Active CN111488247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010277503.9A CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010277503.9A CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Publications (2)

Publication Number Publication Date
CN111488247A CN111488247A (en) 2020-08-04
CN111488247B true CN111488247B (en) 2023-07-25

Family

ID=71797869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010277503.9A Active CN111488247B (en) 2020-04-08 2020-04-08 High availability method and equipment for managing and controlling multiple fault tolerance of nodes

Country Status (1)

Country Link
CN (1) CN111488247B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157393A (en) * 2021-04-09 2021-07-23 上海云轴信息科技有限公司 Method and device for managing high availability of nodes
CN113595899A (en) * 2021-06-30 2021-11-02 上海云轴信息科技有限公司 Method and system for realizing multi-node point cloud routing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859263A (en) * 2010-06-12 2010-10-13 中国人民解放军国防科学技术大学 Quick communication method between virtual machines supporting online migration
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104360943A (en) * 2014-11-11 2015-02-18 浪潮电子信息产业股份有限公司 Resource guarantee model of service-oriented architecture
CN104536842A (en) * 2014-12-17 2015-04-22 中电科华云信息技术有限公司 Virtual machine fault-tolerant method based on KVM virtualization
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN107992351A (en) * 2016-10-26 2018-05-04 阿里巴巴集团控股有限公司 A kind of hardware resource distribution method and device, electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10558536B2 (en) * 2017-03-23 2020-02-11 Dh2I Company Highly available stateful containers in a cluster environment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859263A (en) * 2010-06-12 2010-10-13 中国人民解放军国防科学技术大学 Quick communication method between virtual machines supporting online migration
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104360943A (en) * 2014-11-11 2015-02-18 浪潮电子信息产业股份有限公司 Resource guarantee model of service-oriented architecture
CN104536842A (en) * 2014-12-17 2015-04-22 中电科华云信息技术有限公司 Virtual machine fault-tolerant method based on KVM virtualization
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN107992351A (en) * 2016-10-26 2018-05-04 阿里巴巴集团控股有限公司 A kind of hardware resource distribution method and device, electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于分布式架构的星载并行计算机容错技术;王伟成;罗宇;;计算机工程与科学(第03期);全文 *

Also Published As

Publication number Publication date
CN111488247A (en) 2020-08-04

Similar Documents

Publication Publication Date Title
EP3435604B1 (en) Service processing method, device, and system
CA3028504C (en) Data processing method and device
US9842033B2 (en) Storage cluster failure detection
US8266474B2 (en) Fast cluster failure detection
CN109446169B (en) Double-control disk array shared file system
CN111880956B (en) Data synchronization method and device
CN111488247B (en) High availability method and equipment for managing and controlling multiple fault tolerance of nodes
US8015432B1 (en) Method and apparatus for providing computer failover to a virtualized environment
CN112307045A (en) Data synchronization method and system
CN112256477A (en) Virtualization fault-tolerant method and device
RU2643642C2 (en) Use of cache memory and another type of memory in distributed memory system
CN111209260A (en) NFS cluster based on distributed storage and method for providing NFS service
CN111865632A (en) Switching method of distributed data storage cluster and switching instruction sending method and device
CN114443768A (en) Main/standby switching method and device of distributed database and readable storage medium
CN107943615B (en) Data processing method and system based on distributed cluster
CN107528703B (en) Method and equipment for managing node equipment in distributed system
CN116389233B (en) Container cloud management platform active-standby switching system, method and device and computer equipment
CN113157392B (en) High-availability method and equipment for mirror image warehouse
CN112328368B (en) Application layer storage method and device based on cloud platform
US20190124145A1 (en) Method and apparatus for availability management
CN113596195B (en) Public IP address management method, device, main node and storage medium
US11422904B2 (en) Identifying fault domains for delta components of a distributed data object
CN112202601B (en) Application method of two physical node mongo clusters operated in duplicate set mode
CN115510167B (en) Distributed database system and electronic equipment
CN113157393A (en) Method and device for managing high availability of nodes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant