CN116668269A - Arbitration method, device and system for dual-activity data center - Google Patents

Arbitration method, device and system for dual-activity data center Download PDF

Info

Publication number
CN116668269A
CN116668269A CN202210157330.6A CN202210157330A CN116668269A CN 116668269 A CN116668269 A CN 116668269A CN 202210157330 A CN202210157330 A CN 202210157330A CN 116668269 A CN116668269 A CN 116668269A
Authority
CN
China
Prior art keywords
data center
arbitration
node
component
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210157330.6A
Other languages
Chinese (zh)
Inventor
张晓磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202210157330.6A priority Critical patent/CN116668269A/en
Publication of CN116668269A publication Critical patent/CN116668269A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

The embodiment of the application discloses an arbitration method, device and system for a dual-activity data center, and relates to the technical field of cloud computing. The dual active data center includes a first data center and a second data center, the method comprising: determining a fault object in the first data center when a fault event occurs; selecting a target node from a plurality of alternative nodes to replace a master node in the first data center to provide service when the fault object comprises the master node of the first data center, wherein the plurality of alternative nodes comprise alternative nodes of the first data center and/or alternative nodes of the second data center; and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service. The method is helpful for providing an efficient and reliable arbitration mechanism in a double-activity disaster relief scene.

Description

Arbitration method, device and system for dual-activity data center
Technical Field
The present application relates to the field of cloud computing technologies, and in particular, to an arbitration method, device, and system for a dual-activity data center.
Background
The data center provides resources for the software applications, such resources including memory, processors, network bandwidth, and the like. For disaster recovery (Disaster Recovery), at least two data centers are typically built, with some of the at least two data centers assuming the business of the user and others backing up data, configuration, business, etc. The dual-activity data center means that two data centers bear business at the same time and back up each other, so as to improve the overall service capacity and the system resource utilization rate of the two data centers.
And sending a heartbeat packet to the opposite party every set time between two data centers in the dual-activity data centers, and interrupting backup if the heartbeat packet of the opposite party is not received within the set time. At this time, if both data centers continue to bear the service, a problem of inconsistent data occurs. An Arbitration (ARB) mechanism is one of the current means for avoiding the problem of data inconsistency, and the implementation manner of the arbitration mechanism is as follows: the two data centers respectively send arbitration requests to arbitration devices which are independent of the two data centers, the arbitration devices determine winning data centers according to the arbitration requests, the winning data centers continue to provide services (namely bear business), and no winning data centers stop providing services.
With the development of cloud computing and big data, more and more enterprise users conduct centralized processing on applications, data and systems, and risks are also faced when the data are centralized. With the continuous expansion of the scale of cloud resources, the bearing business is continuously increased, the reliability requirement is higher and higher, and how to efficiently and reliably arbitrate so as to ensure high reliability is still an important problem to be solved.
Disclosure of Invention
The embodiment of the application provides an arbitration method, an arbitration device and an arbitration system for a dual-activity data center, which are beneficial to providing an efficient and reliable arbitration mechanism in a dual-activity disaster recovery scene.
In a first aspect, an embodiment of the present application provides an arbitration method for a dual-active data center, where the dual-active data center may include the first data center and the second data center, where the arbitration method may be implemented by an arbitration device, and the method may include: determining a fault object in the first data center when a fault event occurs; selecting a target node from a plurality of alternative nodes to replace a failed master node in the first data center to provide service when the failed object comprises the master node of the first data center, wherein the plurality of alternative nodes comprise alternative nodes of the first data center and/or alternative nodes of the second data center; and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service.
By the method, the arbitration device can use the standby node to replace the fault node to provide service when a fault event occurs, so that the service continuity is ensured by using a fault point transfer mode.
It should be noted that, in the embodiment of the present application, possible forms of the fault object may include, but are not limited to, a data center, a node in the data center, or a component of a node deployed in the data center. Accordingly, the alternative object for replacing a failed object for service upon occurrence of a failure event may be the same modality as the failed object, including but not limited to being a data center, a node in a data center, or a component deployed at a node of a data center. In the embodiment of the application, the arbitration management can be performed according to the related specific fault object form and specific fault situation in the arbitration process.
For example, from the node granularity under the data center, the first data center may include a master node and a standby node associated with the same service, and the second data center may include a master node and a standby node associated with the same service, and when a fault event occurs, the arbitration method may be that in the foregoing description, a target node is selected from a plurality of alternative nodes to replace the failed master node to provide the service, where the plurality of alternative nodes include alternative nodes of the first data center and/or alternative nodes of the second data center; when the failed object does not include a primary node of the first data center, then the node replacement step described above need not be performed.
For another example, from a data center granularity, a first data center may be a primary data center associated with the same service, a second data center may be an alternate data center, and when a failure event occurs, an arbitration method may include: when the fault object comprises the first data center, providing service by replacing the first data center with the second data center; and when the fault object does not comprise the first data center, enabling the first data center to continue to provide service. Similarly, when the second data center is the master node, the second data center can be replaced by the first data center to provide service when the second data center has a fault event.
For another example, from the granularity of components deployed at nodes, components in a primary state (such as an active state) and components in a standby state of the same service may be deployed at the same or different nodes (may be nodes of the same data center or nodes of different data centers), and when a failure event occurs, the arbitration method may include: selecting a target component from a plurality of spare components of the failed component to provide service in place of the failed component when the failed object includes a component of the first data center/second data center in a primary state based on component granularity, the plurality of spare components of the failed component may include a spare component of the first data center and/or a spare component of the second data center; and when the fault object does not comprise the component in the main state, enabling the component in the main state to continue to provide service.
In the embodiment of the application, the tenant can autonomously select and configure the detection granularity of the fault event and the arbitration mechanism corresponding to the fault object under different granularities according to the application scene or the service scene required by the tenant, so that the reliable and efficient management of the dual-activity data center can be implemented from different layers, and the overall reliability of the arbitration system is improved.
With reference to the first aspect, in an optional implementation manner, the determining, in the first data center, a fault object when a fault event occurs includes: when the failure event occurs, the failure object is determined in the first data center, the second data center, and an arbitration device.
With reference to the first aspect, in an optional implementation manner, each candidate node in the plurality of candidate nodes meets at least one of the following conditions: the data synchronization rate of the alternative node is greater than or equal to a first threshold, wherein the data synchronization rate is used for representing the data synchronization state between the alternative node and the corresponding main node; alternatively, the load of the candidate node is less than or equal to a second threshold.
With reference to the first aspect, in an optional implementation manner, the load of the candidate node includes at least one of: CPU load, memory load, disk read/write load, network delay or packet loss rate.
With reference to the first aspect, in an optional implementation manner, the priority of the candidate node of the first data center is higher than the priority of the candidate node of the second data center.
With reference to the first aspect, in an optional implementation manner, the method further includes: when no alternative node exists, first alarm information is sent to the tenant, and the first data center continues to provide service; and when the fault object further comprises the arbitration device, sending second alarm information to the tenant.
With reference to the first aspect, in an optional implementation manner, the method further includes: receiving registration information of User Equipment (UE) from a tenant, wherein the registration information is used for indicating components to be deployed to the first data center and/or the second data center; and registering the component in the arbitration device according to the registration information, and deploying the component in the corresponding node of the first data center and/or the corresponding node of the second data center.
With reference to the first aspect, in an optional implementation manner, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
With reference to the first aspect, in an optional implementation manner, the method further includes: monitoring data from a client is received, wherein the monitoring data is used for indicating whether the fault event occurs to a component deployed at a node to which the client belongs.
With reference to the first aspect, in an optional implementation manner, the method further includes: receiving a query request from a User Equipment (UE) of a tenant, wherein the query request is used for indicating a target component to be queried; and feeding back the state information and/or the monitoring index data of the target component to the UE according to the query request, wherein the state information and/or the monitoring index data of the target component are used for indicating whether the fault event occurs.
With reference to the first aspect, in an optional implementation manner, the method further includes: receiving a change request from a UE of a tenant, the change request being for indicating a change in state of a target component deployed at the first data center and/or the second data center; and sending a change instruction to the client according to the change request, wherein the change instruction is used for indicating to execute state change operation on the target component.
With reference to the first aspect, in an optional implementation manner, the state change operation includes at least one of: starting, stopping and switching.
With reference to the first aspect, in an optional implementation manner, the arbitration device is a cloud server.
In a second aspect, an embodiment of the present application provides an arbitration method for a dual-activity data center, where the method may be implemented by a user device, and the user device may be any electronic device on a user side, including, but not limited to, a smart phone, a notebook computer, a desktop computer, and the like. The method may include: displaying an arbitration management interface; acquiring registration information of the tenant through the arbitration management interface, wherein the registration information is used for indicating components which need to be deployed to the first data center and/or the second data center; and sending the registration information to the arbitration device.
With reference to the second aspect, in an optional implementation manner, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
With reference to the second aspect, in an optional implementation manner, the arbitration method further includes: acquiring a query request of the tenant through the arbitration management interface, wherein the query request is used for indicating a target component to be queried; sending the query request to the arbitration device; receiving status information and/or monitoring indicator data from the arbitration device in response to the query request; and displaying the state information and/or the monitoring index data.
With reference to the second aspect, in an optional implementation manner, the arbitration method further includes: acquiring a change request of the tenant through the arbitration management interface, wherein the change request is used for indicating state change of a target component deployed in the first data center and/or the second data center; and sending the change request to the arbitration device so that the arbitration device performs a state change operation on the target component.
With reference to the second aspect, in an optional implementation manner, the state change operation includes at least one of: starting, stopping and switching.
With reference to the second aspect, in an optional implementation manner, the arbitration device is a cloud server.
In a third aspect, an embodiment of the present application provides an arbitration device for a dual-active data center, where the dual-active data center includes the first data center and the second data center, and the arbitration device includes: a determining unit configured to determine a fault object in the first data center when a fault event occurs; an arbitration unit, configured to select a target node from a plurality of candidate nodes to provide a service instead of a master node in the first data center when the fault object includes the master node of the first data center, where the plurality of candidate nodes include a candidate node of the first data center and/or a candidate node of the second data center; and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service.
With reference to the third aspect, in an optional implementation manner, the determining unit is configured to: when the failure event occurs, the failure object is determined in the first data center, the second data center, and an arbitration device.
With reference to the third aspect, in an optional implementation manner, each candidate node in the plurality of candidate nodes meets at least one of the following conditions: the data synchronization rate of the alternative node is greater than or equal to a first threshold, wherein the data synchronization rate is used for representing the data synchronization state between the alternative node and the corresponding main node; alternatively, the load of the candidate node is less than or equal to a second threshold.
With reference to the third aspect, in an optional implementation manner, the load of the candidate node includes at least one of: CPU load, memory load, disk read/write load, network delay or packet loss rate.
With reference to the third aspect, in an optional implementation manner, the priority of the candidate node of the first data center is higher than the priority of the candidate node of the second data center.
With reference to the third aspect, in an optional implementation manner, the arbitration unit is further configured to: when no alternative node exists currently, first alarm information is sent to the tenant, and the first data center continues to provide service; and when the fault object further comprises the arbitration device, sending second alarm information to the tenant.
With reference to the third aspect, in an optional implementation manner, the apparatus further includes: a communication unit for receiving registration information of user equipment UE from a tenant, the registration information being for indicating components to be deployed to the first data center and/or the second data center; and the deployment unit is used for registering the component in the arbitration equipment according to the registration information and deploying the component in the corresponding node of the first data center and/or the corresponding node of the second data center.
With reference to the third aspect, in an optional implementation manner, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
With reference to the third aspect, in an optional implementation manner, the apparatus further includes: and the communication unit is used for receiving monitoring data from the client, wherein the monitoring data is used for indicating whether the fault event occurs to a component deployed at a node to which the client belongs.
With reference to the third aspect, in an optional implementation manner, the arbitration unit is configured to: receiving a query request from a User Equipment (UE) of a tenant through the communication unit, wherein the query request is used for indicating a target component to be queried; and according to the query request, feeding back the state information and/or monitoring index data of the target component to the UE through the communication unit, wherein the state information and/or monitoring index data of the target component are used for indicating whether the fault event occurs or not.
With reference to the third aspect, in an optional implementation manner, the arbitration unit is configured to: receiving, by a communication unit, a change request from a UE of a tenant, the change request being for indicating a change in state of a target component deployed at the first data center and/or the second data center; and sending a change instruction to the client through the communication unit according to the change request, wherein the change instruction is used for indicating to execute state change operation on the target component.
With reference to the third aspect, in an optional implementation manner, the state change operation includes at least one of: starting, stopping and switching.
With reference to the third aspect, in an optional implementation manner, the arbitration device is a cloud server.
In a fourth aspect, an embodiment of the present application provides an arbitration device for a dual active data center, the arbitration device including:
the display unit is used for displaying the arbitration management interface;
the acquiring unit is used for acquiring registration information of the tenant through the arbitration management interface, wherein the registration information is used for indicating components which need to be deployed to the first data center and/or the second data center;
and the communication unit is used for sending the registration information to the arbitration device.
With reference to the fourth aspect, in an optional implementation manner, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
With reference to the fourth aspect, in an optional implementation manner, the acquiring unit is further configured to: acquiring a query request of the tenant through the arbitration management interface, wherein the query request is used for indicating a target component to be queried; the communication unit is further configured to send the query request to the arbitration device; receiving status information and/or monitoring indicator data from the arbitration device in response to the query request; the display unit is also used for displaying the state information and/or the monitoring index data.
With reference to the fourth aspect, in an optional implementation manner, the acquiring unit is further configured to: acquiring a change request of the tenant through the arbitration management interface, wherein the change request is used for indicating state change of a target component deployed in the first data center and/or the second data center; the communication unit is further configured to send the change request to the arbitration device, so that the arbitration device performs a state change operation on the target component.
With reference to the fourth aspect, in an optional implementation manner, the state change operation includes at least one of: starting, stopping and switching.
With reference to the fourth aspect, in an optional implementation manner, the arbitration device is a cloud server.
In a fifth aspect, an embodiment of the present application provides an arbitration system for a dual-activity data center, where the arbitration system includes an arbitration device and a user device, where the arbitration device is configured to implement the arbitration method according to any one of the first aspect and the first possible design, and the user device is configured to implement the arbitration method according to any one of the second aspect and the second possible design.
In a sixth aspect, embodiments of the present application provide a computer readable medium storing a computer program comprising instructions for executing any one of the above first aspect and the first aspect to design the arbitration method, or instructions for executing any one of the above second aspect and the second aspect to design the arbitration method.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Drawings
Fig. 1 is a schematic diagram illustrating an application scenario of an arbitration method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an arbitration system according to an embodiment of the present application;
FIG. 3 is a flow chart of an arbitration method according to an embodiment of the present application;
FIGS. 4 a-4 d are schematic diagrams illustrating a console management interface according to embodiments of the present application;
FIG. 5 shows a schematic diagram of various fault scenarios of an embodiment of the present application;
FIG. 6 shows a method flow diagram of an embodiment of the present application;
FIG. 7 shows a schematic diagram of a communication device according to an embodiment of the application;
fig. 8 shows a schematic diagram of a communication device according to an embodiment of the application;
fig. 9 shows a schematic diagram of a communication device according to an embodiment of the application.
Detailed Description
The following first explains some terms related to the embodiments of the present application.
1. Double live data center:
a data center is a facility that uses complex networks, computing and storage systems to provide shared access to applications and data, and may be visually referred to as a "building" that houses information technology (Information Technology, IT) equipment for communication and data storage.
In view of disaster recovery, users typically construct two (or more) data centers, one being a primary data center for assuming the traffic of the advocacy, and one being a backup data center for backing up the data, configuration, traffic, etc. of the primary data center. Three backup modes of hot backup, cold backup and dual-active backup are generally arranged between the main data center and the standby data center, and specifically:
(1) Hot standby: only when the main data center bears the business of the user, the standby data center backs up the main data center in real time. The backup data center may automatically take over the service of the primary data center and not interrupt the service of the user, thus making the switching of the data center imperceptible.
(2) Cold backup: also, the primary data center is responsible for the business only, while the backup data center will not perform real-time backup of the primary data center. In this case, it may be that the backup is performed periodically or not at all. If the main data center fails, the user service will be interrupted, and at this time, the user needs to manually switch to the standby data center.
(3) Double living: in order to avoid resource waste of the standby data centers, the main data center and the standby data center are simultaneously charged with the service of the user, and at the moment, the main data center and the standby data center are mutually backed up and are backed up in real time. In general, the load of the primary data center may be more, such as sharing 60-70% of the traffic, and the backup data center only sharing 40% -30% of the traffic. That is, the main data center and the standby data center all bear business and are mutually backed up and synchronized in real time.
Among other things, dual active data centers have the following two advantages: and (one), the resources are fully utilized, the waste caused by the fact that one data center is in an idle state all the year round is avoided, and the service capacity of the double-activity data center is doubled through resource integration. And (II) if one data center is disconnected, the other data center is still running and is not perceivable to a user. Based on the dual-activity data center, a high-reliability scheme is generally adopted, and disaster recovery deployment is carried out on key components of the system, so that the disaster recovery effect of the dual-activity site is achieved.
2. Region (Region) and Region (Availability Zone, AZ):
the area and AZ are used to describe the location of the data center, and a user may create resources in a particular area, AZ, which may also be referred to as a site.
Wherein (1) region: from the geographic location and the network latency dimension, public services such as Elastic computing, block storage, object storage, virtual private cloud (Virtual Private Cloud, VPC) networks, elastic public IP (EIP), mirroring, etc. are shared within the same Region. Regions are divided into general regions and exclusive regions, and general regions refer to regions for providing general cloud services for public tenants; dedicated Region refers to a dedicated Region that only carries the same class of traffic or provides traffic services only for a specific tenant. In general, the same area may deploy one or more hosts (i.e., nodes) for implementing at least one service (or referred to as a business).
(2) AZ: an AZ is a collection of one or more physical data centers, and has independent wind, fire, water and electricity, and resources such as calculation, network, storage and the like are logically divided into a plurality of clusters in the AZ. Multiple AZ in a Region are connected through high-speed optical fibers so as to meet the requirement of a user for building a high-performance system across AZ. In general, the same AZ may deploy one or more hosts (i.e., nodes) for implementing at least one service (or referred to as a business).
3. And (3) assembly:
a Component (Component) is a simple encapsulation of data and methods, and in short, a Component is an object. The component may have its own properties and methods, where a property is a simple visitor to the component data and a method is some simple and visible function of the component. Components deployed in a data center may include, for example, servers and racks, energy systems, network connections, security systems, automated management tools, cooling systems, and compliance management policies, etc., which help maintain the efficiency of the data center. In the embodiment of the application, the components can be deployed in the nodes of the data center and used for realizing corresponding services.
4. Master-slave mode and master-slave state:
By component, a primary-Standby mode refers to one component being in an Active state (i.e., active state) of a certain service, and another component being in a Standby state (i.e., standby state) of the service, the Active-Standby state being abbreviated as primary-Standby state. When a component in an active state of a certain service fails, the component in a standby state of the service can replace the failed component to provide service, and the service continuity is ensured by utilizing a failure point transfer mode.
In the embodiment of the present application, the component to which the Active state is assigned may also be referred to as a ribbon component, and accordingly, AZ to which the Active state belongs is referred to as priority AZ, and AZ to which the Standby state belongs is referred to as Standby AZ.
The embodiment of the application provides an arbitration method, an arbitration device and an arbitration system for a dual-activity data center, which aim at various fault events in a dual-activity (or multi-activity) disaster recovery scene and provide a public active/standby arbitration capability for different fault objects so as to obtain the dual-activity disaster recovery capability with high reliability of an arbitration system. The components to be deployed can be connected to the arbitration system in a butt joint way to obtain the double-activity disaster recovery capability with high reliability only by writing a small number of script plug-ins according to the interface definition of the butt joint. The arbitration system also integrates the visual display of the state of component deployment, the management of the lifecycle of the components, the monitoring and alarming of the OS key indexes of the nodes and the key indexes of the components, and provides one-stop management capability. The method and the device are based on the same technical conception, and because the principle of solving the problems by the method and the device is similar, the implementation of the device and the method can be mutually referred to, and the repeated parts are not repeated.
It should be noted that, in the embodiment of the present application, the arbitration scheme is described only by taking the dual-activity data center as an example, and the application scenario of the arbitration scheme is not limited, and in other embodiments, the arbitration scheme may also be applied to the multi-activity data center, which is not limited in the embodiment of the present application. In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b, c may be single or plural.
And, unless otherwise specified, references to "first," "second," etc. ordinal words of embodiments of the present application are used for distinguishing between multiple objects and are not used for limiting the priority or importance of the multiple objects. For example, the first data center, the second data center, and the third data center are only for distinguishing between different data centers, and are not indicative of the difference in priority or importance of the three data centers, etc.
The present application will be described in detail with reference to the accompanying drawings and examples.
Fig. 1 shows a schematic diagram of an application scenario of an arbitration method according to an embodiment of the present application.
As shown in fig. 1, a first data center 11 and a second data center 12 may be provided in the application scenario, where the first data center 11 and the second data center 12 are dual-active data centers, that is, the first data center 11 and the second data center 12 provide services at the same time and backup each other. The connection between the first data center 11 and the second data center 12 may be through an optical fiber or a network cable, and the first data center 11 and the second data center 12 may backup data through an optical fiber or a network cable, and may send heartbeat packets to each other through the optical fiber or the network cable at intervals of a set time (for example, 1 second) to determine whether the connection between the first data center 11 and the second data center 12 is disconnected.
It will be appreciated that in practice, the data backup between the first data center 11 and the second data center 12 may be a synchronous copy. In order to realize synchronous replication, on the one hand, backup data can be transmitted by using a link with high transmission speed, such as an optical fiber, and on the other hand, the distance between the first data center 11 and the second data center 12 can be set within a set distance (for example, 100 km), for example, the first data center and the second data center are disposed in the same city.
In an alternative implementation, the application scenario shown in fig. 1 may also be provided with an arbitration device 13, where the arbitration device 13 may be a device for implementing the arbitration mechanism according to the embodiment of the present application, which is specifically configured independently of the first data center 11 and the second data center 12. The arbitration device 13 may be connected to the first data center 11 and the second data center 12, respectively, and the arbitration method provided in the embodiment of the present application may be specifically applicable to the arbitration device 13 for performing arbitration management on the fault objects in the first data center 11 and/or the second data center 12.
Specifically, the first data center 11 and the second data center 12 may each include a storage layer, an application layer, and a network layer, and one or more partial nodes of a cluster (such as a Database (DB) cluster on the application layer) are respectively disposed on the same layer (storage layer, application layer, or network layer) of the first data center 11 and the second data center 12.
It can be understood that, in the embodiment of the present application, the arbitration device 13 may be a device specifically configured for implementing the arbitration method provided in the embodiment of the present application, or may be disposed at the same site as a certain data center, and the embodiment of the present application does not limit the product form of the arbitration device 13. The application scenario shown in fig. 1 is merely an example, and embodiments of the present application are not limited thereto.
FIG. 2 is a schematic diagram of an arbitration system according to an embodiment of the present application. The arbitration system may support high master-slave arbitration management functions of multiple types of stateful components, such as cluster management and cluster monitoring, where the main functions of cluster management may include, for example: registration docking of the belt-like component with the system, updating of docking configuration and control of the main-standby state of the component after docking; the main functions of cluster monitoring include: display of cluster states, key monitoring indexes of cluster node OS, key business indexes of clusters, alarms exceeding a normal threshold value and the like.
As shown in fig. 2, the arbitration system can be divided into a client (client) side and a server (server) side.
A client may logically mainly include agent nodes (agents) and components to be managed. Each component may be a ribbon component, may be a component in an Active state of a certain service, or may be a component in a Standby state of a certain service. The agent main roles may include: collecting/reporting status information of the components managed by the server to the server, where the status information of the components may include, for example: main-standby state information of the component, key monitoring information of the component, OS load information of the node and the like; and executing a command from a server side to start, stop, switch and the like on the currently managed component.
The server side logically mainly comprises at least one server (server), a console (control) management interface associated with the server and a back-end memory, wherein the control management interface can provide visualized cluster monitoring and management capability for tenants, and the back-end memory is used as a storage back-end of an arbitration system and can store state information of the system; the server can be used as a brain of the system, and can manage different components of the whole system based on state information in the back-end memory.
In an alternative implementation, the arbitration system may further include a User Equipment (UE) of the tenant.
The tenant can register the component which realizes the service and needs to have disaster recovery arbitration capability to the arbitration system through the user equipment and the corresponding console management interface based on the interface definition of service registration. After the service registration of the tenant is completed at the server side, the server side can issue a deployment instruction to the proxy node of the component according to the registration information, the corresponding proxy node performs initialization operation on the component to be deployed as a dual-activity (or multi-activity) data center, and the proxy node can report the status information of the deployed component to the server.
Further, the tenant can check the main-standby state, key monitoring index data and the like of the component through a foreground management interface provided by the control, and can manage the life cycle of the component according to requirements, including but not limited to starting, stopping, switching and the like of the component. When a fault event occurs, the server can serve as arbitration equipment, and performs arbitration management on the fault object according to the fault object related to the fault event, the corresponding fault scene and the arbitration policy so as to ensure the high performance of the dual-activity (or multi-activity) data center.
The arbitration scheme of the embodiment of the application is described below by taking a dual active data center as an example.
FIG. 3 is a flow chart of an arbitration method for a dual active data center according to an embodiment of the present application. The method may be implemented by the arbitration system shown in fig. 2. As shown in fig. 3, from the node level, the arbitration method may include the steps of:
s310: when a failure event occurs, the arbitration device determines a failure object in the first data center.
S320: when the fault object comprises a master node of the first data center, the arbitration device selects a target node from a plurality of alternative nodes to replace the master node in the first data center to provide service, wherein the plurality of alternative nodes comprise alternative nodes of the first data center and/or alternative nodes of the second data center; when the failed object does not include a master node of the first data center, the arbitration device causes the first data center to continue providing service.
It should be noted that, in the embodiment of the present application, possible forms of the fault object may include, but are not limited to, a data center, a node in the data center, or a component of a node deployed in the data center. Accordingly, the alternative object for replacing a failed object for service upon occurrence of a failure event may be the same modality as the failed object, including but not limited to being a data center, a node in a data center, or a component deployed at a node of a data center. In the embodiment of the application, the arbitration management can be performed according to the related specific fault object form and specific fault situation in the arbitration process.
For example, from the node granularity under the data center, the first data center may include a master node and a standby node associated with the same service, and the second data center may include a master node and a standby node associated with the same service, and when a fault event occurs, the arbitration method may be that in the foregoing description, a target node is selected from a plurality of alternative nodes to replace the failed master node to provide the service, where the plurality of alternative nodes include alternative nodes of the first data center and/or alternative nodes of the second data center; when the failed object does not include a primary node of the first data center, then the node replacement step described above need not be performed.
For another example, from a data center granularity, a first data center may be a primary data center associated with the same service, a second data center may be an alternate data center, and when a failure event occurs, an arbitration method may include: when the fault object comprises the first data center, providing service by replacing the first data center with the second data center; and when the fault object does not comprise the first data center, enabling the first data center to continue to provide service. Similarly, when the second data center is the master node, the second data center can be replaced by the first data center to provide service when the second data center has a fault event.
For another example, from the granularity of components deployed at nodes, components in a primary state (such as an active state) and components in a standby state of the same service may be deployed at the same or different nodes (may be nodes of the same data center or nodes of different data centers), and when a failure event occurs, the arbitration method may include: selecting a target component from a plurality of spare components of the failed component to provide service in place of the failed component when the failed object includes a component of the first data center/second data center in a primary state based on component granularity, the plurality of spare components of the failed component may include a spare component of the first data center and/or a spare component of the second data center; and when the fault object does not comprise the component in the main state, enabling the component in the main state to continue to provide service.
In the embodiment of the application, the tenant can autonomously select and configure the detection granularity of the fault event and the arbitration mechanism corresponding to the fault object under different granularities according to the application scene or the service scene required by the tenant, so that the reliable and efficient management of the dual-activity data center can be implemented from different layers, and the overall reliability of the arbitration system is improved.
For ease of understanding, the method steps shown in FIG. 3, as well as other relevant steps for implementing the arbitration method, are described in greater detail below in terms of a component registration process, a fault arbitration process, and the like.
(one) component registration procedure
In the embodiment of the application, the data center of the arbitration system can be a dual-activity data center or a multi-activity data center, and for convenience of distinction, the dual-activity data center can comprise a first data center and a second data center, and the multi-activity data center can comprise the first data center, the second data center, a third data center and the like. Users can register the required band-like components with disaster recovery arbitration capability to different data centers of the arbitration system in advance based on interface definitions of service registration, and the arbitration system can perform cluster management (or called component management) and cluster monitoring on the components of the different data centers.
Illustratively, the component registration format provided by the arbitration system may be as follows:
the arbitration system may provide a console management interface for configuration by the tenant. In the component registration process, the user equipment of the tenant can display a corresponding console management interface, and the tenant can input or select parameters similar to the component registration information in the relevant attribute configuration items of the console management interface according to the use requirement of the tenant on the component so as to complete the component registration.
For example, as shown in fig. 4a, a console management interface involved in the component registration process may be used for a "component docking" function, where the interface may include attribute configuration items for inputting or selecting a service name, a component name, component deployment status information, a management plug-in, a start timeout time, a stop timeout time, a switch timeout time, a status check interval, an abnormal switching sensitivity, and the like, and a tenant may input or select a corresponding parameter in the relevant attribute configuration item, and click a "docking (or referred to as a registration)" button to complete relevant parameter configuration required for component registration.
The registration information for the component may include parameters entered or selected for the relevant attribute configuration item. Including but not limited to at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
The user equipment of the tenant can send the registration information of the component to the server through the console management interface. The server may receive registration information from user devices of the tenant, the registration information indicating components to be deployed to a dual-activity (or multi-activity) data center. The server may also store the registration information of the component in the back-end memory as initial state information of the component. Based on the registration information, the server may register the component with the arbitration system and deploy the component at a corresponding node of a dual-activity (or multi-activity) data center of the arbitration system. For example, the server may issue registration information of the component to the proxy node, and the proxy node may perform an initialization operation on the component according to the registration information of the component, so as to complete deployment of the component.
After the components are deployed, the tenant can change the configuration parameters of the deployed components through the 'configuration management' function through the management interface shown in fig. 4b, and manage the lifecycle of the components, including operations such as starting, stopping, restarting, switching, and the like of the components. The proxy node can also report the state information of the deployed component to the server, and the server can update the state information stored at the rear end according to the state information reported by the proxy node, and can perform state management and cluster monitoring on the component according to the real-time state information of the component. For example, based on a management and control plug-in unit configured during component deployment, various timeout time parameters and the like, the proxy node can also actively monitor the state of the component and report the state to the server so that the server can timely know whether a component fault event exists. For another example, the tenant may take services such as gaussian database (GaussDB), rabbitmq, redis as an example through a management interface in fig. 4c, and in the "state management" function, view/change the active-standby state of the component, and the tenant may manually manage the lifecycle of the component by checking different flag symbols under the "operation" configuration item (only "setting" is simply indicated in the figure, and the flag symbols corresponding to different operations are not limited in the figure), including operations of starting, stopping, restarting, switching, and the like of the component. For another example, the tenant may perform cluster monitoring through the management interface shown in fig. 4d, and view information of the key monitoring index.
The user equipment of the tenant can send the updated configuration parameters of the component to be managed to the server through the console management interface, the server can update the component state information stored in the back-end memory, meanwhile, the server can send a change instruction to the proxy node according to the updated configuration parameters of the component to be managed, and the proxy node can manage the state, the life cycle and the like of the component according to the change instruction.
It should be understood that the above functions of component registration, component management, and cluster monitoring through the console management interface are merely examples, and are not limited to specific implementation manners of these functions, which are not described herein.
(II) Fault arbitration procedure
In an embodiment of the present application, each data center in a dual-activity (or multi-activity) data center may include a master node (e.g., priority AZ) and an alternative node (e.g., standby AZ), where the master node may deploy components of a certain service in an active state, and the alternative node may deploy components of a certain service in a standby state. When a fault event occurs, the arbitration device needs to determine the location of the fault object, i.e. the fault point. Furthermore, the arbitration device needs to perform arbitration and switching processes according to the arbitration policy corresponding to the fault object, so as to ensure the continuity of the service.
For example, the first data center is a primary data center, the second data center is an alternative data center, and when the fault object is the entire data center, such as the first data center, the arbitration device may provide service with the alternative second data center instead of the first data center.
For another example, when the failed object comprises a master node of the data center, the arbitration device may select the target node from a plurality of candidate nodes of the master node, including all candidate nodes in the dual-activity (or multi-activity) data center, to provide service in place of the failed master node in the data center. Such as an alternative node of a first data center and/or an alternative node of a second data center in a dual-active data center. And when the fault object does not comprise the main node of the data center, enabling the components of the main node to continue to provide services. That is, the slave node level triggers the arbitration device to perform arbitration management on the failed object only when the failure point is located at the master node.
For another example, when the failed object includes a component of a service in a primary state (simply referred to as a primary component), the mediation device may select a target component to provide the service in place of the primary component from a plurality of backup components of the same service associated with the failed primary component, which may include all of the backup components in the dual-active (or multi-active) data center, which may be backup components in the first data center and/or backup components in the second data center. When the failed object does not include the primary component, the primary component is caused to continue providing services. That is, from the component level, the arbitration device is triggered to perform arbitration management on the failure object only when the failure point is located in the master component.
In the dual-active disaster recovery deployment mode, the fault situation may be specifically shown in fig. 5, including the following situations:
1. arbitrating equipment failure:
in this case, the cluster state is unchanged, the service is normal, and the primary and standby states of the components are not required to be switched.
2. Backup AZ failure/communication failure between biaz:
in this case, the spare AZ component fails autonomously, the service is normal, and the switching of the main and spare states of the component is not required.
3. Priority AZ failure:
in this case, the component that needs to fail performs active-standby state switching, the failed component is switched from the active state to the standby state, and the target component of standby AZ is switched from the standby state to the active state, and in the active-standby switching process, the service is recovered after a short interruption (for example, 30 s).
4. After the arbitration site failed, the backup AZ also failed:
in this case, the active state component is normal, the service is normal, and the active state and the standby state of the component are not required to be switched.
5. After the backup AZ failure, the arbitration site also fails:
in this case, the active state component is normal, the service is normal, and the active state and the standby state of the component are not required to be switched.
6. After the arbitration site failed, the priority AZ also failed:
In this case, the active component fails, the service is abnormal, the component that needs to fail performs the active-standby state switching, the failed component is switched from the active state to the standby state, the standby AZ target component is switched from the standby state to the active state, and during the active-standby switching, the service is recovered after a short interruption (for example, 30 s).
7. After the priority AZ failure, the arbitration site also fails:
in this case, the active component fails, the service is abnormal, the component that needs to fail performs the active-standby state switching, the failed component is switched from the active state to the standby state, the standby AZ target component is switched from the standby state to the active state, and during the active-standby switching, the service is recovered after a short interruption (for example, 30 s).
8. After the priority AZ failure, the backup AZ also fails:
in this case, the active component fails, the service is abnormal, the component that needs to fail performs active-standby state switching, the failed component is switched from the active state to the standby state, and the standby AZ target component is switched from the standby state to the active state.
9. After the backup AZ failure, the priority AZ also fails:
in this case, the active component fails, the service is abnormal, the component that needs to fail performs active-standby state switching, the failed component is switched from the active state to the standby state, and the standby AZ target component is switched from the standby state to the active state.
10. The priority AZ failure and the arbitration site failure are followed by the backup AZ failure:
in this case, the active component fails, the service is abnormal, the component that needs to fail performs active-standby state switching, the failed component is switched from the active state to the standby state, and the standby AZ target component is switched from the standby state to the active state.
11. After the backup AZ failure and the arbitration site failure, the priority AZ also fails:
in this case, the active component fails, the service is abnormal, the component that needs to fail performs active-standby state switching, the failed component is switched from the active state to the standby state, and the standby AZ target component is switched from the standby state to the active state.
The operator of the arbitration device may preset and save the arbitration policy for the above different fault situations, and when a fault event occurs, the arbitration device (for example, the server shown in fig. 2) may perform arbitration management on the fault object according to the fault situation and the corresponding arbitration policy, and make the faulty component switch from the active state to the standby state, and the selected target component switch from the standby state to the active state, so as to replace the faulty component to continue to provide services.
When the arbitration device performs primary and standby arbitration, the multidimensional information reported to the proxy node can be used as an arbitration basis to determine a plurality of alternative nodes (namely a plurality of alternative nodes at present) capable of replacing a fault primary node to provide service in an arbitration system, and a target node is selected from the plurality of alternative nodes at present to replace the service. Illustratively, the arbitration basis may include at least one of: data synchronization rate of the candidate node; the load state of the candidate node, such as CPU load, memory load, disk read/write load, network delay or packet loss rate, the location of the site to which the candidate node belongs, and the like.
Based on the above arbitration basis, when a component in an active state of a certain service fails, the arbitration device may select, for example, as an alternative node, an alternative node satisfying at least one of the following conditions from among deployed components in a standby state of the service: (1) The data synchronization rate of the alternative node is greater than or equal to a first threshold; (2) The load (including but not limited to at least one of CPU load, memory load, disk read/write load, network latency, or packet loss rate) of the candidate node is less than or equal to a second threshold. It is understood that the evaluation threshold values of different parameters may be different, and the specific values of the threshold values are not limited in the embodiment of the present application.
In an alternative implementation, the arbitration device may draw a list of alternative master nodes that may summarize arbitration entry information for the different components and their nodes to which they belong. In the event of the above-described fault situation of 1-11, on the one hand, the arbitration device may check the data synchronization rate of each current candidate liter master node assembly, and if the data synchronization rate of a certain candidate liter master node assembly is smaller than the first threshold, may delete from the list, i.e. temporarily not serve as a candidate node. On the other hand, the arbitration device may check the load status of the current candidate host node, and if the CPU load/memory load/disk read/write load/network delay/packet loss rate is greater than the corresponding second threshold, the current candidate host node needs to be deleted from the list, and is also temporarily not used as a candidate node. Further, the arbitration device may randomly select from the current plurality of candidate nodes or select the target node according to a preset rule.
Illustratively, the arbitration policy corresponding to the fault object may include: and preferentially selecting the alternative node which is in the same site with the current fault main node as a target node.
That is, the primary-backup switching is preferentially performed at the same site, and the secondary primary-backup switching is performed across sites.
In an alternative implementation, priority may be given to the candidate nodes, and the target node may be selected according to different priorities. For example, the priority of the candidate nodes at the same site as the failed master node is higher than the priority of the candidate nodes at different sites than the failed master node. If the same station has no alternative node, the alternative node which is closer to the fault master node or has better network communication can be selected as far as possible as a target node according to the space distance, the network state and the like. The embodiment of the application does not limit the selection rule of the target node.
Illustratively, the arbitration policy corresponding to the fault object may further include: when no alternative node exists currently, first alarm information is sent to the tenant, and the fault master node continues to provide service; and when the fault object further comprises arbitration equipment, sending second alarm information to the tenant.
That is, if there is no alternative node in the current list, the arbitration device may keep the current active/standby state unchanged, and ensure the high performance of the arbitration system by sending an alarm to the tenant and manually intervening by the tenant to process the fault event.
Under the condition that the main and standby state switching is required to be carried out, the server is used as arbitration equipment, an arbitration instruction can be issued to the proxy node so as to instruct the proxy node to switch the fault main node from the active state to the standby state, switch the target node from the standby state to the active state, and the target node is used for replacing the fault main node to continue to provide services, so that services are realized.
FIG. 6 shows a flow chart of an arbitration method according to an embodiment of the application. The above-mentioned arbitration scheme may be divided into the following steps according to different service scenarios:
component registration phase:
s11: the tenant registers the component through a console management interface presented at its user device.
S12: the console management interface acquires registration information of the tenant, and sends the registration information of the component to the server through the console management interface.
S13: the server stores the registration information of the component as initial state information in a back-end memory, and distributes a deployment instruction to the proxy node according to the registration information of the component, wherein the deployment instruction can comprise a registration script of the component.
S14: the proxy node deploys the components at the respective nodes in response to deployment instructions from the ARB server. Meanwhile, the proxy node can pull up the deployed component and collect the state information of the component.
S15: the proxy node may report the status information of the component to the server. It should be appreciated that for deployed components, the proxy node may subsequently pull up the component and gather status information of the component in real-time or periodically at various stages as described below.
S16: the server sends the registration result of the component to the user equipment of the tenant.
S17: the console management interface displays the registration result to the tenant.
Querying component status phase:
s21: the tenant initiates a request for querying the state information of the component through the console management interface.
S22: the console management interface sends a request for querying the state information of the component to the server through the console management interface.
S23: the server obtains the state information of the component from the back-end memory, and returns the query result to the console management interface through the console management interface.
S24: the console management interface displays the query results to the tenant.
The monitoring information stage of the query component:
s31: the tenant initiates a request for monitoring information of the query component through the console management interface.
S32: the console management interface sends a request for inquiring the monitoring information of the component to the server through the console management interface.
S33: the server acquires the monitoring information of the component from the back-end memory, and returns a query result to the console management interface through the console management interface.
S34: the console management interface displays the query results to the tenant.
Tenant changes cluster state phase:
s41: the tenant initiates a request to change the state of the cluster (including one or more components) through the console management interface.
S42: the console management node sends a request to the server to change the state of the cluster, including configuration update parameters of one or more components in the cluster, through the console management interface.
S43: the server issues cluster operation commands to the proxy node according to configuration update parameters of one or more components.
S44: the proxy node invokes the registration script to perform state change operations of one or more components, such as start, stop, switch, etc. of the components.
S45: and the proxy node returns an execution result to the server.
S46: the server sends the execution result to the console management interface through the console management interface.
S47: the console management node displays the status change result of the cluster to the tenant.
Fault arbitration phase:
s51: the arbitration system experiences a failure event.
S52: the server acts as an arbitration device to determine that a failure event has occurred, e.g., that the master node status information has timed out and not been reported. It will be appreciated that the fault event may be triggered by a failure of the master node, or may be triggered by a communication failure between the master node and the node of the arbitration device, and the triggering situation of the fault event is not limited by the embodiment of the present application).
S53: the server starts an automatic arbitration mechanism, and determines whether to start a primary and standby state switching process according to whether the fault object comprises a primary node or not according to the arbitration strategies respectively corresponding to the 11 situations.
S54: in the case that the fault object includes the master node, selecting a target node from the current multiple candidate nodes according to the arbitration policy, and performing a state switching instruction to the proxy node so as to switch the target node from the standby state to the active state, namely, the candidate component is up to the master.
S55: the proxy node responds to a state switching instruction from the server and calls a registration script of the target node and the fault master node so as to enable the standby state of the target node to be switched into an active state, namely the standby component is lifted, and the fault master node is switched into the standby state from the active state, namely the main standby.
S56: and the proxy node reports the state change result to the server.
According to the arbitration scheme, various fault situations in a double-activity disaster recovery scene are taken as examples, the public active/standby arbitration capability is provided, and a few script plug-ins are written according to interface definition of docking, so that the high-reliability double-activity disaster recovery capability can be obtained by docking the arbitration system. The System also integrates the visual display of the state of component deployment, the management of the lifecycle of the components, the monitoring and alarming of the key indexes of the Operating System (OS) of the nodes and the key indexes of the components, and can provide one-stop management capability.
In combination with the above method embodiment, the embodiment of the present application further provides a communication apparatus, where the communication apparatus performs the method performed by the arbitration device (e.g. server) and the user equipment of the tenant in the above method embodiment.
As shown in fig. 7, the communication apparatus 700 may be an arbitration device, including: a determining unit 701 configured to determine a fault object in the first data center when a fault event occurs; an arbitration unit 702, configured to select, when the fault object includes a master node of the first data center, a target node from a plurality of alternative nodes to provide services in place of the master node in the first data center, where the plurality of alternative nodes includes an alternative node of the first data center and/or an alternative node of the second data center; and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service.
In an alternative implementation, the determining, in the first data center, the fault object when the fault event occurs includes: when the failure event occurs, the failure object is determined in the first data center, the second data center, and an arbitration device.
In an alternative implementation, each of the plurality of alternative nodes satisfies at least one of the following conditions: the data synchronization rate of the alternative node is greater than or equal to a first threshold, wherein the data synchronization rate is used for representing the data synchronization state between the alternative node and the corresponding main node; alternatively, the load of the candidate node is less than or equal to a second threshold.
In an alternative implementation, the load of the alternative node includes at least one of: CPU load, memory load, disk read/write load, network delay or packet loss rate.
In an alternative implementation, the priority of the candidate nodes of the first data center is higher than the priority of the candidate nodes of the second data center.
In an optional implementation manner, the arbitration policy corresponding to the fault object further includes: when no alternative node exists, first alarm information is sent to the tenant, and the first data center continues to provide service; and when the fault object further comprises the arbitration device, sending second alarm information to the tenant.
In an alternative implementation, the apparatus further includes: a communication unit for receiving registration information of user equipment UE from a tenant, the registration information being for indicating components to be deployed to the first data center and/or the second data center; and the deployment unit is used for registering the component in the arbitration equipment according to the registration information and deploying the component in the corresponding node of the first data center and/or the corresponding node of the second data center.
In an alternative implementation, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
In an alternative implementation, the apparatus further includes: and the communication unit is used for receiving monitoring data from the client, wherein the monitoring data is used for indicating whether the fault event occurs to a component deployed at a node to which the client belongs.
In an alternative implementation, the arbitration unit is configured to: receiving a query request from a User Equipment (UE) of a tenant through the communication unit, wherein the query request is used for indicating a target component to be queried; and according to the query request, feeding back the state information and/or monitoring index data of the target component to the UE through the communication unit, wherein the state information and/or monitoring index data of the target component are used for indicating whether the fault event occurs or not.
In an alternative implementation, the arbitration unit is configured to: receiving, by a communication unit, a change request from a UE of a tenant, the change request being for indicating a change in state of a target component deployed at the first data center and/or the second data center; and sending a change instruction to the client through the communication unit according to the change request, wherein the change instruction is used for indicating to execute state change operation on the target component.
In an alternative implementation, the state change operation includes at least one of: starting, stopping and switching.
In an alternative implementation, the arbitration device is a cloud server.
As shown in fig. 8, the communication apparatus 800 may be a user equipment of a tenant, including: a display unit 801 for displaying an arbitration management interface; an obtaining unit 802, configured to obtain, through the arbitration management interface, registration information of a tenant, where the registration information is used to indicate a component that needs to be deployed to a first data center and/or a second data center; a communication unit 803 for sending the registration information to the arbitration device.
In an alternative implementation, the registration information includes at least one of the following information of the component: the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
In an alternative implementation, the obtaining unit is further configured to: acquiring a query request of the tenant through the arbitration management interface, wherein the query request is used for indicating a target component to be queried; the communication unit is further configured to send the query request to the arbitration device; receiving status information and/or monitoring indicator data from the arbitration device in response to the query request; the display unit is also used for displaying the state information and/or the monitoring index data.
In an alternative implementation, the obtaining unit is further configured to: acquiring a change request of the tenant through the arbitration management interface, wherein the change request is used for indicating state change of a target component deployed in the first data center and/or the second data center; the communication unit is further configured to send the change request to the arbitration device, so that the arbitration device performs a state change operation on the target component.
In an alternative implementation, the state change operation includes at least one of: starting, stopping and switching.
In an alternative implementation, the arbitration device is a cloud server.
It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice. The functional units in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In a simple embodiment, one skilled in the art will appreciate that either the arbitration device or the user device in the above embodiments may take the form shown in fig. 9. The communication device 900 as shown in fig. 9 includes at least one processor 910, a memory 920, and optionally a communication interface 930.
Memory 920 may be a volatile memory such as a random access memory; the memory may also be a non-volatile memory such as, but not limited to, read-only memory, flash memory, hard disk (HDD) or Solid State Drive (SSD), or the memory 920 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 920 may be a combination of the above.
The specific connection medium between the processor 910 and the memory 920 is not limited in the embodiment of the present application.
In the apparatus as in fig. 9, a communication interface 930 is further included, and the processor 910 may perform data transmission through the communication interface 930 when communicating with other devices.
When the arbitration device takes the form shown in fig. 9, the processor 910 in fig. 9 may cause the apparatus 900 to perform the method performed by the arbitration device in any of the method embodiments described above by invoking computer-executable instructions stored in the memory 920.
When the tenant-side user device takes the form shown in fig. 9, the processor 910 in fig. 9 may cause the device 900 to perform the method performed by the tenant-side user device in any of the above-described method embodiments by invoking computer-executable instructions stored in the memory 920.
The embodiment of the application also relates to a chip system, which comprises a processor for calling a computer program or computer instructions stored in a memory, so that the processor executes the above-mentioned method embodiment.
In one possible implementation, the processor is coupled to the memory through an interface.
In one possible implementation, the system on a chip further includes a memory having a computer program or computer instructions stored therein.
The embodiments of the present application also relate to a processor for invoking a computer program or computer instructions stored in a memory to cause the processor to perform the above-described method embodiments.
The processor referred to in any of the above may be a general purpose central processing unit, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the program execution of the method in the embodiment shown in fig. 9. The memory mentioned in any of the above may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM), etc.
It should be appreciated that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having computer program code embodied therein.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the scope of the embodiments of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is also intended to include such modifications and variations.

Claims (40)

1. An arbitration method for a dual-active data center, wherein the dual-active data center comprises a first data center and a second data center, the method comprising:
determining a fault object in the first data center when a fault event occurs;
selecting a target node from a plurality of alternative nodes to replace a master node in the first data center to provide service when the fault object comprises the master node of the first data center, wherein the plurality of alternative nodes comprise alternative nodes of the first data center and/or alternative nodes of the second data center;
and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service.
2. The method of claim 1, wherein said determining a fault object in said first data center when a fault event occurs comprises:
When the failure event occurs, the failure object is determined in the first data center, the second data center, and an arbitration device.
3. The method of claim 2, wherein each of the plurality of candidate nodes satisfies at least one of the following conditions:
the data synchronization rate of the alternative node is greater than or equal to a first threshold, wherein the data synchronization rate is used for representing the data synchronization state between the alternative node and the corresponding main node; or alternatively
The load of the alternative node is less than or equal to a second threshold.
4. A method according to claim 3, wherein the load of the candidate node comprises at least one of: CPU load, memory load, disk read/write load, network delay or packet loss rate.
5. The method of any of claims 1-4, wherein the priority of the candidate nodes of the first data center is higher than the priority of the candidate nodes of the second data center.
6. The method according to any one of claims 1-5, further comprising:
when no alternative node exists, first alarm information is sent to the tenant, and the first data center continues to provide service;
And when the fault object further comprises the arbitration device, sending second alarm information to the tenant.
7. The method according to any one of claims 1-6, further comprising:
receiving registration information of User Equipment (UE) from a tenant, wherein the registration information is used for indicating components to be deployed to the first data center and/or the second data center;
and registering the component in the arbitration device according to the registration information, and deploying the component in the corresponding node of the first data center and/or the corresponding node of the second data center.
8. The method of claim 7, wherein the registration information includes at least one of the following information for the component:
the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
9. The method according to any one of claims 1-8, further comprising:
monitoring data from a client is received, wherein the monitoring data is used for indicating whether the fault event occurs to a component deployed at a node to which the client belongs.
10. The method according to claim 9, wherein the method further comprises:
receiving a query request from a User Equipment (UE) of a tenant, wherein the query request is used for indicating a target component to be queried;
and feeding back the state information and/or the monitoring index data of the target component to the UE according to the query request, wherein the state information and/or the monitoring index data of the target component are used for indicating whether the fault event occurs.
11. The method according to any one of claims 1-10, further comprising:
receiving a change request from a UE of a tenant, the change request being for indicating a change in state of a target component deployed at the first data center and/or the second data center;
and sending a change instruction to the client according to the change request, wherein the change instruction is used for indicating to execute state change operation on the target component.
12. The method of claim 11, wherein the state change operation comprises at least one of: starting, stopping and switching.
13. The method according to any one of claims 2-12, wherein the mediation device is a cloud server.
14. An arbitration method for a dual active data center, the arbitration method comprising:
displaying an arbitration management interface;
acquiring registration information of the tenant through the arbitration management interface, wherein the registration information is used for indicating components which need to be deployed to the first data center and/or the second data center;
and sending the registration information to the arbitration device.
15. The method of claim 14, wherein the registration information includes at least one of the following information for the component:
the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
16. The method according to claim 14 or 15, wherein the arbitration method further comprises:
acquiring a query request of the tenant through the arbitration management interface, wherein the query request is used for indicating a target component to be queried;
sending the query request to the arbitration device;
receiving status information and/or monitoring indicator data from the arbitration device in response to the query request;
And displaying the state information and/or the monitoring index data.
17. The method according to any one of claims 14-16, wherein the arbitration method further comprises:
acquiring a change request of the tenant through the arbitration management interface, wherein the change request is used for indicating state change of a target component deployed in the first data center and/or the second data center;
and sending the change request to the arbitration device so that the arbitration device performs a state change operation on the target component.
18. The method of claim 17, wherein the state change operation comprises at least one of: starting, stopping and switching.
19. The method of any one of claims 14-18, wherein the mediation device is a cloud server.
20. An arbitration device for a dual active data center, the dual active data center comprising the first data center and the second data center, the arbitration device comprising:
a determining unit configured to determine a fault object in the first data center when a fault event occurs;
an arbitration unit, configured to select a target node from a plurality of candidate nodes to provide a service instead of a master node in the first data center when the fault object includes the master node of the first data center, where the plurality of candidate nodes include a candidate node of the first data center and/or a candidate node of the second data center; and when the fault object does not comprise the main node of the first data center, enabling the first data center to continue to provide service.
21. The apparatus of claim 20, wherein the determining unit is configured to:
when the failure event occurs, the failure object is determined in the first data center, the second data center, and an arbitration device.
22. The apparatus of claim 21, wherein each candidate node of the plurality of candidate nodes satisfies at least one of:
the data synchronization rate of the alternative node is greater than or equal to a first threshold, wherein the data synchronization rate is used for representing the data synchronization state between the alternative node and the corresponding main node; or alternatively
The load of the alternative node is less than or equal to a second threshold.
23. The apparatus of claim 22, wherein the load of the candidate node comprises at least one of: CPU load, memory load, disk read/write load, network delay or packet loss rate.
24. The apparatus of any of claims 20-23, wherein the priority of the candidate node of the first data center is higher than the priority of the candidate node of the second data center.
25. The apparatus according to any one of claims 20-24, wherein the arbitration unit is further configured to:
When no alternative node exists, first alarm information is sent to the tenant, and the first data center continues to provide service;
and when the fault object further comprises the arbitration device, sending second alarm information to the tenant.
26. The apparatus according to any one of claims 20-25, wherein the apparatus further comprises:
a communication unit for receiving registration information of user equipment UE from a tenant, the registration information being for indicating components to be deployed to the first data center and/or the second data center;
and the deployment unit is used for registering the component in the arbitration equipment according to the registration information and deploying the component in the corresponding node of the first data center and/or the corresponding node of the second data center.
27. The apparatus of claim 26, wherein the registration information comprises at least one of the following information for the component:
the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
28. The apparatus according to any one of claims 20-27, further comprising:
And the communication unit is used for receiving monitoring data from the client, wherein the monitoring data is used for indicating whether the fault event occurs to a component deployed at a node to which the client belongs.
29. The apparatus of claim 28, wherein the arbitration unit is configured to:
receiving a query request from a User Equipment (UE) of a tenant through the communication unit, wherein the query request is used for indicating a target component to be queried;
and according to the query request, feeding back the state information and/or monitoring index data of the target component to the UE through the communication unit, wherein the state information and/or monitoring index data of the target component are used for indicating whether the fault event occurs or not.
30. The apparatus according to any one of claims 20-29, wherein the arbitration unit is configured to:
receiving, by a communication unit, a change request from a UE of a tenant, the change request being for indicating a change in state of a target component deployed at the first data center and/or the second data center;
and sending a change instruction to the client through the communication unit according to the change request, wherein the change instruction is used for indicating to execute state change operation on the target component.
31. The apparatus of claim 30, wherein the state change operation comprises at least one of: starting, stopping and switching.
32. The apparatus of any one of claims 21-31, wherein the mediation device is a cloud server.
33. An arbitration device for a dual active data center, the arbitration device comprising:
the display unit is used for displaying the arbitration management interface;
the acquiring unit is used for acquiring registration information of the tenant through the arbitration management interface, wherein the registration information is used for indicating components which need to be deployed to the first data center and/or the second data center;
and the communication unit is used for sending the registration information to the arbitration device.
34. The apparatus of claim 33, wherein the registration information comprises at least one of the following information for the component:
the type, service name, component name, parent node, associated management account information, associated script information, identification of the node to be deployed, communication address, master-slave status of the site or component involved.
35. The apparatus according to claim 33 or 34, wherein the acquisition unit is further configured to:
Acquiring a query request of the tenant through the arbitration management interface, wherein the query request is used for indicating a target component to be queried;
the communication unit is further configured to send the query request to the arbitration device; receiving status information and/or monitoring indicator data from the arbitration device in response to the query request;
the display unit is also used for displaying the state information and/or the monitoring index data.
36. The apparatus according to any one of claims 33-35, wherein the acquisition unit is further configured to:
acquiring a change request of the tenant through the arbitration management interface, wherein the change request is used for indicating state change of a target component deployed in the first data center and/or the second data center;
the communication unit is further configured to send the change request to the arbitration device, so that the arbitration device performs a state change operation on the target component.
37. The apparatus of claim 36, wherein the state change operation comprises at least one of: starting, stopping and switching.
38. The apparatus of any one of claims 33-37, wherein the mediation device is a cloud server.
39. An arbitration system for a dual active data center, characterized in that the arbitration system comprises an arbitration device for implementing the arbitration method according to any of claims 1-13 and a user device for implementing the arbitration method according to any of claims 14-19.
40. A computer readable medium storing a computer program comprising instructions for performing the arbitration method of any one of claims 1-13 or instructions for performing the arbitration method of any one of claims 14-19.
CN202210157330.6A 2022-02-21 2022-02-21 Arbitration method, device and system for dual-activity data center Pending CN116668269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210157330.6A CN116668269A (en) 2022-02-21 2022-02-21 Arbitration method, device and system for dual-activity data center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210157330.6A CN116668269A (en) 2022-02-21 2022-02-21 Arbitration method, device and system for dual-activity data center

Publications (1)

Publication Number Publication Date
CN116668269A true CN116668269A (en) 2023-08-29

Family

ID=87724786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210157330.6A Pending CN116668269A (en) 2022-02-21 2022-02-21 Arbitration method, device and system for dual-activity data center

Country Status (1)

Country Link
CN (1) CN116668269A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614805A (en) * 2023-11-21 2024-02-27 杭州沃趣科技股份有限公司 Data processing system for monitoring state of data center

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614805A (en) * 2023-11-21 2024-02-27 杭州沃趣科技股份有限公司 Data processing system for monitoring state of data center

Similar Documents

Publication Publication Date Title
EP3694148B1 (en) Configuration modification method for storage cluster, storage cluster and computer system
US9450700B1 (en) Efficient network fleet monitoring
EP3620905B1 (en) Method and device for identifying osd sub-health, and data storage system
EP3518110B1 (en) Designation of a standby node
CN112887368B (en) Load balancing access to replicated databases
WO2018036148A1 (en) Server cluster system
CN105814544B (en) System and method for supporting persistent partition recovery in a distributed data grid
CN102394914A (en) Cluster brain-split processing method and device
CN105069152B (en) data processing method and device
CN111800484B (en) Service anti-destruction replacing method for mobile edge information service system
CN115248826B (en) Method and system for large-scale distributed graph database cluster operation and maintenance management
CN111708668A (en) Cluster fault processing method and device and electronic equipment
CN108243031B (en) Method and device for realizing dual-computer hot standby
CN113489149B (en) Power grid monitoring system service master node selection method based on real-time state sensing
CN116668269A (en) Arbitration method, device and system for dual-activity data center
CN106878096B (en) VNF state detection notification method, device and system
US20050234919A1 (en) Cluster system and an error recovery method thereof
CN113986450A (en) Virtual machine backup method and device
CN111309515B (en) Disaster recovery control method, device and system
CN111404737B (en) Disaster recovery processing method and related device
CN102546250B (en) File publishing method and system based on main/standby mechanism
CN113765690A (en) Cluster switching method, system, device, terminal, server and storage medium
CN111355605A (en) Virtual machine fault recovery method and server of cloud platform
CN107590032A (en) The method and storage cluster system of storage cluster failure transfer
JP6856574B2 (en) Service continuation system and service continuation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication