CN115348156A - Method, equipment and storage medium for processing double-master fault - Google Patents

Method, equipment and storage medium for processing double-master fault Download PDF

Info

Publication number
CN115348156A
CN115348156A CN202210768261.2A CN202210768261A CN115348156A CN 115348156 A CN115348156 A CN 115348156A CN 202210768261 A CN202210768261 A CN 202210768261A CN 115348156 A CN115348156 A CN 115348156A
Authority
CN
China
Prior art keywords
equipment
health
target
value
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210768261.2A
Other languages
Chinese (zh)
Inventor
陈玉炎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202210768261.2A priority Critical patent/CN115348156A/en
Publication of CN115348156A publication Critical patent/CN115348156A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure

Abstract

The embodiment of the application provides a method, equipment and a storage medium for processing double main faults, which relate to the field of Internet and can improve the stability of a system, wherein the method is applied to a main and standby system, the main and standby system comprises first equipment and second equipment, a first link between the first equipment and the second equipment has faults, and the first link is used for mutually synchronizing data of the first equipment and the second equipment; the method comprises the following steps: the method comprises the steps that a first device obtains the health state of a second device, wherein the health state of the device is used for representing the number of available resources of the device, and the better the health state of the device is, the more the available resources of the device are represented; the first device determining a health state of the first device; the first equipment determines target equipment, wherein the target equipment is equipment with poor health state in the first equipment and the second equipment; and when the target equipment is first equipment, the first equipment closes an interface used for processing the service on the first equipment.

Description

Method, equipment and storage medium for processing double-master fault
Technical Field
The embodiment of the application relates to the field of internet, in particular to a method, equipment and a storage medium for processing double main faults.
Background
With the annual increase of data center traffic and the increasing requirement on network reliability, the switch virtual networking technology is applied, and the virtual networking technology represented by stacking is widely applied to data centers. The virtual networking technologies are essentially that main and standby switch networks are mutually backed up, so that the problems of reliability and robustness are solved, and access flow is doubled. However, when the interconnection port between the master and slave switches fails and the previous slave switch also becomes the master switch (i.e., a dual master failure), a loop is formed and the service is damaged.
The traditional method for solving the double-master fault is as follows: assuming that a first switch and a second switch are in a master-slave relationship, wherein the first switch is master equipment, and the second switch is slave equipment; then when a double-master failure exists between the first switch and the second switch (i.e. the first switch and the second switch are both masters), the system randomly selects one device to exit from the current system, and the reliability and stability are low.
Disclosure of Invention
The embodiment of the application provides a method, equipment and a storage medium for processing double main faults, and can improve the reliability and stability of a system.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, an embodiment of the present application provides a method for processing a dual master failure, where the method is applied to a master/slave system, where the master/slave system includes a first device and a second device, a first link between the first device and the second device fails, and the first link is used for synchronizing data between the first device and the second device; the method comprises the following steps: the method comprises the steps that a first device obtains the health state of a second device, wherein the health state of the device is used for representing the number of available resources of the device, and the health state of the device is better to represent the more available resources of the device; the first device determining a health status of the first device; the first equipment determines target equipment, wherein the target equipment is equipment with poor health state in the first equipment and the second equipment; and when the target equipment is first equipment, the first equipment closes an interface used for processing the service on the first equipment.
The embodiment of the application provides a method for processing double-master failure, which is applied to a first device, wherein a master-slave relationship exists between the first device and a second device, and the method comprises the following steps: under the condition that double main faults exist between the first equipment and the second equipment, the equipment with poor health state autonomously closes an interface for processing services on the equipment, namely, the equipment exits from the current system, so that the equipment with good health state continuously works in the current system. Therefore, compared with the traditional method for processing the double main faults, the method for processing the double main faults provided by the application takes the health states of the main and standby equipment as the basis for judging whether to quit the current system, namely equipment with poor health state quits the current system, and equipment with better monitoring state is kept to work in the current system, so that the reliability and stability of the current system are improved.
In a possible implementation manner, the determining, by the first device, a health state of the first device includes: the first device calculates a first health value according to a value of a target feature of the first device, wherein the target feature is a feature for characterizing available resources, and the first health value is used for characterizing a health state of the first device, and the higher the first health value is, the better the health state of the first device is.
In a possible implementation manner, the acquiring, by the first device, the health status of the second device includes: the first device receives a second health value sent by the second device through a second link, the second health value is a health value calculated by the second device according to a value of a target feature of the second device, and the first link is different from the second link.
According to the method for processing the double main faults, the first equipment receives a second health value sent by the second equipment, and the second health value is determined by the second equipment according to target characteristics on the second equipment and is used for representing the health state of the second equipment; therefore, when the target characteristics comprise a plurality of characteristics, the first device only needs to receive the health value sent by the second device, and does not need to receive a plurality of characteristics in the target characteristics of the second device, so that the bandwidth occupied by the second device for sending the health state of the second device to the first device is saved.
In a possible implementation manner, when the first device is a forwarding device, the target feature includes: the device initialization feature is used for indicating whether the device completes initialization; the amount of available resources includes at least one of: the number of non-failing transistors in the device processor, the number of non-failing transistors in the device lan switch LSW chip, the available number of the device upstream ports, the non-utilization of the device processor, or the non-utilization of the device memory.
In a possible implementation manner, when the first device is a computing device, the target feature includes: at least one of an unused rate of the CPU, an unused rate of the memory, and an amount of idle computing power.
In one possible implementation manner, when the first device is a storage device, the target feature includes: the number of available storage resources and the number of available transmission channels for carrying read or write data.
In a possible implementation manner, the calculating, by the first device, the first health value according to the value of the target feature of the first device includes: the first equipment calculates a first health value according to the value of the target characteristic of the first equipment and the weight value of each characteristic in the target characteristic; wherein, the weight value of the feature having larger influence on the stability of the service is larger.
In a possible implementation, the method further includes: the second equipment acquires the health state of the first equipment; the second device determining a health status of the second device; and when the target equipment is second equipment, the second equipment quits the system where the first equipment and the second equipment are located.
The method for processing the double-master failure is executed on both sides of the first device and the second device, so that the current system can be quitted at the first time no matter whether the target device is the first device or the second device determines that the target device is the second device, and therefore reliability and stability of the current system are improved.
In a second aspect, an embodiment of the present application provides a processing device for a dual master fault, where the processing device for a dual master fault includes: the device comprises an acquisition module, a determination module and a processing module; the acquisition module is used for acquiring the health state of the second equipment, wherein the health state of the equipment is used for representing the amount of available resources of the equipment, and the better the health state of the equipment is, the more available resources of the equipment are represented; the determining module is used for determining the health state of the first equipment; the determining module is further used for determining a target device, wherein the target device is a device with poor health status in the first device and the second device; the processing module is used for closing an interface used for processing the service on the first device by the first device when the target device is the first device.
In a possible implementation manner, the apparatus for processing a dual master failure further includes: a calculation module; the calculation module is configured to calculate a first health value according to a value of a target feature of the first device, where the target feature is a feature for characterizing available resources, and the first health value is used to characterize a health status, and the higher the first health value is, the better the health status of the first device is.
In a possible implementation manner, the apparatus for processing a dual master failure includes: a transceiver module; the transceiver module is configured to receive a second health value sent by the second device through a second link, where the second health value is a health value calculated by the second device according to a value of a target feature of the second device, and the first link is different from the second link.
In one possible implementation, when the first device is a forwarding device, the target feature includes: the number of available resources and a device initialization feature, wherein the device initialization feature is used to indicate whether the device has completed initialization; the amount of available resources includes at least one of: the number of non-failing transistors in the device processor, the number of non-failing transistors in the device lan switch LSW chip, the available number of device upstream ports, the non-utilization of the device processor, or the non-utilization of the device memory.
In a possible implementation manner, when the first device is a computing device, the target feature includes: at least one of an unused rate of the CPU, an unused rate of the memory, and an amount of idle computing power.
In a possible implementation manner, when the first device is a storage device, the target feature includes: the number of available storage resources and the number of available transmission channels for carrying read or write data.
In a possible implementation manner, the calculating module is further configured to calculate a first health value according to a value of a target feature of the first device and a weight value of each feature in the target feature; wherein, the weight value of the feature having larger influence on the stability of the service is larger.
In a third aspect, an embodiment of the present application provides a device for processing a dual master failure, including a memory and a processor, where the memory is coupled with the processor; the memory is for storing computer program code, wherein the computer program code comprises computer instructions; the computer instructions, when executed by the processor, cause the processing device of the dual master failure to perform the method of the first aspect and any one of its possible implementations.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed on a computing device, the computer instructions cause the computing device to perform the method described in any one of the first aspect and the possible implementation manner thereof.
In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to execute the method of any one of the above first aspect and its possible implementation manners.
It should be understood that, for the technical effects achieved by the technical solutions of the second aspect to the fifth aspect and the corresponding possible implementations of the embodiments of the present application, reference may be made to the technical effects of the first aspect and the corresponding possible implementations, and details are not described here again.
Drawings
Fig. 1 is a schematic diagram of an internet system having a primary-backup relationship according to an embodiment of the present application;
fig. 2 (a) is a first schematic diagram of a data forwarding system according to an embodiment of the present application;
FIG. 2 (B) is a second schematic diagram of a data storage system according to an embodiment of the present application;
FIG. 2 (C) is a schematic diagram of a data computing system provided by an embodiment of the present application;
fig. 3 is a schematic hardware structure diagram of a processing device for a double-master failure according to an embodiment of the present application;
fig. 4 is a first flowchart illustrating a method for processing a dual master failure according to an embodiment of the present application;
fig. 5 is a schematic flowchart illustrating a method for processing a dual master failure according to an embodiment of the present application;
fig. 6 is a schematic flow chart of a method for processing a double-master failure according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a double-master failure processing device according to an embodiment of the present application.
Detailed Description
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
The terms "first" and "second," and the like, in the description and in the claims of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order of the objects. For example, the first device and the second device, etc. are for distinguishing different devices, and are not for describing a particular order of the devices.
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.
First, some concepts related to a method, a device, and a storage medium for processing a dual master failure provided in an embodiment of the present application are explained.
Double main faults: the method comprises the steps that a first link for connecting a main device and a standby device fails to enable the main device and the standby device to be both the main device and the standby device to fail, wherein the first link is used for mutual data synchronization between the main device and the standby device.
For example, in the system shown in fig. 1, a device a and a device B are devices having a primary-standby relationship, and data between the device a and the device B are synchronized with each other through a first link. When the first link fails, the original standby equipment (equipment B) is converted into main equipment; at this time, the equipment A and the equipment B are both main equipment; therefore, when the server triggers the data a uploading task to the internet, the data a is respectively forwarded to the internet by the device a and the device B, so that the data a is uploaded multiple times.
The method for processing double main faults provided by the embodiment of the application can be applied to systems with main and standby devices, such as a data forwarding system or a data storage system.
When the method for processing a dual master failure provided in the embodiment of the present application is applied to a data forwarding system as shown in (a) of fig. 2, the data forwarding system includes: the system comprises a task initiating device 201, a main forwarding device 202, a standby forwarding device 203 and a task receiving device 204; a specific implementation manner of the data forwarding system is shown in fig. 1.
The task initiating device 201 is configured to trigger a task, where the task initiating device 201 may be a device with task triggering capability, such as a server, a desktop computer, or a notebook computer.
The primary forwarding device 202 and the standby forwarding device 203 are configured to forward a task triggered by the task initiating device 201; when a task is triggered by the task initiating device 201, the task is forwarded by one forwarding device of the primary forwarding device 202 and the standby forwarding device 203; the forwarding device that specifically forwards the task is determined by the load balancing policy in the system, so that the task receiving device 204 finally receives the task. The primary forwarding device 202 and the standby forwarding device 203 have a first link therebetween, and the first link is used for synchronizing data between the primary forwarding device 202 and the standby forwarding device 203. A second link is further provided between the primary forwarding device 202 and the standby forwarding device 203, and the second link is used for detecting and processing a dual-primary fault through the second link when the dual-primary fault exists between the primary forwarding device 202 and the standby forwarding device 203, and reference may be made to the following method for implementing the detection and processing of the dual-primary fault specifically. The primary forwarding device 202 and the standby forwarding device 203 may be devices such as switches or routers having a task forwarding function.
The task receiving device 204 is configured to receive a task forwarded by the task initiating device 201 through the main forwarding device 202 or the standby forwarding device 203; the receiving device 204 may be a device with task receiving capability, such as a server, a desktop computer, or a notebook computer.
When the method for processing a dual master failure provided in the embodiment of the present application is applied to a data storage system as shown in fig. 2 (B), the data storage system includes: a task trigger device 205, a primary storage device 206, and a secondary storage device 207.
The task triggering device 205 is configured to trigger a data read/write task, where the task triggering device 205 may be a device such as a server, a desktop computer, or a notebook computer, which has a capability of triggering the data read/write task.
The main storage device 206 and the standby storage device 207 are used for processing the data read-write task triggered by the task triggering device 205; when the task triggering device 205 triggers a data read-write task, the task is processed by one of the main storage device 206 and the standby storage device 207; the storage device for processing the data read-write task is determined by a load balancing strategy in the system. The primary storage device 206 and the secondary storage device 207 have a first link therebetween for synchronizing data between the primary storage device 206 and the secondary storage device 207 with respect to each other. A second link is further provided between the primary storage device 206 and the secondary storage device 207, and the second link is configured to monitor and process a double primary failure when the double primary failure exists between the primary storage device 206 and the secondary storage device 207, and for an implementation manner of specifically processing the double primary failure, reference may be made to the following method part. The primary storage device 206 and the secondary storage device 207 may be storage devices with data read-write capability, such as a relational database (e.g., MYSQL, oracle, or SQLServer) and a non-relational database (e.g., redis, mongoDB, or HBase).
When the method for processing a dual master failure provided in the embodiment of the present application is applied to a data computing system as shown in (C) diagram in fig. 2, the data computing system includes: task trigger device 208, primary computing device 209, and backup computing device 210.
The task trigger device 208 is configured to trigger a data computing task, where the task trigger device 208 may be a server, a desktop computer, or a notebook computer, which has the capability of triggering a data reading and writing task.
The main computing device 209 and the standby computing device 210 are used for processing the data reading computing task triggered by the task triggering device 208; similar to the diagram (B) in fig. 2, when the task triggering device 209 triggers a data computing task, the task is processed by one of the primary computing device 209 and the backup computing device 210; the computing device that specifically handles the data computing task is determined by the load balancing policy in the system. The primary computing device 209 and the standby computing device 210 have a first link therebetween for synchronizing data between the primary computing device 209 and the standby computing device 210 with respect to each other. A second link is further provided between the main computing device 209 and the standby computing device 210, and the second link is configured to monitor and process a dual-master failure when the dual-master failure exists between the main computing device 209 and the standby computing device 210, and for an implementation manner of specifically processing the dual-master failure, reference may be made to the following method section. The host computing device 209 and the standby computing device 210 may be specifically devices with data computing capabilities, such as servers.
For example, fig. 3 is a hardware schematic diagram of a dual-master failure processing device (such as the above-mentioned master forwarding device 202 or standby forwarding device 203 or master storage device 206 or standby storage device 207) provided in an embodiment of the present application, where the dual-master failure processing device includes a processor 301, a memory 302, and a network interface 303.
Wherein processor 301 includes one or more CPUs. The CPU may be a single-core CPU (single-CPU) or a multi-core CPU (multi-CPU), in which the CPU includes a plurality of transistors, and the greater the number of transistors per unit area, the stronger the processing capability of the CPU.
Memory 302 includes, but is not limited to, RAM, ROM, EPROM, flash memory, optical memory, or the like.
Optionally, the processor 301 implements the method for processing the dual master fault provided in the embodiment of the present application by reading the instruction stored in the memory 302, or the processor 301 implements the method for processing the dual master fault provided in the embodiment of the present application by using an instruction stored inside. In the case where the processor 301 implements the method in the above-described embodiment by reading the instruction stored in the memory 302, the memory 302 stores an instruction for implementing the method for processing a dual master failure provided in the embodiment of the present application.
The network interface 303 is a wired interface (port), such as FDDI, GE interface. Alternatively, the network interface 303 is a wireless interface. It should be understood that the network interface 303 includes a plurality of physical ports, and the network interface 303 is used for receiving data, forwarding data, acquiring the health status of the peer device, and the like.
Optionally, the processing device of the dual master failure further includes a bus 304, and the processor 301, the memory 302, and the network interface 303 are generally connected to each other through the bus 304, or are connected to each other in other manners.
An embodiment of the present application provides a method for processing a dual master failure, where the method is applied to a first device, and the first device and a second device have a master-slave relationship, as shown in fig. 4, the method may include S400 to S402.
S400, the first equipment monitors double main faults between the first equipment and the second equipment.
It should be noted that, the first device may be a master device or a standby device, and when the first device is the master device, the second device is the standby device; when the first equipment is standby equipment, the second equipment is main equipment; wherein, the first device and the second device may both be forwarding devices in the diagram (a) in fig. 2; or both may be the storage devices in FIG. 2 (B); or both may be the computing devices in figure 2 (C). The embodiment of the present application specifically does not limit the active/standby roles and the device types of the first device.
The double main fault is a fault that a first link connected between a first device and a second device is in fault, so that the first device and the second device are both main devices, wherein the first link is used for mutually synchronizing data between the first device and the second device (namely, the first link is used for mutually synchronizing data between the first device and the second device), so that after one of the two devices with the main-standby relationship is down, the other device can replace the down device to work.
It should be understood that, when the first device and the second device have a primary-standby relationship, the first device and the second device provide a common physical address (abbreviated as an external MAC address) for devices other than the first device and the second device in the current system according to respective physical addresses (MACs), where the external MAC address is a MAC address of a primary device in the first device and the second device. When double-master failure does not exist between the first equipment and the second equipment, the first equipment and the second equipment mutually send original MAC addresses through a second link; when a dual-master failure exists between the first device and the second device, the standby device of the first device and the second device may also become a master device, and at this time, the first device and the second device mutually send an external MAC address through the second link.
Illustratively, assume that the first device is a master device with a MAC address of abc.. 1; the second device is a standby device, and the MAC address of the second device is ABC. Then the external MAC addresses of the first device and the second device are abc.. 1; when double main faults do not exist between the first equipment and the second equipment, the second equipment sends the MAC address ABC. When a double-master failure exists between the first device and the second device, at this time, the MAC address of the second device itself becomes an external MAC address, and therefore, the second device sends the MAC address of the second device abc.
The specific implementation of S400 may be that the first device determines according to the MAC address of the first device and the MAC address of the second device, where when the MAC address of the first device is different from the MAC address of the second device, there is no dual master failure between the first device and the second device; when the MAC address of the first device is the same as the MAC address of the second device, a double-master failure exists between the first device and the second device.
As shown in fig. 5, the specific implementation of S400 includes: s400a-S400b.
S400a, the first device receives the second MAC address sent by the second device.
The second MAC address is a MAC address of the second device.
It should be noted that, since the dual master failure is caused by a failure of a first link between the first device and the second device, the first device is a second MAC address received on the second link; the first link is a link for mutually synchronizing data between the first device and the second device, and the second link is a link between the first device and the second device except the first link.
S400b, the first device determines whether double-master failure exists between the first device and the second device according to the first MAC address and the second MAC address.
The first MAC address is a MAC address of the first device.
When the first MAC address is the same as the second MAC address, double main faults exist between the first equipment and the second equipment; when the first MAC address is different from the second MAC address, no double-master failure exists between the first device and the second device.
When there is no dual master failure between the first device and the second device, the first device periodically performs the above S400a-S400b.
When there is a double master failure between the first device and the second device, the first device performs the following S401 to S402.
Optionally, the above dual-master failure monitoring may also be applied to the first device and the second device, and the specific implementation manner is similar to that of S400a-S400b, which is not described herein again.
S401, under the condition that double main faults exist between the first device and the second device, the first device determines a target device.
The target device is a device with a poor health status in the first device and the second device, wherein the device with the better health status has a larger number of available resources.
For example, assume that the total computing power on the processors (CPUs) of the first device and the second device is the same, where the idle computing power of the first device is 80% and the idle computing power of the second device is 50%; then, the first device compares the idle calculation power of the first device with the idle calculation power of the second device after acquiring the idle calculation power of the first device, and determines the second device with less idle calculation power, that is, the device with poor health state, as the target device.
The health status of the second device is obtained by the first device from the second device through the second link. The specific obtaining manner may be that the first device sends an obtaining request to the second device, and the second device sends the health status of the second device to the first device after receiving the obtaining request. Or the health status of the second device that the second device actively sends to the first device.
The health status of the first device is determined by the first device, and the specific determination manner is as follows in S601, which is not described herein again.
It should be noted that the health status of the device may be represented by at least one characteristic, or may be represented by the health status of the device, and the specific form of the health status of the device is not limited in the present application.
S402, when the target device is a first device, the first device quits the system where the first device and the second device are located.
The system for the first device to exit the first device and the second device comprises: the first device executes shutdown operation, or the first device closes service interfaces (such as all service interfaces) on the first device, so that the first device does not process services any more; the service interface is an interface used for processing a service on the first device, and the service interface may be a part or all of the network interface 303 in fig. 3, or may also be a virtual interface, and the specific form of the service interface is not limited in this application.
The service interface may be an interface for forwarding data, an interface for reading and writing data, or an interface for processing data (for example, converting data format).
For example, in the system shown in fig. 1, when there is a dual master failure in device a and device B, when device a is determined as a target device, device a performs shutdown operation and no longer performs data forwarding, so that device B performs data forwarding instead of device a; or the device A closes the interface on the device A for forwarding data to the Internet and closes the service interface for receiving tasks from the server, so that the device B replaces the device A for forwarding the data.
It should be noted that when the target device is the second device, the first device ends the method after S401 is executed.
Optionally, in the method for processing a dual master failure provided in this embodiment of the present application, when the target device is the second device, after the first device completes S401, the first device sends an exit instruction to the second device, so that after the second device receives the exit instruction, the second device exits from the systems where the first device and the second device are located.
The embodiment of the application provides a method for processing double-master failure, which is applied to a first device, wherein a master-slave relationship exists between the first device and a second device, and the method comprises the following steps: under the condition that double main faults exist between the first equipment and the second equipment, the equipment with poor health state autonomously closes an interface for processing services on the equipment, namely, the equipment exits from the current system, so that the equipment with good health state continuously works in the current system. Therefore, compared with the traditional method for processing the double-main fault, the method for processing the double-main fault provided by the application takes the health state of the main and standby equipment as the basis for judging whether to quit the current system, namely the equipment with poor health state quits the current system, and the equipment with good monitoring state is kept to work in the current system, so that the reliability and stability of the current system are improved.
Optionally, the method for processing a dual master failure provided in the embodiment of the present application may also be applied to a first device and a second device having a master-slave relationship, respectively; wherein, the first device and the second device may both be forwarding devices in the diagram (a) in fig. 2; or both of the first device and the second device may be the storage devices in the diagram (B) in fig. 2, or both of the first device and the second device may be the computing devices (e.g., servers) in the diagram (C) in fig. 2; taking the health value as an example, a specific implementation of the method may include, as shown in fig. 6: S601-S608.
S601, the first equipment determines a first health value according to the value of the target characteristic of the first equipment.
The higher the first health value, the better the health status of the first device; that is, the higher the first health value, the greater the amount of available resources on the first device.
It should be noted that the target feature is a feature for representing available resources on the first device, for example: when the first device is a computing device, the characteristics of the available resources include: at least one of an unused rate of the CPU, an unused rate of the memory, and an amount of idle computing power.
For example, assume that the target features include CPU and memory unavailability; the health value is the product of the sum of the CPU non-utilization rate and the memory non-utilization rate and 100; wherein, the CPU non-utilization rate of the first device is 30%, and the memory non-utilization rate of the first device is 50%; then, the first health value is calculated as the product of the sum (i.e. 80%) of the CPU's unused rate (i.e. 30%) and the memory's unused rate (i.e. 50%) multiplied by 100, which is 80: the first health value is 80.
It should be noted that, in this embodiment of the present application, the first device and the second device may be any one of devices such as a switch, a router, or a storage device, and a specific apparatus of the first device and the second device is not limited in this embodiment of the present application.
When the first device is a forwarding device, the target feature includes: the amount of available resources and device initialization features.
The device initialization feature is used to indicate whether the device completes initialization, for example, when the value of the device initialization feature is 0, it indicates that the device does not complete initialization; when the value of the device initialization feature is 1, it indicates that the device has completed initialization.
The available resource characteristics include at least one of: the method comprises the following steps of counting the number of non-failed transistors in a CPU (Central processing Unit) of the equipment, counting the number of non-failed transistors in an LSW (local area network) chip of the equipment, counting the available number of uplink ports of the equipment, and counting the non-utilization rate of a processor of the equipment or the non-utilization rate of a memory of the equipment, wherein the LSW chip of the equipment is used for data forwarding or data exchange, the processing capacity of the LSW chip of the equipment is determined by the number of the transistors in the LSW chip of the equipment, and the uplink ports of the equipment are ports connected with the Internet or a convergent network.
When the first device is a storage device, the target feature includes: the number of available storage resources and the number of available transmission channels for carrying read or write data.
Optionally, when the first device and the second device are forwarding devices, the specific implementation of S601 includes: the first equipment calculates a first health value according to the value of the target characteristic of the first equipment and the weight value of each characteristic in the target characteristic.
The feature weight value of the target feature having a greater influence on the service stability is greater.
For example, when the service is a data forwarding service, the features with weighted values from large to small in the target features may be: initialization characteristics of the device, the available number of upstream ports of the device, the number of non-failing transistors in the LSW chip of the device, the number of non-failing transistors in the processor of the device, the non-utilization rate of the processor of the device, and the non-utilization rate of the memory of the device. The first device determines whether the first device can normally operate or not by whether the first device completes initialization, so that the weight of the initialization feature of the first device is the largest, and the available number of uplink ports of the first device determines whether the received data can be forwarded or not, so that the weight is larger.
For another example, when the service is a data format conversion service, the features with weighted values from large to small in the target features may be: initialization characteristics of the device, the number of non-failing transistors in the device processor, the non-utilization rate of the device processor, the number of non-failing transistors in the device LSW chip, the non-utilization rate of the device memory, and the available number of device upstream ports.
In one implementation of the above calculating the first health value according to the value of the target feature of the first device and the weight value of each of the target features, the health value is a sum of products of the value of each of the target features of the first device and the weight value of each of the target features.
For example, assuming that the service is a data forwarding service, the initialization feature weight value is 300, the weight value of the available number of the uplink ports is 10, the weight value of the number of transistors that have not failed in the LSW chip is 8, the weight value of the number of transistors that have not failed in the processor is 6, the weight value of the non-utilization rate of the processor is 4, and the weight value of the non-utilization rate of the memory is 2. The first preset initialization characteristic value is 1, the available number of the upstream ports of the first device is 20, the number of the non-failed transistors in the LSW chip of the first device is 30, the number of the non-failed transistors in the processor of the first device is 40, the non-utilization rate of the processor of the first device is 80%, and the non-utilization rate of the memory of the first device is 60%. Then, first health value =300 + 1+20 + 10+30 + 8+40 + 6+0.8 + 4+0.6 + 2; from this, the first health value is 1084.4.
S602, the first device sends the first health value to the second device through the second link.
The second link is a link between the first device and the second device, and the second link is a link between the first device and the second device for performing dual master failure detection.
And S603, the second equipment calculates a second health value according to the value of the target characteristic of the second equipment.
The second health value is calculated by the second device according to the value of the target feature of the second device, wherein the first health value and the second health value are calculated in the same manner.
Illustratively, based on the example in S601, assuming that the CPU non-usage of the second device is 40% and the memory non-usage of the second device is 50%, the process of calculating the second health value by the second device is as follows: the product of the sum of the CPU unavailability (i.e., 40%) of the second device and the memory unavailability (i.e., 50%) of the second device multiplied by 100 is 90, i.e.: the second health value is 90.
It should be noted that the specific implementation of S603 is similar to S601, and for the specific description of S603, reference may be made to the related description of S601, which is not described herein again.
S604, the second device sends the second health value to the first device through the second link.
It should be noted that the specific implementation of S604 is similar to S602, and for the specific description of S604, reference may be made to the related description of S602, which is not described herein again.
And S605, the first equipment determines target equipment according to the first health value and the second health value.
The S605 specifically includes: the first device compares the first health value with the second health value, and determines the device with smaller health value as the target device.
Illustratively, based on the example of S603, the first device determines the device corresponding to the first health value (i.e., the first device) as the target device by comparing the first health value 80 with the second health value 90.
And S606, when the target equipment is the first equipment, the first equipment quits the system where the first equipment and the second equipment are located.
It should be noted that the specific implementation of S606 is similar to S402, and for the specific description of S606, reference may be made to the related description of S402, which is not described herein again.
And S607, the second equipment determines the target equipment according to the first health value and the second health value.
It should be noted that the target device is the same device as the target device in S605.
It should be noted that the specific implementation of S607 is similar to S605, and for the specific description of S607, reference may be made to the related description of S605, which is not described herein again.
And S608, when the target device is a second device, the second device quits the system where the first device and the second device are located.
It should be noted that the specific implementation of S608 is similar to S606, and for the specific description of S608, reference may be made to the related description of S606, and details are not repeated here.
According to the method for processing the double-master failure, the first equipment determines target equipment according to a first health value of the first equipment and a second health value sent by the second equipment; therefore, when the target characteristics comprise a plurality of characteristics, the first device only needs to receive the health value sent by the second device, and does not need to receive a plurality of characteristics in the target characteristics of the second device, so that the bandwidth occupied by the second device for sending the health state of the second device to the first device is saved.
In addition, the method for processing the double-master failure provided by the embodiment of the application is executed on both sides of the first device and the second device, so that the current system can be quitted at the first time no matter whether the target device is the first device or the second device when the target device is determined to be the target device, and therefore, the reliability and the stability of the current system are improved.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, according to the method example, the functional modules of the processing device with dual master failures may be divided, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and another division manner may be available in actual implementation.
Fig. 7 shows a schematic diagram of a possible structure of the processing device for the dual master failure in the above embodiment, in the case of dividing each functional module by corresponding functions. As shown in fig. 7, the processing apparatus of the dual master failure includes: an obtaining module 701, a determining module 702 and a processing module 703.
The obtaining module 701 is configured to obtain a health status of the second device.
The determining module 702 is configured to determine a health status of the first device and determine a target device; for example, step S401 in the above-described method embodiment is performed.
The processing module 703 is configured to quit a system where the first device and the second device are located when the target device is the first device; for example, step S402 in the above-described method embodiment is performed.
Optionally, the apparatus for processing dual master failures further includes: a calculation module 704.
The calculating module 704 is configured to calculate a first health value according to the value of the target feature of the first device; for example, step S601 in the above-described method embodiment is performed.
Optionally, the apparatus for processing dual master failure further includes: a transceiver module 705.
The transceiver module 705 is configured to receive a second health value sent by the second device through a second link; for example, step S602 in the above-described method embodiment is performed.
Optionally, the calculating module 704 is configured to calculate the first health value according to the value of the target feature of the first device and the weight value of each feature in the target feature.
Each module of the above dual-master failure processing device may also be configured to execute other actions in the above method embodiment, and all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, which is not described herein again.
Some or all of the steps in the determining module 702, the processing module 703 and the calculating module 704 may be implemented by the processor 301 in fig. 3 executing the codes in the memory 302. Some or all of the steps in the acquiring module 701 and the transceiver module 704 may be implemented by the network interface 303 or the bus 304 in fig. 3.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions, when loaded and executed on a computer, result in all, or in part, the processes or functions described in the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, magnetic tape), an optical medium (e.g., digital Video Disk (DVD)), or a semiconductor medium (e.g., solid State Drive (SSD)), among others.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.
The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A method for processing double main faults is characterized in that the method is applied to a main and standby system, the main and standby system comprises a first device and a second device, a first link between the first device and the second device is in fault, and the first link is used for mutually synchronizing data between the first device and the second device; the method comprises the following steps:
the first device acquires the health state of the second device, wherein the health state of the device is used for representing the amount of available resources of the device, and the better the health state of the device is, the more available resources of the device are represented;
the first device determining a health status of the first device;
the first equipment determines target equipment, wherein the target equipment is equipment with poor health state in the first equipment and the second equipment;
and when the target equipment is the first equipment, the first equipment closes an interface used for processing the service on the first equipment.
2. The method of claim 1, wherein the first device determining the health status of the first device comprises:
the first device calculates a first health value according to a value of a target feature of the first device, wherein the target feature is a feature for characterizing available resources, and the first health value is used for characterizing the health status, and the higher the first health value is, the better the health status of the first device is.
3. The method of claim 2, wherein the first device obtaining the health status of the second device comprises:
the first device receives a second health value sent by the second device through a second link, the second health value is a health value calculated by the second device according to the value of the target feature of the second device, and the first link is different from the second link.
4. The method according to any one of claims 1 to 3,
when the first device is a forwarding device, the target feature includes: a number of available resources and a device initialization feature, wherein the device initialization feature is to indicate whether the device completes initialization; the amount of available resources includes at least one of: the number of non-failed transistors in a device processor, the number of non-failed transistors in the device lan switch LSW chip, the available number of the device upstream ports, the non-usage of the device processor, or the non-usage of the device memory.
5. The method according to any one of claims 2-4, wherein the first device calculates a first health value from a value of a target feature of the first device, comprising:
the first equipment calculates the first health value according to the value of the target characteristic of the first equipment and the weight value of each characteristic in the target characteristic; wherein, the weight value of the characteristic which has larger influence on the stability of the service is larger.
6. The method according to any one of claims 1-5, further comprising:
the second equipment acquires the health state of the first equipment;
the second device determining a health status of the second device;
and when the target equipment is the second equipment, the second equipment exits the system where the first equipment and the second equipment are located.
7. A dual master failure processing device comprising a memory and a processor, the memory coupled to the processor; the memory for storing computer program code, the computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the processor to perform the method of any of claims 1 to 6.
8. A computer storage medium comprising computer instructions that, when executed on a computing device, cause the computing device to perform the method of any of claims 1 to 6.
CN202210768261.2A 2022-07-01 2022-07-01 Method, equipment and storage medium for processing double-master fault Pending CN115348156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210768261.2A CN115348156A (en) 2022-07-01 2022-07-01 Method, equipment and storage medium for processing double-master fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210768261.2A CN115348156A (en) 2022-07-01 2022-07-01 Method, equipment and storage medium for processing double-master fault

Publications (1)

Publication Number Publication Date
CN115348156A true CN115348156A (en) 2022-11-15

Family

ID=83948053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210768261.2A Pending CN115348156A (en) 2022-07-01 2022-07-01 Method, equipment and storage medium for processing double-master fault

Country Status (1)

Country Link
CN (1) CN115348156A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102347867A (en) * 2011-11-14 2012-02-08 杭州华三通信技术有限公司 Processing method and equipment for stacking splitting detection
CN104158707A (en) * 2014-08-29 2014-11-19 杭州华三通信技术有限公司 Method and device of detecting and processing brain split in cluster
CN105281951A (en) * 2015-09-29 2016-01-27 北京星网锐捷网络技术有限公司 Double-main-device conflict detection method for VSU system, and network equipment
CN106534399A (en) * 2016-11-22 2017-03-22 杭州迪普科技股份有限公司 Virtual switch matrix (VSM) splitting detection methods and apparatuses
US20180176073A1 (en) * 2016-12-21 2018-06-21 Nicira, Inc. Dynamic recovery from a split-brain failure in edge nodes
CN109067934A (en) * 2018-08-10 2018-12-21 新华三技术有限公司 A kind of address conflict processing method and processing device
US20190052520A1 (en) * 2017-08-14 2019-02-14 Nicira, Inc. Cooperative active-standby failover between network systems
CN111585838A (en) * 2020-04-29 2020-08-25 杭州迪普科技股份有限公司 Splitting detection method and device based on VSM system
CN112104548A (en) * 2020-08-28 2020-12-18 新华三信息安全技术有限公司 Communication method and device
US20210067403A1 (en) * 2019-08-30 2021-03-04 Versa Networks, Inc. Method and apparatus for split-brain avoidance in sub-secondary high availability systems
CN112787960A (en) * 2020-11-30 2021-05-11 北京东土军悦科技有限公司 Stack splitting processing method, device and equipment and storage medium
CN113438105A (en) * 2021-06-21 2021-09-24 新华三技术有限公司 Method, device and equipment for assisting multi-IRF (inter-range radio frequency) splitting detection by MAD (multi-object detection)

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102347867A (en) * 2011-11-14 2012-02-08 杭州华三通信技术有限公司 Processing method and equipment for stacking splitting detection
CN104158707A (en) * 2014-08-29 2014-11-19 杭州华三通信技术有限公司 Method and device of detecting and processing brain split in cluster
CN105281951A (en) * 2015-09-29 2016-01-27 北京星网锐捷网络技术有限公司 Double-main-device conflict detection method for VSU system, and network equipment
CN106534399A (en) * 2016-11-22 2017-03-22 杭州迪普科技股份有限公司 Virtual switch matrix (VSM) splitting detection methods and apparatuses
US20180176073A1 (en) * 2016-12-21 2018-06-21 Nicira, Inc. Dynamic recovery from a split-brain failure in edge nodes
US20190052520A1 (en) * 2017-08-14 2019-02-14 Nicira, Inc. Cooperative active-standby failover between network systems
CN109067934A (en) * 2018-08-10 2018-12-21 新华三技术有限公司 A kind of address conflict processing method and processing device
US20210067403A1 (en) * 2019-08-30 2021-03-04 Versa Networks, Inc. Method and apparatus for split-brain avoidance in sub-secondary high availability systems
CN111585838A (en) * 2020-04-29 2020-08-25 杭州迪普科技股份有限公司 Splitting detection method and device based on VSM system
CN112104548A (en) * 2020-08-28 2020-12-18 新华三信息安全技术有限公司 Communication method and device
CN112787960A (en) * 2020-11-30 2021-05-11 北京东土军悦科技有限公司 Stack splitting processing method, device and equipment and storage medium
CN113438105A (en) * 2021-06-21 2021-09-24 新华三技术有限公司 Method, device and equipment for assisting multi-IRF (inter-range radio frequency) splitting detection by MAD (multi-object detection)

Similar Documents

Publication Publication Date Title
CN108833202B (en) Method, device and computer readable storage medium for detecting fault link
CN102404390B (en) Intelligent dynamic load balancing method for high-speed real-time database
CN102402395B (en) Quorum disk-based non-interrupted operation method for high availability system
CN102387218B (en) Multimachine hot standby load balance system for computer
US20030172150A1 (en) System and method for determining availability of an arbitrary network configuration
US8943258B2 (en) Server direct attached storage shared through virtual SAS expanders
US8559322B2 (en) Link state detection method and system
CN107918570B (en) Method for sharing arbitration logic disk by double-active system
CN109496401B (en) Service takeover method, storage device and service takeover device
WO2018058941A1 (en) Method for detecting communication status of cluster system, and gateway cluster
CN109873714B (en) Cloud computing node configuration updating method and terminal equipment
WO2021175226A1 (en) Fault recovery method for ring network, and physical node
CN114265753A (en) Management method and management system of message queue and electronic equipment
TW202134899A (en) Server and control method of server
CN108512753B (en) Method and device for transmitting messages in cluster file system
Qing et al. Virtual network protection strategy to ensure the reliability of SFC in NFV
WO2023040203A1 (en) Data acquisition method and apparatus for artificial intelligence platform, device and medium
US11258632B2 (en) Unavailable inter-chassis link storage area network access system
CN114844809A (en) Multi-factor arbitration method and device based on network heartbeat and kernel disk heartbeat
US8918670B2 (en) Active link verification for failover operations in a storage network
CN115348156A (en) Method, equipment and storage medium for processing double-master fault
WO2019079961A1 (en) Method and device for determining shared risk link group
WO2021012169A1 (en) Method of improving reliability of storage system, and related apparatus
CN115794381A (en) Server and data center
JP2006526212A (en) Data collection in computer clusters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination