CN114490152A - Method for establishing dual-computer complete machine level hot standby system - Google Patents

Method for establishing dual-computer complete machine level hot standby system Download PDF

Info

Publication number
CN114490152A
CN114490152A CN202111681040.3A CN202111681040A CN114490152A CN 114490152 A CN114490152 A CN 114490152A CN 202111681040 A CN202111681040 A CN 202111681040A CN 114490152 A CN114490152 A CN 114490152A
Authority
CN
China
Prior art keywords
machine
switching
state
slave
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111681040.3A
Other languages
Chinese (zh)
Inventor
姜寅啸
姜姗姗
唐学术
郭照峰
张腾
朱瓅
陈韬
魏巍
郑佳盈
荆翰谊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aerospace Measurement and Control Technology Co Ltd
Original Assignee
Beijing Aerospace Measurement and Control Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aerospace Measurement and Control Technology Co Ltd filed Critical Beijing Aerospace Measurement and Control Technology Co Ltd
Priority to CN202111681040.3A priority Critical patent/CN114490152A/en
Publication of CN114490152A publication Critical patent/CN114490152A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a method for establishing a dual-computer complete machine level hot standby system, which realizes master-slave identification, dual-computer intelligent switching and dual-computer information synchronization of dual-computer system initialization through the intercommunication of heartbeat information between dual computers. The double-machine intelligent switching comprises four parts of detection parameter state acquisition, single-machine fault diagnosis, switching interpretation and switching control. By utilizing the method provided by the invention, the self-diagnosis and intelligent switching of the dual-computer system are realized, and the intelligent degree and reliability level of the dual-computer system are improved to a great extent.

Description

Method for establishing dual-computer complete machine level hot standby system
Technical Field
The invention relates to the technical field of server database backup, in particular to a method for establishing a dual-computer complete-machine-level hot standby system.
Background
The double hot standby means that two computers are used, backup each other and execute the same task together, when one computer breaks down, the other computer can undertake the service, thereby automatically ensuring that the system can continuously provide the service without manual intervention.
Two typical modes of the existing dual-server hot standby are available, one mode is that two servers realize dual-server hot standby through a shared storage device (generally a shared disk array or a storage area network SAN) and installing dual-server software, and the mode is called a sharing mode. Another way is by a pure software approach, commonly referred to as a pure software approach or a Mirror image approach (Mirror).
By means of the dual-machine hot standby mode of the sharing device, in the working process, the two servers provide services to the outside by using a virtual IP address, and the service request is sent to one of the servers to be undertaken according to different working modes. Meanwhile, the server detects the working condition of another server through a heartbeat wire (usually adopting a mode of establishing a private network), when one server fails, the other server judges according to the heartbeat detection condition, switches and takes over the service, and the mode needs to share the storage equipment in the dual-computer system, so that the hardware cost is increased, and the problem of single-point failure of the shared storage equipment exists. For the pure software mode, the data is copied to another server in real time through the dual-computer software supporting mirroring, so that the same data exists in two servers respectively, if one server fails, more complex data recovery synchronization needs to be performed after recovery, and the period is in an unprotected state. There are 3 major problems with either the sharing or mirroring approach. 1. The dual-computer system lacks intelligent initialization logic, and after being electrified, the host computer and the slave computer are manually specified; 2. the system does not have the fault diagnosis function of a single machine, and belongs to 'after-the-fact' switching, namely, the master-slave switching is carried out only when the host machine fails; 3. After the dual-computer switching, the data recovery synchronization time is long, and a large time 'breakpoint' is easy to generate.
Disclosure of Invention
In view of this, the present invention provides a method for establishing a dual-computer complete machine level hot standby system, which can implement intelligent initialization and single-computer fault diagnosis, and after dual-computer switching, the data recovery synchronization time is short.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the invention relates to a method for establishing a dual-computer complete machine level hot standby system, which comprises three parts of master-slave identification, dual-computer intelligent switching and dual-computer information synchronization of dual-computer system initialization, comprising the following contents:
(1) aiming at the master-slave identification of the dual-computer system initialization, the dual-computer system adopts a dual arbitration strategy of taking a machine feeding dogs first as a main strategy and taking a machine with a small IP address as a main strategy, and is used for adapting to the master-slave identification of the initialization of the dual-computer system starting in sequence and starting at almost the same time;
(2) aiming at the intelligent switching of the double machines, the intelligent switching is realized by detecting the state acquisition of parameters, single machine fault diagnosis, switching interpretation and switching control;
(3) aiming at the information synchronization of the two machines, a full duplex communication mode and a time slot division principle are adopted to carry out the bidirectional synchronization of the master machine and the slave machine.
The master-slave identification initialized by the dual-computer system comprises the following contents:
(1) setting a minimum time interval T of double-machine dog feeding of first-layer arbitration, namely the time interval of double-machine dog feeding is smaller than the time interval T, and realizing double-machine master-slave identification only by means of first-layer arbitration;
(2) the heartbeat information transmission of the two machines adopts a full duplex communication mode;
(3) the double machines send heartbeat information by taking the first dog feeding as a starting point;
(4) the double machines carry out timer timing by taking the first dog feeding as a starting point;
(5) the initial working state after the double machines are electrified is that the machine which feeds the dogs first waits in a shutdown state, the actual time interval of the dog feeding of the double machines is T ', if T' is greater than T, namely the actual dog feeding interval of the double machines is greater than the set minimum time interval, when the timer of the first dog feeding machine times T, the heartbeat information of the opposite machine cannot be received, the first dog feeding machine can set the mark position 1 of 'no opposite machine signal' as the working state of the first dog feeding machine according to the mark position;
(6) through mutual transmission of double-machine heartbeat information, the working state of the first feeding machine received by the second feeding machine is a single-machine state, the working state of the second feeding machine is set as a slave machine state by the second feeding machine, the working state of the second feeding machine received by the first feeding machine is a slave machine, the working state of the first feeding machine is set as a master machine by the first feeding machine, and at the moment, double-machine initialization of which the actual feeding interval is larger than the set minimum time interval of the double machines is completed;
(7) if the actual dog feeding time of the double machines is less than the set minimum time interval, when a timer of the first dog feeding machine counts T, the heartbeat information of the opposite machine is received, a flag bit of the first dog feeding machine, which cannot receive the opposite signal, is still in a 0 state, and the first dog feeding machine does not set the working state of the first dog feeding machine to be a single machine state according to the flag bit, keeps a halt state, cannot continuously perform master-slave identification and needs to enter a second layer of arbitration;
(8) the double machines mutually transmit the heartbeat information, the last two digits of the IP addresses of the double machines are compared, if the last two digits of the IP address of the machine are smaller than the opposite machine, the machine sets the working state of the machine as a single machine, the machine with the larger IP address receives the working state of the machine with the smaller IP address as the single machine state, the machine with the larger IP address sets the working state of the machine as a slave machine, the machine with the smaller IP address receives the working state of the machine with the larger IP address as the slave machine, the machine with the smaller IP address sets the working state of the machine as a master machine, and at the moment, the initialization of the double machines of which the actual dog feeding interval is smaller than the set minimum time interval is completed.
The dual-computer intelligent switching comprises the following contents:
(1) acquiring states of detection parameters, including a dead halt state and key parameter state extraction, wherein the dead halt state judges the periodic dog feeding operation of the fault diagnosis module through application layer software, and a periodic high-level pulse signal is given to the fault diagnosis module; the extraction of the state of the key parameters is to periodically write the health state of the key parameters into a fault diagnosis module through an application layer;
(2) and (4) single machine fault diagnosis, wherein after the health state of the detection content is obtained, the fault diagnosis module carries out fault diagnosis on the single machine through the priority configuration and the diagnosis algorithm of the detection content to obtain the current comprehensive health state of the machine.
(3) And switching interpretation, wherein the switching interpretation module comprises manual switching interpretation and automatic switching interpretation. After the health state of the local machine is obtained, the manual switching interpretation judges whether switching is needed or not according to the health state of the local machine, the health state of the opposite machine, the working state of the local machine and a manual switching instruction, and a switching flag bit is given; the automatic switching interpretation is to judge whether the switching is needed according to the health state of the machine, the health state of the machine and the working state of the machine and to provide a switching flag bit.
(4) The switching control is to determine the current operating state of the device according to the switching response, switching completion, and operating state of the device.
The state acquisition of the detection parameters comprises the following contents:
(1) the detection parameter states comprise a dead halt state and a key parameter health state, wherein the dead halt state comprises 2 dead halt occurrences and 2 non-dead halt occurrences, and the health states of the key parameters are divided into 3 types of serious abnormity, slight abnormity and normal
(2) After the controller software is initialized, the application layer periodically sends dog feeding pulses to the redundant control module through the software interface according to the operating system clock.
(3) Designing a watchdog timer in the redundancy control module, detecting a watchdog feeding pulse, resetting the watchdog timer, and judging that the computer is halted if no watchdog feeding signal is detected in 10 continuous clock cycles of the operating system.
Wherein, the switching interpretation comprises the following contents:
(1) no matter the slave machine or the main machine is in a dead halt fault, the main machine is automatically isolated, and the machine becomes a single machine
(2) When the dual-computer does not crash, the manual switching or the automatic switching is initiated by the slave computer, namely the slave computer receives an instruction, the slave computer firstly changes the working state, then the host computer changes the working state according to the change result of the slave computer and the switching zone bit given by the slave computer, and if the host computer receives the instruction, no response is made.
(3) And when the health state of the slave is lower than or equal to that of the master, the automatic switching is not performed.
(4) And when the health state of the slave machine is seriously abnormal, the non-dead halt fault does not need to be isolated, and the non-dead halt fault needs to be isolated, the manual switching is not responded, namely the switching flag bit is 0.
Wherein the switching control comprises the steps of:
(1) the machine firstly changes the state of the machine by switching the interpretation result. The switching interpretation according to claim 6, wherein the local machine is a slave machine, and the switching condition of table 2 is met, the switching can be performed, so that the 1 st step of the switching control is always from the slave machine to the master machine;
(2) the slave computer is lifted to the master computer, the switching response position 1 of the slave computer is sent to the opposite computer through the heartbeat information packet
(3) When the original host finds that the counter machine is already raised to the host machine and the switching response bit of the counter machine is 1, the working state of the master machine is lowered to the slave machine, the switching state completion position of the master machine is 1, and then the working state and the switching state completion bit of the master machine are transmitted to the newly raised host machine through the heartbeat information packet
(4) The newly-raised host machine finds that the opposite machine is already lowered from the host machine to the slave machine, the switching state completion bit of the opposite machine is 1, and the new host machine resets the switching response bit of the new host machine to 0. And transmits the working state and switching response of the user to the counter machine through heartbeat
(5) The slave which just descends finds that the working state of the counter machine is the master machine and the switching response bit of the counter machine is 0, and resets the switching state completion bit of the slave machine to 0.
The dual-computer information synchronization comprises the following contents:
(1) when the two machines work, information interaction is carried out between the two machines, synchronous data comprise a heartbeat information packet and instruction data, the heartbeat information packet is used for realizing initialization master-slave identification and intelligent switching of working states, the instruction data synchronization is that when a host application layer executes a flow instruction, a crash fault occurs suddenly, after a slave machine becomes a single machine, the slave machine needs to inquire which instruction the original host executes before the crash and informs the application layer of the slave machine, and then the instruction executed by the original host machine continues to be executed;
(2) the information synchronization between the two machines adopts a full duplex communication mode;
(3) the heartbeat information packet and the instruction data are divided according to a time sequence, bidirectional synchronization is periodically carried out, namely, a master carries out information synchronization to a slave, the slave carries out information synchronization to the master, a synchronization period and the time sequence can be flexibly selected.
Advantageous effects
The dual-computer system established by the method can realize intelligent initialization, and does not need to manually specify which is the host and which is the slave after being electrified; the fault diagnosis function of a single machine is provided; after the double-machine switching, the double-machine complete machine level hot standby method of high-speed synchronous master-slave data.
Drawings
FIG. 1 is a diagram of a dual-computer hot-standby system architecture according to the present invention;
FIG. 2 is a schematic diagram of an initialization identification of a dual-machine dog feeding interval greater than a set minimum interval time according to the present invention;
FIG. 3 is a schematic diagram illustrating that the dual-computer dog feeding interval is smaller than the set minimum time interval and the initialization identification cannot be performed;
fig. 4 is a schematic diagram of the determination of the crash state according to the present invention.
Fig. 5 is a schematic diagram of dual-device information synchronization according to the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The invention provides a method for a dual-computer complete machine level hot standby system, aiming at the problems of the existing dual-computer hot standby system
(1) Aiming at the problem that the existing dual-computer hot-standby system lacks intelligent initialization logic, the dual-arbitration strategy of taking a machine feeding dogs first as a main part and taking a machine with a small IP address as a main part is adopted for the initialization of the dual-computer system, so that the dual-computer hot-standby system is suitable for the initialization master-slave identification of sequential start and almost simultaneous start of the dual-computer system, and comprises the following contents:
setting a minimum time interval T of the dual-computer dog feeding of the first layer arbitration, that is, the time interval of the dual-computer dog feeding is smaller than the time interval T, and the dual-computer master-slave identification cannot be realized only by the first layer arbitration
Dual-machine system initialization in "initialization" state machine
The heartbeat information transmission of the dual devices adopts a full-duplex communication mode
The two machines send heartbeat information from the first dog feeding
The dual machines start the first feeding dog to time by the timer
The initial working states after the dual machines are all the shutdown states 0000
If T' > T, that is, the actual dog feeding interval of the dual machines is greater than the set minimum time interval, when the timer of the first dog feeding machine times T, the heartbeat information of the opponent machine cannot be received, the first dog feeding machine sets the flag position 1 of "no opponent machine signal being received" to the single machine state 0001 according to the flag position.
Through mutual transmission of the two-machine heartbeat information, the working state of the first dog feeding machine received by the second dog feeding machine is the single-machine state 0001, the working state of the second dog feeding machine is set to the slave machine state 0011 by the second dog feeding machine, the working state of the second dog feeding machine received by the first dog feeding machine is the slave machine state 0011, the working state of the first dog feeding machine is set to the host machine 0010 by the first dog feeding machine, and at this time, the two-machine initialization that the actual dog feeding interval of the two machines is greater than the set minimum time interval is completed.
If the actual dog feeding time of the dual machines is less than the set minimum time interval, when the timer of the first dog feeding machine times T, the heartbeat information of the paired machine has been received, the flag bit of the first dog feeding machine which cannot receive the paired signal is still in the 0 state, and according to the flag bit, the first dog feeding machine cannot set the working state of the first dog feeding machine to be the single machine state 0001, but still keeps the shutdown state 0000, cannot continue to perform the master-slave identification, and needs to enter the second layer of arbitration.
The two machines mutually transmit the heartbeat information, the last two digits of the IP addresses of the two machines are compared, if the last two digits of the IP address of the local machine are smaller than that of the opposite machine, the local machine sets the working state of the local machine to be single-machine 0001, the machine with the larger IP address receives the machine with the smaller IP address, the working state of the local machine is single-machine 0001, the machine with the larger IP address sets the working state of the local machine to be slave 0011, the machine with the smaller IP address receives the machine with the larger IP address, the working state of the local machine is slave 0011, the machine with the smaller IP address sets the working state of the local machine to be host 0010, and at this time, the initialization of the two machines is completed when the actual dog feeding interval of the two machines is smaller than the set minimum time interval.
(2) Aiming at the problem that the existing dual-computer hot standby system does not have single-computer fault diagnosis function and can only be switched after the fact, the invention provides the intelligent switching of the dual-computer system, which comprises the following contents:
acquiring states of detection parameters, including a dead halt state and key parameter state extraction, wherein the dead halt state judges the periodic dog feeding operation of the fault diagnosis module through application layer software, and a periodic high-level pulse signal is given to the fault diagnosis module; the extraction of the key parameter state is to periodically write the health state of the key parameter into a fault diagnosis module through an application layer;
and single machine fault diagnosis, wherein after the health state of the detection content is obtained, the fault diagnosis module carries out fault diagnosis on the single machine through the priority configuration and the diagnosis algorithm of the detection content to obtain the current comprehensive health state of the local machine.
Switching interpretation, the switching interpretation module comprising manual switching interpretation and automatic switching interpretation. After the health state of the local machine is obtained, the manual switching interpretation judges whether switching is needed or not according to the health state of the local machine, the health state of the opposite machine, the working state of the local machine and a manual switching instruction, and a switching flag bit is given; the automatic switching interpretation is to judge whether the switching is needed according to the health state of the machine, the health state of the machine and the working state of the machine and to provide a switching flag bit.
The switching control is to determine the current operating state of the device based on the switching response to the device, the completion of the switching, and the operating state of the device.
(3) Aiming at the problems that after double-computer switching, the data recovery synchronization time is long and great time 'break point' is easy to generate, the invention provides double-computer bidirectional high-speed real-time synchronization, which comprises the following contents:
information synchronization between two machines using full duplex communication mode
The heartbeat information packet and the instruction data are divided according to the time sequence, and are periodically subjected to bidirectional synchronization, namely, the master performs information synchronization to the slave, and the slave also performs information synchronization to the master, and the synchronization period and the time sequence can be flexibly selected.
Specifically, the master-slave identification of the dual-computer system initialization is as follows:
(1) setting a minimum time interval T of double-machine dog feeding in the first layer arbitration, namely the time interval of double-machine dog feeding is smaller than the time interval T, and realizing double-machine master-slave identification can not be realized only by the first layer arbitration
(2) The heartbeat information transmission of the double machines adopts a full duplex communication mode
(3) The double machines send heartbeat information by taking the first dog feeding as a starting point
(4) The double machines carry out timer timing by taking the first dog feeding as a starting point
(5) The initial working state after the double machine is electrified is the shutdown state (0000)
(6) The machine for feeding the dogs first waits, the actual time interval of feeding the dogs of the two machines is T ', if T' > T, namely the actual time interval of feeding the dogs of the two machines is larger than the set minimum time interval, when the timer of the machine for feeding the dogs first times counts T, the heartbeat information of the opposite machine cannot be received, the machine for feeding the dogs first can set the mark position 1 of the signal which cannot be received as the single machine state (0001) according to the mark position.
(7) Through mutual transmission of double-machine heartbeat information, the working state of the first-feeding dog machine received by the second-feeding dog machine is a single-machine state (0001), the working state of the second-feeding dog machine is set to be a slave-machine state (0011), the working state of the second-feeding dog machine received by the first-feeding dog machine is a slave-machine state (0011), the working state of the first-feeding dog machine is set to be a host machine (0010), and at the moment, double-machine initialization of the actual feeding dog interval of the double machines which is larger than the set minimum time interval is completed.
(8) If the actual dog feeding time of the double machines is less than the set time minimum time interval, when the timer of the first dog feeding machine counts T, the heartbeat information of the opposite machine is received, the flag bit of the first dog feeding machine, which cannot receive the opposite signal, is still in a 0 state, and the first dog feeding machine does not set the working state of the first dog feeding machine to be a single machine state (0001) according to the flag bit and still keeps a halt state (0000), so that the master-slave identification cannot be continuously carried out, and the second-layer arbitration needs to be entered.
(9) The double machines mutually transmit the heartbeat information, the last two digits of the IP addresses of the double machines are compared, if the last two digits of the IP address of the machine are smaller than the opposite machine, the machine sets the working state of the machine to be a single machine (0001), the machine with the larger IP address receives the working state of the machine with the smaller IP address to be the single machine state (0001), the machine with the larger IP address sets the working state of the machine to be a slave machine (0011), the machine with the smaller IP address receives the working state of the machine with the larger IP address to be the slave machine (0011), the machine with the smaller IP address sets the working state of the machine to be a master machine (0010), and at the moment, the initialization of the double machines, the actual dog feeding interval of the double machines is smaller than the set minimum time interval, is completed.
2. Dual-machine intelligent switching
(1) The detection parameter states comprise a dead halt state and a key parameter health state, wherein the dead halt state comprises 2 dead halt occurrences and 2 non-dead halt occurrences, and the health states of the key parameters are divided into 3 types of serious abnormity, slight abnormity and normal
(2) The method comprises the steps of selecting detection parameters and dividing priorities, wherein the detection parameters comprise the dead halt, the residual space of the hard disk, the network communication time, the back plate voltage, the CPU occupancy and the CPU temperature, and the priority grades can be flexibly configured according to actual requirements.
(3) The single machine health state grade division is carried out, the single machine health state grade division is divided into 6 types of dead halt faults, non-dead halt faults needing isolation, serious abnormity, slight abnormity and normal, the single machine health state can be not divided according to actual conditions in detail, and the patent provides more comprehensive single machine health state grade division for explaining problems.
(4) After the system is powered on and the controller software is initialized, the application layer periodically writes dog feeding pulses and monitors the health state of parameters into the redundancy control module through the software interface according to an operating system clock (for example, the clock cycle of the VxWorks system is 16.67 ms).
(5) Designing a watchdog timer in the redundancy control module, detecting a watchdog feeding pulse, resetting the watchdog timer, if no watchdog feeding signal is detected in 10 continuous clock cycles of the operating system, judging that the computer is halted, and specifically referring to fig. 4. And (4) combining the dead halt state and the health state of the detection parameters, and performing single machine fault diagnosis according to the table 1.
TABLE 1 Single machine Fault diagnosis method X-represents any health State
Figure RE-GDA0003538854190000111
(6) According to the fault diagnosis result of the double-machine single machine, switching interpretation is carried out according to the table 2, and the following convention is needed to be made when the switching interpretation is carried out by using the table 2
Whether the slave machine or the master machine, the fault of the dead machine occurs, the local machine is automatically isolated, and the machine becomes a single machine
When the dual-computer does not crash, the slave computer initiates the manual switching or the automatic switching, that is, the slave computer receives the instruction, the slave computer changes the working state first, then the host computer changes the working state according to the change result of the slave computer and the switching flag bit given by the slave computer, and if the host computer receives the instruction, no response is made.
When the health status of the slave is lower than or equal to that of the master, automatic switching is not performed, that is, the switching flag is 0.
When the health status of the slave is seriously abnormal, the non-crash fault does not need to be isolated, and the non-crash fault needs to be isolated, the manual switching is not responded, namely the switching flag bit is 0.
Table 2 dual-machine switching condition interpretation note: all switching is initiated by slave
Figure RE-GDA0003538854190000121
Figure RE-GDA0003538854190000131
(7) The machine firstly changes the state of the machine by switching the interpretation result. According to the switching interpretation convention, the machine is a slave machine, and the switching condition of the table 2 is met, the switching can be carried out, so that the switching control is to be carried out by firstly upgrading the slave machine to a master machine;
(8) when the slave rises to the master, the switching response bit (chang _ ack is 1) of the slave is set to be 1, and the working state of the slave and the switching response bit (chang _ ack is 1) are transmitted to the slave through a heartbeat message packet.
(9) When the original host finds that the opposite host has been upgraded to the host and the switching response bit (change _ ack ═ 1) of the opposite host is 1, the working state of the original host is reduced to the slave, the switching state completion bit (change _ done ═ 1) of the original host is set to 1, and then the working state and the switching state completion bit of the original host are transmitted to the newly upgraded host through heartbeat information packets
(10) The newly-rising master finds that the opposite machine has been dropped from the master to the slave, and the switching state completion bit (change _ done) of the opposite machine is 1, and the new master resets its own switching response bit (change _ ack) to 0. And transmits the working state and switching response (chang _ ack ═ 0) to the opposite machine through the heartbeat
(11) The descending slave finds that the working state of the opposite machine is the master machine, the switching response bit (change _ ack is 0) of the opposite machine is 0, the switching state completion bit (change _ done is 0) of the slave machine is reset to be 0, and the complete switching of the double machines is completed.
3. Dual-computer information synchronization
(1) The information synchronization between the two machines adopts a full duplex communication mode
(2) The heartbeat information packet and the instruction data are divided according to a time sequence, bidirectional synchronization is periodically carried out, namely, a master carries out information synchronization to a slave, the slave carries out information synchronization to the master, a synchronization period and the time sequence can be flexibly selected.
(3) Setting a sending buffer FIFO and a receiving buffer FIFO in a redundancy control module, detecting whether the sending FIFO of the redundancy control module is empty or not in each system clock cycle, and sending the number of the sending FIFO if the sending FIFO is not empty, wherein the sending upper limit is 8 packet data (each packet data comprises 4 bytes of effective data, 1 byte of frame head, 1 byte of frame tail, 1 byte of checksum and 1 byte of frame count), if the next Tick sending is left, waiting for the next Tick sending
(4) 115200 baud rate adopted by 422 serial port, 1ms can almost send 1 packet, 8ms can completely transmit 8 packet data of data volume of upper limit of transmission of one Tick
(5) The receiving FIFO triggers an interrupt as long as it receives 4 bytes of valid data, and the interrupt will disappear only if the data in the receiving FIFO must be completely removed.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (7)

1. A method for establishing a dual-computer complete machine level hot standby system is characterized in that the method comprises three parts of master-slave identification, dual-computer intelligent switching and dual-computer information synchronization of dual-computer system initialization, and comprises the following contents:
(1) aiming at the master-slave identification of the dual-computer system initialization, the dual-computer system adopts a dual arbitration strategy of taking a machine feeding dogs first as a main strategy and taking a machine with a small IP address as a main strategy, and is used for adapting to the master-slave identification of the initialization of the dual-computer system starting in sequence and starting at almost the same time;
(2) aiming at the intelligent switching of the double machines, the intelligent switching is realized by detecting the state acquisition of parameters, single machine fault diagnosis, switching interpretation and switching control;
(3) aiming at the information synchronization of the two machines, a full duplex communication mode and a time slot division principle are adopted to carry out the bidirectional synchronization of the master machine and the slave machine.
2. The method of claim 1, wherein the master-slave identification of dual-computer system initialization comprises the following:
(1) setting a minimum time interval T of double-machine dog feeding of first-layer arbitration, namely the time interval of double-machine dog feeding is smaller than the time interval T, and realizing double-machine master-slave identification only by means of first-layer arbitration;
(2) the heartbeat information transmission of the two machines adopts a full duplex communication mode;
(3) the double machines send heartbeat information by taking the first dog feeding as a starting point;
(4) the double machines carry out timer timing by taking the first dog feeding as a starting point;
(5) the initial working state after the double machines are electrified is that the machine which feeds the dogs first waits in a shutdown state, the actual time interval of the dog feeding of the double machines is T ', if T' is greater than T, namely the actual dog feeding interval of the double machines is greater than the set minimum time interval, when the timer of the first dog feeding machine times T, the heartbeat information of the opposite machine cannot be received, the first dog feeding machine can set the mark position 1 of 'no opposite machine signal' as the working state of the first dog feeding machine according to the mark position;
(6) through mutual transmission of double-machine heartbeat information, the working state of the first feeding machine received by the second feeding machine is a single-machine state, the working state of the second feeding machine is set as a slave machine state by the second feeding machine, the working state of the second feeding machine received by the first feeding machine is a slave machine, the working state of the first feeding machine is set as a master machine by the first feeding machine, and at the moment, double-machine initialization of which the actual feeding interval is larger than the set minimum time interval of the double machines is completed;
(7) if the actual dog feeding time of the double machines is less than the set minimum time interval, when a timer of the first dog feeding machine counts T, the heartbeat information of the opposite machine is received, a flag bit of the first dog feeding machine, which cannot receive the opposite signal, is still in a 0 state, and the first dog feeding machine does not set the working state of the first dog feeding machine to be a single machine state according to the flag bit, keeps a halt state, cannot continuously perform master-slave identification and needs to enter a second layer of arbitration;
(8) the double machines mutually transmit the heartbeat information, the last two digits of the IP addresses of the double machines are compared, if the last two digits of the IP address of the machine are smaller than the opposite machine, the machine sets the working state of the machine as a single machine, the machine with the larger IP address receives the working state of the machine with the smaller IP address as the single machine state, the machine with the larger IP address sets the working state of the machine as a slave machine, the machine with the smaller IP address receives the working state of the machine with the larger IP address as the slave machine, the machine with the smaller IP address sets the working state of the machine as a master machine, and at the moment, the initialization of the double machines of which the actual dog feeding interval is smaller than the set minimum time interval is completed.
3. The method of claim 1, wherein the dual-engine intelligent switching comprises:
(1) acquiring states of detection parameters, including a dead halt state and key parameter state extraction, wherein the dead halt state judges the periodic dog feeding operation of the fault diagnosis module through application layer software, and a periodic high-level pulse signal is given to the fault diagnosis module; the extraction of the key parameter state is to periodically write the health state of the key parameter into a fault diagnosis module through an application layer;
(2) and (4) single machine fault diagnosis, wherein after the health state of the detection content is obtained, the fault diagnosis module carries out fault diagnosis on the single machine through the priority configuration and the diagnosis algorithm of the detection content to obtain the current comprehensive health state of the machine.
(3) And switching interpretation, wherein the switching interpretation module comprises manual switching interpretation and automatic switching interpretation. After the health state of the local machine is obtained, the manual switching interpretation judges whether switching is needed or not according to the health state of the local machine, the health state of the opposite machine, the working state of the local machine and a manual switching instruction, and a switching flag bit is given; the automatic switching interpretation is to judge whether the switching is needed according to the health state of the machine, the health state of the machine and the working state of the machine and to provide a switching flag bit.
(4) The switching control is to determine the current operating state of the device according to the switching response, switching completion, and operating state of the device.
4. The method of claim 3, wherein the status acquisition of the detection parameter comprises:
(1) the detection parameter states comprise a dead halt state and a key parameter health state, wherein the dead halt state comprises 2 dead halt occurrences and 2 non-dead halt occurrences, and the health states of the key parameters are divided into 3 types of serious abnormity, slight abnormity and normal
(2) After the controller software is initialized, the application layer periodically sends dog feeding pulses to the redundant control module through the software interface according to the operating system clock.
(3) Designing a watchdog timer in the redundancy control module, detecting a watchdog feeding pulse, resetting the watchdog timer, and judging that the computer is halted if no watchdog feeding signal is detected in 10 continuous clock cycles of the operating system.
5. The method of claim 3, wherein the switching interpretation comprises:
(1) no matter the slave machine or the host machine has a crash fault, the machine is automatically isolated, and the pair of machines becomes a single machine;
(2) when the dual-computer does not crash, the manual switching or the automatic switching is initiated by the slave computer, namely the slave computer receives an instruction, the slave computer firstly changes the working state, then the host computer changes the working state according to the change result of the slave computer and the switching zone bit given by the slave computer, and if the host computer receives the instruction, no response is made;
(3) when the health state of the slave is lower than or equal to that of the host, automatic switching is not performed;
(4) and when the health state of the slave machine is seriously abnormal, the non-dead halt fault does not need to be isolated, and the non-dead halt fault needs to be isolated, the manual switching is not responded, namely the switching flag bit is 0.
6. The method according to claim 3, wherein said handover control comprises the steps of:
(1) the machine firstly changes the state of the machine by switching the interpretation result. The switching interpretation according to claim 6, wherein the local machine is a slave machine, and the switching condition of table 2 is met, the switching can be performed, so that the 1 st step of the switching control is always from the slave machine to the master machine;
(2) the slave computer is lifted to be a master computer, the switching response position 1 of the slave computer is sent to the opposite computer through a heartbeat information packet;
(3) when the original host finds that the counter machine is already upgraded to the host machine and the switching response bit of the counter machine is 1, the working state of the original host is reduced to the slave machine, the switching state completion position of the original host is 1, and then the working state and the switching state completion bit of the original host are transmitted to the newly upgraded host machine through a heartbeat information packet;
(4) the newly-raised host machine finds that the opposite machine is already lowered from the host machine to the slave machine, the switching state completion bit of the opposite machine is 1, and the new host machine resets the switching response bit of the new host machine to 0. And transmitting the working state and switching response of the user to the opponent machine through the heartbeat;
(5) the slave which just descends finds that the working state of the counter machine is the master machine and the switching response bit of the counter machine is 0, and resets the switching state completion bit of the slave machine to 0.
7. The method of claim 1, wherein the dual-engine information synchronization comprises the following steps:
(1) when the two machines work, information interaction is carried out between the two machines, synchronous data comprise a heartbeat information packet and instruction data, the heartbeat information packet is used for realizing initialization master-slave identification and intelligent switching of working states, the instruction data synchronization is that when a host application layer executes a flow instruction, a crash fault occurs suddenly, after a slave machine becomes a single machine, the slave machine needs to inquire which instruction the original host executes before the crash and informs the application layer of the slave machine, and then the instruction executed by the original host machine continues to be executed;
(2) the information synchronization between the two machines adopts a full duplex communication mode;
(3) the heartbeat information packet and the instruction data are divided according to a time sequence, bidirectional synchronization is periodically carried out, namely, a master carries out information synchronization to a slave, the slave carries out information synchronization to the master, a synchronization period and the time sequence can be flexibly selected.
CN202111681040.3A 2021-12-31 2021-12-31 Method for establishing dual-computer complete machine level hot standby system Pending CN114490152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681040.3A CN114490152A (en) 2021-12-31 2021-12-31 Method for establishing dual-computer complete machine level hot standby system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681040.3A CN114490152A (en) 2021-12-31 2021-12-31 Method for establishing dual-computer complete machine level hot standby system

Publications (1)

Publication Number Publication Date
CN114490152A true CN114490152A (en) 2022-05-13

Family

ID=81510913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681040.3A Pending CN114490152A (en) 2021-12-31 2021-12-31 Method for establishing dual-computer complete machine level hot standby system

Country Status (1)

Country Link
CN (1) CN114490152A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116010329A (en) * 2023-03-27 2023-04-25 厦门立林科技有限公司 USB communication synchronous control method, terminal, intelligent lock and medium
CN116909123A (en) * 2023-09-15 2023-10-20 西北工业大学 Self-monitoring method for motor controller of aviation dual-redundancy electromechanical actuating system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116010329A (en) * 2023-03-27 2023-04-25 厦门立林科技有限公司 USB communication synchronous control method, terminal, intelligent lock and medium
CN116909123A (en) * 2023-09-15 2023-10-20 西北工业大学 Self-monitoring method for motor controller of aviation dual-redundancy electromechanical actuating system
CN116909123B (en) * 2023-09-15 2023-12-19 西北工业大学 Self-monitoring method for motor controller of aviation dual-redundancy electromechanical actuating system

Similar Documents

Publication Publication Date Title
CN103647781B (en) Mixed redundancy programmable control system based on equipment redundancy and network redundancy
CN114490152A (en) Method for establishing dual-computer complete machine level hot standby system
US20140149985A1 (en) Control method for i/o device and virtual computer system
JP2537054B2 (en) Information transmission method
FI101432B (en) Fault-tolerant computer system
US20110320706A1 (en) Storage apparatus and method for controlling the same
CN110750480B (en) Dual-computer hot standby system
EP3107251B1 (en) Packet transmission method and device
CN103532753A (en) Double-computer hot standby method based on memory page replacement synchronization
CN107766181B (en) Double-controller storage high-availability subsystem based on PCIe non-transparent bridge
WO1992006431A1 (en) Message control method for data communication system
JPH086910A (en) Cluster type computer system
US20090077275A1 (en) Multiple I/O interfacing system for a storage device and communicating method for the same
CN115913906A (en) Redundancy control system and method for ship
CN101291201A (en) Heart beat information transmission system and method
CN101488105B (en) Method for implementing high availability of memory double-controller and memory double-controller system
CN111930573A (en) Task-level dual-computer hot standby system and method based on management platform
JP6134720B2 (en) Connection method
CN114979036B (en) Dual-machine hot standby system of network gate based on heartbeat and isolation exchange matrix
EP3995965A1 (en) Method of achieving storage service continuity in storage system, front-end interface card, and storage system
TWI321737B (en) Computer network system and related method for monitoring a server
CN113364659B (en) Data acquisition system based on Modbus protocol
CN111124638B (en) Multi-machine program scheduling system under embedded Linux system and implementation method
JP3465637B2 (en) Server and control method thereof
JPH0427239A (en) Control method for lan connecting device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication