CN114296875A - Data synchronization method, system and computer readable medium based on fault tolerant system - Google Patents

Data synchronization method, system and computer readable medium based on fault tolerant system Download PDF

Info

Publication number
CN114296875A
CN114296875A CN202111623520.4A CN202111623520A CN114296875A CN 114296875 A CN114296875 A CN 114296875A CN 202111623520 A CN202111623520 A CN 202111623520A CN 114296875 A CN114296875 A CN 114296875A
Authority
CN
China
Prior art keywords
virtual machine
register
standby
data synchronization
gits
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111623520.4A
Other languages
Chinese (zh)
Inventor
王海东
俞建群
陈利
周浩波
周晓
徐力群
雷雳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Data Holding Group Co ltd
Orient Securities Co ltd
Original Assignee
Huayun Data Holding Group Co ltd
Orient Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayun Data Holding Group Co ltd, Orient Securities Co ltd filed Critical Huayun Data Holding Group Co ltd
Priority to CN202111623520.4A priority Critical patent/CN114296875A/en
Publication of CN114296875A publication Critical patent/CN114296875A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a data synchronization method, a system and a computer readable medium based on a fault tolerant system, wherein the data synchronization method comprises the following steps: periodically determining the resource data synchronization state of a primary virtual machine and a standby virtual machine contained in a fault-tolerant system in the current state; at least before the main virtual machine and the standby virtual machine execute resource data synchronization, resetting a plurality of target registers which are responsible for interrupt management in a register group contained in the standby virtual machine in the current state to be 0. In the invention, at least before the main virtual machine and the standby virtual machine execute resource data synchronization, a plurality of target registers which are responsible for interrupt management and are contained in the standby virtual machine in the current state are reset to be 0, so that the technical problem of data synchronization failure of the standby virtual machine caused in the data synchronization process between the main virtual machine and the standby virtual machine is solved, and the strong consistency of resource data in the synchronization process is ensured.

Description

Data synchronization method, system and computer readable medium based on fault tolerant system
Technical Field
The present invention relates to the field of fault tolerance technologies, and in particular, to a data synchronization method and system based on a fault tolerance system, and a computer readable medium.
Background
The fault-tolerant system is a system which is constructed by utilizing fault-tolerant technology and can automatically eliminate non-fatal faults. The fault-tolerant system is a high-availability mode, redundant resources are used for enabling the computer system to have the capability of tolerating system faults, and even if a main system which provides services to the outside fails, the computer system can switch services to a standby system to continue to operate so as to ensure the stability and continuity of the services (or applications) which operate in the fault-tolerant system and provide the services to the outside. The fault tolerant system comprises a pair of virtual machines (namely a main virtual machine and a standby virtual machine) and a storage system logically mounted at the back end. The primary virtual machine provides service for the outside and periodically synchronizes the data and the state of the primary virtual machine and the standby virtual machine. The standby virtual machine is equivalent to a copy of the main virtual machine, and when the main virtual machine fails, the standby virtual machine is switched to become the main virtual machine and provides services to the outside. Thus, data and state need to be synchronized periodically to ensure fault tolerance of the fault tolerant system and consistency of data and state between the host/standby virtual machines.
In an example scenario of a domestic cloud operating system based on an ARM architecture (for example, an ARM v8-a architecture), since a hypervisor process (qemu-kvm) to which a standby virtual machine belongs receives a request for synchronizing with a primary virtual machine for the first time, data in the primary virtual machine is successfully written into a memory of the standby virtual machine, and the hypervisor process (qemu-kvm) to which the standby virtual machine belongs receives the request for synchronizing again for the second time or later, the writing into the memory of the standby virtual machine fails due to a return error of the core (where the core refers to a host core and a core of a non-virtual machine) when CPU register information included in synchronization data is synchronized. After the domestic cloud operating system based on the ARM architecture of the fault-tolerant system, the synchronous execution of the memory data between the main virtual machine and the standby virtual machine is failed, and even the domestic cloud operating system based on the ARM architecture is reliably deployed and operated.
In view of the above, there is a need to improve the data synchronization method based on the fault tolerant system in the prior art to solve the above problems.
Disclosure of Invention
The invention aims to disclose a data synchronization method and a data synchronization system based on a fault-tolerant system, which are used for solving the technical problem of data synchronization failure between a primary virtual machine and a standby virtual machine in the data synchronization process between the primary virtual machine and the standby virtual machine which form the fault-tolerant system so as to ensure the fault-tolerant performance of the fault-tolerant system.
In order to achieve one of the above objects, the present invention provides a data synchronization method based on a fault tolerant system, which comprises the following steps:
s1, periodically determining resource data synchronization states of a primary virtual machine and a standby virtual machine contained in a fault tolerant system in the current state;
s2, resetting a plurality of target registers in charge of interrupt management in a register group contained in the standby virtual machine in the current state to 0 at least before the primary virtual machine and the standby virtual machine execute resource data synchronization.
As a further improvement of the present invention, the main system and the standby system are deployed in a server cluster constructed by the same platform or heterogeneous platforms, the main virtual machine is deployed in the main system, and the standby virtual machine is deployed in the standby system.
As a further improvement of the present invention, after resetting to 0a plurality of target registers responsible for interrupt management in a register set included in a standby virtual machine of the fault tolerant system in the current state, the method further includes:
and releasing the equipment of the standby virtual machine based on the ITS mechanism, and interrupting the distributed mapping relation.
As a further improvement of the present invention, the resetting of a plurality of target registers responsible for interrupt management in a register set included in a standby virtual machine of the fault tolerant system in the current state to 0 is implemented by a clearing logic preset by the standby virtual machine or in response to a clearing instruction initiated by an external device;
the destination register includes: a GITS _ CBASER register, a GITS _ CREADR register, a GITS _ CWRISTER register, and a GITS _ CTLR register;
resetting a number of target registers responsible for interrupt management in a register set included in the standby virtual machine of the fault tolerant system in the current state to 0 in step S2 is specifically:
and resetting the GITS _ CBASER register, the GITS _ CREADER register and the GITS _ CWRISTER register which are contained in a register group contained in the standby virtual machine in the current state to 0, and resetting the enable attribute of the GITS _ CTLR register to 0.
As a further development of the invention, the clearing logic is preconfigured by the user or the robot program with the standby virtual machine in a manually loaded or automatically loaded manner.
As a further improvement of the present invention, the time stamp for resetting to 0a plurality of target registers responsible for interrupt management in a register group included in the standby virtual machine in the current state is earlier than the time stamp for performing resource data synchronization between the primary virtual machine and the standby virtual machine.
As a further improvement of the present invention, after resetting to 0a plurality of target registers responsible for interrupt management in a register set included in the standby virtual machine in the current state, the method further includes:
and releasing the resource data stored in the temporary storage of the standby virtual machine by the interrupt analysis service of the standby system, wherein the temporary storage comprises a virtual memory, a database or a part of storage space logically positioned in the storage systems at the rear ends of the main system and the standby system.
As a further improvement of the present invention, the resource data includes one or more of CPU data, memory data, disk data, configuration data, and plug-in.
Based on the same invention idea, the invention also discloses a data synchronization system based on the fault tolerant system, which comprises:
the device comprises a state acquisition module and a reset module;
the state acquisition module periodically determines resource data synchronization states of a primary virtual machine and a standby virtual machine included in a fault tolerant system in the current state, and resets to 0a plurality of target registers in charge of interrupt management in a register group included in the standby virtual machine in the current state through the reset module at least before the primary virtual machine and the standby virtual machine perform resource data synchronization.
As a further improvement of the present invention, the target register includes: a GITS _ CBASER register, a GITS _ CREADR register, a GITS _ CWRISTER register, and a GITS _ CTLR register;
the resetting, by the resetting module, to 0a plurality of target registers in charge of interrupt management in a register group included in a standby virtual machine of the fault tolerant system in the current state specifically includes: and resetting the GITS _ CBASER register, the GITS _ CREADER register and the GITS _ CWRISTER register which are contained in a register group contained in the standby virtual machine in the current state to 0, and resetting the enable attribute of the GITS _ CTLR register to 0.
As a further improvement of the present invention, the data synchronization system further includes: and the clearing module is used for releasing the equipment of the standby virtual machine based on the ITS mechanism and interrupting the operation of the distributed mapping relation.
Based on the same inventive concept, the present invention further discloses a computer readable medium, in which computer program instructions are stored, and the computer program instructions are read and executed by a processor to perform the steps of the data synchronization method based on the fault tolerant system as described in any one of the above inventions.
Compared with the prior art, the invention has the beneficial effects that:
in the invention, at least before the primary virtual machine and the standby virtual machine execute resource data synchronization, a plurality of target registers which are responsible for interrupt management in a register group contained in the standby virtual machine in the current state are reset to be 0, so that the technical problem of data synchronization failure to the standby virtual machine caused in the process of executing data synchronization between the primary virtual machine and the standby virtual machine is solved, and the strong consistency of resource data in the process of executing periodic or random synchronization between the primary virtual machine and the standby virtual machine in the fault-tolerant system is ensured.
Drawings
FIG. 1 is a topology diagram of a fault tolerant system consisting of a primary virtual machine and a standby virtual machine;
FIG. 2 is a flow diagram of data synchronization between a primary virtual machine and a standby virtual machine when the primary virtual machine of a fault tolerant system is not failing;
FIG. 3 is a flowchart illustrating that a standby virtual machine replaces a primary virtual machine to continue providing external services when the primary virtual machine of the fault tolerant system fails;
FIG. 4 is a topological diagram of resource data for executing synchronization operation during the process of executing resource data synchronization between a primary system and a secondary system according to the data synchronization method of the fault tolerant system of the present invention;
fig. 5 is a schematic diagram of the GICv3 interrupt controller sending an interrupt instruction to a register set and independently controlling the primary virtual machine and the standby virtual machine in the process of performing resource data synchronization between the primary virtual machine and the standby virtual machine;
FIG. 6 is a diagram illustrating that Qemu-kvm in the standby system resets to 0 for a number of target registers responsible for interrupt management in a register set included in the standby system;
FIG. 7 is a general flow chart of a data synchronization method based on a fault tolerant system according to the present invention;
fig. 8 is a detailed flowchart of responding to a resource data synchronization request initiated by a primary virtual machine to write into a standby disk of a standby virtual machine according to whether a target register of a GICv3 in the standby virtual machine is written for the first time in a data synchronization method based on a fault tolerant system according to the present invention;
FIG. 9 is a topology diagram of a data synchronization system based on a fault tolerant system according to the present invention;
FIG. 10 is a topology diagram of a computer readable medium of the present invention.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before describing the embodiments of the present invention in detail, the meanings of the main technical terms and the english abbreviations referred to in the embodiments are explained or defined as necessary.
The fault-tolerant system is a system which is constructed by utilizing a fault-tolerant technology and is used for automatically eliminating non-fatal faults. The fault-tolerant system is a high-availability mode, and the redundant resources are used for enabling the computer system to have the capability of tolerating system faults, so that even if the system which provides services to the outside fails, the system can enable the business to continue to operate to improve the stability and continuity of the services provided to the outside. The fault tolerant system is particularly important in industries (such as the industries of bank settlement systems and data centers) and scenes (online payment scenes) with high requirements on service continuity and fault tolerant performance.
Referring to fig. 1 and 4, the fault tolerant system 100 includes two virtual machines and a storage system 50 logically located at the back ends of the two virtual machines. The virtual machine on the left side in fig. 1 is defined as a Primary virtual machine (Primary VM), and the virtual machine on the right side in fig. 1 is defined as a standby virtual machine (Secondary VM), where the storage system 50 is constructed by a nonvolatile storage medium such as a physical disk, and the storage system 50 is mounted to the server cluster 1000 through an ISCSI protocol and an FC protocol, so as to allocate corresponding storage spaces for the Primary virtual machine 10 and the standby virtual machine 20 in the server cluster 1000.
After the data synchronization operation is completed once, the resource data and the states respectively formed by the main virtual machine 10 and the standby virtual machine 20 are completely the same, so that it is ensured that the resource data synchronization is realized between the main virtual machine 10 and the standby virtual machine 20, the resource data synchronization has strong consistency, and when the main virtual machine 10 is in an unavailable state (for example, a failure or downtime scene of the main virtual machine 10), the standby virtual machine 20 responds to a user request, so as to ensure high availability, stability and continuity of services running in the fault tolerant system 100.
As shown in fig. 1, the primary virtual machine 10 includes a primary disk 10b (primary disk) and a primary memory area 10a (memory registers), and the standby virtual machine 20 includes a standby disk 20b (secondary disk) and a memory area 20a (memory registers), where the primary disk 10b and the standby disk 20b both belong to virtual disks, and a storage system 50 (i.e., the primary disk 10b and the standby disk 20b) defined by virtual disks is formed on physical storage media by a virtualization technology and is mounted to the primary/standby virtual machine.
Referring to fig. 2, if the primary virtual machine 10 in the fault tolerant system 100 does not fail, the data synchronization method in the normal state includes the following steps S11 to S14.
Step S11: the Primary virtual machine 10 transmits resource data corresponding to a Primary write Request (Primary write Request) to the Primary disk 10b and saves the resource data to the Primary disk 10 b. The primary write request writes the primary memory area 10a of the primary virtual machine 10 to the primary disk 10b based on the resource data corresponding to the primary write request.
Step S12: the master write request triggers a resource data synchronization event between the master virtual machine 10 and the standby virtual machine 20. A resource data synchronization event such as disk data is triggered based on the master write request in the foregoing step S11, and the disk data synchronization event is a specific example of the resource data synchronization event. The primary write request is initiated to the qemu-kvm of the create standby virtual machine 20 by creating the qemu-kvm of the primary virtual machine 10.
Step S13: the primary/standby virtual machine stores the resource data (e.g., disk data of the primary virtual machine 10) corresponding to the primary write request to a standby memory area 20a (memory regions) of the standby virtual machine 20.
Step S14: the resource data in the primary virtual machine 10 is successfully written into the standby disk 20b of the standby virtual machine 20. When the primary virtual machine 10 and the standby virtual machine 20 are both normal and in the process of performing resource data synchronization such as disk data between the primary and standby virtual machines, transmission may be performed based on the memory frame, thereby realizing real-time transmission of resource data between the primary memory area 10a and the standby memory area 20 a.
Referring to fig. 3, if the primary virtual machine 10 in the fault tolerant system 100 fails, that is, in an abnormal state, the standby virtual machine 20 continues to provide external services in place of the primary virtual machine 10, including the following steps S21 to S24.
Step S21: the checkpoint triggers a resource data synchronization request of a main memory area 10a in the main virtual machine 10 and a standby memory area 20a in the standby virtual machine 20.
Step S22: the standby virtual machine 20 refuses to respond to the resource data synchronization request sent by the primary virtual machine 10, and sends a write request to the standby memory area 20 a.
Step S23: and reading the resource data of the standby disk 20b to the standby memory area 20a of the standby virtual machine 20.
Step S24: the resource data of the standby disk 20b covers the original resource data of the standby memory area 20a of the standby virtual machine 20. It should be noted that the resource data synchronization between the primary disk 10b and the standby disk 20b is continuously triggered, and the resource data synchronization between the primary memory area 10a in the primary virtual machine 10 and the standby memory area 20a in the standby virtual machine 20 is triggered at certain time intervals, for example, the time interval is 20 seconds.
Alternatively, the fault tolerant system 100 is composed of any pair of virtual machines, i.e., the primary virtual machine 10 and the standby virtual machine 20. The primary virtual machine 10 and the standby virtual machine 20 are both created by independent virtualization software (e.g., Qemu-kvm), and may be created by the same independent virtualization software located in one server cluster 1000, or two independent virtualization software located in one server cluster 1000, or even may be created by one virtualization software, and two of the virtual machines are defined as the primary virtual machine 10 and the standby virtual machine 20, respectively, to form a fault tolerant system 100; alternatively, two independent virtualization software located in one server cluster 1000 respectively create a plurality of virtual machines, and a pair of virtual machines created by the two independent virtualization software is arbitrarily selected and respectively used as the primary virtual machine 10 and the standby virtual machine 20 to form the fault tolerant system 100.
Preferably, as shown in FIG. 4, two systems (e.g., Source Host) exist in the server cluster 1000, the left system is defined as the primary system 1, and the right system is defined as the secondary system 2. The main system and the standby system 2 both belong to physical machines. The main system 1 and the standby system 2 each include an application, a virtual machine, and firmware. The data and status (i.e. the lower concept of the resource data) of the primary system 1 and the secondary system 2 are completely the same after the first resource data synchronization operation.
An ARM Cortex-A series processor (Cortex-A53, Cortex-A72) provides four pins for the SoC to realize the transmission of external interrupt signals. A General Interrupt Controller (GIC) is used to manage Interrupt sources for SoC peripherals and provides software, configuration, and control of these Interrupt sources. When the corresponding interrupt source is valid, the configuration of the GIC interrupt source determines whether to send the interrupt signal to the CPU. If multiple interrupt sources are active, the GIC performs arbitration to select the highest priority interrupt and send it to the CPU. When the CPU receives the interrupt sent by the GIC, the address of the interrupt source is obtained by reading the register of the GIC, so that corresponding processing can be carried out. When the CPU has finished processing the interrupt, the register of the GIC is accessed and the interrupt processing is finished. After receiving the information, the GIC can cancel the interrupt source, avoid resending the interrupt to the CPU and allow interrupt preemption. In the GICv3 ITS is an optional hardware mechanism for routing LPIs to the appropriate Redistributor (for configuration interrupts); in software, the ITS is configured by a sequence of instructions, and the table structure associated with the ITS in memory translates a device-dependent EventID to an INTID to be recognized by the PE as an interrupt number.
The GICv3 is a latest version of general interrupt controller provided by ARM company, and is mainly used for receiving hardware interrupt signals, and after certain processing, distributing the signals to corresponding CPUs for processing. The GICv3 register is characterized by allowing only internal write data to be written to the system, and only the first write to the external system will succeed. The register of the GICv3 can only be written once, and the fault tolerant system 100 performs data synchronization circularly in view of the need, so after the domestic cloud operating system based on the ARM architecture, it may cause a failure of performing synchronization on resource data between the primary system 1 (or the primary virtual machine 10) and the standby system 2 (or the standby virtual machine 20), and even directly cause reliable deployment in the server cluster 1000 deployed by the domestic cloud operating system based on the ARM architecture. Therefore, the present embodiment is implemented based on a data synchronization method and a data synchronization system based on a fault tolerant system included in the fault tolerant system 100 described below.
Referring to fig. 5, the fault tolerant system 100 includes: the main virtual machine 10, the standby virtual machine 20, and the GICv3 interrupt controller 12 and the GICv3 interrupt controller 22 respectively mounted on the main virtual machine 10 and the standby virtual machine 20. The main virtual machine 10 sends an interrupt instruction to the GICv3 interrupt controller 12 mounted on the main virtual machine 10, and the GICv3 interrupt controller 12 receives the interrupt signal, processes the signal, and returns the signal to the main virtual machine 10 to provide service to the outside. Similarly, the standby virtual machine 20 also responds to the interrupt instruction in the main virtual machine 10, and then performs processing of responding, that is, the standby virtual machine 20 sends an interrupt instruction to the GICv3 interrupt controller 22 mounted on the standby virtual machine 20, and the GICv3 interrupt controller 22 receives an interrupt signal, and then returns the interrupt signal to the standby virtual machine 20 after processing, but the standby virtual machine 20 does not provide service to the outside. The primary virtual machine 10 sends a resource data synchronization request to the standby virtual machine 20, the standby virtual machine 20 receives and responds to the resource data synchronization request sent by the primary virtual machine 10, and then the resource data in the primary virtual machine 10 is stored in the standby disk 20b of the standby virtual machine 20. The GICv3 register is characterized in that it only allows the internal system to store data to the disk (e.g. standby disk 20b) of the system, and for the external system (e.g. host system 1), the writing is successful only when the resource data synchronization request is sent for the first time, and the resource data of the host system 1 (or host virtual machine 10) in the current state is synchronized to the standby disk 20b of the standby system 2 (or standby virtual machine 20). Therefore, when the primary virtual machine 10 sends a resource data synchronization request to the standby virtual machine 20 for the second time, four target registers (four target registers are described below) in a register group in the kernel layer, which includes a plurality of target registers responsible for interrupt management, return an error, and the write fails. The destination register includes: GITS _ CBASER register, GITS _ CREADR register, GITS _ CWRISTER register, and GITS _ CTLR register. However, for the fault tolerant system 100, it is necessary to constantly synchronize the resource data in the primary virtual machine 10 to the standby virtual machine 20, and thus this characteristic of the GICv3 register causes the fault tolerant system 100 to fail. In the GICv3 interrupt controller, an ITS (interrupt transfer service) is used to resolve LPI (local-Specific Peripheral Interrupts), a new interrupt type.
Based on the foregoing, referring to fig. 7 and 8, the data synchronization method based on the fault tolerant system disclosed in the present invention includes the following steps S1 to S2.
First, step S1 is executed: the resource data synchronization status of the primary virtual machine 10 and the standby virtual machine 20 included in a fault tolerant system 100 in the current status is periodically determined.
Illustratively, the primary virtual machine 10 is deployed in the primary system 1, the standby virtual machine 20 is deployed in the standby system 2, and the primary system 1 and the standby system 2 are deployed in a server cluster 1000 constructed by the same platform or a heterogeneous platform. The server cluster 1000 refers to a plurality of mutually independent computer systems, and a larger computer service system is formed by utilizing a high-speed communication network, and each node (i.e. a computer in the server cluster 1000 or a virtual machine with an independent response to a user request) runs in a server. The servers can communicate with each other, cooperatively provide application programs and system resource data to users, and be managed in a single system mode. In this embodiment, the primary system 1 and the secondary system 2 may also be understood as a primary node and a secondary node. In the data synchronization method disclosed in this embodiment, by resetting the plurality of target registers in the primary system (or the primary virtual machine) and the secondary system (or the secondary virtual machine) to 0 in advance, the difference between the register designation values of the physical CPUs (for example, the free 920 physical CPU manufactured by hua corporation or the FT2000 physical CPU manufactured by the FT corporation) manufactured by different manufacturers in the fault tolerant system 100 for creating and running the virtual machine can be ignored, and the universality and the portability of the fault tolerant system 100 are improved. The period for determining the resource data synchronization state of the primary virtual machine 10 and the standby virtual machine 20 included in one fault tolerant system 100 in the current state may be adaptively set according to the fault tolerance performance of the fault tolerant system 100, which is not limited in the present invention.
Referring to FIG. 8, step 41: after the primary virtual machine 10 initiates a resource data synchronization request to the standby virtual machine 20, jump to step 42.
Step 42: it is determined whether the destination register of the GICv3 in the standby virtual machine 20 is the first write. In general, if the write is the first write, the GITS _ CBASER register, GITS _ creatr register, and GITS _ CWRITER register are set to 0, and the enable attribute of the GITS _ CTLR register is set to 0; if not, all of the four target registers will return error values, and the synchronization operation of the resource data will fail, thereby jumping to step 43.
Step 43: resetting the target register (i.e. the aforementioned GITS _ CBASER register, GITS _ CREADR register, and GITS _ CWRISTER register) of GICv3 of standby virtual machine 20 to perform a reset operation of resetting to 0, and resetting the enable attribute of the GITS _ CTLR register to 0, and then proceeding to step 44; if not, go to step 44.
Step 44: and writing the resource data into the standby disk 20b of the standby virtual machine 20, and ending.
Thereby, the synchronous operation of the resource data between the primary virtual machine 10 and the standby virtual machine 20 is finally achieved.
For the server cluster 1000 constructed on the same platform, for example, the primary system 1 adopts a server constructed by FT2000+ CPU, and the secondary system 2 also adopts a server constructed by FT2000+ CPU; for example, the main system 1 employs a server composed of FT2000+ CPU, and the standby system 2 employs a server composed of a spread 920 in the server cluster 1000 composed of heterogeneous platforms. The resource data in this embodiment includes: one or more of CPU data, memory data, disk data, configuration data or plug-ins. The data synchronization method disclosed in this embodiment ensures stability and high availability of the fault tolerant system 100 by having strong consistency on resource data respectively formed by a host/standby virtual machine in the fault tolerant system 100, where the configuration data includes, but is not limited to, network configuration data (e.g., virtual gateway configuration, virtual IP address configuration, etc.) for the host/standby virtual machine, database configuration of the virtual machine, and entity data or metadata in a virtual disk mounted by the virtual machine.
The time stamp of resetting to 0a plurality of target registers in charge of interrupt management in the register group included in the standby virtual machine 20 in the current state is earlier than the time stamp of performing resource data synchronization between the primary virtual machine 10 and the standby virtual machine 20. Thus, it is ensured that the reset operation is performed on the four target registers individually or in batch before the point of time before the arrival of the second synchronization cycle. Since the time of register operation is very short and the speed of reset operation is very fast, the computation performance consumed by performing reset operation before performing multiple synchronization operations on resource data is negligible and does not affect the synchronization performance.
Then, step S2 is executed: at least before the primary virtual machine 10 and the standby virtual machine 20 perform resource data synchronization, resetting a number of target registers responsible for interrupt management in a register group included in the standby virtual machine 20 in the current state to 0, so as to implement a reset operation on the plurality of target registers.
Referring to fig. 6, an LPI (local-specific interrupt) is a new interrupt type defined in the GICv3, and is preferably implemented using its (interrupt transfer service). ITS is optional in GICv 3. The ITS is responsible for receiving interrupts from peripheral equipment, converting the interrupts into LPI INTID and sending the LPI INTID to a corresponding Redistributor, so that the Redistributor is responsible for managing PPI, SGI and LPI interrupts and sending the interrupts to a CPU interface (CPU interface). The destination register includes: GITS _ CBASER register, GITS _ CREADR register, GITS _ CWRISTER register, and GITS _ CTLR register. GITS _ CBASER register (ITS Command Queue Descriptor), a 64-bit register, is used to specify the base address and size of the Command Queue. A GITS _ credit Register (ITS Reader Register), a 64-bit Register, and is used to specify the offset of GITS _ CBASER when the ITS reads the next ITS command, to point to the next command to be processed by the ITS with the offset, and this Register is cleared to 0 when the value is written to GITS _ CBASER. A GITS _ CWRISTER Register (ITS Write Register), a 64-bit Register, and is used to specify the offset from GITS _ CBASER for the next ITS command written by software. The GITS _ CTLR Register (ITS Control Register), a 32-bit Register, is used to Control the operation of the ITS.
The standby virtual machine 20 responds to the user mode program, and writes the GITS _ credit from the GITS _ credit register of the main virtual machine 10 to the GITS _ credit register of the standby virtual machine 20; writing a GITS _ CWRISTR from a GITS _ CWRISTR register of primary virtual machine 10 to a GITS _ CWRISTR register of standby virtual machine 20; the GITS _ CTLR register of the control slave virtual machine 10 is written to the GITS _ CTLR register of the standby virtual machine 20.
The primary virtual machine 10 and the standby virtual machine 20 execute the instruction of the resource data synchronization request for the first time, the primary virtual machine 10 sends the resource data synchronization request to the standby virtual machine 20, and the standby virtual machine 20 responds to the resource data synchronization request and calls the resource data synchronization request through the standby system 2. At this time, the four target registers in the GICv3 register group in the standby virtual machine 20, which are responsible for interrupt management, are reset, and specifically, the GITS _ CBASER register, the GITS _ creatr register, and the GITS _ CWRITER register are sequentially reset to 0, and the enable attribute of the GITS _ CTLR register is reset to 0, so as to prevent the situation that the resource data synchronization fails between the main and standby virtual machines due to the situations that the return value fails in the target registers and the kernel of the standby virtual machine 20 is reported incorrectly in the subsequent resource data synchronization operation for one or more times after the first resource data synchronization operation between the main and standby virtual machines is performed.
Specifically, the resetting of the target registers responsible for interrupt management to 0 in the register set included in the standby virtual machine 20 of the fault tolerant system 100 in the current state is implemented by the clearing logic preset by the standby virtual machine 20 or in response to a clearing instruction initiated by an external device. Illustratively, the cleanup logic is preconfigured by the user or the bot to the standby virtual machine 20 in a manually loaded or automatically loaded manner. For example, a User (User) may type the purge logic directly into a computer on which a visual interface (UI) is embedded, or through a peripheral device (e.g., keyboard), or may configure the purge logic by automatically or manually importing a robot program. Therefore, the implementation manner of the data synchronization method is more flexible, and the data synchronization method can be further packaged as micro-service and loaded by a user or an administrator, so as to implement the synchronization operation of the resource data between the main/standby virtual machines in the fault tolerant system 100. The aforementioned flush logic or flush instruction is implemented in the form of a command line. The following illustrates an example of a reset operation to a target register reset to 0 performed independently by a single command line. In addition, the foregoing clearing logic or clearing instructions may also be implemented in the form of a plug-in or microservice, or the like.
For example, a user may log in to the Host of the standby system 2 where the deployed standby virtual machine 20 is located through a peripheral device (e.g., computer, keyboard), and enters a "virsh reg-reset vm-name gicv3-its _ cbaser" command line, sends to qemu-kvm of the standby virtual machine 20, and the flush logic contained by the command line (i.e. the aforementioned flush instruction) is executed by qemu-kvm of the standby virtual machine 20 on a specified target register, to perform a reset operation to complete the reset operation of resetting to 0 on the aforementioned target register (i.e. the bits _ cbaser register) as the standby virtual machine (named vm-name), so that the value of the gits _ cbaser register as a target register in the backup virtual machine is reset to the state of 0 from the state already set to 1 in the first resource data synchronization operation, thereby facilitating the subsequent resource data synchronization operation. Referring to fig. 6, the target register is located at the CPU interface, the reset operation is performed on the CPU interface of the CPU in the standby system 2 and is completed in the CPU interface, and the enable attribute of the register set is reset to 0, so as to satisfy that the resource data of the main system 1 (or the standby virtual machine 20) in any state is synchronously written into the standby disk of the standby system 2 after the standby system 2 (or the standby virtual machine 20) receives the resource data synchronization request initiated by the main system 1 (or the standby virtual machine 20) each time the ARM architecture-based CPU executes the resource data. The CPU interface is used to handle interrupts. Thus, in the present embodiment, reset operations to reset three target registers to 0 and the enable attribute of one target register (i.e., the GITS _ CTLR register) to 0 can be independently performed by a single command line.
In addition, reset operations to reset three target registers in the register set to 0 and to reset the enable attribute of one target register (i.e., the GITS _ CTLR register) to 0 in the present implementation may be performed in bulk, synchronously, with the main code as follows.
Figure BDA0003438297560000131
Figure BDA0003438297560000141
The aforementioned code its- > cbiser ═ 0 performs an operation of resetting to 0 on the GITS _ CBASER register, the aforementioned code its- > creatr ═ 0 performs an operation of resetting to 0 on the GITS _ creatr register, the aforementioned code its- > CWRITER ═ 0 performs an operation of resetting to 0 on the GITS _ CWRITER register, and the aforementioned code its- > enabled ═ 0 performs an enable attribute resetting to 0 on the GITS _ CTLR register.
After the reset operation is finished, the interrupt resolution service of the standby system 2 releases the resource data stored in the temporary storage of the standby virtual machine 20, where the temporary storage includes a virtual memory, a database, or a part of the storage space in the storage system 50 logically located at the back end of the main system 1 and the standby system 2. Thereby releasing the temporary data in the main memory area 10a and the spare memory area 20 a.
It should be noted that the primary virtual machine 10 and the standby virtual machine 20 may also be understood as servers and may also be understood as nodes. The primary virtual machine 10 and the standby virtual machine 20 may be virtual machines having the same configuration and state corresponding to two resource data existing on the same physical machine, or may be virtual machines having the same configuration and state of two resource data respectively existing on different physical machines. The number of the primary virtual machines 10 is one, and the number of the standby virtual machines 20 may be one or plural, and the present invention does not limit the logical positions on which the primary/standby virtual machines depend. Illustratively, a primary virtual machine 10 of the fault tolerant system 100 can be deployed in the primary system 1, and a standby virtual machine 20 of the fault tolerant system 100 is deployed in the standby system 2; alternatively, the primary system 1 may also deploy the standby virtual machine 20 of the fault tolerant system 100, and the secondary system 2 may deploy the primary virtual machine 10 of the fault tolerant system 100.
The system 2 comprises: user Space (User Space) and kernel Space (kernel Space). The kernel mode refers to that the CPU can access all data of the memory, including peripheral devices such as a hard disk and a network card, and the CPU can also switch itself from one program to another program. The user state refers to that only limited access to the memory is available, access to peripheral equipment is not allowed, the capacity of the CPU is deprived, and CPU data can be acquired by other programs. The Qemu-kvm user state program comprises: a register bank comprising the following four target registers: a GITS _ CBASER register, a GITS _ CREADR register, a GITS _ CWRISTER register, and a GITS _ CTLR register.
Further, the data synchronization method disclosed in the present embodiment further includes performing step S3 after the step S2 is completed, and step S3 may be omitted.
Step S3, releasing the device based on the ITS mechanism of the standby virtual machine 20 to which the standby system 1 belongs, and interrupting the distributed mapping relationship.
Illustratively, when a primary/secondary virtual machine constituting the fault tolerant system 100 (or the fault tolerant server) is deployed in a scenario of a physical server constructed based on a physical CPU of the spread 920, the primary/secondary virtual machine is a virtual machine that allocates 8-core virtual CPUs and 16G virtual memories, and each virtual machine is mounted with 50G data, when the primary virtual machine 10 (or the primary system 1) fails, resource data representing the primary virtual machine 10, such as resource data, may be synchronized into the secondary virtual machine 20 (or the secondary system 2) within 1 second under the support of a trillion network, so as to ensure that a service provided by the primary virtual machine 10 (the primary system 1) for a user is not affected and is not perceived by the user.
Based on the data synchronization method based on the fault tolerant system disclosed above, the present embodiment further discloses a data synchronization system 200 based on the fault tolerant system (hereinafter referred to as "data synchronization system 200") based on the data synchronization method.
Referring to fig. 9, a data synchronization system 200 based on a fault tolerant system (hereinafter referred to as "data synchronization system 200") includes: a state acquisition module 201 and a reset module 202. The state obtaining module 201 periodically determines resource data synchronization states of the primary virtual machine 10 and the standby virtual machine 20 included in a fault tolerant system 100 in the current state, and performs a reset operation of resetting to 0a plurality of target registers responsible for interrupt management in a register set included in the standby virtual machine 20 in the current state through the reset module 202 at least before the primary virtual machine 10 and the standby virtual machine 20 perform resource data synchronization. The data synchronization system 200 further includes: a clearing module 203, where the clearing module 203 performs an operation of releasing the ITS mechanism-based device of the standby virtual machine 20 and interrupting the distributed mapping relationship, so that the primary virtual machine 10 can perform a resource data synchronization operation with the standby virtual machine 20. The destination register includes: GITS _ CBASER register, GITS _ CREADR register, GITS _ CWRISTER register, and GITS _ CTLR register. The flush module 203 is configured with flush logic preset by the virtual machine or in response to a peripheral initiated flush instruction.
In this embodiment, the resetting module 202 resets to 0a plurality of target registers responsible for interrupt management in a register set included in the standby virtual machine 20 of the fault tolerant system 100 in the current state specifically: the bits of the GITS _ CBASER register, the GITS _ creatr register, and the GITS _ CWRITER register included in the register group included in the standby virtual machine 20 in the current state are reset to 0, and the enable attribute of the GITS _ CTLR register is reset to 0. The specific implementation process of the reset operation of resetting the three target registers to 0 and resetting the enable attribute of the GITS _ CTLR register to 0 is referred to the corresponding embodiment of the data synchronization method based on the fault tolerant system, and is not described herein again.
The aforementioned data synchronization system 200 based on the fault tolerant system and the data synchronization method based on the fault tolerant system have the same technical solutions, which are described in the foregoing data synchronization method based on the fault tolerant system and are not described herein again.
Finally, based on the foregoing technical solution, the present embodiment further discloses a computer-readable medium 900.
Referring to FIG. 10, the present example discloses one embodiment of a computer readable medium 900. The computer-readable medium 900 may be disposed in whole or in part in a physical form of a computer, server, cluster server, or data center.
In the present embodiment, a computer-readable medium 900 is provided, the computer-readable medium 900 stores computer program instructions 901, and the computer program instructions 901 are read and executed by a processor 902 to perform the steps of the data synchronization method based on the fault tolerant system as disclosed above.
Alternatively, the computer-readable medium 900 may be configured as a server and the server runs on a physical device that constructs a private cloud, a hybrid cloud, or a public cloud. Meanwhile, the computer-readable medium 900 may also be configured as a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The computer readable medium 900 is used for storing a program, and the processor 902 executes the data synchronization method based on the fault tolerant system after receiving the execution instruction.
Meanwhile, the processor 902 disclosed in the present embodiment may be an integrated circuit chip having signal processing capability. The Processor 902 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. The general purpose processor may be a microprocessor or the general purpose processor may be any conventional processor.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (13)

1. A data synchronization method based on a fault tolerant system is characterized by comprising the following steps:
s1, periodically determining resource data synchronization states of a primary virtual machine and a standby virtual machine contained in a fault tolerant system in the current state;
s2, resetting a plurality of target registers in charge of interrupt management in a register group contained in the standby virtual machine in the current state to 0 at least before the primary virtual machine and the standby virtual machine execute resource data synchronization.
2. The data synchronization method according to claim 1, wherein the primary system and the standby system are deployed in a server cluster constructed by a same platform or heterogeneous platforms, the primary virtual machine is deployed in the primary system, and the standby virtual machine is deployed in the standby system.
3. The method according to claim 1, wherein after resetting to 0a number of target registers responsible for interrupt management in a register set included in a standby virtual machine of the fault tolerant system in the current state, the method further comprises:
and releasing the equipment of the standby virtual machine based on the ITS mechanism, and interrupting the distributed mapping relation.
4. The data synchronization method according to claim 3, wherein the resetting to 0 of a number of target registers in charge of interrupt management in a register set included in the standby virtual machine of the fault tolerant system in the current state is implemented by a flush logic preset by the standby virtual machine or in response to a flush instruction initiated by an external device;
the destination register includes: a GITS _ CBASER register, a GITS _ CREADR register, a GITS _ CWRISTER register, and a GITS _ CTLR register;
5. the method according to claim 4, wherein the resetting of the target registers responsible for interrupt management to 0 in the register set included in the standby virtual machine of the fault tolerant system in the current state in step S2 is specifically:
and resetting the GITS _ CBASER register, the GITS _ CREADER register and the GITS _ CWRISTER register which are contained in a register group contained in the standby virtual machine in the current state to 0, and resetting the enable attribute of the GITS _ CTLR register to 0.
6. The data synchronization method of claim 5, wherein the cleanup logic is preconfigured by a user or a robot program to prepare the virtual machine for manual loading or automatic loading.
7. The data synchronization method according to claim 5, wherein a timestamp for resetting to 0a plurality of target registers responsible for interrupt management in a register group included in the standby virtual machine in the current state is earlier than a timestamp for performing resource data synchronization between the primary virtual machine and the standby virtual machine.
8. The data synchronization method according to claim 5, wherein after resetting to 0a plurality of target registers responsible for interrupt management in a register set included in the standby virtual machine in the current state, the method further comprises:
and releasing the resource data stored in the temporary storage of the standby virtual machine by the interrupt analysis service of the standby system, wherein the temporary storage comprises a virtual memory, a database or a part of storage space logically positioned in the storage systems at the rear ends of the main system and the standby system.
9. The data synchronization method according to any one of claims 1 to 8, wherein the resource data includes one or more of CPU data, memory data, disk data, configuration data, or plug-ins.
10. A data synchronization system based on a fault tolerant system, comprising:
the device comprises a state acquisition module and a reset module;
the state acquisition module periodically determines resource data synchronization states of a primary virtual machine and a standby virtual machine included in a fault tolerant system in the current state, and resets to 0a plurality of target registers in charge of interrupt management in a register group included in the standby virtual machine in the current state through the reset module at least before the primary virtual machine and the standby virtual machine perform resource data synchronization.
11. The data synchronization system of claim 10, wherein the destination register comprises:
a GITS _ CBASER register, a GITS _ CREADR register, a GITS _ CWRISTER register, and a GITS _ CTLR register;
the resetting, by the resetting module, to 0a plurality of target registers in charge of interrupt management in a register group included in a standby virtual machine of the fault tolerant system in the current state specifically includes:
and resetting the GITS _ CBASER register, the GITS _ CREADER register and the GITS _ CWRISTER register which are contained in a register group contained in the standby virtual machine in the current state to 0, and resetting the enable attribute of the GITS _ CTLR register to 0.
12. The data synchronization system of claim 10, further comprising: and the clearing module is used for releasing the equipment of the standby virtual machine based on the ITS mechanism and interrupting the operation of the distributed mapping relation.
13. A computer-readable medium, in which computer program instructions are stored, which computer instructions, when read and executed by a processor, perform the steps of the fault tolerant system based data synchronization method according to any of the claims 1 to 9.
CN202111623520.4A 2021-12-28 2021-12-28 Data synchronization method, system and computer readable medium based on fault tolerant system Pending CN114296875A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111623520.4A CN114296875A (en) 2021-12-28 2021-12-28 Data synchronization method, system and computer readable medium based on fault tolerant system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111623520.4A CN114296875A (en) 2021-12-28 2021-12-28 Data synchronization method, system and computer readable medium based on fault tolerant system

Publications (1)

Publication Number Publication Date
CN114296875A true CN114296875A (en) 2022-04-08

Family

ID=80971370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111623520.4A Pending CN114296875A (en) 2021-12-28 2021-12-28 Data synchronization method, system and computer readable medium based on fault tolerant system

Country Status (1)

Country Link
CN (1) CN114296875A (en)

Similar Documents

Publication Publication Date Title
US11627041B2 (en) Dynamic reconfiguration of resilient logical modules in a software defined server
US9389976B2 (en) Distributed persistent memory using asynchronous streaming of log records
US8473692B2 (en) Operating system image management
US7523344B2 (en) Method and apparatus for facilitating process migration
US8726274B2 (en) Registration and initialization of cluster-aware virtual input/output server nodes
EP4083786A1 (en) Cloud operating system management method and apparatus, server, management system, and medium
US11647075B2 (en) Commissioning and decommissioning metadata nodes in a running distributed data storage system
US20180004777A1 (en) Data distribution across nodes of a distributed database base system
US11262933B2 (en) Sharing memory resources between asynchronous replication workloads
US11573737B2 (en) Method and apparatus for performing disk management of all flash array server
CN110912991A (en) Super-fusion-based high-availability implementation method for double nodes
JP2011530748A (en) Realization of reliable access to non-local block data storage by executing programs
US20210089379A1 (en) Computer system
US20240152286A1 (en) Fast restart of large memory systems
JP6219514B2 (en) Computing device that provides virtual multipath state access, remote computing device for virtual multipath, method for providing virtual multipath state access, method for virtual multipath, computing device, multiple methods for computing device And a machine-readable recording medium
JP2002183105A (en) Method for processing unit synchronization for scalable parallel processing
CN114296875A (en) Data synchronization method, system and computer readable medium based on fault tolerant system
US20240028611A1 (en) Granular Replica Healing for Distributed Databases
CN112333283B (en) Autonomous high-end storage array system architecture
US20220358018A1 (en) Journal barrier consistency determination
CN115878269A (en) Cluster migration method, related device and storage medium
WO2024005875A1 (en) High availability systems having thinly-provisioned secondary servers
CN111104199A (en) Method and device for high availability of virtual machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination