CN111930563A - Fault tolerance method in cloud simulation system - Google Patents

Fault tolerance method in cloud simulation system Download PDF

Info

Publication number
CN111930563A
CN111930563A CN202010683652.5A CN202010683652A CN111930563A CN 111930563 A CN111930563 A CN 111930563A CN 202010683652 A CN202010683652 A CN 202010683652A CN 111930563 A CN111930563 A CN 111930563A
Authority
CN
China
Prior art keywords
fault
node
simulation
resource
backup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010683652.5A
Other languages
Chinese (zh)
Other versions
CN111930563B (en
Inventor
陈志佳
孟宪国
冯少冲
邸彦强
朱元昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202010683652.5A priority Critical patent/CN111930563B/en
Publication of CN111930563A publication Critical patent/CN111930563A/en
Application granted granted Critical
Publication of CN111930563B publication Critical patent/CN111930563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration

Abstract

The invention provides a fault tolerance method in a cloud simulation system, which comprises the following steps: s10: discovering that the system has a fault; s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30; s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing a step S40; s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.

Description

Fault tolerance method in cloud simulation system
Technical Field
The invention relates to the technical field of computers, in particular to a fault tolerance method in a cloud simulation system.
Background
With the increase of members in the distributed simulation system, the advance of the operation time and the expansion of the simulation scale, the reliability of the simulation system is gradually reduced, and the fault probability is gradually increased. If a certain key simulation node fails, or data transmission is blocked or lost due to network delay, the whole simulation system may be crashed. If the current system does not have a certain fault tolerance, the only method is to restart the whole simulation system, which may cause serious consequences, resulting in that the simulation process cannot be normally promoted. Therefore, in the distributed simulation system, improving the fault tolerance of the simulation system is a key problem that the distributed simulation system must solve.
Disclosure of Invention
In view of the above technical problems, the present invention provides a fault tolerance method in a cloud simulation system to overcome the above deficiencies in the prior art.
The invention provides a fault tolerance method in a cloud simulation system, which comprises the following steps: s10: discovering that the system has a fault; s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30; s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing a step S40; s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.
In some embodiments, step S20 further includes: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node.
In some embodiments, step S40 further includes: and if the fault tolerance is still carried out by adopting the virtual machine migration fault tolerance mode, the fault can not be eliminated, the fault tolerance is carried out by adopting the backup fault tolerance mode, and a backup node is selected to operate as a new simulation node.
In some embodiments, said fault-tolerance in the snapshot fault-tolerance mode comprises setting a snapshot fault-tolerance period TpSnapshot fault tolerance period TpProportional to the resource consumption rate and the consumption time.
In some embodiments, the snapshot fault tolerance period
Figure BDA0002584990380000021
Delta T is the consumption time of the snapshot, R is the software and hardware resources required by the snapshot, R is the upper limit of the software and hardware resources, L is the simulation task level of the simulation node, L takes the value of 1, 2, 3, 4 or 5, K is an adjusting parameter, T is the time of the snapshot, R is the upper limit of the software and hardware resources, R is the upper limit of the simulation task level of the simulation nodefIs the mean time between failures.
In some embodiments, the fault tolerance using the backup fault tolerance mode includes: s301: in the system operation process, a plurality of corresponding backup nodes are set for at least one simulation node; s302: the simulation node sends heartbeat information and current simulation data to each backup node, and each backup node sends heartbeat information to the simulation node; s303: if backup nodes accounting for more than 1/2 in all backup nodes corresponding to the simulation node do not receive the heartbeat information of the simulation node in a certain heartbeat cycle, judging that the simulation node has a fault; otherwise, judging that the simulation node works normally; s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, at least one backup node is brought into the simulation node again, M is a positive integer and is more than or equal to 20; s305: when the simulation node is in fault, selecting a backup node from all backup nodes corresponding to the simulation node as a new simulation node by adopting an election mode; s306: and the new simulation node continuously performs information interaction with other simulation nodes, and at least one corresponding backup node is set for the new simulation node.
In some embodiments, the method for selecting the backup node creation location of the backup fault-tolerant mode includes: s100: calculating distances between a plurality of potential target servers and the simulation node; s200: sequencing the potential target servers according to the ascending order of the distance between the potential target servers and the simulation node; s300: obtaining the available resource quantity R of each potential target server through the monitoring system0(ii) a S400: comparing the resource quantity R needed by the simulation noderAmount of resources available R with each potential target server0(ii) a S500: if the available resource amount R of the potential target server0Greater than the amount of resources R required by the simulation noderIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node.
In some embodiments, the election mode is to select simulation data in the most recent N heartbeat cycles for comparison, and select a backup node with the highest similarity degree with simulation data of other backup nodes as a new simulation node, where N is a positive integer and is greater than or equal to 3 and less than or equal to 7.
In some embodiments, the virtual machine migration fault tolerance mode comprises: s401: obtaining resource demand values of each virtual machine in the current server, including CPU resource uCPUMemory resource uMemBandwidth resource uBwGPU resource uGPUAnd storage resource uSt(ii) a S402: constructing a vector representing the resource demand values of the virtual machines according to the resource demand values of the virtual machines in the server,
Figure BDA0002584990380000031
wherein i is the virtual machine serial number, and the n virtual machine resource demand matrices are:
Figure BDA0002584990380000041
s403: determining virtual machine resources to be migrated, and selecting a target server according to the virtual machine resources to be migrated; s404: determining the resource residual quantity of each target server to be selected
Figure BDA0002584990380000042
Respectively comparing the resource residual r of the target servers to be selectedjMaximum vector u of resource demand value of virtual machine to be migratedvmaxIf there is rj×85%>uvmmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.
In some embodiments, further comprising the step of: s405: if the plurality of servers meet the condition in the step S404, different weights are given to the resource requirements of the virtual machine according to the task requirements, wherein the weights comprise the CPU weight wCPUMemory weight wMemNetwork bandwidth weight wBwGPU weight wGPUAnd storing the weight wStThe component weight vector w ═ wCPU,wMem,wBw,wGPU,wSt}; s406: calculating a Hadamard product of the weight vector and the resource demand matrix to obtain a weighted resource demand matrix:
Figure BDA0002584990380000043
s407: matrix array
Figure BDA0002584990380000044
Each row represents the condition of the entitled resource requirement of one virtual machine; selecting the virtual machine corresponding to the column with the maximum empowerment resource demand rate in the matrix according to the priority of the resource demand type, and taking the virtual machine as a virtual machine to be migrated; s408: obtaining a resource surplus matrix R of the plurality of servers according to the resource surplus in the plurality of servers, wherein the resource surplus matrix R is represented as follows:
Figure BDA0002584990380000051
s409: for matrix R, each column of elements is compared respectively to obtain the maximum value of each column of elements, and the number j of the row where the element is recorded belongs to [1, 2, …, k ]](ii) a And selecting the server corresponding to the row number of the resource type at the maximum value of the column of the resource residual amount matrix R as a target server according to the maximum resource demand type selected from the weighted resource demand matrix.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
fig. 2 is a flowchart of a fault tolerance method in a cloud simulation system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a snapshot fault tolerance cycle of the present invention;
FIG. 4 is a schematic diagram of the relationship between a simulation node and a backup node according to the present invention;
FIG. 5 is a flowchart illustrating the detailed steps of step S30 in FIG. 2;
FIG. 6 is a flowchart illustrating the detailed steps of the backup location calculation of FIG. 5;
fig. 7 is a flowchart illustrating a detailed step of step S40 in fig. 2.
Detailed Description
Certain embodiments of the invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
At present, because initialization of a virtual machine instance requires a certain preparation time and cannot be immediately effective, virtual machine resources are scheduled only by monitoring the task load and resource performance of the virtual machine, and the performance of virtual machine nodes is difficult to guarantee.
An embodiment of the invention provides a fault tolerance method in a cloud simulation system. The system structure of the invention is schematically shown in figure 1. In order to facilitate the analysis of the simulation system faults, the faults occurring in the simulation system are divided into three major types of faults according to the fault types: simulation software failure, simulation node failure, and simulation server failure. The simulation software fault refers to a fault occurring in a simulation application, a simulation process and the like; the simulation node fault mainly refers to a fault occurring in a simulation virtual machine, and the simulation server fault mainly refers to a fault occurring in a server where the simulation node is located. After a fault occurs, selecting a fault-tolerant strategy according to the fault type, and reducing the overhead on the premise of ensuring the fault-tolerant effect: if the simulation software fails, fault tolerance is mainly carried out in a snapshot rollback mode; if the simulation node fails, fault tolerance is carried out mainly in a backup node mode; if the simulation server fails, fault tolerance is carried out by adopting a mode that the virtual machine is migrated to a non-failure server or a backup mode is started according to needs.
As shown in fig. 2, the fault tolerance method in the cloud simulation system provided by the present disclosure includes the following steps:
s10: discovering that the system has a fault;
s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30;
s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing the step S40:
s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.
In this embodiment, step S20 further includes: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node. Step S40 further includes: if the two types of faults are not the faults, the faults are judged to be the faults of the simulation server, and a fault-tolerant mode based on virtual machine migration is preferentially adopted. If the fault still can not be eliminated, a backup fault-tolerant mode is adopted for fault tolerance, and a backup node is selected to operate as a new simulation node.
It will be appreciated by those skilled in the art that in the present invention, faults are considered to occur randomly with the operation of the system, regardless of the particular fault monitoring problem, and can be detected as soon as they occur. After a certain simulation node fails, the communication among other nodes is not affected, and the node can be rejoined to the simulation network after recovery.
In the invention, in step S20, fault tolerance of the simulation software is mainly monitored by adopting a daemon process. And arranging a daemon process in each simulation node, continuously monitoring the running condition of the simulation software through the daemon process, and sending abnormal information to a monitoring management center by the daemon process after the simulation software fails. And the monitoring management center performs rollback operation on the virtual machine. The method comprises the following specific steps:
in the snapshot fault-tolerant mode, because the time required for setting a snapshot is fixed, the setting interval of the snapshot needs to be optimized to ensure the fault-tolerant effect and reduce the resource overhead and the time overhead. Fig. 3 is a diagram illustrating a snapshot fault tolerance cycle. C in FIG. 31……CkEtc. are all snapshots. If the simulation software fails, the system is recovered from the latest snapshot without being rolled back to the initial snapshot. Assuming that the time required for setting each snapshot is Δ T and the required software and hardware resources are r, the snapshot overhead in one failure-recovery period is: Δ T k, k is a snapshot in a failure-recovery period.
The fault tolerance is carried out by adopting a snapshot fault tolerance mode, and the method comprises the step of setting a snapshot fault tolerance period TpSnapshot fault tolerance period TpProportional to the resource consumption rate and the consumption time. The snapshot fault tolerance period
Figure BDA0002584990380000081
And delta T is the consumption time of the snapshot, R is the software and hardware resources required by the snapshot, R is the upper limit of the software and hardware resources, L is the simulation task level of the simulation node, and L takes the value of 1, 2, 3, 4 or 5 according to the task level, wherein the higher the level is, the larger the corresponding number is. K is an adjustment parameter, TfIs the mean time between failures.
Snapshot fault tolerance period TpIs the inverse of the snapshot fault tolerance frequency f. The snapshot fault tolerance frequency f should be inversely proportional to the resource consumption rate and the consumption time. Meanwhile, the snapshot fault-tolerant frequency is determined by the importance degree of the simulation task of the simulation node, and if the simulation task quantity of the node is large and the task importance level is high, the snapshot fault-tolerant frequency f is increased; otherwise, the reverse is carried out.
In this embodiment, in step S30, the backup fault tolerant mode is to provide a backup for the simulation system by using a virtual machine generated by a virtualization technology in the cloud simulation system, so as to avoid a crash of the distributed simulation system due to an error occurring in a single simulation node. The backup system structure is as shown in fig. 4, and the emulation node and the backup node are both virtual machine nodes and are physically isolated. The key simulation node is provided with a plurality of backup nodes, and the number of the backup nodes is odd, so that election when the backup nodes are started is facilitated. The general node may set one or more backup nodes.
As shown in fig. 5, the fault tolerance by using the backup fault tolerance mode includes the following specific steps:
s301: in the system operation process, a plurality of corresponding backup nodes are set for at least one simulation node;
s302: the simulation node sends heartbeat information and current simulation data to each backup node, and each backup node sends heartbeat information to the simulation node;
s303: if backup nodes accounting for more than 1/2 in all backup nodes corresponding to the simulation node do not receive the heartbeat information of the simulation node in a certain heartbeat cycle, judging that the simulation node has a fault; otherwise, judging that the simulation node works normally;
s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, and at least one backup node is brought into the simulation node again, wherein M is a positive integer and is more than or equal to 20, for example, M is more than or equal to 25;
s305: when the simulation node is in fault, selecting a backup node from all backup nodes corresponding to the simulation node as a new simulation node by adopting an election mode;
s306: and the new simulation node continuously performs information interaction with other simulation nodes, and at least one corresponding backup node is set for the new simulation node.
The election method in step S305 is to select the simulation data in the most recent N heartbeat cycles for comparison, and select a backup node with the highest similarity to the simulation data of other backup nodes as a new simulation node, where N is a positive integer and is greater than or equal to 3 and less than or equal to 7. In this embodiment, the comparison of the simulation data in the last 5 heart cycle is preferred.
In this embodiment, the optimization of the backup fault-tolerant mode needs to consider the rationality of the backup creation location: the improper location distribution causes excessive occupation of bandwidth resources and performance degradation due to latency, and therefore location selection should be targeted to meet successful execution of system tasks and minimization of latency. From the aspect of bandwidth resource overhead, the original simulation node is taken as a backup node to be placed for backup. However, since the original emulation node may not be able to recover the operating state after the failure occurs, other nodes should be selected as backup placement nodes.
As shown in fig. 6, the method for selecting the creation location of the backup node in the backup fault-tolerant mode includes:
s100: calculating the distances between a plurality of potential target servers and the simulation node, and expressing the network exchange times passing between the target servers and the servers where the simulation nodes are located
S200: sequencing the potential target servers according to the ascending order of the distance between the potential target servers and the simulation node;
s300: obtaining the available resource quantity R of each potential target server through the monitoring system0
S400: comparing the resource quantity R needed by the simulation noderAmount of resources available R with each potential target server0
S500: if the available resource amount R of the potential target server0Greater than the amount of resources R required by the simulation noderIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node. Otherwise, the target position is recalculated according to the method until all the conditions are met.
As shown in fig. 7, in step S40, the fault tolerance performed by using the virtual machine migration fault tolerance mode is an improvement of the migration process of the virtual machine implemented by using a task-based multi-attribute weighting method, and the method includes the following specific steps:
s401: obtaining virtual machines in current serverResource requirement values, including CPU resources uCPUMemory resource uMemBandwidth resource uBwGPU resource uGPUAnd storage resource uSt
S402: constructing a vector representing the resource demand value of the virtual machine according to the resource demand value of each virtual machine in the server
Figure BDA0002584990380000101
Wherein i is the virtual machine serial number, and the n virtual machine resource demand matrices are:
Figure BDA0002584990380000102
s403: determining virtual machine resources to be migrated, and selecting a target server according to the virtual machine resources to be migrated;
s404: determining the resource residual quantity of each target server to be selected
Figure BDA0002584990380000111
Respectively comparing the resource residual r of the target servers to be selectedjMaximum vector u of resource demand value of virtual machine to be migratedvmmaxIf there is rj×85%>uvmmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.
S405: if the plurality of servers meet the condition in the step S404, different weights are given to the resource requirements of the virtual machine according to the task requirements, wherein the weights comprise the CPU weight wCPUMemory weight wMemNetwork bandwidth weight wBwGPU weight wGPUAnd storing the weight wStThe component weight vector w ═ wCPU,wMem,wBw,wGPU,wSt};
S406: calculating a Hadamard product of the weight vector and the resource demand matrix to obtain a weighted resource demand matrix:
Figure BDA0002584990380000112
s407: matrix array
Figure BDA0002584990380000113
Each row represents the condition of the entitled resource requirement of one virtual machine; selecting the virtual machine corresponding to the column with the maximum empowerment resource demand rate in the matrix according to the priority of the resource demand type, and taking the virtual machine as a virtual machine to be migrated;
s408: obtaining a resource surplus matrix R of the plurality of servers according to the resource surplus in the plurality of servers, wherein the resource surplus matrix R is represented as follows:
Figure BDA0002584990380000114
s409: for the matrix R, respectively comparing each row of elements, obtaining the maximum value of each row of elements, and recording the number j of the row where the element belongs to [1, 2, …, k ]; and selecting the server corresponding to the row number of the resource type at the maximum value of the column of the resource residual amount matrix R as a target server according to the maximum resource demand type selected from the weighted resource demand matrix.
If the original server still has insufficient resource supply after the migration of the virtual machine is finished, the migration is carried out according to the sequence of the resource utilization rate from high to low until the resource utilization rate of the server reaches a satisfactory service quality level.
It should be noted that the shapes and sizes of the respective components in the drawings do not reflect actual sizes and proportions, but merely illustrate the contents of the embodiments of the present invention.
Directional phrases used in the embodiments, such as "upper", "lower", "front", "rear", "left", "right", etc., refer only to the direction of the attached drawings and are not intended to limit the scope of the present invention. The embodiments described above may be mixed and matched with each other or with other embodiments based on design and reliability considerations, i.e., technical features in different embodiments may be freely combined to form further embodiments.
The method steps involved in the embodiments are not limited to the order described, and the order of the steps may be modified as required.
It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. Further, the above definitions of the various elements and methods are not limited to the various specific structures, shapes or arrangements of parts mentioned in the examples, which may be easily modified or substituted by those of ordinary skill in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A fault tolerance method in a cloud simulation system is characterized by comprising the following steps:
s10: discovering that the system has a fault;
s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30;
s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing a step S40;
s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.
2. The fault tolerant method of claim 1 wherein step S20 further comprises: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node.
3. The fault tolerant method of claim 1 wherein step S40 further comprises: and if the fault tolerance is still carried out by adopting the virtual machine migration fault tolerance mode, the fault can not be eliminated, the fault tolerance is carried out by adopting the backup fault tolerance mode, and a backup node is selected to operate as a new simulation node.
4. Fault tolerant method according to any of claims 1-3, characterized in that said fault tolerance using a snapshot fault tolerant mode comprises setting a snapshot fault tolerance period TpSnapshot fault tolerance period TpProportional to the resource consumption rate and the consumption time.
5. Fault tolerant method according to claim 4, characterized in that the snapshot fault tolerance period
Figure FDA0002584990370000021
Delta T is the consumption time of the snapshot, R is the software and hardware resources required by the snapshot, R is the upper limit of the software and hardware resources, L is the simulation task level of the simulation node, L takes the value of 1, 2, 3, 4 or 5, K is an adjusting parameter, T is the time of the snapshot, R is the upper limit of the software and hardware resources, R is the upper limit of the simulation task level of the simulation nodefIs the mean time between failures.
6. Fault tolerant method according to any of claims 1-3, wherein said fault tolerance using a backup fault tolerant mode comprises:
s301: in the system operation process, a plurality of corresponding backup nodes are set for at least one simulation node;
s302: the simulation node sends heartbeat information and current simulation data to each backup node, and each backup node sends heartbeat information to the simulation node;
s303: if backup nodes accounting for more than 1/2 in all backup nodes corresponding to the simulation node do not receive the heartbeat information of the simulation node in a certain heartbeat cycle, judging that the simulation node has a fault; otherwise, judging that the simulation node works normally;
s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, at least one backup node is brought into the simulation node again, M is a positive integer and is more than or equal to 20;
s305: when the simulation node is in fault, selecting a backup node from all backup nodes corresponding to the simulation node as a new simulation node by adopting an election mode;
s306: and the new simulation node continuously performs information interaction with other simulation nodes, and at least one corresponding backup node is set for the new simulation node.
7. The fault-tolerant method of claim 6, wherein the selection of the backup node creation location of the backup fault-tolerant mode comprises:
s100: calculating distances between a plurality of potential target servers and the simulation node;
s200: sequencing the potential target servers according to the ascending order of the distance between the potential target servers and the simulation node;
s300: obtaining the available resource quantity R of each potential target server through the monitoring system0
S400: comparing the resource quantity R needed by the simulation noderAmount of resources available R with each potential target server0
S500: if the available resource amount R of the potential target server0Greater than the amount of resources R required by the simulation noderIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node.
8. The fault tolerant method according to claim 6 or 7 characterized in that the selection is to select the simulation data in the most recent N heartbeat cycles for comparison, and select one backup node with the highest similarity degree with the simulation data of other backup nodes as the new simulation node, wherein N is a positive integer, and N is more than or equal to 3 and less than or equal to 7.
9. The fault tolerant method of claim 1 wherein said fault tolerant using virtual machine migration fault tolerant mode comprises:
s401: obtaining resource demand values of each virtual machine in the current server, including CPU resource uCPUMemory resource uMemBandwidth resource uBwGPU resource uGPUAnd storage resource uSt
S402: constructing a vector representing the resource demand values of the virtual machines according to the resource demand values of the virtual machines in the server,
Figure FDA0002584990370000041
wherein i is the virtual machine serial number, and the n virtual machine resource demand matrices are:
Figure FDA0002584990370000042
s403: determining virtual machine resources to be migrated, and selecting a target server according to the virtual machine resources to be migrated;
s404: determining the resource residual quantity of each target server to be selected
Figure FDA0002584990370000043
Respectively comparing the resource residual r of the target servers to be selectedjMaximum vector u of resource demand value of virtual machine to be migratedvmaxIf there is rj×85%>uvmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.
10. The fault tolerant method according to claim 9 further comprising the step of:
s405: if the plurality of servers all meet the condition in the step S404, the resources of the virtual machine are required according to the taskThe requirements are given different weights, including the CPU weight wCPUMemory weight wMemNetwork bandwidth weight wBwGPU weight wGPUAnd storing the weight wStThe component weight vector w ═ wCPU,wMem,wBw,wGPU,wSt};
S406: calculating a Hadamard product of the weight vector and the resource demand matrix to obtain a weighted resource demand matrix:
Figure FDA0002584990370000051
s407: matrix array
Figure FDA0002584990370000052
Each row represents the condition of the entitled resource requirement of one virtual machine; selecting the virtual machine corresponding to the column with the maximum empowerment resource demand rate in the matrix according to the priority of the resource demand type, and taking the virtual machine as a virtual machine to be migrated;
s408: obtaining a resource surplus matrix R of the plurality of servers according to the resource surplus in the plurality of servers, wherein the resource surplus matrix R is represented as follows:
Figure FDA0002584990370000053
s409: for the matrix R, respectively comparing each row of elements, obtaining the maximum value of each row of elements, and recording the number j of the row where the element belongs to [1, 2, …, k ]; and selecting the server corresponding to the row number of the resource type at the maximum value of the column of the resource residual amount matrix R as a target server according to the maximum resource demand type selected from the weighted resource demand matrix.
CN202010683652.5A 2020-07-15 2020-07-15 Fault tolerance method in cloud simulation system Active CN111930563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010683652.5A CN111930563B (en) 2020-07-15 2020-07-15 Fault tolerance method in cloud simulation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010683652.5A CN111930563B (en) 2020-07-15 2020-07-15 Fault tolerance method in cloud simulation system

Publications (2)

Publication Number Publication Date
CN111930563A true CN111930563A (en) 2020-11-13
CN111930563B CN111930563B (en) 2022-01-11

Family

ID=73313602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010683652.5A Active CN111930563B (en) 2020-07-15 2020-07-15 Fault tolerance method in cloud simulation system

Country Status (1)

Country Link
CN (1) CN111930563B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112631981A (en) * 2020-12-23 2021-04-09 中国人民解放军63921部队 Reliable fault-tolerant simulation engine for simulation training

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050251567A1 (en) * 2004-04-15 2005-11-10 Raytheon Company System and method for cluster management based on HPC architecture
CN102521128A (en) * 2011-12-08 2012-06-27 华中科技大学 Software fault tolerance method facing cloud platform
JP2013089142A (en) * 2011-10-20 2013-05-13 Hitachi Ltd Fault tolerant system
CN103716182A (en) * 2013-12-12 2014-04-09 中国科学院信息工程研究所 Failure detection and fault tolerance method and failure detection and fault tolerance system for real-time cloud platform
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN105740049A (en) * 2016-01-27 2016-07-06 杭州华三通信技术有限公司 Control method and apparatus
CN107203440A (en) * 2017-05-27 2017-09-26 郑州云海信息技术有限公司 A kind of integration is backed up in realtime disaster tolerance system and building method
CN107506261A (en) * 2017-08-01 2017-12-22 北京丁牛科技有限公司 Adapt to the cascade fault-tolerance processing method of CPU, GPU isomeric group
CN108469996A (en) * 2018-03-13 2018-08-31 山东超越数控电子股份有限公司 A kind of system high availability method based on auto snapshot

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050251567A1 (en) * 2004-04-15 2005-11-10 Raytheon Company System and method for cluster management based on HPC architecture
JP2013089142A (en) * 2011-10-20 2013-05-13 Hitachi Ltd Fault tolerant system
CN102521128A (en) * 2011-12-08 2012-06-27 华中科技大学 Software fault tolerance method facing cloud platform
CN103716182A (en) * 2013-12-12 2014-04-09 中国科学院信息工程研究所 Failure detection and fault tolerance method and failure detection and fault tolerance system for real-time cloud platform
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN105740049A (en) * 2016-01-27 2016-07-06 杭州华三通信技术有限公司 Control method and apparatus
CN107203440A (en) * 2017-05-27 2017-09-26 郑州云海信息技术有限公司 A kind of integration is backed up in realtime disaster tolerance system and building method
CN107506261A (en) * 2017-08-01 2017-12-22 北京丁牛科技有限公司 Adapt to the cascade fault-tolerance processing method of CPU, GPU isomeric group
CN108469996A (en) * 2018-03-13 2018-08-31 山东超越数控电子股份有限公司 A kind of system high availability method based on auto snapshot

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHIJIA CHEN, YUANCHANG ZHU, YANQIANG DI, AND SHAOCHONG FENG: "A Dynamic Resource Scheduling Method Based on Fuzzy Control Theory in Cloud Environment", 《JOURNAL OF CONTROL SCIENCE AND ENGINEERING》 *
陈志佳,朱元昌,邸彦强,冯少冲: "云训练中基于自适应副本策略的容错研究", 《微电子学与计算机》 *
陈志佳,朱元昌,邸彦强,冯少冲: "基于虚拟化技术的仿真系统容错优化方法", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112631981A (en) * 2020-12-23 2021-04-09 中国人民解放军63921部队 Reliable fault-tolerant simulation engine for simulation training

Also Published As

Publication number Publication date
CN111930563B (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN109729129B (en) Configuration modification method of storage cluster system, storage cluster and computer system
US7392433B2 (en) Method and system for deciding when to checkpoint an application based on risk analysis
US8020041B2 (en) Method and computer system for making a computer have high availability
US20050283658A1 (en) Method, apparatus and program storage device for providing failover for high availability in an N-way shared-nothing cluster system
CN109151045A (en) A kind of distribution cloud system and monitoring method
CN110535680B (en) Byzantine fault-tolerant method
US20070083641A1 (en) Using a standby data storage system to detect the health of a cluster of data storage servers
JP2005209201A (en) Node management in high-availability cluster
JP2002525748A (en) Protocol for replication server
US20080288812A1 (en) Cluster system and an error recovery method thereof
US20020002448A1 (en) Means for incorporating software into avilability models
CN113553179A (en) Distributed key value storage load balancing method and system
CN113821376A (en) Cloud disaster backup-based integrated backup disaster recovery method and system
CN110377664B (en) Data synchronization method, device, server and storage medium
CN111930563B (en) Fault tolerance method in cloud simulation system
Glider et al. The software architecture of a san storage control system
CN114844809A (en) Multi-factor arbitration method and device based on network heartbeat and kernel disk heartbeat
CN116088763B (en) Copy allocation strategy system and method for optimizing recovery rate
JP3447347B2 (en) Failure detection method
CN108763312B (en) Slave data node screening method based on load
US10846094B2 (en) Method and system for managing data access in storage system
EP3389222B1 (en) A method and a host for managing events in a network that adapts event-driven programming framework
CN104516778B (en) The preservation of process checkpoint and recovery system and method under a kind of multitask environment
US11522966B2 (en) Methods, devices and systems for non-disruptive upgrades to a replicated state machine in a distributed computing environment
CN111400098A (en) Copy management method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant