CN111930563A

CN111930563A - Fault tolerance method in cloud simulation system

Info

Publication number: CN111930563A
Application number: CN202010683652.5A
Authority: CN
Inventors: 陈志佳; 孟宪国; 冯少冲; 邸彦强; 朱元昌
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2020-11-13
Anticipated expiration: 2040-07-15
Also published as: CN111930563B

Abstract

The invention provides a fault tolerance method in a cloud simulation system, which comprises the following steps: s10: discovering that the system has a fault; s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30; s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing a step S40; s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.

Description

Fault tolerance method in cloud simulation system

Technical Field

The invention relates to the technical field of computers, in particular to a fault tolerance method in a cloud simulation system.

Background

With the increase of members in the distributed simulation system, the advance of the operation time and the expansion of the simulation scale, the reliability of the simulation system is gradually reduced, and the fault probability is gradually increased. If a certain key simulation node fails, or data transmission is blocked or lost due to network delay, the whole simulation system may be crashed. If the current system does not have a certain fault tolerance, the only method is to restart the whole simulation system, which may cause serious consequences, resulting in that the simulation process cannot be normally promoted. Therefore, in the distributed simulation system, improving the fault tolerance of the simulation system is a key problem that the distributed simulation system must solve.

Disclosure of Invention

In view of the above technical problems, the present invention provides a fault tolerance method in a cloud simulation system to overcome the above deficiencies in the prior art.

In some embodiments, step S20 further includes: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node.

In some embodiments, step S40 further includes: and if the fault tolerance is still carried out by adopting the virtual machine migration fault tolerance mode, the fault can not be eliminated, the fault tolerance is carried out by adopting the backup fault tolerance mode, and a backup node is selected to operate as a new simulation node.

In some embodiments, said fault-tolerance in the snapshot fault-tolerance mode comprises setting a snapshot fault-tolerance period T_pSnapshot fault tolerance period T_pProportional to the resource consumption rate and the consumption time.

In some embodiments, the snapshot fault tolerance period

Delta T is the consumption time of the snapshot, R is the software and hardware resources required by the snapshot, R is the upper limit of the software and hardware resources, L is the simulation task level of the simulation node, L takes the value of 1, 2, 3, 4 or 5, K is an adjusting parameter, T is the time of the snapshot, R is the upper limit of the software and hardware resources, R is the upper limit of the simulation task level of the simulation node_fIs the mean time between failures.

In some embodiments, the fault tolerance using the backup fault tolerance mode includes: s301: in the system operation process, a plurality of corresponding backup nodes are set for at least one simulation node; s302: the simulation node sends heartbeat information and current simulation data to each backup node, and each backup node sends heartbeat information to the simulation node; s303: if backup nodes accounting for more than 1/2 in all backup nodes corresponding to the simulation node do not receive the heartbeat information of the simulation node in a certain heartbeat cycle, judging that the simulation node has a fault; otherwise, judging that the simulation node works normally; s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, at least one backup node is brought into the simulation node again, M is a positive integer and is more than or equal to 20; s305: when the simulation node is in fault, selecting a backup node from all backup nodes corresponding to the simulation node as a new simulation node by adopting an election mode; s306: and the new simulation node continuously performs information interaction with other simulation nodes, and at least one corresponding backup node is set for the new simulation node.

In some embodiments, the method for selecting the backup node creation location of the backup fault-tolerant mode includes: s100: calculating distances between a plurality of potential target servers and the simulation node; s200: sequencing the potential target servers according to the ascending order of the distance between the potential target servers and the simulation node; s300: obtaining the available resource quantity R of each potential target server through the monitoring system₀(ii) a S400: comparing the resource quantity R needed by the simulation node_rAmount of resources available R with each potential target server₀(ii) a S500: if the available resource amount R of the potential target server₀Greater than the amount of resources R required by the simulation node_rIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node.

In some embodiments, the election mode is to select simulation data in the most recent N heartbeat cycles for comparison, and select a backup node with the highest similarity degree with simulation data of other backup nodes as a new simulation node, where N is a positive integer and is greater than or equal to 3 and less than or equal to 7.

In some embodiments, the virtual machine migration fault tolerance mode comprises: s401: obtaining resource demand values of each virtual machine in the current server, including CPU resource u_CPUMemory resource u_MemBandwidth resource u_BwGPU resource u_GPUAnd storage resource u_St(ii) a S402: constructing a vector representing the resource demand values of the virtual machines according to the resource demand values of the virtual machines in the server,

wherein i is the virtual machine serial number, and the n virtual machine resource demand matrices are:

s403: determining virtual machine resources to be migrated, and selecting a target server according to the virtual machine resources to be migrated; s404: determining the resource residual quantity of each target server to be selected

Respectively comparing the resource residual r of the target servers to be selected_jMaximum vector u of resource demand value of virtual machine to be migrated_vmaxIf there is r_j×85％＞u_vmmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.

In some embodiments, further comprising the step of: s405: if the plurality of servers meet the condition in the step S404, different weights are given to the resource requirements of the virtual machine according to the task requirements, wherein the weights comprise the CPU weight w_CPUMemory weight w_MemNetwork bandwidth weight w_BwGPU weight w_GPUAnd storing the weight w_StThe component weight vector w ═ w_CPU，w_Mem，w_Bw，w_GPU，w_St}; s406: calculating a Hadamard product of the weight vector and the resource demand matrix to obtain a weighted resource demand matrix:

s407: matrix array

Each row represents the condition of the entitled resource requirement of one virtual machine; selecting the virtual machine corresponding to the column with the maximum empowerment resource demand rate in the matrix according to the priority of the resource demand type, and taking the virtual machine as a virtual machine to be migrated; s408: obtaining a resource surplus matrix R of the plurality of servers according to the resource surplus in the plurality of servers, wherein the resource surplus matrix R is represented as follows:

s409: for matrix R, each column of elements is compared respectively to obtain the maximum value of each column of elements, and the number j of the row where the element is recorded belongs to [1, 2, …, k ]](ii) a And selecting the server corresponding to the row number of the resource type at the maximum value of the column of the resource residual amount matrix R as a target server according to the maximum resource demand type selected from the weighted resource demand matrix.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

fig. 2 is a flowchart of a fault tolerance method in a cloud simulation system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a snapshot fault tolerance cycle of the present invention;

FIG. 4 is a schematic diagram of the relationship between a simulation node and a backup node according to the present invention;

FIG. 5 is a flowchart illustrating the detailed steps of step S30 in FIG. 2;

FIG. 6 is a flowchart illustrating the detailed steps of the backup location calculation of FIG. 5;

fig. 7 is a flowchart illustrating a detailed step of step S40 in fig. 2.

Detailed Description

Certain embodiments of the invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

At present, because initialization of a virtual machine instance requires a certain preparation time and cannot be immediately effective, virtual machine resources are scheduled only by monitoring the task load and resource performance of the virtual machine, and the performance of virtual machine nodes is difficult to guarantee.

An embodiment of the invention provides a fault tolerance method in a cloud simulation system. The system structure of the invention is schematically shown in figure 1. In order to facilitate the analysis of the simulation system faults, the faults occurring in the simulation system are divided into three major types of faults according to the fault types: simulation software failure, simulation node failure, and simulation server failure. The simulation software fault refers to a fault occurring in a simulation application, a simulation process and the like; the simulation node fault mainly refers to a fault occurring in a simulation virtual machine, and the simulation server fault mainly refers to a fault occurring in a server where the simulation node is located. After a fault occurs, selecting a fault-tolerant strategy according to the fault type, and reducing the overhead on the premise of ensuring the fault-tolerant effect: if the simulation software fails, fault tolerance is mainly carried out in a snapshot rollback mode; if the simulation node fails, fault tolerance is carried out mainly in a backup node mode; if the simulation server fails, fault tolerance is carried out by adopting a mode that the virtual machine is migrated to a non-failure server or a backup mode is started according to needs.

As shown in fig. 2, the fault tolerance method in the cloud simulation system provided by the present disclosure includes the following steps:

s10: discovering that the system has a fault;

s20: judging whether the fault is a simulation software fault, if so, carrying out fault tolerance by adopting a snapshot fault tolerance mode, rolling back to the previous normal operation position, and if not, executing the step S30;

s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing the step S40:

s40: and judging the fault as a simulation server fault, and carrying out fault tolerance by adopting a virtual machine migration fault tolerance mode.

In this embodiment, step S20 further includes: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node. Step S40 further includes: if the two types of faults are not the faults, the faults are judged to be the faults of the simulation server, and a fault-tolerant mode based on virtual machine migration is preferentially adopted. If the fault still can not be eliminated, a backup fault-tolerant mode is adopted for fault tolerance, and a backup node is selected to operate as a new simulation node.

It will be appreciated by those skilled in the art that in the present invention, faults are considered to occur randomly with the operation of the system, regardless of the particular fault monitoring problem, and can be detected as soon as they occur. After a certain simulation node fails, the communication among other nodes is not affected, and the node can be rejoined to the simulation network after recovery.

In the invention, in step S20, fault tolerance of the simulation software is mainly monitored by adopting a daemon process. And arranging a daemon process in each simulation node, continuously monitoring the running condition of the simulation software through the daemon process, and sending abnormal information to a monitoring management center by the daemon process after the simulation software fails. And the monitoring management center performs rollback operation on the virtual machine. The method comprises the following specific steps:

in the snapshot fault-tolerant mode, because the time required for setting a snapshot is fixed, the setting interval of the snapshot needs to be optimized to ensure the fault-tolerant effect and reduce the resource overhead and the time overhead. Fig. 3 is a diagram illustrating a snapshot fault tolerance cycle. C in FIG. 3₁……C_kEtc. are all snapshots. If the simulation software fails, the system is recovered from the latest snapshot without being rolled back to the initial snapshot. Assuming that the time required for setting each snapshot is Δ T and the required software and hardware resources are r, the snapshot overhead in one failure-recovery period is: Δ T k, k is a snapshot in a failure-recovery period.

The fault tolerance is carried out by adopting a snapshot fault tolerance mode, and the method comprises the step of setting a snapshot fault tolerance period T_pSnapshot fault tolerance period T_pProportional to the resource consumption rate and the consumption time. The snapshot fault tolerance period

And delta T is the consumption time of the snapshot, R is the software and hardware resources required by the snapshot, R is the upper limit of the software and hardware resources, L is the simulation task level of the simulation node, and L takes the value of 1, 2, 3, 4 or 5 according to the task level, wherein the higher the level is, the larger the corresponding number is. K is an adjustment parameter, T_fIs the mean time between failures.

Snapshot fault tolerance period T_pIs the inverse of the snapshot fault tolerance frequency f. The snapshot fault tolerance frequency f should be inversely proportional to the resource consumption rate and the consumption time. Meanwhile, the snapshot fault-tolerant frequency is determined by the importance degree of the simulation task of the simulation node, and if the simulation task quantity of the node is large and the task importance level is high, the snapshot fault-tolerant frequency f is increased; otherwise, the reverse is carried out.

In this embodiment, in step S30, the backup fault tolerant mode is to provide a backup for the simulation system by using a virtual machine generated by a virtualization technology in the cloud simulation system, so as to avoid a crash of the distributed simulation system due to an error occurring in a single simulation node. The backup system structure is as shown in fig. 4, and the emulation node and the backup node are both virtual machine nodes and are physically isolated. The key simulation node is provided with a plurality of backup nodes, and the number of the backup nodes is odd, so that election when the backup nodes are started is facilitated. The general node may set one or more backup nodes.

As shown in fig. 5, the fault tolerance by using the backup fault tolerance mode includes the following specific steps:

s301: in the system operation process, a plurality of corresponding backup nodes are set for at least one simulation node;

s302: the simulation node sends heartbeat information and current simulation data to each backup node, and each backup node sends heartbeat information to the simulation node;

s303: if backup nodes accounting for more than 1/2 in all backup nodes corresponding to the simulation node do not receive the heartbeat information of the simulation node in a certain heartbeat cycle, judging that the simulation node has a fault; otherwise, judging that the simulation node works normally;

s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, and at least one backup node is brought into the simulation node again, wherein M is a positive integer and is more than or equal to 20, for example, M is more than or equal to 25;

s305: when the simulation node is in fault, selecting a backup node from all backup nodes corresponding to the simulation node as a new simulation node by adopting an election mode;

s306: and the new simulation node continuously performs information interaction with other simulation nodes, and at least one corresponding backup node is set for the new simulation node.

The election method in step S305 is to select the simulation data in the most recent N heartbeat cycles for comparison, and select a backup node with the highest similarity to the simulation data of other backup nodes as a new simulation node, where N is a positive integer and is greater than or equal to 3 and less than or equal to 7. In this embodiment, the comparison of the simulation data in the last 5 heart cycle is preferred.

In this embodiment, the optimization of the backup fault-tolerant mode needs to consider the rationality of the backup creation location: the improper location distribution causes excessive occupation of bandwidth resources and performance degradation due to latency, and therefore location selection should be targeted to meet successful execution of system tasks and minimization of latency. From the aspect of bandwidth resource overhead, the original simulation node is taken as a backup node to be placed for backup. However, since the original emulation node may not be able to recover the operating state after the failure occurs, other nodes should be selected as backup placement nodes.

As shown in fig. 6, the method for selecting the creation location of the backup node in the backup fault-tolerant mode includes:

s100: calculating the distances between a plurality of potential target servers and the simulation node, and expressing the network exchange times passing between the target servers and the servers where the simulation nodes are located

S200: sequencing the potential target servers according to the ascending order of the distance between the potential target servers and the simulation node;

s300: obtaining the available resource quantity R of each potential target server through the monitoring system₀；

S400: comparing the resource quantity R needed by the simulation node_rAmount of resources available R with each potential target server₀；

S500: if the available resource amount R of the potential target server₀Greater than the amount of resources R required by the simulation node_rIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node. Otherwise, the target position is recalculated according to the method until all the conditions are met.

As shown in fig. 7, in step S40, the fault tolerance performed by using the virtual machine migration fault tolerance mode is an improvement of the migration process of the virtual machine implemented by using a task-based multi-attribute weighting method, and the method includes the following specific steps:

s401: obtaining virtual machines in current serverResource requirement values, including CPU resources u_CPUMemory resource u_MemBandwidth resource u_BwGPU resource u_GPUAnd storage resource u_St；

S402: constructing a vector representing the resource demand value of the virtual machine according to the resource demand value of each virtual machine in the server

s403: determining virtual machine resources to be migrated, and selecting a target server according to the virtual machine resources to be migrated;

s404: determining the resource residual quantity of each target server to be selected

Respectively comparing the resource residual r of the target servers to be selected_jMaximum vector u of resource demand value of virtual machine to be migrated_vmmaxIf there is r_j×85％＞u_vmmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.

S405: if the plurality of servers meet the condition in the step S404, different weights are given to the resource requirements of the virtual machine according to the task requirements, wherein the weights comprise the CPU weight w_CPUMemory weight w_MemNetwork bandwidth weight w_BwGPU weight w_GPUAnd storing the weight w_StThe component weight vector w ═ w_CPU，w_Mem，w_Bw，w_GPU，w_St}；

S406: calculating a Hadamard product of the weight vector and the resource demand matrix to obtain a weighted resource demand matrix:

s407: matrix array

Each row represents the condition of the entitled resource requirement of one virtual machine; selecting the virtual machine corresponding to the column with the maximum empowerment resource demand rate in the matrix according to the priority of the resource demand type, and taking the virtual machine as a virtual machine to be migrated;

s408: obtaining a resource surplus matrix R of the plurality of servers according to the resource surplus in the plurality of servers, wherein the resource surplus matrix R is represented as follows:

s409: for the matrix R, respectively comparing each row of elements, obtaining the maximum value of each row of elements, and recording the number j of the row where the element belongs to [1, 2, …, k ]; and selecting the server corresponding to the row number of the resource type at the maximum value of the column of the resource residual amount matrix R as a target server according to the maximum resource demand type selected from the weighted resource demand matrix.

If the original server still has insufficient resource supply after the migration of the virtual machine is finished, the migration is carried out according to the sequence of the resource utilization rate from high to low until the resource utilization rate of the server reaches a satisfactory service quality level.

It should be noted that the shapes and sizes of the respective components in the drawings do not reflect actual sizes and proportions, but merely illustrate the contents of the embodiments of the present invention.

Directional phrases used in the embodiments, such as "upper", "lower", "front", "rear", "left", "right", etc., refer only to the direction of the attached drawings and are not intended to limit the scope of the present invention. The embodiments described above may be mixed and matched with each other or with other embodiments based on design and reliability considerations, i.e., technical features in different embodiments may be freely combined to form further embodiments.

The method steps involved in the embodiments are not limited to the order described, and the order of the steps may be modified as required.

It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. Further, the above definitions of the various elements and methods are not limited to the various specific structures, shapes or arrangements of parts mentioned in the examples, which may be easily modified or substituted by those of ordinary skill in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A fault tolerance method in a cloud simulation system is characterized by comprising the following steps:

s10: discovering that the system has a fault;

s30: judging whether the fault is a simulation node fault, if so, carrying out fault tolerance by adopting a backup fault tolerance mode, selecting a backup node as a new simulation node to operate, and if not, executing a step S40;

2. The fault tolerant method of claim 1 wherein step S20 further comprises: and if the fault tolerance is carried out by adopting the snapshot fault tolerance mode, the fault can not be eliminated, the backup fault tolerance mode is adopted for carrying out the fault tolerance, and a backup node is selected to operate as a new simulation node.

3. The fault tolerant method of claim 1 wherein step S40 further comprises: and if the fault tolerance is still carried out by adopting the virtual machine migration fault tolerance mode, the fault can not be eliminated, the fault tolerance is carried out by adopting the backup fault tolerance mode, and a backup node is selected to operate as a new simulation node.

4. Fault tolerant method according to any of claims 1-3, characterized in that said fault tolerance using a snapshot fault tolerant mode comprises setting a snapshot fault tolerance period T_pSnapshot fault tolerance period T_pProportional to the resource consumption rate and the consumption time.

5. Fault tolerant method according to claim 4, characterized in that the snapshot fault tolerance period

6. Fault tolerant method according to any of claims 1-3, wherein said fault tolerance using a backup fault tolerant mode comprises:

s304: if M heartbeat cycles are waited, the simulation node still does not receive heartbeat information of any corresponding backup node, the corresponding backup node is judged to be invalid, the corresponding backup node is deleted, at least one backup node is brought into the simulation node again, M is a positive integer and is more than or equal to 20;

7. The fault-tolerant method of claim 6, wherein the selection of the backup node creation location of the backup fault-tolerant mode comprises:

s100: calculating distances between a plurality of potential target servers and the simulation node;

S500: if the available resource amount R of the potential target server₀Greater than the amount of resources R required by the simulation node_rIf so, the target server can accommodate the simulation node, and the target server with the highest distance order and capable of accommodating the simulation node is selected to create the backup node.

8. The fault tolerant method according to claim 6 or 7 characterized in that the selection is to select the simulation data in the most recent N heartbeat cycles for comparison, and select one backup node with the highest similarity degree with the simulation data of other backup nodes as the new simulation node, wherein N is a positive integer, and N is more than or equal to 3 and less than or equal to 7.

9. The fault tolerant method of claim 1 wherein said fault tolerant using virtual machine migration fault tolerant mode comprises:

s401: obtaining resource demand values of each virtual machine in the current server, including CPU resource u_CPUMemory resource u_MemBandwidth resource u_BwGPU resource u_GPUAnd storage resource u_St；

S402: constructing a vector representing the resource demand values of the virtual machines according to the resource demand values of the virtual machines in the server,

Respectively comparing the resource residual r of the target servers to be selected_jMaximum vector u of resource demand value of virtual machine to be migrated_vmaxIf there is r_j×85％＞u_vmaxAnd the server is taken as a target server, wherein j is the serial number of the target server to be selected.

10. The fault tolerant method according to claim 9 further comprising the step of:

s405: if the plurality of servers all meet the condition in the step S404, the resources of the virtual machine are required according to the taskThe requirements are given different weights, including the CPU weight w_CPUMemory weight w_MemNetwork bandwidth weight w_BwGPU weight w_GPUAnd storing the weight w_StThe component weight vector w ═ w_CPU，w_Mem，w_Bw，w_GPU，w_St}；

s407: matrix array