CN116361060B

CN116361060B - Multi-feature-aware stream computing system fault tolerance method and system

Info

Publication number: CN116361060B
Application number: CN202310598274.4A
Authority: CN
Inventors: 孙大为; 朱婷
Original assignee: China University of Geosciences Beijing
Current assignee: China University of Geosciences Beijing
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-09-15
Anticipated expiration: 2043-05-25
Also published as: CN116361060A

Abstract

The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, comprising the following steps: s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks; s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point interval CI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies; s3, according to the adjusted check point interval CI _n An nth checkpoint is initiated. The invention can reduce the time for storing the check point data, reduce the system recovery delay, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.

Description

Multi-feature-aware stream computing system fault tolerance method and system

Technical Field

The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and system.

Background

The stream computing system may process the data stream in real-time. However, fault tolerance of streaming computing systems has become a hotspot problem for research due to the prevalence of faults, etc. Shorter checkpoint intervals increase checkpoint overhead, whereas longer checkpoint intervals increase failure recovery time. Therefore, setting an optimal checkpoint interval is important for efficient operation of streaming applications.

Disclosure of Invention

The invention provides a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, which are used for carrying out multi-feature-aware stream computing system fault tolerance. The technical scheme is as follows:

in one aspect, a multi-feature aware stream computing system fault tolerance method is provided, comprising:

s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;

s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;

s3, according to the adjusted interval of each check pointCI _n Start the firstｎAnd checking points.

Optionally, predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:

collecting and preprocessing historical data;

selecting suitable features for training of the linear regression model, the suitable features including: cpu occupation

The utilization rate, the memory occupancy rate and the task execution time;

taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U _cpu refers to the occupancy rate of the CPU,U _mem refers to the occupancy rate of the memory,T _ex refers to the time of execution of a task,β _i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;

and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.

Optionally, in S2, according to the predicted failure rate, a failure-aware fault-tolerant policy is used to dynamically adjust the checkpoint interval, which specifically includes:

prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi ₁ Increasing fixed checkpoint intervalCI ₀ ，The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI _n Until a fault occurs;

when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi _n Reduction of，CI _n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.

Optionally, in the step S2, a resource-aware fault-tolerant policy is used to dynamically adjust a checkpoint interval according to a resource occupation condition of a task on a node, which specifically includes:

setting a threshold value for maximum usage time of CPUC _const ，0<C _const ≤1；

Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC _const Increase the check point interval, define a CPU occupancy rateU _cpu The next checkpoint interval is calculated by equation (1):

（1）

U _cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:

（2）

setting a threshold value for the maximum usage of the memoryM _const ，0<M _const ≤1；

Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM _const Increase the check point interval, define a memory occupancy rateU _mem The next checkpoint interval can be calculated by equation (3):

（3）

U _mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:

（4）

until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI _n Reset to a fixed checkpoint intervalCI ₀ 。

Optionally, the slow task is judged by the task execution time length and the processing data amount of the task, the number of the slow tasks is counted, and only the task with the task execution time exceeding the first preset threshold and the task processing amount smaller than the second preset threshold is judged as the slow task.

Optionally, in the step S2, a slow task aware fault tolerance policy is used according to the execution duration of the task on the node and the processing data amount of the task, so as to dynamically adjust the checkpoint interval, which specifically includes:

detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci ₁ Increasing fixed checkpoint intervalCI ₀ ，Δci ₁ =CI _n－1 ×[(N _slow -M)/M] ，N _slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI _n Until the number of slow tasks falls to a thresholdMThe following are set forth;

when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI _n Reset to a fixed checkpoint intervalCI ₀ 。

In another aspect, there is provided a multi-feature aware stream computing system fault tolerant system comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;

the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;

the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI _n Start the firstｎAnd checking points.

Optionally, the MAFT module includes a MAFT agent and a database, the MAFT agent including: the system comprises a fault sensing module, a monitoring module and a slow task detector;

the fault perception module is used for predicting the fault rate by using linear regressionfr；

The monitoring module is used for monitoring the real-time CPU and the real-time memory quantity occupied by the tasks on the nodes;

the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;

the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.

In another aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method described above.

In another aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method described above is provided.

Compared with the prior art, the technical scheme has at least the following beneficial effects:

the invention can dynamically adjust the check point interval through the fault rate, the CPU occupancy rate, the memory occupancy rate, the execution time of the task on the node and the processing data volume of the task, and compared with the periodic check point, the invention can reduce the time for storing the check point data, reduce the recovery delay of the system, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a fault tolerance method for a multi-feature aware stream computing system according to an embodiment of the present invention;

FIG. 2 is a diagram of a fault tolerant architecture of a multi-feature aware streaming computing system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed description of the preferred embodiments

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a multi-feature aware stream computing system fault tolerance method, including:

Referring to fig. 2, fig. 2 shows a fault-tolerant system architecture diagram of a Multi-feature-aware stream computing system according to an embodiment of the present invention, where a Multi-feature-aware fault-tolerant (Multi-Features Aware Fault Tolerance, MAFT) module is added to a system architecture of a link according to an embodiment of the present invention, where the MAFT module is configured to predict a failure rate in an application running process, a resource occupation condition of a task on a monitoring node, an execution duration of the task on the sensing node, and a processing data volume of the task; included in this module are a MAFT agent and a database, the MAFT agent comprising: fault sensing module, monitoring module and slow task detectorThe fault perception module is used for predicting the fault rate by using linear regressionfrThe method comprises the steps of carrying out a first treatment on the surface of the The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes; the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task, and further judging whether the task is a slow task (only the task with the task execution time exceeding a first preset threshold and the task processing volume being smaller than a second preset threshold is judged to be the slow task); the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.

In a typical (distributed) stream computing system, the checkpoint coordinator runs checkpoints periodically, without regard to potential failure distribution. The embodiment of the invention modifies the existing checkpoint coordinator into a multi-feature-aware checkpoint coordinator, wherein the multi-feature-aware checkpoint coordinator is used for dynamically adjusting the checkpoint interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted checkpoint intervalCI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI _n Start the firstｎAnd checking points.

The multi-feature-aware stream computing system fault tolerance method of the embodiment of the invention comprises the following steps:

collecting and preprocessing historical data;

selecting suitable features for training of the linear regression model, the suitable features including: cpu occupancy, memory occupancy, and task execution time;

Specifically:

1. collecting data: collecting historical data of failure rate, including relevant information of system hardware, software, network, configuration and the like, and recording the historical data in a data set (failure data used in the embodiment of the invention is from a publicly available storage library-failure tracking archive);

2. data preprocessing: according to the fault rate historical data in the data set, the data are subjected to cleaning, conversion, normalization and other processing so that the data can be used for training a linear regression model;

3. feature selection: and selecting proper characteristics for training a linear regression model. The characteristics are selected to consider the association of the fault rate and the attributes and configuration of the system, the finally selected characteristics comprise CPU occupancy rate, memory occupancy rate and task execution time, and if the resource occupancy rate of the tasks on the nodes is too high, the performance bottleneck of the system can be caused; if the task execution time is too long, delay or blockage can occur in the execution process of the system, and the performance of the system is affected;

4. model training: taking the preprocessed data set asInputting, training a model by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U _cpu refers to the occupancy rate of the CPU,U _mem refers to the occupancy rate of the memory,T _ex refers to the time of execution of a task,β _i (i=0, 1,2, 3) refers to regression coefficients, +.>Refers to error terms;

5. model evaluation: verifying the accuracy and stability of the model obtained by training by using the test data set;

6. model application: and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current distributed stream computing system.

In feature selection and model training, classical machine learning algorithms such as Support Vector Machines (SVMs), logistic regression, principal Component Analysis (PCA), etc. may be used to improve model accuracy and performance.

The resource occupation condition of the tasks on the monitoring node mainly refers to the CPU and the memory occupied by each task on the monitoring node (the linux has a technology cgroup, and can allocate independent CPU and memory for the tasks, so that the CPU and the memory occupied by each task can be monitored), and particularly, a monitoring module can be arranged, and the occupation condition of the CPU and the memory can be monitored by adopting a heartbeat mechanism.

The execution time of the task on the sensing node and the processing data volume of the task are detected by expanding a slow task detector in the flink.

prediction-based failure ratefrContinuous enlargementCI ₀ UsingΔi ₁ Increase in sizeCI ₀ ，The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI _n Until a fault occurs;

when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi _n Reduction ofCI _n ，，CI _n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.

The fault-aware fault tolerance strategy is described in detail below in conjunction with algorithm 1:

the multi-feature perception checkpoint coordinator of the embodiment of the invention can start the perception of the fault checkpoint according to the predicted fault rate. If the failure rate is small, it will dynamically increase the checkpoint interval; instead, it dynamically reduces the checkpoint interval. In particular, the multi-feature aware checkpoint coordinator initiates the running of the program and at fixed checkpoint intervalsCI ₀ The first checkpoint is initiated. Subsequently, it optimistically believes that the fault will not occur in the near future, so it begins to be based on the predicted fault ratefrContinuously increasing the checkpoint interval. This can result in monotonically increasing checkpoint intervals, such asCI ₁ ＜CI ₂ …<CI _n . The multi-feature aware checkpoint coordinator continues to increase the checkpoint interval until a failure occurs. If a fault occurs, it will start to reduce the checkpoint intervalTo reduce the fault recovery time. Since studies of actual fault data and predictions of operational faults show that the probability of a subsequent fault will be very high shortly after the last fault. Therefore, to mitigate a possible related failure, the multi-feature aware checkpoint coordinator does not increase the checkpoint interval but decreases the checkpoint interval after the failure.

In determining the change value of the checkpoint, and when changing the checkpoint frequency,frthe value plays an important role.frThe minimum and maximum values of (2) are different from 0 to 1, wherein a value of 0 represents no fault and a value of 1 represents a higher fault rate. When (when)frAt =1, the fault-aware fault tolerance behaves like a fixed checkpoint interval model, since the failure rate is high and any increase in the checkpoint interval may increase the failure recovery time. In this case, the checkpoint interval is gradually reduced. On the other hand, in the other hand,fr=0 characterizes a low failure rate, increasing the checkpoint interval to improve the efficiency of the application. When the probability of failure is low, the checkpoint interval can be increased to improve utilization and reduce checkpoint costs. However, the process is not limited to the above-described process,frthe condition of =0 is most likely to doubly increase the checkpoint interval, which is highly susceptible to higher fault recovery overhead. For an exponential increase in the localization checkpoint interval, the multi-feature aware fault tolerant agent allowsfrIs 0.25, all values below this value are set to 0.25.

Algorithm 1 describes the detailed algorithm of the fault-aware fault tolerance strategy:

Input:Predicted Failure Probabilityfr,Periodiccheckpoint intervalCI ₀

Ouput:Failure aware checkpoints

1.iffr<0.25then

2. Setfr=0.25

3.end if

4.whileApplication not finisheddo

5. UsePeriodic checkpointintervalCI ₀

6.ifNot failthen

7.Non-uniform-Interval(){

8.Calculate

9.Next checkpoint intervalCI _n =CI _n-1 +Δi _n

10.Triggering checkpoint timet _n =CI _n +CI _n-1 ;}

11.end if

12.iffailure occurthen

13.Restart execution from last checkpoint

14.Non-uniform-Interval(){

15.Calculate

16.Next checkpoint intervalCI _n =CI _n-1 -Δi _n

17.Triggering checkpoint timet _n =CI _n +CI _n-1

18.untilCI _n =CI _min }

19.end if

20.end while

usingt ₀ Representing a periodic checkpoint time. The algorithm uses a fixed checkpoint intervalCI ₀ Running an application and int ₀ The periodic checkpoints are started at the moment. It then optimistically acts as if it will not fail in the near future. Therefore, the checkpoint interval begins to increase to initiate a checkpoint. It usesΔi ₁ Increase in sizeCI ₀ Wherein, the method comprises the steps of, wherein,and continuously increasing each checkpoint interval at this rate at all timesCI _n . As such, it iteratively increases the checkpoint interval, andthe checkpoints are initiated as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm gradually increases the checkpoint start time as the checkpoint interval increases. Start the firstnThe time of each check point ist _n =CI _n +CI _n-1 . It uses this approach to dynamically increase the checkpoint interval and reduce the frequency of the start checkpoints. The checkpoint interval gradually increases until a fault occurs. When a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi _n Reduction ofCI _n ，/>. Maintaining this speed continues to reduce the checkpoint interval in order to tolerate potential related faults. As such, it iteratively reduces the checkpoint intervals and based on each checkpoint interval that is adjustedCI _n Start the firstｎAnd checking points. In this way, it dynamically reduces the checkpoint interval, reducing the fault recovery time. The checkpoint interval gradually decreases until it is equal to the minimum checkpoint interval (the minimum checkpoint interval is a preset value close to 0).

（1）

（2）；

setting a threshold value for the maximum usage time of the memoryM _const ，0<M _const ≤1；

（3）

（4）；

The resource-aware fault tolerance policy is described in detail below in conjunction with algorithm 2:

first, consider the impact of CPU resources on checkpoint intervals. Setting a threshold value for maximum usage time of CPUC _const ，0<C _const ≤1。C _const An upper limit of the CPU time duty cycle used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. Reselection of a new checkpointCare must be taken in the interval value because reducing the checkpoint interval will reduce the failure recovery time but will also increase the checkpoint overhead, whereas increasing the checkpoint interval will reduce the checkpoint overhead but will also increase the failure recovery time. If the CPU usage time duty cycle of the node exceeds the thresholdC _const The embodiment of the invention can increase the check point interval to reduce the occupation of the check point operation on the CPU, and use all the CPU for executing the task, so that the task can be executed and completed more quickly. Defining a CPU occupancy rateU _cpu The next checkpoint interval can be calculated by equation (1):

（1）

（2）

then, consider the effect of memory on the checkpoint interval. Setting a threshold value for the maximum usage of the memoryM _const ，0<M _const ≤1。M _const An upper limit of the memory size used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. If the memory usage ratio exceeds the thresholdM _const The checkpointing interval is increased to reduce memory usage by checkpointing, and the entire memory is used to run the task so that the task runs faster and successfully. Defining a memory occupancy rateU _mem The next checkpoint interval can be calculated by equation (3):

（3）

（4）

algorithm 2 gives a detailed algorithm of the resource-aware fault tolerance policy:

Input:CPU occupancy rateU _cpu ，Initialized checkpoint intervalCI ₀ ，Memory occupancy rateU _mem , CPU ocupancy thresholdC _const ，Memory occupancy thresholdM _const

Ouput:Resource-aware checkpoints

1.whileApplication not finisheddo

2. Use Initialized checkpoint intervalCI ₀

3. if U _cpu >C _const orU _mem >M _const then

4.Non-uniform-Interval(){

5.Next checkpoint interval by Equation (1) or (3)

6.Triggering checkpoint timet _n =CI _n +CI _n-1 ;}

7. end if

8. ifU _cpu ≤C _const andU _mem ≤M _const then

9.Recover checkpoint interval toCI ₀

10.end if

11.end while

the algorithm first uses a constant value checkpoint intervalCI ₀ Running an application and int ₀ The periodic checkpoints are started at the moment. When monitoringThe control module detects that the CPU occupancy rate exceeds a threshold valueC _const Or the memory occupancy exceedsM _const At this time, checkpoints begin to open at non-uniform intervals. It calculates a new checkpoint interval using either equation (1) or (3). As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstｎThe time of each check point ist _n =CI _n +CI _n－1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. Gradually increasing the check point interval until the two kinds of resource occupancy rate falls below the threshold value, and after the two kinds of resource occupancy rate falls below the threshold value, separating the check point intervalCI _n Reset to a fixed checkpoint intervalCI ₀ 。

In the production process, hot spot machines cannot be avoided, and the dense back brushing and the mixed part clustering can ensure that the workload of a certain machine is high and the input and output are busy. The data processing tasks running thereon may be extremely slow, making it difficult to secure job throughput time. The abnormal machine node comprises hardware abnormality, accidental IO busy, high CPU load and other problems. These problems can cause tasks to be performed on them to be much slower than tasks performed on other nodes, thereby extending the run length of the entire job.

By expanding the slow task detector in the link, the embodiment of the invention can also detect the execution time of the stream calculation task and the processing data quantity of the task, and the task with longer execution time (exceeding the first preset threshold) and less processing data (being smaller than the second preset threshold) is identified as the slow task. The embodiment of the invention reduces the check point overhead by reducing the time for executing check point operation of slow tasks, and more importantly, reduces the resource occupancy rate and the task execution time.

Algorithm 3 gives a detailed algorithm for slow task assessment:

Input:total number of tasksm，Task ListtaskList, Historical task execution times for previous timesexecution[task_i][time_pre], Duration of task executionduration[task_i]，processing data volume of tasksprocVol[task_ i]，total number of slow tasksN _solw

Ouput:number of slow tasks

1:for task_i=1→mdo

2: duration[task_i]=computeTaskExecutionDuration (execution[task_i] [])

3:end for

4.tasksList1=sortTasks(m,duration[])

5:for task_i=1→mdo

6:procVol[task_i]=computeProcessVolume(execution[task_i][])

7:end for

8.tasksList2=sortTasks(m,procVol[])

9.slowTasksList=selectSlowTasks(s,taskList1,tasksList2)

10.N _slow =Count(slowTasksList)

11.returnN _slow

the algorithm 3 judges the slow tasks according to the execution time of the tasks and the processing data quantity of the tasks, counts the quantity of the slow tasks, and judges the tasks as the slow tasks only when the execution time of the tasks exceeds a first preset threshold value and the processing data quantity of the tasks is smaller than a second preset threshold value.

The slow task aware fault tolerance strategy is described in detail below in conjunction with algorithm 4:

embodiments of the present invention use a slow task detector to detect the number of slow tasks in a streaming computing system. The number of slow tasks is expressed asN _slow . If it isN _slow Exceeding a set thresholdMThe check point interval is increased, the time for the slow task to check point is reduced, and the slow task is concentrated in time to execute the task. When the slow task detector detectsN _slow When falling to normal values, the fixed checkpoint interval is restored.

Algorithm 4 gives a detailed algorithm for the slow task aware fault tolerance strategy:

Input:Slowtask numberN _slow ，Initialized checkpoint intervalCI ₀

Ouput:Slowtask aware checkpoints

1.whileApplication not finisheddo

2. Use Initialized checkpointintervalCI ₀

3.ifN _slow >Mthen

4.Non-uniform-Interval(){

5. Calculate

6.Next checkpoint intervalCI _n =CI _n-1 +Δci _n

7.Triggering checkpoint timet _n =CI _n +CI _n-1 ;}

8.end if

9.ifN _slow ≤Mthen

10.Recover checkpoint interval toCI ₀

11. end if

12.end while

the algorithm first uses a fixed checkpoint intervalCI ₀ Running an application and int ₀ The periodic checkpoints are started at the moment. When the slow task detector detects that the number of slow tasks exceeds a thresholdMAt this time, checkpoints begin to open at non-uniform intervals. It is used forΔ ci ₁ Increase in sizeCI ₀ ，Δci ₁ =CI _n－1 ×[(N _slow -M)/M]And maintaining this speed continuously increases each checkpoint intervalCI _n As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstｎThe time of each check point ist _n =CI _n +CI _n－1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. The checkpoint interval is gradually increased until the number of slow tasks falls below the threshold, after which the checkpoint interval is increasedCI _n Reset to a fixed checkpoint intervalCI ₀ 。

S3, according to each adjusted checkDot spacingCI _n Start the firstｎAnd checking points.

As shown in fig. 2, an embodiment of the present invention further provides a multi-feature aware stream computing system fault tolerant system, including: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;

The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes;

The functional structure of the fault-tolerant system of the multi-feature-aware stream computing system provided by the embodiment of the invention corresponds to the fault-tolerant method of the multi-feature-aware stream computing system provided by the embodiment of the invention, and is not described herein.

Fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present invention, where the electronic device 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 301 and one or more memories 302, where at least one instruction is stored in the memories 302, and the at least one instruction is loaded and executed by the processors 301 to implement the steps of the multi-feature-aware stream computing system fault tolerance method described above.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the multi-feature aware stream computing system fault tolerance method described above is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A multi-feature aware stream computing system fault tolerance method, comprising:

s2, according to the perceived multi-feature, using an Mf-Stream fault-tolerant strategyDynamically adjusting the checkpoint intervals to obtain each of the adjusted checkpoint intervalsCI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;

s3, according to the adjusted interval of each check pointCI _n Start the firstｎA plurality of checkpoints;

in the step S2, according to the predicted failure rate, a failure sensing fault tolerance strategy is used to dynamically adjust the checkpoint interval, which specifically includes:

when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi _n Reduction ofCI _n ，，CI _n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval;

in the step S2, according to the resource occupation condition of the tasks on the nodes, a resource-aware fault-tolerant strategy is used to dynamically adjust the check point intervals, and the method specifically comprises the following steps:

Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC _const The checkpoint interval is increased and the time interval is increased,defining a CPU occupancy rateU _cpu The next checkpoint interval is calculated by equation (1):

（1）

（2）

（3）

（4）

until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI _n Reset to a fixed checkpoint intervalCI ₀ ；

In the step S2, according to the execution time of the task on the node and the processing data volume of the task, a slow task perception fault tolerance strategy is used to dynamically adjust the check point interval, which specifically comprises the following steps:

when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI _n Reset to a fixed checkpoint intervalCI _0。

2. The method according to claim 1, wherein predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:

collecting and preprocessing historical data;

3. The method according to claim 1, wherein the slow task is judged by a task execution time length and a processing data amount of the task, and the number of the slow tasks is counted, and only the task whose task execution time exceeds a first preset threshold and task processing amount is smaller than a second preset threshold is judged as the slow task.

4. A multi-feature aware stream computing system fault tolerant system, comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;

the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI _n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI _n Start the firstｎA plurality of checkpoints;

the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals according to a predicted failure rate using a failure-aware fault tolerance strategy, and specifically includes:

prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi ₁ Increasing fixed checkpoint intervalCI ₀ ，The saidfrIs 0.25, all values below this valueSet to 0.25 and maintain this speed continuously increasing each checkpoint intervalCI _n Until a fault occurs;

the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a resource perception fault tolerance policy according to the resource occupation condition of tasks on nodes, and specifically includes:

（1）

（2）

Using coresThe jump mechanism monitors the memory occupancy, if the memory usage duty cycle of a certain task exceeds the thresholdM _const Increase the check point interval, define a memory occupancy rateU _mem The next checkpoint interval can be calculated by equation (3):

（3）

（4）

The multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a slow task perception fault tolerance strategy according to execution time of tasks on nodes and processing data volume of the tasks, and specifically comprises the following steps:

5. The system of claim 4, wherein the MAFT module comprises a MAFT agent and a database, the MAFT agent comprising: the system comprises a fault sensing module, a monitoring module and a slow task detector;

6. An electronic device comprising a processor and a memory having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.

7. A computer readable storage medium having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.