CN116361060B - Multi-feature-aware stream computing system fault tolerance method and system - Google Patents

Multi-feature-aware stream computing system fault tolerance method and system Download PDF

Info

Publication number
CN116361060B
CN116361060B CN202310598274.4A CN202310598274A CN116361060B CN 116361060 B CN116361060 B CN 116361060B CN 202310598274 A CN202310598274 A CN 202310598274A CN 116361060 B CN116361060 B CN 116361060B
Authority
CN
China
Prior art keywords
task
fault
checkpoint
interval
cpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310598274.4A
Other languages
Chinese (zh)
Other versions
CN116361060A (en
Inventor
孙大为
朱婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences Beijing
Original Assignee
China University of Geosciences Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences Beijing filed Critical China University of Geosciences Beijing
Priority to CN202310598274.4A priority Critical patent/CN116361060B/en
Publication of CN116361060A publication Critical patent/CN116361060A/en
Application granted granted Critical
Publication of CN116361060B publication Critical patent/CN116361060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • G06F11/3423Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, comprising the following steps: s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks; s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point interval CI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies; s3, according to the adjusted check point interval CI n An nth checkpoint is initiated. The invention can reduce the time for storing the check point data, reduce the system recovery delay, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.

Description

Multi-feature-aware stream computing system fault tolerance method and system
Technical Field
The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and system.
Background
The stream computing system may process the data stream in real-time. However, fault tolerance of streaming computing systems has become a hotspot problem for research due to the prevalence of faults, etc. Shorter checkpoint intervals increase checkpoint overhead, whereas longer checkpoint intervals increase failure recovery time. Therefore, setting an optimal checkpoint interval is important for efficient operation of streaming applications.
Disclosure of Invention
The invention provides a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, which are used for carrying out multi-feature-aware stream computing system fault tolerance. The technical scheme is as follows:
in one aspect, a multi-feature aware stream computing system fault tolerance method is provided, comprising:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstAnd checking points.
Optionally, predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupation
The utilization rate, the memory occupancy rate and the task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
Optionally, in S2, according to the predicted failure rate, a failure-aware fault-tolerant policy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.
Optionally, in the step S2, a resource-aware fault-tolerant policy is used to dynamically adjust a checkpoint interval according to a resource occupation condition of a task on a node, which specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0
Optionally, the slow task is judged by the task execution time length and the processing data amount of the task, the number of the slow tasks is counted, and only the task with the task execution time exceeding the first preset threshold and the task processing amount smaller than the second preset threshold is judged as the slow task.
Optionally, in the step S2, a slow task aware fault tolerance policy is used according to the execution duration of the task on the node and the processing data amount of the task, so as to dynamically adjust the checkpoint interval, which specifically includes:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0
In another aspect, there is provided a multi-feature aware stream computing system fault tolerant system comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstAnd checking points.
Optionally, the MAFT module includes a MAFT agent and a database, the MAFT agent including: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
In another aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method described above is provided.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
the invention can dynamically adjust the check point interval through the fault rate, the CPU occupancy rate, the memory occupancy rate, the execution time of the task on the node and the processing data volume of the task, and compared with the periodic check point, the invention can reduce the time for storing the check point data, reduce the recovery delay of the system, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a fault tolerance method for a multi-feature aware stream computing system according to an embodiment of the present invention;
FIG. 2 is a diagram of a fault tolerant architecture of a multi-feature aware streaming computing system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed description of the preferred embodiments
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a multi-feature aware stream computing system fault tolerance method, including:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstAnd checking points.
Referring to fig. 2, fig. 2 shows a fault-tolerant system architecture diagram of a Multi-feature-aware stream computing system according to an embodiment of the present invention, where a Multi-feature-aware fault-tolerant (Multi-Features Aware Fault Tolerance, MAFT) module is added to a system architecture of a link according to an embodiment of the present invention, where the MAFT module is configured to predict a failure rate in an application running process, a resource occupation condition of a task on a monitoring node, an execution duration of the task on the sensing node, and a processing data volume of the task; included in this module are a MAFT agent and a database, the MAFT agent comprising: fault sensing module, monitoring module and slow task detectorThe fault perception module is used for predicting the fault rate by using linear regressionfrThe method comprises the steps of carrying out a first treatment on the surface of the The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes; the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task, and further judging whether the task is a slow task (only the task with the task execution time exceeding a first preset threshold and the task processing volume being smaller than a second preset threshold is judged to be the slow task); the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
In a typical (distributed) stream computing system, the checkpoint coordinator runs checkpoints periodically, without regard to potential failure distribution. The embodiment of the invention modifies the existing checkpoint coordinator into a multi-feature-aware checkpoint coordinator, wherein the multi-feature-aware checkpoint coordinator is used for dynamically adjusting the checkpoint interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted checkpoint intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstAnd checking points.
The multi-feature-aware stream computing system fault tolerance method of the embodiment of the invention comprises the following steps:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
optionally, predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupancy, memory occupancy, and task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
Specifically:
1. collecting data: collecting historical data of failure rate, including relevant information of system hardware, software, network, configuration and the like, and recording the historical data in a data set (failure data used in the embodiment of the invention is from a publicly available storage library-failure tracking archive);
2. data preprocessing: according to the fault rate historical data in the data set, the data are subjected to cleaning, conversion, normalization and other processing so that the data can be used for training a linear regression model;
3. feature selection: and selecting proper characteristics for training a linear regression model. The characteristics are selected to consider the association of the fault rate and the attributes and configuration of the system, the finally selected characteristics comprise CPU occupancy rate, memory occupancy rate and task execution time, and if the resource occupancy rate of the tasks on the nodes is too high, the performance bottleneck of the system can be caused; if the task execution time is too long, delay or blockage can occur in the execution process of the system, and the performance of the system is affected;
4. model training: taking the preprocessed data set asInputting, training a model by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,2, 3) refers to regression coefficients, +.>Refers to error terms;
5. model evaluation: verifying the accuracy and stability of the model obtained by training by using the test data set;
6. model application: and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current distributed stream computing system.
In feature selection and model training, classical machine learning algorithms such as Support Vector Machines (SVMs), logistic regression, principal Component Analysis (PCA), etc. may be used to improve model accuracy and performance.
The resource occupation condition of the tasks on the monitoring node mainly refers to the CPU and the memory occupied by each task on the monitoring node (the linux has a technology cgroup, and can allocate independent CPU and memory for the tasks, so that the CPU and the memory occupied by each task can be monitored), and particularly, a monitoring module can be arranged, and the occupation condition of the CPU and the memory can be monitored by adopting a heartbeat mechanism.
The execution time of the task on the sensing node and the processing data volume of the task are detected by expanding a slow task detector in the flink.
S2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
optionally, in S2, according to the predicted failure rate, a failure-aware fault-tolerant policy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuous enlargementCI 0 UsingΔi 1 Increase in sizeCI 0 The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.
The fault-aware fault tolerance strategy is described in detail below in conjunction with algorithm 1:
the multi-feature perception checkpoint coordinator of the embodiment of the invention can start the perception of the fault checkpoint according to the predicted fault rate. If the failure rate is small, it will dynamically increase the checkpoint interval; instead, it dynamically reduces the checkpoint interval. In particular, the multi-feature aware checkpoint coordinator initiates the running of the program and at fixed checkpoint intervalsCI 0 The first checkpoint is initiated. Subsequently, it optimistically believes that the fault will not occur in the near future, so it begins to be based on the predicted fault ratefrContinuously increasing the checkpoint interval. This can result in monotonically increasing checkpoint intervals, such asCI 1 CI 2 …<CI n . The multi-feature aware checkpoint coordinator continues to increase the checkpoint interval until a failure occurs. If a fault occurs, it will start to reduce the checkpoint intervalTo reduce the fault recovery time. Since studies of actual fault data and predictions of operational faults show that the probability of a subsequent fault will be very high shortly after the last fault. Therefore, to mitigate a possible related failure, the multi-feature aware checkpoint coordinator does not increase the checkpoint interval but decreases the checkpoint interval after the failure.
In determining the change value of the checkpoint, and when changing the checkpoint frequency,frthe value plays an important role.frThe minimum and maximum values of (2) are different from 0 to 1, wherein a value of 0 represents no fault and a value of 1 represents a higher fault rate. When (when)frAt =1, the fault-aware fault tolerance behaves like a fixed checkpoint interval model, since the failure rate is high and any increase in the checkpoint interval may increase the failure recovery time. In this case, the checkpoint interval is gradually reduced. On the other hand, in the other hand,fr=0 characterizes a low failure rate, increasing the checkpoint interval to improve the efficiency of the application. When the probability of failure is low, the checkpoint interval can be increased to improve utilization and reduce checkpoint costs. However, the process is not limited to the above-described process,frthe condition of =0 is most likely to doubly increase the checkpoint interval, which is highly susceptible to higher fault recovery overhead. For an exponential increase in the localization checkpoint interval, the multi-feature aware fault tolerant agent allowsfrIs 0.25, all values below this value are set to 0.25.
Algorithm 1 describes the detailed algorithm of the fault-aware fault tolerance strategy:
Input:Predicted Failure Probabilityfr,Periodiccheckpoint intervalCI 0
Ouput:Failure aware checkpoints
1.iffr<0.25then
2. Setfr=0.25
3.end if
4.whileApplication not finisheddo
5. UsePeriodic checkpointintervalCI 0
6.ifNot failthen
7.Non-uniform-Interval(){
8.Calculate
9.Next checkpoint intervalCI n =CI n-1 +Δi n
10.Triggering checkpoint timet n =CI n +CI n-1 ;}
11.end if
12.iffailure occurthen
13.Restart execution from last checkpoint
14.Non-uniform-Interval(){
15.Calculate
16.Next checkpoint intervalCI n =CI n-1 -Δi n
17.Triggering checkpoint timet n =CI n +CI n-1
18.untilCI n =CI min }
19.end if
20.end while
usingt 0 Representing a periodic checkpoint time. The algorithm uses a fixed checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. It then optimistically acts as if it will not fail in the near future. Therefore, the checkpoint interval begins to increase to initiate a checkpoint. It usesΔi 1 Increase in sizeCI 0 Wherein, the method comprises the steps of, wherein,and continuously increasing each checkpoint interval at this rate at all timesCI n . As such, it iteratively increases the checkpoint interval, andthe checkpoints are initiated as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm gradually increases the checkpoint start time as the checkpoint interval increases. Start the firstnThe time of each check point ist n =CI n +CI n-1 . It uses this approach to dynamically increase the checkpoint interval and reduce the frequency of the start checkpoints. The checkpoint interval gradually increases until a fault occurs. When a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n ,/>. Maintaining this speed continues to reduce the checkpoint interval in order to tolerate potential related faults. As such, it iteratively reduces the checkpoint intervals and based on each checkpoint interval that is adjustedCI n Start the firstAnd checking points. In this way, it dynamically reduces the checkpoint interval, reducing the fault recovery time. The checkpoint interval gradually decreases until it is equal to the minimum checkpoint interval (the minimum checkpoint interval is a preset value close to 0).
Optionally, in the step S2, a resource-aware fault-tolerant policy is used to dynamically adjust a checkpoint interval according to a resource occupation condition of a task on a node, which specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2);
setting a threshold value for the maximum usage time of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4);
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0
The resource-aware fault tolerance policy is described in detail below in conjunction with algorithm 2:
first, consider the impact of CPU resources on checkpoint intervals. Setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1。C const An upper limit of the CPU time duty cycle used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. Reselection of a new checkpointCare must be taken in the interval value because reducing the checkpoint interval will reduce the failure recovery time but will also increase the checkpoint overhead, whereas increasing the checkpoint interval will reduce the checkpoint overhead but will also increase the failure recovery time. If the CPU usage time duty cycle of the node exceeds the thresholdC const The embodiment of the invention can increase the check point interval to reduce the occupation of the check point operation on the CPU, and use all the CPU for executing the task, so that the task can be executed and completed more quickly. Defining a CPU occupancy rateU cpu The next checkpoint interval can be calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
then, consider the effect of memory on the checkpoint interval. Setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1。M const An upper limit of the memory size used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. If the memory usage ratio exceeds the thresholdM const The checkpointing interval is increased to reduce memory usage by checkpointing, and the entire memory is used to run the task so that the task runs faster and successfully. Defining a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
algorithm 2 gives a detailed algorithm of the resource-aware fault tolerance policy:
Input:CPU occupancy rateU cpu ,Initialized checkpoint intervalCI 0 ,Memory occupancy rateU mem , CPU ocupancy thresholdC const ,Memory occupancy thresholdM const
Ouput:Resource-aware checkpoints
1.whileApplication not finisheddo
2. Use Initialized checkpoint intervalCI 0
3. if U cpu >C const orU mem >M const then
4.Non-uniform-Interval(){
5.Next checkpoint interval by Equation (1) or (3)
6.Triggering checkpoint timet n =CI n +CI n-1 ;}
7. end if
8. ifU cpu C const andU mem M const then
9.Recover checkpoint interval toCI 0
10.end if
11.end while
the algorithm first uses a constant value checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. When monitoringThe control module detects that the CPU occupancy rate exceeds a threshold valueC const Or the memory occupancy exceedsM const At this time, checkpoints begin to open at non-uniform intervals. It calculates a new checkpoint interval using either equation (1) or (3). As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstThe time of each check point ist n =CI n +CI n-1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. Gradually increasing the check point interval until the two kinds of resource occupancy rate falls below the threshold value, and after the two kinds of resource occupancy rate falls below the threshold value, separating the check point intervalCI n Reset to a fixed checkpoint intervalCI 0
Optionally, the slow task is judged by the task execution time length and the processing data amount of the task, the number of the slow tasks is counted, and only the task with the task execution time exceeding the first preset threshold and the task processing amount smaller than the second preset threshold is judged as the slow task.
In the production process, hot spot machines cannot be avoided, and the dense back brushing and the mixed part clustering can ensure that the workload of a certain machine is high and the input and output are busy. The data processing tasks running thereon may be extremely slow, making it difficult to secure job throughput time. The abnormal machine node comprises hardware abnormality, accidental IO busy, high CPU load and other problems. These problems can cause tasks to be performed on them to be much slower than tasks performed on other nodes, thereby extending the run length of the entire job.
By expanding the slow task detector in the link, the embodiment of the invention can also detect the execution time of the stream calculation task and the processing data quantity of the task, and the task with longer execution time (exceeding the first preset threshold) and less processing data (being smaller than the second preset threshold) is identified as the slow task. The embodiment of the invention reduces the check point overhead by reducing the time for executing check point operation of slow tasks, and more importantly, reduces the resource occupancy rate and the task execution time.
Algorithm 3 gives a detailed algorithm for slow task assessment:
Input:total number of tasksm,Task ListtaskList, Historical task execution times for previous timesexecution[task_i][time_pre], Duration of task executionduration[task_i],processing data volume of tasksprocVol[task_ i],total number of slow tasksN solw
Ouput:number of slow tasks
1:for task_i=1→mdo
2: duration[task_i]=computeTaskExecutionDuration (execution[task_i] [])
3:end for
4.tasksList1=sortTasks(m,duration[])
5:for task_i=1→mdo
6:procVol[task_i]=computeProcessVolume(execution[task_i][])
7:end for
8.tasksList2=sortTasks(m,procVol[])
9.slowTasksList=selectSlowTasks(s,taskList1,tasksList2)
10.N slow =Count(slowTasksList)
11.returnN slow
the algorithm 3 judges the slow tasks according to the execution time of the tasks and the processing data quantity of the tasks, counts the quantity of the slow tasks, and judges the tasks as the slow tasks only when the execution time of the tasks exceeds a first preset threshold value and the processing data quantity of the tasks is smaller than a second preset threshold value.
Optionally, in the step S2, a slow task aware fault tolerance policy is used according to the execution duration of the task on the node and the processing data amount of the task, so as to dynamically adjust the checkpoint interval, which specifically includes:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0
The slow task aware fault tolerance strategy is described in detail below in conjunction with algorithm 4:
embodiments of the present invention use a slow task detector to detect the number of slow tasks in a streaming computing system. The number of slow tasks is expressed asN slow . If it isN slow Exceeding a set thresholdMThe check point interval is increased, the time for the slow task to check point is reduced, and the slow task is concentrated in time to execute the task. When the slow task detector detectsN slow When falling to normal values, the fixed checkpoint interval is restored.
Algorithm 4 gives a detailed algorithm for the slow task aware fault tolerance strategy:
Input:Slowtask numberN slow ,Initialized checkpoint intervalCI 0
Ouput:Slowtask aware checkpoints
1.whileApplication not finisheddo
2. Use Initialized checkpointintervalCI 0
3.ifN slow >Mthen
4.Non-uniform-Interval(){
5. Calculate
6.Next checkpoint intervalCI n =CI n-1 +Δci n
7.Triggering checkpoint timet n =CI n +CI n-1 ;}
8.end if
9.ifN slow Mthen
10.Recover checkpoint interval toCI 0
11. end if
12.end while
the algorithm first uses a fixed checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. When the slow task detector detects that the number of slow tasks exceeds a thresholdMAt this time, checkpoints begin to open at non-uniform intervals. It is used forΔ ci 1 Increase in sizeCI 0 Δci 1 =CI n-1 ×[(N slow -M)/M]And maintaining this speed continuously increases each checkpoint intervalCI n As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstThe time of each check point ist n =CI n +CI n-1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. The checkpoint interval is gradually increased until the number of slow tasks falls below the threshold, after which the checkpoint interval is increasedCI n Reset to a fixed checkpoint intervalCI 0
S3, according to each adjusted checkDot spacingCI n Start the firstAnd checking points.
As shown in fig. 2, an embodiment of the present invention further provides a multi-feature aware stream computing system fault tolerant system, including: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstAnd checking points.
Optionally, the MAFT module includes a MAFT agent and a database, the MAFT agent including: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
The functional structure of the fault-tolerant system of the multi-feature-aware stream computing system provided by the embodiment of the invention corresponds to the fault-tolerant method of the multi-feature-aware stream computing system provided by the embodiment of the invention, and is not described herein.
Fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present invention, where the electronic device 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 301 and one or more memories 302, where at least one instruction is stored in the memories 302, and the at least one instruction is loaded and executed by the processors 301 to implement the steps of the multi-feature-aware stream computing system fault tolerance method described above.
In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the multi-feature aware stream computing system fault tolerance method described above is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (7)

1. A multi-feature aware stream computing system fault tolerance method, comprising:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, according to the perceived multi-feature, using an Mf-Stream fault-tolerant strategyDynamically adjusting the checkpoint intervals to obtain each of the adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstA plurality of checkpoints;
in the step S2, according to the predicted failure rate, a failure sensing fault tolerance strategy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval;
in the step S2, according to the resource occupation condition of the tasks on the nodes, a resource-aware fault-tolerant strategy is used to dynamically adjust the check point intervals, and the method specifically comprises the following steps:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const The checkpoint interval is increased and the time interval is increased,defining a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0
In the step S2, according to the execution time of the task on the node and the processing data volume of the task, a slow task perception fault tolerance strategy is used to dynamically adjust the check point interval, which specifically comprises the following steps:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0。
2. The method according to claim 1, wherein predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupancy, memory occupancy, and task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
3. The method according to claim 1, wherein the slow task is judged by a task execution time length and a processing data amount of the task, and the number of the slow tasks is counted, and only the task whose task execution time exceeds a first preset threshold and task processing amount is smaller than a second preset threshold is judged as the slow task.
4. A multi-feature aware stream computing system fault tolerant system, comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstA plurality of checkpoints;
the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals according to a predicted failure rate using a failure-aware fault tolerance strategy, and specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 The saidfrIs 0.25, all values below this valueSet to 0.25 and maintain this speed continuously increasing each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval;
the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a resource perception fault tolerance policy according to the resource occupation condition of tasks on nodes, and specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Using coresThe jump mechanism monitors the memory occupancy, if the memory usage duty cycle of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0
The multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a slow task perception fault tolerance strategy according to execution time of tasks on nodes and processing data volume of the tasks, and specifically comprises the following steps:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0
5. The system of claim 4, wherein the MAFT module comprises a MAFT agent and a database, the MAFT agent comprising: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
6. An electronic device comprising a processor and a memory having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.
7. A computer readable storage medium having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.
CN202310598274.4A 2023-05-25 2023-05-25 Multi-feature-aware stream computing system fault tolerance method and system Active CN116361060B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310598274.4A CN116361060B (en) 2023-05-25 2023-05-25 Multi-feature-aware stream computing system fault tolerance method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310598274.4A CN116361060B (en) 2023-05-25 2023-05-25 Multi-feature-aware stream computing system fault tolerance method and system

Publications (2)

Publication Number Publication Date
CN116361060A CN116361060A (en) 2023-06-30
CN116361060B true CN116361060B (en) 2023-09-15

Family

ID=86939416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310598274.4A Active CN116361060B (en) 2023-05-25 2023-05-25 Multi-feature-aware stream computing system fault tolerance method and system

Country Status (1)

Country Link
CN (1) CN116361060B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331347A (en) * 2014-11-25 2015-02-04 中国人民解放军国防科学技术大学 Variable error rate-oriented check point interval real-time determining method
CN109344009A (en) * 2018-10-11 2019-02-15 重庆邮电大学 Mobile cloud system fault-tolerance approach based on classification checkpoint
CN111124720A (en) * 2019-12-26 2020-05-08 江南大学 Self-adaptive check point interval dynamic setting method
CN111258824A (en) * 2020-01-18 2020-06-09 重庆邮电大学 Increment check point fault tolerance method based on artificial potential field in cloud computing
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance
CN112445635A (en) * 2019-09-04 2021-03-05 无锡江南计算技术研究所 Data-driven adaptive checkpoint optimization method
CN116069468A (en) * 2022-12-30 2023-05-05 三星(中国)半导体有限公司 Checkpoint adjustment method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070220327A1 (en) * 2006-02-23 2007-09-20 Evergrid, Inc., A Delaware Corporation Dynamically Controlled Checkpoint Timing
US11641395B2 (en) * 2019-07-31 2023-05-02 Stratus Technologies Ireland Ltd. Fault tolerant systems and methods incorporating a minimum checkpoint interval

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331347A (en) * 2014-11-25 2015-02-04 中国人民解放军国防科学技术大学 Variable error rate-oriented check point interval real-time determining method
CN109344009A (en) * 2018-10-11 2019-02-15 重庆邮电大学 Mobile cloud system fault-tolerance approach based on classification checkpoint
CN112445635A (en) * 2019-09-04 2021-03-05 无锡江南计算技术研究所 Data-driven adaptive checkpoint optimization method
CN111124720A (en) * 2019-12-26 2020-05-08 江南大学 Self-adaptive check point interval dynamic setting method
CN111258824A (en) * 2020-01-18 2020-06-09 重庆邮电大学 Increment check point fault tolerance method based on artificial potential field in cloud computing
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance
CN116069468A (en) * 2022-12-30 2023-05-05 三星(中国)半导体有限公司 Checkpoint adjustment method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何忠政等.基于检查点间隔优化的容错实时系统可调度性.《吉林大学学报》.2014,第44卷(第2期),第433-439页. *

Also Published As

Publication number Publication date
CN116361060A (en) 2023-06-30

Similar Documents

Publication Publication Date Title
Chtepen et al. Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids
JP4170988B2 (en) Risk prediction / avoidance method, system, program, and recording medium for execution environment
US7890297B2 (en) Predictive monitoring method and system
US8949642B2 (en) Method for dynamically distributing one or more services in a network comprising of a plurality of computers by deriving a resource capacity required based on a past chronological progression of a resource demand
CN107562512B (en) Method, device and system for migrating virtual machine
Heinze et al. An adaptive replication scheme for elastic data stream processing systems
US11886919B2 (en) Directing queries to nodes of a cluster of a container orchestration platform distributed across a host system and a hardware accelerator of the host system
CN112799817A (en) Micro-service resource scheduling system and method
US20220414503A1 (en) Slo-aware artificial intelligence inference scheduler for heterogeneous processors in edge platforms
CN111880906A (en) Virtual machine high-availability management method, system and storage medium
WO2020248227A1 (en) Load prediction-based hadoop computing task speculative execution method
US7130770B2 (en) Monitoring method and system with corrective actions having dynamic intensities
WO2018024076A1 (en) Flow velocity control method and device
US11966273B2 (en) Throughput-optimized, quality-of-service aware power capping system
CN109522100B (en) Real-time computing task adjusting method and device
Rood et al. Resource availability prediction for improved grid scheduling
Lassettre et al. Dynamic surge protection: An approach to handling unexpected workload surges with resource actions that have lead times
CN116361060B (en) Multi-feature-aware stream computing system fault tolerance method and system
WO2022247219A1 (en) Information backup method, device, and platform
CN111274111B (en) Prediction and anti-aging method for microservice aging
CN112559287A (en) Method and device for optimizing task flow in data
Okamura et al. Optimization of opportunity-based software rejuvenation policy
CN115470006B (en) Load balancing method based on microkernel
CN115858155A (en) Dynamic capacity expansion and contraction method and device for application resources of computing power network platform
Amin et al. Using automated control charts for the runtime evaluation of qos attributes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant