CN116361060B - Multi-feature-aware stream computing system fault tolerance method and system - Google Patents
Multi-feature-aware stream computing system fault tolerance method and system Download PDFInfo
- Publication number
- CN116361060B CN116361060B CN202310598274.4A CN202310598274A CN116361060B CN 116361060 B CN116361060 B CN 116361060B CN 202310598274 A CN202310598274 A CN 202310598274A CN 116361060 B CN116361060 B CN 116361060B
- Authority
- CN
- China
- Prior art keywords
- task
- fault
- checkpoint
- interval
- cpu
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000015654 memory Effects 0.000 claims abstract description 64
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000012544 monitoring process Methods 0.000 claims abstract description 31
- 238000012417 linear regression Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 16
- 230000008447 perception Effects 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000003860 storage Methods 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000011084 recovery Methods 0.000 abstract description 9
- 230000000737 periodic effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- BULVZWIRKLYCBC-UHFFFAOYSA-N phorate Chemical compound CCOP(=S)(OCC)SCSCC BULVZWIRKLYCBC-UHFFFAOYSA-N 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001680 brushing effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3024—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3037—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3089—Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
- G06F11/3093—Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3419—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
- G06F11/3423—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3452—Performance evaluation by statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/865—Monitoring of software
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, comprising the following steps: s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks; s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point interval CI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies; s3, according to the adjusted check point interval CI n An nth checkpoint is initiated. The invention can reduce the time for storing the check point data, reduce the system recovery delay, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.
Description
Technical Field
The invention relates to the technical field of fault tolerance of a stream computing system, in particular to a multi-feature-aware stream computing system fault tolerance method and system.
Background
The stream computing system may process the data stream in real-time. However, fault tolerance of streaming computing systems has become a hotspot problem for research due to the prevalence of faults, etc. Shorter checkpoint intervals increase checkpoint overhead, whereas longer checkpoint intervals increase failure recovery time. Therefore, setting an optimal checkpoint interval is important for efficient operation of streaming applications.
Disclosure of Invention
The invention provides a multi-feature-aware stream computing system fault tolerance method and a multi-feature-aware stream computing system fault tolerance system, which are used for carrying out multi-feature-aware stream computing system fault tolerance. The technical scheme is as follows:
in one aspect, a multi-feature aware stream computing system fault tolerance method is provided, comprising:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstnAnd checking points.
Optionally, predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupation
The utilization rate, the memory occupancy rate and the task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
Optionally, in S2, according to the predicted failure rate, a failure-aware fault-tolerant policy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 ,The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction of,CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.
Optionally, in the step S2, a resource-aware fault-tolerant policy is used to dynamically adjust a checkpoint interval according to a resource occupation condition of a task on a node, which specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0 。
Optionally, the slow task is judged by the task execution time length and the processing data amount of the task, the number of the slow tasks is counted, and only the task with the task execution time exceeding the first preset threshold and the task processing amount smaller than the second preset threshold is judged as the slow task.
Optionally, in the step S2, a slow task aware fault tolerance policy is used according to the execution duration of the task on the node and the processing data amount of the task, so as to dynamically adjust the checkpoint interval, which specifically includes:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 ,Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0 。
In another aspect, there is provided a multi-feature aware stream computing system fault tolerant system comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstnAnd checking points.
Optionally, the MAFT module includes a MAFT agent and a database, the MAFT agent including: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
In another aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method described above is provided.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
the invention can dynamically adjust the check point interval through the fault rate, the CPU occupancy rate, the memory occupancy rate, the execution time of the task on the node and the processing data volume of the task, and compared with the periodic check point, the invention can reduce the time for storing the check point data, reduce the recovery delay of the system, reduce the CPU occupancy rate and the memory occupancy rate and reduce the task execution time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a fault tolerance method for a multi-feature aware stream computing system according to an embodiment of the present invention;
FIG. 2 is a diagram of a fault tolerant architecture of a multi-feature aware streaming computing system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed description of the preferred embodiments
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a multi-feature aware stream computing system fault tolerance method, including:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstnAnd checking points.
Referring to fig. 2, fig. 2 shows a fault-tolerant system architecture diagram of a Multi-feature-aware stream computing system according to an embodiment of the present invention, where a Multi-feature-aware fault-tolerant (Multi-Features Aware Fault Tolerance, MAFT) module is added to a system architecture of a link according to an embodiment of the present invention, where the MAFT module is configured to predict a failure rate in an application running process, a resource occupation condition of a task on a monitoring node, an execution duration of the task on the sensing node, and a processing data volume of the task; included in this module are a MAFT agent and a database, the MAFT agent comprising: fault sensing module, monitoring module and slow task detectorThe fault perception module is used for predicting the fault rate by using linear regressionfrThe method comprises the steps of carrying out a first treatment on the surface of the The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes; the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task, and further judging whether the task is a slow task (only the task with the task execution time exceeding a first preset threshold and the task processing volume being smaller than a second preset threshold is judged to be the slow task); the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
In a typical (distributed) stream computing system, the checkpoint coordinator runs checkpoints periodically, without regard to potential failure distribution. The embodiment of the invention modifies the existing checkpoint coordinator into a multi-feature-aware checkpoint coordinator, wherein the multi-feature-aware checkpoint coordinator is used for dynamically adjusting the checkpoint interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted checkpoint intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstnAnd checking points.
The multi-feature-aware stream computing system fault tolerance method of the embodiment of the invention comprises the following steps:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
optionally, predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupancy, memory occupancy, and task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
Specifically:
1. collecting data: collecting historical data of failure rate, including relevant information of system hardware, software, network, configuration and the like, and recording the historical data in a data set (failure data used in the embodiment of the invention is from a publicly available storage library-failure tracking archive);
2. data preprocessing: according to the fault rate historical data in the data set, the data are subjected to cleaning, conversion, normalization and other processing so that the data can be used for training a linear regression model;
3. feature selection: and selecting proper characteristics for training a linear regression model. The characteristics are selected to consider the association of the fault rate and the attributes and configuration of the system, the finally selected characteristics comprise CPU occupancy rate, memory occupancy rate and task execution time, and if the resource occupancy rate of the tasks on the nodes is too high, the performance bottleneck of the system can be caused; if the task execution time is too long, delay or blockage can occur in the execution process of the system, and the performance of the system is affected;
4. model training: taking the preprocessed data set asInputting, training a model by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,2, 3) refers to regression coefficients, +.>Refers to error terms;
5. model evaluation: verifying the accuracy and stability of the model obtained by training by using the test data set;
6. model application: and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current distributed stream computing system.
In feature selection and model training, classical machine learning algorithms such as Support Vector Machines (SVMs), logistic regression, principal Component Analysis (PCA), etc. may be used to improve model accuracy and performance.
The resource occupation condition of the tasks on the monitoring node mainly refers to the CPU and the memory occupied by each task on the monitoring node (the linux has a technology cgroup, and can allocate independent CPU and memory for the tasks, so that the CPU and the memory occupied by each task can be monitored), and particularly, a monitoring module can be arranged, and the occupation condition of the CPU and the memory can be monitored by adopting a heartbeat mechanism.
The execution time of the task on the sensing node and the processing data volume of the task are detected by expanding a slow task detector in the flink.
S2, dynamically adjusting the check point interval by using an Mf-Stream fault-tolerant strategy according to the perceived multi-feature to obtain each adjusted check point intervalCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
optionally, in S2, according to the predicted failure rate, a failure-aware fault-tolerant policy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuous enlargementCI 0 UsingΔi 1 Increase in sizeCI 0 ,The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n ,,CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval.
The fault-aware fault tolerance strategy is described in detail below in conjunction with algorithm 1:
the multi-feature perception checkpoint coordinator of the embodiment of the invention can start the perception of the fault checkpoint according to the predicted fault rate. If the failure rate is small, it will dynamically increase the checkpoint interval; instead, it dynamically reduces the checkpoint interval. In particular, the multi-feature aware checkpoint coordinator initiates the running of the program and at fixed checkpoint intervalsCI 0 The first checkpoint is initiated. Subsequently, it optimistically believes that the fault will not occur in the near future, so it begins to be based on the predicted fault ratefrContinuously increasing the checkpoint interval. This can result in monotonically increasing checkpoint intervals, such asCI 1 <CI 2 …<CI n . The multi-feature aware checkpoint coordinator continues to increase the checkpoint interval until a failure occurs. If a fault occurs, it will start to reduce the checkpoint intervalTo reduce the fault recovery time. Since studies of actual fault data and predictions of operational faults show that the probability of a subsequent fault will be very high shortly after the last fault. Therefore, to mitigate a possible related failure, the multi-feature aware checkpoint coordinator does not increase the checkpoint interval but decreases the checkpoint interval after the failure.
In determining the change value of the checkpoint, and when changing the checkpoint frequency,frthe value plays an important role.frThe minimum and maximum values of (2) are different from 0 to 1, wherein a value of 0 represents no fault and a value of 1 represents a higher fault rate. When (when)frAt =1, the fault-aware fault tolerance behaves like a fixed checkpoint interval model, since the failure rate is high and any increase in the checkpoint interval may increase the failure recovery time. In this case, the checkpoint interval is gradually reduced. On the other hand, in the other hand,fr=0 characterizes a low failure rate, increasing the checkpoint interval to improve the efficiency of the application. When the probability of failure is low, the checkpoint interval can be increased to improve utilization and reduce checkpoint costs. However, the process is not limited to the above-described process,frthe condition of =0 is most likely to doubly increase the checkpoint interval, which is highly susceptible to higher fault recovery overhead. For an exponential increase in the localization checkpoint interval, the multi-feature aware fault tolerant agent allowsfrIs 0.25, all values below this value are set to 0.25.
Algorithm 1 describes the detailed algorithm of the fault-aware fault tolerance strategy:
Input:Predicted Failure Probabilityfr,Periodiccheckpoint intervalCI 0
Ouput:Failure aware checkpoints
1.iffr<0.25then
2. Setfr=0.25
3.end if
4.whileApplication not finisheddo
5. UsePeriodic checkpointintervalCI 0
6.ifNot failthen
7.Non-uniform-Interval(){
8.Calculate
9.Next checkpoint intervalCI n =CI n-1 +Δi n
10.Triggering checkpoint timet n =CI n +CI n-1 ;}
11.end if
12.iffailure occurthen
13.Restart execution from last checkpoint
14.Non-uniform-Interval(){
15.Calculate
16.Next checkpoint intervalCI n =CI n-1 -Δi n
17.Triggering checkpoint timet n =CI n +CI n-1
18.untilCI n =CI min }
19.end if
20.end while
usingt 0 Representing a periodic checkpoint time. The algorithm uses a fixed checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. It then optimistically acts as if it will not fail in the near future. Therefore, the checkpoint interval begins to increase to initiate a checkpoint. It usesΔi 1 Increase in sizeCI 0 Wherein, the method comprises the steps of, wherein,and continuously increasing each checkpoint interval at this rate at all timesCI n . As such, it iteratively increases the checkpoint interval, andthe checkpoints are initiated as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm gradually increases the checkpoint start time as the checkpoint interval increases. Start the firstnThe time of each check point ist n =CI n +CI n-1 . It uses this approach to dynamically increase the checkpoint interval and reduce the frequency of the start checkpoints. The checkpoint interval gradually increases until a fault occurs. When a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n ,/>. Maintaining this speed continues to reduce the checkpoint interval in order to tolerate potential related faults. As such, it iteratively reduces the checkpoint intervals and based on each checkpoint interval that is adjustedCI n Start the firstnAnd checking points. In this way, it dynamically reduces the checkpoint interval, reducing the fault recovery time. The checkpoint interval gradually decreases until it is equal to the minimum checkpoint interval (the minimum checkpoint interval is a preset value close to 0).
Optionally, in the step S2, a resource-aware fault-tolerant policy is used to dynamically adjust a checkpoint interval according to a resource occupation condition of a task on a node, which specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2);
setting a threshold value for the maximum usage time of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4);
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0 。
The resource-aware fault tolerance policy is described in detail below in conjunction with algorithm 2:
first, consider the impact of CPU resources on checkpoint intervals. Setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1。C const An upper limit of the CPU time duty cycle used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. Reselection of a new checkpointCare must be taken in the interval value because reducing the checkpoint interval will reduce the failure recovery time but will also increase the checkpoint overhead, whereas increasing the checkpoint interval will reduce the checkpoint overhead but will also increase the failure recovery time. If the CPU usage time duty cycle of the node exceeds the thresholdC const The embodiment of the invention can increase the check point interval to reduce the occupation of the check point operation on the CPU, and use all the CPU for executing the task, so that the task can be executed and completed more quickly. Defining a CPU occupancy rateU cpu The next checkpoint interval can be calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
then, consider the effect of memory on the checkpoint interval. Setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1。M const An upper limit of the memory size used by each task is defined. Exceeding the threshold triggers a system reset of the new checkpoint interval. If the memory usage ratio exceeds the thresholdM const The checkpointing interval is increased to reduce memory usage by checkpointing, and the entire memory is used to run the task so that the task runs faster and successfully. Defining a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
algorithm 2 gives a detailed algorithm of the resource-aware fault tolerance policy:
Input:CPU occupancy rateU cpu ,Initialized checkpoint intervalCI 0 ,Memory occupancy rateU mem , CPU ocupancy thresholdC const ,Memory occupancy thresholdM const
Ouput:Resource-aware checkpoints
1.whileApplication not finisheddo
2. Use Initialized checkpoint intervalCI 0
3. if U cpu >C const orU mem >M const then
4.Non-uniform-Interval(){
5.Next checkpoint interval by Equation (1) or (3)
6.Triggering checkpoint timet n =CI n +CI n-1 ;}
7. end if
8. ifU cpu ≤C const andU mem ≤M const then
9.Recover checkpoint interval toCI 0
10.end if
11.end while
the algorithm first uses a constant value checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. When monitoringThe control module detects that the CPU occupancy rate exceeds a threshold valueC const Or the memory occupancy exceedsM const At this time, checkpoints begin to open at non-uniform intervals. It calculates a new checkpoint interval using either equation (1) or (3). As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstnThe time of each check point ist n =CI n +CI n-1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. Gradually increasing the check point interval until the two kinds of resource occupancy rate falls below the threshold value, and after the two kinds of resource occupancy rate falls below the threshold value, separating the check point intervalCI n Reset to a fixed checkpoint intervalCI 0 。
Optionally, the slow task is judged by the task execution time length and the processing data amount of the task, the number of the slow tasks is counted, and only the task with the task execution time exceeding the first preset threshold and the task processing amount smaller than the second preset threshold is judged as the slow task.
In the production process, hot spot machines cannot be avoided, and the dense back brushing and the mixed part clustering can ensure that the workload of a certain machine is high and the input and output are busy. The data processing tasks running thereon may be extremely slow, making it difficult to secure job throughput time. The abnormal machine node comprises hardware abnormality, accidental IO busy, high CPU load and other problems. These problems can cause tasks to be performed on them to be much slower than tasks performed on other nodes, thereby extending the run length of the entire job.
By expanding the slow task detector in the link, the embodiment of the invention can also detect the execution time of the stream calculation task and the processing data quantity of the task, and the task with longer execution time (exceeding the first preset threshold) and less processing data (being smaller than the second preset threshold) is identified as the slow task. The embodiment of the invention reduces the check point overhead by reducing the time for executing check point operation of slow tasks, and more importantly, reduces the resource occupancy rate and the task execution time.
Algorithm 3 gives a detailed algorithm for slow task assessment:
Input:total number of tasksm,Task ListtaskList, Historical task execution times for previous timesexecution[task_i][time_pre], Duration of task executionduration[task_i],processing data volume of tasksprocVol[task_ i],total number of slow tasksN solw
Ouput:number of slow tasks
1:for task_i=1→mdo
2: duration[task_i]=computeTaskExecutionDuration (execution[task_i] [])
3:end for
4.tasksList1=sortTasks(m,duration[])
5:for task_i=1→mdo
6:procVol[task_i]=computeProcessVolume(execution[task_i][])
7:end for
8.tasksList2=sortTasks(m,procVol[])
9.slowTasksList=selectSlowTasks(s,taskList1,tasksList2)
10.N slow =Count(slowTasksList)
11.returnN slow
the algorithm 3 judges the slow tasks according to the execution time of the tasks and the processing data quantity of the tasks, counts the quantity of the slow tasks, and judges the tasks as the slow tasks only when the execution time of the tasks exceeds a first preset threshold value and the processing data quantity of the tasks is smaller than a second preset threshold value.
Optionally, in the step S2, a slow task aware fault tolerance policy is used according to the execution duration of the task on the node and the processing data amount of the task, so as to dynamically adjust the checkpoint interval, which specifically includes:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 ,Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0 。
The slow task aware fault tolerance strategy is described in detail below in conjunction with algorithm 4:
embodiments of the present invention use a slow task detector to detect the number of slow tasks in a streaming computing system. The number of slow tasks is expressed asN slow . If it isN slow Exceeding a set thresholdMThe check point interval is increased, the time for the slow task to check point is reduced, and the slow task is concentrated in time to execute the task. When the slow task detector detectsN slow When falling to normal values, the fixed checkpoint interval is restored.
Algorithm 4 gives a detailed algorithm for the slow task aware fault tolerance strategy:
Input:Slowtask numberN slow ,Initialized checkpoint intervalCI 0
Ouput:Slowtask aware checkpoints
1.whileApplication not finisheddo
2. Use Initialized checkpointintervalCI 0
3.ifN slow >Mthen
4.Non-uniform-Interval(){
5. Calculate
6.Next checkpoint intervalCI n =CI n-1 +Δci n
7.Triggering checkpoint timet n =CI n +CI n-1 ;}
8.end if
9.ifN slow ≤Mthen
10.Recover checkpoint interval toCI 0
11. end if
12.end while
the algorithm first uses a fixed checkpoint intervalCI 0 Running an application and int 0 The periodic checkpoints are started at the moment. When the slow task detector detects that the number of slow tasks exceeds a thresholdMAt this time, checkpoints begin to open at non-uniform intervals. It is used forΔ ci 1 Increase in sizeCI 0 ,Δci 1 =CI n-1 ×[(N slow -M)/M]And maintaining this speed continuously increases each checkpoint intervalCI n As such, it iteratively increases the checkpoint interval and initiates a checkpoint as the checkpoint interval increases. Once the size of the checkpoint interval is calculated, the algorithm continually increases the checkpoint start time with increasing checkpoint interval. Start the firstnThe time of each check point ist n =CI n +CI n-1 . By this method it dynamically increases the checkpoint interval while also reducing the frequency of checkpoints being opened. The checkpoint interval is gradually increased until the number of slow tasks falls below the threshold, after which the checkpoint interval is increasedCI n Reset to a fixed checkpoint intervalCI 0 。
S3, according to each adjusted checkDot spacingCI n Start the firstnAnd checking points.
As shown in fig. 2, an embodiment of the present invention further provides a multi-feature aware stream computing system fault tolerant system, including: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstnAnd checking points.
Optionally, the MAFT module includes a MAFT agent and a database, the MAFT agent including: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
The functional structure of the fault-tolerant system of the multi-feature-aware stream computing system provided by the embodiment of the invention corresponds to the fault-tolerant method of the multi-feature-aware stream computing system provided by the embodiment of the invention, and is not described herein.
Fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present invention, where the electronic device 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 301 and one or more memories 302, where at least one instruction is stored in the memories 302, and the at least one instruction is loaded and executed by the processors 301 to implement the steps of the multi-feature-aware stream computing system fault tolerance method described above.
In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the multi-feature aware stream computing system fault tolerance method described above is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (7)
1. A multi-feature aware stream computing system fault tolerance method, comprising:
s1, multi-feature sensing is carried out, wherein the multi-feature sensing comprises the following steps: predicting failure rate in an application program operation flow, monitoring resource occupation condition of tasks on nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
s2, according to the perceived multi-feature, using an Mf-Stream fault-tolerant strategyDynamically adjusting the checkpoint intervals to obtain each of the adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policies, resource-aware fault-tolerant policies, slow-task-aware fault-tolerant policies;
s3, according to the adjusted interval of each check pointCI n Start the firstnA plurality of checkpoints;
in the step S2, according to the predicted failure rate, a failure sensing fault tolerance strategy is used to dynamically adjust the checkpoint interval, which specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 ,The saidfrIs 0.25, all values below this value are set to 0.25, and maintaining this speed continues to increase each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n ,,CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval;
in the step S2, according to the resource occupation condition of the tasks on the nodes, a resource-aware fault-tolerant strategy is used to dynamically adjust the check point intervals, and the method specifically comprises the following steps:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const The checkpoint interval is increased and the time interval is increased,defining a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Monitoring memory occupancy by using heartbeat mechanism, if the memory occupancy rate of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0 ;
In the step S2, according to the execution time of the task on the node and the processing data volume of the task, a slow task perception fault tolerance strategy is used to dynamically adjust the check point interval, which specifically comprises the following steps:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 ,Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0。
2. The method according to claim 1, wherein predicting the failure rate in the application running process in S1 includes: predicting failure rate using linear regressionfrThe method specifically comprises the following steps:
collecting and preprocessing historical data;
selecting suitable features for training of the linear regression model, the suitable features including: cpu occupancy, memory occupancy, and task execution time;
taking the preprocessed data set as input, performing model training by using a linear regression algorithm, and fitting a linear regression modelThe linear regression model is used to predict failure rates,frrepresenting the failure rate of the device,U cpu refers to the occupancy rate of the CPU,U mem refers to the occupancy rate of the memory,T ex refers to the time of execution of a task,β i (i=0, 1,..3) refers to the undetermined parameter, +.>Refers to error terms;
and predicting by using the linear regression model obtained through training to give a fault rate predicted value of the current flow computing system.
3. The method according to claim 1, wherein the slow task is judged by a task execution time length and a processing data amount of the task, and the number of the slow tasks is counted, and only the task whose task execution time exceeds a first preset threshold and task processing amount is smaller than a second preset threshold is judged as the slow task.
4. A multi-feature aware stream computing system fault tolerant system, comprising: a multi-feature aware fault tolerant MAFT module and a multi-feature aware checkpoint coordinator;
the MAFT module is used for predicting failure rate in the running process of the application program, monitoring resource occupation condition of tasks on the nodes, sensing execution time of the tasks on the nodes and processing data quantity of the tasks;
the multi-feature perception checkpoint coordinator is configured to dynamically adjust checkpoint intervals according to the perceived multi-feature using an Mf-Stream fault tolerance policy to obtain adjusted checkpoint intervalsCI n The Mf-Stream fault tolerance strategy correspondingly comprises: fault-aware fault-tolerant policy, resource-aware fault-tolerant policy, slow task-aware fault-tolerant policy, and according to the adjusted interval of each check pointCI n Start the firstnA plurality of checkpoints;
the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals according to a predicted failure rate using a failure-aware fault tolerance strategy, and specifically includes:
prediction-based failure ratefrContinuously increasing checkpoint intervals, useΔi 1 Increasing fixed checkpoint intervalCI 0 ,The saidfrIs 0.25, all values below this valueSet to 0.25 and maintain this speed continuously increasing each checkpoint intervalCI n Until a fault occurs;
when a fault occurs, the application program starts running again from the checkpoint nearest the point of the fault and usesΔi n Reduction ofCI n ,,CI n-1 Representing the last checkpoint interval, maintaining this speed continues to decrease the checkpoint interval until it is equal to the minimum checkpoint interval;
the multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a resource perception fault tolerance policy according to the resource occupation condition of tasks on nodes, and specifically includes:
setting a threshold value for maximum usage time of CPUC const ,0<C const ≤1;
Monitoring CPU occupation by using heartbeat mechanism, if CPU usage time duty ratio of a certain task exceeds the thresholdC const Increase the check point interval, define a CPU occupancy rateU cpu The next checkpoint interval is calculated by equation (1):
(1)
U cpu the ratio of the CPU time occupied by normal logic processing in the task to the total CPU time in the running process is equal to the ratio:
(2)
setting a threshold value for the maximum usage of the memoryM const ,0<M const ≤1;
Using coresThe jump mechanism monitors the memory occupancy, if the memory usage duty cycle of a certain task exceeds the thresholdM const Increase the check point interval, define a memory occupancy rateU mem The next checkpoint interval can be calculated by equation (3):
(3)
U mem equal to the ratio of the amount of memory used for normal logic processing in the task to the total amount of memory in the running process:
(4)
until the occupancy rate of two resources of CPU and memory is reduced below threshold value, spacing check pointsCI n Reset to a fixed checkpoint intervalCI 0 ;
The multi-feature perception checkpoint coordinator dynamically adjusts checkpoint intervals by using a slow task perception fault tolerance strategy according to execution time of tasks on nodes and processing data volume of the tasks, and specifically comprises the following steps:
detecting that the number of slow tasks exceeds a thresholdMAt the beginning, checkpoints are opened at non-uniform intervals, usingΔci 1 Increasing fixed checkpoint intervalCI 0 ,Δci 1 =CI n-1 ×[(N slow -M)/M] ,N slow Representing the number of slow tasks and maintaining this speed continuously increases each checkpoint intervalCI n Until the number of slow tasks falls to a thresholdMThe following are set forth;
when the number of slow tasks falls to a thresholdMAfter that, the check points are spacedCI n Reset to a fixed checkpoint intervalCI 0 。
5. The system of claim 4, wherein the MAFT module comprises a MAFT agent and a database, the MAFT agent comprising: the system comprises a fault sensing module, a monitoring module and a slow task detector;
the fault perception module is used for predicting the fault rate by using linear regressionfr;
The monitoring module is used for monitoring the real-time CPU time and the real-time memory quantity occupied by the tasks on the nodes;
the slow task detector is used for sensing the execution time of the task on the node and the processing data volume of the task so as to judge whether the task is a slow task or not;
the database is used for temporarily storing the predicted failure rate, the monitored CPU and memory quantity, the execution time of the task and the processing data quantity of the task, and the deleting operation is carried out after the data has been processed.
6. An electronic device comprising a processor and a memory having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by the processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.
7. A computer readable storage medium having stored therein at least one instruction, wherein the at least one instruction is loaded and executed by a processor to implement the multi-feature aware stream computing system fault tolerance method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310598274.4A CN116361060B (en) | 2023-05-25 | 2023-05-25 | Multi-feature-aware stream computing system fault tolerance method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310598274.4A CN116361060B (en) | 2023-05-25 | 2023-05-25 | Multi-feature-aware stream computing system fault tolerance method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116361060A CN116361060A (en) | 2023-06-30 |
CN116361060B true CN116361060B (en) | 2023-09-15 |
Family
ID=86939416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310598274.4A Active CN116361060B (en) | 2023-05-25 | 2023-05-25 | Multi-feature-aware stream computing system fault tolerance method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116361060B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104331347A (en) * | 2014-11-25 | 2015-02-04 | 中国人民解放军国防科学技术大学 | Variable error rate-oriented check point interval real-time determining method |
CN109344009A (en) * | 2018-10-11 | 2019-02-15 | 重庆邮电大学 | Mobile cloud system fault-tolerance approach based on classification checkpoint |
CN111124720A (en) * | 2019-12-26 | 2020-05-08 | 江南大学 | Self-adaptive check point interval dynamic setting method |
CN111258824A (en) * | 2020-01-18 | 2020-06-09 | 重庆邮电大学 | Increment check point fault tolerance method based on artificial potential field in cloud computing |
CN111682981A (en) * | 2020-06-02 | 2020-09-18 | 深圳大学 | Check point interval setting method and device based on cloud platform performance |
CN112445635A (en) * | 2019-09-04 | 2021-03-05 | 无锡江南计算技术研究所 | Data-driven adaptive checkpoint optimization method |
CN116069468A (en) * | 2022-12-30 | 2023-05-05 | 三星(中国)半导体有限公司 | Checkpoint adjustment method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070220327A1 (en) * | 2006-02-23 | 2007-09-20 | Evergrid, Inc., A Delaware Corporation | Dynamically Controlled Checkpoint Timing |
US11641395B2 (en) * | 2019-07-31 | 2023-05-02 | Stratus Technologies Ireland Ltd. | Fault tolerant systems and methods incorporating a minimum checkpoint interval |
-
2023
- 2023-05-25 CN CN202310598274.4A patent/CN116361060B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104331347A (en) * | 2014-11-25 | 2015-02-04 | 中国人民解放军国防科学技术大学 | Variable error rate-oriented check point interval real-time determining method |
CN109344009A (en) * | 2018-10-11 | 2019-02-15 | 重庆邮电大学 | Mobile cloud system fault-tolerance approach based on classification checkpoint |
CN112445635A (en) * | 2019-09-04 | 2021-03-05 | 无锡江南计算技术研究所 | Data-driven adaptive checkpoint optimization method |
CN111124720A (en) * | 2019-12-26 | 2020-05-08 | 江南大学 | Self-adaptive check point interval dynamic setting method |
CN111258824A (en) * | 2020-01-18 | 2020-06-09 | 重庆邮电大学 | Increment check point fault tolerance method based on artificial potential field in cloud computing |
CN111682981A (en) * | 2020-06-02 | 2020-09-18 | 深圳大学 | Check point interval setting method and device based on cloud platform performance |
CN116069468A (en) * | 2022-12-30 | 2023-05-05 | 三星(中国)半导体有限公司 | Checkpoint adjustment method and device |
Non-Patent Citations (1)
Title |
---|
何忠政等.基于检查点间隔优化的容错实时系统可调度性.《吉林大学学报》.2014,第44卷(第2期),第433-439页. * |
Also Published As
Publication number | Publication date |
---|---|
CN116361060A (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chtepen et al. | Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids | |
JP4170988B2 (en) | Risk prediction / avoidance method, system, program, and recording medium for execution environment | |
US7890297B2 (en) | Predictive monitoring method and system | |
US8949642B2 (en) | Method for dynamically distributing one or more services in a network comprising of a plurality of computers by deriving a resource capacity required based on a past chronological progression of a resource demand | |
CN107562512B (en) | Method, device and system for migrating virtual machine | |
Heinze et al. | An adaptive replication scheme for elastic data stream processing systems | |
US11886919B2 (en) | Directing queries to nodes of a cluster of a container orchestration platform distributed across a host system and a hardware accelerator of the host system | |
CN112799817A (en) | Micro-service resource scheduling system and method | |
US20220414503A1 (en) | Slo-aware artificial intelligence inference scheduler for heterogeneous processors in edge platforms | |
CN111880906A (en) | Virtual machine high-availability management method, system and storage medium | |
WO2020248227A1 (en) | Load prediction-based hadoop computing task speculative execution method | |
US7130770B2 (en) | Monitoring method and system with corrective actions having dynamic intensities | |
WO2018024076A1 (en) | Flow velocity control method and device | |
US11966273B2 (en) | Throughput-optimized, quality-of-service aware power capping system | |
CN109522100B (en) | Real-time computing task adjusting method and device | |
Rood et al. | Resource availability prediction for improved grid scheduling | |
Lassettre et al. | Dynamic surge protection: An approach to handling unexpected workload surges with resource actions that have lead times | |
CN116361060B (en) | Multi-feature-aware stream computing system fault tolerance method and system | |
WO2022247219A1 (en) | Information backup method, device, and platform | |
CN111274111B (en) | Prediction and anti-aging method for microservice aging | |
CN112559287A (en) | Method and device for optimizing task flow in data | |
Okamura et al. | Optimization of opportunity-based software rejuvenation policy | |
CN115470006B (en) | Load balancing method based on microkernel | |
CN115858155A (en) | Dynamic capacity expansion and contraction method and device for application resources of computing power network platform | |
Amin et al. | Using automated control charts for the runtime evaluation of qos attributes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |