CN117349775B

CN117349775B - Cluster computing-oriented abnormal subtask identification method and device

Info

Publication number: CN117349775B
Application number: CN202311435871.1A
Authority: CN
Inventors: 周俊; 朱海洋; 陈为; 陈正奎; 肖杰; 郑励; 谈旭炜; 储诚灿; 潘奇豪; 李可涵; 钱晓英; 黄理乐
Original assignee: Products Zhongda Digital Technology Co ltd; Zhejiang University ZJU
Current assignee: Products Zhongda Digital Technology Co ltd; Zhejiang University ZJU
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-04-26
Anticipated expiration: 2043-10-30
Also published as: CN117349775A

Abstract

The embodiment of the specification provides a cluster computation-oriented abnormal subtask identification method and device. The method comprises the following steps: aiming at the running task, acquiring running data of a plurality of subtasks of the running task, which are correspondingly run on a plurality of nodes in the cluster; constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask and corresponding actual completion time length; training a duration prediction model by using a first training sample set to determine the predicted completion duration of each running sub-task; constructing a second training sample set based on the operation data, wherein the completed subtasks are taken as positive samples, and the operation subtasks are taken as negative samples; training a probability prediction model based on the second training sample set to predict the completion probability of each running sub-task; updating the predicted completion time length of each running sub-task to be the quotient of the predicted completion time length and the corresponding completion probability; and determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

Description

Cluster computing-oriented abnormal subtask identification method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a cluster computation-oriented abnormal subtask recognition method and apparatus.

Background

In the computation of a data center or cluster, a computation task is divided into a plurality of sub-tasks, each of which is executed in parallel on a different machine, and the results are summarized after the last sub-task is completed.

A "slow machine" is a subtask that is rare and extremely slow in a complete task, which impedes the completion of the job. Slow machine conditions occur for various reasons, such as poor hardware conditions of the machine, misuse of machine resources at the obsolete edge by other tasks, or loopholes (bugs) of task execution code triggered by the hardware environment of the machine, or overheating of the machine's external environment, etc. The presence of the slow machine can reduce the overall performance by 30% -50%.

Therefore, a scheme is needed for timely and accurately identifying the slow machine subtask (or called abnormal subtask).

Disclosure of Invention

One or more embodiments of the present disclosure describe a cluster computation-oriented abnormal subtask recognition method and apparatus, which may reduce prediction bias, and predict a slow machine timely and accurately.

According to a first aspect, a cluster computation-oriented abnormal subtask recognition method is provided. The method comprises the following steps:

aiming at the running target task, acquiring running data of a plurality of subtasks of the running target task, which are correspondingly run on a plurality of nodes in the cluster; constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model; constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample; training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model; updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the corresponding completion probability; and determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

In one embodiment, for a target task being executed in a cluster, acquiring operation data of a plurality of subtasks corresponding to the target task running on a plurality of nodes in the cluster includes: and acquiring the operation data in response to the operation time of the target task reaching a preset time.

In one embodiment, training a probability prediction model based on the second training sample set, and predicting the probability of completion of each running sub-task using the trained probability prediction model; comprising the following steps: determining a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples; training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model; and training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model.

In one embodiment, determining an abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length includes: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.

In one embodiment, determining an abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length includes: sequencing the updated predicted completion time length corresponding to each running sub-task from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.

In one embodiment, after determining the abnormal subtask corresponding to the abnormal time according to the updated predicted completion time, the method further includes: and executing a restarting operation or executing a termination operation aiming at the abnormal subtask.

In one embodiment, the target task is a log analysis task or a machine learning task.

According to a second aspect, an abnormal subtask recognition device for cluster-oriented computing is provided. The device comprises:

The running data acquisition module is configured to acquire running data of a plurality of subtasks of the running target task, which run on a plurality of nodes in the cluster; the first training set construction module is configured to construct a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; a duration model training module configured to train a duration prediction model using the first set of training samples; the time length prediction module is configured to determine the predicted completion time length of each running sub-task in the plurality of sub-tasks by using a trained time length prediction model; a second training set construction module configured to construct a second training sample set based on the operational data, wherein each of the completed subtasks is taken as a positive sample and each of the operational subtasks is taken as a negative sample; a probabilistic model training module configured to train a probabilistic predictive model based on the second set of training samples; the probability prediction module is configured to predict the completion probability of each running sub-task by using a trained probability prediction model; the duration updating module is configured to update the predicted completion duration of each running sub-task to the quotient of the predicted completion duration and the completion probability; the abnormal subtask determining module is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

In one embodiment, the probabilistic model training module comprises: a subset construction unit configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in another part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples; a first training unit configured to train a first probabilistic predictive model using the first subset; a second training unit configured to train a second probabilistic predictive model using the second subset; the probability prediction module specifically comprises: the first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model; and a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.

In one embodiment, the abnormal subtask determination module is specifically configured to: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

According to a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, aiming at the target task in operation, the completion time of each running subtask is predicted in real time by constructing the prediction model on line, and then the predicted deviation is corrected by re-weighting, so that the potential abnormal subtask can be timely and accurately identified, and the performance loss caused by the slow-running subtask is effectively reduced or eliminated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an abnormal subtask identification scheme for cluster-oriented computing disclosed in an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a method for identifying abnormal subtasks for cluster-oriented computing according to an embodiment of the present disclosure;

Fig. 3 is a schematic structural diagram of an abnormal subtask recognition device for cluster computing according to an embodiment of the present disclosure.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

In light of the foregoing, a solution is needed that can timely and accurately identify slow machine subtasks.

In one mode, it is assumed that the completion time of the subtasks follows a certain distribution, such as a normal distribution, according to which the completion time of each completed task in the task is collected first and normal distribution fitting is performed, and then the subtasks corresponding to abnormal points which do not follow the normal distribution are identified as slow machine subtasks.

In another approach, historical operational data for a large number of tasks is collected, and a training set is constructed to train a slow machine prediction model. However, in training sets, feature items of different tasks tend to be unique, so it is difficult to train one model on one task and apply directly to another task, and in addition, slow machine prediction results in poor results because of the small sample size of slow machine subtasks.

Based on the observation and analysis, the application provides an abnormal subtask identification scheme which can be independently implemented for each task, and can identify potential slow subtasks in real time when the task is still running.

FIG. 1 is a block diagram of an abnormal subtask identification scheme for clustered computing disclosed in an embodiment of the present disclosure, wherein key steps of the schematic scheme concept include: delay prediction, re-weighting, and slow machine identification. Wherein the delay prediction uses subtasks (not slow-down) that have been completed during the task's operation to train a model, the delay (or completion time) of which is predicted to run the subtasks. Also, given that this model may favor non-slow-machine, it is proposed to re-weight the predicted delays, intuitively, that this weighting function represents how much the characteristics of a particular running subtask differ from those of the completed subtasks, i.e., it preserves the prediction of delays for those non-slow-machine subtasks and increases the prediction delay for subtasks other than non-slow-machine, thereby reducing prediction bias. Then, slow machine identification is performed based on the re-weighted prediction delay, so that the slow machine can be predicted timely and accurately. Further, the identified potential slow machine can be processed, and adverse effects caused by the slow machine can be eliminated in time.

The implementation of the above scheme will be described below in conjunction with fig. 2 and further embodiments.

Fig. 2 is a schematic flowchart of a cluster-computing-oriented abnormal subtask recognition method disclosed in the embodiment of the present disclosure, where an execution body of the method may be any device, platform, or server with computing and processing capabilities, for example, may be a cluster management platform.

As shown in fig. 2, the method comprises the steps of:

Step S210, aiming at a running target task, acquiring running data of a plurality of subtasks of the running target task, wherein the running data are correspondingly run on a plurality of nodes in a cluster; step S220, a first training sample set is constructed based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; step S230, training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model; step S240, constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample; step S250, training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model; step S260, updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the completion probability; step S270, determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

The development of the above steps is described as follows:

first, in step S210, for a target task running in a cluster, running data of a plurality of subtasks corresponding to the subtasks running on a plurality of nodes in the cluster is obtained.

The target task may be a data analysis task or a machine learning task, and the embodiment of the present disclosure is not limited to the specific task content. For the splitting rule and mode of splitting the target task into a plurality of subtasks and the scheduling mode of scheduling each subtask on the corresponding node respectively, the related prior art can be adopted, which will not be described in detail. It is to be appreciated that the nodes in the cluster may be individual computers, servers, virtual machines, or the like.

The triggering condition for acquiring the operation data can be configured in advance according to actual needs, or can be manually triggered by a worker to acquire the operation data. For example, the triggering condition may be that a predetermined time period is reached after the target task starts running, or that acquisition is automatically triggered every predetermined time period.

The operational data may include operational status of each of the plurality of subtasks, such as completed, running, or terminated. The start runtime, the run time, the running environment, etc. of each task may also be included, where the running environment may include a software environment and/or a hardware environment of the corresponding node. By way of example, the software environment may include the type, version, etc. of the operating system; the hardware environment may include a model, manufacturer, number, core number, or memory space of a central processing unit (Central Processing Unit, CPU for short) or graphics processor (Graphics Processing Unit, GPU for short), or the like.

From this, the operation data of each sub-task of the target task being operated can be obtained.

Then, in step S220, a first training sample set is constructed based on the operation data, where the first training sample set includes task characteristics and corresponding actual completion time periods of each completed subtask in the plurality of subtasks.

It should be understood that, according to the running state of each subtask in the running data, a plurality of subtasks in the completed state in the plurality of subtasks may be determined, and then the first training sample set is constructed by using the subtasks. In addition, several references herein refer to one or more.

Specifically, a completed subtask is taken as a training sample, the characteristics of the completed subtask are taken as sample characteristics, and the actual completion time length is taken as a sample label. By way of example, the characteristics of the completed subtasks may include the start runtime, the run duration, the run environment, and the like, as described above.

Based on the first training sample set constructed above, step S230 may be executed, where the duration prediction model is trained using the first training sample set, and the predicted completion duration of each running sub-task in the plurality of sub-tasks is determined using the trained duration prediction model.

It can be appreciated that the duration prediction model is a regression model, and the implementation algorithm can be selected according to the need, for example, a deep neural network (Deep Neural Networks, abbreviated as DNN) or an extreme gradient lifting tree (eXtreme Gradient Boosting, abbreviated as XGBoost) is selected.

Specifically, subtask features in each training sample are input into a duration prediction model to obtain a prediction result of the completion duration of the subtask, and then training loss is determined based on the prediction result and a corresponding sample label (namely the actual completion duration), so that model parameters of the duration prediction model are updated by using the training loss. After one or more iterations, a trained duration prediction model can be obtained.

Further, the predicted completion time of each running sub-task can be determined by using a trained time prediction model.

On the other hand, after the operation data is acquired in step S210, steps S240 and S250 may also be performed.

In step S240, a second training sample set is constructed based on the operational data, wherein each completed subtask is taken as a positive sample and each operational subtask is taken as a negative sample.

Specifically, the completed subtasks and the running subtasks are determined based on the running data, for each completed subtask, the subtask characteristics thereof are used as sample characteristics, the corresponding sample label is set to 1, so as to obtain the corresponding positive sample, and similarly, for each running subtask, the subtask characteristics thereof are used as sample characteristics, the corresponding sample label is set to 0, so as to obtain the corresponding negative sample.

In one embodiment, a terminated one of the plurality of subtasks may also be taken as a negative sample.

Based on the constructed second training sample set, step S250 is executed to train a probability prediction model based on the constructed second training sample set, and predict the probability of completion of each running subtask using the trained probability prediction model.

It should be appreciated that the probabilistic predictive model is essentially a two-class model, and that the probabilistic predictive model may be implemented as a logistic regression model, for example.

In one embodiment, a single probabilistic predictive model is trained. Specifically, sample characteristics of each training sample in the second training sample set are input into a probability prediction model to obtain a probability prediction result, and then training loss is determined by using the probability prediction result and a corresponding sample label, so that model parameters of the probability prediction model are updated by using the training loss. Thus, after one or more iterations, a trained probabilistic predictive model can be obtained.

Further, the characteristics of each running sub-task can be input into a probability prediction model, so that the corresponding predicted completion probability is obtained.

In view of the above embodiments, there is coincidence between samples used in the training phase and the prediction phase of the probabilistic predictive model, including the same running subtask samples, which may affect the accuracy of the completion probability predicted by the prediction phase, and another embodiment is proposed.

In another embodiment, all negative samples in the second training sample set are divided into two parts, which are allocated to two subsets, and in addition, a total number of completed subtasks are added to the two subsets, respectively. And then, respectively and independently training a probability prediction model by using the two subsets, and further, performing cross prediction by using the two probability prediction models to predict the completion probability of the running sub-tasks outside the self-training subsets.

Specifically, a first subset and a second subset of the second training sample set are determined, wherein the first subset uses a plurality of first subtasks in part of running as a plurality of negative samples, the second subset uses a plurality of second subtasks in another part of running as a plurality of negative samples, and all the completed subtasks in the first subset and the second subset are positive samples. Then, training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model; and simultaneously, training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model. Thus, the completion probability of each running sub-task in the plurality of sub-tasks can be obtained.

It should be understood that, in the description of step S240 and step S250, the first subset and the second subset may be directly constructed based on the operation data in the embodiment of constructing the second training sample set and then constructing the first subset and the second subset, where the first subset may be replaced by the first positive and negative sample set and the second subset may be replaced by the second positive and negative sample set.

From the above, the predicted completion time of each running sub-task can be obtained by performing steps S220 and S230, and the completion probability of each running sub-task can be obtained by performing steps S240 and S250.

Based on this, step S260 is performed to update the predicted completion time length of each running sub-task to the quotient of the predicted completion time length and the corresponding completion probability. Thus, the re-weighting of the predicted completion time length can be realized, and the deviation of the predicted time length is reduced.

Then, in step S270, according to the updated predicted completion time, an abnormal subtask corresponding to the abnormal time is determined.

In one embodiment, for each running sub-task, the running sub-task is classified as an abnormal sub-task if its corresponding updated predicted completion time is greater than a preset duration threshold.

In another embodiment, the updated predicted completion time length corresponding to each running sub-task is ordered from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks. Where the value of k may be set by a worker, e.g., set k=2.

According to an embodiment of another aspect, after performing step S270, the method may further include: and executing a restarting operation for the abnormal subtask. At this point, the abnormal subtask resumes operation. Or executing the operation termination operation for the abnormal subtask, wherein the running state of the subtask is changed to be terminated. The terminated subtasks may then also be scheduled to be executed on other nodes. It should be understood that the processing mode of the abnormal subtasks may be preconfigured by a worker or may be selected after real-time observation.

It should be noted that the order of execution of the steps is not limited, and may be executed sequentially, for example, in the order illustrated in fig. 2, or may be executed in another order, for example, S240 is executed first, S220 is executed, or S240 and S220 are executed simultaneously, etc., so long as the flow of data conforms to logic.

In summary, by adopting the cluster computation-oriented abnormal subtask identification method disclosed by the embodiment of the specification, aiming at the target task in operation, the completion time of each running subtask is predicted in real time by constructing a prediction model on line, and then the predicted deviation is corrected by re-weighting, so that the potential abnormal subtasks can be timely and accurately identified, and the performance loss caused by the slow-running subtasks is effectively reduced or eliminated.

Corresponding to the above identification method, the embodiment of the present specification also discloses an identification device. Fig. 3 is a schematic structural diagram of an abnormal subtask recognition device for cluster computing according to an embodiment of the present disclosure.

As shown in fig. 3, the apparatus 300 includes the following modules:

The operation data obtaining module 302 is configured to obtain, for a target task being operated, operation data of a plurality of subtasks corresponding to the subtasks of the target task being operated, where the operation data are operated on a plurality of nodes in the cluster. A first training set construction module 304 is configured to construct a first training sample set based on the operation data, where the first training sample set includes task characteristics and corresponding actual completion time periods of each completed sub-task of the plurality of sub-tasks. A duration model training module 306 configured to train a duration prediction model using the first set of training samples. The duration prediction module 308 is configured to determine a predicted completion duration of each running sub-task of the plurality of sub-tasks using the trained duration prediction model. A second training set construction module 310 is configured to construct a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample and each operation subtask is taken as a negative sample. A probabilistic model training module 312 is configured to train a probabilistic predictive model based on the second set of training samples. The probability prediction module 314 is configured to predict the completion probability of each running sub-task using the trained probability prediction model. A duration update module 316 configured to update the predicted completion duration of each running sub-task to its quotient with the completion probability. The abnormal subtask determining module 318 is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

In one embodiment, the operational data acquisition module 302 is specifically configured to: and acquiring the operation data in response to the operation time of the target task reaching a preset time.

In one embodiment, the probabilistic model training module 312 includes: the subset construction unit is configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples. A first training unit configured to train a first probabilistic predictive model using the first subset. A second training unit configured to train a second probabilistic predictive model using the second subset.

The probability prediction module 314 specifically includes: and the first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model. And a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.

In one embodiment, the abnormal subtask determination module 318 is specifically configured to: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.

In one embodiment, the abnormal subtask determination module 318 is specifically configured to: sequencing the updated predicted completion time length corresponding to each running sub-task from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.

In one embodiment, the apparatus 300 further includes an abnormal subtask processing module 320 configured to perform a restart operation for the abnormal subtask or to perform a terminate run operation.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. The cluster computation-oriented abnormal subtask identification method is characterized by comprising the following steps of:

aiming at the running target task, acquiring running data of a plurality of subtasks of the running target task, which are correspondingly run on a plurality of nodes in the cluster;

constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length;

Training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model;

constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample;

Training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model;

Updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the corresponding completion probability;

Determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time;

The probability prediction model is trained based on the second training sample set, and the completion probability of each running sub-task is predicted by using the trained probability prediction model; comprising the following steps:

Determining a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples;

training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model;

And training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model.

2. The method of claim 1, wherein, for a target task being executed in a cluster, obtaining operational data whose multiple subtasks correspond to operations on multiple nodes in the cluster, comprises:

and acquiring the operation data in response to the operation time of the target task reaching a preset time.

3. The method of claim 1, wherein determining the abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length comprises:

Aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.

4. The method of claim 1, wherein determining the abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length comprises:

Sequencing the updated predicted completion time length corresponding to each running sub-task from big to small;

And classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.

5. The method of claim 1, wherein after determining an abnormal subtask corresponding to the abnormal time period according to the updated predicted completion time period, the method further comprises:

And executing a restarting operation or executing a termination operation aiming at the abnormal subtask.

6. The method of any of claims 1-5, wherein the target task is a log analysis task or a machine learning task.

7. An abnormal subtask recognition device for cluster computing, comprising:

The running data acquisition module is configured to acquire running data of a plurality of subtasks of the running target task, which run on a plurality of nodes in the cluster;

the first training set construction module is configured to construct a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length;

a duration model training module configured to train a duration prediction model using the first set of training samples;

The time length prediction module is configured to determine the predicted completion time length of each running sub-task in the plurality of sub-tasks by using a trained time length prediction model;

A second training set construction module configured to construct a second training sample set based on the operational data, wherein each of the completed subtasks is taken as a positive sample and each of the operational subtasks is taken as a negative sample;

a probabilistic model training module configured to train a probabilistic predictive model based on the second set of training samples;

The probability prediction module is configured to predict the completion probability of each running sub-task by using a trained probability prediction model;

the duration updating module is configured to update the predicted completion duration of each running sub-task to the quotient of the predicted completion duration and the completion probability;

the abnormal subtask determining module is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time;

wherein, the probability model training module includes:

a subset construction unit configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in another part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples;

A first training unit configured to train a first probabilistic predictive model using the first subset;

a second training unit configured to train a second probabilistic predictive model using the second subset;

the probability prediction module specifically comprises:

The first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model;

and a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.

8. The apparatus of claim 7, wherein the abnormal subtask determination module is specifically configured to: