CN117349775B - Cluster computing-oriented abnormal subtask identification method and device - Google Patents

Cluster computing-oriented abnormal subtask identification method and device Download PDF

Info

Publication number
CN117349775B
CN117349775B CN202311435871.1A CN202311435871A CN117349775B CN 117349775 B CN117349775 B CN 117349775B CN 202311435871 A CN202311435871 A CN 202311435871A CN 117349775 B CN117349775 B CN 117349775B
Authority
CN
China
Prior art keywords
subtask
running
task
subtasks
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311435871.1A
Other languages
Chinese (zh)
Other versions
CN117349775A (en
Inventor
周俊
朱海洋
陈为
陈正奎
肖杰
郑励
谈旭炜
储诚灿
潘奇豪
李可涵
钱晓英
黄理乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Products Zhongda Digital Technology Co ltd
Zhejiang University ZJU
Original Assignee
Products Zhongda Digital Technology Co ltd
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Products Zhongda Digital Technology Co ltd, Zhejiang University ZJU filed Critical Products Zhongda Digital Technology Co ltd
Priority to CN202311435871.1A priority Critical patent/CN117349775B/en
Publication of CN117349775A publication Critical patent/CN117349775A/en
Application granted granted Critical
Publication of CN117349775B publication Critical patent/CN117349775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the specification provides a cluster computation-oriented abnormal subtask identification method and device. The method comprises the following steps: aiming at the running task, acquiring running data of a plurality of subtasks of the running task, which are correspondingly run on a plurality of nodes in the cluster; constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask and corresponding actual completion time length; training a duration prediction model by using a first training sample set to determine the predicted completion duration of each running sub-task; constructing a second training sample set based on the operation data, wherein the completed subtasks are taken as positive samples, and the operation subtasks are taken as negative samples; training a probability prediction model based on the second training sample set to predict the completion probability of each running sub-task; updating the predicted completion time length of each running sub-task to be the quotient of the predicted completion time length and the corresponding completion probability; and determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.

Description

Cluster computing-oriented abnormal subtask identification method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a cluster computation-oriented abnormal subtask recognition method and apparatus.
Background
In the computation of a data center or cluster, a computation task is divided into a plurality of sub-tasks, each of which is executed in parallel on a different machine, and the results are summarized after the last sub-task is completed.
A "slow machine" is a subtask that is rare and extremely slow in a complete task, which impedes the completion of the job. Slow machine conditions occur for various reasons, such as poor hardware conditions of the machine, misuse of machine resources at the obsolete edge by other tasks, or loopholes (bugs) of task execution code triggered by the hardware environment of the machine, or overheating of the machine's external environment, etc. The presence of the slow machine can reduce the overall performance by 30% -50%.
Therefore, a scheme is needed for timely and accurately identifying the slow machine subtask (or called abnormal subtask).
Disclosure of Invention
One or more embodiments of the present disclosure describe a cluster computation-oriented abnormal subtask recognition method and apparatus, which may reduce prediction bias, and predict a slow machine timely and accurately.
According to a first aspect, a cluster computation-oriented abnormal subtask recognition method is provided. The method comprises the following steps:
aiming at the running target task, acquiring running data of a plurality of subtasks of the running target task, which are correspondingly run on a plurality of nodes in the cluster; constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model; constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample; training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model; updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the corresponding completion probability; and determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.
In one embodiment, for a target task being executed in a cluster, acquiring operation data of a plurality of subtasks corresponding to the target task running on a plurality of nodes in the cluster includes: and acquiring the operation data in response to the operation time of the target task reaching a preset time.
In one embodiment, training a probability prediction model based on the second training sample set, and predicting the probability of completion of each running sub-task using the trained probability prediction model; comprising the following steps: determining a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples; training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model; and training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model.
In one embodiment, determining an abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length includes: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.
In one embodiment, determining an abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length includes: sequencing the updated predicted completion time length corresponding to each running sub-task from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.
In one embodiment, after determining the abnormal subtask corresponding to the abnormal time according to the updated predicted completion time, the method further includes: and executing a restarting operation or executing a termination operation aiming at the abnormal subtask.
In one embodiment, the target task is a log analysis task or a machine learning task.
According to a second aspect, an abnormal subtask recognition device for cluster-oriented computing is provided. The device comprises:
The running data acquisition module is configured to acquire running data of a plurality of subtasks of the running target task, which run on a plurality of nodes in the cluster; the first training set construction module is configured to construct a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; a duration model training module configured to train a duration prediction model using the first set of training samples; the time length prediction module is configured to determine the predicted completion time length of each running sub-task in the plurality of sub-tasks by using a trained time length prediction model; a second training set construction module configured to construct a second training sample set based on the operational data, wherein each of the completed subtasks is taken as a positive sample and each of the operational subtasks is taken as a negative sample; a probabilistic model training module configured to train a probabilistic predictive model based on the second set of training samples; the probability prediction module is configured to predict the completion probability of each running sub-task by using a trained probability prediction model; the duration updating module is configured to update the predicted completion duration of each running sub-task to the quotient of the predicted completion duration and the completion probability; the abnormal subtask determining module is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.
In one embodiment, the probabilistic model training module comprises: a subset construction unit configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in another part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples; a first training unit configured to train a first probabilistic predictive model using the first subset; a second training unit configured to train a second probabilistic predictive model using the second subset; the probability prediction module specifically comprises: the first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model; and a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.
In one embodiment, the abnormal subtask determination module is specifically configured to: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, aiming at the target task in operation, the completion time of each running subtask is predicted in real time by constructing the prediction model on line, and then the predicted deviation is corrected by re-weighting, so that the potential abnormal subtask can be timely and accurately identified, and the performance loss caused by the slow-running subtask is effectively reduced or eliminated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of an abnormal subtask identification scheme for cluster-oriented computing disclosed in an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of a method for identifying abnormal subtasks for cluster-oriented computing according to an embodiment of the present disclosure;
Fig. 3 is a schematic structural diagram of an abnormal subtask recognition device for cluster computing according to an embodiment of the present disclosure.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
In light of the foregoing, a solution is needed that can timely and accurately identify slow machine subtasks.
In one mode, it is assumed that the completion time of the subtasks follows a certain distribution, such as a normal distribution, according to which the completion time of each completed task in the task is collected first and normal distribution fitting is performed, and then the subtasks corresponding to abnormal points which do not follow the normal distribution are identified as slow machine subtasks.
In another approach, historical operational data for a large number of tasks is collected, and a training set is constructed to train a slow machine prediction model. However, in training sets, feature items of different tasks tend to be unique, so it is difficult to train one model on one task and apply directly to another task, and in addition, slow machine prediction results in poor results because of the small sample size of slow machine subtasks.
Based on the observation and analysis, the application provides an abnormal subtask identification scheme which can be independently implemented for each task, and can identify potential slow subtasks in real time when the task is still running.
FIG. 1 is a block diagram of an abnormal subtask identification scheme for clustered computing disclosed in an embodiment of the present disclosure, wherein key steps of the schematic scheme concept include: delay prediction, re-weighting, and slow machine identification. Wherein the delay prediction uses subtasks (not slow-down) that have been completed during the task's operation to train a model, the delay (or completion time) of which is predicted to run the subtasks. Also, given that this model may favor non-slow-machine, it is proposed to re-weight the predicted delays, intuitively, that this weighting function represents how much the characteristics of a particular running subtask differ from those of the completed subtasks, i.e., it preserves the prediction of delays for those non-slow-machine subtasks and increases the prediction delay for subtasks other than non-slow-machine, thereby reducing prediction bias. Then, slow machine identification is performed based on the re-weighted prediction delay, so that the slow machine can be predicted timely and accurately. Further, the identified potential slow machine can be processed, and adverse effects caused by the slow machine can be eliminated in time.
The implementation of the above scheme will be described below in conjunction with fig. 2 and further embodiments.
Fig. 2 is a schematic flowchart of a cluster-computing-oriented abnormal subtask recognition method disclosed in the embodiment of the present disclosure, where an execution body of the method may be any device, platform, or server with computing and processing capabilities, for example, may be a cluster management platform.
As shown in fig. 2, the method comprises the steps of:
Step S210, aiming at a running target task, acquiring running data of a plurality of subtasks of the running target task, wherein the running data are correspondingly run on a plurality of nodes in a cluster; step S220, a first training sample set is constructed based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length; step S230, training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model; step S240, constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample; step S250, training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model; step S260, updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the completion probability; step S270, determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.
The development of the above steps is described as follows:
first, in step S210, for a target task running in a cluster, running data of a plurality of subtasks corresponding to the subtasks running on a plurality of nodes in the cluster is obtained.
The target task may be a data analysis task or a machine learning task, and the embodiment of the present disclosure is not limited to the specific task content. For the splitting rule and mode of splitting the target task into a plurality of subtasks and the scheduling mode of scheduling each subtask on the corresponding node respectively, the related prior art can be adopted, which will not be described in detail. It is to be appreciated that the nodes in the cluster may be individual computers, servers, virtual machines, or the like.
The triggering condition for acquiring the operation data can be configured in advance according to actual needs, or can be manually triggered by a worker to acquire the operation data. For example, the triggering condition may be that a predetermined time period is reached after the target task starts running, or that acquisition is automatically triggered every predetermined time period.
The operational data may include operational status of each of the plurality of subtasks, such as completed, running, or terminated. The start runtime, the run time, the running environment, etc. of each task may also be included, where the running environment may include a software environment and/or a hardware environment of the corresponding node. By way of example, the software environment may include the type, version, etc. of the operating system; the hardware environment may include a model, manufacturer, number, core number, or memory space of a central processing unit (Central Processing Unit, CPU for short) or graphics processor (Graphics Processing Unit, GPU for short), or the like.
From this, the operation data of each sub-task of the target task being operated can be obtained.
Then, in step S220, a first training sample set is constructed based on the operation data, where the first training sample set includes task characteristics and corresponding actual completion time periods of each completed subtask in the plurality of subtasks.
It should be understood that, according to the running state of each subtask in the running data, a plurality of subtasks in the completed state in the plurality of subtasks may be determined, and then the first training sample set is constructed by using the subtasks. In addition, several references herein refer to one or more.
Specifically, a completed subtask is taken as a training sample, the characteristics of the completed subtask are taken as sample characteristics, and the actual completion time length is taken as a sample label. By way of example, the characteristics of the completed subtasks may include the start runtime, the run duration, the run environment, and the like, as described above.
Based on the first training sample set constructed above, step S230 may be executed, where the duration prediction model is trained using the first training sample set, and the predicted completion duration of each running sub-task in the plurality of sub-tasks is determined using the trained duration prediction model.
It can be appreciated that the duration prediction model is a regression model, and the implementation algorithm can be selected according to the need, for example, a deep neural network (Deep Neural Networks, abbreviated as DNN) or an extreme gradient lifting tree (eXtreme Gradient Boosting, abbreviated as XGBoost) is selected.
Specifically, subtask features in each training sample are input into a duration prediction model to obtain a prediction result of the completion duration of the subtask, and then training loss is determined based on the prediction result and a corresponding sample label (namely the actual completion duration), so that model parameters of the duration prediction model are updated by using the training loss. After one or more iterations, a trained duration prediction model can be obtained.
Further, the predicted completion time of each running sub-task can be determined by using a trained time prediction model.
On the other hand, after the operation data is acquired in step S210, steps S240 and S250 may also be performed.
In step S240, a second training sample set is constructed based on the operational data, wherein each completed subtask is taken as a positive sample and each operational subtask is taken as a negative sample.
Specifically, the completed subtasks and the running subtasks are determined based on the running data, for each completed subtask, the subtask characteristics thereof are used as sample characteristics, the corresponding sample label is set to 1, so as to obtain the corresponding positive sample, and similarly, for each running subtask, the subtask characteristics thereof are used as sample characteristics, the corresponding sample label is set to 0, so as to obtain the corresponding negative sample.
In one embodiment, a terminated one of the plurality of subtasks may also be taken as a negative sample.
Based on the constructed second training sample set, step S250 is executed to train a probability prediction model based on the constructed second training sample set, and predict the probability of completion of each running subtask using the trained probability prediction model.
It should be appreciated that the probabilistic predictive model is essentially a two-class model, and that the probabilistic predictive model may be implemented as a logistic regression model, for example.
In one embodiment, a single probabilistic predictive model is trained. Specifically, sample characteristics of each training sample in the second training sample set are input into a probability prediction model to obtain a probability prediction result, and then training loss is determined by using the probability prediction result and a corresponding sample label, so that model parameters of the probability prediction model are updated by using the training loss. Thus, after one or more iterations, a trained probabilistic predictive model can be obtained.
Further, the characteristics of each running sub-task can be input into a probability prediction model, so that the corresponding predicted completion probability is obtained.
In view of the above embodiments, there is coincidence between samples used in the training phase and the prediction phase of the probabilistic predictive model, including the same running subtask samples, which may affect the accuracy of the completion probability predicted by the prediction phase, and another embodiment is proposed.
In another embodiment, all negative samples in the second training sample set are divided into two parts, which are allocated to two subsets, and in addition, a total number of completed subtasks are added to the two subsets, respectively. And then, respectively and independently training a probability prediction model by using the two subsets, and further, performing cross prediction by using the two probability prediction models to predict the completion probability of the running sub-tasks outside the self-training subsets.
Specifically, a first subset and a second subset of the second training sample set are determined, wherein the first subset uses a plurality of first subtasks in part of running as a plurality of negative samples, the second subset uses a plurality of second subtasks in another part of running as a plurality of negative samples, and all the completed subtasks in the first subset and the second subset are positive samples. Then, training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model; and simultaneously, training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model. Thus, the completion probability of each running sub-task in the plurality of sub-tasks can be obtained.
It should be understood that, in the description of step S240 and step S250, the first subset and the second subset may be directly constructed based on the operation data in the embodiment of constructing the second training sample set and then constructing the first subset and the second subset, where the first subset may be replaced by the first positive and negative sample set and the second subset may be replaced by the second positive and negative sample set.
From the above, the predicted completion time of each running sub-task can be obtained by performing steps S220 and S230, and the completion probability of each running sub-task can be obtained by performing steps S240 and S250.
Based on this, step S260 is performed to update the predicted completion time length of each running sub-task to the quotient of the predicted completion time length and the corresponding completion probability. Thus, the re-weighting of the predicted completion time length can be realized, and the deviation of the predicted time length is reduced.
Then, in step S270, according to the updated predicted completion time, an abnormal subtask corresponding to the abnormal time is determined.
In one embodiment, for each running sub-task, the running sub-task is classified as an abnormal sub-task if its corresponding updated predicted completion time is greater than a preset duration threshold.
In another embodiment, the updated predicted completion time length corresponding to each running sub-task is ordered from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks. Where the value of k may be set by a worker, e.g., set k=2.
According to an embodiment of another aspect, after performing step S270, the method may further include: and executing a restarting operation for the abnormal subtask. At this point, the abnormal subtask resumes operation. Or executing the operation termination operation for the abnormal subtask, wherein the running state of the subtask is changed to be terminated. The terminated subtasks may then also be scheduled to be executed on other nodes. It should be understood that the processing mode of the abnormal subtasks may be preconfigured by a worker or may be selected after real-time observation.
It should be noted that the order of execution of the steps is not limited, and may be executed sequentially, for example, in the order illustrated in fig. 2, or may be executed in another order, for example, S240 is executed first, S220 is executed, or S240 and S220 are executed simultaneously, etc., so long as the flow of data conforms to logic.
In summary, by adopting the cluster computation-oriented abnormal subtask identification method disclosed by the embodiment of the specification, aiming at the target task in operation, the completion time of each running subtask is predicted in real time by constructing a prediction model on line, and then the predicted deviation is corrected by re-weighting, so that the potential abnormal subtasks can be timely and accurately identified, and the performance loss caused by the slow-running subtasks is effectively reduced or eliminated.
Corresponding to the above identification method, the embodiment of the present specification also discloses an identification device. Fig. 3 is a schematic structural diagram of an abnormal subtask recognition device for cluster computing according to an embodiment of the present disclosure.
As shown in fig. 3, the apparatus 300 includes the following modules:
The operation data obtaining module 302 is configured to obtain, for a target task being operated, operation data of a plurality of subtasks corresponding to the subtasks of the target task being operated, where the operation data are operated on a plurality of nodes in the cluster. A first training set construction module 304 is configured to construct a first training sample set based on the operation data, where the first training sample set includes task characteristics and corresponding actual completion time periods of each completed sub-task of the plurality of sub-tasks. A duration model training module 306 configured to train a duration prediction model using the first set of training samples. The duration prediction module 308 is configured to determine a predicted completion duration of each running sub-task of the plurality of sub-tasks using the trained duration prediction model. A second training set construction module 310 is configured to construct a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample and each operation subtask is taken as a negative sample. A probabilistic model training module 312 is configured to train a probabilistic predictive model based on the second set of training samples. The probability prediction module 314 is configured to predict the completion probability of each running sub-task using the trained probability prediction model. A duration update module 316 configured to update the predicted completion duration of each running sub-task to its quotient with the completion probability. The abnormal subtask determining module 318 is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time.
In one embodiment, the operational data acquisition module 302 is specifically configured to: and acquiring the operation data in response to the operation time of the target task reaching a preset time.
In one embodiment, the probabilistic model training module 312 includes: the subset construction unit is configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples. A first training unit configured to train a first probabilistic predictive model using the first subset. A second training unit configured to train a second probabilistic predictive model using the second subset.
The probability prediction module 314 specifically includes: and the first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model. And a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.
In one embodiment, the abnormal subtask determination module 318 is specifically configured to: aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.
In one embodiment, the abnormal subtask determination module 318 is specifically configured to: sequencing the updated predicted completion time length corresponding to each running sub-task from big to small; and classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.
In one embodiment, the apparatus 300 further includes an abnormal subtask processing module 320 configured to perform a restart operation for the abnormal subtask or to perform a terminate run operation.
In one embodiment, the target task is a log analysis task or a machine learning task.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (8)

1. The cluster computation-oriented abnormal subtask identification method is characterized by comprising the following steps of:
aiming at the running target task, acquiring running data of a plurality of subtasks of the running target task, which are correspondingly run on a plurality of nodes in the cluster;
constructing a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length;
Training a duration prediction model by using the first training sample set, and determining the predicted completion duration of each running sub-task in the plurality of sub-tasks by using the trained duration prediction model;
constructing a second training sample set based on the operation data, wherein each completed subtask is taken as a positive sample, and each operation subtask is taken as a negative sample;
Training a probability prediction model based on the second training sample set, and predicting the completion probability of each running sub-task by using the trained probability prediction model;
Updating the predicted completion time of each running sub-task to be the quotient of the predicted completion time and the corresponding completion probability;
Determining an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time;
The probability prediction model is trained based on the second training sample set, and the completion probability of each running sub-task is predicted by using the trained probability prediction model; comprising the following steps:
Determining a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in the other part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples;
training a first probability prediction model by using the first subset, and predicting the completion probability of each second subtask by using the trained first probability prediction model;
And training a second probability prediction model by using the second subset, and predicting the completion probability of each first subtask by using the trained second probability prediction model.
2. The method of claim 1, wherein, for a target task being executed in a cluster, obtaining operational data whose multiple subtasks correspond to operations on multiple nodes in the cluster, comprises:
and acquiring the operation data in response to the operation time of the target task reaching a preset time.
3. The method of claim 1, wherein determining the abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length comprises:
Aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.
4. The method of claim 1, wherein determining the abnormal subtask corresponding to the abnormal time length according to the updated predicted completion time length comprises:
Sequencing the updated predicted completion time length corresponding to each running sub-task from big to small;
And classifying the subtasks corresponding to the time length of the front k bits after sequencing into abnormal subtasks.
5. The method of claim 1, wherein after determining an abnormal subtask corresponding to the abnormal time period according to the updated predicted completion time period, the method further comprises:
And executing a restarting operation or executing a termination operation aiming at the abnormal subtask.
6. The method of any of claims 1-5, wherein the target task is a log analysis task or a machine learning task.
7. An abnormal subtask recognition device for cluster computing, comprising:
The running data acquisition module is configured to acquire running data of a plurality of subtasks of the running target task, which run on a plurality of nodes in the cluster;
the first training set construction module is configured to construct a first training sample set based on the operation data, wherein the first training sample set comprises task characteristics of each completed subtask in the plurality of subtasks and corresponding actual completion time length;
a duration model training module configured to train a duration prediction model using the first set of training samples;
The time length prediction module is configured to determine the predicted completion time length of each running sub-task in the plurality of sub-tasks by using a trained time length prediction model;
A second training set construction module configured to construct a second training sample set based on the operational data, wherein each of the completed subtasks is taken as a positive sample and each of the operational subtasks is taken as a negative sample;
a probabilistic model training module configured to train a probabilistic predictive model based on the second set of training samples;
The probability prediction module is configured to predict the completion probability of each running sub-task by using a trained probability prediction model;
the duration updating module is configured to update the predicted completion duration of each running sub-task to the quotient of the predicted completion duration and the completion probability;
the abnormal subtask determining module is configured to determine an abnormal subtask corresponding to the abnormal time according to the updated predicted completion time;
wherein, the probability model training module includes:
a subset construction unit configured to determine a first subset and a second subset of the second training sample set, wherein the first subset takes a plurality of first subtasks in part of running as negative samples, the second subset takes a plurality of second subtasks in another part of running as negative samples, and the first subset and the second subset take all completed subtasks as positive samples;
A first training unit configured to train a first probabilistic predictive model using the first subset;
a second training unit configured to train a second probabilistic predictive model using the second subset;
the probability prediction module specifically comprises:
The first prediction unit is configured to predict the completion probability of each second subtask by using the trained first probability prediction model;
and a second prediction unit configured to predict the probability of completion of each of the first subtasks using the trained second probabilistic prediction model.
8. The apparatus of claim 7, wherein the abnormal subtask determination module is specifically configured to:
Aiming at each running subtask, classifying the subtask into an abnormal subtask under the condition that the corresponding updated predicted completion time length is larger than a preset time length threshold value.
CN202311435871.1A 2023-10-30 2023-10-30 Cluster computing-oriented abnormal subtask identification method and device Active CN117349775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311435871.1A CN117349775B (en) 2023-10-30 2023-10-30 Cluster computing-oriented abnormal subtask identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311435871.1A CN117349775B (en) 2023-10-30 2023-10-30 Cluster computing-oriented abnormal subtask identification method and device

Publications (2)

Publication Number Publication Date
CN117349775A CN117349775A (en) 2024-01-05
CN117349775B true CN117349775B (en) 2024-04-26

Family

ID=89362984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311435871.1A Active CN117349775B (en) 2023-10-30 2023-10-30 Cluster computing-oriented abnormal subtask identification method and device

Country Status (1)

Country Link
CN (1) CN117349775B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070117A (en) * 2019-04-08 2019-07-30 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN111274036A (en) * 2020-01-21 2020-06-12 南京大学 Deep learning task scheduling method based on speed prediction
CN111709447A (en) * 2020-05-14 2020-09-25 中国电力科学研究院有限公司 Power grid abnormality detection method and device, computer equipment and storage medium
CN112581191A (en) * 2020-08-14 2021-03-30 支付宝(杭州)信息技术有限公司 Training method and device of behavior prediction model
CN112882718A (en) * 2021-02-26 2021-06-01 百果园技术(新加坡)有限公司 Compiling processing method, device, equipment and storage medium
CN114047965A (en) * 2021-10-12 2022-02-15 润联软件系统(深圳)有限公司 Computation offloading method, satellite server, and computer-readable storage medium
CN114357858A (en) * 2021-12-06 2022-04-15 苏州方正璞华信息技术有限公司 Equipment deterioration analysis method and system based on multi-task learning model
CN114399669A (en) * 2022-03-25 2022-04-26 江苏智云天工科技有限公司 Target detection method and device
US11366660B1 (en) * 2019-06-20 2022-06-21 Amazon Technologies, Inc. Interface latency estimation based on platform subcomponent parameters
CN114821538A (en) * 2022-05-19 2022-07-29 北京地平线机器人技术研发有限公司 Training method and device of multi-task model
CN115329871A (en) * 2022-08-15 2022-11-11 腾讯科技(深圳)有限公司 Model training and model testing method, device, equipment and storage medium
CN116295409A (en) * 2023-02-14 2023-06-23 腾讯科技(深圳)有限公司 Route processing method, route processing device, computer readable medium and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070117A (en) * 2019-04-08 2019-07-30 腾讯科技(深圳)有限公司 A kind of data processing method and device
US11366660B1 (en) * 2019-06-20 2022-06-21 Amazon Technologies, Inc. Interface latency estimation based on platform subcomponent parameters
CN111274036A (en) * 2020-01-21 2020-06-12 南京大学 Deep learning task scheduling method based on speed prediction
CN111709447A (en) * 2020-05-14 2020-09-25 中国电力科学研究院有限公司 Power grid abnormality detection method and device, computer equipment and storage medium
CN112581191A (en) * 2020-08-14 2021-03-30 支付宝(杭州)信息技术有限公司 Training method and device of behavior prediction model
CN112882718A (en) * 2021-02-26 2021-06-01 百果园技术(新加坡)有限公司 Compiling processing method, device, equipment and storage medium
CN114047965A (en) * 2021-10-12 2022-02-15 润联软件系统(深圳)有限公司 Computation offloading method, satellite server, and computer-readable storage medium
CN114357858A (en) * 2021-12-06 2022-04-15 苏州方正璞华信息技术有限公司 Equipment deterioration analysis method and system based on multi-task learning model
CN114399669A (en) * 2022-03-25 2022-04-26 江苏智云天工科技有限公司 Target detection method and device
CN114821538A (en) * 2022-05-19 2022-07-29 北京地平线机器人技术研发有限公司 Training method and device of multi-task model
CN115329871A (en) * 2022-08-15 2022-11-11 腾讯科技(深圳)有限公司 Model training and model testing method, device, equipment and storage medium
CN116295409A (en) * 2023-02-14 2023-06-23 腾讯科技(深圳)有限公司 Route processing method, route processing device, computer readable medium and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Perspectives on cross-domain visual analysis ofcyber-physical-social big data;Wei CHEN;《 Front Inform Technol Electron Eng 》;20211231;论文全文 *
Self-adaptive task allocation and scheduling of meta-tasks in non-dedicated heterogeneous computing;Ming Wu;《International Journal of High Performance Computing and Networking》;20060206;论文全文 *
一种在线集群异常作业预测方法;谢丽霞;《北京邮电大学学报》;20191031;论文全文 *
基于模糊逻辑方法的人误风险严重度识别;李鹏程;陈国华;戴立操;张力;赵明;;原子能科学技术;20100520(第05期);论文全文 *

Also Published As

Publication number Publication date
CN117349775A (en) 2024-01-05

Similar Documents

Publication Publication Date Title
US11989647B2 (en) Self-learning scheduler for application orchestration on shared compute cluster
CN108052394B (en) Resource allocation method based on SQL statement running time and computer equipment
Hsu et al. Scout: An experienced guide to find the best cloud configuration
CN111625331B (en) Task scheduling method, device, platform, server and storage medium
EP3798930A2 (en) Machine learning training resource management
US20200226401A1 (en) Utilizing artificial intelligence to generate and update a root cause analysis classification model
Chen et al. Predicting job completion times using system logs in supercomputing clusters
CN112783616A (en) Concurrent conflict processing method and device and computer storage medium
CN114895773A (en) Energy consumption optimization method, system and device of heterogeneous multi-core processor and storage medium
CN115796041A (en) Neural network model deployment method, system, device and storage medium
WO2023040145A1 (en) Artificial intelligence-based text classification method and apparatus, electronic device, and medium
CN114564281A (en) Container scheduling method, device, equipment and storage medium
EP3798931A1 (en) Machine learning training resource management
CN117349775B (en) Cluster computing-oriented abnormal subtask identification method and device
CN113407343A (en) Service processing method, device and equipment based on resource allocation
Ouyang et al. An approach for modeling and ranking node-level stragglers in cloud datacenters
CN109739649B (en) Resource management method, device, equipment and computer readable storage medium
CN113986495A (en) Task execution method, device, equipment and storage medium
Hongyan et al. Predicting misconfiguration-induced unsuccessful executions of jobs in big data system
CN114466014A (en) Service scheduling method and device, electronic equipment and storage medium
CN104360898B (en) The method and apparatus of operation task
JP7424373B2 (en) Analytical equipment, analytical methods and analytical programs
CN115061794A (en) Method, device, terminal and medium for scheduling task and training neural network model
Le Hai et al. Potential of applying knn with soft walltime to improve scheduling performance
CN111290855A (en) GPU card management method, system and storage medium for multiple GPU servers in distributed environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant