US20230325235A1 - Training task queuing cause analysis method and system, device and medium - Google Patents

Training task queuing cause analysis method and system, device and medium Download PDF

Info

Publication number
US20230325235A1
US20230325235A1 US18/036,864 US202118036864A US2023325235A1 US 20230325235 A1 US20230325235 A1 US 20230325235A1 US 202118036864 A US202118036864 A US 202118036864A US 2023325235 A1 US2023325235 A1 US 2023325235A1
Authority
US
United States
Prior art keywords
data
cluster center
sample data
cluster
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US18/036,864
Other versions
US11775344B1 (en
Inventor
Wenxiao Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Suzhou Wave Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wave Intelligent Technology Co Ltd filed Critical Suzhou Wave Intelligent Technology Co Ltd
Assigned to INSPUR SUZHOU INTELLIGENT TECHNOLOGY CO., LTD. reassignment INSPUR SUZHOU INTELLIGENT TECHNOLOGY CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WANG, Wenxiao
Application granted granted Critical
Publication of US11775344B1 publication Critical patent/US11775344B1/en
Publication of US20230325235A1 publication Critical patent/US20230325235A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of deep learning and, more particularly, to a method and system for analyzing a cause of training-task queuing, a device and a storage medium.
  • the platform dispatcher may perform the dispatching operation only when the resource applied for by the task satisfies the requirements of the calculation-power resource pool. If the condition is not satisfied, then the platform dispatcher places the task into a waiting queue, and performs the dispatching again when the resource satisfies the requirements, which process is circulated and repeated.
  • the central processing unit (CPU) resource and the graphics processing unit (GPU) resource in the resource pool are exhausted very quickly, and the subsequent tasks can not be performed, but only wait for the dispatching.
  • an embodiment of the present application provides a method for analyzing a cause of training-task queuing, wherein the method includes the steps of:
  • the step of regarding the required resource and the remaining resource as the sample data includes:
  • the method further includes:
  • the step of, by using the saved sample data, updating the cluster model includes:
  • an embodiment of the present application further provides a system for analyzing a cause of training-task queuing, wherein the system includes:
  • the calculating module is further configured for:
  • system further includes an updating module, and the updating module is configured for:
  • the updating module is further configured for:
  • an embodiment of the present application further provides a computer device, wherein the computer device includes:
  • an embodiment of the present application further provides a computer-readable storage medium, the computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • the present application has one of the following advantageous technical effects.
  • the capacity of the platform of distinguishing the causes of queuing tasks is increased, the user may be enabled to rapidly acquire the cause of the task queuing, the usability of the platform is improved, the user experience is improved, and correspondingly the research and development of the related deep-learning platforms are instructed.
  • FIG. 1 is a schematic flow chart of a method for analyzing a cause of training-task queuing according to an embodiment of the present application
  • FIG. 2 is a schematic diagram showing the structure of a system for analyzing a cause of training-task queuing according to an embodiment of the present application
  • FIG. 3 is a schematic diagram showing the structure of a computer device according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram showing the structure of a computer-readable storage medium according to an embodiment of the present application.
  • an embodiment of the present application discloses a method for analyzing a cause of training-task queuing. As shown in FIG. 1 , the method may include the following steps:
  • the capacity of the platform of distinguishing the causes of queuing tasks is increased, the user may be enabled to rapidly acquire the cause of the task queuing, the usability of the platform is improved, the user experience is improved, and correspondingly the research and development of the related deep-learning platforms are instructed.
  • obtaining the plurality of cluster center data that are pre-generated in the cluster model may be started up, to perform monitoring the task queuing.
  • a real-time monitoring thread may be started up, to perform monitoring the task queuing.
  • the task information submitted to the platform is quantized, including the quantity of the used CPUs, the quantity of the used GPUs, the GPU types, the specified scheduling nodes and so on.
  • the Euclidean distances between the data and the cluster center data are calculated, to find the center point closest to the data, and the corresponding queuing cause is matched.
  • each of the cluster center data is associated in advance, and each of the cluster center data is obtained by calculation by using a plurality of sample data, for example, by calculating the average value or calculating the mean squared error.
  • the step of regarding the required resource and the remaining resource as the sample data includes:
  • the user inputs the relevant configuration parameters of his own task, including the quantity of the used CPUs, the quantity of the used GPUs, the resource groups, the GPU types, the specified scheduling nodes and so on.
  • Some of the parameters cannot be expressed by numerical values, for example, the GPU types, and therefore it is required to perform quantization processing to those parameters.
  • the GPU types cannot be expressed by numerical values, and it is required to perform quantization processing to them.
  • the GPUs of different types may be distinguished by using labels or serial numbers.
  • the method further includes:
  • the sample data of this time are automatically added into a sample database, and after the quantity of the newly added sample data reaches a threshold, according to all of the sample data in the sample database at this point, the cluster center data are calculated again.
  • the step of, by using the saved sample data, updating the cluster model includes:
  • the quantity of the queuing causes is set; and then the cluster center data with the quantity equal to the quantity of the queuing causes are randomly generated, the distances between each of the sample data and each of the cluster center data are individually calculated.
  • the distances g the sample data are allocated to the categories of the center points that are closest to themself, and, subsequently, according to the sample data divided into each of the cluster center data, the cluster center data are calculated again, for example, the cluster center data are calculated according to the average value. And then it is determined whether the recalculated cluster center data and the cluster center data that are currently used for the sample-data division determining are the same.
  • the randomly generated cluster center data are A and B
  • the serial numbers of the sample data divided into A and B are 1, 2, 3, 4, 5 and 6.
  • the cluster center datum A′ is calculated by using the sample data of the serial numbers of 1, 2 and 3
  • the cluster center datum B′ is calculated by using the sample data of the serial numbers of 4, 5 and 6.
  • the solutions according to the present application by using the clustering algorithm, use the task resource usage amount (the CPU and GPU usage amounts) and the state of the current load of the platform server cluster (the CPU load and the GPU load) as the basic information of the queuing category factor, search for the cluster centers in the limited times of iteration, and map the cluster centers to the corresponding queuing causes.
  • the task resource usage amount the CPU and GPU usage amounts
  • the state of the current load of the platform server cluster the CPU load and the GPU load
  • the queuing cause is issued by using a monitoring mechanism, to enable the user to timely acquire the queuing cause and make the corresponding resource changing.
  • an embodiment of the present application further provides a system for analyzing a cause of training-task queuing 400 .
  • the system includes:
  • the calculating module is further configured for:
  • system further includes an updating module, and the updating module is configured for:
  • the updating module is further configured for:
  • an embodiment of the present application further provides a computer device 501 , wherein the computer device 501 includes:
  • an embodiment of the present application further provides a computer-readable storage medium 601 , the computer-readable storage medium 601 storing a computer program instruction 610 , wherein the computer program instruction 610 , when executed by a processor, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • the computer-readable storage medium for example, a memory
  • the computer-readable storage medium may be a volatile memory or a non-volatile memory, or may include both of a volatile memory and a non-volatile memory.
  • serial numbers of the embodiments of the present application are merely for the purpose of description, and do not indicate the relative preferences of the embodiments.
  • the program may be stored in a computer-readable storage medium.
  • the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk and so on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present application discloses a method for analyzing a cause of training-task queuing, wherein the method includes the steps of: obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster; in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model; regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data. The present application further discloses a system, a computer device and a readable storage medium.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • The present application claims the priority of the Chinese patent application filed on Dec. 4, 2020 before the Chinese Patent Office with the application number of 202011402706.2 and the title of “METHOD AND SYSTEM FOR ANALYZING CAUSE OF TRAINING-TASK QUEUING, DEVICE AND MEDIUM”, which is incorporated herein in its entirety by reference.
  • TECHNICAL FIELD
  • The present application relates to the technical field of deep learning and, more particularly, to a method and system for analyzing a cause of training-task queuing, a device and a storage medium.
  • BACKGROUND
  • Currently, as artificial intelligence is gradually extensively used in various industries, industry experts are tending to solve complicated problems from the perspective of artificial intelligence, various algorithm frames are applied in various fields, and at the same time algorithmic models are continuously improved by algorithm engineers. All of those trends require supporting by a strong calculation power. Therefore, the calculation power becomes more and more important. In the original production mode, algorithm engineers more tend to use a single server or several public servers for the algorithm training. However, with the continuous expansion of the staff size, the problem of resource allocation and scrambling is becoming increasingly more serious, which highly affects the working efficiency of the algorithm engineers. Therefore, for the large-scale algorithm engineers, it is very necessary to establish a calculation-power resource pool for resource management. On the platform applying the algorithms, the engineers only need to apply for the resource, and the remaining matters are left to the platform resource dispatcher to solve, which greatly increases the production efficiency.
  • Usually, the platform dispatcher may perform the dispatching operation only when the resource applied for by the task satisfies the requirements of the calculation-power resource pool. If the condition is not satisfied, then the platform dispatcher places the task into a waiting queue, and performs the dispatching again when the resource satisfies the requirements, which process is circulated and repeated. However, in a user rush hour, the central processing unit (CPU) resource and the graphics processing unit (GPU) resource in the resource pool are exhausted very quickly, and the subsequent tasks can not be performed, but only wait for the dispatching.
  • SUMMARY
  • In view of the above, in order to overcome at least one aspect of the above-described problem, an embodiment of the present application provides a method for analyzing a cause of training-task queuing, wherein the method includes the steps of:
      • obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster;
      • in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model;
      • regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and
      • feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data.
  • In some embodiments, the step of regarding the required resource and the remaining resource as the sample data includes:
      • performing quantization processing to the sample data.
  • In some embodiments, the method further includes:
      • saving the sample data; and
      • in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
  • In some embodiments, the step of, by using the saved sample data, updating the cluster model includes:
      • randomly generating the plurality of cluster center data;
      • calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
      • by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
      • in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
  • On the basis of the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a system for analyzing a cause of training-task queuing, wherein the system includes:
      • an obtaining module, wherein the obtaining module is configured for obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster;
      • a determining module, wherein the determining module is configured for, in responding to the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model;
      • a calculating module, wherein the calculating module is configured for, regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and
      • a feeding-back module, wherein the feeding-back module is configured for, feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data.
  • In some embodiments, the calculating module is further configured for:
      • performing quantization processing to the sample data.
  • In some embodiments, the system further includes an updating module, and the updating module is configured for:
      • saving the sample data; and
      • in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
  • In some embodiments, the updating module is further configured for:
      • randomly generating the plurality of cluster center data;
      • calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
      • by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
      • in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
  • On the basis of the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a computer device, wherein the computer device includes:
      • at least one processor; and
      • a memory, the memory storing a computer program that is executable in the processor, wherein the processor, when executing the program, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • On the basis of the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a computer-readable storage medium, the computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • The present application has one of the following advantageous technical effects. By using the solutions according to the present application, the capacity of the platform of distinguishing the causes of queuing tasks is increased, the user may be enabled to rapidly acquire the cause of the task queuing, the usability of the platform is improved, the user experience is improved, and correspondingly the research and development of the related deep-learning platforms are instructed.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • In order to more clearly illustrate the technical solutions of the embodiments of the present application or the prior art, the figures that are required to describe the embodiments or the prior art will be briefly described below. Apparently, the figures that are described below are merely embodiments of the present application, and a person skilled in the art may obtain other embodiments according to these figures without paying creative work.
  • FIG. 1 is a schematic flow chart of a method for analyzing a cause of training-task queuing according to an embodiment of the present application;
  • FIG. 2 is a schematic diagram showing the structure of a system for analyzing a cause of training-task queuing according to an embodiment of the present application;
  • FIG. 3 is a schematic diagram showing the structure of a computer device according to an embodiment of the present application; and
  • FIG. 4 is a schematic diagram showing the structure of a computer-readable storage medium according to an embodiment of the present application.
  • DETAILED DESCRIPTION
  • In order to make the objects, the technical solutions and the advantages of the present application clearer, the embodiments of the present application will be further described in detail with reference to the embodiments and the drawings.
  • It should be noted that all of the expressions using “first” and “second” in the embodiments of the present application are intended to distinguish two different entities or different parameters that have the same names. It may be seen that “first” and “second” are merely for the convenience of the expression, and should not be construed as a limitation on the embodiments of the present application, which will not be explained in detail in the subsequent embodiments.
  • According to an aspect of the present application, an embodiment of the present application discloses a method for analyzing a cause of training-task queuing. As shown in FIG. 1 , the method may include the following steps:
      • S1: obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster; wherein the cluster herein refers to a computing device group, such as a computer group, a server group and so on.
      • S2: in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model;
      • S3: regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and
      • S4: feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data.
  • By using the solutions according to the present application, the capacity of the platform of distinguishing the causes of queuing tasks is increased, the user may be enabled to rapidly acquire the cause of the task queuing, the usability of the platform is improved, the user experience is improved, and correspondingly the research and development of the related deep-learning platforms are instructed.
  • In some embodiments, in the step S2 of, in responding to that the remaining resource does not satisfy the required resource, obtaining the plurality of cluster center data that are pre-generated in the cluster model, particularly, a real-time monitoring thread may be started up, to perform monitoring the task queuing. When the calculation-power resource pool does not satisfy the required resource that is required by the training task that the user apples for this time, it is required to determine the particular cause. In other words, the task information submitted to the platform is quantized, including the quantity of the used CPUs, the quantity of the used GPUs, the GPU types, the specified scheduling nodes and so on. The Euclidean distances between the data and the cluster center data are calculated, to find the center point closest to the data, and the corresponding queuing cause is matched.
  • It should be noted that the queuing causes represented by each of the cluster center data are associated in advance, and each of the cluster center data is obtained by calculation by using a plurality of sample data, for example, by calculating the average value or calculating the mean squared error.
  • In some embodiments, the step of regarding the required resource and the remaining resource as the sample data includes:
      • performing quantization processing to the sample data.
  • The user, according to his own demand, inputs the relevant configuration parameters of his own task, including the quantity of the used CPUs, the quantity of the used GPUs, the resource groups, the GPU types, the specified scheduling nodes and so on. Some of the parameters cannot be expressed by numerical values, for example, the GPU types, and therefore it is required to perform quantization processing to those parameters. For example, the GPU types cannot be expressed by numerical values, and it is required to perform quantization processing to them. The GPUs of different types may be distinguished by using labels or serial numbers.
  • In some embodiments, the method further includes:
      • saving the sample data; and
      • in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
  • In order to enrich the cluster sample information, after each time of the determination, the sample data of this time are automatically added into a sample database, and after the quantity of the newly added sample data reaches a threshold, according to all of the sample data in the sample database at this point, the cluster center data are calculated again.
  • In some embodiments, the step of, by using the saved sample data, updating the cluster model includes:
      • randomly generating the plurality of cluster center data;
      • calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
      • by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
      • in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
  • According to the service demand, the quantity of the queuing causes is set; and then the cluster center data with the quantity equal to the quantity of the queuing causes are randomly generated, the distances between each of the sample data and each of the cluster center data are individually calculated. According to the distances, g the sample data are allocated to the categories of the center points that are closest to themself, and, subsequently, according to the sample data divided into each of the cluster center data, the cluster center data are calculated again, for example, the cluster center data are calculated according to the average value. And then it is determined whether the recalculated cluster center data and the cluster center data that are currently used for the sample-data division determining are the same. When the recalculated cluster center data and the cluster center data that are currently used for the sample-data division are not the same, then continuing the iterative training, till the cluster center data no longer change. For example, the randomly generated cluster center data are A and B, and the serial numbers of the sample data divided into A and B are 1, 2, 3, 4, 5 and 6. Subsequently, the cluster center datum A′ is calculated by using the sample data of the serial numbers of 1, 2 and 3, and the cluster center datum B′ is calculated by using the sample data of the serial numbers of 4, 5 and 6. Subsequently, it is determined whether A′ and B′ are the same as A and B. When A′ and B′ are different from A and B, then all of the sample data are divided by using A′ and B′, and the new cluster center data are calculated again. Accordingly, after multiple times of the iteration, the cluster center data no longer change, and the finally obtained cluster center data are mapped to the causes of task queuing.
  • The solutions according to the present application, by using the clustering algorithm, use the task resource usage amount (the CPU and GPU usage amounts) and the state of the current load of the platform server cluster (the CPU load and the GPU load) as the basic information of the queuing category factor, search for the cluster centers in the limited times of iteration, and map the cluster centers to the corresponding queuing causes. When the user submits the task, if the task is not dispatched because the resource is not matched, the current basic factors are compared with the cluster centers, to find the center points closest to the basic factors of itself, the cause of the task queuing submitted this time is attributed to that type, and the queuing cause is issued by using a monitoring mechanism, to enable the user to timely acquire the queuing cause and make the corresponding resource changing.
  • On the basis of the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a system for analyzing a cause of training-task queuing 400. As shown in FIG. 2 , the system includes:
      • an obtaining module 401, wherein the obtaining module 401 is configured for obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster;
      • a determining module 402, wherein the determining module 402 is configured for, in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model;
      • a calculating module 403, wherein the calculating module 403 is configured for, regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and
      • a feeding-back module 404, wherein the feeding-back module 404 is configured for, feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data.
  • In some embodiments, the calculating module is further configured for:
      • performing quantization processing to the sample data.
  • In some embodiments, the system further includes an updating module, and the updating module is configured for:
      • saving the sample data; and
      • in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
  • In some embodiments, the updating module is further configured for:
      • randomly generating the plurality of cluster center data;
      • calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
      • by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
      • in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
  • On the basis of the same inventive concept, according to another aspect of the present application, as shown in FIG. 3 , an embodiment of the present application further provides a computer device 501, wherein the computer device 501 includes:
      • at least one processor 520; and
      • a memory 510, the memory 510 storing a computer program 511 that is executable in the processor, wherein the processor 520, when executing the computer program, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • On the basis of the same inventive concept, according to another aspect of the present application, as shown in FIG. 4 , an embodiment of the present application further provides a computer-readable storage medium 601, the computer-readable storage medium 601 storing a computer program instruction 610, wherein the computer program instruction 610, when executed by a processor, implements the steps of the method for analyzing a cause of training-task queuing according to any one of the above embodiments.
  • Finally, it should be noted that a person skilled in the art may understand that all or some of the processes of the methods according to the above embodiments may be implemented by relative hardware according to an instruction from a computer program, the program may be stored in a computer-readable storage medium, and the program, when executed, may contain the processes of the embodiments of the method stated above.
  • Furthermore, it should be noted that the computer-readable storage medium (for example, a memory) as used herein may be a volatile memory or a non-volatile memory, or may include both of a volatile memory and a non-volatile memory.
  • A person skilled in the art should also understand that various illustrative logical blocks, modules, electric circuits and algorithm steps described with reference to the disclosure herein may be embodied as electronic hardware, computer software or a combination thereof. In order to clearly explain the interchangeability between the hardware and the software, it has be described generally with reference to the functions of various illustrative components, blocks, modules, electric circuits and steps. Whether those functions are embodied as software or hardware depends on the applications and the design constraints exerted on the entire system. A person skilled in the art may employ different modes to implement the functions with respect to each of the applications, but those implementation decisions should not be considered as leading to departing from the scope disclosed by the embodiments of the present application.
  • The illustrative embodiments disclosed by the present application are described above. However, it should be noted that many variations and modifications may be made without departing from the scope of the embodiments of the present application defined by the claims. The functions, steps and/or acts of the process claims according to the disclosed embodiments described herein are not required to be implemented in any specific sequence. Furthermore, although the elements of the embodiments of the present application may be described or claimed in a singular form, unless explicitly limited as singular, they may also be comprehended as plural.
  • It should be understood that, as used herein, unless the context clearly supports an exception, the singular form “a” is intended to encompass a plural form. It should also be understood that, as used herein, the “and/or” refers to including any and all feasible combinations of one or more relatively listed items.
  • The serial numbers of the embodiments of the present application are merely for the purpose of description, and do not indicate the relative preferences of the embodiments.
  • A person skilled in the art may understand that all or some of the steps for implementing the above embodiments may be completed by hardware, and may also be completed by using a program to instruct relevant hardware. The program may be stored in a computer-readable storage medium. The above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk and so on.
  • A person skilled in the art should understand that the discussion on any of the above embodiments is merely illustrative, and are not intended to imply that the scope (including the claims) of the embodiments of the present application is limited to those examples. With the concept of the embodiments of the present application, the embodiments or the technical features of different embodiments may be combined, and many other variations of different aspects of the embodiments of the present application as stated above may exist, which are not provided in detail for brevity. Therefore, any omissions, modifications, equivalent substitutions and improvements that are made within the spirit and the principle of the embodiments of the present application should fall within the protection scope of the embodiments of the present application.

Claims (21)

1. A method for analyzing a cause of training-task queuing, comprising:
obtaining a required resource that is required by a training task inputted by a user and a remaining resource of a cluster;
in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model;
regarding the required resource and the remaining resource as sample data, and calculating distances between the sample data and each of the cluster center data; and
feeding back a cause corresponding to the cluster center datum that has a minimum distance with the sample data.
2. The method according to claim 1, wherein the step of regarding the required resource and the remaining resource as the sample data comprises:
performing quantization processing to the sample data.
3. The method according to claim 1, wherein the method further comprises:
saving the sample data; and
in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
4. The method according to claim 3, wherein the step of, by using the saved sample data, updating the cluster model comprises:
randomly generating the plurality of cluster center data;
calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
5-8. (canceled)
9. A computer device, comprising:
at least one processor; and
a memory, the memory storing a computer program that is executable in the processor, wherein the processor, when executing the program, implements the steps of the method according to claim 1.
10. A computer-readable storage medium, the computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method according to claim 1.
11. The method according to claim 1, wherein in responding to that the remaining resource does not satisfy the required resource, obtaining the plurality of cluster center data that are pre-generated in the cluster model comprises:
starting up a real-time monitoring thread to monitor the training-task queuing.
12. The method according to claim 1, wherein calculating distances between the sample data and each of the cluster center data comprises:
calculating Euclidean distances between the sample data and each of the cluster center data.
13. The method according to claim 1, wherein the sample data comprises configuration parameters related to the training task, and the configuration parameters comprises:
a quantity of used CPUs, a quantity of used GPUs, resource groups, GPU types and specified scheduling nodes.
14. The computer device according to claim 9, wherein the operation of regarding the required resource and the remaining resource as the sample data comprises:
performing quantization processing to the sample data.
15. The computer device according to claim 9, wherein the method further comprises:
saving the sample data; and
in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
16. The computer device according to claim 15, wherein the operation of, by using the saved sample data, updating the cluster model comprises:
randomly generating the plurality of cluster center data;
calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
17. The computer device according to claim 9, wherein in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model comprises:
starting up a real-time monitoring thread to monitor the training-task queuing.
18. The computer device according to claim 9, wherein calculating distances between the sample data and each of the cluster center data comprises:
calculating Euclidean distances between the sample data and each of the cluster center data.
19. The computer device according to claim 9, wherein the sample data comprises configuration parameters related to the training task, and the configuration parameters comprises:
a quantity of used CPUs, a quantity of used GPUs, resource groups, GPU types and specified scheduling nodes.
20. The computer-readable storage medium according to claim 10, wherein the operation of regarding the required resource and the remaining resource as the sample data comprises:
performing quantization processing to the sample data.
21. The computer-readable storage medium according to claim 10, wherein the method further comprises:
saving the sample data; and
in responding to that a quantity of the saved sample data reaches a threshold, by using the saved sample data, updating the cluster model.
22. The computer-readable storage medium according to claim 21, wherein the operation of, by using the saved sample data, updating the cluster model comprises:
randomly generating the plurality of cluster center data;
calculating distances between the saved sample data and each of the current cluster center data and distances between original sample data in the cluster model and each of the current cluster center data, to divide the saved sample data and the original sample data in the cluster model into the corresponding cluster center data;
by using the sample data in each of the cluster center data, recalculating the corresponding cluster center data; and
in responding to that the cluster center data obtained by the calculation are different from the cluster center data that are currently used for the sample-data division, performing division of the sample data again by using the cluster center data obtained by the calculation, to perform iterative training, till the cluster center data obtained by the calculation and the cluster center data that are currently used for the sample-data division are the same.
23. The computer-readable storage medium according to claim 10, wherein in responding to that the remaining resource does not satisfy the required resource, obtaining a plurality of cluster center data that are pre-generated in a cluster model comprises:
starting up a real-time monitoring thread to monitor the training-task queuing.
24. The computer-readable storage medium according to claim 10, wherein calculating distances between the sample data and each of the cluster center data comprises:
calculating Euclidean distances between the sample data and each of the cluster center data.
US18/036,864 2020-12-04 2021-09-29 Training task queuing cause analysis method and system, device and medium Active US11775344B1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN202011402706.2 2020-12-04
CN202011402706.2A CN112463334B (en) 2020-12-04 2020-12-04 Training task queuing reason analysis method, system, equipment and medium
PCT/CN2021/121870 WO2022116667A1 (en) 2020-12-04 2021-09-29 Training task queuing cause analysis method and system, device and medium

Publications (2)

Publication Number Publication Date
US11775344B1 US11775344B1 (en) 2023-10-03
US20230325235A1 true US20230325235A1 (en) 2023-10-12

Family

ID=74804816

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/036,864 Active US11775344B1 (en) 2020-12-04 2021-09-29 Training task queuing cause analysis method and system, device and medium

Country Status (3)

Country Link
US (1) US11775344B1 (en)
CN (1) CN112463334B (en)
WO (1) WO2022116667A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463334B (en) * 2020-12-04 2023-08-18 苏州浪潮智能科技有限公司 Training task queuing reason analysis method, system, equipment and medium
CN117576822B (en) * 2023-11-20 2024-04-30 上海徽视科技集团有限公司 Queuing and number calling guiding system based on Internet platform

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5329596A (en) * 1991-09-11 1994-07-12 Hitachi, Ltd. Automatic clustering method
US5361323A (en) * 1990-11-29 1994-11-01 Sharp Kabushiki Kaisha Signal encoding device
US9152860B2 (en) * 2013-05-10 2015-10-06 Tantrum Street LLC Methods and apparatus for capturing, processing, training, and detecting patterns using pattern recognition classifiers
US9720998B2 (en) * 2012-11-19 2017-08-01 The Penn State Research Foundation Massive clustering of discrete distributions
US10241832B2 (en) * 2016-06-20 2019-03-26 Steering Solutions Ip Holding Corporation Runtime determination of real time operating systems task timing behavior
US10990900B2 (en) * 2018-04-09 2021-04-27 Veda Data Solutions, Inc. Scheduling machine learning tasks, and applications thereof
US11234629B2 (en) * 2017-11-27 2022-02-01 Shanghai Lepu CloudMed Co., Ltd. Method and device for self-learning dynamic electrocardiography analysis employing artificial intelligence
US11347546B2 (en) * 2019-03-15 2022-05-31 Shanghai Sensetime Intelligent Technology Co., Ltd Task scheduling method and device, and computer storage medium
US11366806B2 (en) * 2019-08-05 2022-06-21 The SQLNet Company GmbH Automated feature generation for machine learning application
US11507419B2 (en) * 2020-03-25 2022-11-22 EMC IP Holding Company LLC Method,electronic device and computer program product for scheduling computer resources in a task processing environment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100420315C (en) 2006-08-31 2008-09-17 华为技术有限公司 Method and system for obtaining congestion cause
US10261839B2 (en) 2016-11-02 2019-04-16 International Business Machines Corporation Outlier and root cause determination of excessive resource usage in a virtual machine environment
CN108153587B (en) 2017-12-26 2021-05-04 北京航空航天大学 Slow task reason detection method for big data platform
CN109828833B (en) * 2018-11-02 2020-09-29 上海帆一尚行科技有限公司 Queuing system and method for neural network training task
CN109634748A (en) * 2018-12-12 2019-04-16 深圳前海微众银行股份有限公司 Cluster resource dispatching method, device, equipment and computer readable storage medium
CN110609742B (en) * 2019-09-25 2023-01-06 苏州浪潮智能科技有限公司 Method and device for configuring queues of Kubernetes scheduler
CN111191794B (en) * 2019-12-29 2023-03-14 广东浪潮大数据研究有限公司 Training task processing method, device and equipment and readable storage medium
CN111198767A (en) * 2020-01-07 2020-05-26 平安科技(深圳)有限公司 Big data resource processing method and device, terminal and storage medium
CN112463334B (en) 2020-12-04 2023-08-18 苏州浪潮智能科技有限公司 Training task queuing reason analysis method, system, equipment and medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5361323A (en) * 1990-11-29 1994-11-01 Sharp Kabushiki Kaisha Signal encoding device
US5329596A (en) * 1991-09-11 1994-07-12 Hitachi, Ltd. Automatic clustering method
US9720998B2 (en) * 2012-11-19 2017-08-01 The Penn State Research Foundation Massive clustering of discrete distributions
US9152860B2 (en) * 2013-05-10 2015-10-06 Tantrum Street LLC Methods and apparatus for capturing, processing, training, and detecting patterns using pattern recognition classifiers
US10241832B2 (en) * 2016-06-20 2019-03-26 Steering Solutions Ip Holding Corporation Runtime determination of real time operating systems task timing behavior
US11234629B2 (en) * 2017-11-27 2022-02-01 Shanghai Lepu CloudMed Co., Ltd. Method and device for self-learning dynamic electrocardiography analysis employing artificial intelligence
US10990900B2 (en) * 2018-04-09 2021-04-27 Veda Data Solutions, Inc. Scheduling machine learning tasks, and applications thereof
US11347546B2 (en) * 2019-03-15 2022-05-31 Shanghai Sensetime Intelligent Technology Co., Ltd Task scheduling method and device, and computer storage medium
US11366806B2 (en) * 2019-08-05 2022-06-21 The SQLNet Company GmbH Automated feature generation for machine learning application
US11507419B2 (en) * 2020-03-25 2022-11-22 EMC IP Holding Company LLC Method,electronic device and computer program product for scheduling computer resources in a task processing environment

Also Published As

Publication number Publication date
CN112463334A (en) 2021-03-09
US11775344B1 (en) 2023-10-03
WO2022116667A1 (en) 2022-06-09
CN112463334B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
US10031775B2 (en) Backfill scheduling for embarrassingly parallel jobs
US11775344B1 (en) Training task queuing cause analysis method and system, device and medium
US8595732B2 (en) Reducing the response time of flexible highly data parallel task by assigning task sets using dynamic combined longest processing time scheme
US8595735B2 (en) Holistic task scheduling for distributed computing
WO2018059260A1 (en) Apparatus and method for scheduling graph computing on heterogeneous platforms based on energy efficiency
CN109960573B (en) Cross-domain computing task scheduling method and system based on intelligent perception
US20200193964A1 (en) Method and device for training an acoustic model
CN112114973A (en) Data processing method and device
CN110795238A (en) Load calculation method and device, storage medium and electronic equipment
CN115586961A (en) AI platform computing resource task scheduling method, device and medium
CN115150471A (en) Data processing method, device, equipment, storage medium and program product
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN116521350B (en) ETL scheduling method and device based on deep learning algorithm
CN115827225A (en) Distribution method of heterogeneous operation, model training method, device, chip, equipment and medium
CN115841343A (en) Method and device for determining sales amount
CN111309821B (en) Task scheduling method and device based on graph database and electronic equipment
CN114416669A (en) Group process file management method, device, network disk and storage medium
CN114077481A (en) Task scheduling method, device, equipment and storage medium
Du et al. OctopusKing: A TCT-aware task scheduling on spark platform
TWI778924B (en) Method for big data retrieval and system thereof
CN116501499B (en) Data batch running method and device, electronic equipment and storage medium
WO2023097424A1 (en) Method and apparatus for fusing layers of different models
CN117891584B (en) Task parallelism scheduling method, medium and device based on DAG grouping
US20230367633A1 (en) Gpu and gpu method
CN112764906B (en) Cluster resource scheduling method based on user job type and node performance bias

Legal Events

Date Code Title Description
AS Assignment

Owner name: INSPUR SUZHOU INTELLIGENT TECHNOLOGY CO., LTD., CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WANG, WENXIAO;REEL/FRAME:063631/0512

Effective date: 20230314

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE