WO2023066304A1 - 应用于超算集群调度的作业运行参数优化方法 - Google Patents

应用于超算集群调度的作业运行参数优化方法 Download PDF

Info

Publication number
WO2023066304A1
WO2023066304A1 PCT/CN2022/126219 CN2022126219W WO2023066304A1 WO 2023066304 A1 WO2023066304 A1 WO 2023066304A1 CN 2022126219 W CN2022126219 W CN 2022126219W WO 2023066304 A1 WO2023066304 A1 WO 2023066304A1
Authority
WO
WIPO (PCT)
Prior art keywords
job
parameter
parameter configuration
application
test
Prior art date
Application number
PCT/CN2022/126219
Other languages
English (en)
French (fr)
Inventor
张文帅
李会民
李京
Original Assignee
中国科学技术大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学技术大学 filed Critical 中国科学技术大学
Publication of WO2023066304A1 publication Critical patent/WO2023066304A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority

Definitions

  • the invention relates to the field of supercomputing clusters, in particular to a method for optimizing job operation parameters applied to supercomputing cluster scheduling.
  • Computing software such as but not limited to VASP is running on the supercomputing cluster.
  • the user When submitting a job for these software, the user needs to set operating environment parameters, operating resource parameters, application-related input parameters and other operating parameters.
  • one or more parallel parameters need to be specified as input parameters, such as the required total CPU cores number, or in the case of a multi-layer parallel structure, the number of parallel computing tasks assigned by each layer. Users can adjust these parallel parameters to significantly increase the calculation speed without changing the calculation results. But at present, many software itself cannot pre-judge a parallel parameter that is close to optimal based on the input file and the system hardware and software environment.
  • the system will only faithfully use the parallel parameters submitted by the user for calculation, and will not test faster for the user calculation parameters, especially the automatic optimization and modification of the specific application’s own input parameters.
  • the computing software in current supercomputing clusters tends to have a multi-layer parallel structure, and the input parameter space of the corresponding applications is becoming more and more complex. It is difficult for supercomputing cluster users to obtain ideal operating speeds with only a small amount of computing experience. It will cause a large number of jobs in the cluster system to be in an inefficient state.
  • the present invention proposes a job operation parameter optimization method applied to supercomputing cluster scheduling, which can be applied to the cluster.
  • the operating environment, operating resources and application input parameters are optimized to improve computing efficiency without affecting the calculation accuracy of application operations.
  • the present invention adopts following technical scheme:
  • a method for optimizing job operation parameters applied to supercomputing cluster scheduling comprising: obtaining application jobs submitted by users, and multiple sets of different job parameter configurations corresponding to the application jobs; using multiple sets of job parameter configurations to run the application jobs respectively; according to The set parameter configuration judgment conditions analyze the running results, and obtain the optimal parameter configuration from multiple sets of job parameter configurations.
  • the parameter configuration judgment conditions include the reduced parallel efficiency, which is used to run the application job. When the number of resources is increased by a fixed multiple, the parallel efficiency of the supercomputing cluster; push the optimal parameter configuration to the user, or modify the job parameter configuration of the application job according to the optimal parameter configuration.
  • the parameter configuration judging conditions also include running cost and job calculation duration, and the running results are analyzed according to the set parameter configuration judging conditions, and the optimal parameter configuration obtained by screening from multiple groups of job parameter configurations includes: Under normal circumstances, the job parameter configuration corresponding to the minimum job calculation time is selected as the optimal parameter configuration; in the case of different operating costs, the optimal parameter configuration is obtained from multiple groups of job parameter configurations by combining the operating cost and reduced parallel efficiency ; Wherein, the operating cost is the amount of hardware resources or the total amount of fixed assets occupied when running the application job with job parameter configuration.
  • combining operating costs and reduced parallel efficiency to obtain the optimal parameter configuration from multiple sets of job parameter configurations includes: pairing multiple sets of job parameter configurations in a pairwise comparison, wherein any A pair includes the first set of operation parameter configuration and the second set of operation parameter configuration; calculate the good or bad judgment index, and determine the first set of job parameter configuration or the second set of job parameter configuration according to the good or bad judgment index and the size of the set threshold is the optimal parameter configuration; among them, the quality judgment index is:
  • R1 is the operating cost corresponding to the configuration of the first group of job parameters
  • R2 is the corresponding running cost of the configuration of the second group of job parameters
  • R1 ⁇ R2 T1 is the calculation time of the job corresponding to the configuration of the first group of job parameters
  • T2 is the second group Job calculation duration corresponding to job parameter configuration
  • n is a constant, and n>1.
  • using multiple sets of job parameter configurations to run the application jobs separately includes: combining multiple sets of job parameter configurations, using idle hardware resources in the cluster to run the application jobs to form a test job; wherein the test job includes part of the program in the corresponding application job , the hardware resources occupied by the test job are at least partly the same as the hardware resources required by the corresponding application job; during the running of the test job, when any hardware resource occupied by it is applied for by any running application job, the test job stop running.
  • the job parameter configuration includes: at least one of the original parameter configuration, estimated parameter configuration and supplementary parameter configuration; wherein, the original parameter configuration is the initial job parameter configuration set by the user; the estimated parameter configuration is based on the The job parameter configuration obtained by the parameter estimation model corresponding to the application category, where the input of the parameter estimation model is the information of the job to be run, and the output is the estimated parameter configuration corresponding to the job to be run, and the parameter estimation model is the empirical model and /or obtained by using big data training; the supplementary parameter configuration is a set or multiple sets of job parameter configurations obtained by substituting at least one of the original parameter configuration and estimated parameter configuration into the set parameter variation model as input, wherein , the parameter variation model is used to change one or more parameters in the input job parameter configuration according to the set rules to form a new job parameter configuration and output it; or, the output of the parameter prediction model is set to multiple groups The job parameter configuration, and the optimal group among the multiple groups of job parameter configurations that are output according to the parameter estimation model is used as the estimated parameter
  • combining multiple sets of job parameter configurations, using idle hardware resources in the cluster to run application jobs, forming a test job includes: obtaining all test jobs corresponding to multiple sets of job parameter configurations to obtain a test job set; when the application job is run , if any test job in the corresponding test job set is not running or is running, stop and delete the test job in the test job set, and use the estimated parameter configuration corresponding to the application job as the optimal parameter configuration.
  • the test job has a maximum running time; when the running time of any test job reaches the maximum running time, the test job is deleted from the running program, and the test job is deleted from the test job set where it is located.
  • the training database of the parameter estimation model is the historical job database of the corresponding application category
  • the historical job database contains the actual operation of the application jobs in the corresponding application category
  • the historical job database also includes the job parameter configuration and the calculation completion time used by the application job in the corresponding application category during the test run.
  • the job operation parameter optimization method applied to supercomputing cluster scheduling further includes: adding the optimal parameter configuration to the historical job database.
  • a method for optimizing operation parameters of a job applied to a supercomputing cluster scheduling system proposed by the present invention can compare the configurations of various job parameters obtained by itself, so as to obtain the configuration of job parameters that can achieve the best computing efficiency, that is, the most Optimal parameter configuration.
  • the present invention by setting the optimization target parameters, flexible parameter optimization objects can be realized, and even the optimization of the parameters input by the application itself can be realized, which solves the problem that the parameter space of the parameters input by the application itself is complicated and cannot be optimized at present. The problem.
  • the obtained optimal parameter configuration can be pushed to the cluster user according to the authority set by the cluster user, or the application job submitted by the user can be directly optimized according to the optimal parameter configuration.
  • the invention realizes the automatic optimization of the parameter configuration of the application jobs submitted by the users of the supercomputing cluster, makes up for the defect that most users do not have the ability to optimize the parameter configuration, and is conducive to improving the computing efficiency of the supercomputing cluster as a whole.
  • the present invention When the present invention is applied to improve the running speed of the application job in the supercomputing cluster, it can significantly reduce the amount of hardware resources occupied by the job without significantly reducing the running speed of the application job according to the parameter selection of the optimized parameter configuration, In this way, the utilization efficiency of the hardware resources of the supercomputing cluster can be improved, the number of jobs completed by the supercomputing cluster can be increased per unit of time, the economic benefits of the cluster can be improved, and the supercomputing cluster users can be helped to reduce queuing time and improve the computing speed experience of users.
  • the method for judging the quality of the parameter configuration according to the good or bad judgment index E r is clearly defined. It can achieve a better balance between speed performance and hardware resource utilization efficiency, avoiding the situation where computing resources are greatly increased but only a small computing performance improvement is achieved. Optimization of the amount of resources.
  • the application operation and the pre-selected operation parameter configuration that is, the original parameter configuration, the estimated parameter configuration and the supplementary parameter configuration are combined to generate a test operation, and the operation of the test operation is equivalent to realizing the Use different job parameter configurations to conduct trial run tests to accurately estimate the execution speed of application jobs under each parameter configuration, and provide actual operating data for judging the pros and cons of the original parameter configuration, estimated parameter configuration, and supplementary parameter configuration Support, and further the reliability of the optimal parameter configuration finally obtained.
  • the test job is limited to be executed only on idle hardware resources in the supercomputing cluster, and the test jobs are all jobs that can be intercepted, that is, the test job can be interrupted and stopped when it is running, and all the occupied Or part of the hardware resources give way to other high-priority jobs, that is, application jobs (formal jobs), so that the queuing time of high-priority jobs, that is, application jobs, is not increased while the test job is used to test the job parameter configuration.
  • high-priority jobs that is, application jobs (formal jobs)
  • the relationship between the optimal parameter configuration and the application operation obtained after the test operation test run is added to the historical operation database of the category to which the application operation belongs, thereby improving the sample quality in the historical operation database, It is beneficial to improve the quality of the parameter estimation model, thereby further improving the accuracy and reliability of the estimated parameter configuration. At the same time, it also improves the advantages and benefits when using the estimated parameter configuration as the optimal parameter configuration, ensuring the improvement of the computing efficiency and hardware resource utilization efficiency of all application jobs in the cluster.
  • Fig. 1 is a flow chart of a method for optimizing job operation parameters applied to supercomputing cluster scheduling proposed by the present invention
  • FIG. 2 is a flow chart of a method for optimizing job running parameters applied to supercomputing cluster scheduling provided in Embodiment 1.
  • the number of hardware resources occupied when running the application job or the total fixed assets of the occupied hardware resources is the corresponding operating cost of the job parameter configuration.
  • Test job The test job includes some programs in the corresponding application job, and the running time when the application job corresponding to the test job is configured with the same job parameters as the test job is recorded as the complete running time.
  • the running time of the test job is significantly less than its corresponding full running time, and the running time of the test job is positively correlated with the corresponding full running time.
  • the corresponding test job can be simplified to 3 or 5 iterations to reduce the test time.
  • Can be robbed mechanism once the hardware resources used in the running of application job A are applied by another application job B, application job A stops running to free up hardware resources for application job B, and the stop of application job A Mechanisms are known as Stealable Mechanisms.
  • the test job has a mechanism that can be preempted, which means that the execution priority of the test job is lower than that of all application jobs, that is, the official job. Any application job application, the test job stops running to release the hardware resources it occupies.
  • Job parameter configuration includes, for example, one or more of the environment parameters of the cluster system when the job is executed, the hardware resource configuration parameters used by the job, and the input parameters of the application itself.
  • Environmental parameters of the cluster system under the Linux system, it is the environment variable parameters of the system, including but not limited to various stack and cache usage limit parameters and thread number parameters (such as but not limited to OMP_STACKSIZE, ulimit parameters, etc.), etc.
  • Hardware resource configuration parameters used by the job such as but not limited to the number of processes started, the number of CPU hardware cores occupied by each process, the distribution parameters of processes between nodes and CPU cores, and the accelerator cards connected and used by each process (such as GPU) configuration parameters, etc.
  • the input parameters of the application itself must be internal operating parameters that do not affect the precision requirements of the calculation results, which are generally specified in the input file of the application, and are reflected in different parameters depending on the application.
  • VASP application job it includes KPAR, NCORE, NPAR, NSIM and other parameters used to divide or aggregate various computing tasks, but is not limited to the above parameters.
  • the job operation parameter optimization method applied to supercomputing cluster scheduling includes, for example:
  • a job operation parameter optimization method applied to supercomputing cluster scheduling proposed in this embodiment, as shown in FIG. 2 includes the following steps, for example.
  • SA1 Obtain an application job submitted by a user.
  • SA2 Obtain the application category described in the application job, select a parameter estimation model according to the application category and the operating parameters to be optimized, and obtain the estimated parameter configuration in combination with the application job and the parameter estimation model.
  • the input of the parameter estimation model is the information of the job to be run, and the output is the estimated parameter configuration corresponding to the job to be run.
  • the parameter estimation model can adopt an empirical model, that is, a manual setting.
  • the parameter estimation model can also be obtained through big data training.
  • the training database of the parameter estimation model is the historical job database of the corresponding application category.
  • the historical job database contains the job parameters used by the application jobs in the corresponding application category when they are actually running. Configure and calculate the completion time.
  • the original parameter configuration is the user's original job parameter configuration, that is, the initial parameter value of the application job submitted by the user.
  • the supplementary parameter configuration is the job parameter configuration obtained by substituting the preset parameter variation model into the original parameter configuration and/or the estimated parameter configuration, or the optimal group of multiple groups of job parameter configurations output by the parameter prediction model is used as Predicted parameter configurations, the remaining job parameter configurations output by the parameter estimation model are recorded as supplementary parameter configurations.
  • the test job adopts the interception mechanism, and the test job includes some programs in the corresponding application job, and the hardware resources occupied by the test job are all or partially the same as those required by the corresponding application job.
  • SA6 Send the optimal parameter configuration to the user according to the authority set by the user, or directly optimize and execute the parameter configuration of the application job according to the optimal parameter configuration.
  • the optimal parameter configuration is sent to the user in the form of information notification, and the user decides whether to adopt it or not and manually modifies it.
  • the authority set by the user is to automatically modify the parameter configuration
  • the input information of the application job can be directly modified through the program to run the application job with the optimal parameter configuration.
  • step SA2 when using big data training to obtain the parameter estimation model, the collected historical data is classified according to the application category, and the parameter prediction corresponding to the application category is trained for the historical job database formed by the historical data of the same application category.
  • the estimation model ensures the reliability of the job parameter configuration of the application job according to the parameter estimation model, and ensures the accuracy of the estimated parameter configuration.
  • the data source in the historical job database includes the actual running application job, and the job parameter configuration and job calculation duration of the actual running application job are clear historical data, and after job parameter configuration optimization For an application job, its optimal parameter configuration and corresponding job calculation time must be stored in the historical job database described in the application job, which improves the sample quality of the historical job database, thereby improving the accuracy of the parameter estimation model.
  • the original parameter configuration, estimated parameter configuration, and each group of supplementary parameter configuration corresponding to the application job can be judged according to the set parameter configuration judgment conditions, combined with the running data of each test job corresponding to the application job, so as to ensure the final
  • the obtained optimal parameter configuration is accurate and reliable.
  • test jobs obtained will be included in the test queue to be run, and the test jobs in the test queue to be run are limited to pass through idle hardware resources, that is, those that have not been applied for or occupied by any application job.
  • the hardware resource runs.
  • the operation of the test job is realized without affecting the operation of the application job, that is, the trial operation of the original parameter configuration, estimated parameter configuration and each group of supplementary parameter configuration corresponding to the application job is realized.
  • Evaluation which is conducive to the configuration of job parameters that can enable the application job to achieve better hardware resource utilization efficiency before the actual operation of the application job, thereby realizing the optimization of the parameter configuration of the application job, improving the overall hardware resource utilization efficiency of the cluster, and realizing computing A balance of speed and efficient use of hardware resources.
  • the optimization method for job operation parameters applied to supercomputing cluster scheduling seeks to achieve a better balance between computing speed and hardware support utilization efficiency without reducing computing accuracy. Therefore, in this embodiment
  • the required job parameter configuration is set as a job parameter that does not affect the calculation accuracy of the application job, but only affects the utilization efficiency of the hardware resources of the application job.
  • the optimal parameter configuration corresponding to the VASP application job may include KPAR, NCORE, NPAR, and NSIM and other parameters used to divide or aggregate various computing tasks.
  • this embodiment further restricts the parameter items contained in the original parameter configuration, estimated parameter configuration and supplementary parameter configuration, that is, the parameter categories are the same, but different parameter configurations At least one parameter item has a different specific setting value or attribute. That is, the format of the original parameter configuration, estimated parameter configuration and all supplementary parameter configurations is unified.
  • step SA5 when the application job is run in step SA5, if any test job corresponding to it has not finished running, that is, any test job corresponding to the application job has not been run or is running, stop and delete the execution of all test jobs corresponding to the application job, and use the estimated parameter configuration corresponding to the application job as the optimal parameter configuration.
  • test jobs corresponding to the application job are recorded as the test job set corresponding to the application job. If the test job set corresponding to the application job is not completed before the application job is run, the original parameter configuration, estimated parameter configuration, and various parameters corresponding to the application job cannot be configured. Group supplementary parameter configurations can be effectively compared. At this time, continuing to test the running of the job to obtain the optimal parameter configuration cannot be applied to the running of the application job.
  • test job corresponding to it when the application job is running, if any test job corresponding to it has not finished running, that is, there is still a test job corresponding to the application job in the test queue to be run, then all the test jobs corresponding to the application job are removed from the waiting list. Deletion from the running test queue avoids redundant occupation of hardware resources, further improves the utilization efficiency of hardware resources, and provides time and hardware resources for running test jobs of application jobs that are not yet to be run.
  • the test job has a maximum running time.
  • the running time of any test job reaches the maximum running time, the test job is deleted from the running program, and the test job is removed from its Deleted from the collection of test jobs.
  • the running of the test job reaches the maximum running time, it indicates that the configuration effect of the job parameters corresponding to the test job is poor.
  • stopping the running of the test job in time will help to avoid useless work and stop losses in time.
  • the test job is deleted from the test job set where it is located, which also prevents the test job from affecting the tests of the test job set.
  • the parameter estimation model is trained in combination with the historical job database including the optimal parameter configuration corresponding to similar application jobs, which improves the accuracy of the parameter prediction model and the superiority of the estimated parameter configuration.
  • the default estimated parameter configuration is the most Optimal parameter configuration, higher reliability.
  • the historical job database further includes the job configuration parameters used in the test job and the corresponding calculation completion time, so as to increase the number of samples through the test job.
  • test job Since the test job is a part of the corresponding application job, its running time is much shorter than the application job, so when training the parameter estimation model, it is necessary to process the job calculation time corresponding to the test job to restore the job corresponding to the test job
  • the parameter configuration is applied to the corresponding application job, the job calculation time required by the application job, and then the parameter configuration of the test job and the restored job calculation time are used to train the parameter estimation model.
  • the historical job database is divided into the original sub-library and the test sub-library
  • the original sub-library is used to store the job parameter configuration and the job calculation time used by the actual running application job
  • the test sub-library is used to store The job parameter configuration and job calculation duration provided by the test job.
  • the amount of stored data in the original sub-library can be limited, and the regular update or data coverage of the original sub-library can be realized to ensure that the original sub-library only saves the latest data, thereby avoiding the original sub-library The detrimental effect of outdated samples in the parameter estimation model.
  • the application job can actively check the running completion status of its corresponding test job set, that is, when the application job is running, the application job actively calls the auxiliary program to check whether all the test jobs in the corresponding test job set have been completed.
  • the auxiliary program filters the optimal parameter configuration according to the operating data of the test job and the parameter configuration judgment conditions to be called by the application job. If not all the test jobs in the test job set corresponding to the application job have finished running, the application job invokes the corresponding estimated parameter configuration to run.
  • the test job set can be used to check the running status of the corresponding application job, that is, during the running process of the test job set in the test job set, the auxiliary program can be used to regularly check whether the application job corresponding to the test job set has started running. Once the corresponding application job starts running , all test jobs in the test job set are stopped and deleted, and the application job invokes the corresponding estimated parameter configuration to run. On the contrary, if all the test jobs in the test job set have been run and the corresponding application job has not started to run, the auxiliary program will filter the optimal parameter configuration according to the running data of the test job and the parameter configuration judgment conditions to be called by the application job.
  • This embodiment provides a specific parameter configuration determination condition.
  • the parameter configuration judgment condition is:
  • the optimal parameter configuration is selected from multiple sets of job parameter configurations in combination with the corresponding running cost and parallel efficiency of the job parameter configuration.
  • the specific method of selecting the optimal parameter configuration from multiple groups of job parameter configurations in combination with the operating cost and parallel efficiency corresponding to the job parameter configuration can be specifically set according to the selection of running job data and the goal of parameter configuration.
  • the optimal parameter configuration is selected from the multiple sets of job parameter configurations corresponding to the application job in a pairwise comparison.
  • case-1 and case-2 are two different job parameter configurations corresponding to the same application job, and the running cost of the application job using case-1 is less than the running cost of the application job using case-2.
  • R1 is the operating cost corresponding to case-1
  • R2 is the operating cost corresponding to case-2.
  • T1 is the calculation duration of the job corresponding to case-1
  • T2 is the calculation duration of the job corresponding to case-2.
  • n is a calculation constant, n>1.
  • the method of judging the more optimal item in the two job parameter configurations according to the quality judgment index E r is, for example: when E r ⁇ m, then judge case-2 as the more preferable item, and when E r ⁇ m, then judge case-1 is a more preferred option, wherein m is a set threshold, 0 ⁇ m ⁇ 1.
  • jobs A, B, C, and D are test jobs formed by combining the same application job with different job parameter configurations
  • the job parameter configuration corresponding to test job B is the optimal parameter configuration corresponding to the application job.
  • the job calculation duration T1 and T2 are the calculation time consumed in the job trial run test, that is, the test composed of case-1
  • the job composed of job and case-2 is a test job corresponding to the same application job
  • T1 is the time consumed to complete the calculation of the test job composed of case-1
  • T2 is the time consumed to complete the calculation of the test job composed of case-2.
  • This embodiment analyzes, tests, and provides optimized VASP operating parameters for users based on the CPU isomorphic platform.
  • the cluster scheduling system in order to facilitate the description of the job operation parameter optimization method applied to supercomputing cluster scheduling, it is assumed that the cluster scheduling system is provided with a method for storing and executing job operation parameter optimization applied to supercomputing cluster scheduling The runtime optimization module.
  • This embodiment includes the following steps, for example.
  • Step 1 The user submits a VASP application job to the LSF job management system through the submission system.
  • the submission system obtains the job input information and passes it to the runtime optimization module.
  • the job input information includes but is not limited to the input file or the directory where the input file is located, and the hardware Resource application information, application execution commands, etc.
  • Step 2 the runtime optimization module analyzes the received job input information, and judges whether the current job is a VASP application job based on information such as the application execution command, the job name named by the user, the input file and the directory where the input file is located, and when the judgment is true , continue to step 3; when the judgment is false, jump to the runtime optimization process written for other applications.
  • Step 3 According to the calculation specification document of the VASP application job, select the typical input parameters NPAR, KPAR, and NCORE that do not affect the accuracy requirements of the job calculation results and affect its calculation time and hardware resource requirements as the optimization target parameters.
  • the product of these three typical input parameters is the number of processes started by the job application, that is, the CPU core resources corresponding to the job application.
  • Step 4 Preprocessing and analyzing the input data of the VASP application job to obtain the main calculation parameters of the job.
  • the main calculation parameters include calculation parameters that affect the running time of the job.
  • the main calculation parameters specifically include and may not be limited to the following parameter:
  • the input files and input parameters of the VASP application include INCAR, KPOINTS, POSCAR, and POTCAR.
  • the input parameters include ICHARG, ISTART, NCORE, KPAR, NPAR and other parameters. NCORE, KPAR, and NPAR are used to divide parallel tasks.
  • Step 5 input the main calculation parameters of the VASP job obtained in the preprocessing analysis in the previous step into the parameter estimation model, and obtain various combinations of the configuration values of the three operating parameters NPAR, KPAR, and NCORE output by the parameter estimation model, And estimate a combination of the best operating parameter configuration is recorded as the estimated parameter configuration, and the remaining combination is recorded as the supplementary parameter configuration.
  • the parameter estimation model may be a type of empirical model, or a big data model based on historical VASP job data collections, or a combination of the two.
  • Step 6 copy all input files of the VASP job to a new directory, and submit a trial run test job.
  • the number of hardware resources applied for by the test job is not lower than the hardware resources set by the original VASP application job; the test job is set as a "stealable job", so that the test process does not affect the operation of other official jobs.
  • the electronic step cycle iteration step limit parameter NELM is set to 5 steps or 3 steps, etc.
  • the ion step iteration parameter IBRION is set to 0 (no ion step iterations), and set both LWAVE and LCHARG to ".F.” (stop wave function and charge output).
  • the original parameter configuration, estimated parameter configuration and supplementary parameter configuration are tested in sequence to obtain the optimal parameter configuration.
  • Step 7 After the trial run test job is completed, it will invoke the activation runtime optimization module, and the runtime optimization module will check whether the cluster user to which the VASP application job belongs has been set with the optimization result reminder function. If it is necessary to remind, it will The optimal configuration parameters generated by the test operation are pushed to the user in the form of short messages.
  • the runtime optimization module also checks whether the cluster user to which the VASP application job belongs is set to "agree to the system optimization to modify the input parameters in its VASP job", if authorized by the user, and it is detected that the VASP application job has not started running, then The values of the three optimization target parameters NPAR, KPAR, and NCORE selected in this embodiment in the VASP application operation are modified according to the optimal parameter configuration.
  • Step 8 configure the job pre-processing module of the LSF job management system.
  • the pre-processing module checks whether there is a corresponding trial run test job for the current job. If it exists and has not been executed yet, activate it Runtime optimization module, kill unfinished trial run test jobs. At this point, the application job can be run directly using the original parameter configuration, or the parameters can be modified according to the estimated parameter configuration.
  • the user can submit the VASP application job through the WEB interface or other login interfaces, and the submission system is the WEB interface background system that submits the job to the LSF job management system,
  • the submission system is a component of the job management system, and users can directly submit jobs through the BSUB command of the LSF job management system.
  • step 4 it can be judged whether operation parameter optimization needs to be performed according to the obtained main calculation parameters. For example, when ICNARG is greater than 10, the calculation amount of the VASP job is very small, and it may not be optimized. Or, when ISTART is equal to 1, the job is set by the user to be executed after a second restart, and the operating parameters at this time should not be changed. Or, when the VASP job does not contain heavy computing tasks such as ion-step optimization, it is not optimized. When there is no need to optimize the running parameters, the runtime optimization module can omit subsequent optimization steps, return and wait for a new job.
  • step 4 all input files of the VASP application job can be copied to a new directory, and the detailed pre-processing data of the VASP application job can be obtained through a very short trial run, that is, the required and not existing in the main calculation parameters data, such as the number of reduced K points NKPTS and NBANDS parameter values to be calculated by the VASP application job.
  • the reduced number of K points NKPTS and NBANDS cannot be directly obtained in the input parameters, but the number of K points NKPTS has a great influence on the accuracy of the optimization result of the K point parallel partition parameter KPAR.
  • the NBANDS parameter value has a certain influence on the accuracy of the optimization results of NPAR.
  • the calculation of pre-processing data is very fast.
  • the calculation time can be reduced by setting the number of electronic steps of the VASP application job in the trial run to a small value such as 1, and canceling the ion step iteration of the VASP job. And set a runtime upper limit (say 10 seconds).
  • step 4 the input data can also be desensitized, and only the data required for running optimization is saved.
  • step 7 the runtime optimization module also desensitizes the input and output data calculated by the trial run test job, and then stores it in the VASP job operation history data set, so as to be used for training the estimated parameter model.
  • runtime optimization module based on the present invention can also be easily migrated and implemented to any computing software other than VASP (such as but not limited to quantum chemical software Gaussian, weather simulation software WRF, etc.); Any other job scheduling system (such as but not limited to Slurm, PBS and other job scheduling systems) other than LSF to implement; it can also be implemented on a hardware resource platform including accelerator cards such as GPU.
  • VASP quantum chemical software Gaussian, weather simulation software WRF, etc.
  • Any other job scheduling system such as but not limited to Slurm, PBS and other job scheduling systems
  • LSF hardware resource platform including accelerator cards such as GPU.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种应用于超算集群调度的作业运行参数优化方法,包括:获取用户提交的应用作业,以及与应用作业对应的多组不同的作业参数配置;采用多组作业参数配置分别运行应用作业;根据设定的参数配置判定条件对运行结果进行分析,从多组作业参数配置中筛选得到最优参数配置,其中,参数配置判定条件包括约化并行效率,约化并行效率为运行应用作业所采用的资源数每提升固定倍数时,超算集群的并行效率;将最优参数配置推送给用户,或者,根据最优参数配置修改应用作业的作业参数配置。本发明实现了超算集群用户提交的应用作业的参数配置的自动优化,弥补了大多数用户不具备参数配置优化能力的缺陷,有利于整体提高超算集群的计算效率。

Description

应用于超算集群调度的作业运行参数优化方法 技术领域
本发明涉及超级计算集群领域,尤其涉及一种应用于超算集群调度的作业运行参数优化方法。
背景技术
在超级计算集群上,运行着例如并不限于VASP等计算软件。用户在提交这些软件的作业时,需设定运行环境参数、运行资源参数、应用自身相关的输入参数等运行参数,特别的,需要指定一个至多个并行参数作为输入参数,例如所需总CPU核心数,或者在具有多层并行结构时,每层分配的并行计算的任务数。用户可以调节这些并行参数,达到在不改变计算结果的情况下,使计算速度显著提高。但是目前,很多软件自身无法仅依据输入文件与系统软硬件环境便预先判断一个接近最优的并行参数。
为了得到更好的计算速度,用户需要手动的多次调节测试这些运行参数,找出其中表现最好的参数。这个调节测试,也需要用户通过提交新作业的形式来完成,这增加了至少一次用户排队的次数。用户为了获得较好的计算速度,大约需要两倍的排队时间。这对用户是一个非常大的不方便,降低了用户进行优化测试的积极性,同时用户能掌握的经验数据有限,也难以获得较好的候选运行参数。此外,用户的测试作业同样消耗了用户自身的机时资源,产生了更多的机时费用,调查显示,做这类优化调试的用户并不多。在当前的集群作业调度系统,例如Slurm、PBS Pro、Platform LSF、或TORQUE中,用户提交一个新作业时,系统仅会忠实的使用用户提交的并行参数进行计算,不会为用户测试速度更快的计算参数,尤其无法自动化的优化修改具体应用的自身输入参数。特别是,当前的超算集群中的计算软件越发趋向多层的并行结构,相应的应用的自身输入参数空间也越发复杂,超算集群用户仅凭少量计算经 验很难获得理想的运行速度,这会导致集群系统中的大量的作业处于运行效率不佳的状态。
发明内容
为了解决上述现有技术中超算集群用户很难自行优化参数配置提高超算集群运行效率的缺陷,本发明提出了一种应用于超算集群调度的作业运行参数优化方法,可对集群上应用作业的运行环境、运行资源与应用输入参数的进行优化,实现在不影响应用作业计算精度的情况下提高计算效率。
本发明采用以下技术方案:
一种应用于超算集群调度的作业运行参数优化方法,包括:获取用户提交的应用作业,以及与应用作业对应的多组不同的作业参数配置;采用多组作业参数配置分别运行应用作业;根据设定的参数配置判定条件对运行结果进行分析,从多组作业参数配置中筛选得到最优参数配置,其中,参数配置判定条件包括约化并行效率,约化并行效率为运行应用作业所采用的资源数每提升固定倍数时,超算集群的并行效率;将最优参数配置推送给用户,或者,根据最优参数配置修改应用作业的作业参数配置。
优选地,参数配置判定条件还包括运行成本和作业计算时长,根据设定的参数配置判定条件对运行结果进行分析,从多组作业参数配置中筛选得到最优参数配置包括:在运行成本相同的情况下,筛选得到实现最小的作业计算时长对应的作业参数配置为最优参数配置;在运行成本不同的情况下,结合运行成本和约化并行效率从多组作业参数配置中筛选得到最优参数配置;其中,运行成本为采用作业参数配置运行应用作业时所占用的硬件资源的数量或者固定资产总额。
优选地,在运行成本不同的情况下,结合运行成本和约化并行效率从多组作业参数配置中筛选得到最优参数配置包括:以两两对比的方式将多组作业参数配置配对,其中,任意一对包含第一组作业参数配置和 第二组作业参数配置;计算优劣判定指标,并根据优劣判定指标与设定阈值的大小,确定第一组作业参数配置或第二组作业参数配置为最优参数配置;其中,优劣判定指标为:
Figure PCTCN2022126219-appb-000001
其中,
Figure PCTCN2022126219-appb-000002
R1为第一组作业参数配置对应的运行成本,R2为第二组作业参数配置对应的运行成本,且R1<R2,T1为第一组作业参数配置对应的作业计算时长,T2为第二组作业参数配置对应的作业计算时长,n为常数,且n>1。
优选地,采用多组作业参数配置分别运行应用作业包括:结合多组作业参数配置,利用集群中闲置的硬件资源运行应用作业,形成测试作业;其中,测试作业包含对应的应用作业中的部分程序,测试作业占用的硬件资源与对应的应用作业所需的硬件资源至少部分相同;在测试作业的运行过程中,当其占用的任一硬件资源被运行中的任一应用作业申请时,测试作业停止运行。
优选地,作业参数配置包括:原初参数配置、预估参数配置和补充参数配置中的至少一项;其中,原初参数配置为用户设置的初始作业参数配置;预估参数配置为根据与应用作业所属应用类别对应的参数预估模型获得的作业参数配置,其中,参数预估模型的输入为待运行作业的信息,输出为与待运行作业对应的预估参数配置,参数预估模型为经验模型和/或采用大数据训练得到;补充参数配置为将原初参数配置和预估参数配置中的至少一项作为输入代入到设定的参数异变模型中获得的一组或多组作业参数配置,其中,参数异变模型用于根据设定规则对输入的作业参数配置中的一项或者多项参数进行改变以形成新的作业参数配置并输出;或者,将参数预估模型的输出设置为多组作业参数配置,且根据参数预估模型标注输出的多组作业参数配置中最优的一组作为预估参数配置,剩余的作业参数配置记作补充参数配置。
优选地,结合多组作业参数配置,利用集群中闲置的硬件资源运行应用作业,形成测试作业包括:获取与多组作业参数配置对应的所有测试作业,得到测试作业集合;当应用作业被运行时,如果与其对应的测 试作业集合中的任一测试作业没有运行或者正在运行,则停止并删除测试作业集合中的测试作业,并将与应用作业对应的预估参数配置作为最优参数配置。
优选地,测试作业设有最大运行时长;在任一测试作业的运行时长达到最大运行时长的情况下,从运行程序中删除测试作业,并将测试作业从其所在的测试作业集合中删除。
优选地,在参数预估模型采用大数据训练获得的情况下,参数预估模型的训练数据库为与其对应的应用类别的历史作业数据库,历史作业数据库包含对应的应用类别中的应用作业在实际运行时所采用的作业参数配置以及计算完成时长。
优选地,历史作业数据库还包含对应的应用类别中的应用作业在测试运行时所采用的作业参数配置以及计算完成时长。
优选地,应用于超算集群调度的作业运行参数优化方法还包括:将最优参数配置添加到历史作业数据库中。
本发明的优点在于:
(1)本发明提出的一种应用于超算集群调度系统的作业运行参数优化方法,可自行比对获得的多种作业参数配置,从而自行获取可实现最佳计算效率的作业参数配置即最优参数配置。本发明中,可通过对优化目标参数进行设置,从而实现灵活的参数优化对象,甚至可实现对应用自身输入的参数的优化,解决了目前由于应用自身输入的参数的参数空间复杂而无法进行优化的问题。
(2)本发明中可根据集群用户设置的权限,将获得的最优参数配置推送给集群用户,或者直接根据最优参数配置对用户提交的应用作业进行优化。本发明实现了超算集群用户提交的应用作业的参数配置的自动优化,弥补了大多数用户不具备参数配置优化能力的缺陷,有利于整体提高超算集群的计算效率。
(3)本发明应用于提高超算集群中应用作业的运行速度时,可根据优化参数配置的参数选择,在不显著降低应用作业的运行速度的前提下,显著降低作业的硬件资源占用数额,从而提高超算集群的硬件资源 利用效率,提高超算集群整体在单位时间内完成计算的作业数量,提高集群的经济效益,并帮助超算集群用户减少排队时间,改善用户的计算速度体验。
(4)本发明提出的参数配置判定条件中明确了根据优劣判定指标E r判断参数配置优劣的方法,该优劣判定指标E r实际为约化并行效率,相比一般使用的并行效率可以更好的在速度性能与硬件资源利用效率上取得平衡,避免出现大幅增加计算资源却只取得很小的计算性能提升的情况,不仅可以实现计算速度的优化,更能够实现运行应用作业所使用资源量的优化。
(5)本发明中,在应用作业运行前,结合应用作业和预选的作业参数配置即原初参数配置、预估参数配置和补充参数配置生成测试作业,并通过测试作业的运行相当于实现了对采用不同的作业参数配置的应用作业进行试运行测试,来精确估算各个参数配置下的应用作业执行速度,为原初参数配置、预估参数配置和补充参数配置的优劣判断提供了实际的运行数据支撑,进一步最终获得的最优参数配置的可靠。
(6)本发明中,限定测试作业只在超算集群中空闲的硬件资源上执行,且测试作业均为可被抢断式作业,即测试作业在运行时可以被中断停止,并将占用的全部或部分硬件资源让位给高优先级的其他作业即应用作业(正式作业),以便在通过测试作业进行作业参数配置的测试的同时的同时不增加高优先级作业即应用作业的排队时间,实现在不增加高优先级作业即应用作业的排队时间的前提下增加全集群硬件资源利用率。
(7)本发明中,将经过测试作业试运行测试后获得的最优参数配置与应用作业的关联关系添加到该应用作业所属类别的历史作业数据库中,提高了历史作业数据库中的样本质量,有利于提高参数预估模型的质量,从而进一步提高预估参数配置的精度和可靠性。同时也提高了采用预估参数配置作为最优参数配置时的优益,保证了集群中所有应用作业的计算效率和硬件资源利用效率的提高。
附图说明
图1为本发明提出的应用于超算集群调度的作业运行参数优化方法流程图;
图2为实施例1提供的一种应用于超算集群调度的作业运行参数优化方法流程图。
具体实施方式
名词定义:
运行成本:根据作业参数配置运行应用作业时所占用的硬件资源的数量或者是所占用的硬件资源的固定资产总额为该作业参数配置对应的运行成本。
测试作业:测试作业包含对应的应用作业中的部分程序,将测试作业对应的应用作业采用与该测试作业相同的作业参数配置时的运行时间记作完整运行时间。测试作业的运行时间大大小于其对应的完整运行时间,且测试作业的运行时间与对应的完整运行时间正相关。
具体地,假设应用作业包含千次迭代,则其对应的测试作业可简化到3次或者5次迭代以减少测试时间。
可被抢断机制:应用作业A运行过程中用到的硬件资源一旦被另一个应用作业B申请,则应用作业A停止运行以便为应用作业B腾出硬件资源,则针对应用作业A的这种停止机制被称为可被抢断机制。
根据本公开的实施例,测试作业具有可被抢断机制,指的是测试作业的被执行优先级低于所有应用作业即正式作业的被执行优先级,一旦测试作业运行过程中占用的硬件资源被任一应用作业申请,则测试作业停止运行以释放其占用的硬件资源。
作业参数配置:例如包括作业执行时集群系统的环境参数、作业使用的硬件资源配置参数,以及应用自身的输入参数中的一个或者多个。
集群系统的环境参数:在Linux系统下为系统的环境变量参数,其包括但不限于各种堆栈与缓存的使用限制参数以及线程个数参数(例如但 不限于OMP_STACKSIZE、ulimit参数等)等。
作业使用的硬件资源配置参数:例如但不限于启动的进程个数、每个进程占用的CPU硬件核心数、进程的在节点与CPU核心间的分布参数,以及各个进程连接并使用的加速卡(如GPU)配置参数等。
应用自身的输入参数:必须为不影响计算结果精度要求的内部运行参数,其一般在应用的输入文件中被指定,并因应用不同而体现为不同的参数。
特别地,在VASP应用作业中,其包括KPAR、NCORE、NPAR、NSIM等用于划分或聚合多种计算任务的参数,但不限于以上所述参数。
根据本公开的实施例,如图1所示,应用于超算集群调度的作业运行参数优化方法,例如包括:
S1、获取用户提交的应用作业,以及与应用作业对应的多组不同的作业参数配置。
S2、采用多组作业参数配置分别运行应用作业。
S3、根据设定的参数配置判定条件对运行结果进行分析,从多组作业参数配置中筛选得到最优参数配置,其中,参数配置判定条件包括约化并行效率,约化并行效率为运行应用作业所采用的资源数每提升固定倍数时,超算集群的并行效率。
S4、将最优参数配置推送给用户,或者,根据最优参数配置修改应用作业的作业参数配置。
实施例1
本实施方式提出的一种应用于超算集群调度的作业运行参数优化方法,如图2所示,例如包括以下步骤。
SA1、获取用户提交的应用作业。
SA2、获取应用作业所述的应用类别,根据应用类别和待优化运行参数选择参数预估模型,结合应用作业和参数预估模型获取预估参数配置。
参数预估模型的输入为待运行作业的信息,输出为待运行作业对应的预估参数配置。参数预估模型可采用经验模型,即人工设置。参数预 估模型也可采用大数据训练获得,参数预估模型的训练数据库为其对应的应用类别的历史作业数据库,历史作业数据库包含对应的应用类别中的应用作业实际运行时所采用的作业参数配置以及计算完成时长。
SA3、结合原初参数配置和预估参数配置,获取补充参数配置。原初参数配置为用户原初的作业参数配置,即用户提交的应用作业的初始参数值。
补充参数配置为通过预设的参数异变模型代入原初参数配置和/或预估参数配置后获得的作业参数配置,或者将参数预估模型输出的多组作业参数配置中最优的一组作为预估参数配置,将参数预估模型输出的剩余的作业参数配置记作补充参数配置。
SA4、将应用作业分别与原初参数配置、预估参数配置和各组补充参数配置结合,以生成相对应的测试作业,利用集群中闲置的硬件资源在对应的应用作业运行前运行测试作业,并记录测试作业的运行数据。
测试作业采用被抢断机制,测试作业包含对应的应用作业中的部分程序,测试作业占用的硬件资源与对应的应用作业所需的硬件资源全部相同或者部分相同。
SA5、当应用作业被运行时,如果其对应的所有测试作业均已经运行完成,则根据各测试作业的运行数据结合设定的参数配置判定条件从各测试作业对应的原初参数配置、预估参数配置和补充参数配置中选择最优参数配置。
SA6、根据用户设置的权限,将最优参数配置发送给用户,或者直接根据最优参数配置对应用作业的参数配置进行优化并执行。
具体地,本步骤中,当用户设置的权限为信息提醒,则将最优参数配置以信息通知的方式发送给用户,由用户决定采用与否并手动修改。当用户设置的权限为自动化修改参数配置,则可通过程序直接修改应用作业的输入信息,以采用最优参数配置运行该应用作业。
步骤SA2中,采用大数据训练获得参数预估模型时,将收集到得的历史数据根据应用类别分类,实现了针对同一应用类别的历史数据所构成的历史作业数据库训练该应用类别对应的参数预估模型,从而保证了 根据参数预估模型预估应用作业的作业参数配置的可靠性,保证了预估参数配置的精度。
并且,本实施例中,历史作业数据库中的数据来源包括实际运行后的应用作业,实际运行后的应用作业的作业参数配置和作业计算时长都是明确的历史数据,且经过作业参数配置优化后的应用作业,其最优参数配置和对应的作业计算时长必然存入该应用作业所述的历史作业数据库中,提高了历史作业数据库的样本质量,从而提高了参数预估模型的精确度。
具体实施时,可根据设置的参数配置判定条件,结合应用作业对应的各测试作业的运行数据判定应用作业对应的原初参数配置、预估参数配置和各组补充参数配置的优劣,从而保证最终获得的最优参数配置的精确可靠。
具体的,本实施例具体实施时,将获得所有测试作业列入待运行测试队列,待运行测试队列中的测试作业均限定只能通过闲置的硬件资源即没有被任一应用作业申请或占用的硬件资源运行。如此,本实施例中实现了在不影响应用作业运行的情况下,实现对测试作业的运行,即实现了对应用作业对应的原初参数配置、预估参数配置和各组补充参数配置的试运行评估,从而有利于在应用作业实际运行前可使得该应用作业实现更好的硬件资源利用效率的作业参数配置,从而实现应用作业的参数配置的优化,提高集群的整体硬件资源利用效率,实现计算速度和硬件资源利用效率的平衡。
值得注意的是,本发明提供的应用于超算集群调度的作业运行参数优化方法,以不降低计算精度为前提寻找实现计算速度和硬件支援利用效率更优的平衡点,因此,本实施例中所需的作业参数配置设定为不影响应用作业计算精度,只影响应用作业对硬件资源的利用效率的作业参数,例如,VASP应用作业对应的最优参数配置可包括KPAR、NCORE、NPAR、NSIM等用于划分或聚合多种计算任务的参数。
具体实施时,为了方便最优参数配置的获得和实施,本实施例中进一步限定原初参数配置、预估参数配置和补充参数配置所包含的参数项 即参数类别均相同,但不同的参数配置中至少有一个参数项的具体设置数值或者属性不同。即,原初参数配置、预估参数配置和所有的补充参数配置格式统一。
实施例2
在实施例1的基础上,本实施例的进一步实施中,步骤SA5中当应用作业被运行时,如果其对应的任一测试作业没有完成运行,即应用作业对应的任一测试作业没有运行或者正在运行,则停止并删除该应用作业对应的所有测试作业的运行,并将该应用作业对应的预估参数配置作为最优参数配置。
应用作业对应的所有测试作业记作应用作业对应的测试作业集合,应用作业对应的测试作业集合没有在应用作业运行前运行完成,则无法对应用作业对应的原初参数配置、预估参数配置和各组补充参数配置进行有效对比,此时继续测试作业的运行以获取最优参数配置也无法应用到应用作业的运行中。
本实施例中,当应用作业运行时,如果其对应的任一测试作业没有完成运行即待运行测试队列中还存在该应用作业对应的测试作业,则将该应用作业对应的所有测试作业从待运行测试队列中删除,避免了对硬件资源的冗余占用,进一步提高了硬件资源的利用效率,未待运行的应用作业的测试作业的运行提供了时间和硬件资源。
值得注意的是,本实施例进一步实施时,测试作业设有最大运行时长,任一测试作业的运行时长达到最大运行时长时,则从运行程序删除该测试作业,并将该测试作业从其所在的测试作业集合中删除。当测试作业的运行达到最大运行时长时,说明该测试作业对应的作业参数配置效果较差,此时,及时停止该测试作业的运行,有利于避免做无用功,及时止损。同时将该测试作业从其所在的测试作业集合中删除,也避免了该测试作业影响该测试作业集合的测试。
本实施例中,结合包含有同类应用作业对应的最优参数配置的历史作业数据库训练参数预估模型,提高了参数预估模型的精度和预估参数配置的优异性。如此,本实施例中,在应用作业对应的测试作业集中没 有完成测试时,即无法根据测试数据对原初参数配置、预估参数配置和补充参数配置进行客观对比时,默认预估参数配置为最优参数配置,可靠性更高。
实施例3
在实施例1的基础上,本实施例中,历史作业数据库还包括测试作业采用的作业配置参数及对应的计算完成时长,以通过测试作业增加样本数量。
由于测试作业为对应的应用作业的部分程序,其运行时长远远小于应用作业,故而在训练参数预估模型时,需要对测试作业对应的作业计算时长进行处理,以还原该测试作业对应的作业参数配置应用到对应的应用作业中,该应用作业所需的作业计算时长,然后测试作业的参数配置和还原后的作业计算时长用于训练参数预估模型。
本实施例中,例如将历史作业数据库分割为原初子库和测试子库,原初子库用于存储实际运行后的应用作业所采用的作业参数配置和其作业计算时长,测试子库用于存储测试作业提供的作业参数配置和作业计算时长。
如此,通过原初子库和测试子库的设置,可限定原初子库的存储数据量,并实现原初子库的定期更新或者数据覆盖,保证原初子库仅保存最新的数据,从而避免原初子库中的样本过时对参数预估模型的不利影响。
本实施例也适用于实施例2。
值得注意的是,以上任何一个实施例中均没有对应用作业获得最优参数配置的具体情况进行限定。具体实施时,可由应用作业主动查看其对应的测试作业集合的运行完成情况,即当应用作业运行时,应用作业主动调用辅助程序查看其对应的测试作业集合中的测试作业是否全部完成运行,如果全部完成,则由辅助程序根据测试作业的运行数据和参数配置判定条件筛选最优参数配置以供应用作业调用。如果应用作业对应的测试作业集合中的测试作业没有全部完成运行,则应用作业调用对应的预估参数配置进行运行。
或者,可由测试作业集合查看对应的应用作业的运行状态,即测试作业集合的测试作业运行过程中,通过辅助程序定时查看该测试作业集合对应的应用作业是否开始运行,一旦对应的应用作业开始运行,则该测试作业集合中的所有测试作业均停止运行并删除,应用作业调用对应的预估参数配置进行运行。反之,如果测试作业集合的测试作业全部运行完成后,其对应的应用作业还没有开始运行,则辅助程序根据测试作业的运行数据和参数配置判定条件筛选最优参数配置以供应用作业调用。
实施例4
本实施例提供了一种具体的参数配置判定条件。
所述参数配置判定条件为:
当应用作业对应的多个作业参数配置所需的运行成本相同,则定义实现最小的作业计算时长的作业参数配置为最优参数配置。
当应用作业采用两个不同的作业参数配置运行时所需的运行成本不同,结合作业参数配置对应的运行成本和并行效率从多组作业参数配置中筛选最优参数配置。
具体实施时,结合作业参数配置对应的运行成本和并行效率从多组作业参数配置中筛选最优参数配置的具体方式,可根据运行作业数据的选择以及参数配置的目标具体设置。
本实施例的参数配置判定条件中,当应用作业采用两个不同的作业参数配置运行时所需的运行成本不同,则根据优劣判定指标E r判断两个作业参数配置中的更优选项,并结合优劣判定指标E r以两两对比的方式从应用作业对应的多组作业参数配置中筛选最优参数配置。
优劣判定指标E r的计算公式为:
Figure PCTCN2022126219-appb-000003
其中,case-1和case-2为同一应用作业对应的两个不同的作业参数配置,该应用作业采用case-1时的运行成本少于该应用作业采用case-2时的运行成本。R1为case-1对应的运行成本,R2为case-2对应的运行成本。T1为case-1对应的作业计算时长,T2为case-2对应的作业计算时长。n为计算常数,n>1。
根据优劣判定指标E r判断两个作业参数配置中的更优选项的方式例如为:当E r≥m,则判断case-2为更优选项,当E r<m,则判断case-1为更优选项,其中,m为设定阈值,0<m<1。
假设具体实施例中,存在A、B、C、D四个作业,n=2,m=80%;A、B、C、D四个作业对应的运行成本分别记作Ra、Rb、Rc、Rd,A、B、C、D四个作业对应的作业计算时长Ta、Tb、Tc、Td。
假设:
Ra=r,Ta=t;
Rb=2r,Tb=0.6t;
Rc=4r,Td=0.4t;
Rc=8r,Td=0.25t;
则,
Figure PCTCN2022126219-appb-000004
Figure PCTCN2022126219-appb-000005
Figure PCTCN2022126219-appb-000006
结合E r(A,B)可知,作业B的作业参数配置优于作业A的作业参数配置,结合E r(A,C)、E r(A,D)可知,作业A的作业参数配置优于作业C、D的作业参数配置,即作业A、B、C、D中以作业B的作业参数配置最优。
如此,假设作业A、B、C、D为同一个应用作业与不同的作业参数配置结合形成的测试作业,则可知测试作业B对应的作业参数配置为该应用作业对应的最优参数配置。
本实施例具体应用时,可结合实施例1至3,优劣判定指标E r的计算公式中作业计算时长T1、T2为作业试运行测试中所消耗的计算时间, 即case-1构成的测试作业和case-2构成的作业为对应同一应用作业的测试作业,T1为case-1构成的测试作业计算完成所消耗的时间,T2为case-2构成的测试作业计算完成所消耗的时间。
值得注意的是,上述m值只是一个参考值,本实施例中,设置E r=m时,判断case-2为更优选项。具体实施时,也可设置E r=m时,判断case-1为更优选项,或者E r=m时,随机选择case-1或case-2为更优选项。本领域技术人员应该知道,对于E r=m的三种判断情况为等同技术特征。
实施例5
本实施例中,以超算平台上用户使用量最多的应用软件之一VASP计算软件,以及超算平台上广泛使用的LSF作业调度系统为例,对本发明提供的应用于超算集群调度的作业运行参数优化方法作进一步解释。
本实施例基于CPU同构平台为用户分析、测试、并提供优化的VASP运行参数。本实施例中,为了方便对所述应用于超算集群调度的作业运行参数优化方法进行阐述,假没在集群调度系统中设置有用于存储和执行应用于超算集群调度的作业运行参数优化方法的运行时优化模块。
本实施例例如包括以下步骤。
步骤1,用户通过提交系统向LSF作业管理系统提交一个VASP应用作业,提交系统获取作业输入信息并传递给运行时优化模块,所述作业输入信息包括但不限于输入文件或者输入文件所在目录、硬件资源申请信息、应用执行命令等。
步骤2,运行时优化模块分析收到的作业输入信息,根据应用执行命令、用户命名的作业名称、输入文件和输入文件所在目录等信息,判断当前作业是否为VASP应用作业,当判断为真时,继续步骤3;当判断为假时,跳转至针对其他应用所编写的运行时优化流程。
步骤3,根据VASP应用作业的计算说明书文档,选择不影响作业计算结果精确度要求且影响其计算时长与硬件资源需求的典型输入参数NPAR、KPAR、NCORE作为优化目标参数。此三个典型输入参数的乘积为作业申请启动的进程数,即对应作业申请的CPU核心资源。
步骤4,对该VASP应用作业的输入数据进行预处理分析,获取该 作业的主要计算参数,所述主要计算参数包括影响作业运行时间的计算参数,所述主要计算参数具体包括且可以不限于以下参数:
申请的节点类型与数量,以及申请的CPU核心数量。
运行的VASP程序版本。
VASP应用作业的输入文件所在之目录。
VASP应用应用的输入文件以及输入参数,输入文件包括INCAR、KPOINTS、POSCAR、POTCAR,输入参数包括ICHARG、ISTART、NCORE、KPAR、NPAR等参数,NCORE、KPAR、NPAR用于并行任务的划分。
步骤5,将前一步预处理分析中得到的该VASP作业的主要计算参数输入到参数预估模型中,得到参数预估模型输出三个运行参数NPAR、KPAR、NCORE的配置值的多种组合,并预估一种组合最佳的运行参数配置记作预估参数配置,将剩余的组合记作补充参数配置。所述参数预估模型,可以是一类经验模型,或者是基于历史的VASP作业数据集合的大数据模型,或者是两者的合并统一。
步骤6,复制该VASP作业的所有输入文件到新的目录,提交一个试运行测试作业。该测试作业申请的各个硬件资源的数量,不低于原VASP应用作业设置的硬件资源;该测试作业被设置为“可被抢断式作业”,使得其测试过程不影响其他正式作业的运行。为缩减不必要的测试计算时长,该测试作业的INCAR输入文件中,电子步循环迭代步数上限参数NELM被设置为5步或3步等,离子步迭代参数IBRION设置为0(不进行离子步迭代),并将LWAVE与LCHARG均设置为“.F.”(停止波函数与电荷输出)。当该作业被分配执行时,依次测试原初参数配置、预估参数配置和补充参数配置,以得到最优参数配置。
以上测试过程中,对不同的作业参数配置做两两对比,以获取原初参数配置、预估参数配置和补充参数配置中的最优参数配置,具体方法可参考实施例4。
步骤7,试运行测试作业在完成运行后,将调用激活运行时优化模块,运行时优化模块将检查该VASP应用作业所归属的集群用户是否被 设置有优化结果提醒功能,如需提醒,则将测试作业所产生的最优配置参数以短消息形式推送给用户。运行时优化模块还检查该VASP应用作业所归属的集群用户是否被设置为“同意系统优化修改其VASP作业中的输入参数”,如被用户授权,且检测到该VASP应用作业尚未开始运行,则将该VASP应用作业中本实施例选定三个优化目标参数NPAR、KPAR、NCORE的值根据最优参数配置进行修改。
步骤8,配置LSF作业管理系统的作业前处理模块,当用户提交的应用作业开始执行时,该前处理模块检查当前作业是否存在相应的试运行测试作业,如果存在且还没有执行完成,则激活运行时优化模块,杀掉还未完成的试运行测试作业。此时,该应用作业可直接采用原初参数配置进行运行,也可根据预估参数配置进行参数修改。
本实施例的步骤1中,用户可通过WEB界面或者其他登录界面提交VASP应用作业,提交系统为向LSF作业管理系统提交作业的WEB界面后台系统,
具体地,本步骤1中,提交系统为作业管理系统的组成部分,用户可通过LSF作业管理系统的BSUB命令直接提交作业。
步骤4中,可以根据获得的主要计算参数判断是否需要做运行参数优化。例如,当ICNARG大于10时,该VASP作业计算量极少,可以不对它做优化。或者,当ISTART等于1时,作业被用户设定为二次重启执行,此时的运行参数不宜被变更。或者,当该VASP作业不含有离子步优化等繁重计算任务时,不对它做优化。当不需要做运行参数优化时,运行时优化模块可省略之后的优化步骤,返回并等待新的作业。
步骤4中,可复制该VASP应用作业的所有输入文件到新的目录,通过极短时间的试运行,获取该VASP应用作业详细的前处理数据,即所需的且不存在与主要计算参数中的数据,例如该VASP应用作业将要计算的约化K点个数NKPTS、NBANDS参数值等。约化K点个数NKPTS和NBANDS在输入参数中均无法直接得到,但K点个数NKPTS对K点并行划分参数KPAR的优化结果的准确性有很大影响。NBANDS参数值对NPAR的优化结果的准确性有一定影响。前处理数据的计算非常快 速,为降低资源消耗,可以通过将试运行中的VASP应用作业的电子步数量设置为1等很少的数值,并取消VASP作业的离子步迭代,来减少计算时长,并设置一个运行时间上限(例如10秒)。
步骤4中,还可对输入数据做脱敏处理,仅保存运行优化所需要的数据。
步骤7中,运行时优化模块也将试运行测试作业计算得到的输入输出数据进行脱敏处理,然后存入VASP作业运行历史数据集中,以便用于训练预估参数模型。
以上仅为本发明创造的较佳实施例而已,并不用以限制本发明创造,凡在本发明创造的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明创造的保护范围之内。
对于VASP应用,可在更优化的实施例中,添加其他不影响计算结果精度要求的其他运行参数,如NSIM、OMP_NUM_THREADS,仍在本专利的覆盖范围内,为简化描述,不再作为新的实施例详述。
以上实施例中的部分功能可以通过其他具体实现方式完成,其所产生的新的实现方案仍在本实施方式的覆盖范围内。
本领域的一般技术人员应清楚,基于本发明的运行时优化模块亦可以方便地迁移实施到VASP以外的任何计算软件(例如但不限于量子化学软件Gaussian、气象模拟软件WRF等);也可以结合LSF以外的任何其他作业调度系统(例如但不限于Slurm、PBS等作业调度系统)来实施;也可以在包含GPU等加速卡的硬件资源平台上实施。

Claims (10)

  1. 一种应用于超算集群调度的作业运行参数优化方法,其特征在于,包括:
    获取用户提交的应用作业,以及与所述应用作业对应的多组不同的作业参数配置;
    采用多组所述作业参数配置分别运行所述应用作业;
    根据设定的参数配置判定条件对运行结果进行分析,从多组所述作业参数配置中筛选得到最优参数配置,其中,所述参数配置判定条件包括约化并行效率,所述约化并行效率为运行所述应用作业所采用的资源数每提升固定倍数时,所述超算集群的并行效率;
    将所述最优参数配置推送给用户,或者,根据所述最优参数配置修改所述应用作业的作业参数配置。
  2. 根据权利要求1所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述参数配置判定条件还包括运行成本和作业计算时长,所述根据设定的参数配置判定条件对运行结果进行分析,从多组所述作业参数配置中筛选得到最优参数配置包括:
    在所述运行成本相同的情况下,筛选得到实现最小的所述作业计算时长对应的所述作业参数配置为所述最优参数配置;
    在所述运行成本不同的情况下,结合所述运行成本和所述约化并行效率从多组所述作业参数配置中筛选得到所述最优参数配置;
    其中,所述运行成本为采用所述作业参数配置运行所述应用作业时所占用的硬件资源的数量或者固定资产总额。
  3. 根据权利要求2所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述在所述运行成本不同的情况下,结合所述运行成本和所述约化并行效率从多组所述作业参数配置中筛选得到所述最优参数配置包括:
    以两两对比的方式将多组所述作业参数配置配对,其中,任意一对包含第一组作业参数配置和第二组作业参数配置;
    计算优劣判定指标,并根据所述优劣判定指标与设定阈值的大小,确定所述第一组作业参数配置或所述第二组作业参数配置为所述最优参数配置;
    其中,所述优劣判定指标为:
    Figure PCTCN2022126219-appb-100001
    其中,
    Figure PCTCN2022126219-appb-100002
    R1为所述第一组作业参数配置对应的运行成本,R2为所述第二组作业参数配置对应的运行成本,且R1<R2,T1为所述第一组作业参数配置对应的作业计算时长,T2为所述第二组作业参数配置对应的作业计算时长,n为常数,且n>1。
  4. 根据权利要求1所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述采用多组所述作业参数配置分别运行所述应用作业包括:
    结合多组所述作业参数配置,利用集群中闲置的硬件资源运行所述应用作业,形成测试作业;
    其中,所述测试作业包含对应的所述应用作业中的部分程序,所述测试作业占用的硬件资源与对应的所述应用作业所需的硬件资源至少部分相同;
    在所述测试作业的运行过程中,当其占用的任一硬件资源被运行中的任一应用作业申请时,所述测试作业停止运行。
  5. 根据权利要求4所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述作业参数配置包括:
    原初参数配置、预估参数配置和补充参数配置中的至少一项;
    其中,所述原初参数配置为用户设置的初始作业参数配置;
    所述预估参数配置为根据与所述应用作业所属应用类别对应的参数预估模型获得的作业参数配置,其中,所述参数预估模型的输入为待运行作业的信息,输出为与所述待运行作业对应的所述预估参数配置,所述参数预估模型为经验模型和/或采用大数据训练得到;
    所述补充参数配置为将所述原初参数配置和所述预估参数配置中的至少一项作为输入代入到设定的参数异变模型中获得的一组或多组 作业参数配置,其中,所述参数异变模型用于根据设定规则对输入的作业参数配置中的一项或者多项参数进行改变以形成新的作业参数配置并输出;或者,
    将所述参数预估模型的输出设置为多组所述作业参数配置,且根据所述参数预估模型标注输出的多组所述作业参数配置中最优的一组作为所述预估参数配置,剩余的所述作业参数配置记作所述补充参数配置。
  6. 根据权利要求5所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述结合多组所述作业参数配置,利用集群中闲置的硬件资源运行所述应用作业,形成测试作业包括:
    获取与多组所述作业参数配置对应的所有测试作业,得到测试作业集合;
    当所述应用作业被运行时,如果与其对应的所述测试作业集合中的任一测试作业没有运行或者正在运行,则停止并删除所述测试作业集合中的测试作业,并将与所述应用作业对应的所述预估参数配置作为最优参数配置。
  7. 根据权利要求6所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述测试作业设有最大运行时长;
    在任一所述测试作业的运行时长达到所述最大运行时长的情况下,从运行程序中删除所述测试作业,并将所述测试作业从其所在的所述测试作业集合中删除。
  8. 根据权利要求5所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,在所述参数预估模型采用大数据训练获得的情况下,所述参数预估模型的训练数据库为与其对应的应用类别的历史作业数据库,所述历史作业数据库包含对应的应用类别中的应用作业在实际运行时所采用的作业参数配置以及计算完成时长。
  9. 根据权利要求8所述的应用于超算集群调度的作业运行参数优化方法,其特征在于,所述历史作业数据库还包含对应的应用类别中的应用作业在测试运行时所采用的作业参数配置以及计算完成时长。
  10. 根据权利要求8所述的应用于超算集群调度的作业运行参数优 化方法,其特征在于,还包括:
    将所述最优参数配置添加到所述历史作业数据库中。
PCT/CN2022/126219 2021-10-21 2022-10-19 应用于超算集群调度的作业运行参数优化方法 WO2023066304A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111268933.5A CN114048027B (zh) 2021-10-21 2021-10-21 一种应用于超算集群调度的作业运行参数优化方法
CN202111268933.5 2021-10-21

Publications (1)

Publication Number Publication Date
WO2023066304A1 true WO2023066304A1 (zh) 2023-04-27

Family

ID=80207271

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126219 WO2023066304A1 (zh) 2021-10-21 2022-10-19 应用于超算集群调度的作业运行参数优化方法

Country Status (2)

Country Link
CN (1) CN114048027B (zh)
WO (1) WO2023066304A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116909676A (zh) * 2023-09-12 2023-10-20 中国科学技术大学 一种二分量第一性原理计算系统与服务方法
CN117370135A (zh) * 2023-10-18 2024-01-09 方心科技股份有限公司 基于电力应用弹性测试的超算平台性能评测方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048027B (zh) * 2021-10-21 2022-05-13 中国科学技术大学 一种应用于超算集群调度的作业运行参数优化方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110657A1 (en) * 2014-10-14 2016-04-21 Skytree, Inc. Configurable Machine Learning Method Selection and Parameter Optimization System and Method
CN106383746A (zh) * 2016-08-30 2017-02-08 北京航空航天大学 大数据处理系统的配置参数确定方法和装置
CN106599585A (zh) * 2016-12-19 2017-04-26 兰州交通大学 基于并行蜂群算法的水文模型参数优化方法及装置
CN106909452A (zh) * 2017-03-06 2017-06-30 中国科学技术大学 并行程序运行时参数优化方法
CN114048027A (zh) * 2021-10-21 2022-02-15 中国科学技术大学 一种应用于超算集群调度的作业运行参数优化方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402227B1 (en) * 2016-08-31 2019-09-03 Amazon Technologies, Inc. Task-level optimization with compute environments
CN106844040B (zh) * 2016-12-20 2020-08-28 北京并行科技股份有限公司 一种作业提交方法、系统及服务器
CN109951558A (zh) * 2019-03-27 2019-06-28 北京并行科技股份有限公司 一种超算资源的云调度方法、云调度中心和系统
CN111651220B (zh) * 2020-06-04 2023-08-18 上海电力大学 一种基于深度强化学习的Spark参数自动优化方法及系统
CN111858003B (zh) * 2020-07-16 2021-05-28 山东大学 一种Hadoop最优参数评估方法及装置
CN112102887B (zh) * 2020-09-02 2023-02-24 北京航空航天大学 多尺度集成可视化的高通量自动计算流程及数据智能系统
CN112418438B (zh) * 2020-11-24 2022-08-26 国电南瑞科技股份有限公司 基于容器的机器学习流程化训练任务执行方法及系统
CN113220745B (zh) * 2021-05-19 2024-02-09 中国科学技术大学 一种基于区块链的交易处理方法、装置及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110657A1 (en) * 2014-10-14 2016-04-21 Skytree, Inc. Configurable Machine Learning Method Selection and Parameter Optimization System and Method
CN106383746A (zh) * 2016-08-30 2017-02-08 北京航空航天大学 大数据处理系统的配置参数确定方法和装置
CN106599585A (zh) * 2016-12-19 2017-04-26 兰州交通大学 基于并行蜂群算法的水文模型参数优化方法及装置
CN106909452A (zh) * 2017-03-06 2017-06-30 中国科学技术大学 并行程序运行时参数优化方法
CN114048027A (zh) * 2021-10-21 2022-02-15 中国科学技术大学 一种应用于超算集群调度的作业运行参数优化方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116909676A (zh) * 2023-09-12 2023-10-20 中国科学技术大学 一种二分量第一性原理计算系统与服务方法
CN116909676B (zh) * 2023-09-12 2024-02-23 中国科学技术大学 一种二分量第一性原理计算系统与服务方法
CN117370135A (zh) * 2023-10-18 2024-01-09 方心科技股份有限公司 基于电力应用弹性测试的超算平台性能评测方法及系统
CN117370135B (zh) * 2023-10-18 2024-04-02 方心科技股份有限公司 基于电力应用弹性测试的超算平台性能评测方法及系统

Also Published As

Publication number Publication date
CN114048027A (zh) 2022-02-15
CN114048027B (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023066304A1 (zh) 应用于超算集群调度的作业运行参数优化方法
US10402225B2 (en) Tuning resources based on queuing network model
WO2022262167A1 (zh) 集群资源调度方法及装置、电子设备和存储介质
Buchmann et al. Time-critical database scheduling: A framework for integrating real-time scheduling and concurrency control
US8996811B2 (en) Scheduler, multi-core processor system, and scheduling method
CN109144710A (zh) 资源调度方法、装置及计算机可读存储介质
CN107273200B (zh) 一种针对异构存储的任务调度方法
CN110262897B (zh) 一种基于负载预测的Hadoop计算任务初始分配方法
CN109992366B (zh) 任务调度方法及调度装置
US11580458B2 (en) Method and system for performance tuning and performance tuning device
CN111258735A (zh) 一种支持用户QoS感知的深度学习任务调度方法
CN110187835A (zh) 用于管理访问请求的方法、装置、设备和存储介质
CN113157411B (zh) 一种基于Celery的可靠可配置任务系统及装置
CN113391913A (zh) 一种基于预测的分布式调度方法和装置
US20240078013A1 (en) Optimized I/O Performance Regulation for Non-Volatile Storage
US8954969B2 (en) File system object node management
Zhang et al. Zeus: Improving resource efficiency via workload colocation for massive kubernetes clusters
CN113127179A (zh) 资源调度方法、装置、电子设备及计算机可读介质
Will et al. Ruya: Memory-aware iterative optimization of cluster configurations for big data processing
US20090320036A1 (en) File System Object Node Management
Yang et al. Deep reinforcement agent for failure-aware job scheduling in high-performance computing
CN108009074B (zh) 一种基于模型和动态分析的多核系统实时性评估方法
US11989181B2 (en) Optimal query scheduling for resource utilization optimization
CN117742876A (zh) 处理器核心的绑定方法、装置、设备及计算机存储介质
Jonk et al. Timing prediction for service-based applications mapped on linux-based multi-core platforms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882909

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE