CN113835953A - Statistical method and device of job information, computer equipment and storage medium - Google Patents

Statistical method and device of job information, computer equipment and storage medium Download PDF

Info

Publication number
CN113835953A
CN113835953A CN202111050624.0A CN202111050624A CN113835953A CN 113835953 A CN113835953 A CN 113835953A CN 202111050624 A CN202111050624 A CN 202111050624A CN 113835953 A CN113835953 A CN 113835953A
Authority
CN
China
Prior art keywords
job
target
information
time
running
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111050624.0A
Other languages
Chinese (zh)
Inventor
肖邦
郝文静
吕灼恒
王家尧
张博
周干
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Co Ltd
Original Assignee
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Co Ltd
Priority to CN202111050624.0A priority Critical patent/CN113835953A/en
Publication of CN113835953A publication Critical patent/CN113835953A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a statistical method and device of job information, computer equipment and a storage medium. The method comprises the steps of obtaining job information of a target job in an SLURM job scheduling system, extracting target job information required by counting the target job from the job information, and counting preset parameters of the target job according to the target job information, wherein the target job is a job running in the SLURM job scheduling system, and the preset parameters are used for evaluating the job running condition of a cluster where the target job is located. The method realizes real-time statistics of the operation in the SLURM calculation cluster, and solves the problem that the traditional statistical method is difficult to realize the real-time statistics of the operation.

Description

Statistical method and device of job information, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a statistical method and apparatus for job information, a computer device, and a storage medium.
Background
The job scheduling system (Simple Linux Utility for Resource Management, SLURM) is a highly scalable cluster manager or job scheduling system that can be used for large-scale computing node clusters, and can count job scheduling information in a cluster according to angles such as time, users, space and the like to generate various statistical reports, and through analyzing the statistical reports, the use condition of the cluster and the behaviors of the users can be more clearly and accurately grasped.
Currently, when a job is statistically analyzed based on the SLURM system, the relevant information of the job after scheduling is usually stored in a database after the job is scheduled, so that the relevant information of the job can be read from the database, and statistical analysis can be performed on the job at different angles by using corresponding statistical methods, so as to obtain statistical results.
However, it is difficult to perform real-time job statistics by using the statistical method.
Disclosure of Invention
In view of the above, it is necessary to provide a method and an apparatus for counting job information, a computer device, and a storage medium, which can realize real-time job counting, in order to solve the above-described technical problems.
In a first aspect, a statistical method for job information includes:
acquiring the job information of a target job based on an SLURM job scheduling system; the target operation is an operation running in the computing cluster;
extracting target job information required for counting the target job from the job information;
counting preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the computing cluster.
The method for counting the job information comprises the steps of acquiring the job information of a target job in the SLURM job scheduling system, extracting target job information required by counting the target job from the job information, and counting preset parameters of the target job according to the target job information, wherein the target job is a job which is running in the SLURM job scheduling system, and the preset parameters are used for evaluating the job running condition of a cluster where the target job is located. The method realizes real-time statistics of the operation in the SLURM calculation cluster, and solves the problem that the traditional statistical method is difficult to realize the real-time statistics of the operation.
In one embodiment, after the SLURM-based job scheduling system acquires the job information of the target job, the method further includes:
storing the operation information of the target operation into a local memory;
the extracting, from the job information, target job information required for counting the target job includes:
reading the operation information from the local memory;
and analyzing the read job information to extract target job information required by the target job.
The embodiment uses a cache technology to store real-time data, and because the stored data are all original job data, if the preset parameters of later statistics change, the job information of the target job does not need to be acquired again or the manner of acquiring the job information does not need to be changed, and only different target job information needs to be analyzed from the job information, so the embodiment provides a flexible and variable statistical manner.
In one embodiment, the target job information includes: the actual running time of the target operation and the running core number of the target operation; when the preset parameter is a kernel.
In one embodiment, the counting preset parameters of the target job according to the target job information includes:
acquiring preset screening time;
determining the statistical running time of the target operation according to the screening time and the actual running time of the target operation;
and counting the core time of the target operation according to the counted running time of the target operation and the running core number of the target operation.
In one embodiment, the determining the statistical runtime of the target job according to the screening time and the actual runtime of the target job includes:
and determining the intersection time between the screening time and the actual running time as the statistical running time of the target job.
In one embodiment, when the core of the target job is obtained through statistics according to the statistical running time of the target job and the running core number of the target job, the method includes:
and taking the product of the statistical running time of the target operation and the running core number of the target operation as the core time of the target operation.
The foregoing embodiments provide a statistical method for kernel-time of a job, where the kernel-time of the job may be used to evaluate an operation condition of a processor or related hardware performance when each service node device in a computing cluster performs the job, so that after the job is counted in real time by a job scheduling server, the operation performance evaluation of the related hardware of the computing cluster may be completed according to a statistical result, and accurate reference data is provided for a user or a maintenance personnel maintenance system of the computing cluster.
In one embodiment, the acquiring job information of the target job in the SLURM job scheduling system includes:
and inquiring to obtain the job information of the target job by using the inquiry command in the SLURM job scheduling system.
The embodiment provides an effective technical means for real-time job statistics by directly acquiring the job information of the target job from each service node device in real time in an active query mode, and the method can be realized only according to one query command.
In a second aspect, an apparatus for statistics of job information, the apparatus comprising:
the acquisition module is used for acquiring the job information of the target job in the SLURM job scheduling system; the target job is a job running in the SLURM job scheduling system;
the extraction module is used for extracting target operation information required by the statistics of the target operation from the operation information;
the statistic module is used for carrying out statistics on preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the cluster where the target operation is located.
In a third aspect, a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the statistical method of job information according to the first aspect when executing the computer program.
In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the statistical method of job information according to the first aspect.
Drawings
FIG. 1 is a diagram of an application cluster for a statistical approach to job information, according to an embodiment;
FIG. 2 is a flow diagram illustrating a statistical method for job information according to one embodiment;
FIG. 3 is a flow diagram illustrating a statistical method for job information according to one embodiment;
FIG. 4 is a flowchart illustrating an implementation manner of S102 in the embodiment of FIG. 2;
FIG. 5 is a flowchart illustrating an implementation manner of S103 in the embodiment of FIG. 2;
FIG. 6 is a diagram of an application scenario in one embodiment;
FIG. 7 is a flowchart illustrating a statistical method of job information according to one embodiment;
FIG. 8 is a block diagram showing an example of a statistical apparatus for job information;
FIG. 9 is a block diagram showing an example of a statistical apparatus for job information;
FIG. 10 is a block diagram showing an example of a statistical apparatus for job information;
FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The statistical method for the job information provided by the application can be applied to the computing cluster shown in fig. 1. The computing cluster may be a large SLURM computing cluster, and the computing cluster includes: the system comprises a plurality of service node devices 102 and a job scheduling server 104, wherein the service node devices 102 communicate with the job scheduling server 104 through a network, an SLURM job scheduling system (SLURM job scheduling framework) is installed on the job scheduling server 104, and job information running on each service node device 102 can be inquired and obtained through the SLURM job scheduling system on the job scheduling server 104. The job scheduling server 104 is configured to perform statistical analysis on the information of the jobs running on the respective service node devices 102, so that the job scheduling server 104 may evaluate the job running condition of the computing cluster according to the statistical result. The service node device 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the job scheduling server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is a block diagram of only a portion of the architecture associated with the subject application and does not constitute a limitation on the application clusters to which the subject application applies, and that a particular application cluster may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
In one embodiment, as shown in fig. 2, a statistical method for job information is provided, which is described by taking the method as an example applied to the job scheduling server in fig. 1, and includes the following steps:
and S101, acquiring the job information of the target job based on the SLURM job scheduling system.
Wherein the target job is a job running in the compute cluster. The SLURM job scheduling system is a highly scalable cluster manager or job scheduling system that can be used for large computing node clusters, widely adopted by supercomputers and computing clusters worldwide, and manages available computing nodes in a shared or unshared manner for users to perform work. The SLURM job scheduling system will allocate resources for the task queue and monitor the job until completion. The job information includes a series of information related to the job, such as the node address where the job runs, the start running time of the job, the end running time of the job, the type of the job, the identification of the job, the type of resources occupied by the job, the number of running cores of the job, and the like.
In this embodiment, when each service node device in the computing cluster shown in fig. 1 performs a job, the job scheduling server may obtain job information of the running job, that is, job information of the target job, from each service node device in real time. The specific acquisition mode may include two modes: the first mode is that the job scheduling server uses the query command in the SLURM job scheduling system to query and obtain the job information of the target job, namely, the job information of the target job is directly obtained from each service node device in real time in an active query mode; in the second mode, each service node device may report job information of a locally running job to the job scheduling server in real time and store the job information in a database of the job scheduling server, and the job scheduling server reads the job information of the target job reported by each service node device from the database based on a management function of an SLURM job scheduling system on the job scheduling server. It should be noted that, since the obtained target jobs are jobs that are running on each service node device in the computing cluster, and the number of the target jobs is generally multiple, when the job scheduling server obtains job information of multiple target jobs, the job information of multiple target jobs may be stored in a form of a job list so as to be conveniently read in a later query.
S102, target job information necessary for counting the target job is extracted from the job information.
The target job information may be determined according to the requirement of the statistical target job, for example, when the core of the target job needs to be counted, the target job information includes the actual running time of the target job and the running core number of the target job; when the index parameter of the CPU occupation capacity of the target job needs to be counted, the target job information comprises the time length of the CPU occupation of the target job and the type (such as a large core or a small core) of the target job occupation processor.
In this embodiment, when the job scheduling server obtains the job information of the target job based on the foregoing steps, the job information of the target job may be further analyzed, and target job information required when the relevant index parameters of the target job are counted in a later period is extracted from the job information. Alternatively, when the job scheduling server acquires a job list including job information of a plurality of target jobs, the job scheduling server may generate a new job list including the extracted target job information of each target job based on analysis of the job list.
And S103, counting preset parameters of the target operation according to the target operation information.
The preset parameters are used for evaluating the job running condition of the computing cluster, can be determined by a user in advance according to evaluation requirements, and can also be preset as configuration parameters for use. Specifically, the preset parameter may be various index parameters that can be used to evaluate the job running condition of the computing cluster or the system performance of the computing cluster, such as the core time of the target job, the CPU occupation capability of the target job, and the like.
In this embodiment, when the job scheduling server obtains the target job information of the target job, the preset parameters that need to be counted may be determined, and a statistical method corresponding to the preset parameters is adopted to perform statistical analysis and calculation based on the target job information, so as to obtain the preset parameters of the target job. It should be noted that, when the job scheduling server counts the obtained multiple target jobs, the preset parameters corresponding to each target job may be the same or different, that is, the job scheduling server may count different target jobs and obtain different preset parameters of different target jobs, for example, if there are target job 1 and target job 2, the job scheduling server may perform statistical analysis on the kernel time of target job 1 to obtain the preset parameter of the kernel time of target job 1, and perform statistical analysis on the CPU occupation capacity of target job 2 to obtain the preset parameter of the CPU occupation capacity of target job 2. Specifically, the job scheduling server may sequentially perform statistics on different target jobs by using corresponding statistical methods, and optionally, the job scheduling server may also perform statistics on different target jobs by using different statistical methods. In addition, the job scheduling server may perform statistics on one preset parameter of one target job, and optionally, the job scheduling server may also perform statistics on a plurality of preset parameters of one target job. In practical application, after the job scheduling server performs preset parameter statistics on the target jobs on each service node device in the computing cluster, the job running condition of the computing cluster can be further evaluated by analyzing the attribute of the preset parameter.
The method for counting the job information comprises the steps of acquiring the job information of a target job in the SLURM job scheduling system, extracting target job information required by counting the target job from the job information, and counting preset parameters of the target job according to the target job information, wherein the target job is a job which is running in the SLURM job scheduling system, and the preset parameters are used for evaluating the job running condition of a cluster where the target job is located. The method realizes real-time statistics of the operation in the SLURM calculation cluster, and solves the problem that the traditional statistical method is difficult to realize the real-time statistics of the operation.
In practical applications, after the job scheduling server executes the above S101, the statistical method for job information according to the embodiment of fig. 2, as shown in fig. 3, further includes:
and S104, storing the operation information of the target operation into a local memory.
When the job scheduling server acquires the job information of the target job in real time, the job information can be stored in the local memory, so that the job scheduling server can efficiently read data from the local memory for statistical analysis.
Correspondingly, when the job scheduling server executes the step of S102, as shown in fig. 4, the following steps are specifically executed:
s1021, reads the job information from the local memory.
S1022, analyzes the read job information to extract target job information necessary for the target job.
When the job scheduling server executes the statistical operation, the job information of the target job can be directly read from the local memory. Since the job information of the target job includes all job information related to the target job, and some information is unnecessary information when counting the preset parameters of the target job, the job scheduling server needs to further analyze the read job information, and extract or screen out target job information required by the target job, that is, target job information required when calculating the preset parameters of the target job. Optionally, when the job scheduling server performs a statistical operation on a plurality of target jobs, the job list including job information of the plurality of target jobs may be directly read from the local memory, and then the job list may be analyzed, so as to generate a new job list including the target job information of the plurality of target jobs. It should be noted that the data in the memory is continuously updated at short time intervals, so that the data in the memory always stores the job information of the running job in the computing cluster, and the purpose of acquiring the job information in real time is achieved. The embodiment uses a cache technology to store real-time data, and because the stored data are all original job data, if the preset parameters of later statistics change, the job information of the target job does not need to be acquired again or the manner of acquiring the job information does not need to be changed, and only different target job information needs to be analyzed from the job information, so the embodiment provides a flexible and variable statistical manner.
Optionally, the present application provides an implementation manner of the foregoing S103, where if the target job information includes an actual running time of the target job and a running core number of the target job, as shown in fig. 5, the foregoing S103 "count preset parameters of the target job according to the target job information", including:
s201, acquiring preset screening time.
The screening time may be predetermined by a user according to a statistical requirement, and the screening time includes a screening start time and a screening end time, for example, if the time is counted according to hours, it may be determined that the screening start time is one hour before the current time point, and the screening end time is the current time point.
In this embodiment, before the job scheduling server performs the statistical operation, the job scheduling server may obtain a statistical request sent by a user, and the statistical request includes the screening time, and when the job scheduling server receives the statistical request, the job scheduling server may extract the screening time from the statistical request, so as to perform real-time statistics on the job according to the user requirement; optionally, the job scheduling server may also determine a suitable screening time in advance according to the type of the counted job to perform configuration, and then may perform job statistics according to the configured screening time. It should be noted that when the job scheduling server needs to count the preset parameters of multiple target jobs, different screening times may be configured for different target jobs, and the same screening time may also be configured for different target jobs; when the job scheduling server needs to count a plurality of preset parameters of a target job, different screening times can be configured for different preset parameters, and the same screening time can also be configured for different preset parameters.
S202, determining the statistical running time of the target job according to the screening time and the actual running time of the target job.
Wherein the actual run time of the target job comprises an actual start run time of the target job and an actual end run time of the target job. The statistical run time of the target job includes a statistical start run time of the target job and a statistical end time of the target job.
In this embodiment, since the filtering time is predefined by the user or configured by the job scheduling server in advance, the filtering time is generally inconsistent with the actual running time of the target job, and there are generally four application scenarios: as shown in fig. 6, the first is that the screening start time is later than the actual start running time, and the screening end time is later than the actual end running time; the second is that the screening start time is earlier than the actual start running time, and the screening end time is later than the actual end running time; the third is that the screening start time is earlier than the actual start running time, and the screening end time is earlier than the actual end running time; the fourth is that the screening start time is later than the actual start run time and the screening end time is earlier than the actual end run time.
Specifically, in any of the above application scenarios, the job scheduling server may determine the intersection time between the filtering time and the actual running time as the statistical running time of the target job.
For example, when corresponding to the first application scenario, the job scheduling server determines the screening start time as the statistical start running time of the target job, determines the actual end running time as the statistical end running time of the target job, and determines a duration between the screening start time and the actual end running time as a duration corresponding to the statistical running time; when the second application scenario corresponds to, the job scheduling server determines the actual starting running time as the statistical starting running time of the target job, determines the actual ending running time as the statistical ending running time of the target job, and determines the duration between the actual starting running time and the actual ending running time as the duration corresponding to the statistical running time; when the third application scenario corresponds to, the job scheduling server determines the actual starting operation time as the statistical starting operation time of the target job, determines the screening end time as the statistical ending operation time of the target job, and determines the duration between the actual starting operation time and the screening end time as the duration corresponding to the statistical operation time; corresponding to the fourth application scenario, the job scheduling server determines the screening start time as the statistical start running time of the target job, determines the screening end time as the statistical end running time of the target job, and determines the duration between the screening start time and the screening end time as the duration corresponding to the statistical running time.
And S203, counting the core time of the target operation according to the counted running time of the target operation and the running core number of the target operation.
After the job scheduling server obtains the statistical running time of the target job and the running core number of the target job, the corresponding counting method during core time can be adopted to obtain the core time of the target job. In one application, the job scheduling server takes the product of the statistical running time of the target job and the running core number of the target job as the core time of the target job.
Optionally, the job scheduling server may calculate the kernel time of the target job by using the following relation (1):
Figure BDA0003252611360000101
in the above equation, when T denotes the core of the target job, Na denotes the number of operation cores of the target job, and Ta denotes the statistical operation time of the target job. In the case where the relational expression (1) is a kernel determined when the unit of Ta is second, the relational expression (1) may be modified in units other than Ta to calculate a kernel of the target work.
For example, assuming that a certain job uses 6 cores and runs from 14 points 1 to 22 points 3 by taking day statistics as an example, the actual running time of the job should be from 14 points to 24 points 10 hours in the statistical data of number 1, and the core time used by the job number 1 is 60 according to the calculation formula (1) of the core time. In the statistical data of number 2, the actual running time of the job is 0 o 'clock to 24 o' clock 24 hours, and the core time of the job in use of number 2 is 124 according to the core time calculation formula.
The foregoing embodiments provide a statistical method for kernel-time of a job, where the kernel-time of the job may be used to evaluate an operation condition of a processor or related hardware performance when each service node device in a computing cluster performs the job, so that after the job is counted in real time by a job scheduling server, the operation performance evaluation of the related hardware of the computing cluster may be completed according to a statistical result, and accurate reference data is provided for a user or a maintenance personnel maintenance system of the computing cluster.
With reference to all the above embodiments, the present application further provides a statistical method for job information, as shown in fig. 7, the method includes:
s301, using the query command in the SLURM job scheduling system to query and obtain the job information of the target job.
S302, storing the operation information of the target operation in a local memory.
S303, reads the job information from the local memory.
S304, analyzes the read job information to extract target job information required for the target job. The target job information includes an actual run time of the target job and a run core number of the target job.
S305, acquiring preset screening time.
S306, determining the intersection time between the screening time and the actual running time as the statistical running time of the target job.
And S307, taking the product of the statistical running time of the target job and the running core number of the target job as the core time of the target job.
And S308, evaluating the job running condition of the cluster where the target job is located according to the checking time.
The steps in the above embodiments are described in the foregoing, and for details, refer to the foregoing description, which is not repeated herein. The statistical method for the job information, provided by the application, realizes real-time statistics for the running job information in the computing cluster based on the SLURM job scheduling system, and solves the problem that the traditional statistical method is difficult to realize the real-time statistics for the job; in addition, in the process of using the statistical method, a cache technology is adopted to store real-time data, so that I/O operation of reading data from a database is omitted, and different databases have different access data rules, so that when a large amount of data are read or accessed, the statistical method provided by the application can not only improve the speed of reading the data, but also improve the stability of the system. In addition, since the stored original job data is all original job data, if the preset parameters of the later statistics change, the job information of the target job does not need to be re-acquired or the manner of acquiring the job information does not need to be changed, and only different target job information needs to be analyzed from the job information, so that the statistical method provided by the embodiment is a flexible statistical method.
It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 8, there is provided a statistical apparatus of job information, including:
an obtaining module 11, configured to obtain job information of a target job based on an SLURM job scheduling system; the target job is a job running in the compute cluster.
And an extracting module 12, configured to extract target job information required for counting the target job from the job information.
The statistic module 13 is configured to perform statistics on preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the computing cluster.
In an embodiment, as shown in fig. 9, the apparatus for counting job information further includes:
a storage module 14, configured to store the job information of the target job in a local memory;
correspondingly, the extracting module 12 is specifically configured to read the job information from the local memory, and analyze the read job information to extract target job information required by the target job.
In one embodiment, the target job information includes: the actual running time of the target operation and the running core number of the target operation; when the preset parameter is a kernel.
In one embodiment, as shown in fig. 10, the statistical module 13 includes:
an obtaining unit 131, configured to obtain a preset screening time;
a determining unit 132, configured to determine a statistical running time of the target job according to the screening time and an actual running time of the target job;
the counting unit 133 is configured to count the core time of the target job according to the counted running time of the target job and the running core number of the target job.
In an embodiment, the determining unit 132 is specifically configured to determine an intersection time between the screening time and the actual runtime as the statistical runtime of the target job.
In an embodiment, the statistical unit 133 is specifically configured to use a product of the statistical running time of the target job and the running core number of the target job as the core of the target job.
In an embodiment, the obtaining module 11 is specifically configured to use a query command in the SLURM job scheduling system to query and obtain job information of the target job.
For specific limitations of the statistical device of the job information, reference may be made to the above limitations of the statistical method of the job information, which are not described herein again. All or part of each module in the device for counting the job information can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store job information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a statistical method of job information.
Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring the job information of a target job based on an SLURM job scheduling system; the target operation is an operation running in the computing cluster;
extracting target job information required for counting the target job from the job information;
counting preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the computing cluster.
The implementation principle and technical effect of the computer device provided by the above embodiment are similar to those of the above method embodiment, and are not described herein again.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring the job information of a target job based on an SLURM job scheduling system; the target operation is an operation running in the computing cluster;
extracting target job information required for counting the target job from the job information;
counting preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the computing cluster.
The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A statistical method for job information, the method comprising:
acquiring the job information of a target job based on an SLURM job scheduling system; the target operation is an operation running in the computing cluster;
extracting target job information required for counting the target job from the job information;
counting preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the computing cluster.
2. The method of claim 1, wherein after the SLURM-based job scheduling system obtains job information for a target job, the method further comprises:
storing the operation information of the target operation into a local memory;
the extracting, from the job information, target job information required for counting the target job includes:
reading the operation information from the local memory;
and analyzing the read job information to extract target job information required by the target job.
3. The method according to claim 1 or 2, wherein the target job information includes: the actual running time of the target operation and the running core number of the target operation; when the preset parameter is a kernel.
4. The method according to claim 1, wherein the counting preset parameters of the target job according to the target job information comprises:
acquiring preset screening time;
determining the statistical running time of the target operation according to the screening time and the actual running time of the target operation;
and counting the core time of the target operation according to the counted running time of the target operation and the running core number of the target operation.
5. The method of claim 4, wherein determining the statistical runtime of the target job based on the screening time and the actual runtime of the target job comprises:
and determining the intersection time between the screening time and the actual running time as the statistical running time of the target job.
6. The method according to claim 4, wherein the counting the cores of the target job according to the counted running time of the target job and the running core number of the target job comprises:
and taking the product of the statistical running time of the target operation and the running core number of the target operation as the core time of the target operation.
7. The method of claim 1, wherein the obtaining job information of the target job in the SLURM job scheduling system comprises:
and inquiring to obtain the job information of the target job by using the inquiry command in the SLURM job scheduling system.
8. An apparatus for counting job information, the apparatus comprising:
the acquisition module is used for acquiring the job information of the target job in the SLURM job scheduling system; the target job is a job running in the SLURM job scheduling system;
the extraction module is used for extracting target operation information required by the statistics of the target operation from the operation information;
the statistic module is used for carrying out statistics on preset parameters of the target operation according to the target operation information; the preset parameters are used for evaluating the operation condition of the cluster where the target operation is located.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111050624.0A 2021-09-08 2021-09-08 Statistical method and device of job information, computer equipment and storage medium Pending CN113835953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111050624.0A CN113835953A (en) 2021-09-08 2021-09-08 Statistical method and device of job information, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111050624.0A CN113835953A (en) 2021-09-08 2021-09-08 Statistical method and device of job information, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113835953A true CN113835953A (en) 2021-12-24

Family

ID=78958814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111050624.0A Pending CN113835953A (en) 2021-09-08 2021-09-08 Statistical method and device of job information, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113835953A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168059A (en) * 2022-09-07 2022-10-11 平安银行股份有限公司 System kernel monitoring method and device, terminal equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101065436B1 (en) * 2010-12-07 2011-09-19 경상대학교산학협력단 Stochastic scheduling of a real-time parallel task with uncertain computation amount on mulit-core processors
JP2013069222A (en) * 2011-09-26 2013-04-18 Fuji Xerox Co Ltd Work information management device and program
US20140330536A1 (en) * 2013-05-06 2014-11-06 Sas Institute Inc. Techniques to simulate statistical tests
US20150052242A1 (en) * 2013-08-16 2015-02-19 Fujitsu Limited Information processing system, method of controlling information processing system, and computer-readable recording medium storing control program for controller
CN111324445A (en) * 2018-12-14 2020-06-23 中国科学院深圳先进技术研究院 Task scheduling simulation system
CN111949389A (en) * 2020-08-11 2020-11-17 曙光信息产业(北京)有限公司 Slurm-based information acquisition method and device, server and computer-readable storage medium
CN112052144A (en) * 2020-09-15 2020-12-08 曙光信息产业(北京)有限公司 Information management method, information management device, electronic equipment and storage medium
CN113296929A (en) * 2020-06-29 2021-08-24 阿里巴巴集团控股有限公司 Resource matching method, device and system based on cloud computing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101065436B1 (en) * 2010-12-07 2011-09-19 경상대학교산학협력단 Stochastic scheduling of a real-time parallel task with uncertain computation amount on mulit-core processors
JP2013069222A (en) * 2011-09-26 2013-04-18 Fuji Xerox Co Ltd Work information management device and program
US20140330536A1 (en) * 2013-05-06 2014-11-06 Sas Institute Inc. Techniques to simulate statistical tests
US20150052242A1 (en) * 2013-08-16 2015-02-19 Fujitsu Limited Information processing system, method of controlling information processing system, and computer-readable recording medium storing control program for controller
CN111324445A (en) * 2018-12-14 2020-06-23 中国科学院深圳先进技术研究院 Task scheduling simulation system
CN113296929A (en) * 2020-06-29 2021-08-24 阿里巴巴集团控股有限公司 Resource matching method, device and system based on cloud computing
CN111949389A (en) * 2020-08-11 2020-11-17 曙光信息产业(北京)有限公司 Slurm-based information acquisition method and device, server and computer-readable storage medium
CN112052144A (en) * 2020-09-15 2020-12-08 曙光信息产业(北京)有限公司 Information management method, information management device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
尹常红 等: "浅谈武汉气象高性能计算机系统的运维管理", 《电脑知识与技术》, 15 January 2021 (2021-01-15), pages 204 - 206 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168059A (en) * 2022-09-07 2022-10-11 平安银行股份有限公司 System kernel monitoring method and device, terminal equipment and storage medium
CN115168059B (en) * 2022-09-07 2022-12-16 平安银行股份有限公司 System kernel monitoring method and device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
US10642642B2 (en) Techniques to manage virtual classes for statistical tests
CN108776934B (en) Distributed data calculation method and device, computer equipment and readable storage medium
US9367601B2 (en) Cost-based optimization of configuration parameters and cluster sizing for hadoop
EP2503733B1 (en) Data collecting method, data collecting apparatus and network management device
Di et al. Characterizing and modeling cloud applications/jobs on a Google data center
US10783002B1 (en) Cost determination of a service call
Yang et al. Intermediate data caching optimization for multi-stage and parallel big data frameworks
CN111274256B (en) Resource management and control method, device, equipment and storage medium based on time sequence database
WO2011071010A1 (en) Load characteristics estimation system, load characteristics estimation method, and program
CN109189572B (en) Resource estimation method and system, electronic equipment and storage medium
JPWO2014208139A1 (en) Abnormality detection apparatus, control method, and program
US20220050814A1 (en) Application performance data processing
CN113158435B (en) Complex system simulation running time prediction method and device based on ensemble learning
CN114416849A (en) Data processing method and device, electronic equipment and storage medium
CN113835953A (en) Statistical method and device of job information, computer equipment and storage medium
CN103729417A (en) Method and device for data scanning
US20140214826A1 (en) Ranking method and system
US20160253591A1 (en) Method and apparatus for managing performance of database
CN117435335A (en) Computing power dispatching method, computing power dispatching device, computer equipment and storage medium
CA3184895A1 (en) User behavior data writing method and device, computer equipment and storage medium
Perennou et al. Workload characterization for a non-hyperscale public cloud platform
CN110287158A (en) Monitor the method, apparatus and storage medium of distributed file system IO time delay
US20220050761A1 (en) Low overhead performance data collection
US11003565B2 (en) Performance change predictions
CN109902067B (en) File processing method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination