CN110865922A - Supercomputing platform resource use monitoring method - Google Patents

Supercomputing platform resource use monitoring method Download PDF

Info

Publication number
CN110865922A
CN110865922A CN201911149060.9A CN201911149060A CN110865922A CN 110865922 A CN110865922 A CN 110865922A CN 201911149060 A CN201911149060 A CN 201911149060A CN 110865922 A CN110865922 A CN 110865922A
Authority
CN
China
Prior art keywords
user
computing resource
current computing
user task
scheduling system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911149060.9A
Other languages
Chinese (zh)
Inventor
周佳佳
戴超群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Original Assignee
Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd filed Critical Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Priority to CN201911149060.9A priority Critical patent/CN110865922A/en
Publication of CN110865922A publication Critical patent/CN110865922A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3017Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking

Abstract

The invention provides a supercomputing platform resource use monitoring method, which comprises the following steps: determining the number of user task processes being executed on the current computing resource; whether the number of the user processes being executed is greater than or equal to 2; if yes, determining whether the current computing resource is allocated by the scheduling system, if yes, comparing whether the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, and judging the inconsistent user task process as the wrongly submitted user task process.

Description

Supercomputing platform resource use monitoring method
Technical Field
The invention relates to the field of super computing, in particular to a resource use monitoring method for a super computing platform.
Background
Supercomputer platforms have found wide application in various industries. When a user submits a task to the supercomputing platform, the user applies for required computing resources including a CPU (central processing unit), a GPU (graphics processing unit) and the like, and a scheduling system of the supercomputing platform allocates the computing resources for the task. Under reasonable circumstances, the computing resource is allocated to occupy and use the task process submitted by the user. However, unreasonable situations exist in practical situations, for example, problems that other users unreasonably submit task processes to the computing resources to cause conflicts and the like. The unreasonable situations need the operation and maintenance personnel of the super-computation platform to regularly check and solve the problems, and the problems are mainly checked and solved manually based on random logic in the prior art, so that the efficiency is very low.
Disclosure of Invention
The invention aims to provide a resource use monitoring method for a supercomputing platform, which can quickly find various unreasonable resource use problems and provide help for operation and maintenance personnel to take follow-up measures.
In order to achieve the above object, an aspect of the present invention provides a method for monitoring resource usage of a supercomputing platform, including:
determining the number of user task processes being executed on the current computing resource;
whether the number of the user processes being executed is greater than or equal to 2; if yes, determining whether the current computing resource is allocated by the scheduling system, if yes, comparing whether the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, and judging the inconsistent user task process as the wrongly submitted user task process.
In a preferred embodiment, the method further comprises: and if the scheduling system is determined not to allocate the current computing resource, determining that all the user task processes executing on the current computing resource are the user task processes submitted in error.
Through the embodiment, various unreasonable resource use problems can be found quickly, and help is provided for operation and maintenance personnel to take follow-up measures.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 is a flowchart of a method for monitoring resource usage of a supercomputing platform according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The reasons for the low utilization rate of the GPU of the supercomputing platform are various, and the inventor summarizes the reasons for the low utilization rate of the GPU into several typical problems through experience summary of long-term work. Meanwhile, in practice, the problems are generally time-consuming and labor-consuming to examine, so that the inventor develops a set of solution for examining the problems, can regularly and automatically examine various phenomena causing low utilization rate of the GPU in modes of automatically executing script files and the like, and greatly reduces the problem of large operation and maintenance workload generated for improving the utilization rate of the GPU.
Fig. 1 illustrates a supercomputing platform resource usage monitoring method provided by an embodiment of the present invention, which includes:
step S101: the number of user task processes currently executing on the computing resource is determined.
The current computing resources may be relatively costly computing resources such as GPUs. Determining the number of user task processes may be accomplished by a script running on a server on which the current computing resource resides.
It should be noted that, in order to provide a basis for further problem troubleshooting, in a preferred embodiment of the present invention, step S101 may return the number of executing user task processes to "1" even if there are no executing user task processes on the current computing resource, although such setting and actual 1 user task process on the current computing resource may cause some confusion, the following detailed description of such processing will help find out more problems that generate the current computing resource waste.
When it is determined that the number of user task processes being executed on the current computing resource is greater than or equal to 2, execution is continued from step S102. When it is determined that the number of user task processes being executed on the current computing resource is equal to 1, execution is continued from step S105.
And step S102, when the number of the user task processes which are executed on the current computing resource is determined to be more than or equal to 2, determining whether the current computing resource is already allocated by the scheduling system.
The allocation of the scheduling system to the current computing resource can be obtained by means of the squee instruction. In practice, the task process identifier and the user identifier allocated to the current computing resource by the scheduling system can be known.
And S103, if the judgment result of the S102 is positive, comparing whether the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, and judging the inconsistent user task process as the user task process submitted wrongly.
If the result of the determination in step S102 is yes, it indicates that there is a situation of an error submission in the user task process being executed on the current computing resource, because in the scenario of the present invention, if the current computing resource is a user process that is reasonably submitted (i.e., the user task process is submitted to a computing resource allocated by the scheduling system for execution), only one user process should be executed exclusively on the current computing resource. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Under the condition that the user task process executing on the current computing resource is determined to have the wrong submission, the task process of which the user has the wrong submission needs to be checked, so that the operation and maintenance personnel can inform the user of correction. Specifically, the executing user task processes may be examined one by one, and it is determined whether a user corresponding to the executing user task process is consistent with a user allocated to the current computing resource by the scheduling system. The consistent user task process is the condition of reasonable submission without any processing; inconsistent user task processes are unreasonably submitted user task processes and need to be prompted for discovery by operation and maintenance personnel.
Step S104: and when the judgment result of the S102 is negative, judging that all the user task processes executing on the current computing resource are submitted incorrectly.
When the judgment result in the step S102 is no, it indicates that the scheduling system has not allocated the current computing resource, and the current computing resource is idle, but the number of the user task processes being executed on the current computing resource is found to be greater than or equal to 2 through the step S101, which indicates that all the user task processes running on the current computing resource are submitted incorrectly. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S105: when it is determined that the number of user task processes executing on the current computing resource is equal to 1, it is determined whether the scheduling system has allocated the current computing resource.
When the execution result of step S105 is no, execution is started from step S109.
When the execution result of step S105 is yes, there are three cases, corresponding to steps S106, S107, and S108, respectively.
Step S106: when the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is in a normal state: and the user task process distributed in the scheduling system is reasonably used.
Step S107: when the executing user task process is not found, determining that the current computing resources are wasted: the scheduling system has allocated the current computing resources to the users, but there are no task processes of any users executing on the current computing resources.
The case of step S107 corresponds to the above-mentioned situation that step S101 returns the number of executing user task processes to "1" even if there are no executing user task processes on the current computing resource.
The problem of step S107 is serious in that when applying for a resource for another user, the current computing resource cannot be reallocated for use because it has already been allocated by the scheduling system, and the user allocated for the current computing resource by the scheduling system does not actually use the resource, which causes a waste problem called "occupied but unused".
Step S108: and when the executing user task process exists and the user corresponding to the executing user task process is inconsistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is wrongly submitted by the executing user task process. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S109: when the execution result of step S105 is no, it is determined whether the user task process being executed can be searched for.
Step S110: when the judgment result of step S109 is yes, it is determined that the user task process being executed is an error submission. This is because the judgment result in step S105 is no, which indicates that the current computing resource is not allocated by the scheduling system, and the current computing resource should be in an idle state in a normal state, but in an actual situation, a user task process is executed on the current computing resource, which indicates that the user task process is a process submitted incorrectly. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S111: when the judgment result in step S109 is negative, it is determined that the current computing resource is in an idle state, that is, the computing resource is not allocated by the scheduling system, and no user task process is executed on the computing resource.
Through the technical scheme of the embodiment, various unreasonable resource use problems can be found quickly, and help is provided for operation and maintenance personnel to take follow-up measures.
Please note that the above description is only for the preferred embodiment of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (2)

1. A supercomputing platform resource usage monitoring method is characterized by comprising the following steps:
determining the number of user task processes being executed on the current computing resource;
whether the number of the user processes being executed is greater than or equal to 2; if yes, determining whether the current computing resource is allocated by the scheduling system, if yes, comparing whether the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, and judging the inconsistent user task process as the wrongly submitted user task process.
2. The method of claim 1, wherein the method further comprises: and if the scheduling system is determined not to allocate the current computing resource, determining that all the user task processes executing on the current computing resource are the user task processes submitted in error.
CN201911149060.9A 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method Pending CN110865922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911149060.9A CN110865922A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911149060.9A CN110865922A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Publications (1)

Publication Number Publication Date
CN110865922A true CN110865922A (en) 2020-03-06

Family

ID=69655316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911149060.9A Pending CN110865922A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Country Status (1)

Country Link
CN (1) CN110865922A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235744A (en) * 2013-04-15 2013-08-07 中山大学 Application resource management system for smart TV (television)

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235744A (en) * 2013-04-15 2013-08-07 中山大学 Application resource management system for smart TV (television)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
源建华 等: "《微机汉字FoxBASE+简明教程》", 30 November 1993, 高等教育出版社 *

Similar Documents

Publication Publication Date Title
EP3567829B1 (en) Resource management method and apparatus
US8185905B2 (en) Resource allocation in computing systems according to permissible flexibilities in the recommended resource requirements
US20120297236A1 (en) High availability system allowing conditionally reserved computing resource use and reclamation upon a failover
US20170017511A1 (en) Method for memory management in virtual machines, and corresponding system and computer program product
JP2007115246A (en) Method and apparatus for dynamically allocating resource used by software
CN110673927B (en) Scheduling method and device of virtual machine
CN111258746B (en) Resource allocation method and service equipment
KR20200078328A (en) Systems and methods of monitoring software application processes
CN106020984B (en) Method and device for creating process in electronic equipment
JP4348639B2 (en) Multiprocessor system and workload management method
CN115113987A (en) Method, device, equipment and medium for allocating non-uniform memory access resources
CN113032102A (en) Resource rescheduling method, device, equipment and medium
CN116185623A (en) Task allocation method and device, electronic equipment and storage medium
CN110928756A (en) Supercomputing platform resource use monitoring method
JP6477260B2 (en) Method and resource manager for executing an application
CN108287762B (en) Distributed computing interactive mode use resource optimization method and computer equipment
CN110865922A (en) Supercomputing platform resource use monitoring method
CN110879772A (en) Supercomputing platform resource use monitoring method
CN110928686A (en) Supercomputing platform resource use monitoring method
CN110941491A (en) Supercomputing platform resource use monitoring method
CN111143210A (en) Test task scheduling method and system
CN115269136A (en) Heterogeneous multi-core platform partition operating system safety scheduling method and system
CN111581041A (en) Method and equipment for testing performance of magnetic disk
CN112486502A (en) Distributed task deployment method and device, computer equipment and storage medium
CN110908777B (en) Job scheduling method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200306