CN110928756A - Supercomputing platform resource use monitoring method - Google Patents

Supercomputing platform resource use monitoring method Download PDF

Info

Publication number
CN110928756A
CN110928756A CN201911150115.8A CN201911150115A CN110928756A CN 110928756 A CN110928756 A CN 110928756A CN 201911150115 A CN201911150115 A CN 201911150115A CN 110928756 A CN110928756 A CN 110928756A
Authority
CN
China
Prior art keywords
computing resource
current computing
user
scheduling system
user task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911150115.8A
Other languages
Chinese (zh)
Inventor
周佳佳
戴超群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Original Assignee
Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd filed Critical Suzhou Jiaochi Artificial Intelligence Research Institute Co Ltd
Priority to CN201911150115.8A priority Critical patent/CN110928756A/en
Publication of CN110928756A publication Critical patent/CN110928756A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3438Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment monitoring of user actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a supercomputing platform resource use monitoring method, which comprises the following steps: determining the number of user task processes being executed on the current computing resource; whether the number of user processes being executed is equal to 1; if so, determining whether the scheduling system has allocated the current computing resource, and if so, determining that the current computing resource is wasted and determining that the current computing resource is wasted when the executing user task process is not found.

Description

Supercomputing platform resource use monitoring method
Technical Field
The invention relates to the field of super computing, in particular to a resource use monitoring method for a super computing platform.
Background
Supercomputer platforms have found wide application in various industries. When a user submits a task to the supercomputing platform, the user applies for required computing resources including a CPU (central processing unit), a GPU (graphics processing unit) and the like, and a scheduling system of the supercomputing platform allocates the computing resources for the task. Under reasonable circumstances, the computing resource is allocated to occupy and use the task process submitted by the user. However, unreasonable situations exist in practical situations, for example, problems that other users unreasonably submit task processes to the computing resources to cause conflicts and the like. The unreasonable situations need the operation and maintenance personnel of the super-computation platform to regularly check and solve the problems, and the problems are mainly checked and solved manually based on random logic in the prior art, so that the efficiency is very low.
Disclosure of Invention
The invention aims to provide a resource use monitoring method for a supercomputing platform, which can quickly find various unreasonable resource use problems and provide help for operation and maintenance personnel to take follow-up measures.
In order to achieve the above object, an aspect of the present invention provides a method for monitoring resource usage of a supercomputing platform, including:
determining the number of user task processes being executed on the current computing resource;
whether the number of user processes being executed is equal to 1; if so, determining whether the scheduling system has allocated the current computing resource, and if so, determining that the current computing resource is wasted and determining that the current computing resource is wasted when the executing user task process is not found.
In a preferred embodiment, the method further comprises: and if the scheduling system is determined to allocate the current computing resource, and when the executing user task process exists and the user corresponding to the executing user task process is inconsistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is wrongly submitted by the executing user task process.
In a preferred embodiment, the method further comprises: and if the current computing resource is distributed by the scheduling system and the user corresponding to the executing user task process is consistent with the user distributed to the current computing resource by the scheduling system, determining that the current computing resource is in a normal state.
Through the embodiment, various unreasonable resource use problems can be found quickly, and help is provided for operation and maintenance personnel to take follow-up measures.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 is a flowchart of a method for monitoring resource usage of a supercomputing platform according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The reasons for the low utilization rate of the GPU of the supercomputing platform are various, and the inventor summarizes the reasons for the low utilization rate of the GPU into several typical problems through experience summary of long-term work. Meanwhile, in practice, the problems are generally time-consuming and labor-consuming to examine, so that the inventor develops a set of solution for examining the problems, can regularly and automatically examine various phenomena causing low utilization rate of the GPU in modes of automatically executing script files and the like, and greatly reduces the problem of large operation and maintenance workload generated for improving the utilization rate of the GPU.
Fig. 1 illustrates a supercomputing platform resource usage monitoring method provided by an embodiment of the present invention, which includes:
step S101: the number of user task processes currently executing on the computing resource is determined.
The current computing resources may be relatively costly computing resources such as GPUs. Determining the number of user task processes may be accomplished by a script running on a server on which the current computing resource resides.
It should be noted that, in order to provide a basis for further problem troubleshooting, in a preferred embodiment of the present invention, step S101 may return the number of executing user task processes to "1" even if there are no executing user task processes on the current computing resource, although such setting and actual 1 user task process on the current computing resource may cause some confusion, the following detailed description of such processing will help find out more problems that generate the current computing resource waste.
When it is determined that the number of user task processes being executed on the current computing resource is greater than or equal to 2, execution is continued from step S102. When it is determined that the number of user task processes being executed on the current computing resource is equal to 1, execution is continued from step S105.
And step S102, when the number of the user task processes which are executed on the current computing resource is determined to be more than or equal to 2, determining whether the current computing resource is already allocated by the scheduling system.
The allocation of the scheduling system to the current computing resource can be obtained by means of the squee instruction. In practice, the task process identifier and the user identifier allocated to the current computing resource by the scheduling system can be known.
And S103, if the judgment result of the S102 is positive, comparing whether the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, and judging the inconsistent user task process as the user task process submitted wrongly.
If the result of the determination in step S102 is yes, it indicates that there is a situation of an error submission in the user task process being executed on the current computing resource, because in the scenario of the present invention, if the current computing resource is a user process that is reasonably submitted (i.e., the user task process is submitted to a computing resource allocated by the scheduling system for execution), only one user process should be executed exclusively on the current computing resource. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Under the condition that the user task process executing on the current computing resource is determined to have the wrong submission, the task process of which the user has the wrong submission needs to be checked, so that the operation and maintenance personnel can inform the user of correction. Specifically, the executing user task processes may be examined one by one, and it is determined whether a user corresponding to the executing user task process is consistent with a user allocated to the current computing resource by the scheduling system. The consistent user task process is the condition of reasonable submission without any processing; inconsistent user task processes are unreasonably submitted user task processes and need to be prompted for discovery by operation and maintenance personnel.
Step S104: and when the judgment result of the S102 is negative, judging that all the user task processes executing on the current computing resource are submitted incorrectly.
When the judgment result in the step S102 is no, it indicates that the scheduling system has not allocated the current computing resource, and the current computing resource is idle, but the number of the user task processes being executed on the current computing resource is found to be greater than or equal to 2 through the step S101, which indicates that all the user task processes running on the current computing resource are submitted incorrectly. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S105: when it is determined that the number of user task processes executing on the current computing resource is equal to 1, it is determined whether the scheduling system has allocated the current computing resource.
When the execution result of step S105 is no, execution is started from step S109.
When the execution result of step S105 is yes, there are three cases, corresponding to steps S106, S107, and S108, respectively.
Step S106: when the user corresponding to the executing user task process is consistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is in a normal state: and the user task process distributed in the scheduling system is reasonably used.
Step S107: when the executing user task process is not found, determining that the current computing resources are wasted: the scheduling system has allocated the current computing resources to the users, but there are no task processes of any users executing on the current computing resources.
The case of step S107 corresponds to the above-mentioned situation that step S101 returns the number of executing user task processes to "1" even if there are no executing user task processes on the current computing resource.
The problem of step S107 is serious in that when applying for a resource for another user, the current computing resource cannot be reallocated for use because it has already been allocated by the scheduling system, and the user allocated for the current computing resource by the scheduling system does not actually use the resource, which causes a waste problem called "occupied but unused".
Step S108: and when the executing user task process exists and the user corresponding to the executing user task process is inconsistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is wrongly submitted by the executing user task process. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S109: when the execution result of step S105 is no, it is determined whether the user task process being executed can be searched for.
Step S110: when the judgment result of step S109 is yes, it is determined that the user task process being executed is an error submission. This is because the judgment result in step S105 is no, which indicates that the current computing resource is not allocated by the scheduling system, and the current computing resource should be in an idle state in a normal state, but in an actual situation, a user task process is executed on the current computing resource, which indicates that the user task process is a process submitted incorrectly. A process called a mis-commit is a process in which a user commits his task to a computing resource that the user has not allocated to it according to the scheduling system.
Step S111: when the judgment result in step S109 is negative, it is determined that the current computing resource is in an idle state, that is, the computing resource is not allocated by the scheduling system, and no user task process is executed on the computing resource.
Through the technical scheme of the embodiment, various unreasonable resource use problems can be found quickly, and help is provided for operation and maintenance personnel to take follow-up measures.
Please note that the above description is only for the preferred embodiment of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (3)

1. A supercomputing platform resource usage monitoring method is characterized by comprising the following steps:
determining the number of user task processes being executed on the current computing resource;
whether the number of user processes being executed is equal to 1; if so, it is determined whether the scheduling system has allocated the current computing resources, and if so, it is determined that the current computing resources are wasted when no executing user task process is found.
2. The method of claim 1, wherein the method further comprises: and if the scheduling system is determined to allocate the current computing resource, and when the executing user task process exists and the user corresponding to the executing user task process is inconsistent with the user allocated to the current computing resource by the scheduling system, determining that the current computing resource is wrongly submitted by the executing user task process.
3. The method of claim 1, wherein the method further comprises: and if the current computing resource is distributed by the scheduling system and the user corresponding to the executing user task process is consistent with the user distributed to the current computing resource by the scheduling system, determining that the current computing resource is in a normal state.
CN201911150115.8A 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method Pending CN110928756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911150115.8A CN110928756A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911150115.8A CN110928756A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Publications (1)

Publication Number Publication Date
CN110928756A true CN110928756A (en) 2020-03-27

Family

ID=69850621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911150115.8A Pending CN110928756A (en) 2019-11-21 2019-11-21 Supercomputing platform resource use monitoring method

Country Status (1)

Country Link
CN (1) CN110928756A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115495321B (en) * 2022-11-18 2023-03-24 天河超级计算淮海分中心 Automatic identification method for use state of super-computation node

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235744A (en) * 2013-04-15 2013-08-07 中山大学 Application resource management system for smart TV (television)

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235744A (en) * 2013-04-15 2013-08-07 中山大学 Application resource management system for smart TV (television)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
源建华 等: "《微机汉字FoxBASE+简明教程》", 30 November 1993, 高等教育出版社 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115495321B (en) * 2022-11-18 2023-03-24 天河超级计算淮海分中心 Automatic identification method for use state of super-computation node

Similar Documents

Publication Publication Date Title
EP3567829B1 (en) Resource management method and apparatus
US9875145B2 (en) Load based dynamic resource sets
US8185905B2 (en) Resource allocation in computing systems according to permissible flexibilities in the recommended resource requirements
US20120297236A1 (en) High availability system allowing conditionally reserved computing resource use and reclamation upon a failover
US20170017511A1 (en) Method for memory management in virtual machines, and corresponding system and computer program product
KR101733117B1 (en) Task distribution method on multicore system and apparatus thereof
JP2007115246A (en) Method and apparatus for dynamically allocating resource used by software
CN105159769A (en) Distributed job scheduling method suitable for heterogeneous computational capability cluster
CN110673927B (en) Scheduling method and device of virtual machine
CN111258746B (en) Resource allocation method and service equipment
KR20200078328A (en) Systems and methods of monitoring software application processes
CN106020984B (en) Method and device for creating process in electronic equipment
JP4348639B2 (en) Multiprocessor system and workload management method
CN115113987A (en) Method, device, equipment and medium for allocating non-uniform memory access resources
CN113032102A (en) Resource rescheduling method, device, equipment and medium
CN110928756A (en) Supercomputing platform resource use monitoring method
JP6477260B2 (en) Method and resource manager for executing an application
CN108287762B (en) Distributed computing interactive mode use resource optimization method and computer equipment
CN110865922A (en) Supercomputing platform resource use monitoring method
CN110928686A (en) Supercomputing platform resource use monitoring method
CN110879772A (en) Supercomputing platform resource use monitoring method
CN110941491A (en) Supercomputing platform resource use monitoring method
CN111143210A (en) Test task scheduling method and system
KR102090306B1 (en) Method and its apparatus for task load balancing for multicore system
CN111581041A (en) Method and equipment for testing performance of magnetic disk

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200327

RJ01 Rejection of invention patent application after publication