CN106844021B - Computing environment resource management system and management method thereof - Google Patents

Computing environment resource management system and management method thereof Download PDF

Info

Publication number
CN106844021B
CN106844021B CN201611111871.6A CN201611111871A CN106844021B CN 106844021 B CN106844021 B CN 106844021B CN 201611111871 A CN201611111871 A CN 201611111871A CN 106844021 B CN106844021 B CN 106844021B
Authority
CN
China
Prior art keywords
unit
task
information
tasks
slave
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611111871.6A
Other languages
Chinese (zh)
Other versions
CN106844021A (en
Inventor
王佳世
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
No32 Research Institute Of China Electronics Technology Group Corp
Original Assignee
No32 Research Institute Of China Electronics Technology Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by No32 Research Institute Of China Electronics Technology Group Corp filed Critical No32 Research Institute Of China Electronics Technology Group Corp
Priority to CN201611111871.6A priority Critical patent/CN106844021B/en
Publication of CN106844021A publication Critical patent/CN106844021A/en
Application granted granted Critical
Publication of CN106844021B publication Critical patent/CN106844021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Abstract

The invention provides a computing environment resource management system and a management method thereof, wherein the system comprises a first statistical unit and a second statistical unit which are connected with each other, the first statistical unit comprises a first communication unit and the like, and the first communication unit, a first task information statistical unit and a first operating system are in level with a first task management unit; the second statistical unit comprises a second task management unit and the like, the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system. The invention integrates the resources of the master system and the slave system, takes the whole CPU platform consisting of the master system and the three slave systems as a resource scheduling unit, truly reflects the system state and improves the resource management efficiency.

Description

Computing environment resource management system and management method thereof
Technical Field
The present invention relates to a management system and a management method thereof, and more particularly, to a computing environment resource management system and a management method thereof.
Background
The SLURM is a high-availability, scalable, fault-tolerant, scalable cluster resource manager and task scheduling system that can be used in a large cluster system, and mainly has three functions: first, cluster resources are dynamically allocated to tasks. Secondly, a complete framework is provided for starting, executing and monitoring the tasks. And finally, managing the task queue to realize the arbitration of resource competition. The system mainly comprises a management daemon and a plurality of agent daemons, wherein the management daemon runs on a management node, receives cluster state monitoring data, performs scheduling distribution on resources, and distributes tasks and recovers results. And the agent daemon runs on the computing node, waits for, executes and returns to the task state, and simultaneously counts and records information such as the cluster state, the task state and the like, and reports the information to the management node. The management function of the cluster is realized by the cooperation of the two.
The Shenwei platform is a domestic CPU platform developed by the research institute of computational technology in the south of the Yangtze river, and is provided with sixteen cores which are divided into four core groups, namely a main core group and three slave core groups, wherein each core group is provided with a system, the main core group runs a main system, and the slave core groups run slave systems. The slave system depends on the master system, and needs to acquire system resources and access the underlying hardware equipment through the master system.
Therefore, the agent process on the main system can fully monitor the system state of the four core groups and truly reflect the resource condition of the four core groups. And the agent process on the slave system can not truly reflect the resource consumption of the system, but only can monitor the task running state on the slave system to perform operations such as distribution, monitoring, recovery and the like of the tasks.
Therefore, if the deployment is performed according to the original architecture of the SLURM, the proxy process on the slave system can only acquire error information, and the actual situation of the computing node cannot be reflected; the management node cannot monitor the correct state of the cluster, so that misjudgment of resource consumption is caused, and finally the cluster cannot normally operate.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a computing environment resource management system and a management method thereof, which integrate the resources of a master system and a slave system, take the whole CPU platform consisting of a master system and three slave systems as a resource scheduling unit, truly reflect the system state and improve the resource management efficiency.
According to one aspect of the invention, a computing environment resource management system is provided, which is characterized in that the computing environment resource management system comprises a first statistical unit and a second statistical unit which are connected with each other, the first statistical unit comprises a first communication unit, a first task management unit, a first task information statistical unit and a first operating system, and the first communication unit, the first task information statistical unit and the first operating system are in level with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system.
Preferably, the computing environment resource management system distinguishes the master system and the slave system and runs different agent daemons.
Preferably, the agent daemon in the main system modifies and adds functions.
The invention also provides a computing environment resource management method, which is characterized by comprising a task distribution process and a state information reporting process;
the task distribution process is as follows: the management daemon receives a computing task submitted by a system administrator; according to the task priority, the occupied resources, the running time and other parameters specified by the administrator and the resource scheduling strategy, properly dividing the tasks and distributing the tasks to a certain computing node main system in a proper partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks through the second communication unit, acquires the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and transmits the other part of the tasks to the slave system through the second communication unit; the second task management unit acquires the distributed tasks from the second communication unit, acquires the running state of the tasks from the second task information statistical unit, and starts the tasks when the resources meet the requirements;
the status information reporting process comprises the following steps: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions, resource consumption and other information of one main system and three slave systems; the state information integration unit integrates the three information to obtain the integral state information of a main system and three slave systems, and reports the integral state information to the management daemon process through the second communication unit.
Compared with the prior art, the invention has the following beneficial effects: the invention reduces the number of the nodes needing to be managed in the cluster to one fourth of the original number, thereby not only simplifying the cluster structure, but also reducing the communication traffic required by cluster management. Meanwhile, partial functions of the management daemon process are transferred to the main system agent daemon process, so that the load pressure of the management node is reduced, and the stability of the cluster system is improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a functional block diagram of a computing environment resource management system of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the computing environment resource management system of the present invention includes a first statistical unit and a second statistical unit connected to each other, where the first statistical unit includes a first communication unit, a first task management unit, a first task information statistical unit, and a first operating system, and the first communication unit, the first task information statistical unit, and the first operating system are in level with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system.
The technical scheme of the invention is described in detail by combining the attached drawings of the invention. The standard SLURM includes one (or more, the mutual hot standby relationship, only one valid at the same time) management daemon and a plurality of agent daemons, the management daemons run on the management node, receive the cluster state monitoring data, perform scheduling allocation to the resources, distribute the tasks and recover the results. And the agent daemon runs on the computing node, waits for, executes and returns to the task state, and simultaneously counts and records information such as the cluster state, the task state and the like, and reports the information to the management node.
However, since sixteen cores of the Shenwei platform are divided into a main core group and three slave core groups, each core group is provided with a system. The slave system depends on the master system, and needs to acquire system resources and access bottom hardware equipment through the master system, so that the resource consumption of the slave system cannot be truly reflected by the proxy process on the slave system. Therefore, if the deployment is performed according to the original architecture of the SLURM, the proxy process on the slave system can only acquire error information, and the actual situation of the computing node cannot be reflected; the management node cannot monitor the correct state of the cluster, so that misjudgment of resource consumption is caused, and finally the cluster cannot normally operate.
In order to solve the problem, based on SLURM software, the invention provides a computing environment resource management system of a Shenwei platform, which distinguishes a master system and a slave system and runs different agent daemon processes. Based on the SLURM agent process, the function cutting is carried out from the agent process in the system, the functions of system state monitoring and the like are removed, and only the management function of the task is reserved. Based on the SLURM agent process, the agent daemon process in the main system modifies and adds functions, including the state information integration function of the main system and the slave system, the priority redistribution function, the task management function and the like.
The method for managing the resources of the computing environment comprises a task distribution process and a state information reporting process.
The task distribution process of the invention is as follows: the management daemon receives a computing task submitted by a system administrator; according to the task priority, the occupied resources, the running time and other parameters specified by the administrator and the resource scheduling strategy, properly dividing the tasks and distributing the tasks to a certain computing node main system in a proper partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit obtains the distributed tasks through the second communication unit, obtains the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts a part of tasks at the main system, and sends the other part of tasks to the slave system through the second communication unit. The second task management unit obtains the distributed tasks through the second communication unit, obtains the task running state through the second task information counting unit, and starts the tasks when the resources meet the requirements.
The state information reporting process of the invention is as follows: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions, resource consumption and other information of one main system and three slave systems; the state information integration unit integrates the three information to obtain the integral state information of a main system and three slave systems, and reports the integral state information to the management daemon process through the second communication unit.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (4)

1. A computing environment resource management system is characterized by comprising a first statistical unit and a second statistical unit which are connected with each other, wherein the first statistical unit comprises a first communication unit, a first task management unit, a first task information statistical unit and a first operating system, and the first communication unit, the first task information statistical unit and the first operating system are all connected with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system;
the management daemon receives a computing task submitted by a system administrator; dividing the tasks according to task priorities, occupied resources, running time parameters and resource scheduling strategies specified by an administrator, and distributing the tasks to a certain computing node main system in a partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks by the second communication unit, acquires the whole state information by the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and sends the other part of the tasks to the slave system through the second communication unit, the second task management unit acquires the distributed tasks by the second communication unit, acquires the running state of the tasks by the second task information statistical unit, and starts the tasks when the resources meet the requirements;
the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions and resource consumption information of one main system and three slave systems; the state information integration unit integrates the running states and resource consumption information of the main system and the three slave systems to obtain the overall state information of the main system and the three slave systems, and reports the overall state information to the management daemon process through the second communication unit.
2. The system according to claim 1, wherein the system distinguishes between master system and slave system, running different agent daemons;
based on the SLURM agent process, performing function cutting from the agent process in the system, removing the system state monitoring function and only reserving the management function of the task; based on the SLURM agent process, the agent daemon process in the main system modifies and adds functions, including the state information integration function of the main system and the slave system, the priority redistribution function and the task management function.
3. The system according to claim 2, wherein the agent daemon in the host system modifies or adds functions.
4. A computing environment resource management method is characterized by comprising a task distribution process and a state information reporting process;
the task distribution process is as follows: the management daemon receives a computing task submitted by a system administrator; dividing the tasks according to task priorities, occupied resources, running time parameters and resource scheduling strategies specified by an administrator, and distributing the tasks to a certain computing node main system in a partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks through the second communication unit, acquires the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and transmits the other part of the tasks to the slave system through the second communication unit; the second task management unit acquires the distributed tasks from the second communication unit, acquires the running state of the tasks from the second task information statistical unit, and starts the tasks when the resources meet the requirements;
the status information reporting process comprises the following steps: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions and resource consumption information of one main system and three slave systems; the state information integration unit integrates the running states and resource consumption information of the main system and the three slave systems to obtain the overall state information of the main system and the three slave systems, and reports the overall state information to the management daemon process through the second communication unit.
CN201611111871.6A 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof Active CN106844021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611111871.6A CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611111871.6A CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Publications (2)

Publication Number Publication Date
CN106844021A CN106844021A (en) 2017-06-13
CN106844021B true CN106844021B (en) 2020-08-25

Family

ID=59146333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611111871.6A Active CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Country Status (1)

Country Link
CN (1) CN106844021B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110177020A (en) * 2019-06-18 2019-08-27 北京计算机技术及应用研究所 A kind of High-Performance Computing Cluster management method based on Slurm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283788A1 (en) * 2004-06-17 2005-12-22 Platform Computing Corporation Autonomic monitoring in a grid environment
US20060106996A1 (en) * 2004-11-15 2006-05-18 Ahmad Said A Updating data shared among systems
CN103501047A (en) * 2013-10-09 2014-01-08 云南电力调度控制中心 Intelligent fault wave recording main station information management system
CN105938357A (en) * 2015-03-02 2016-09-14 发那科株式会社 Control device capable of centrally managing control by grouping a plurality of systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283788A1 (en) * 2004-06-17 2005-12-22 Platform Computing Corporation Autonomic monitoring in a grid environment
US20060106996A1 (en) * 2004-11-15 2006-05-18 Ahmad Said A Updating data shared among systems
CN103501047A (en) * 2013-10-09 2014-01-08 云南电力调度控制中心 Intelligent fault wave recording main station information management system
CN105938357A (en) * 2015-03-02 2016-09-14 发那科株式会社 Control device capable of centrally managing control by grouping a plurality of systems

Also Published As

Publication number Publication date
CN106844021A (en) 2017-06-13

Similar Documents

Publication Publication Date Title
CN107066319B (en) Multi-dimensional scheduling system for heterogeneous resources
CN107038069B (en) Dynamic label matching DLMS scheduling method under Hadoop platform
CN102346460B (en) Transaction-based service control system and method
CN102402395B (en) Quorum disk-based non-interrupted operation method for high availability system
CN101309167B (en) Disaster allowable system and method based on cluster backup
US8949847B2 (en) Apparatus and method for managing resources in cluster computing environment
US9208029B2 (en) Computer system to switch logical group of virtual computers
CN102387173B (en) MapReduce system and method and device for scheduling tasks thereof
CN108920153B (en) Docker container dynamic scheduling method based on load prediction
CN102271145A (en) Virtual computer cluster and enforcement method thereof
CN101986272A (en) Task scheduling method under cloud computing environment
CN104618693A (en) Cloud computing based online processing task management method and system for monitoring video
CN110221920B (en) Deployment method, device, storage medium and system
CN102164184A (en) Computer entity access and management method for cloud computing network and cloud computing network
JP2011516998A (en) Workload scheduling method, system, and computer program
TW202133055A (en) Method for establishing system resource prediction and resource management model through multi-layer correlations
CN103761146A (en) Method for dynamically setting quantities of slots for MapReduce
CN111930493A (en) NodeManager state management method and device in cluster and computing equipment
CN105740085A (en) Fault tolerance processing method and device
CN103763373A (en) Method for dispatching based on cloud computing and dispatcher
CN104123183A (en) Cluster assignment dispatching method and device
CN104484228A (en) Distributed parallel task processing system based on Intelli-DSC (Intelligence-Data Service Center)
CN112737934B (en) Cluster type internet of things edge gateway device and method
CN106844021B (en) Computing environment resource management system and management method thereof
CN106815318B (en) Clustering method and system for time sequence database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant