CN106844021B

CN106844021B - Computing environment resource management system and management method thereof

Info

Publication number: CN106844021B
Application number: CN201611111871.6A
Authority: CN
Inventors: 王佳世
Original assignee: No32 Research Institute Of China Electronics Technology Group Corp
Current assignee: No32 Research Institute Of China Electronics Technology Group Corp
Priority date: 2016-12-06
Filing date: 2016-12-06
Publication date: 2020-08-25
Anticipated expiration: 2036-12-06
Also published as: CN106844021A

Abstract

The invention provides a computing environment resource management system and a management method thereof, wherein the system comprises a first statistical unit and a second statistical unit which are connected with each other, the first statistical unit comprises a first communication unit and the like, and the first communication unit, a first task information statistical unit and a first operating system are in level with a first task management unit; the second statistical unit comprises a second task management unit and the like, the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system. The invention integrates the resources of the master system and the slave system, takes the whole CPU platform consisting of the master system and the three slave systems as a resource scheduling unit, truly reflects the system state and improves the resource management efficiency.

Description

Computing environment resource management system and management method thereof

Technical Field

The present invention relates to a management system and a management method thereof, and more particularly, to a computing environment resource management system and a management method thereof.

Background

The SLURM is a high-availability, scalable, fault-tolerant, scalable cluster resource manager and task scheduling system that can be used in a large cluster system, and mainly has three functions: first, cluster resources are dynamically allocated to tasks. Secondly, a complete framework is provided for starting, executing and monitoring the tasks. And finally, managing the task queue to realize the arbitration of resource competition. The system mainly comprises a management daemon and a plurality of agent daemons, wherein the management daemon runs on a management node, receives cluster state monitoring data, performs scheduling distribution on resources, and distributes tasks and recovers results. And the agent daemon runs on the computing node, waits for, executes and returns to the task state, and simultaneously counts and records information such as the cluster state, the task state and the like, and reports the information to the management node. The management function of the cluster is realized by the cooperation of the two.

The Shenwei platform is a domestic CPU platform developed by the research institute of computational technology in the south of the Yangtze river, and is provided with sixteen cores which are divided into four core groups, namely a main core group and three slave core groups, wherein each core group is provided with a system, the main core group runs a main system, and the slave core groups run slave systems. The slave system depends on the master system, and needs to acquire system resources and access the underlying hardware equipment through the master system.

Therefore, the agent process on the main system can fully monitor the system state of the four core groups and truly reflect the resource condition of the four core groups. And the agent process on the slave system can not truly reflect the resource consumption of the system, but only can monitor the task running state on the slave system to perform operations such as distribution, monitoring, recovery and the like of the tasks.

Therefore, if the deployment is performed according to the original architecture of the SLURM, the proxy process on the slave system can only acquire error information, and the actual situation of the computing node cannot be reflected; the management node cannot monitor the correct state of the cluster, so that misjudgment of resource consumption is caused, and finally the cluster cannot normally operate.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a computing environment resource management system and a management method thereof, which integrate the resources of a master system and a slave system, take the whole CPU platform consisting of a master system and three slave systems as a resource scheduling unit, truly reflect the system state and improve the resource management efficiency.

According to one aspect of the invention, a computing environment resource management system is provided, which is characterized in that the computing environment resource management system comprises a first statistical unit and a second statistical unit which are connected with each other, the first statistical unit comprises a first communication unit, a first task management unit, a first task information statistical unit and a first operating system, and the first communication unit, the first task information statistical unit and the first operating system are in level with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system.

Preferably, the computing environment resource management system distinguishes the master system and the slave system and runs different agent daemons.

Preferably, the agent daemon in the main system modifies and adds functions.

The invention also provides a computing environment resource management method, which is characterized by comprising a task distribution process and a state information reporting process;

the task distribution process is as follows: the management daemon receives a computing task submitted by a system administrator; according to the task priority, the occupied resources, the running time and other parameters specified by the administrator and the resource scheduling strategy, properly dividing the tasks and distributing the tasks to a certain computing node main system in a proper partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks through the second communication unit, acquires the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and transmits the other part of the tasks to the slave system through the second communication unit; the second task management unit acquires the distributed tasks from the second communication unit, acquires the running state of the tasks from the second task information statistical unit, and starts the tasks when the resources meet the requirements;

the status information reporting process comprises the following steps: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions, resource consumption and other information of one main system and three slave systems; the state information integration unit integrates the three information to obtain the integral state information of a main system and three slave systems, and reports the integral state information to the management daemon process through the second communication unit.

Compared with the prior art, the invention has the following beneficial effects: the invention reduces the number of the nodes needing to be managed in the cluster to one fourth of the original number, thereby not only simplifying the cluster structure, but also reducing the communication traffic required by cluster management. Meanwhile, partial functions of the management daemon process are transferred to the main system agent daemon process, so that the load pressure of the management node is reduced, and the stability of the cluster system is improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a functional block diagram of a computing environment resource management system of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, the computing environment resource management system of the present invention includes a first statistical unit and a second statistical unit connected to each other, where the first statistical unit includes a first communication unit, a first task management unit, a first task information statistical unit, and a first operating system, and the first communication unit, the first task information statistical unit, and the first operating system are in level with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system.

The technical scheme of the invention is described in detail by combining the attached drawings of the invention. The standard SLURM includes one (or more, the mutual hot standby relationship, only one valid at the same time) management daemon and a plurality of agent daemons, the management daemons run on the management node, receive the cluster state monitoring data, perform scheduling allocation to the resources, distribute the tasks and recover the results. And the agent daemon runs on the computing node, waits for, executes and returns to the task state, and simultaneously counts and records information such as the cluster state, the task state and the like, and reports the information to the management node.

However, since sixteen cores of the Shenwei platform are divided into a main core group and three slave core groups, each core group is provided with a system. The slave system depends on the master system, and needs to acquire system resources and access bottom hardware equipment through the master system, so that the resource consumption of the slave system cannot be truly reflected by the proxy process on the slave system. Therefore, if the deployment is performed according to the original architecture of the SLURM, the proxy process on the slave system can only acquire error information, and the actual situation of the computing node cannot be reflected; the management node cannot monitor the correct state of the cluster, so that misjudgment of resource consumption is caused, and finally the cluster cannot normally operate.

In order to solve the problem, based on SLURM software, the invention provides a computing environment resource management system of a Shenwei platform, which distinguishes a master system and a slave system and runs different agent daemon processes. Based on the SLURM agent process, the function cutting is carried out from the agent process in the system, the functions of system state monitoring and the like are removed, and only the management function of the task is reserved. Based on the SLURM agent process, the agent daemon process in the main system modifies and adds functions, including the state information integration function of the main system and the slave system, the priority redistribution function, the task management function and the like.

The method for managing the resources of the computing environment comprises a task distribution process and a state information reporting process.

The task distribution process of the invention is as follows: the management daemon receives a computing task submitted by a system administrator; according to the task priority, the occupied resources, the running time and other parameters specified by the administrator and the resource scheduling strategy, properly dividing the tasks and distributing the tasks to a certain computing node main system in a proper partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit obtains the distributed tasks through the second communication unit, obtains the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts a part of tasks at the main system, and sends the other part of tasks to the slave system through the second communication unit. The second task management unit obtains the distributed tasks through the second communication unit, obtains the task running state through the second task information counting unit, and starts the tasks when the resources meet the requirements.

The state information reporting process of the invention is as follows: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions, resource consumption and other information of one main system and three slave systems; the state information integration unit integrates the three information to obtain the integral state information of a main system and three slave systems, and reports the integral state information to the management daemon process through the second communication unit.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A computing environment resource management system is characterized by comprising a first statistical unit and a second statistical unit which are connected with each other, wherein the first statistical unit comprises a first communication unit, a first task management unit, a first task information statistical unit and a first operating system, and the first communication unit, the first task information statistical unit and the first operating system are all connected with the first task management unit; the second statistical unit comprises a second task management unit, a second communication unit, a state information integration unit, a slave system information receiving unit, a second task information statistical unit, a system state statistical unit and a second operating system, wherein the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system;

the management daemon receives a computing task submitted by a system administrator; dividing the tasks according to task priorities, occupied resources, running time parameters and resource scheduling strategies specified by an administrator, and distributing the tasks to a certain computing node main system in a partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks by the second communication unit, acquires the whole state information by the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and sends the other part of the tasks to the slave system through the second communication unit, the second task management unit acquires the distributed tasks by the second communication unit, acquires the running state of the tasks by the second task information statistical unit, and starts the tasks when the resources meet the requirements;

the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions and resource consumption information of one main system and three slave systems; the state information integration unit integrates the running states and resource consumption information of the main system and the three slave systems to obtain the overall state information of the main system and the three slave systems, and reports the overall state information to the management daemon process through the second communication unit.

2. The system according to claim 1, wherein the system distinguishes between master system and slave system, running different agent daemons;

based on the SLURM agent process, performing function cutting from the agent process in the system, removing the system state monitoring function and only reserving the management function of the task; based on the SLURM agent process, the agent daemon process in the main system modifies and adds functions, including the state information integration function of the main system and the slave system, the priority redistribution function and the task management function.

3. The system according to claim 2, wherein the agent daemon in the host system modifies or adds functions.

4. A computing environment resource management method is characterized by comprising a task distribution process and a state information reporting process;

the task distribution process is as follows: the management daemon receives a computing task submitted by a system administrator; dividing the tasks according to task priorities, occupied resources, running time parameters and resource scheduling strategies specified by an administrator, and distributing the tasks to a certain computing node main system in a partition; the state information integration unit respectively acquires task running state information in the slave system, task running state information in the master system and system state and resource consumption information of the master system from the slave system information receiving unit, the second task information counting unit and the system state counting unit, and integrates the information together to obtain integral state information of the master system and the three slave systems; the second task management unit acquires the distributed tasks through the second communication unit, acquires the whole state information through the state information integration unit, decomposes the tasks again according to the scheduling rule, starts part of the tasks at the main system, and transmits the other part of the tasks to the slave system through the second communication unit; the second task management unit acquires the distributed tasks from the second communication unit, acquires the running state of the tasks from the second task information statistical unit, and starts the tasks when the resources meet the requirements;

the status information reporting process comprises the following steps: the first task information counting unit periodically counts the state information of the tasks running in the slave system and reports the state information to the master system through the first communication unit; the slave system information receiving unit is responsible for receiving three pieces of task information reported by the slave system; the second task information statistical unit is responsible for monitoring and counting tasks in the main system; the system state statistical unit monitors the running conditions and resource consumption information of one main system and three slave systems; the state information integration unit integrates the running states and resource consumption information of the main system and the three slave systems to obtain the overall state information of the main system and the three slave systems, and reports the overall state information to the management daemon process through the second communication unit.