CN106844021A - Computing environment resource management system and management method thereof - Google Patents

Computing environment resource management system and management method thereof Download PDF

Info

Publication number
CN106844021A
CN106844021A CN201611111871.6A CN201611111871A CN106844021A CN 106844021 A CN106844021 A CN 106844021A CN 201611111871 A CN201611111871 A CN 201611111871A CN 106844021 A CN106844021 A CN 106844021A
Authority
CN
China
Prior art keywords
unit
task
information
management
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611111871.6A
Other languages
Chinese (zh)
Other versions
CN106844021B (en
Inventor
王佳世
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
No32 Research Institute Of China Electronics Technology Group Corp
Original Assignee
No32 Research Institute Of China Electronics Technology Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by No32 Research Institute Of China Electronics Technology Group Corp filed Critical No32 Research Institute Of China Electronics Technology Group Corp
Priority to CN201611111871.6A priority Critical patent/CN106844021B/en
Publication of CN106844021A publication Critical patent/CN106844021A/en
Application granted granted Critical
Publication of CN106844021B publication Critical patent/CN106844021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a computing environment resource management system and a management method thereof, wherein the system comprises a first statistical unit and a second statistical unit which are connected with each other, the first statistical unit comprises a first communication unit and the like, and the first communication unit, a first task information statistical unit and a first operating system are in level with a first task management unit; the second statistical unit comprises a second task management unit and the like, the second task management unit, the second communication unit, the slave system information receiving unit, the second task information statistical unit and the system state statistical unit are all connected with the state information integration unit, and the second task management unit, the second task information statistical unit and the system state statistical unit are all connected with the second operating system. The invention integrates the resources of the master system and the slave system, takes the whole CPU platform consisting of the master system and the three slave systems as a resource scheduling unit, truly reflects the system state and improves the resource management efficiency.

Description

Computing environment resource management system and its management method
Technical field
The present invention relates to a kind of management system and its management method, in particular it relates to a kind of computing environment resource management system System and its management method.
Background technology
SLURM is a High Availabitity that can be used for large construction cluster system, scalable, fault tolerant, scalable cluster resource Manager and task scheduling system, mainly have three functions:First, cluster resource is dynamically assigned to task.Secondly, there is provided One complete framework, is started to task, performed and is monitored.Finally, management role queue, realizes the secondary of resource contention Cut out.The system mainly includes a management finger daemon and multiple acts on behalf of finger daemon, and management finger daemon runs on management section Point, receives cluster state monitoring data, distribution is scheduled to resource, distributed tasks and recovery result.Act on behalf of finger daemon fortune Row is waited, performed and return to task status, while being counted to information such as cluster state, task statuses, being remembered in calculate node Record, and report management node.Both coordinate the management function for realizing cluster.
Shen prestige platform is the domestic CPU platforms researched and developed by south of the River Institute of Computing Technology, and it has 16 cores, is divided into four Individual core group, from core group, each core group installs a system for respectively one main core group and three, and main core group runs main system, from Run in core group from system.Main system is depended on from system set, it is necessary to obtain system resource, access bottom hardware by main system It is standby.
Therefore, the agent process in main system can fully monitor four system modes of core group, truly four cores of reflection The resource situation of group.And the agent process from system cannot truly reflect the resource consumption of the system, and can only monitor from Task run state in system, carries out the operation such as distribution, monitoring, recovery of task.
Therefore, if being disposed according to original framework of SLURM, the agent process from system can only obtain mistake False information, it is impossible to reflect the truth of calculate node;Management node cannot monitor cluster correct status, cause resource consumption Erroneous judgement, final cluster cannot normally run.
The content of the invention
For defect of the prior art, it is an object of the invention to provide a kind of computing environment resource management system and its pipe Reason method, its integrate master-slave system resource, using by a main system, three whole CPU platforms constituted from system as one Scheduling of resource unit, truly reflects system mode, improves resources management efficiency.
According to an aspect of the present invention, there is provided a kind of computing environment resource management system, it is characterised in that the calculating Environmental resources management system includes the first statistic unit and the second statistic unit that are connected with each other, and the first statistic unit includes first Communication unit, first task administrative unit, first task Information Statistics unit, the first operating system, the first communication unit, first Mission bit stream statistic unit, the first operating system all practice level with first task administrative unit;Second statistic unit includes second Business administrative unit, the second communication unit, status information integral unit, from System Information reception unit, the second mission bit stream count Unit, system mode statistic unit, the second operating system, the second role management unit, the second communication unit, connect from system information Receive unit, the second mission bit stream statistic unit, system mode statistic unit to be all connected with status information integral unit, the second task Administrative unit, the second mission bit stream statistic unit, system mode statistic unit are all connected with the second operating system.
Preferably, the computing environment resource management system is distinguished to main system, from system, is run different agencies and is kept Shield process.
Preferably, the finger daemon of acting on behalf of in the main system carries out the modification of function, addition.
The present invention also provides a kind of computing environment method for managing resource, it is characterised in that including task distribution flow and shape State information reporting flow;
Task distribution flow is as follows:The calculating task that management finger daemon reception system keeper submits to;According to keeper Parameter and the resource dispatching strategies such as task priority, occupancy resource, the operation duration specified, appropriate drawing is carried out to task Point, and it is assigned to certain the calculate node main system in appropriate subregion;Status information integral unit is by from System Information reception list Obtained respectively in unit, the second mission bit stream statistic unit, system mode statistic unit these three units from the task fortune in system System mode, the resource consumption information of task run status information and main system in row status information, main system, and will be upper Information integration is stated to together, obtaining a main system, three from the integrality information of system;Second role management unit is led to by second Letter unit obtains the task of distribution, and integrality information is obtained by status information integral unit, then according to scheduling rule to appointing Business is decomposed again, is handed down to from being by the second communication unit in this main system actuating section task, another part task System;Second role management unit is obtained the task of distribution by the second communication unit, is obtained by the second mission bit stream statistic unit and appointed Business running status, task is started when resource meets and requires;
State information report flow is as follows:The task that first task Information Statistics unit periodic statistical runs from system Status information, and main system is reported by the first communication unit;It is responsible for receiving three from system from System Information reception unit The mission bit stream for reporting;Second mission bit stream statistic unit is responsible for being monitored the task in main system, counting;System mode Statistic unit then monitors the information such as a main system, three operation conditions, the resource consumptions from system;Status information integral unit Foregoing three kinds of information is integrated, a main system, the three integrality information from system is obtained, and by the second communication Integrality information reporting is given management finger daemon by unit.
Compared with prior art, the present invention has following beneficial effect:The present invention reduces needed in cluster management Node number, is reduced to a quarter of original number, this not only simplifies cluster topology, required for decreasing cluster management The traffic.Meanwhile, the partial function for managing finger daemon is transferred to main system and acts on behalf of finger daemon, reduce management node Load pressure, improve the stability of group system.
Brief description of the drawings
The detailed description made to non-limiting example with reference to the following drawings by reading, further feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is the theory diagram of computing environment resource management system of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that to the ordinary skill of this area For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection domain.
As shown in figure 1, computing environment resource management system of the present invention includes the first statistic unit and second being connected with each other Statistic unit, the first statistic unit include the first communication unit, first task administrative unit, first task Information Statistics unit, First operating system, the first communication unit, first task Information Statistics unit, the first operating system all manage single with first task Level is practiced by unit;Second statistic unit include the second role management unit, the second communication unit, status information integral unit, from system Information receiving unit, the second mission bit stream statistic unit, system mode statistic unit, the second operating system, the second task management Unit, the second communication unit, from System Information reception unit, the second mission bit stream statistic unit, system mode statistic unit all It is connected with status information integral unit, the second role management unit, the second mission bit stream statistic unit, system mode statistic unit All it is connected with the second operating system.
With reference to accompanying drawing of the invention, technical scheme is described in detail.Standard SLURM include one (it is or multiple, It is each other hot standby relation, same time only one of which comes into force) management finger daemon and multiple act on behalf of finger daemon, and management is kept Shield process runs on management node, receives cluster state monitoring data, distribution is scheduled to resource, and distributed tasks are tied with recovery Really.Act on behalf of finger daemon and run on calculate node, wait, perform and return to task status, while to cluster state, task status Counted etc. information, recorded, and reported management node.
But because 16 cores of Shen prestige platform are divided into a main core group and three from core group, each core group installs one and is System.Main system is depended on from system, it is necessary to obtain system resource by main system, access bottom hardware equipment, cause from system On agent process cannot truly reflect the resource consumption of the system.Therefore, if disposed according to original framework of SLURM If, the agent process from system can only obtain error message, it is impossible to reflect the truth of calculate node;Management node without Method monitors cluster correct status, causes the erroneous judgement of resource consumption, and final cluster cannot normally run.
In order to solve this problem, based on SLURM softwares, the present invention provides a kind of computing environment resource pipe of Shen prestige platform Reason system, distinguishes to main system, from system, runs and different acts on behalf of finger daemon.Based on SLURM agent processes, from system In agent process carry out function cutting, remove the functions such as system status monitoring, only retain the management function of task.It is based on SLURM agent processes, the finger daemon of acting on behalf of in main system carries out the state letter of the modification of function, addition, including master and slave system Breath integration function, distribution function again and task management functions of priority etc..
Computing environment method for managing resource of the present invention includes task distribution flow and state information report flow.
Task distribution flow of the invention is as follows:The calculating task that management finger daemon reception system keeper submits to;Root Parameter and the resource dispatching strategies such as task priority, occupancy resource, the operation duration specified according to keeper, fit to task When division, and be assigned to certain the calculate node main system in appropriate subregion;Status information integral unit is by from system information Obtained from system respectively in receiving unit, the second mission bit stream statistic unit, system mode statistic unit these three units System mode, the resource consumption information of task run status information and main system in task run status information, main system, And be integrated together above- mentioned information, a main system, three are obtained from the integrality information of system;Second role management unit by Second communication unit obtains the task of distribution, and integrality information is obtained by status information integral unit, is then advised according to scheduling Then task is decomposed again, is issued by the second communication unit in this main system actuating section task, another part task To from system.Second role management unit is obtained the task of distribution by the second communication unit, by the second mission bit stream statistic unit Task run state is obtained, task is started when resource meets and requires.
State information report flow of the invention is as follows:First task Information Statistics unit periodic statistical runs from system Task status information, and main system is reported by the first communication unit;It is responsible for reception three from System Information reception unit The individual mission bit stream reported from system;Second mission bit stream statistic unit is responsible for being monitored the task in main system, counting; System mode statistic unit then monitors the information such as a main system, three operation conditions, the resource consumptions from system;Status information Integral unit is integrated to foregoing three kinds of information, obtains a main system, the three integrality information from system, and pass through Integrality information reporting is given management finger daemon by the second communication unit.
Specific embodiment of the invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can within the scope of the claims make various deformations or amendments, this not shadow Sound substance of the invention.

Claims (4)

1. a kind of computing environment resource management system, it is characterised in that the computing environment resource management system includes mutually interconnecting The first statistic unit and the second statistic unit for connecing, the first statistic unit include the first communication unit, first task administrative unit, First task Information Statistics unit, the first operating system, the first communication unit, first task Information Statistics unit, the first operation System all practices level with first task administrative unit;Second statistic unit includes the second role management unit, the second communication unit, shape State information integration unit, from System Information reception unit, the second mission bit stream statistic unit, system mode statistic unit, second Operating system, it is the second role management unit, the second communication unit, single from System Information reception unit, the second mission bit stream statistics Unit, system mode statistic unit are all connected with status information integral unit, the second role management unit, the second mission bit stream statistics Unit, system mode statistic unit are all connected with the second operating system.
2. computing environment resource management system according to claim 1, it is characterised in that the computing environment resource management System is distinguished to main system, from system, is run and different is acted on behalf of finger daemon.
3. computing environment resource management system according to claim 2, it is characterised in that the agency in the main system keeps Shield process carries out the modification of function, addition.
4. a kind of computing environment method for managing resource, it is characterised in that including task distribution flow and state information report flow;
Task distribution flow is as follows:The calculating task that management finger daemon reception system keeper submits to;Specified according to keeper Task priority, take resource, parameter and the resource dispatching strategy such as operation duration, appropriate division is carried out to task, and It is assigned to certain the calculate node main system in appropriate subregion;Status information integral unit is by from System Information reception unit, Obtained respectively from the task run state in system in two mission bit stream statistic units, system mode statistic unit these three units System mode, the resource consumption information of task run status information and main system in information, main system, and by above- mentioned information It is integrated together, obtains a main system, three from the integrality information of system;Second role management unit is by the second communication unit The task of distribution is obtained, integrality information is obtained by status information integral unit, task is carried out according to scheduling rule then Decompose again, be handed down to from system by the second communication unit in this main system actuating section task, another part task;Second Role management unit is obtained the task of distribution by the second communication unit, and task run shape is obtained by the second mission bit stream statistic unit State, task is started when resource meets and requires;
State information report flow is as follows:The state of the task that first task Information Statistics unit periodic statistical runs from system Information, and main system is reported by the first communication unit;It is responsible for reception three from System Information reception unit to be reported from system Mission bit stream;Second mission bit stream statistic unit is responsible for being monitored the task in main system, counting;System mode is counted Unit then monitors the information such as a main system, three operation conditions, the resource consumptions from system;Status information integral unit is to preceding State three kinds of information to be integrated, obtain a main system, the three integrality information from system, and by the second communication unit Management finger daemon is given by integrality information reporting.
CN201611111871.6A 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof Active CN106844021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611111871.6A CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611111871.6A CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Publications (2)

Publication Number Publication Date
CN106844021A true CN106844021A (en) 2017-06-13
CN106844021B CN106844021B (en) 2020-08-25

Family

ID=59146333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611111871.6A Active CN106844021B (en) 2016-12-06 2016-12-06 Computing environment resource management system and management method thereof

Country Status (1)

Country Link
CN (1) CN106844021B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110177020A (en) * 2019-06-18 2019-08-27 北京计算机技术及应用研究所 A kind of High-Performance Computing Cluster management method based on Slurm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283788A1 (en) * 2004-06-17 2005-12-22 Platform Computing Corporation Autonomic monitoring in a grid environment
US20060106996A1 (en) * 2004-11-15 2006-05-18 Ahmad Said A Updating data shared among systems
CN103501047A (en) * 2013-10-09 2014-01-08 云南电力调度控制中心 Intelligent fault wave recording main station information management system
CN105938357A (en) * 2015-03-02 2016-09-14 发那科株式会社 Control device capable of centrally managing control by grouping a plurality of systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283788A1 (en) * 2004-06-17 2005-12-22 Platform Computing Corporation Autonomic monitoring in a grid environment
US20060106996A1 (en) * 2004-11-15 2006-05-18 Ahmad Said A Updating data shared among systems
CN103501047A (en) * 2013-10-09 2014-01-08 云南电力调度控制中心 Intelligent fault wave recording main station information management system
CN105938357A (en) * 2015-03-02 2016-09-14 发那科株式会社 Control device capable of centrally managing control by grouping a plurality of systems

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110177020A (en) * 2019-06-18 2019-08-27 北京计算机技术及应用研究所 A kind of High-Performance Computing Cluster management method based on Slurm

Also Published As

Publication number Publication date
CN106844021B (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN103873279B (en) Server management method and server management device
CN106708622A (en) Cluster resource processing method and system, and resource processing cluster
CN108092813A (en) Data center's total management system server hardware Governance framework and implementation method
CN111339175B (en) Data processing method, device, electronic equipment and readable storage medium
CN104580338A (en) Service processing method, system and equipment
CN102929773A (en) Information collection method and device
US20140115153A1 (en) Apparatus for monitoring data distribution service (dds) and method thereof
US11212173B2 (en) Model-driven technique for virtual network function rehoming for service chains
CN114116172A (en) Flow data acquisition method, device, equipment and storage medium
CN103763373A (en) Method for dispatching based on cloud computing and dispatcher
CN108563787A (en) A kind of data interaction management system and method for data center's total management system
JP2010128597A (en) Information processor and method of operating the same
CN106844021A (en) Computing environment resource management system and management method thereof
CN116260738B (en) Equipment monitoring method and related equipment
US9009735B2 (en) Method for processing data, computing node, and system
CN103442212A (en) Network security and protection comprehensive early warning type management system platform
Benford Requirements of Activity Management.
EP2770447B1 (en) Data processing method, computational node and system
CN116346823A (en) Big data heterogeneous task scheduling method and system based on message queue
CN114757448B (en) Manufacturing inter-link optimal value chain construction method based on data space model
CN112000657A (en) Data management method, device, server and storage medium
Liu et al. Distributed ale in rfid middleware
CN115168042A (en) Management method and device of monitoring cluster, computer storage medium and electronic equipment
Gulhane Enhancing queuing efficiency using discrete event simulation
JP2009157597A (en) Automatic distribution system for remote maintenance software, and automatic distribution method for remote maintenance software

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant