CN103713974A - High-performance job scheduling management node dual-computer reinforcement method and device - Google Patents

High-performance job scheduling management node dual-computer reinforcement method and device Download PDF

Info

Publication number
CN103713974A
CN103713974A CN201410007013.1A CN201410007013A CN103713974A CN 103713974 A CN103713974 A CN 103713974A CN 201410007013 A CN201410007013 A CN 201410007013A CN 103713974 A CN103713974 A CN 103713974A
Authority
CN
China
Prior art keywords
management node
heartbeat
job scheduling
operating system
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410007013.1A
Other languages
Chinese (zh)
Other versions
CN103713974B (en
Inventor
马四腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410007013.1A priority Critical patent/CN103713974B/en
Publication of CN103713974A publication Critical patent/CN103713974A/en
Application granted granted Critical
Publication of CN103713974B publication Critical patent/CN103713974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a high-performance job scheduling management node dual-computer reinforcement method to simultaneously monitor heartbeat information and operating system resources of a main management node. When faults are found to happen to the heartbeat information or the operating system resources of the main management node, management node switching is started. Meanwhile, the invention further provides a corresponding device. Dual-computer reinforcement of the job scheduling management node is achieved through the method and the device, the operating system resources can be monitored, and the defects of a traditional method are effectively overcome.

Description

A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment
Technical field
The present invention relates to field of computer technology, the two-shipper that is specifically related to a kind of job scheduling management node is reinforced.
Background technology
Current, network computer technology, has promoted development and the widespread use of group system.With express network, high-performance workstation or PC (PC) are connected into cluster by certain structure, realize parallel computation, only need very little cost just can obtain the performance of large scale computer and parallel machine.Yet along with the continuous expansion of high-performance computer cluster application scale, the problem of management of cluster is also following.Job scheduling system is mainly responsible for receiving the job request that user submits to, and to the requirement of operation, selects suitable computational resource to carry out completing user operation according to specific scheduling rule and user.Under the help of job scheduling system, for user's HPCC system, just look like a large server that possesses a lot of CPU, a plurality of users can use this system simultaneously.The operation that job scheduling system leading subscriber is submitted to, is each operation Resources allocation reasonably, thereby guarantees to make full use of the computing power of group system, and as far as possible promptly obtains operation result.Therefore the importance of job scheduling system is also just self-evident.
Traditional reinforcement means comprises the deployment of management node unit, or uses heartbeat (heartbeat) scheme to carry out two-shipper reinforcing.All there is certain defect leak in these two kinds of modes, the mode that for example adopts management node unit to dispose, once this management node breaks down, just can cause the job scheduling system of whole cluster to quit work, the operation of whole cluster cannot be carried out reasonable efficient scheduling, job run also just there will be stagnation, has a strong impact on running efficiency of system; For another example adopt heartbeat scheme to carry out two-shipper reinforcing, design factor due to heartbeat software self, can not carry out resource level monitoring to job scheduling system, once the resource of monitoring breaks down, just can not effectively carry out resource switch, can cause equally whole group operation cannot carry out reasonable efficient scheduling, have a strong impact on running efficiency of system.Because above-mentioned two kinds of reinforcing modes all exist fatal shortcoming, therefore how more effectively job scheduling system to be reinforced and just to become a technical matters urgently to be resolved hurrily.
Summary of the invention
The present invention proposes a kind of high-performance job scheduling management node two-shipper reinforcement means and equipment, avoided on the one hand unit to dispose the Single Point of Faliure problem causing, on the other hand, provide the monitoring to operating system resource, can effectively make up the deficiency of classic method.
A high-performance job scheduling management node two-shipper reinforcement means, comprising:
Step 1: the share directory of nfs server is mounted on job scheduling two-shipper management node, starts heartbeat monitor and monitoring resource;
Step 2: heartbeat monitor and monitoring resource are monitored the heartbeat message of current main management node and operating system resource respectively;
Step 3: judge whether the described heartbeat message of current main management node or operating system resource break down, if it is start management node and switch.
A high-performance job scheduling management node two-shipper bracing means, comprising:
Heartbeat inspecting module, is configured for the heartbeat message of current main management node is monitored, and to monitoring resource module report heartbeat failure message;
Monitoring resource module, is configured for the operating system resource of current main management node is monitored, and when receiving heartbeat failure message or judge that described operating system resource breaks down, starts management node and switch.
The invention has the beneficial effects as follows to realize the two-shipper of job scheduling management node is reinforced, also realized the monitoring to operating system resource, can effectively make up the deficiency of classic method simultaneously.
Accompanying drawing explanation
Fig. 1 is the operation logic block diagram of a kind of high-performance job scheduling management node two-shipper reinforcement means of proposing of the present invention.
Fig. 2 is the process flow diagram of a kind of high-performance job scheduling management node two-shipper reinforcement means of proposing of the present invention.
Fig. 3 is the theory diagram of a kind of high-performance job scheduling management node two-shipper bracing means of proposing of the present invention.
Embodiment
With reference to Fig. 1, Fig. 1 shows the operation logic block diagram of the method for the present invention's proposition, at management node 1(main management node) and management node 2 on move the method that the present invention proposes, the heartbeat message of heartbeat inspecting module Real-Time Monitoring main management node, when the heartbeat of finding main management node is broken down, report monitoring resource module.Monitoring resource module is monitored the operating system resource on main management node in real time, when finding that operating system resource breaks down or while receiving the main management node heartbeat fault of heartbeat inspecting module report, start management node handoff procedure, make management node 2 become main management node.
With reference to accompanying drawing 2, Fig. 2 shows a kind of high-performance job scheduling management node two-shipper reinforcement means process flow diagram that the present invention proposes, and comprising:
Step 1: the share directory of nfs server is mounted on job scheduling two-shipper management node, starts heartbeat monitor (corosync) and monitoring resource (pacemaker).Described heartbeat monitor and monitoring resource are monitored management node 1 and management node 2 respectively, and wherein management node 1 is as main management node, and management node 2 is as slave node, and management node 1 and management node 2 fabrication processes are dispatched two-shipper nodes.User can be configured heartbeat monitor and monitoring resource parameter in advance, for example monitor duration timeout, the supervision interval interval of resource allocation are, grouping and the boot sequence of resource, need to configure STONITH, so to greatest extent the availability of Support Resource simultaneously.
Step 2: heartbeat monitor and monitoring resource are monitored the heartbeat message of current main management node and operating system resource respectively.
Step 3: judge whether the described heartbeat message of current main management node or operating system resource break down, if it is start management node and switch.
Referring to Fig. 3, Fig. 3 shows a kind of high-performance job scheduling management node two-shipper bracing means that the present invention proposes, and described device comprises:
Heartbeat inspecting module, is configured for the heartbeat message of current main management node is monitored, and to monitoring resource module report heartbeat failure message;
Monitoring resource module, is configured for the operating system resource of current main management node is monitored, and when receiving heartbeat failure message or judge that described operating system resource breaks down, starts management node and switch.
Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims (3)

1. a high-performance job scheduling management node two-shipper reinforcement means, is characterized in that, comprising:
Step 1: the share directory of nfs server is mounted on job scheduling two-shipper management node, starts heartbeat monitor and monitoring resource;
Step 2: heartbeat monitor and monitoring resource are monitored the heartbeat message of current main management node and operating system resource respectively;
Step 3: judge whether the described heartbeat message of current main management node or operating system resource break down, if it is start management node and switch.
2. the method for claim 1, is characterized in that:
User is configured heartbeat monitor and monitoring resource parameter in advance, and described parameter comprises monitor duration timeout, supervision interval interval.
3. a high-performance job scheduling management node two-shipper bracing means, is characterized in that: comprising:
Heartbeat inspecting module, is configured for the heartbeat message of current main management node is monitored, and to monitoring resource module report heartbeat failure message;
Monitoring resource module, is configured for the operating system resource of current main management node is monitored, and when receiving heartbeat failure message or judge that described operating system resource breaks down, starts management node and switch.
CN201410007013.1A 2014-01-07 2014-01-07 A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment Active CN103713974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410007013.1A CN103713974B (en) 2014-01-07 2014-01-07 A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410007013.1A CN103713974B (en) 2014-01-07 2014-01-07 A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment

Publications (2)

Publication Number Publication Date
CN103713974A true CN103713974A (en) 2014-04-09
CN103713974B CN103713974B (en) 2016-02-17

Family

ID=50406975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410007013.1A Active CN103713974B (en) 2014-01-07 2014-01-07 A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment

Country Status (1)

Country Link
CN (1) CN103713974B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942128A (en) * 2014-04-29 2014-07-23 浪潮电子信息产业股份有限公司 Double-computer reinforcing method for high-performance job scheduling management node
CN104123183A (en) * 2014-07-28 2014-10-29 浪潮(北京)电子信息产业有限公司 Cluster assignment dispatching method and device
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
CN105260377A (en) * 2015-09-01 2016-01-20 浪潮(北京)电子信息产业有限公司 Updating method and system based on hierarchical storage
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN106708881A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Interaction method and device based on network file system
CN107819619A (en) * 2017-11-02 2018-03-20 郑州云海信息技术有限公司 A kind of continual method of access for realizing NFS
CN109062184A (en) * 2018-08-10 2018-12-21 中国船舶重工集团公司第七〇九研究所 Two-shipper emergency and rescue equipment, failure switching method and rescue system
CN109542471A (en) * 2018-11-28 2019-03-29 郑州云海信息技术有限公司 A kind of installation method and device of calculate node

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179432A (en) * 2007-12-13 2008-05-14 浪潮电子信息产业股份有限公司 Method of implementing high availability of system in multi-machine surroundings
US20090193071A1 (en) * 2008-01-30 2009-07-30 At&T Knowledge Ventures, L.P. Facilitating Deployment of New Application Services in a Next Generation Network
CN103227838A (en) * 2013-05-10 2013-07-31 中国工商银行股份有限公司 Multi-load equalization processing device and method
CN103279386A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Method for achieving high availability of computer operation scheduling system
CN103297543A (en) * 2013-06-24 2013-09-11 浪潮电子信息产业股份有限公司 Job scheduling method based on computer cluster

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179432A (en) * 2007-12-13 2008-05-14 浪潮电子信息产业股份有限公司 Method of implementing high availability of system in multi-machine surroundings
US20090193071A1 (en) * 2008-01-30 2009-07-30 At&T Knowledge Ventures, L.P. Facilitating Deployment of New Application Services in a Next Generation Network
CN103227838A (en) * 2013-05-10 2013-07-31 中国工商银行股份有限公司 Multi-load equalization processing device and method
CN103279386A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Method for achieving high availability of computer operation scheduling system
CN103297543A (en) * 2013-06-24 2013-09-11 浪潮电子信息产业股份有限公司 Job scheduling method based on computer cluster

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942128A (en) * 2014-04-29 2014-07-23 浪潮电子信息产业股份有限公司 Double-computer reinforcing method for high-performance job scheduling management node
CN104123183B (en) * 2014-07-28 2017-11-14 浪潮(北京)电子信息产业有限公司 Cluster job scheduling method and apparatus
CN104123183A (en) * 2014-07-28 2014-10-29 浪潮(北京)电子信息产业有限公司 Cluster assignment dispatching method and device
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
CN105260377A (en) * 2015-09-01 2016-01-20 浪潮(北京)电子信息产业有限公司 Updating method and system based on hierarchical storage
CN105260377B (en) * 2015-09-01 2019-02-12 浪潮(北京)电子信息产业有限公司 A kind of upgrade method and system based on classification storage
CN106708881A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Interaction method and device based on network file system
CN106708881B (en) * 2015-11-17 2020-08-25 华为技术有限公司 Interaction method and device based on network file system
CN105743995A (en) * 2016-04-05 2016-07-06 北京轻元科技有限公司 Transplantable high-available container cluster deploying and managing system and method
CN105743995B (en) * 2016-04-05 2019-10-18 北京轻元科技有限公司 A kind of system and method for the deployment of portable High Availabitity and management container cluster
CN107819619A (en) * 2017-11-02 2018-03-20 郑州云海信息技术有限公司 A kind of continual method of access for realizing NFS
CN109062184A (en) * 2018-08-10 2018-12-21 中国船舶重工集团公司第七〇九研究所 Two-shipper emergency and rescue equipment, failure switching method and rescue system
CN109062184B (en) * 2018-08-10 2021-05-14 中国船舶重工集团公司第七一九研究所 Double-machine emergency rescue equipment, fault switching method and rescue system
CN109542471A (en) * 2018-11-28 2019-03-29 郑州云海信息技术有限公司 A kind of installation method and device of calculate node

Also Published As

Publication number Publication date
CN103713974B (en) 2016-02-17

Similar Documents

Publication Publication Date Title
CN103713974B (en) A kind of high-performance job scheduling management node two-shipper reinforcement means and equipment
TWI603266B (en) Resource adjustment methods and systems for virtual machines
CN102984012B (en) Management method and system for service resources
CN106406905B (en) Configuration method and system for SETUP option of BIOS of server
WO2016058318A1 (en) Elastic virtual machine (vm) resource scaling method, apparatus and system
CN112948063B (en) Cloud platform creation method and device, cloud platform and cloud platform implementation system
CN102394774A (en) Service state monitoring and failure recovery method for controllers of cloud computing operating system
CN105159769A (en) Distributed job scheduling method suitable for heterogeneous computational capability cluster
CN103942128A (en) Double-computer reinforcing method for high-performance job scheduling management node
KR20200078328A (en) Systems and methods of monitoring software application processes
CN105812169A (en) Host and standby machine switching method and device
CN109842526B (en) Disaster recovery method and device
CN104660694A (en) Method and apparatus for calling service
CN102025776A (en) Disaster tolerant control method, device and system
CN101262479B (en) A network file share method, server and network file share system
CN103312541A (en) Management method of high-availability mutual backup cluster
CN103152420B (en) A kind of method avoiding single-point-of-failofe ofe Ovirt virtual management platform
CN107579850B (en) Wired and wireless hybrid networking method based on SDN control for cloud data center
CN105141691A (en) System and method for automatically expanding virtual machine cluster under cloud computing
CN107529180B (en) Base station cloud test environment construction device and method
CN109995554A (en) The control method and cloud dispatch control device of multi-stage data center active-standby switch
CN108154343B (en) Emergency processing method and system for enterprise-level information system
CN107005434A (en) A kind of method, device and the equipment of synchronous virtual network function VNF states
CN105302276A (en) Design method for limiting power consumption of SmartRack whole cabinet
CN109117320A (en) Power distribution automation main station failure disaster tolerance processing system and method based on cloud platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant