CN102902598B - A kind of resources measurement preprocess method combined with job scheduling system - Google Patents

A kind of resources measurement preprocess method combined with job scheduling system Download PDF

Info

Publication number
CN102902598B
CN102902598B CN201210333671.0A CN201210333671A CN102902598B CN 102902598 B CN102902598 B CN 102902598B CN 201210333671 A CN201210333671 A CN 201210333671A CN 102902598 B CN102902598 B CN 102902598B
Authority
CN
China
Prior art keywords
file
computing node
node resource
content
carry out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210333671.0A
Other languages
Chinese (zh)
Other versions
CN102902598A (en
Inventor
张磊
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuguang zhisuan Information Technology Co.,Ltd.
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN201210333671.0A priority Critical patent/CN102902598B/en
Publication of CN102902598A publication Critical patent/CN102902598A/en
Application granted granted Critical
Publication of CN102902598B publication Critical patent/CN102902598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to a kind of resources measurement preprocess method combined with job scheduling system, comprise the steps: that (1) enables job scheduler preprocessing function; (2) job scheduler reads computing node resource distribution file; (3) content detection is carried out to computing node resource; (4) when finding computing node resource exception content, judge whether to need to start from processing procedure; (5) judge whether to carry out from process to computing node resource exception content; (6) carry out from process to computing node resource exception content; (7) by SMTP or SMGP expanded configuration interface, computing node resource exception content is sent to user with the form of note or mail; (8) by operation process recording in journal file.Give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and relevant automatic process and configuration file are provided, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving.

Description

A kind of resources measurement preprocess method combined with job scheduling system
Technical field
The present invention relates to a kind of preprocess method of HPCC field, be specifically related to a kind of resources measurement preprocess method combined with job scheduling system.
Background technology
One of modal problem of large-scale cluster job scheduling system is exactly: resource (comprising computing node resource, storage resources etc.) has occurred abnormal (not a node roll off the production line exception), but dispatching system fails to catch this exception, to such an extent as to operation is scheduled in abnormal nodes resource, or employ other abnormal resource, cause operation finally cannot normally complete.To a large amount of wastes of resource and time be caused like this, and normal job run result cannot be obtained.
Provide the function of computing node health detection in Torque 5.0, and coordinate scheduler (as: Maui) that the state of health status abnormal nodes is set to Down.The node health measuring ability of Torque, by specifying monitoring script, obtains the operation output information of detection script, if output information is with " ERROR " beginning, then the state of this node is set to Down by scheduler.Meanwhile, nodal test interval can be set.There is following problem in prior art:
First, executable program under the computing node health measuring function that Torque provides requires user to write corresponding detection script or Linux voluntarily, so just require that user has certain detection script or application trace routine development ability, use difficulty larger; Secondly, the computing node health measuring function that Torque provides only when detecting abnormal, utilizes scheduler that node state is set to Down, does not provide corresponding abnormal automatic processing capacity.
Summary of the invention
For the deficiencies in the prior art, the invention provides a kind of resources measurement preprocess method combined with job scheduling system.The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.
The object of the invention is to adopt following technical proposals to realize:
The resources measurement preprocess method combined with job scheduling system, its improvements are, described method comprises the steps:
(1) job scheduler preprocessing function is enabled;
(2) described job scheduler reads computing node resource distribution file;
(3) content detection is carried out to computing node resource;
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure;
(5) judge whether to carry out from process to computing node resource exception content;
(6) carry out from process to computing node resource exception content;
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file.
Wherein, in described step (2), described computing node resource distribution file health.prop configuration file represents.
Wherein, the content of described health.prop configuration file comprises:
A, whether enable monitoring resource preprocessing function, be defaulted as Yes;
B, document (document belongs to one of detected object, is only availability detects its detection mode) availability object, that is: check whether specified file exists, and is defaulted as sky;
C, want the catalogue of measurement capacity or subregion whether to exist, be defaulted as sky;
D, automatically processing procedure activation threshold value, when specified catalogue or subregion use capacity to exceed this threshold value, will start automatic processing procedure, be defaulted as 0.8, that is: when assigned catalogue or subregion use amount are more than 80%, then start automatic processing procedure;
E, when automatically processing, the minimum value of process file object, acquiescence: 1BM, that is: only process the file that file size is greater than 1BM;
F, when automatically processing, process file object need produce early than before this date, and default value is 7, that is: only process the file of production before one week;
G, when automatically processing, only process belongs to the file of certain task groups, and default value be sky, that is: process the file of all groups;
H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file.
Wherein, in described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents.
Wherein, in described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1).
Wherein, in described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).
Wherein, in described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).
Wherein, in described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.
Compared with the prior art, the beneficial effect that the present invention reaches is:
The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving, and the reliability of process improves.
Accompanying drawing explanation
Fig. 1 to be pbs_mom config provided by the invention part be Torque the configuration file schematic diagram of health measuring function is provided;
Fig. 2 is the resources measurement preprocess method process flow diagram combined with job scheduling system provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
HPCC (HPCC, High Performance Computing Cluster): a branch of computer science, to solve for the purpose of complicated scientific algorithm or numerical evaluation problem, it is the loosely-coupled computing node set of one be made up of multiple stage Node station (server).
The present invention on the basis of Torque computing node health measuring function, for user provides a set of simple, configurable, extendible nodal test pretreating scheme.With the communication process of numerous HPCC user, we recognize, in the use procedure of cluster resource, the situation of the computational resource exception that user worries mainly concentrates on: in " reliability of storage resources " and " document availability " this two problem.The nodal test pretreating scheme that the present invention provides is exactly mainly for above-mentioned two problems, and the actual demand of numerous user, a set of configuration standard of formation with automatically process the solution combined.
Pbs_mom config provided by the invention part provide by Torque the configuration file of health measuring function as shown in Figure 1, require to be configured to by node_check_script item in this configuration file the node_check.scp script file position that this solution provides.As shown in Fig. 1 resources measurement pre-service allocation plan, solution in the present invention formed primarily of a series of script files such as node_check.scp, health.prop configuration file, health.log log record file, provides the expanded configuration interfaces such as SMTP, SMGP simultaneously.
Torque represents a kind of cluster job scheduling resource management system of increasing income; SMTP(Simple Mail TransferProtocol) i.e. Simple Mail Transfer protocol, it be one group for being transmitted the rule of mail to destination address by source address, controlled the transfer mode of mail by it; SMGP (Short Message Gateway Protocol) is the interface protocol that SMGW and other network element device carry out short message transmission.
As shown in Figure 2, the method comprises the steps: the resources measurement preprocess method flow process combined with job scheduling system provided by the invention
(1) job scheduler preprocessing function is enabled: this job scheduler is Maui job scheduler.
(2) Maui job scheduler reads computing node resource heakh.prop configuration file;
(3) content of computing node resource node_check.scp script file is detected: according to
Configuration in the detailed annotation of table 1 health.prop partial configuration, carries out given content detection; Table 1 is as follows:
Table 1 health.prop partial configuration is explained in detail
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure: if desired start from processing procedure, then carry out step (5); Otherwise return step (1).
(5) judge whether to carry out from process to computing node resource exception content: if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).
(6) carry out from process to computing node resource exception content: carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file: operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Journal file health.log represents.
The nodal test pretreating scheme that the present invention provides, mainly give processing scheme for " reliability of storage resources " and " document availability " this two problem exactly, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or equivalent replacement, and not departing from any amendment of spirit and scope of the invention or equivalent replacement, it all should be encompassed in the middle of right of the present invention.

Claims (1)

1. the resources measurement preprocess method combined with job scheduling system, is characterized in that, described method comprises the steps:
(1) job scheduler preprocessing function is enabled;
(2) described job scheduler reads computing node resource distribution file;
(3) content of computing node resource is detected;
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure;
(5) judge whether to carry out from process to computing node resource exception content;
(6) carry out from process to computing node resource exception content;
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file;
In described step (2), described computing node resource distribution file health.prop configuration file represents;
The content of described health.prop configuration file comprises:
A, whether enable monitoring resource preprocessing function, be defaulted as Yes;
B, document availability object, that is: check whether specified file exists, and is defaulted as sky;
C, want the catalogue of measurement capacity or subregion whether to exist, be defaulted as sky;
D, automatically processing procedure activation threshold value, when specified catalogue or subregion use capacity to exceed this threshold value, will start automatic processing procedure, be defaulted as 0.8, that is: when assigned catalogue or subregion use amount are more than 80%, then start automatic processing procedure;
E, when automatically processing, the minimum value of process file object, acquiescence: 1BM, that is: only process the file that file size is greater than 1BM;
F, when automatically processing, process file object need produce early than before this date, and default value is 7, that is: only process the file of production before one week;
G, when automatically processing, only process belongs to the file of certain task groups, and default value be sky, that is: process the file of all groups;
H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file;
In described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents;
In described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1);
In described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7);
In described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8);
In described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.
CN201210333671.0A 2012-09-10 2012-09-10 A kind of resources measurement preprocess method combined with job scheduling system Active CN102902598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210333671.0A CN102902598B (en) 2012-09-10 2012-09-10 A kind of resources measurement preprocess method combined with job scheduling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210333671.0A CN102902598B (en) 2012-09-10 2012-09-10 A kind of resources measurement preprocess method combined with job scheduling system

Publications (2)

Publication Number Publication Date
CN102902598A CN102902598A (en) 2013-01-30
CN102902598B true CN102902598B (en) 2015-08-19

Family

ID=47574844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210333671.0A Active CN102902598B (en) 2012-09-10 2012-09-10 A kind of resources measurement preprocess method combined with job scheduling system

Country Status (1)

Country Link
CN (1) CN102902598B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103347059B (en) * 2013-06-20 2016-06-22 北京奇虎科技有限公司 Realize the method for user's configuration parameter transmission, client and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
CN101694630A (en) * 2009-09-30 2010-04-14 曙光信息产业(北京)有限公司 Method, system and equipment for operation dispatching
WO2011005073A2 (en) * 2009-07-09 2011-01-13 Mimos Bhd. Job status monitoring method
CN102117225A (en) * 2009-12-31 2011-07-06 上海可鲁系统软件有限公司 Industrial automatic multi-point cluster system and task management method thereof
CN102148871A (en) * 2011-03-18 2011-08-10 浪潮(北京)电子信息产业有限公司 Storage resource scheduling method and device
CN102147960A (en) * 2011-03-22 2011-08-10 曙光信息产业股份有限公司 System and method for monitoring super-large scale trunking services
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205414A1 (en) * 1999-07-26 2004-10-14 Roselli Drew Schaffer Fault-tolerance framework for an extendable computer architecture

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
WO2011005073A2 (en) * 2009-07-09 2011-01-13 Mimos Bhd. Job status monitoring method
CN101694630A (en) * 2009-09-30 2010-04-14 曙光信息产业(北京)有限公司 Method, system and equipment for operation dispatching
CN102117225A (en) * 2009-12-31 2011-07-06 上海可鲁系统软件有限公司 Industrial automatic multi-point cluster system and task management method thereof
CN102148871A (en) * 2011-03-18 2011-08-10 浪潮(北京)电子信息产业有限公司 Storage resource scheduling method and device
CN102147960A (en) * 2011-03-22 2011-08-10 曙光信息产业股份有限公司 System and method for monitoring super-large scale trunking services
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Also Published As

Publication number Publication date
CN102902598A (en) 2013-01-30

Similar Documents

Publication Publication Date Title
US8533731B2 (en) Apparatus and method for distrubuting complex events based on correlations therebetween
CN207301773U (en) A kind of numerical control machine tool monitoring system based on Internet of Things
CN106126346A (en) A kind of large-scale distributed data collecting system and method
CN103645947A (en) MIL-STD-1553B bus monitoring and data analysis system
CN106033476A (en) Incremental graphic computing method in distributed computing mode under cloud computing environment
CN111562889B (en) Data processing method, device, system and storage medium
CN107612984B (en) Big data platform based on internet
CN112769897A (en) Synchronization method and device for edge calculation message, electronic equipment and storage medium
CN102752294B (en) Method and system for synchronizing data of multiple terminals on basis of equipment capacity
CN103200199A (en) Out of band (OOB) data collection system
CN107291744A (en) It is determined that and with the method and device of the relationship between application program
CN105592122A (en) Cloud platform monitoring method and cloud platform monitoring system
CN112118174A (en) Software defined data gateway
CN111930565B (en) Process fault self-healing method, device and equipment for components in distributed management system
CN106383771A (en) Host cluster monitoring method and device
CN106027674A (en) Technology architecture of "Internet & smart manufacturing"
CN105607606A (en) Data acquisition device and data acquisition method based on double-mainboard framework
CN106598738A (en) Computer cluster system and parallel computing method thereof
CN103763181A (en) Automatic attribute setting device and method
CN112000735A (en) Data processing method, device and system
CN117651003B (en) ERP information transmission safety monitoring system
CN103678423A (en) Data file input system, device and method
CN102902598B (en) A kind of resources measurement preprocess method combined with job scheduling system
CN104750814B (en) The automatic storage method of polynary heterogeneous data flow based on multisensor
CN109672731A (en) A kind of distributed node information monitoring method, system and application

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20211025

Address after: 100089 zone A-1, floor 2, building 36, yard 8, Dongbeiwang West Road, Haidian District, Beijing

Patentee after: Shuguang zhisuan Information Technology Co.,Ltd.

Address before: 100193 No.36 Zhongguancun Software Park, No.8 Dongbeiwang West Road, Haidian District, Beijing

Patentee before: Dawning Information Industry (Beijing) Co.,Ltd.

TR01 Transfer of patent right