CN102902598B - A kind of resources measurement preprocess method combined with job scheduling system - Google Patents
A kind of resources measurement preprocess method combined with job scheduling system Download PDFInfo
- Publication number
- CN102902598B CN102902598B CN201210333671.0A CN201210333671A CN102902598B CN 102902598 B CN102902598 B CN 102902598B CN 201210333671 A CN201210333671 A CN 201210333671A CN 102902598 B CN102902598 B CN 102902598B
- Authority
- CN
- China
- Prior art keywords
- file
- computing node
- node resource
- content
- carry out
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The present invention relates to a kind of resources measurement preprocess method combined with job scheduling system, comprise the steps: that (1) enables job scheduler preprocessing function; (2) job scheduler reads computing node resource distribution file; (3) content detection is carried out to computing node resource; (4) when finding computing node resource exception content, judge whether to need to start from processing procedure; (5) judge whether to carry out from process to computing node resource exception content; (6) carry out from process to computing node resource exception content; (7) by SMTP or SMGP expanded configuration interface, computing node resource exception content is sent to user with the form of note or mail; (8) by operation process recording in journal file.Give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and relevant automatic process and configuration file are provided, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving.
Description
Technical field
The present invention relates to a kind of preprocess method of HPCC field, be specifically related to a kind of resources measurement preprocess method combined with job scheduling system.
Background technology
One of modal problem of large-scale cluster job scheduling system is exactly: resource (comprising computing node resource, storage resources etc.) has occurred abnormal (not a node roll off the production line exception), but dispatching system fails to catch this exception, to such an extent as to operation is scheduled in abnormal nodes resource, or employ other abnormal resource, cause operation finally cannot normally complete.To a large amount of wastes of resource and time be caused like this, and normal job run result cannot be obtained.
Provide the function of computing node health detection in Torque 5.0, and coordinate scheduler (as: Maui) that the state of health status abnormal nodes is set to Down.The node health measuring ability of Torque, by specifying monitoring script, obtains the operation output information of detection script, if output information is with " ERROR " beginning, then the state of this node is set to Down by scheduler.Meanwhile, nodal test interval can be set.There is following problem in prior art:
First, executable program under the computing node health measuring function that Torque provides requires user to write corresponding detection script or Linux voluntarily, so just require that user has certain detection script or application trace routine development ability, use difficulty larger; Secondly, the computing node health measuring function that Torque provides only when detecting abnormal, utilizes scheduler that node state is set to Down, does not provide corresponding abnormal automatic processing capacity.
Summary of the invention
For the deficiencies in the prior art, the invention provides a kind of resources measurement preprocess method combined with job scheduling system.The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.
The object of the invention is to adopt following technical proposals to realize:
The resources measurement preprocess method combined with job scheduling system, its improvements are, described method comprises the steps:
(1) job scheduler preprocessing function is enabled;
(2) described job scheduler reads computing node resource distribution file;
(3) content detection is carried out to computing node resource;
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure;
(5) judge whether to carry out from process to computing node resource exception content;
(6) carry out from process to computing node resource exception content;
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file.
Wherein, in described step (2), described computing node resource distribution file health.prop configuration file represents.
Wherein, the content of described health.prop configuration file comprises:
A, whether enable monitoring resource preprocessing function, be defaulted as Yes;
B, document (document belongs to one of detected object, is only availability detects its detection mode) availability object, that is: check whether specified file exists, and is defaulted as sky;
C, want the catalogue of measurement capacity or subregion whether to exist, be defaulted as sky;
D, automatically processing procedure activation threshold value, when specified catalogue or subregion use capacity to exceed this threshold value, will start automatic processing procedure, be defaulted as 0.8, that is: when assigned catalogue or subregion use amount are more than 80%, then start automatic processing procedure;
E, when automatically processing, the minimum value of process file object, acquiescence: 1BM, that is: only process the file that file size is greater than 1BM;
F, when automatically processing, process file object need produce early than before this date, and default value is 7, that is: only process the file of production before one week;
G, when automatically processing, only process belongs to the file of certain task groups, and default value be sky, that is: process the file of all groups;
H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file.
Wherein, in described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents.
Wherein, in described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1).
Wherein, in described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).
Wherein, in described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).
Wherein, in described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.
Compared with the prior art, the beneficial effect that the present invention reaches is:
The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving, and the reliability of process improves.
Accompanying drawing explanation
Fig. 1 to be pbs_mom config provided by the invention part be Torque the configuration file schematic diagram of health measuring function is provided;
Fig. 2 is the resources measurement preprocess method process flow diagram combined with job scheduling system provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
HPCC (HPCC, High Performance Computing Cluster): a branch of computer science, to solve for the purpose of complicated scientific algorithm or numerical evaluation problem, it is the loosely-coupled computing node set of one be made up of multiple stage Node station (server).
The present invention on the basis of Torque computing node health measuring function, for user provides a set of simple, configurable, extendible nodal test pretreating scheme.With the communication process of numerous HPCC user, we recognize, in the use procedure of cluster resource, the situation of the computational resource exception that user worries mainly concentrates on: in " reliability of storage resources " and " document availability " this two problem.The nodal test pretreating scheme that the present invention provides is exactly mainly for above-mentioned two problems, and the actual demand of numerous user, a set of configuration standard of formation with automatically process the solution combined.
Pbs_mom config provided by the invention part provide by Torque the configuration file of health measuring function as shown in Figure 1, require to be configured to by node_check_script item in this configuration file the node_check.scp script file position that this solution provides.As shown in Fig. 1 resources measurement pre-service allocation plan, solution in the present invention formed primarily of a series of script files such as node_check.scp, health.prop configuration file, health.log log record file, provides the expanded configuration interfaces such as SMTP, SMGP simultaneously.
Torque represents a kind of cluster job scheduling resource management system of increasing income; SMTP(Simple Mail TransferProtocol) i.e. Simple Mail Transfer protocol, it be one group for being transmitted the rule of mail to destination address by source address, controlled the transfer mode of mail by it; SMGP (Short Message Gateway Protocol) is the interface protocol that SMGW and other network element device carry out short message transmission.
As shown in Figure 2, the method comprises the steps: the resources measurement preprocess method flow process combined with job scheduling system provided by the invention
(1) job scheduler preprocessing function is enabled: this job scheduler is Maui job scheduler.
(2) Maui job scheduler reads computing node resource heakh.prop configuration file;
(3) content of computing node resource node_check.scp script file is detected: according to
Configuration in the detailed annotation of table 1 health.prop partial configuration, carries out given content detection; Table 1 is as follows:
Table 1 health.prop partial configuration is explained in detail
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure: if desired start from processing procedure, then carry out step (5); Otherwise return step (1).
(5) judge whether to carry out from process to computing node resource exception content: if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).
(6) carry out from process to computing node resource exception content: carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file: operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Journal file health.log represents.
The nodal test pretreating scheme that the present invention provides, mainly give processing scheme for " reliability of storage resources " and " document availability " this two problem exactly, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or equivalent replacement, and not departing from any amendment of spirit and scope of the invention or equivalent replacement, it all should be encompassed in the middle of right of the present invention.
Claims (1)
1. the resources measurement preprocess method combined with job scheduling system, is characterized in that, described method comprises the steps:
(1) job scheduler preprocessing function is enabled;
(2) described job scheduler reads computing node resource distribution file;
(3) content of computing node resource is detected;
(4) when finding computing node resource exception content, judge whether to need to start from processing procedure;
(5) judge whether to carry out from process to computing node resource exception content;
(6) carry out from process to computing node resource exception content;
(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;
(8) by operation process recording in journal file;
In described step (2), described computing node resource distribution file health.prop configuration file represents;
The content of described health.prop configuration file comprises:
A, whether enable monitoring resource preprocessing function, be defaulted as Yes;
B, document availability object, that is: check whether specified file exists, and is defaulted as sky;
C, want the catalogue of measurement capacity or subregion whether to exist, be defaulted as sky;
D, automatically processing procedure activation threshold value, when specified catalogue or subregion use capacity to exceed this threshold value, will start automatic processing procedure, be defaulted as 0.8, that is: when assigned catalogue or subregion use amount are more than 80%, then start automatic processing procedure;
E, when automatically processing, the minimum value of process file object, acquiescence: 1BM, that is: only process the file that file size is greater than 1BM;
F, when automatically processing, process file object need produce early than before this date, and default value is 7, that is: only process the file of production before one week;
G, when automatically processing, only process belongs to the file of certain task groups, and default value be sky, that is: process the file of all groups;
H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file;
In described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents;
In described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1);
In described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7);
In described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8);
In described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210333671.0A CN102902598B (en) | 2012-09-10 | 2012-09-10 | A kind of resources measurement preprocess method combined with job scheduling system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210333671.0A CN102902598B (en) | 2012-09-10 | 2012-09-10 | A kind of resources measurement preprocess method combined with job scheduling system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102902598A CN102902598A (en) | 2013-01-30 |
CN102902598B true CN102902598B (en) | 2015-08-19 |
Family
ID=47574844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210333671.0A Active CN102902598B (en) | 2012-09-10 | 2012-09-10 | A kind of resources measurement preprocess method combined with job scheduling system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102902598B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103347059B (en) * | 2013-06-20 | 2016-06-22 | 北京奇虎科技有限公司 | Realize the method for user's configuration parameter transmission, client and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101373447A (en) * | 2008-08-20 | 2009-02-25 | 上海超级计算中心 | System and method for detecting health degree of computer cluster |
CN101694630A (en) * | 2009-09-30 | 2010-04-14 | 曙光信息产业(北京)有限公司 | Method, system and equipment for operation dispatching |
WO2011005073A2 (en) * | 2009-07-09 | 2011-01-13 | Mimos Bhd. | Job status monitoring method |
CN102117225A (en) * | 2009-12-31 | 2011-07-06 | 上海可鲁系统软件有限公司 | Industrial automatic multi-point cluster system and task management method thereof |
CN102148871A (en) * | 2011-03-18 | 2011-08-10 | 浪潮(北京)电子信息产业有限公司 | Storage resource scheduling method and device |
CN102147960A (en) * | 2011-03-22 | 2011-08-10 | 曙光信息产业股份有限公司 | System and method for monitoring super-large scale trunking services |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040205414A1 (en) * | 1999-07-26 | 2004-10-14 | Roselli Drew Schaffer | Fault-tolerance framework for an extendable computer architecture |
-
2012
- 2012-09-10 CN CN201210333671.0A patent/CN102902598B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101373447A (en) * | 2008-08-20 | 2009-02-25 | 上海超级计算中心 | System and method for detecting health degree of computer cluster |
WO2011005073A2 (en) * | 2009-07-09 | 2011-01-13 | Mimos Bhd. | Job status monitoring method |
CN101694630A (en) * | 2009-09-30 | 2010-04-14 | 曙光信息产业(北京)有限公司 | Method, system and equipment for operation dispatching |
CN102117225A (en) * | 2009-12-31 | 2011-07-06 | 上海可鲁系统软件有限公司 | Industrial automatic multi-point cluster system and task management method thereof |
CN102148871A (en) * | 2011-03-18 | 2011-08-10 | 浪潮(北京)电子信息产业有限公司 | Storage resource scheduling method and device |
CN102147960A (en) * | 2011-03-22 | 2011-08-10 | 曙光信息产业股份有限公司 | System and method for monitoring super-large scale trunking services |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN102902598A (en) | 2013-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8533731B2 (en) | Apparatus and method for distrubuting complex events based on correlations therebetween | |
CN207301773U (en) | A kind of numerical control machine tool monitoring system based on Internet of Things | |
CN106126346A (en) | A kind of large-scale distributed data collecting system and method | |
CN103645947A (en) | MIL-STD-1553B bus monitoring and data analysis system | |
CN106033476A (en) | Incremental graphic computing method in distributed computing mode under cloud computing environment | |
CN111562889B (en) | Data processing method, device, system and storage medium | |
CN107612984B (en) | Big data platform based on internet | |
CN112769897A (en) | Synchronization method and device for edge calculation message, electronic equipment and storage medium | |
CN102752294B (en) | Method and system for synchronizing data of multiple terminals on basis of equipment capacity | |
CN103200199A (en) | Out of band (OOB) data collection system | |
CN107291744A (en) | It is determined that and with the method and device of the relationship between application program | |
CN105592122A (en) | Cloud platform monitoring method and cloud platform monitoring system | |
CN112118174A (en) | Software defined data gateway | |
CN111930565B (en) | Process fault self-healing method, device and equipment for components in distributed management system | |
CN106383771A (en) | Host cluster monitoring method and device | |
CN106027674A (en) | Technology architecture of "Internet & smart manufacturing" | |
CN105607606A (en) | Data acquisition device and data acquisition method based on double-mainboard framework | |
CN106598738A (en) | Computer cluster system and parallel computing method thereof | |
CN103763181A (en) | Automatic attribute setting device and method | |
CN112000735A (en) | Data processing method, device and system | |
CN117651003B (en) | ERP information transmission safety monitoring system | |
CN103678423A (en) | Data file input system, device and method | |
CN102902598B (en) | A kind of resources measurement preprocess method combined with job scheduling system | |
CN104750814B (en) | The automatic storage method of polynary heterogeneous data flow based on multisensor | |
CN109672731A (en) | A kind of distributed node information monitoring method, system and application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211025 Address after: 100089 zone A-1, floor 2, building 36, yard 8, Dongbeiwang West Road, Haidian District, Beijing Patentee after: Shuguang zhisuan Information Technology Co.,Ltd. Address before: 100193 No.36 Zhongguancun Software Park, No.8 Dongbeiwang West Road, Haidian District, Beijing Patentee before: Dawning Information Industry (Beijing) Co.,Ltd. |
|
TR01 | Transfer of patent right |