CN102902598B

CN102902598B - A kind of resources measurement preprocess method combined with job scheduling system

Info

Publication number: CN102902598B
Application number: CN201210333671.0A
Authority: CN
Inventors: 张磊; 张涛
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Shuguang zhisuan Information Technology Co.,Ltd.
Priority date: 2012-09-10
Filing date: 2012-09-10
Publication date: 2015-08-19
Anticipated expiration: 2032-09-10
Also published as: CN102902598A

Abstract

The present invention relates to a kind of resources measurement preprocess method combined with job scheduling system, comprise the steps: that (1) enables job scheduler preprocessing function; (2) job scheduler reads computing node resource distribution file; (3) content detection is carried out to computing node resource; (4) when finding computing node resource exception content, judge whether to need to start from processing procedure; (5) judge whether to carry out from process to computing node resource exception content; (6) carry out from process to computing node resource exception content; (7) by SMTP or SMGP expanded configuration interface, computing node resource exception content is sent to user with the form of note or mail; (8) by operation process recording in journal file.Give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and relevant automatic process and configuration file are provided, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving.

Description

A kind of resources measurement preprocess method combined with job scheduling system

Technical field

The present invention relates to a kind of preprocess method of HPCC field, be specifically related to a kind of resources measurement preprocess method combined with job scheduling system.

Background technology

One of modal problem of large-scale cluster job scheduling system is exactly: resource (comprising computing node resource, storage resources etc.) has occurred abnormal (not a node roll off the production line exception), but dispatching system fails to catch this exception, to such an extent as to operation is scheduled in abnormal nodes resource, or employ other abnormal resource, cause operation finally cannot normally complete.To a large amount of wastes of resource and time be caused like this, and normal job run result cannot be obtained.

Provide the function of computing node health detection in Torque 5.0, and coordinate scheduler (as: Maui) that the state of health status abnormal nodes is set to Down.The node health measuring ability of Torque, by specifying monitoring script, obtains the operation output information of detection script, if output information is with " ERROR " beginning, then the state of this node is set to Down by scheduler.Meanwhile, nodal test interval can be set.There is following problem in prior art:

First, executable program under the computing node health measuring function that Torque provides requires user to write corresponding detection script or Linux voluntarily, so just require that user has certain detection script or application trace routine development ability, use difficulty larger; Secondly, the computing node health measuring function that Torque provides only when detecting abnormal, utilizes scheduler that node state is set to Down, does not provide corresponding abnormal automatic processing capacity.

Summary of the invention

For the deficiencies in the prior art, the invention provides a kind of resources measurement preprocess method combined with job scheduling system.The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.

The object of the invention is to adopt following technical proposals to realize:

The resources measurement preprocess method combined with job scheduling system, its improvements are, described method comprises the steps:

(1) job scheduler preprocessing function is enabled;

(2) described job scheduler reads computing node resource distribution file;

(3) content detection is carried out to computing node resource;

(4) when finding computing node resource exception content, judge whether to need to start from processing procedure;

(5) judge whether to carry out from process to computing node resource exception content;

(6) carry out from process to computing node resource exception content;

(7) by SMTP or SMGP expanded configuration interface, described computing node resource exception content is sent to user with the form of note or mail;

(8) by operation process recording in journal file.

Wherein, in described step (2), described computing node resource distribution file health.prop configuration file represents.

Wherein, the content of described health.prop configuration file comprises:

A, whether enable monitoring resource preprocessing function, be defaulted as Yes;

B, document (document belongs to one of detected object, is only availability detects its detection mode) availability object, that is: check whether specified file exists, and is defaulted as sky;

C, want the catalogue of measurement capacity or subregion whether to exist, be defaulted as sky;

D, automatically processing procedure activation threshold value, when specified catalogue or subregion use capacity to exceed this threshold value, will start automatic processing procedure, be defaulted as 0.8, that is: when assigned catalogue or subregion use amount are more than 80%, then start automatic processing procedure;

E, when automatically processing, the minimum value of process file object, acquiescence: 1BM, that is: only process the file that file size is greater than 1BM;

F, when automatically processing, process file object need produce early than before this date, and default value is 7, that is: only process the file of production before one week;

G, when automatically processing, only process belongs to the file of certain task groups, and default value be sky, that is: process the file of all groups;

H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file.

Wherein, in described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents.

Wherein, in described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1).

Wherein, in described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).

Wherein, in described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).

Wherein, in described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.

Compared with the prior art, the beneficial effect that the present invention reaches is:

The present invention is on the basis of the cluster job scheduling resource management system Torque computing node health measuring function of increasing income, give from processing scheme for " reliability of storage resources " and " document availability " this two problem, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.Treatment effeciency is high, time saving and energy saving, and the reliability of process improves.

Accompanying drawing explanation

Fig. 1 to be pbs_mom config provided by the invention part be Torque the configuration file schematic diagram of health measuring function is provided;

Fig. 2 is the resources measurement preprocess method process flow diagram combined with job scheduling system provided by the invention.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.

HPCC (HPCC, High Performance Computing Cluster): a branch of computer science, to solve for the purpose of complicated scientific algorithm or numerical evaluation problem, it is the loosely-coupled computing node set of one be made up of multiple stage Node station (server).

The present invention on the basis of Torque computing node health measuring function, for user provides a set of simple, configurable, extendible nodal test pretreating scheme.With the communication process of numerous HPCC user, we recognize, in the use procedure of cluster resource, the situation of the computational resource exception that user worries mainly concentrates on: in " reliability of storage resources " and " document availability " this two problem.The nodal test pretreating scheme that the present invention provides is exactly mainly for above-mentioned two problems, and the actual demand of numerous user, a set of configuration standard of formation with automatically process the solution combined.

Pbs_mom config provided by the invention part provide by Torque the configuration file of health measuring function as shown in Figure 1, require to be configured to by node_check_script item in this configuration file the node_check.scp script file position that this solution provides.As shown in Fig. 1 resources measurement pre-service allocation plan, solution in the present invention formed primarily of a series of script files such as node_check.scp, health.prop configuration file, health.log log record file, provides the expanded configuration interfaces such as SMTP, SMGP simultaneously.

Torque represents a kind of cluster job scheduling resource management system of increasing income; SMTP(Simple Mail TransferProtocol) i.e. Simple Mail Transfer protocol, it be one group for being transmitted the rule of mail to destination address by source address, controlled the transfer mode of mail by it; SMGP (Short Message Gateway Protocol) is the interface protocol that SMGW and other network element device carry out short message transmission.

As shown in Figure 2, the method comprises the steps: the resources measurement preprocess method flow process combined with job scheduling system provided by the invention

(1) job scheduler preprocessing function is enabled: this job scheduler is Maui job scheduler.

(2) Maui job scheduler reads computing node resource heakh.prop configuration file;

(3) content of computing node resource node_check.scp script file is detected: according to

Configuration in the detailed annotation of table 1 health.prop partial configuration, carries out given content detection; Table 1 is as follows:

Table 1 health.prop partial configuration is explained in detail

(4) when finding computing node resource exception content, judge whether to need to start from processing procedure: if desired start from processing procedure, then carry out step (5); Otherwise return step (1).

(5) judge whether to carry out from process to computing node resource exception content: if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7).

(6) carry out from process to computing node resource exception content: carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8).

(8) by operation process recording in journal file: operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Journal file health.log represents.

The nodal test pretreating scheme that the present invention provides, mainly give processing scheme for " reliability of storage resources " and " document availability " this two problem exactly, and provide relevant automatic process and configuration file, really accomplish simple, configurable, easily extensible.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or equivalent replacement, and not departing from any amendment of spirit and scope of the invention or equivalent replacement, it all should be encompassed in the middle of right of the present invention.

Claims

1. the resources measurement preprocess method combined with job scheduling system, is characterized in that, described method comprises the steps:

(1) job scheduler preprocessing function is enabled;

(2) described job scheduler reads computing node resource distribution file;

(3) content of computing node resource is detected;

(6) carry out from process to computing node resource exception content;

(8) by operation process recording in journal file;

In described step (2), described computing node resource distribution file health.prop configuration file represents;

The content of described health.prop configuration file comprises:

B, document availability object, that is: check whether specified file exists, and is defaulted as sky;

H, when automatically processing, only process belongs to the file of someone, and default value be sky, that is: process proprietary file;

In described step (3), the content of script file in computing node resource is detected; Described script file node_check.scp represents;

In described step (4), if desired start from processing procedure, then carry out step (5); Otherwise return step (1);

In described step (5), if when processing computing node resource exception content, carry out step (6); Otherwise carry out step (7);

In described step (6), carry out after process, processing procedure being recorded in journal file to computing node resource exception content, namely carry out step (8);

In described step (8), described operating process comprises processing procedure when processing computing node resource exception content and sends to the process of transmitting of user; Described journal file health.log represents.