CN101340423B

CN101340423B - Multi-cluster job scheduling method based on element scheduling ring

Info

Publication number: CN101340423B
Application number: CN2008101181738A
Authority: CN
Inventors: 荣晓慧; 邓攀; 陈�峰; 马世龙; 伊胜伟; 孙超赟; 于冰; 梁峰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2008-08-13
Filing date: 2008-08-13
Publication date: 2011-02-02
Anticipated expiration: 2028-08-13
Also published as: CN101340423A

Abstract

The invention discloses a multi-cluster job scheduling method which is based on meta-scheduling rings. The method is based on the meta-scheduling rings which are established among the multiple clusters, a job which is submitted to a job main node by a user is added to the multiple clusters through sending a message of adding the job by the job main node, the cluster which operates the job is decided by a cluster node which is ready to implement the job through sending the message of cancelling the job and comparing the weight value of the job operation, thereby avoiding the multi-cluster conflict for operating the job and further completing the scheduling of the job among the multiple clusters. The scheduling method which is provided by the invention can increase the scheduling opportunity of the job, improve the utilization rate of the cluster resources, avoid the problems of single-point failure and over-burdened network, and the like, and have good scalability through inserting the job in local queues of the multiple clusters on the meta-scheduling rings.

Description

A kind of many cluster job schedulings method based on first grooming ring

Technical field

The invention belongs to data processing field, relate to a kind of job scheduling method, be specifically related to a kind of many cluster job schedulings method based on first grooming ring.

Background technology

At present many organizing all bought cluster and has been used for scientific research, but there is bigger difference in the cluster utilance of each tissue, exist some cluster resource idle and some can't satisfy user's situations such as computation requirement, the cluster utilance that has is higher, and the cluster that has then utilance is lower.If a plurality of clusters are joined together to do as a whole, then can provide powerful computing ability to satisfy more user's computation requirement for the user provides calculation services.

Loop network links together each terminal by a continuous ring, transmission medium from an end subscriber to another end subscriber, up to being linked to be all end subscribers ring-like.When this structure has been eliminated end subscriber communication to the dependence of centring system.Cyclic structure makes each end subscriber link to each other with the aft terminal user with the forward terminal user, thereby exists point-to-point link, and with the one way system operation, any information of transmission all must be passed all end points on the ring.

The P2P network is the overlay network that covers on the IP network, the P2P network can be divided into 3 kinds of forms according to the relation of topological structure: centralization topology, full distributed destructuring topology, all distributed structure topology.The centric topology system adopts the mode of LIST SERVER to realize, causes Single Point of Faliure easily, relevant issues such as " focus " phenomenon of visit; Full distributed destructuring topological system adopts the organizational form of Random Graph, and the routing algorithm in this type systematic all is similar to the searching algorithm of broadcast type, therefore produces a large amount of network informations, brings great burden to network; The all distributed structure topological system adopts the organizational form of regular graph, safeguards a certain size routing table, the maintenance mechanism complexity of this type systematic, and maintenance cost directly depends on the size of routing table.

Under the centralized job scheduling pattern, a load information of being responsible for the responsible gathering system of main frame (first scheduler) of scheduling is arranged in the multi-cluster system.It is safeguarding a work distribution chart, and comes allocating task according to the system load situation.At first, unit's scheduler need be safeguarded the resource information and the job information of all clustered nodes, so when the clustered node number that needs management increases gradually, operation will become complicated more with the matching process between the resource, the amount of information that so first scheduler need be handled will become big, the complexity of dispatching algorithm also will uprise, and it is heavy more that its burden will become, and first scheduler will become the bottleneck of whole system this moment.Secondly, when first scheduler was collected all clustered node information, clustered node needed to wait for the disposal ability of having wasted clustered node.At last, all information all is stored on first scheduler in the centralized scheduling pattern, and all scheduling decisions are also all made by first scheduler, and first scheduler is the core of whole system, when first scheduler broke down, whole system all will be in the collapse state can not proceed work.

Unit's scheduling is the scheduling that is based upon on the cluster local scheduling mechanism, and first scheduler is responsible for scheduler task in a plurality of clusters, and each cluster all has oneself independently scheduling solution.Unit's scheduler can be created a clathrum aggregated(particle) structure, and wherein each assembly all utilizes identical scheduling solution.Unit's scheduler also can be used for linking the assembly that uses different schedulers, thereby system user is shielded these difference.

A plurality of clusters connect the loop configuration of setting up by forward and backward node and are referred to as first grooming ring, when having only a clustered node on first grooming ring, need set up a first grooming ring that has only single clustered node.The front nodal point of this clustered node and posterior nodal point all are this clustered nodes itself.

The process of setting up of first grooming ring of many clustered nodes is to finish on the basis of the first grooming ring of single clustered node.Initiate clustered node sends the message that adds ring by certain clustered node (connected node) on first grooming ring, connected node is made as initiate clustered node with the front nodal point of oneself, own original front nodal point information is encapsulated into returns to initiate clustered node in the response message simultaneously.After initiate clustered node is received the response message that connected node returns, the posterior nodal point of oneself is made as connected node, parse original front nodal point information of connected node in the response message simultaneously, the original front nodal point to connected node sends the message that adds ring then.After original front nodal point is received this message, oneself posterior nodal point is made as initiate clustered node, sends corresponding response message simultaneously to initiate clustered node.After initiate clustered node is received the response message of original front nodal point, oneself front nodal point is made as original front nodal point.Initiate like this clustered node just joins in first grooming ring, and it can receive the message that sends over from front nodal point, sends message also can for simultaneously its posterior nodal point.

Summary of the invention

The present invention proposes a kind of many cluster job schedulings method based on first grooming ring.This method is based on first grooming ring of setting up between many clusters, by sending the message of adding operation operation is added on a plurality of clusters, and the message that sends the cancellation operation decides the cluster of this operation of operation, thereby fulfils assignment scheduling between many clusters.This method is inserted operation in a plurality of cluster local queues on first grooming ring, increases the chance that operation is scheduled, and improves the utilance of cluster resource, avoids problems such as single point failure, network burden are overweight, and is with good expansibility.

A kind of many cluster job schedulings method based on first grooming ring, this method comprises the steps:

Step 1: the operation host node on first grooming ring receives user's job request, and generates operation interpolation message according to job request;

Step 2: the local resource adaptation of operation host node adds message according to operation and judges whether the resource of operation host node can satisfy the demand in the job description information, if can, then judge in the local queue of operation host node and have or not other operations by the local scheduler of operation host node, if do not have, then the operation host node begins to carry out this operation, and the job scheduling result returned to the user, this method finishes; Otherwise operation is added in the local queue of operation host node, message is added in operation be forwarded to next clustered node along first grooming ring; If the resource of operation host node can not satisfy the demand in the job description information, then operation is added message and be forwarded to next clustered node along first grooming ring;

Step 3: after next clustered node on first grooming ring receives operation interpolation message, the local resource adaptation of clustered node adds message according to operation and judges whether the resource of clustered node can satisfy the demand in the job description information, if can, then judge in the local queue of this clustered node and have or not other operations by the local scheduler of this clustered node, if do not have, then this clustered node can be carried out this operation immediately, then forwards step 5 to; Otherwise this clustered node adds operation in the local job queue to, message is added in operation be forwarded to next clustered node along first grooming ring; If the local resource adaptation of clustered node is judged the resource of clustered node and can not be satisfied the demand in the job description information, then operation added message and be forwarded to next clustered node along first grooming ring;

Step 4: the clustered node on first grooming ring judges by local scheduler whether the operation that will carry out is arranged in the local job queue, if having, then forwards step 5 to; Otherwise continue to wait for until there being operation to carry out;

Step 5: this clustered node is created job delete message, and transmits job delete message along first grooming ring, notifies the clustered node that contains this operation in other local queues to cancel the execution of this operation;

Step 6: after the clustered node that carry out operation is received the job delete message of returning, resolve the job run identification field in this message, if the higher clustered node of this operation authority of operation is arranged on first grooming ring, then do not carry out this operation, the cancellation operation message that is sent by the higher clustered node of operation authority is with this job delete; Otherwise clustered node begins to carry out operation, and the job scheduling result is returned to the user.

In the described step 1,2,3, operation is added message and is added operation on first grooming ring a plurality of clustered nodes.Operation is added message structure and comprised: node listing is added in type of message, operation ID, user ID, operation host node title, interpolation job identification, operation, operation JSDL describes; To add node listing be the variable length structure of arrays in operation in the message, all the other each fields be the variable length string structure and between divide with the space.

In the described step 5, job delete message has been cancelled the execution of operation other clustered nodes on first grooming ring, the scheduling that fulfils assignment, the job delete message structure comprises: type of message, operation ID, message source node, job run weights, job run sign; In the message each field be variable length string and between divide with the space.

Clustered node is cancelled the execution of this operation in the described step 5, comprises the steps:

A) clustered node is resolved the job delete message that front nodal point sends, if the operation ID of this clustered node is identical with the field contents of resolving the message source node that obtains, then this clustered node judges whether oneself can carry out this operation, if can, then this clustered node moves this operation, otherwise this operation will be cancelled by the job delete message that the higher cluster of other authorities sends; The job delete message process finishes; If the ID of current clustered node is different with the field contents of resolving the message source node that obtains, change b);

B) the operation ID that obtains according to parsing searches the local queue of this clustered node, if there is not this operation in the local queue, then this clustered node is transmitted this job delete message to next clustered node, changes a); Otherwise change c);

C) if this clustered node moves the weights of this operation greater than the job run weights that parse, then the job run identification field with job delete message is made as no, otherwise then this operation is deleted from the local queue of this clustered node; Transmit this job delete message then to posterior nodal point, change a).

A kind of many cluster job schedulings method advantage based on first grooming ring of the present invention is:

(1) improve the operation response ratio: the present invention adopts operation is inserted into a plurality of cluster local queues, increases the chance that operation is scheduled, and reduces the stand-by period of operation, improves the response ratio of operation.

(2) improve the cluster utilance: the present invention adopts many request mode to make operation to dispatch automatically between cluster, improves the utilance of each cluster resource.

(3) extensibility is good: this method adopts the realization carrier of first grooming ring structure as method, and the connection between each clustered node on so first grooming ring and the relation of clustered node become linearity, thereby makes whole system be more prone to expansion.

(4) avoid single point failure: this method is because each clustered node of the first grooming ring of composition is all relatively independent, each clustered node can independently finish traffic control by local scheduling, certain clustered node inefficacy can not occur and the problem that causes whole system to move.

Description of drawings

Fig. 1 is the structural relation figure of a kind of many cluster job schedulings method based on first grooming ring of the present invention;

Fig. 2 is the flow chart of steps of a kind of many cluster job schedulings method based on first grooming ring of the present invention;

Fig. 3 adds message message format figure for the operation of a kind of many cluster job schedulings method based on first grooming ring of the present invention;

Fig. 4 is the job delete message message format figure of a kind of many cluster job schedulings method based on first grooming ring of the present invention;

Fig. 5 is the job delete flow chart of a kind of many cluster job schedulings method based on first grooming ring of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.

The operation platform of a kind of many cluster job schedulings method based on first grooming ring of the present invention is a multi-cluster system based on first grooming ring, as shown in Figure 1:

In the present embodiment, multi-cluster system is made up of 5 clustered nodes, each clustered node comprises 1～4 management host node and 8～128 computation host nodes, the ID of 5 clustered nodes is respectively: C0000001～C0000005, interconnected between the clustered node by gigabit Ethernet, form first grooming ring structure.Each clustered node comprises that a length is 256 cluster local queue, a cluster local resource adaptation and a cluster local scheduler.When clustered node receives operation interpolation request message, if judging, the local resource adaptation of this clustered node can handle this operation by this clustered node, then operation is added in the local queue of clustered node, carried out scheduled for executing by the local scheduler of clustered node; Otherwise this clustered node will not handled this operation.

Clustered node needs foundation and maintenance process that the following parameter is finished first grooming ring when the foundation of first grooming ring and maintenance process.These information comprise:

● front nodal point (PreviousNode): the Hostname of this clustered node previous clustered node on ring, for example, cluster C0000001 is the front nodal point of cluster C0000002.

● posterior nodal point (NextNode): the Hostname of this clustered node next clustered node on ring.For example, cluster C0000002 is the posterior nodal point of cluster C0000001.

The present invention proposes a kind ofly based on the job scheduling method between many clusters of first grooming ring, as shown in Figure 2, this method comprises the steps:

Step 1: certain clustered node on first grooming ring when promptly the operation host node receives user's job request, generates operation and adds message.

After many clusters formed first grooming ring, the user just can add operation by this yuan grooming ring in this system, and the user can select any one clustered node on first grooming ring as the inlet of first grooming ring when adding operation at random.First clustered node that receives operation will be called as the operation host node of this operation on unit's grooming ring.Operation ID is a UUID who generates at random, and the job description in the job request adopts the operation of GGF tissue definition to submit descriptive language JSDL (JobSubmission Description Language) to.

User special submits the job request that operation ID is 41e54b37-c01a-431d-97ef-fead95e2c271 to by clustered node C0000001.Operation host node C0000001 generates the operation of this operation and adds message, and the JobMasterID of this message is set to the title of oneself, simultaneously the AddAction field is changed to Yes.Form is as shown in Figure 3:

● MsgType: type of message, the String type, when adding new operation, this field is changed to AddJob.

● JobID: operation ID, the String type, 36 UUID that generate at random, it is used for operation of unique identification.

● UserID: user ID, String type, the user ID of expression submit job.

● JobMasterNode: operation host node title, String type, the inlet clustered node title that the expression operation enters first grooming ring.

● AddJob: add job identification, the String type, if AddJob is YES, the clustered node that receives this message will be handled this message, be about to operation and add in the cluster local system at this clustered node place.If AddJob is NO, there has been clustered node to have brought into operation this operation before showing, do not need again this operation to be handled.

● ToRunNode: node listing is added in operation, the List type, when a clustered node decision receives this operation, need to revise the ToRunNode field, oneself nodename is added in this field goes, like this when this message around ring when returning in one week, system can know this operation is added on which clustered node on the ring.

● JSDL: operation JSDL descriptor, the String type is used for mating with cluster resource, judges whether clustered node can move this operation; And, when the clustered node that receives this operation is prepared this operation of operation, will from this message, parse the content of this field, move this operation according to the content of this field then.

Step 2: the local resource adaptation of operation host node adds message according to operation to be differentiated operation host node resource, and according to the resource situation of operation host node message is added in operation and handle.

Operation is dispatched according to the situation of operation host node C0000001 in the present embodiment.The local resource adaptation of operation host node C0000001 adds message according to operation and judges whether the resource of operation host node C0000001 can satisfy the demand in the job description information, if the resource of operation host node C0000001 can satisfy the demand in the job description information, then operation host node C0000001 can receive down this operation, and local colony dispatching device is handed in this operation.

Because most of clustered node of user's submit job is constantly normally carried out the optimal node of this operation, therefore after the local scheduler of operation host node C0000001 is added in operation to, with decision is that this operation is put into operation immediately, still operation need be put into local queue and wait for.Do not have other operations in the local queue of the local scheduler judgement operation host node C0000001 of operation host node C0000001, then move the operation that this ID is 41e54b37-c01a-431d-97ef-fead95e2c271, and report scheduling result to the user; To on first grooming ring, not transmit because request is added in this operation, improve the response ratio of operation and reduced network service burden on first grooming ring, the entire method end; If operation host node C0000001 can not carry out this operation immediately, then the IDC0000001 of operation host node is added to operation and add in the message ToRunNode field, and operation is added in the local queue of operation host node C0000001.

If the cluster resource of operation host node C0000001 can not satisfy the demand in the job description information in the task description file JSDL field, operation is added forwards to posterior nodal point C0000002.

Step 3: the clustered node of being transmitted by the operation host node is received after operation adds message, and the local resource adaptation adds message according to operation to be differentiated the resource of this clustered node, and according to the resource situation of this clustered node message is added in operation and handle.

After clustered node on first grooming ring of this clustered node receives operation interpolation message, the local resource adaptation of clustered node adds message according to operation and judges whether the resource of this clustered node can satisfy the demand in the job description information, if can, then judge in the local queue of this clustered node and have or not other operations by the local scheduler of this clustered node, if do not have, then this clustered node can be carried out this operation immediately, then forwards step 5 to; Otherwise adding operation in the local job queue to, waits for this clustered node; Message is added in operation be forwarded to next clustered node,, then operation is added message and be forwarded to next clustered node along first grooming ring if the resource of this clustered node can not satisfy the demand in the job description information along first grooming ring.

In the present embodiment, after clustered node C0000002 on unit's grooming ring receives operation interpolation message (AddJob), resolve this message, earlier by JobMasterNode field value C0000001, judge whether clustered node C0000002 is the operation host node of operation: if, the message that the interpolation operation is described is enclosed the operation host node of getting back to operation again around ring one, do not need to add again this operation, then resolve the ToRunNode field in the message, return in the local queue which cluster operation be added to the user; If not, then judge according to operation interpolation message whether the resource of clustered node C0000002 satisfies the demand among this job description information JSDL in the present embodiment by the local resource adaptation of clustered node C0000002, if can, then will judge in the local queue of this clustered node C0000002 and have or not other operations by the cluster local scheduler of clustered node C0000002, if do not have, illustrate that then clustered node C0000002 can carry out this operation immediately, then forwards step 5 to; Otherwise this clustered node C0000002 adds operation in the local job queue to, and message is added in operation is forwarded to posterior nodal point clustered node C000003 along first grooming ring.

In the present embodiment, when clustered node C0000002 begins to carry out this operation, the AddJob field is made as no, after follow-up clustered node C0000003 receives that message is added in this operation, check the content of the AddJob field of message, find that this field is no, illustrate that this operation prepared to bring into operation on clustered node C0000001 or C0000002, do not need to receive again this operation, message is not added in this operation and carry out any operation.

If this operation is placed in the local queue of clustered node C0000002 and waits for, then the ID of this node is appended to the end that node listing ToRunNode field is added in operation.

Judge that the resource of this clustered node C0000002 can not satisfy demand among the job description information JSDL and message is added in operation be forwarded to posterior nodal point clustered node C0000003 along first grooming ring if the local resource adaptation of clustered node C0000002 adds message according to operation, and handle by clustered node C0000003.

Step 4: the clustered node on first grooming ring is by local scheduler, and whether judge has the operation that will carry out in the local job queue, if having, changes step 5; Otherwise continue to wait for until there being operation to carry out.

In the present embodiment, the clustered node C0000003 on first grooming ring is by local scheduler, and whether judge has the operation that will carry out in the local job queue, if having, changes step 5; Otherwise continue to wait for until there being operation to carry out.

Step 5: the clustered node that carry out operation is created job delete message, and transmits job delete message along first grooming ring, notifies other clustered nodes to cancel the execution of this operation;

When certain clustered node will move certain operation, at first send the message of this operation of cancellation to other nodes, on first grooming ring, transmit a circle when message and return this clustered node, and when not having the higher clustered node of job run weights will move this operation on first grooming ring yet, this operation that just really brings into operation of this node.

Send cancellation operation message on unit's grooming ring and mainly comprise two kinds of situations:

1) operation is in the interpolation process, clustered node is received when message is added in operation, judge by the local scheduler of clustered node in the local queue of this clustered node and do not have other operations, clustered node is before this operation of operation, this clustered node will send job delete message, cancel the queuing of this operation on other nodes;

2) certain operation on the clustered node becomes running status from queueing condition, and this node will send job delete message to posterior nodal point before this operation of operation, and will be transmitted down successively along first grooming ring by posterior nodal point, cancel the queuing of this operation on other nodes.

In the present embodiment, it is that 41e54b37-c01a-431d-97ef-fead95e2c271 becomes running status by queueing condition that operation ID is arranged in the local job queue of clustered node C0000004, and the weights that the local scheduler of clustered node C0000004 provides this operation of operation are 200.Before this operation of operation, clustered node C0000004 generates the job delete message of this operation, then the SourceNode of this message is set to the title of oneself, the Weight value is made as 200, simultaneously the Result field is changed to Yes, illustrate that this clustered node C0000004 can running job 41e54b37-c01a-431d-97ef-fead95e2c271, if at this moment there is the higher clustered node of authority to be assumed to be clustered node C0000002, the clustered node C0000002 that then authority is higher will be made as no to the Result field of this message.Form as shown in Figure 4

● MsgType: type of message, String type; In the present embodiment, during the cancellation operation, field is made as CancelJob.

● JobID: operation ID, the String type indicates to delete the ID of operation, is 36 UUID that generate at random, operation of unique identification.

● SourceNode: message source node, the String type, the clustered node title of this message is sent in expression.

● Weight: job run weights, the String type, the weights of this operation of operation that the clustered node that expression sends cancellation operation message draws according to the specific calculation method, when having a plurality of clustered nodes to send the message of this operation on other nodes of cancellation on first grooming ring, move this operation by the big clustered node of weights.The clustered node weights of waiting in the local queue of clustered node are-∞

● Result: job run sign, the String type, if when having the higher clustered node of weights also will move this operation on first grooming ring, the node that weights are higher is made as no with this field, SourceNode does not then move this operation, and the cancellation operation message that yet can be sent by the higher clustered node of authority of this operation on the SourceNode and cancelling.Initial value is yes.

Clustered node receive after the cancellation operation message that other clustered nodes send handling process as shown in Figure 5:

1) after clustered node is received the job delete message (CancelJob) of front nodal point transmission, resolve this job delete message earlier, if the ID of this clustered node is identical with the SourceNode field contents that parsing obtains, then the job delete message of explanation transmission is got back to the clustered node that sends this cancellation message along first grooming ring again around a circle, at this moment this clustered node will judge whether oneself can carry out this operation according to the Result that resolves in the job delete message, if can, then this clustered node moves this operation, otherwise, this operation on this clustered node will be cancelled by the job delete message that the higher cluster of authority sends, and this job delete forwards finishes.If the ID of current clustered node is different with the SourceNode field contents that parsing obtains, change 2).

2) JobID that obtains according to parsing job delete message searches the local queue of this clustered node, if there is not this operation in the local queue, then this clustered node is transmitted this job delete message to next clustered node, changes 1); If this operation is arranged in the local queue, change 3).

3) this clustered node judges that whether these operation weights of operation are greater than the Weight value that parses in the job delete message, if this clustered node moves the weights of this operation greater than the Weight value that parses, then the Result field in the job delete message is made as no, the clustered node that expression sends this job delete message can not move this operation, transmit this job delete message then to posterior nodal point, change 1); If this clustered node moves the weights of this operation and is not more than the Weight value that parses, then this operation is deleted from the local queue of this clustered node, transmit this job delete message then to posterior nodal point, change 1).

Step 6: after the clustered node that carry out operation is received the job delete message of returning, this job delete message is handled, the operation that judgement should be cancelled in this clustered node still should be carried out this operation by this clustered node, and the job scheduling result is returned to the user, and this method finishes.

After clustered node is received the job delete message of returning, resolve this job delete message, obtain moving the value of identification field Result, if the Result field value is yes, this operation that can bring into operation of this clustered node is described, promptly show on first grooming ring and do not move the higher clustered node of these operation weights than this clustered node, and in the job delete message process of whole first grooming ring, on unit's grooming ring in other all clustered node formations this was cancelled already, so this clustered node begins to carry out this operation, and execution result is returned to the user.If the Result field value is no, illustrate and also have the higher clustered node of weights Weight will move this operation on first grooming ring, then flow process finishes, and this operation on this clustered node also can be by the higher cancellation operation message that clustered node the sent cancellation of weights Weight.

Claims

1. the many cluster job schedulings method based on first grooming ring is characterized in that, comprises the steps:

Message is added in described operation, and its structure comprises: node listing is added in type of message, operation ID, user ID, operation host node title, interpolation job identification, operation, operation JSDL describes; To add node listing be the variable length structure of arrays in operation in the message, all the other each fields be the variable length string structure and between divide with the space;

Described job delete message has been cancelled the execution of operation other clustered nodes on first grooming ring, the scheduling that fulfils assignment; Its structure comprises: type of message, operation ID, message source node, job run weights, job run sign; In the message each field be variable length string and between divide with the space;

2. according to the described a kind of many cluster job schedulings method based on first grooming ring of claim 1, it is characterized in that: the clustered node of described step 5 is cancelled the execution of this operation, comprises the steps: