Summary of the invention
Technical matters: the purpose of this invention is to provide the operation cross-domain control method under a kind of grid computing environment, the method that the application of the invention proposes can realize the safe dynamic of available resources in the grid is searched the adaptivity of controlling with operation, make the current field reduce the traffic of grid, improve the utilization factor of network, form the parallel of operation and find the solution, thereby improve the utilization ratio of gridding resource and the execution efficient of grid computing.If resource runs into irresistible natural cause such as power down and leaves grid, the operation node is not also given information the grid control gear, at this moment just has no idea to integrate, and obtains final correct result.In order to allow the user when the burst disaster takes place, still can obtain correct operation result, need monitor each resource node, handle the subjob fault immediately, under the situation of the current field scarcity of resources, in time carry out cross-domain operation.
Technical scheme: the final purpose of grid is exactly for a kind of convenient environment that carries out high-performance calculation is provided to the user.For as close as possible data source is carried out in the operation that makes us, reduce cost on network communication, save bandwidth, balanced load, strengthen the monitoring of subtask node, accelerate task executions, thereby improve the treatment effeciency of distributed system and result's accuracy, we have proposed a kind of operation controlling schemes of using faith mechanism.
Trust is the assessment to the confidence level of an entity identities and behavior, relevant with reliability, sincerity and the performance of this entity, trust is a subjective concept, depends on experience, usually represent the height of reliability rating with trust value, trust value is dynamic change with the behavior of entity.Grid resources such as the personal computer that distributes on the geography, workstation, cluster, scientific instrument with customer contact.The grid entity comprises resource and user, difference according to difference of organizing under the grid entity and geographic position, we become several independently autonomous territories (Autonomous Domain) to grid dividing, each autonomous territory comprises the plurality of grids entity, oneself operating strategy, security strategy are arranged, connect by network between the autonomous territory.By being different autonomous territories, can solve the autonomy and the isomerism problem of extensibility, website easily to grid dividing.When entities different in the grid will be concluded the business, need know the trusting relationship between them, according to the difference in entity autonomous territory of living in, we are divided in the territory trusting relationship between the entity between the trusting relationship between the entity and territory to the trusting relationship between the entity.Here only simply applied to trust model in a kind of territory and come trust value between the computational entity, the trusting relationship between the territory between the entity is not considered.
Operation control under the grid environment of the present invention in the operation cross-domain control method is the utilization faith mechanism, and has realized the cross-domain operation of grid, and concrete steps are as follows:
Step 1: before submit job, the user at first will become the user of this grid through registration,
Step 2: before the user added grid, the grid application layer carried out the initialization of environment, for the ensuing a series of activities of grid user are prepared,
Step 3: if user identity is legal, grid is determined the access control right of user to resource, the request of grid user submit job,
1. grid user is filled in the operation that will submit to: grid user is when submitting grid work to, need provide the zero-time and the termination time of task names, job description and the operation execution of submission, in the process of submitting to, submit to the host of this operation spontaneously local ip address and host name to be attached in the job description
2. grid user is submitted the operation of oneself to, legitimacy, the user capture control authority grade of the operation that grid Virtual Organization control gear need be submitted to grid user are tested, if this job request is legal, there is not the semantic conflict problem, grid Virtual Organization job controller will be accepted this request
3. this grid user operation enters the operation wait to row in the grid Virtual Organization, and solicited status is set to: submit state, wait for scheduled for executing,
Step 4: the operation control gear at grid Virtual Organization center is that the order scheduling is carried out in operation, regularly extracts the operation that is positioned at head of the queue in the operation waiting list, if formation is not empty, and execution in step 5; Otherwise the wait of operation control gear enters formation until the operation that has the user to submit to,
Step 5: the operation control gear obtains the descriptor of operation,
Step 6: filter out available computational resource at the grid work control gear according to faith mechanism,
Step 7: the operation control gear is that resource matched scheduling is carried out in operation, determines to be assigned to the subtask of each computational resource,
Step 8: operation is decomposed, moved: behind the resource node that obtains mating, Virtual Organization's service end is divided the operation that the user submitted to according to the resource performance of coupling, this job assignment algorithm is to divide according to the resource performance weight, the workload that the combination property height of resource then is divided into is also big, and then workload is little on the contrary; Virtual Organization's service end is given corresponding resource node by mobile agent platform startup mobile agent with the job assignment of each division then, if job migration success, operation control gear job state is set to the state of being ready to and enters step 9, otherwise, job state is set to error status and enters step 11
Step 9: the subtask is migrated to computational resource, accepts the scheduling of local resource operating system,
Step 10: when the operation in the Virtual Organization was moved to the resource node operation by mobile agent, Virtual Organization's service end started watcher thread, and monitoring results and resource node see whether return operation result,
Finish if user inquiring goes out operation, then can check the execution result of operation, otherwise enter step 11 by the identification number of input operation,
Step 11: return failure if occurred the operation result of certain resource operation in this process, at this moment Virtual Organization's service end is distributed again with regard to the operation that needs to distribute to this resource, if at this moment other are in resource nodes operation that also do not end task of operation, Virtual Organization's service end then needs to ask the service of other Virtual Organization to bring in to assist to finish this section operation, thereby the service end of Virtual Organization 1 sends to this operation the service end of Virtual Organization 2, also send this user's authentication assertion simultaneously, the service end of Virtual Organization 2 is asserted to this and is verified, if by then just receiving this section operation, and in the territory of Virtual Organization 2 correspondingly Resources allocation move, last operation result returns to the service end of Virtual Organization 1, the user is integrated and returned to this service end to these results again, so just realized the cross-domain dynamic migration scheduling of grid work.
The mentality of designing of this trust model is to be starting point with the direct or indirect trusting relationship between each resource node of grid and the user, carries out that modeling and coding realizes, has the tree-shaped relation as Fig. 3 between user and the resource node.
This dendrogram is divided into 4 layers, and h represents the height set. a certain user be in ground floor be tree root (h=1) by that analogy, be cotyledon up to h=4.Come for oneself provides service if the user wants to find in the grid all to meet the resource node that oneself requires, then will travel through one by one as destination node, filter out enabled node then with all nodes.
1. at first open user's trust record, search the resource node that direct trusting relationship is arranged with the user earlier, getting IP address this node of last bit representation has 120,170,190 3, earlier from node 120, if not destination node is the source with this 120 node again then, carries out degree of depth traversal.
2. finish back (find h=4 170 till) up to degree of depth traversal and just carry out range, promptly horizontal traversal turns back to the h=2 layer, searches node 170, if not destination node, and does not have trust record, then forward to layer next node 190.
3. carry out degree of depth traversal again, all travel through one time EOP (end of program) up to all nodes.
4. at each node in the grid one or more trust path is arranged all, it is integrated, draw the final trust value of user after the weighted mean, and then filter out the node that meets the demands, finish the job task that this user submits to according to the confidence level threshold value to this node.
Mobile agent be one can be in heterogeneous network independently from a host migration to an other main frame, and can with other agency or the mutual program of resource.In fact it is a synthesis of acting on behalf of Agent technology and distributed computing technology.
For the operation that needs in the grid to handle, search available resource at first dynamically, the available resources here are meant meet the demands and the resource node online free time that filters out according to trust value, and come multifactorial evaluation according to resource performance separately, dynamically operation are decomposed.Rely on the information that control system provided in the grid, and utilize mobile agent (Mobile agency) that it is migrated on the adequate resources and carry out.If occur some in the process of implementation or some resource nodes takes place unusual and can't return correct result, will carry out abnormality processing so, on other nodes of grid, preserve the copy of this operation in operational process, comprise working procedure, input data, descriptor etc.In order to reduce grid work execution time and network service load, mobile agent migrates to the subjob of abnormal nodes on other nodes of local domain as much as possible.Here only a marker need be set, when the operation node has any unusual and can't return normal information or result, it is unusual just to dish out, marker is put other values, enter the abnormality processing stage, be about to this abnormal nodes and be made as off-line state, and the scope of work in its information is taken out, other normal node that meet in this territory continue to carry out, and gather correct result at last.If but in local domain scarcity of resources, be difficult to find when meeting the resource that grid work describes, mobile agent is just carried out copy with its operation and is sent to other territories, allows it seek adequate resources for this job requirements, the cross-domain operation in realizing dispatching.
Resource exists a life cycle: comprise the registration of resource, shared and cancellation.Its detailed process is as follows:
1. to resouce controller registration oneself.
2. resouce controller is write the log-on message of resource in the resource information database.The result of registration has had the information of oneself in resource information database.Resource after the registration has just become gridding resource.
3. when the user needs resource, file a request to resouce controller.
4. resouce controller obtains the information of coupling resource from resource information database, returns to the user, and the user has obtained resource information.
5. resource information has been arranged, and server just can be various mutual with carrying out between the resource.
Grid work control center need carry out task to the operation that grid user is submitted to and decompose, and tree row branched structure has been adopted in the decomposition of operation here, as shown in Figure 4.
The wherein original operation of root node A in grid work submission interface, submitting to, and the operation of really carrying out on the grid computing resource node is leaf node E, F, G, H, I.The decomposition of grid work should be considered the static load problem in the grid environment, the distribution that is each task all requires the computing power of resource node to satisfy the computation requirement of task node, to avoid the bigger task of calculated amount to be assigned on the resource of computing power difference, perhaps the less task of calculated amount is assigned on the strong resource of computing power, realizes static load balance.
After task was decomposed, the ensuing work of our desired grids was that the gridding task after decomposing is issued in grid, and continued to carry out in the current time migrates to the host of available resources.The meaning of job migration is:
1. realization load balance.Load balance is that the user obtains the prerequisite that good service quality and resource are fully shared, and in the job run stage, adopts resource migration mechanism, and to the underloading node, the load that makes each resource in the system is balance roughly with a part of job migration on the heavily loaded node.
2. processing operation fault and resource are left request.When resource because fault or capabilities limits can not continue to move the operation that has moved on it again the time, can continue the operation of operation these job migrations on other resources.When resource proposes to withdraw from grid request, the grid work that is moving on it is moved on other resources, permit resource and withdraw from grid, respect the wish of resource owner.
3. make full use of gridding resource, reduce the overall overhead of operation.
The free migration of mobile agent decision operation, it is that difference according to migrating objects is divided into code migration and data migtation.In order to reduce grid work execution time and network service load, mobile agent migrates to our grid work in the local LAN (Local Area Network) as much as possible, only in local area network, be difficult to find under the situation of the resource that meets the grid work description, the Agent that mobile agent is carried out copy with its operation is sent to gateway, allow it seek to continue in the adequate resources in another one or the several LAN (Local Area Network) to carry out, as shown in Figure 5 for this job requirements.
For the loading problem under the grid environment, because the composition structure of computational resource is very complicated in the grid computing environment, but it can be by the LAN (Local Area Network) of up to ten thousand single PCs, a plurality of cluster even several tissues.Owing to the difference of computational load, the difference of processor architecture, the reasons such as difference of high-speed cache service efficiency, the unbalanced of computational load between each resource node caused in the capital, cause the computational resource node idle waiting that has, the excessive phenomenon of computational resource node load that has.
We require can be to the computing power c of computational resource
iComputation requirement ψ with parallel task
jAll carry out quantitative description comparatively accurately, make that the distribution of task each time all requires the computing power of resource node to satisfy the computation requirement of task node, to avoid the bigger task of calculated amount to be assigned on the resource of computing power difference, perhaps the less task of calculated amount is assigned on the strong resource of computing power, thus the load balance of the task of realization.So if the computing power parameter of computational resource and the computation requirement amount of parallel task can reflect real situation more exactly, the resource that computing power is strong in the system can obtain more task so, this meets the demand of the load balance of grid environment.
For the network service loading problem, why grid has powerful distributed computation ability, has benefited from it and utilizes gridding resource with using up all institute's energy.Therefore but this has also brought our problem that need pay close attention to of another one: the network service load.At present, communication network is the physical basis of grid, the migrating to remote resource node, interprocess communication etc. none does not need the support of communication network of the processing of grid work such as job entity.This will certainly produce a large amount of network service loads, and how reducing these loads as much as possible also is the problem that our designing institute will be considered.
Grid provides the physical basis that can carry out parallel computation for people.Just as noted earlier, in the grid owing to the difference of computational load, the difference of processor architecture, the reasons such as difference of high-speed cache service efficiency, the unbalanced of computational load between each resource node caused in the capital, cause the computational resource node idle waiting that has, the excessive phenomenon of computational resource node load that has.
When atomic task is assigned to when beginning to calculate on the computational resource, taken all or part of computing power of this resource, resouce controller will deduct the shared part of this atomic task from the computing power of current resource.Simultaneously, in order to guarantee when distributing other parallel task, to access correct computing power parameter value, in on the non-leafy node of non-atomic task, beginning to dispatch, also this non-atomic task aggregate demand is deducted from the computing power of this resource node at resource tree.Certainly after the calculating of atomic task is finished, resouce controller will recover by the shared part of this atomic task, as shown in Figure 6 in the computing power parameter of current resource.
Be defined as follows variable: T
i: the operation that the user submits to; R
i: the gridding resource node; c
i: R
iThe valuation of CPU computing power; Link
I, j: R
iAnd R
jBetween bandwidth; ψ
J, i: T
jBe assigned to R
iThe task workload.Under the prerequisite of the CPU computing power of each resource of considering gridding, for any R
i, its computing power
Gridding task control center is with T
jDecompose, because
So at R
kGo up assigned
Moving into and moving out, c along with migration task in the computational resource
iValue can continuous thereupon corresponding adjustment.
Beneficial effect:
(1) the utilization faith mechanism can effectively find out available resource node.
(2) the utilization mobile agent can be moved on the client servers at different levels or central server of grid environment, carries out local high-speed communication with it, and it no longer takies Internet resources, thereby greatly reduces the traffic of grid, and has improved utilization efficiency of network resources.
(3) can independently calculation task be moved to another node from a node in the isomery lattice computing environment that on the region, distributes; And mutual with other agency or resource, the control and the self-adaptation of realization operation and resource.
(4) in grid computing, mobile agent does not need unified scheduling.Can be asynchronous by the agency that the user creates in the operation of various computing node, finish again and send the result to user etc. task.Same user or same computing node can be created multiple agency, in one or more node operations, form the parallel ability of finding the solution simultaneously.
(5) overcome that the response time is not guaranteed and shortcoming that working time on resource may be long etc.Under the situation of this territory scarcity of resources, call the resource in other territories, effectively by the cross-domain task that fulfils assignment.
(6) operation control effectively provides excellent user interface, and control data I/O, and the correctness of assurance data are responsible for grid and are created all processes that returns result of calculation to end from submitting to.
(7) effectively be responsible for operation and decompose and migration, realize the load balance of resource.
Embodiment
One. architecture
The main grid assembly of utilization faith mechanism:
The mobile agent back-up environment: as the middleware of mobile agent operation, provide that mobile agent moves, safe and intelligent basic-level support, can be integrated with other grid assembly.
Node: be the supplier of grid computing resource, make a general reference various computer equipments, instrument etc.
The grid control system: be responsible for unified command, the Coordination Treatment that different grid users use resource and judge whether need when unusual cross-domain; The information service of grid computing is provided, can adopts information inquiry, collection and dissemination method based on mobile agent.
Operation agency: be to be used for the collaborative grid task of finishing a complexity according to the mobile agent (or sub agent) that certain job description standard generates.
The structure of grid work control system:
Grid work control is the module of being responsible for control mesh operation life cycle.A kind of grid work hierarchy of control structure of using mobile agent has been proposed, as shown in Figure 7 here.
The client terminal local mobile agent: describe in grid clients input job request at validated user, the local mobile agent of client is according to this job request descriptor generating mesh operation and be committed to operation control center of grid Virtual Organization.
Job information: store the grid work under all various state queues, and the execution information of grid work, as the executing state of operation, the execution data of operation etc.
Job scheduling: for grid work carries out order scheduling, coupling scheduling.
Operation is decomposed: operation must be decomposed according to the resource control information in the control center of grid Virtual Organization is dynamic.
Job assignment: subtask and resource are mated.
The service end agency: service end the agency communicate according to the mobile agent in job assignment module and the host.
The host agency: host has been represented gridding resource, in case open mobile agent, registers in regional region, just means that this gridding resource is effective in Virtual Organization.
Grid work control need be finished following task:
1. the whole life of control operation is responsible for operation and is submitted to beginning up to the overall process of returning result of calculation to the user from the user;
2. search adequate resources for operation, the coupling job requirements.According to the demand of user job, from grid, select adequate resources in the current available resource, and selected resources allocation is used to the user;
3. the I/O of control operation.The I/O of grid work is generally all carried out between remote node, but these characteristics might not embody at the code of operation, input may be to read keyboard, output may be to write screen, the grid work control gear is wanted and can be read data from correct position, can be to correct position write data;
4. be responsible for the migration of operation, operation from the then operation of a resource migration to a new resource, is realized the load balance of resource.Owing to can not accurately predict the actual conditions of job run, laod unbalance in grid, also can occur and need the situation of job migration, the dynamic turnover of resource also needs to carry out the migration of operation.
The operation control gear also will provide the job information query interface, so that the user obtains the job status information that oneself is submitted at any time.
At present, the operation of supporting on the grid all is batch processing job mostly, and after the user submitted to, grid will find the appropriate nodes running job, needs to return to user result behind the end of run.General operation seldom needs in operational process or even does not need and user interactions again.
The scheduling of the operation in the grid computing environment comprise operation decomposition, resource discovering and choose, Task Distribution, task run, task supervision and recovery, task coordinate and six aspects such as integrated.
1. the major function of operation decomposition is the subtask that a plurality of high as far as possible degree of parallelisms are resolved in the operation of submitting to.
Resource discovering with choose: resource that resource owner should be issued and access strategy are given resource media (resource matchmaker); Releasing news of these resources of resource media storage; Its resource requirement information of resource requestor issue is given the resource media; The resource media is chosen adequate resources according to the demand information of resource requestor and is gathered to resource requestor.
3. Task Distribution: an operation is broken down into m task T={T1, T2 ..., Tm} has n available resource R={R1 in the system, R2 ..., Rn}.The purpose of Task Distribution be exactly with this m module assignment in n resource, make the performance objective functional value minimum of expection.
4. task run: resource reservation; The submission task is to resource; Preparatory stage can comprise foundation, segmentation transportation, require reservation of resource or other related resource action of need preparing to run application; Task under the control of local scheduling strategy, operation task.
5. task monitors and recovers: task monitors two purposes: be convenient between user and the operation alternately; Be the job control program feedback information in time, be convenient to job control program and make a policy fast.
6. task coordinate and integrated: carry out between can finishing the work by a coordinator synchronously.After all tasks were finished, we must integrate their execution result, become the result of whole task.In addition, grid work scheduling comprises that also performance evaluation, the QoS of scheduling consider or the like function.
Two. method flow
Flow process is carried out in grid work control:
Generally speaking, the execution of grid work is all carried out on remote node, and a complete grid work control performance period as shown in Figure 8.
1. before submit job, the user at first will become the user of this grid through registration.
2. before the user added grid, the grid application layer at first will carry out the initialization of environment, for the ensuing a series of activities of grid user are prepared.
3. if user identity is legal, grid is determined the access control right of user to resource, the request of grid user submit job.
1. grid user is filled in the operation that will submit to.
Grid user need provide the zero-time and the termination time of task names, job description and the operation execution of submission when submitting grid work to.In the process of submitting to, submit to the host of this operation spontaneously local ip address and host name to be attached in the job description.The reason of doing like this is to cause ensuing series of steps such as operation issue to make mistakes for fear of IP address and host name thereof that grid user is wrongly write this machine.
2. grid user is submitted the operation of oneself to; Legitimacy, the user capture control authority grade of the operation that grid Virtual Organization control gear need be submitted to grid user are tested, if this job request is legal, do not have the semantic conflict problem, and grid Virtual Organization job controller will be accepted this request.
3. this grid user operation enters the operation wait to row in the grid Virtual Organization, and solicited status is set to: submit state, wait for scheduled for executing.
Grid user is after submit job, and operation enters grid Virtual Organization center job waiting list.Each submitted operation all is endowed unique identification number.Grid Virtual Organization dispatching center can carry out order scheduling and coupling scheduling for grid work.For the order scheduling, grid work is followed the principle of " FIFO ", and the grid work control gear at Virtual Organization center is always selected the operation that is positioned at head of the queue and at first handled in the operation waiting list.The resource of record grid user in the grid Virtual Organization, Virtual Organization is that current operation selects suitable gridding resource to mate scheduling according to the faith mechanism of being mentioned before.
4. the operation control gear at grid Virtual Organization center is that the order scheduling is carried out in operation, regularly extracts the operation that is positioned at head of the queue in the operation waiting list, if formation is not empty, and execution in step 5; Otherwise the operation control gear is waited for and is entered formation until the operation that has the user to submit to.
5. the operation control gear obtains the descriptor of operation, as submission person's user profile, job content etc.
6. filter out available computational resource (its number is no more than the maximal value of available resources number in the resource control) according to faith mechanism at the grid work control gear.
7. the operation control gear is that resource matched scheduling is carried out in operation, determines to be assigned to the subtask of each computational resource.
8. operation is decomposed, moved.
Behind the resource node that obtains mating, the VO service end is divided the operation that the user submitted to according to the resource performance of coupling, this job assignment algorithm just (comprises cpu performance according to the resource performance weight at present, bandwidth performance, internal memory performance carries out comprehensively) divide, the workload that the combination property height of resource then is divided into is also big, and then workload is little on the contrary.The VO service end is given corresponding resource node by mobile agent platform startup agent with the job assignment of each division then, as Fig. 9.If the job migration success, operation control gear job state is set to the state of being ready to and enters step 9, otherwise job state is set to error status and enters step 11.
9. the subtask is migrated to computational resource, accepts the scheduling of local resource operating system.
10. when the operation among the VO was moved to the resource node operation by agent, the VO service end started watcher thread, and monitoring results and resource node see whether return operation result.
The effector of grid Virtual Organization can inquire about the running status of grid work in the present formation by operation control center.For grid user, can utilize the job state of the identification number inquiry submission of operation, show the current running status of operation, as being arranged at present, which resource carrying out this operation, the process status of the content of the subjob that each resources allocation is arrived and current each resource running job.
Finish if user inquiring goes out operation, then can check the execution result of operation, otherwise enter step 11 by the identification number of input operation.
11. return failure (its reason such as this resource node goes offline or overload and paralysis etc.) if occurred the operation result of certain resource operation in this process, at this moment the VO service end is distributed again with regard to the operation that needs to distribute to this resource, if at this moment other are in resource nodes operation that also do not end task of operation, the VO service end then needs to ask other VO service ends (as: VO2) to assist to finish this section operation, as Figure 10.Thereby the VO1 service end sends to the VO2 service end with this operation, and the SAML that also sends this user simultaneously asserts, the VO2 service end asserts to this and verify, if by then just can receiving this section operation, and in the VO2 territory correspondingly Resources allocation move.Last operation result returns to the VO1 service end, and the user is integrated and returned to the VO1 service end to these results again, has so just realized the cross-domain dynamic migration scheduling of grid work.