CN1687917A

CN1687917A - Large scale data parallel computing main system and method under network environment

Info

Publication number: CN1687917A
Application number: CN200510025730.8A
Authority: CN
Inventors: 陈庆奎; 那丽春; 图占乐
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2005-05-11
Filing date: 2005-05-11
Publication date: 2005-10-26
Anticipated expiration: 2025-05-11
Also published as: CN100357930C

Abstract

The invention discloses a large-scale data parallel calculating main system and method under network environment, where the system is composed of multiple computer clusters interconnected by LAN or Intranet, and each computer cluster and each calculating node are of different structures and multiple trustable levels; in the sequence of synthetic calculating ability of the calculating nodes from the highest down, numbering all the calculating nodes to compose a calculating node logic ring; similarly in sequence of synthetic calculating ability of the computer clusters from the highest down, numbering all the computer clusters to compose a computer cluster logic ring; each computer cluster on the computer cluster logic ring is composed of calculating node logic ring; the method is an data parallel calculating algorithm based on dynamic redundancy mechanism, and uses the policies of constructing dynamic logic rings for computer clusters and calculating nodes and of m-redundancy distribution, to effectively solve the technical problems of dynamic redundancy, dynamic load balancing, linear acceleration ratio, etc.

Description

Large scale data parallel computing main system under the grid environment and method

Technical field

The present invention relates to a kind of computing technique of computing machine, particularly relate to a kind of by under the large-scale calculations environment that constitutes at common, cheap computer group, utilize existing vacant computational resource, computing technique and algorithm that general data parallel large-scale computing systems realizes.

Background technology

Along with fast development of information technology and universal day by day, the processing of magnanimity information and the demand of high-performance calculation are more and more urgent, and these demands are proposed by the every field of nation-building gradually.Magnanimity information processing technology, the high-performance calculation technology of seeking high performance price ratio become the subject matter that urgent need that industrial community and academia faced solves.At this problem, grid and data grids become one of feasible best solution with the diversity of its good autonomy, self-similarity, isomerism, management, powerful parallel I/O ability, very high characteristics such as the ratio of performance to price.The Intranet that is made of a plurality of computer networks gets more and more at present, and the personal computing device of a large amount of cheapnesss is seen everywhere, but their resource utilization is very low.Pertinent literature research is pointed out, under a network environment, there are many resources not to be used in the given time, even one day the busiest in, still have 1/3rd workstation usefulness not fully, 70% in the cluster is in idle condition to 85% network memory (being distributed in the internal memory of each node on the network).So vacant computational resource, storage resources, the communication resource on the grid that digging utilization is made of a plurality of computer groups can obtain a large amount of, non-special use, cheap, large-scale high-performance treatments and computational resource.Yet along with the increase of cluster nodes number and network number, the dynamic transfer ability of the reliability of system and resource will descend.So the reliability engineering of a group of planes, the research of extensible technique become the research focus in this field, wherein the LifeKeeper of the Failover of the Wolfpack of Microsoft, Oracle, NCR is typical case's representative that a reliability group of planes calculates.Yet, variation day by day along with expansion day by day, gridding resource and the service of data grids resource extent, these traditional reliability engineerings can't be fit to the regulatory requirement of gridding resources isomery, many confidence levels, and are therefore more and more urgent based on the research of the brand-new large-scale parallel theoretical model of grid and algorithm.

Summary of the invention

At the defective that exists in the above-mentioned prior art, technical matters to be solved by this invention provides a kind of present existing computational resource, Internet resources structure that can utilize and has good failure tolerance, good speed-up ratio characteristic, very high dynamic load ability, large scale data parallel computing main system and method under cheap, large-scale, reliable, the stable grid environment.

In order to solve the problems of the technologies described above, the large scale data parallel computing main system under a kind of grid environment provided by the present invention comprises:

One DGSS of monitoring grid system (DATA GRID SUPERVISE SYSTEM) utilizes the multi-Agent cooperative mechanism DG to be implemented the grid management system of effectively dynamic condition monitoring;

Also comprise a computing system DGCS (DATA GRID COMPUTINGSYSTEM) who constitutes by group of planes logic box, wherein:

Described group of planes logic box is made of by the number order connection the computer group of setting numbering, and a logical successor group of planes of numbering a last group of planes is to be numbered a group of planes of 1; Described numbering is according to the numbering COMPREHENSIVE CALCULATING ability of all computing nodes of computer group and that all group of planes are set by order from big to small; Except that the 1st numbered between a group of planes and the maximum numbering group of planes, the COMPREHENSIVE CALCULATING ability of a group of planes that closes on more was close more on group of planes logic box;

Each computer group on the described group of planes logic box is made of the computing node logic box, described computing node logic box, be made of by the number order connection the computing node of setting numbering, the logical successor node of numbering last computing node is to be numbered 1 computing node; Described numbering is the numbering of by order from big to small all computing nodes being set according to the COMPREHENSIVE CALCULATING ability of each computing node, and the COMPREHENSIVE CALCULATING ability of described each computing node is calculated according to weight vectors W; Except that the 1st numbered between computing node and the maximum numbering computing node, the COMPREHENSIVE CALCULATING ability of the computing node that closes on more was close more on the computing node logic box;

Constituted dynamic DG (data grids computing system) by group of planes logic box and computing node logic box; Described monitoring grid system connects each computer group and the computing node of described dynamic grid computing system by common Lan or Intranet.

In order to solve the problems of the technologies described above, the redundant allocation strategy of the m-of the large scale data parallel computing main system under a kind of grid environment provided by the present invention: its step is as follows:

On a logic box (comprising group of planes ring or computing node ring), establishing a computing unit (comprising a computing node or a group of planes) logical number is k, and its computing power is CP _k

For described computing unit distributes CP _k* M (* is multiplying) task amount;

Again CP _k* to be evenly distributed to logical number be k+1 to the M task amount, k+2 ..., on common m the computing unit of k+m; Task CP like this _k* M is distributed simultaneously by DG and carries out twice;

Claim that in the present invention this redundancy strategy is the redundant allocation strategy of m-; In fact, task CP _k* M is only distributed once by redundant.

In order to solve the problems of the technologies described above, the large scale data parallel computational algorithm under a kind of grid environment provided by the present invention, the DGCS key data structure constitutes just like lower member:

If grid DG is made of c computer group, the number dynamic change of the computing node in each group of planes; DPC is a parallel calculation task of data on the DG, | DPC| represents its general assignment amount, and W is its computational resource requirements weight vectors; Q _TDTTask distribution message queue for DPC; M is the basic task unit; DGSS is the monitoring grid system;

The step of large scale data parallel algorithm is as follows:

1) initialization;

A) according to M decomposing D PC;

B) calculate: the termination condition of finished=DPC;

C) auxiliary data (as matrix of coefficients) of broadcasting DPC is to all computing nodes of DG;

D) Count=0; / * parallel computation time counter initialization */

2) While (when Finished does not set up) do

3) [the Master distribution DPC task of DG]/* circulation execute the task */

A) obtain the resource state information of DG from DGSS;

B) group of planes logic box of structure DG;

C) start all group of planes structures computing node logic box separately;

D) calculate the Dynamic Two-dimensional address for each computing node;

E) obtain the overall computing power CCP of each group of planes _i(0≤i≤c);

F) to each group of planes CC _i(0≤i≤c) do:

{

Calculate CCP _i/ ∑ CCP _j(the ratio of 0≤j≤c);

Calculating is at group of planes CC _iThe task amount T of the DPC of last distribution _i=(CCP _i/ ∑ CCP _j(0≤j≤c)) * | DPC|/M;

}；

For?i＝1?to?c

G) transformation task T _iTo group of planes CC _i

H) on group of planes ring, press the redundant allocation strategy distribution T of 1- _iTo CC _i1 group of planes CC of logical successor _I+1On;

i)End?for；

4) all group of planes CC _i(0≤i≤c) do concurrently step 5) ~ 11):

5) group of planes CC _iCOMPREHENSIVE CALCULATING ability CP according to its each computing node _j(0≤j≤p, p are CC _iThe computing node number) calculate the bear amount of each computing node to subtask Ti, promptly as follows:

a)For?j＝1?to?p

b)T _ij＝(CP _j/∑CP _k(0≤k≤p))*|T _i|/M；

C) transmission subtask T _IjTo computing node C _j

D) on the computing node ring, press the redundant allocation strategy distribution T of m- _IjTo C _jM computing node C of back _J+1, C _J+2, ..., C _J+mOn;

E) End for; DATA DISTRIBUTION end * on the/* computing node/

6) Master of group of planes structure group of planes CC _iThe local task distribution message queue Q of this subtask _TDTiConcurrent Q _TDTiDeliver to the Master of DG, the Master of DG constructs overall task distribution message queue Q _TDT

7) CC _iMaster start its all computing nodes of having jurisdiction over and finish this calculation task, repeat to do step 8) 9) 10);

8) CC _iMaster monitor this group of planes Q _TDTiTask situation about finishing;

Accept its follow-up group of planes to Q _TDTiRedundant computation situation about finishing;

Transmit this group of planes to the intermediate result of its forerunner's redundant computation to its forerunner's group of planes;

Transmit this group of planes Q _TDTiThe Master of calculating intermediate result DG;

9) if

((this Q _TDTiResult of calculation, all obtain by self or its descendant node)

Or

In the middle of the Master of (obtaining the finish command of the Master of DG)/* DG obtains all by a redundant group of planes

As a result */

)

Then finish this subtask and calculate, and forward step 11) to;

10) if CC _iMaster obtain some computing nodes from DGSS and lost efficacy,

Then { putting this node is failure state;

By making the redundant allocation strategy of m-, calculate the inefficacy task amount, put into the inefficacy task queue;

Send the Master of fail message simultaneously to DG;

}

Finish the calculation task of oneself when some computing node after, obtaining corresponding task to the inefficacy task queue and continue to carry out, is empty up to the inefficacy formation;

11) accept the Task Distribution next time of the Master of DG; This grid parallel computation end of/* */

Count++；

12) Master of DG is according to overall Q _TDTGather this and calculate intermediate result; Revise the algorithm termination condition; Conversion intermediate result is new overall calculation task DPC;

13)End?while；

14) output result of calculation notifies all computing nodes to finish this calculating.

Described step 3) b) group of planes logic box in, its constitution step is as follows:

A class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃), to any one computer group CC of DG _i∈ CSS, CC _iThe COMPREHENSIVE CALCULATING ability be its all computing nodes the COMPREHENSIVE CALCULATING ability and, be designated as CCP _j(0≤j≤c);

In DG according to CCP _i(order from big to small of 0≤j≤c) is to all group of planes numberings;

Constitute a group of planes logic box by this numbering, a logical successor group of planes of numbering a last group of planes is to be numbered a group of planes of 1.

Described step 3) c) the computing node logic box in, its constitution step is as follows:

A class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃), to any one computer group CC of DG _i∈ CSS, the number of its computing node is p, the COMPREHENSIVE CALCULATING ability that calculates each computing node according to weight vectors W is CP _j(0≤j≤p);

At CC _iIn according to CP _j(order from big to small of 0≤j≤p) is to all computing nodes numberings;

Constitute a computing node logic box by this numbering, the logical successor node of numbering last computing node is to be numbered 1 node.

The Dynamic Two-dimensional address of calculating described step 3) d), it is set at two tuples (r, o) address; Each computing node on the DG can both obtain one two tuple (r from the group of planes logic box of being constructed and computing node logic box, o) address, wherein r is the group of planes ring numbering of this computing node place group of planes, and o is the logical number of this computing node in the computing node ring of group of planes r; Because group of planes ring and the dynamic change in the parallel computation process of computing node ring of DG, so claim that (r o) is the Dynamic Two-dimensional address of this computing node.

Task distribution message queue in the described step 6): the TDT of all elementary cell tasks constitutes a task distribution message queue Q _TDTDG has the Q of an overall situation _TDT, each group of planes has the Q of a part _TDTi

Inefficacy task queue in the described step 10): one of each computer group structure of DG is deposited mission bit stream formation behind the local calculation node failure, and its form is the same with the task distribution message queue.

Utilize the large scale data parallel computing main system under the grid environment provided by the invention, provide a cover feasible calculating back-up system and implementation method for the large scale data parallel based on the internet calculates.The present invention utilizes the vacant resource of existing computer network, computing node to carry out large-scale parallel to calculate, and the structure of these computational resources, software systems can be isomery, and the network interconnection can be any technology.By adjusting the size of basic task piece, adjust parallel granularity according to the actual conditions of network.Energy force function by each computational resource of dynamic calculation, dynamically construct computing machine group rings, computing node ring, according to the balanced distribution of m-redundancy strategy load, make the data parallel algorithm of this system's support have good speed-up ratio, dynamic load leveling, effective fault-tolerant ability then.

The effective fault tolerant mechanism of described data parallel algorithm based on dynamic redundancy mechanism divides three step proofs as follows: one, in a single group of planes. and be without loss of generality, only need the average fault-tolerant ability of this algorithm of proof; The COMPREHENSIVE CALCULATING ability of supposing each computing node in the single group of planes is identical, so the host computer task that each node is assigned to all is T.The redundant separately T/m that deposits this task of the m of this node follow-up computing node like this.If the failure probability of each computing node is q.Only analyze the size of the data volume of the inefficacy distribution situation of m+1 neighborhood calculation node and inefficacy formation.

When 1 node failure, fail data is 0;

When 2 node failures, (T/m) q is arranged ²Fail data; When 3 node failures, (T/m) q is arranged ³Fail data;

When k node failure, (T/m) q is arranged ^kFail data;

……

The average fail data amount of m failure conditions generation is so:

T _ave((T/m)(q ²+2q ³+3q ⁴+…+(m-1)q ^m))/m

＝(T/m ²)(q ²+2q ³+3q ⁴+…+(m-1)q ^m)

＝T/(m ²(1-q) ²)-(T/m)(q ^m+1/(1-q))………(1)

In formula (1), when q is tending towards 0, T _Ave=T/m ²When q is tending towards 0.5, T _Ave≈ 1.33 (T/m ²);

If the part that it is m+1 that the computing node ring of a group of planes can be divided into h length, the average inefficacy queue length of this group of planes is h T so _Ave, the visible length of suitably adjusting the computing node logic box can effectively be controlled the length of fault tolerant queue.So the redundancy scheme in the single group of planes is effective.

Two, owing to adopt the 1-redundancy strategy between a group of planes, actual is mirror policy.The mirror image redundancy scheme is effective.

Three, can utilize and calculate communication performance and recently calculate m in the m-redundancy strategy.The optimum redundancy quantity of information should be make node computing power just can with the network communications capability balance.

In sum, described data parallel algorithm based on dynamic redundancy mechanism provides effective fault tolerant mechanism.

About the described data parallel algorithm proof that is dynamic load leveling:

In the step 3) of algorithm, the starting stage that each parallel computation starts is all according to each group of planes, the computing power structure group of planes logic box of each computing node, the computing node logic box of DG at that time.

According to the setting of two logic boxs, according to algorithm steps 3) description, the load of each group of planes task is to divide by ability.Simultaneously, according to the step 5) of algorithm as can be known, it also is to divide by ability that the load in each group of planes distributes.So in each parallel stage of algorithm, load balancing.

In addition, because the structure of two logic boxs was constructed in real time in each parallel stage, be dynamic so this load distributes.

So described data parallel algorithm is a dynamic load leveling.

Large scale data parallel computing main system of the present invention is based on above-mentioned advantage, the grid environment that is highly suitable for common computer group of planes formation solves the large-scale calculations problem down, has proposed significant valuable system realization technology, method to utilizing existing vacant computational resource to implement high-performance calculation.The present invention provide one under the data grid environment that constitutes by a multicomputer group of planes, the parallel calculating of data-oriented, based on the large-scale parallel computing main system and the method for dynamic redundancy mechanism.Theoretical analysis and practice show that this system and method has good dynamic load leveling, fault-tolerance and speed-up ratio characteristic, can support large scale data parallel to calculate effectively.

Description of drawings

Fig. 1 is the dynamic grid DG synoptic diagram that is made of two logic boxs of the present invention;

Fig. 2 is the redundant allocation strategy synoptic diagram of m-of the present invention;

Fig. 3 is the state exchange synoptic diagram of computing node in the multi-Agent model operating mechanism.

Embodiment

Below in conjunction with description of drawings embodiments of the invention are described in further detail, but present embodiment is not limited to the present invention, every employing analog structure of the present invention and method and similar variation thereof all should be listed protection scope of the present invention in.

In order to construct the grid environment of supporting that large scale data parallel calculates, utilize common computational grid, computer resource to constitute the computing system of many confidence levels, in order to describe the implementation of this DGCS system effectively, this instructions is done following setting:

Set 1, computer group: a computer group (Computer Cluster) be one two tuple CC (Master, CS), wherein Master is the CC master controller; CS={C ₁, C ₂..., C _pIt is the set of all computing nodes of CC.

Set 2, data grids: a data grid (Data Grid) be a four-tuple DG (Master, CCS, N, R); Wherein Master is the DG master controller; CSS={CC ₁, CC ₂..., CC _cIt is the set of the collection of computer group; N={N ₁, N ₂..., N _nFor connecting the set of network, connecting network is the high speed switching network; R is a concatenate rule.Computing node among each DG has separate processor and external storage.

Set 3, the data parallel on the DG calculates: DG (Master, CCS, N, R) the parallel computation process of data on is as follows:

(1) the calculation task scale, be decomposed into the subtask;

(2) start computing node among all CCS;

(3) calculate: the condition of Finish=task termination;

(4)i＝1；

(5) While (when Finish does not set up) do

(6) decomposing global data Data is D ₁, D ₂..., D _p

(7) send D concurrently _kTo C _k(1≤k≤p);

(8) drive C _k(1≤k≤p) finds the solution subtask i simultaneously;

(9) synchronous C _k(1≤k≤p) finds the solution subtask i's, and exchange local data forms new global data New_Data;

(10)Data＝New_Data；

(11)i＝i+1；

(12)End?while；

(13) synthetic result of calculation;

(14) notice C _k(1≤k≤p) finish to calculate;

(15) finish this calculating.

Set 4, the resource requirement weight vectors: it is different to the demand of calculating (CPU) performance of computing node, storage (RAM) capacity, I/O (DISK) speed that the every class data parallel among the DG calculates DPC (Data ParallelComputing), every class data parallel calculating is analyzed, can provide demand weight, with vectorial W=(w to above-mentioned three kinds of resources ₁, w ₂, w ₃) represent, be called the resource requirement weight vectors of such DPC.

Set 5, calculate the communication performance ratio: a class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃).Any one computer group CC to DG _i∈ CSS, the number of its computing node is p, calculates the COMPREHENSIVE CALCULATING ability CP of each computing node according to weight vector W _j(0≤j≤p); If CC _iThe network bandwidth is B, so group of planes CC _iCalculating communication performance ratio to DPC is set at: R=∑ CP _j(0≤j≤p)/pB; Its implication is group of planes CC _iThe average comprehensive treatment capability of computing node and the ratio of this cluster network bandwidth.

Because the communication bandwidth of general networking is generally 100M and 1000M rank at present, it is more and more faster that the comprehensive treatment capability of processing node then improves, so the value of R can be greater than 1; Unit of account as for the computing node ability can be set according to the demand of DPC, and for example can establish per hundred megahertz processor frequencies is a CPU unit, and per million storeies are a storage cell, and every millisecond I/O speed is as speed unit of hard disk etc.

In the grid parallel architecture,

Set 6, the computing node logic box: a class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃).Any one computer group CC to DG _i∈ CSS, the number of its computing node is p, the COMPREHENSIVE CALCULATING ability that calculates each computing node according to weight vector W is CP _j(0≤j≤p); At CC _iIn according to CP _j(order from big to small of 0≤j≤p) is to all computing nodes numberings; And by computing node logic box of this numbering formation, the logical successor node of numbering last computing node is to be numbered 1 node, claims that this logic box is the computing node logic box.

Can know that by setting except that the 1st numbered between computing node and the maximum numbered node, the COMPREHENSIVE CALCULATING ability of the computing node that closes on more was close more on the computing node logic box.

Set 7, group of planes logic box: a class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃).Any one computer group CC to DG _i∈ CSS, CC _iThe COMPREHENSIVE CALCULATING ability be its all computing nodes the COMPREHENSIVE CALCULATING ability and, be designated as CCP _j(0≤j≤c).In DG according to CCP _i(order from big to small of 0≤j≤c) is to all group of planes numberings; And by group of planes logic box of this numbering formation, a logical successor group of planes of numbering a last group of planes is to be numbered a group of planes of 1, claims that this logic box is a group of planes logic box.

Equally, except that the 1st numbered between a group of planes and the maximum numbering group of planes, the COMPREHENSIVE CALCULATING ability of a group of planes that closes on more was close more on group of planes logic box.

Referring to shown in Figure 1, constitute dynamic DG by group of planes logic box 1 and computing node logic box 2 dicyclos.

The Dynamic Two-dimensional address of setting 8, computing node: each computing node on the DG can both be from obtaining one two tuple (r according to setting 6,7 two logic boxs of being constructed, o) address, wherein r is the group of planes ring numbering of this computing node place group of planes, o is the logical number of this computing node in the computing node ring of group of planes r, because group of planes ring and the dynamic change in the parallel computation process of computing node ring of DG, so claim that (r o) is the Dynamic Two-dimensional address of this computing node.

Set 9, the basic task unit: a class DPC and resource requirement weight vectors W=(w thereof among the known given DG ₁, w ₂, w ₃).According to the characteristics of DPC, DPC is decomposed into several sizes is that the task piece of M, this instructions claim that M is the basic task unit of DPC on DG.

Setting 10, the redundant allocation strategy of m-: on a logic box (group of planes ring or computing node ring), establishing a computing unit (computing node or a group of planes) logical number is k, and its computing power is CP _kDistributing CP for this computing unit _k* behind M (* the is multiplying) task amount, again CP _k* to be evenly distributed to logical number be k+1 to the M task amount, k+2 ..., on common m the computing unit of k+m, task CP like this _k* M is distributed simultaneously by DG and carries out twice, claims that this redundancy strategy is the redundant allocation strategy of m-, as shown in Figure 2; In fact, on a computing node ring 3, task CP _k* M is only distributed once by redundant.

Setting 11, task distribution information tabular: a mission bit stream distribution table TDT (Mlink is set in each the basic task unit among the DG, Slink), wherein Mlink is the Dynamic Two-dimensional address link list of the computing node that distributes for the first time of this TU task unit, and Slink is the Dynamic Two-dimensional address link list of computing node of the redundant distributions of this TU task unit;

The task distribution message queue: the TDT of all elementary cell tasks constitutes a task distribution message queue Q _TDTDG has the Q of an overall situation _TDTEach group of planes has the Q of a part _TDTi

The inefficacy task queue: one of each computer group structure of DG is deposited mission bit stream formation behind the local calculation node failure, and its form is the same with the task distribution message queue.

The monitoring grid system is DGSS: in order to support the effective operation based on the data parallel large-scale parallel algorithm on the DG of said structure, must have one to the effectively dynamic in real time grid management system of DG.Utilize the multi-Agent cooperative mechanism to study gridding resource discovery, monitoring, dynamic debugging system, can guarantee effective operation of DG system, about the description of this system with introduce referring to this instructions decline based on twin nuclei; For the description of this instructions, the name of remembering this system is DGSS (DG Supervise System).

Large scale data parallel computing main system under a kind of grid environment that the embodiment of the invention provided is established grid DG and is made of c computer group, the number dynamic change of the computing node in each group of planes; DPC is a parallel calculation task of data on the DG, | DPC| represents its general assignment amount, and W is its computational resource requirements weight vectors; Q _TDTTask distribution message queue for DPC; M is the basic task unit; DGSS is the monitoring grid system;

Large-scale data concurrency arthmetic statement is as follows:

1) initialization;

A) according to M decomposing D PC;

B) calculate: the termination condition of finished=DPC;

D) Count=0; / * parallel computation time counter initialization */

2) While (when Finished does not set up) do

3) [the Master distribution DPC task of DG]/* circulation execute the task */

A) obtain the resource state information of DG from DGSS;

B) group of planes logic box of structure DG;

C) start all group of planes structures computing node logic box separately;

D) calculate the Dynamic Two-dimensional address for each computing node;

F) to each group of planes CC _i(0≤i≤c) do:

{

Calculate CCP _i/ ∑ CCP _j(the ratio of 0≤j≤c);

}；

For?i＝1?to?c

G) transformation task T _iTo group of planes CC _i

i)End?for；

4) all group of planes CC _i(0≤i≤c) do concurrently step 5) ~ 11):

a)For?j＝1?to?p

b)T _ij＝(CP _j/∑CP _k(0≤k≤p))*|T _i|/M；

C) transmission subtask T _IjTo computing node C _j

E) End for; DATA DISTRIBUTION end * on the/* computing node/

9) if

((this Q _TDTiResult of calculation, all obtain by self or its descendant node)

Or

As a result */

)

Then finish this subtask and calculate, and forward step 11) to;

10) if CC _iMaster obtain some computing nodes from DGSS and lost efficacy,

Then { putting this node is failure state;

Send the Mast ē r of fail message simultaneously to DG;

}

11) accept the Task Distribution next time of the Master of DG; This grid parallel computation of/* finishes */Count++;

13)End?while；

Embodiments of the invention: effective development environment of isomerous environment is a Web Service technology.The embodiment of the invention utilizes the SunOne technology of .Net of Microsoft and sun company as development environment, technology, has developed this system.And effective analysis, test job have been carried out.

In order to check the validity of this algorithm, the embodiment of the invention is to the general iterative method of linear equations group ^[13]Solution is decomposed, and tests from speed-up ratio and two aspects of fault-tolerance then.

Grid DG is made of 8 group of planes, and each group of planes has 6 computing nodes, and network is made of the cascade of 8 100M switches.

The embodiment of the invention is divided into 3 classes to computing node, and every class configuring condition is as shown in table 1, and every class computing node has 2 in each group of planes.

All kinds of computing nodes of table 1. are joined information

The node classification	Processor	Internal memory	Hard disk	Network interface card
The node classification	Processor	Internal memory	Hard disk	Network interface card	??CT1	??p2.8Ghz	??256M	5400 change	??100M
??CT2	??p2.4Ghz	??256M	7200 change	??100M	??CT1	??p2.8Ghz	??256M	5400 change	??100M
??CT2	??p2.4Ghz	??256M	7200 change	??100M	??CT3	??p2.0Ghz	??256M	5400 change	??100M

The embodiment of the invention is to the system of equations of individual 5000 * 5000 matrix of coefficients, respectively with 1000 times as once intactly calculating.DPC is a general iterative method.Basic task unit M is the one-component of finding the solution vector.Get the m=2 in the m-redundancy strategy.Test is carried out under the situation of 1 ~ 8 group of planes respectively, and test result is as shown in table 2.From data as can be seen, the speed-up ratio of this algorithm is a near-linear.

The speed-up ratio of table 2. algorithm

Group of planes number	??1	??2	??3	??4
Group of planes number	??1	??2	??3	??4	Response time (s)	??6543	??3331	??2577	??1755
Group of planes number	??5	??6	??7	??8	Response time (s)	??6543	??3331	??2577	??1755
Group of planes number	??5	??6	??7	??8	Response time (s)	??1439	??1148	??1023	??917

In order to test the fault-tolerant ability of this model, the embodiment of the invention is respectively carried out 100 times above-mentioned test under the different loads situation of computational resource respectively 4 different periods, checks the complete failure rate (can not normally finish the ratio of calculating) of algorithm.The result shows, even in DG busy the peak morning and afternoon, the complete failure rate of this algorithm also is very low, and the fault tolerance that this algorithm is described is effective.Test result is as shown in table 3.

The complete failure rate of different periods of table 3.

Period	??6:00-8:30	??9:00-11:30	??1:00-2:30	??20:00-22:30
Period	??6:00-8:30	??9:00-11:30	??1:00-2:30	??20:00-22:30	Complete failure rate (%)	??3	??6	??7	??1

Because the traffic in parallel each stage of process of iteration is not too big, and data volume is stable, and this algorithm has showed good speed-up ratio.To the parallel bigger and unsettled parallel JOIN of data volume of phase communication amount ^[12]The test job of algorithm is carried out, and other is gone out paper describe.In the realization of the large-scale parallel algorithm that the data parallel based on grid calculates, dynamic load leveling and effective fault tolerant mechanism are very important.The dynamic redundancy strategy that this paper proposes, the dynamic application strategy of gridding resource, the load balancing strategy of logic-based ring have solved this problem effectively.This instructions is found in practice, seek out stable grid computing resource, also needs effective environmental management mechanism (as many Master technology, computer lab management strategy etc.), hardware fault-tolerant strategy etc. to give security.

For large scale data parallel computing main system and the method under the grid environment of the present invention is described effectively, this instructions is described as follows the network monitoring system DGSS (DATA GRID SUPERVISESYSTEM) that utilizes the multi-Agent cooperative mechanism:

One, basic structure is set:

Set 1, computer group, a computer group (Computer Cluster) be one two tuple CC (Master, CS), wherein Master is the CC master controller; CS={C ₁, C ₂..., C _pIt is the set of all computing nodes of CC.

Set 2, data grids, a data grid (Data Grid) be a four-tuple DG (Master, CCS, N, R); Wherein Master is the DG master controller; CSS={CC ₁, CC ₂..., CC _cIt is the set of the collection of computer group; N={N ₁, N ₂..., N _nFor connecting the set of network, connecting network is the high speed switching network; R is a concatenate rule.Computing node among each DG has separate processor and external storage.

Set 3, the data parallel on the DG calculates, DG (Master, CCS, N, R) the parallel computation process of data on is as follows:

(1) the calculation task scale, be decomposed into the subtask;

(2) start computing node among all CCS;

(3) n=subtask number;

(4)i＝1；

(5) decomposition data Data is D ₁, D ₂..., D _p

(6) send D _kTo C _k(1≤k≤p);

(7)While?i＜n?do

(8) drive C _k(1≤k≤p) finds the solution subtask i simultaneously;

(9) synchronous C _k(the finding the solution of 1≤k≤p) to subtask i;

(10)i＝i+1；

(11)End?while；

(12) reclaim C _k(result of calculation of 1≤k≤p), and synthetic;

(13) notice C _k(1≤k≤p) finish to calculate;

(14) finish this calculating.

Set 4, the multi-Agent cooperative model, the multi-Agent cooperative model MS among the DG can formally be set at a four-tuple MS=(Agents, Tm, Sm, Space), wherein Agents is the set of all cooperation entity A gent; Tm is the communication mechanism between Agent; Sm is the service mechanism between Agent; Space is the space that exists of all Agent.

Setting 5, Agent, an Agent can describe Agent=(A with a four-tuple _Id, A _Type, A _Area, A _Desc, A _BDI, A _Prg), A wherein _IdUnique identifier for Agent; A _TypeType for Agent; A _AreaBe the scope of activities of Agent on grid, A _DescDescription vector for Agent.A _BDIBe rule bases such as the conviction of Agent, hope, intention; A _PrgExecutable code for Agent.

According to the demand that the data parallel based on grid calculates, this instructions has been done following classification to the Agent in the grid:

Setting 6, resource management intelligence body A _Rm, DG (Master, CCS, N, R) the resource management intelligence body A on _RmIt is dynamic change management on all computing nodes of CCS, that be used for resource; Its conviction is that the place computing node is to have ability most; Its hope is to excavate calculating, storage, the communication capacity of place node, improves the vital role of its place computing node in grid as far as possible; It is intended that resource situation, state according to this node, cooperates, competes with other Agent in the grid, reaches the optimization duty of its computing node of having jurisdiction over.

Setting 7, reliability management intelligence body A _a, the reliability management intelligence body A on the DG _aBe to be present in stability status on all computing nodes of CCS, that be used for resource to detect, the duty of the main cpu resource that detects its place computing node, memory source, Internet resources, various services, and revise computing node ground, place dependability parameter according to the data of these states; Its conviction is to believe that various faults can appear in the place computing node; Its hope is to find calculating, storage, the communication failure of place node, reduces the vital role of its place computing node in grid as far as possible, itself and A _RmBe conflicting; It is intended that the resource state information according to this node, cooperates, competes with other Agent in the grid, reaches the optimization duty of its computing node of having jurisdiction over.

Setting 8, cluster management intelligence body A _Cc, the cluster management intelligence body A on the DG _CcBe to be present on the Master of each group of planes in CCS, to be used for the ability of its all computing nodes of having jurisdiction over is carried out integrated management, comprehensively line up, manage, be responsible for A simultaneously with other group of planes according to its calculating, storage, communication, service ability etc. _CcCoordination, cooperation work; Its conviction is to believe that it has jurisdiction over group of planes ability is the strongest; Its hope be find the group of planes of having jurisdiction over win calculating, storage, communication, Service Source as far as possible, improve the vital role in its place grid as far as possible; It is intended that the resource state information according to this group of planes, with the A of its group of planes in the grid _CcCooperate, compete, reach the optimization duty of its group of planes of having jurisdiction over.

Setting 9, user agent's intelligence body A _User, the mesh-managing intelligence body A on the DG _UserBe to be present on the Master of DG, to be used for the services request of proxy user,, be responsible for A with a group of planes according to requirements such as the calculating of request, storage, communication, services _GridCoordination, cooperation work.

Setting 10, mesh-managing intelligence body A _Grid, the mesh-managing intelligence body A on the DG _GridBe to be present on the Master of DG, to be used for the ability of its all group of planes of having jurisdiction over is carried out integrated management, comprehensively line up, manage, be responsible for A simultaneously with a group of planes according to the calculating of all group of planes, storage, communication, service ability etc. _CcCoordination, cooperation work; Its conviction is to believe that its mesh capabilities of having jurisdiction over is the strongest; Its hope is to find calculating as much as possible, storage, communication, the Service Source of the grid of having jurisdiction over, and improves the throughput and the efficient of grid as far as possible; It is intended that the resource state information according to this grid, cooperates with service broker Agent, reaches the optimization duty of its grid of having jurisdiction over.

Setting 11, mesh services intelligence body A _Service, the mesh services intelligence body A on the DG _ServiceIt is the specific computing function that is present on all computing nodes of DG, is used for finishing parallel computation, as set 3 (8) subtask and find the solution problem, decorrelation in parallel minute of the computational problem that they are main with different, it is the externally pith of parallel service of grid, it is an intelligent body set often, and the mesh services intelligence body set on this instructions meter DG is SAS.

Each mesh services intelligence body all has the BDI of oneself, generally speaking, and an A _ServiceConviction be to believe its service that can offer the best; Its hope is the service that self is provided as much as possible; It is intended that seek on the grid of place to own useful best resource, and move to the best resource node, be optimized service.

Two, multi-Agent model operating mechanism

Because the formation characteristics of DG, along with the increase of the number of computer cluster number, network number, computing node, the integrity problem of grid becomes extremely important, and therefore the effective fault tolerant mechanism of a cover must be arranged.The check point fault tolerant mechanism has been played the part of the key player on traditional System Fault Tolerance, but this fault tolerant mechanism can produce Domino effect, is not suitable for using based on the large-scale calculations of grid.This instructions utilizes dynamic redundancy mechanism, multi-Agent cooperative mechanism to solve this problem.Fault-tolerant technique also is the problem that this cooperative model of this instructions is mainly considered.

Redundancy fault-tolerant mechanism is exactly that an important mesh services subtask is finished by being distributed in mesh services intelligence body common implementings on the various computing node, a plurality of congenerous, and one of them is first person of finishing, and other is the reserve person of finishing; After first person of finishing was lost efficacy, the person of finishing took over by reserve, the consequence that the grid performance that can avoid single failpoint to cause like this descends.

1) mesh services intelligence body state

Mesh services intelligence body on the DG is divided into three states: main attitude, be equipped with attitude, be sunk into sleep.In service process, when mesh services intelligence body is first person of finishing, it is main attitude intelligence body that this instructions claims this to serve intelligent body; When mesh services intelligence body is the reserve person of finishing, this instructions claims this to serve intelligent body for being equipped with attitude intelligence body; If claiming this to serve intelligent body, the intelligent body of service never participation service in a computing node, this instructions be the intelligent body of being sunk into sleep.

2) state of grid computing node

A computing node C on the DG _iOne of four states is arranged: main attitude, appearance attitude, attitude, inefficacy fully.In the period of determining, if computing node C _iOn all non-be sunk into sleep the service intelligent bodies all be in main attitude, claim that so this computing node is main attitude node; If computing node C _iOn the non-intelligent body master attitude of service and be equipped with attitude and deposit of being sunk into sleep, claim this computing node for holding the attitude node so; If computing node C _iOn all serve intelligent body and all be in fully attitude, claim this computing node for being equipped with the attitude node so; If computing node C _iOn all serve intelligent body and all be in slumber, claim that so this computing node is a failure node.

3) grid node formation

At 4 above-mentioned states of the computing node of grid, this instructions is constructed 4 node queues on grid DG, be respectively main attitude formation Q _Master, hold attitude formation Q _Slave, be equipped with attitude formation Q _Bak, inefficacy formation Q _FailureThe length of these 4 formations is designated as respectively: LQ _Master, LQ _Slave, LQ _Bak, LQ _Failure

According to the BDI that sets 6 ~ 11 described all kinds of Agent, all Agent are that computing node, the computer group that makes separately to be served brought into play main effect as far as possible by cooperation, competition purpose, promptly are in as far as possible in the main attitude formation of grid; According to this driving mechanism, this instructions is according to priority lined up four formations as follows:

Main attitude formation＞appearance attitude formation＞attitude formation＞inefficacy formation fully

According to this principle, this instructions provides the state exchange mechanism of computing node, and the state conversion model of grid computing node as shown in Figure 3.

4) multi-Agent cooperation rule

A) resource management intelligence body A _RmRule

A grid computing node C _iThe measurement parameter of performance generally can comprise following four aspects:

1. C _iCurrent available CPU computing power P _Cpu

2. C _iCurrent available memory ability P _Mem

3. C _iCurrent available network communications capability P _Net

4. C _iCurrent available magnetic disc i/o ability P _I/O

Resource management intelligence body A _RmBe responsible for dynamic monitoring computing node C _iFour parameter P _Cpu, P _Mem, P _Net, P _I/OVariation, that utilizes that their constitute computing node current time can force function:

P _node＝f(P _cpu，P _mem，P _net，P _I/O)????…………(1)

If the ability functional value in a last moment of this computing node is PL _Node, resource management intelligence body A _RmUtilize formula (1) to calculate the ability functional value PC of current time _Node

Rule 1, resource management intelligence body A _RmIf rule is PC _Node＞PL _Node, A then _RmTo A _CCApplication is to upper level computing node state-transition.

B) reliability management intelligence body A _aRule

A grid computing node C _iThe measurement parameter of unfailing performance generally can include following two aspects:

1. the normal condition marker P of present node _MarkWhen computing node just often, its value is 1; When computing node lost efficacy, its value was 0;

2. computing node completes successfully the ratio of mesh services intelligence body: P _SuccThe service number that the service number that=computing node completes successfully/computing node is accepted.If l ₁, l ₂, l ₃Be three decimals, and 0＜l ₁＜l ₂＜l ₃＜1, so

Rule 2, reliability management intelligence body A _aRule,

If computing node C _iP _MarkBe false, then A _RmTo A _CCCircular C _iBecome failure state;

If P _Succ∈ (0, L ₁), C then _iBecome failure state;

If P _Succ∈ (L ₁, L ₂), C then _iBecome backup status;

If P _Succ∈ (L ₂, L ₃), C then _iBecome and hold the attitude state;

If P _Succ∈ (L ₃, 1), C then _iBecome main attitude state;

C) cluster management intelligence body A _CcRule

A group of planes CC _iThe measurement parameter of performance as follows:

P _Cci=CC _iAffiliated computing node is the number/LQ of main attitude _Master

P _CciReflected group of planes CC _iThe distribution situation of the main attitude computing node in DG, P _CciBig more, CC is described _iEffect big more.

For whole DG, P _Cc1+ P _Cc2+ ... + P _Ccm=1.

CC _iA _CcReception is from group of planes CC _iA on all computing nodes _Rm, A _aThe P that regularly sends _Node, P _Mark, P _Succ

Utilize the heartbeat detection techniques to detect the effective status of all nodes simultaneously.

Group of planes CC _iCurrent time can force function:

PC _CC＝g(∑P _node，∑P _mark，∑P _succ)????…………(2)

Here ∑ is represented comprehensive group of planes CC _iThe parameter of all computing nodes.PC _CCThe computing node of an expression group of planes is at the main attitude formation Q of DG _MasterIn number.

If group of planes CC _iLast one constantly ability functional value be PL _CC, cluster management intelligence body A _CcUtilize formula (2) to calculate the ability functional value PC of current time _Cc

Rule 3, cluster management intelligence body A _CcRule,

If CC _iComputing node C _iP _MarkBe false, then make C _iBecome failure state;

If PC _Cc＞PL _CcOr PC _Cc＜PL _Cc, A then _CcTo A _GridApplication computing node state-transition.

D) mesh-managing intelligence body A _GridRule

A _GridReception is from the A of all group of planes of DG _CcThe P that regularly sends _Node, P _Mark, P _Succ, PC _Cc

According to following regular allocation LQ _Master

Rule 4, mesh-managing intelligence body A _GridThe distribution of main attitude formation,

for?i＝1?to?c

To group of planes CC _iAll main attitude computing nodes by its resource P that provides _NodeOrdering forms interim formation Q from big to small _Cci

Each group of planes distributes a counter C _Count[i]=0;

End?for；

P _count＝LQ _master；

For i=1 to c/*c be grid DG group of planes number */

Get group of planes CC _iPC _Cc

If?C _Count[i]＜PC _cc?then

Get group of planes CC _iInterim formation Q _CciMiddle maximum computing node adds becomes owner of attitude formation Q _Master,, and from Q _CciDelete this computing node;

C _Count[i] ++; / * group of planes CC _iMain attitude node counts device add 1*/

P _Count--; Main attitude node index * of/* distribution/

End?if；

End?for；

For?i＝1?to?c

Get group of planes CC _iInterim formation Q _CciIn remaining computing node add the appearance attitude formation of grid;

End?for；

A _GridBroadcast new main attitude, appearance attitude, be equipped with attitude, fail message to all group of planes and computing node;

In Fig. 3,5,6 and 7 are respectively the resource management intelligence body A in the multi-Agent cooperation rule _RmRule, the intelligent body A of reliability management _aRule and cluster management intelligence body A _CcRule.

This distribution rule is to distribute according to the ability employing round-robin mechanism of each group of planes current time, make the main attitude interstitial content of each group of planes meet its computing power, can bring into play the actual computation usefulness of each computing node like this, make again simultaneously and load on equiblibrium mass distribution on the grid, help the raising of the extensibility of grid scale.

Claims

1. the large scale data parallel computing main system under the grid environment comprises:

The DGSS of monitoring grid system utilizes the multi-Agent cooperative mechanism DG to be implemented the grid management system of effectively dynamic condition monitoring;

It is characterized in that, also comprise a computing system DGCS who constitutes by group of planes logic box, wherein:

Described group of planes logic box is made of by the connection of number order logic the computer group of setting numbering, and a logical successor group of planes of numbering a last group of planes is to be numbered a group of planes of 1; Described numbering is according to the numbering COMPREHENSIVE CALCULATING ability of all computing nodes of computer group and that all group of planes are set by order from big to small;

Each computer group on the described group of planes logic box is made of the computing node logic box, described computing node logic box is made of by the connection of number order logic the computing node of setting numbering, and the logical successor node of numbering last computing node is to be numbered 1 computing node; Described numbering is the numbering of by order from big to small all computing nodes being set according to the COMPREHENSIVE CALCULATING ability of each computing node, and the COMPREHENSIVE CALCULATING ability of described each computing node is calculated according to weight vectors W;

Constituted the dynamic data computing system by group of planes logic box and computing node logic box; Described monitoring grid system is by common Lan or Intranet supervision and each computer group and the computing node that are connected described dynamic grid computing system.

2. the redundant allocation strategy of the m-of the large scale data parallel computing main system under the grid environment, its step is as follows:

On a computer group ring or computing node ring, establish the computing unit of a computer group or computing node, logical number is k, its computing power is CP _k

For described computing unit distributes CP _k* M task amount;

It is characterized in that,

Again CP _k* to be evenly distributed to logical number be k+1 to the M task amount, k+2 ..., on common m the computing unit of k+m; Task CP like this _k* M is distributed simultaneously by DG and carries out twice.

3. the large scale data parallel computational algorithm under the grid environment, its DGCS key data structure constitutes just like lower member:

If grid DG is made of c computer group, the number dynamic change of the computing node in each group of planes; DPC is a parallel calculation task of data on the DG, | DPC| represents its general assignment amount, and W is its computational resource requirements weight vectors; QTDT is the task distribution message queue of DPC; M is the basic task unit; DGSS is the monitoring grid system;

It is characterized in that the step of large scale data parallel algorithm is as follows:

1) initialization:

A) according to M decomposing D PC;

B) calculate: the termination condition of finished=DPC;

C) auxiliary data of broadcasting DPC is to all computing nodes of DG;

D) parallel computation time counter Count=0;

2) While (when Finished does not set up) do

3) execute the task:

A) obtain the resource state information of DG from DGSS;

B) group of planes logic box of structure DG;

C) start all group of planes structures computing node logic box separately;

D) calculate the Dynamic Two-dimensional address for each computing node;

F) to each group of planes CC _i(0≤i≤c) do:

{

Calculate CCP _i/ ∑ CCP _j(the ratio of 0≤j≤c);

Calculating is at group of planes CC _iThe task amount T of the DPC of last distribution _i=(CCP _i/ ∑ CCP _j(0≤j≤

c))*|DPC|/M；

}；

g)For?i＝1?to?c

H) transformation task T _iTo group of planes CC _i

I) on group of planes ring, press the redundant allocation strategy distribution T of 1- _iTo CC _i1 group of planes CC of logical successor _I+1On;

j)End?for；

4) all group of planes CC _i(0≤i≤c) do concurrently step 5) ~ 11):

5) group of planes CC _iCOMPREHENSIVE CALCULATING ability CP according to its each computing node _j(0≤j≤p, p are CC _iThe computing node number) calculate each computing node to subtask T _iThe amount of bearing, promptly as follows:

a)For?j＝1?to?p

b)T _ij＝(CP _j/∑CP _k(0≤k≤p))*|T _i|/M；

C) transmission subtask T _IjTo computing node C _j

D) on the computing node ring, press the redundant allocation strategy distribution T of m- _IjTo C _jM computing node C of back _J+1, C _J+2..., C _J+mOn;

e)End?for；

9) if

((this Q _TDTiResult of calculation, all obtain by self or its descendant node)

Or

(obtaining the finish command of the Master of DG))

Then finish this subtask and calculate, and forward step 11) to;

10) if CC _iMaster obtain some computing nodes from DGSS and lost efficacy,

Then { putting this node is failure state;

Send the Master of fail message simultaneously to DG;

}

13)End?while；

4. large scale data parallel computational algorithm according to claim 3 is characterized in that, described step 3) b) in group of planes logic box, its constitution step is as follows:

5. large scale data parallel computational algorithm according to claim 3 is characterized in that, described step 3) c) in the computing node logic box, its constitution step is as follows:

6. large scale data parallel computational algorithm according to claim 3 is characterized in that, described step 3) d) middle Dynamic Two-dimensional address of calculating, it is set at two tuples (r, o) address; Each computing node on the DG can both obtain one two tuple (r from the group of planes logic box of being constructed and computing node logic box, o) address, wherein r is the group of planes ring numbering of this computing node place group of planes, and o is the logical number of this computing node in the computing node ring of group of planes r.

7. large scale data parallel computational algorithm according to claim 3 is characterized in that, the task distribution message queue in the described step 6): the TDT of all elementary cell tasks constitutes a task distribution message queue Q _TDTDG has the Q of an overall situation _TDT, each group of planes has the Q of a part _TDTi

8. large scale data parallel computational algorithm according to claim 3, it is characterized in that, inefficacy task queue in the described step 10): one of each computer group structure of DG is deposited mission bit stream formation behind the local calculation node failure, and its form is the same with the task distribution message queue.