CN105095455A

CN105095455A - Data connection optimization method and data operation system

Info

Publication number: CN105095455A
Application number: CN201510446965.8A
Authority: CN
Inventors: 王淑玲; 冯伟斌; 王志军
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2015-07-27
Filing date: 2015-07-27
Publication date: 2015-11-25
Anticipated expiration: 2035-07-27
Also published as: CN105095455B

Abstract

Embodiments of the invention provide a data connection optimization method and a data operation system, and the invention relates to the field of communication, and solves problems of limited application range of existing connection methods for improving data connection efficiency. In the method, data connection operation is divided into two MapReduce stages. The method comprises: in the first stage, counting for values of connecting elements, determining a frequency value assembly the connecting elements belong to; in the second stage, Map loading query vectors and intersections among the frequency value assemblies of the connecting elements in internal storage which calculates nodes to calculate, determining whether a certain record in a database needs to cause connection operation, and thus key value pairs of elements which do not need to cause connection operation do not need to be sent to Reduce nodes. The method and the system are used for data connection optimization facing MapReduce.

Description

A kind of data cube computation optimization method and data operation system

Technical field

The present invention relates to the communications field, particularly relate to a kind of data cube computation optimization method and data operation system.

Background technology

In data handling, the attended operation right and wrong of data are usually shown in and consuming time.Such as have two database R and S, R comprises Data Entry A and B, is designated as R (A, B), and S comprises Data Entry B and C, is designated as S (B, C). represent the attended operation between R and S, condition of contact is R.B=S.B.MapReduce is the main flow programming model instantly in large data processing technique, by abstract for data processing task be map task and reduce task, in the filtration treatment of map stage complete paired data, complete the gathering process into data in the reduce stage.

In this MapReduce programming model, the simplest equivalent attended operation is reducesidejoin, wherein, in reducesidejoin, the all elements of R and S all must be transferred to reduce, this consumption for Internet resources is larger, but, some data can be transmitted, such as values some in R.B, if this value is do not need to be transferred to reduce when not occurring in S.B, in order to optimize the efficiency connected in MapReduce, industry introduces mapsidejoin, Semijoin etc. and connects optimization method.

Wherein, in this method of attachment of mapsidejoin, a table less in R and S can be selected, be assumed to be R, again R is copied many parts, allow exist in the internal memory of each map node a, then only scanning shows S greatly, like this for the record of each in S, identical key record whether is had to have if searched in Hash table, export after then connecting, but it is very large that this method of attachment is only applicable to a table in two tables to be connected, another table is very little, to such an extent as to little table can directly be stored in internal memory, but when two tables are all very large, the internal memory of quick-fried map node will be supportted, usable range is limited to, in Semijoin method of attachment, be also choose a little table, be assumed to be R, first R.B is extracted, be saved in file T, and file T deposited in internal memory, in the map stage, T is copied on each map node, then check each value of S.B, if its value is not in T, then the recorded of correspondence in S is filtered, remaining record adopts the operation identical with mapsidejoin, so also can there is the problem of usable range limitation in mapsidejoin.

Summary of the invention

The embodiment of the present invention provides a kind of data cube computation optimization method and data operation system, the problem of the usable range limitation that the method for attachment that can solve existing raising data cube computation efficiency exists.

First aspect, provides a kind of data cube computation optimization method, comprising:

M ₁an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;

Described n ₁an individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;

Described Centroid, to the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;

Described database is distributed to m by described Centroid again ₂individual 2nd Map node, and described first set and described query vector are sent to described m ₂individual 2nd Map node;

Described m ₂individual 2nd Map node determines the belonging to connection element first set in described record respectively according to the record in second data block of distributing and described query vector, and determines whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.

In conjunction with first aspect, in the first mode in the cards of first aspect, described m ₁an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, comprises to obtain statistics:

Described m ₁an individual Map node calls map function respectively from described first data block, extracts the record that each article comprise described connection element, and export the first key-value pair, described first key-value pair comprises described connection element, characterizes mark and the counting of the Data Source of described connection element, described in be counted as 1;

For same connection element in described first data block, counting in described first key-value pair identical for the mark of the Data Source of described for described sign connection element is added up, obtains the quantity of record corresponding to the same connection element in the source of identical data in described first data block.

In conjunction with the first mode in the cards of first aspect or first aspect, in the second of first aspect mode in the cards, if described database comprises the first database and the second database, then describedly determine that first belonging to described connection element is gathered according to described record sum comprise with the threshold value preset:

The the first record sum deriving from described first database when connection element described in a described Reduce node is less than described threshold value, and the second record sum deriving from described second database is when being less than described threshold value, if described first record sum is not equal to zero, then determine that the first set belonging to described connection element is set R _thinif described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set _thin, described set R _thinwith described S set _thininclude the connection element of sparse appearance in described first database and described second database;

If described first record sum is more than or equal to described threshold value, and described first record sum is more than or equal to described second record sum, then determine that the first set belonging to described connection element is set R _den, described set R _denbe included in described first database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described second database; And if described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set _rden, described S set _rdenbe included in described second database and occur, but belong to described set R _denconnection element;

If described second record sum is more than or equal to described threshold value, and described second record sum is more than or equal to described first record sum, then determine that the first set belonging to described connection element is S set _den, described S set _denbe included in described second database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described first database; And if described first record sum is non-vanishing, then determine that the first set belonging to described connection element is set R _sden, described set R _sdenbe included in described first database and occur, but belong to described S set _denconnection element.

The second in conjunction with first aspect mode in the cards, in the third mode in the cards of second aspect, described Centroid is to the connection element rearrangement in described first set, obtain corresponding second set, and to the connection element execute vector update in described second set, the query vector obtaining described second set corresponding comprises:

Described Centroid is described set R to described first set _den, described set R _sden, described S set _denwith described S set _rdenin connection element, resequence according to the size of described connection element value, obtain corresponding second set, and record described connection element described second set in position;

For described set R _den, described S set _rden, described S set _denwith described S set _rdencorresponding second set, performs described vectorial update according to predefined initial vector to each connection element in described second set, obtains the query vector that described second set is corresponding.

In conjunction with the third mode in the cards of first aspect, in the 4th kind of mode in the cards of first aspect, after determine the first set belonging to the connection element in described record according to the record in second data block of distributing and described query vector, described method also comprises:

Described m ₂individual 2nd Map node generates the second key-value pair corresponding to described connection element respectively, and described second key-value pair comprises described connection element, characterizes the mark of described connection element Data Source, the mark of the first set belonging to described connection element and disconnected element corresponding to described connection element.

In conjunction with the 4th kind of mode in the cards of first aspect, in the 5th kind of mode in the cards of first aspect, describedly determine whether that carrying out attended operation to described connection element comprises according between described first set for the intersection operation of described connection element:

After determining the belonging to the connection element in described record first set, if determine described connection element not gather that to occur simultaneously be not that another of sky first is gathered with described first according to described query vector, then determine not carry out attended operation to described connection element.

In conjunction with the 5th kind of mode in the cards of first aspect, in the 6th kind of mode in the cards of first aspect, described method also comprises:

Described m ₂the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope ₂individual 2nd Reduce node;

Described n ₂described second key-value pair is classified as different queues by first set of individual 2nd Reduce node respectively belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue.

In conjunction with the 6th kind of mode in the cards of first aspect, in the 7th kind of mode in the cards of first aspect, described m ₂the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope ₂individual 2nd Reduce node comprises:

If described connection element belongs to described S set _thinor described set R _thin, then the second corresponding for identical described connection element key-value pair is sent to same 2nd Reduce node;

If described connection element belongs to described set R _denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the first scope by described 2nd Map node at random, described first scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described first database and the record number of each Reduce node average treatment ratio and between numbering;

If described connection element belongs to described S set _denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the second scope by described 2nd Map node at random, described second scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;

If described connection element belongs to described set R _sdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 3rd scope by described 2nd Map node, described 3rd scope is initial Reduce node serial number, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;

If described connection element belongs to described S set _rdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope by described 2nd Map node, described 4th scope be initial 2nd Reduce numbering, to described initial 2nd Reduce numbering and described first database described in connection element record sum and each 2nd Reduce node average treatment record number ratio and between numbering.

In conjunction with the 7th kind of mode in the cards of first aspect, in the 8th kind of mode in the cards of first aspect, the record number of described each 2nd Reduce node average treatment is: the ratio of wall scroll record size in the file size of described first database and described first database, add the ratio of the wall scroll record size in the file size of described second database and described second database and, with the ratio of the quantity of described 2nd Reduce node.

In conjunction with the 7th kind of mode in the cards or the 8th kind of mode in the cards of first aspect, in the 9th kind of mode in the cards of first aspect, described second key-value pair is classified as different queues by described the first set belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue and comprise:

To belonging to described set R _denconnection element corresponding second key-value pair composition queue and belong to described S set _rdenconnection element corresponding key-value pair combination queue carry out attended operation;

To belonging to described set R _sdenconnection element corresponding second key-value pair composition queue and belong to described S set _denconnection element corresponding key-value pair composition queue carry out attended operation;

To belonging to described set R _thinconnection element corresponding second key-value pair composition queue and belong to described S set _thinconnection element corresponding key-value pair composition queue carry out attended operation.

Second aspect, provides a kind of data operation system, comprises m ₁individual Map node, n ₁an individual Reduce node, Centroid, m ₂individual 2nd Map node, wherein:

Described m ₁individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;

Described n ₁an individual Reduce node, be respectively used to the record sum obtaining record corresponding to the connection element that derives from same database in respective Reduce node according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;

Described Centroid, for the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;

Described Centroid, also for again described database being distributed to m ₂individual 2nd Map node, and described first set and described query vector are sent to described m ₂individual 2nd Map node;

Described m ₂individual 2nd Map node, be respectively used to the belonging to connection element first set determined according to the record in second data block of distributing and described query vector in described record, and determine whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.

The embodiment of the present invention provides a kind of data cube computation optimization method and data operation system, m ₁an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and statistics is sent to Centroid, make Centroid that statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to same Reduce node; n ₁individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element; Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding; Database is distributed to m by Centroid again ₂individual 2nd Map node, and the first set and query vector are sent to m ₂individual 2nd Map node; m ₂individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set; So, the first set belonging to connection element is judged by vectorial update, and then can determine whether this connection element will cause attended operation according to the intersection operation of connection element, namely need to be passed to Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data cube computation efficiency exists.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

The structural representation of a kind of data operation system that Fig. 1 provides for the embodiment of the present invention;

The schematic flow sheet of a kind of data cube computation optimization method that Fig. 2 provides for the embodiment of the present invention;

The schematic flow sheet of the another kind of data cube computation optimization method that Fig. 3 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The embodiment of the present invention provides a kind of data operation system 1, as shown in Figure 1, comprising: m ₁an individual Map node, n ₁an individual Reduce node, Centroid and m ₂individual 2nd Map node, wherein:

M ₁individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and statistics is sent to Centroid, makes Centroid that statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;

N ₁an individual Reduce node, be respectively used to the record sum obtaining record corresponding to the connection element that derives from same database in respective Reduce node according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element;

Centroid, for the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding;

Centroid, also for again database being distributed to m ₂individual 2nd Map node, and the first set and query vector are sent to m ₂individual 2nd Map node;

M ₂individual 2nd Map node, be respectively used to the first set belonging to connection element determined according to the record in second data block of distributing and query vector in record, and determine whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.

So, judge whether connection element will cause attended operation by vectorial update, namely need to be passed to the 2nd Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data downlink connection efficiency exists.

Based on above-mentioned data operation system, do the method that notebook data arithmetic system performs to illustrate below, therefore, the embodiment of the present invention provides a kind of data cube computation optimization method, as shown in Figure 2, comprising:

201, m ₁an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and statistics is sent to Centroid, make Centroid that statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node.

202, n ₁an individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element.

203, Centroid is to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.

204, database is distributed to m by Centroid again ₂individual 2nd Map node, and the first set and query vector are sent to m ₂individual 2nd Map node;

205, m ₂individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.

Specifically, by two stage MapReduce process implementation, the embodiment of the present invention determines whether the record of a certain connection element will cause attended operation, namely needs to be passed to Reduce node, to reduce Internet Transmission.The MapReduce of first stage mainly adds up the value of connection element, and the MapReduce of subordinate phase completes attended operation according to the statistics of first stage.

Suppose there are two database R and S, R comprises Data Entry A and B, is designated as R (A, B), and S comprises Data Entry B and C, is designated as S (B, C). represent the attended operation between R and S, condition of contact is R.B=S.B, and the MapReduce of first stage, only to R.B and S.B process, namely only processes connection element, and the data transmitted between node are also only relevant to connection element B, have nothing to do with R.A and S.C.

In the process of specific implementation, be the first database with above-mentioned database R, database S is the second database is example, before step 201, namely at m ₁before an individual Map node obtains statistics, database R and database S first can be divided into M by the master routine on Centroid ₁individual data block, and distributed to m ₁an individual Map node, each Map node processing M ₁/ m ₁blocks of data, i.e. M ₁/ m ₁individual first data block.

In step 201, for m ₁an individual Map node, m ₁an individual Map node adds up the Centroid record corresponding to the same connection element B deriving from same database in the first data block of database R and database S distribution respectively, to obtain statistics.

Particularly, one Map node can call map function from the first data block, extract the record that each article comprise connection element B, and exporting the first key-value pair, this first key-value pair comprises connection element B, the mark characterizing connection element Data Source and counting, is counted as 1.

Exemplary, a Map node calls map function, extracts each record in database R or database S, for the record in R, exports (key, 0:1); For the record in database S, export (key, 1:1); Wherein, the value of key is the value of connection element B, and the value of value part is made up of tag:val, tag identifies Data Source (from database R, 1 representative is from database S in 0 representative), and counting 1 represents that the value of B occurs once.

One Map node is again for connection element same in the first data block, counting in the first identical for the mark of Data Source characterizing connection element key-value pair is added up, obtains the quantity of record corresponding to the same connection element in identical data source in the first data block.This one-phase, can be described as the shuffle stage, namely for identical key, if value.tag is identical, then value.val is added up, finally, at the end of all map tasks all, generating data layout is the intermediate file of (key, tag:val), the Data Source of connection element that wherein tag identifies in this first data block is that R or S, val represent and output to the identical record number of key with tag in this intermediate file.

The data layout of generation is sent to Centroid (master routine) by all Map nodes again, starts Reduce process.Centroid is according to the hash function defined, and statistics is sent to n by the Hash result according to connection element ₁an individual Reduce node, wherein, the data layout that the statistics of same connection element is namely corresponding is sent to a same Reduce node.

In step 202., a Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in a Reduce node again according to statistics.Particularly, the data layout that the Reduce task inspection of a Reduce node receives, the record identical to key with tag performs the cumulative operation of val, if the result of accumulation is expressed as (key, tag:val), can comprise (key, 0:val _r) and (key, 1:val _s), wherein, key represents the value of connection element B, and 0 and 1 represents Data Source, val _rrecord sum when the value of B is key in expression database R, val _srecord sum when the value of B is key in expression database S.

At this moment the Reduce task of first stage terminates, and exports net result, namely performs gathering according to record sum and first belonging to the threshold value determination connection element preset in step 202.

Particularly, the record rule of net result can be:

For each connection element key:

The the first record sum deriving from the first database when connection element in a Reduce node is less than threshold value, and the second record sum deriving from the second database is when being less than threshold value, if the first record sum is not equal to zero, then determine that the first set belonging to connection element is set R _thinif the second record sum is non-vanishing, then determine that the first set belonging to connection element is S set _thin, set R _thinand S set _thininclude the connection element of sparse appearance in the first database and the second database.Also can be expressed as: work as val _r< Θ & & val _sduring < Θ, if val _r≠ 0, then record R _thin=R _thin∪ { key}; If val _s≠ 0, then record S _thin=S _thin∪ { key}; Θ represents threshold value, certainly, if val _rand val _ssimultaneously non-vanishing, just this key is recorded to simultaneously set R _thinand S set _thinin;

Otherwise if the first record sum is more than or equal to threshold value, and the first record sum is more than or equal to the second record sum, then determine that the first set belonging to connection element is set R _den, set R _denbe included in the first database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in the second database; And if the second record sum is non-vanishing, then determine that the first set belonging to connection element is S set _rden, S set _rdenbe included in the second database and occur, but belong to set R _denconnection element, also can be expressed as: if val _r>=Θ & & val _r>=val _s: then record R _den=R _den∪ { (key, val _r); If val _s≠ 0, record S _rden=S _rden∪ { (key, val _r);

If the second record sum is more than or equal to threshold value, and the second record sum is more than or equal to the first record sum, then determine that the first set belonging to connection element is S set _den, S set _denbe included in the second database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in the first database; And if the first record sum is non-vanishing, then determine that the first set belonging to connection element is set R _sden, set R _sdenbe included in the first database and occur, but belong to S set _denconnection element, also can be expressed as: if val _s>=Θ & & val _s>=val _r: then record S _den=S _den∪ { (key, val _s); If val _r≠ 0, record R _sden=R _sden∪ { (key, val _s);

Therefore, after the MapReduce of first stage is finished, generate altogether six set, its mark and implication as shown in table 1 below:

Table 1

From the generative process of above-mentioned definition and set, between set, there is result as shown in table 2 below for the intersection operation of key:

Table 2

Further, the first set is sent to Centroid by a Reduce node again, and starts the MapReduce entering subordinate phase, and the MapReduce of subordinate phase performs attended operation.

But before the MapReduce performing subordinate phase, the master routine of Centroid is introduced into the preparatory stage, namely perform step 203, be further processed the result of first stage Reduce, processing mode is as follows:

Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.

Particularly, Centroid can be set R to the first set _den, set R _sden, S set _denand S set _rdenin connection element, resequence according to the size of connection element value, obtain corresponding second set, and record connection element second set in position, be designated as seq (key), generation second set can be expressed as accordingly set R ' _den, set R ' _sden, S set ' _denand S set ' _rden, element wherein can be expressed as { (key, seq (key)) }.

Here not to set R _thinand S set _thinprocessing, is the set R owing to belonging to sparse value _thinand S set _thincan be same as the prior art, connect key-value pair corresponding to element and be all sent to the 2nd Reduce node and carry out attended operation.Wherein, the 2nd Reduce node can be identical with a Reduce node, also may be different.

And for set R _den, set R _sden, S set _denand S set _rdencorresponding second set, according to predefined initial vector to each connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.Particularly, for set R _den, set R _sden, S set _denand S set _rden, suppose the initial BloomFilter vector defined, be designated as respectively: for set R ' _denin each element perform update in vector, for set R ' _sdenin each element perform update in vector, for S set ' _denin each element perform update in vector, and to S set ' _rdenin each element perform update in vector, final generation 4 BloomFliter vectors, i.e. query vector, and these 4 vectors all have following character: if the Query Result lookup (key) of certain element key in certain BloomFilter vector is false, then this key is not certainly in the set that this BloomFilter vector is corresponding.

Data operation system enters the MapReduce process of subordinate phase, specifically can be as follows:

In step 204, database is first distributed to m by Centroid again ₂individual 2nd Map node, and the first set and query vector are sent to m ₂individual 2nd Map node.Here data road is needed to be reassigned to m ₂individual 2nd Map node is that to carry out the nodes of Map operation different with workload due to subordinate phase, needs again to database distribution node.

Particularly, database R and database S is first divided into M by master routine in Centroid again ₂block, and distributed to m ₂individual Map node, each node processing M ₂/ m ₂blocks of data.Master routine is also gathered first for set R _den, set R _sden, S set _denand S set _rdenset and query vector send to m ₂individual Map node, each Map node will gather R _den, set R _sden, S set _den, S set _rdenand buffer memory, to internal memory, operates to perform map to its block data.

With the map that data block is performed of first stage operate unlike, the map operation of subordinate phase is not only relevant with R.B and S.B, also relevant with R.A and R.C.

In step 205, m ₂individual Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.

Wherein, after determine the first set belonging to the connection element in record according to the record in second data block of distributing and query vector, the second key-value pair that connection element is corresponding can be generated, the second key-value pair comprises connection element, characterizes the mark of connection element Data Source, the mark of the first set belonging to connection element and disconnected element corresponding to connection element.Be specifically as follows Map node and extract record in the second data block, and the first set belonging to connection element in record is determined according to query vector, generation form is second key-value pair of (key, value), wherein, key is the value of connection element B, value is that a shape is as the stowed value of (tagS, tagF, val), tagS identification record derives from database R or database S, tagF mark key belongs to six set R _den, R _sden, R _thin, S _den, S _rdenand S _thinin which, val records the value of disconnected element.

Particularly, the process of the first set belonging to query vector determination connection element can be:

For certain record in R, the value of note R.B is key, and note R.A is val;

If a) then key ∈ R _thin, the second key-value pair of generation is (key, 0:R _thin: val), that is, the key in certain record in R determines it neither at R by inquiry _denin, also not at R _sdentime middle, so key belongs to set R _thin, in the conventional mode the second corresponding for key key-value pair can be sent to Reduce and carry out attended operation;

B) otherwise, when time, if then key ∈ R _sdenif namely key is not at set R _thinwith set R _denin, so key is at set R _sdenin, if at this moment then do not process, that is, if at set R _sdenin do not inquire this key, then determine that this key does not cause attended operation.This is due to according to the set operation in above-mentioned table 2, gathers R _sdenonly and S set _dencommon factor be not empty, therefore as key ∈ R _sdentime, if the value of this key is not in S set _denin, so determine that this key does not cause attended operation, so do not need to transmit the second key-value pair corresponding to this key; Otherwise, if this key is in S set _denin, so determine that this key causes attended operation, need the second key-value pair of its correspondence to be sent to the 2nd Reduce node, and the second key-value pair is (key, 0:R _sden: val);

C) otherwise, key ∈ R _den, namely key is not at set R _thinwith set R _sdentime, so key ∈ R _denbut, if then determine that this key does not cause attended operation, do not need to be passed to the 2nd Reduce node, this is due to set R _denonly and S set _rdencommon factor be not empty, like this as key ∈ R _dentime, if the value of this key is not in S set _rdenin, so determine that this value does not cause attended operation, so do not need to be passed to the 2nd Reduce node.Because key also belongs to the value of high frequency appearance, therefore do not need the data record of transmission more, playing a role clearly for the network optimization; Otherwise, if this key is in S set _rdenin, so this value needs to cause attended operation, and the second key-value pair of generation is (key, 0:S _rden: val);

In like manner, for certain record in database S, the value of note S.B is key, and note S.C is val;

If a) then key ∈ S _thin, the second key-value pair of generation is (key, 1:S _thin: val), determine that it causes attended operation;

B) otherwise, when time, if then key ∈ S _rden;

If then due to S set _rdenonly with set R _dencommon factor be not empty, then as key ∈ S _rdentime, if this value is not at set R _den, then determine that this value does not cause attended operation, do not need to be passed to Reduce;

C) otherwise, when and time, key ∈ S _den;

If then due to S set _denonly with set R _sdencommon factor be not empty, then as key ∈ S _dentime, if this value is not at set R _sden, then determine that this value does not cause attended operation, do not need to be passed to Reduce;

Otherwise, if this value is at set R _sdenin, second key-value pair that so can generate is (key, 1:R _sden: val).

So, the key-value pair of attended operation is not caused then not to be sent to Reduce for determining, the data needing to transmit can be reduced in network, improve the efficiency of Internet Transmission, compared to mapsidejoin or SemiJoin of data cube computation operation in prior art, the application does not limit the size of database, is applicable to any scene.

Therefore, the embodiment of the present invention provides a kind of data cube computation optimization method, m ₁an individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and statistics is sent to Centroid, makes Centroid that statistics is sent to n ₁an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node; One Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element; Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding; Database is distributed to m by Centroid again ₂individual 2nd Map node, and the first set and query vector are sent to m ₂individual 2nd Map node; m ₂individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set; So, judge whether connection element will cause attended operation by vectorial update, namely need to be passed to the 2nd Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data downlink connection efficiency exists.

For mapsidejoin or SemiJoin that data cube computation of the prior art operates, not only there is the problem being suitable for part scene in it, also there is the problem of load balancing.This is due in the mode of operation of itself Map and Reduce, the Reduce stage is to carry out task assignment according to the value of key, the record of identical key value is passed to same Reduce node, like this for connection element B, if the data record of the value of certain several B is many especially, the task of being so responsible for the Reduce node of these values is assigned can be heavier, thus cause Reduce task matching uneven, the load imbalance of Reduce node.

Therefore, in order to solve the problem of load imbalance, at above-described embodiment in order to reduce on the basis of Internet Transmission, generate the second key-value pair after the Map stage of subordinate phase performs after, the second key-value pair needing to transmit can be sent to P the 2nd Reduce node according to following rule, for the second key-value pair of the connection element of different sets, correspondence sends to n wherein ₂individual 2nd Reduce node, P is the n of the second key-value pair transmission that the element of all set is corresponding ₂individual 2nd Reduce node and.

Therefore, as shown in Figure 3, said method embodiment also comprises:

206, m ₂the scope of the corresponding 2nd Reduce node of the second key-value pair is determined in first set of individual 2nd Map node respectively belonging to the second key-value pair, and the second key-value pair is sent to the n within the scope of this ₂individual 2nd Reduce node.

207, n ₂second key-value pair is classified as different queues by first set of individual 2nd Reduce node respectively belonging to the second key-value pair, and carries out attended operation to the second key-value pair in queue.

In step 206, first, each 2nd Reduce node average treatment P is defined _avgbar record, and P _avgfor the ratio of the wall scroll record size in: the file size of database R and database R, add the ratio of the wall scroll record size in the file size of database S and database S and, ratio with the quantity of the 2nd Reduce node, can be expressed as: [R file size/wall scroll record size+S file size/wall scroll record size]/P.

Like this, if the value:tagF=S in the second key-value pair _thinor R _thin, even connection element belongs to S set _thinor set R _thin, then the second corresponding for identical connection element key-value pair is sent to same 2nd Reduce node, namely all sends to the 2nd Reduce node to carry out connection handling operation the second corresponding for identical connection element key-value pair in the conventional mode.This is due to when key belongs to sparse value, and it being distributed in the conventional mode the 2nd Reduce node can not cause load imbalance;

If the connection element in the second key-value pair belongs to set R _denthen the second corresponding for connection element key-value pair is sent at random arbitrary 2nd Reduce node that numbering is positioned at the first scope, first scope is that initial Reduce numbers, to initial Reduce numbering and the first database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val _r/ P _avg) modP]; If the connection element in the second key-value pair belongs to S set _denthen the second corresponding for connection element key-value pair is sent at random arbitrary 2nd Reduce node that numbering is positioned at the second scope, second scope is that initial Reduce numbers, to initial Reduce numbering and the second database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val _s/ P _avg) modP].

Like this, when key belongs to high frequency value, if it to be distributed in the conventional mode the 2nd Reduce node, the 2nd Reduce node load being then assigned to the value of high frequency can much larger than other node, for this reason, high frequency value is picked out by the application, and is its distribution the 2nd Reduce node desirably according to the number of times that high frequency value occurs, alleviates the problem of load balancing that value skewness brings.

If connection element belongs to set R _sdenthen the second corresponding for connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 3rd scope, 3rd scope is initial 2nd Reduce node serial number, to initial Reduce numbering and the second database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val _s/ P _avg) modP]; If connection element belongs to S set _rdenthen the second corresponding for connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope, 4th scope is that initial Reduce numbers, to initial Reduce numbering and the first database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val _r/ P _avg) modP].

Like this, due to set R _sdenin key be sparse element in database R, S set _rdenin key be sparse element in database S, therefore, select to gather R _sdenor S set _rdenin record to carry out being broadcast to the pressure that the 2nd Reduce node brings to network less.

In step 207,2nd Reduce node reads the second key-value pair being assigned to this node, second key-value pair is classified as different queues by the mark according to the first set identified in the second key-value pair belonging to it, and carries out attended operation to the second key-value pair in queue.

Particularly, to belonging to set R _denconnection element corresponding second key-value pair composition queue and belong to S set _rdenconnection element corresponding key-value pair combination queue carry out attended operation; To belonging to set R _sdenconnection element corresponding second key-value pair composition queue and belong to S set _denconnection element corresponding key-value pair composition queue carry out attended operation; To belonging to set R _thinconnection element corresponding second key-value pair composition queue and belong to S set _thinconnection element corresponding key-value pair composition queue carry out attended operation.Certainly, the condition of attended operation is that key is identical, and key.tagS is different, namely occur simultaneously between empty set without the need to doing attended operation.

Exemplary, according to the value:tagF in the second key-value pair, the second key-value pair is put into 6 queues, tagF is different, puts into different queues, and remembers that these 6 queues are QR _den, QR _sden, QR _thin, QS _den, QS _rdenand QS _thin.Then, according to intersection of sets set operation, by queue QR _denand QS _rden, QR _sdenand QS _den, QR _thinand QS _thincarry out attended operation.

Therefore, the embodiment of the present invention also by identifying the occurrence frequency of key, being distributed the 2nd Reduce node desirably, being alleviated the problem of load balancing that value skewness brings.

In several embodiments that the application provides, should be understood that disclosed terminal and method can realize by another way.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit comprises, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add SFU software functional unit realizes.

The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the part steps of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-OnlyMemory, be called for short ROM), random access memory (RandomAccessMemory, be called for short RAM), magnetic disc or CD etc. various can be program code stored medium.

Last it is noted that above embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to previous embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. a data cube computation optimization method, is characterized in that, comprising:

2. data cube computation optimization method according to claim 1, is characterized in that, described m ₁an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, comprises to obtain statistics:

3. data cube computation optimization method according to claim 1 and 2, it is characterized in that, if described database comprises the first database and the second database, then describedly determine that first belonging to described connection element is gathered according to described record sum comprise with the threshold value preset:

4. data cube computation optimization method according to claim 3, it is characterized in that, described Centroid is to the connection element rearrangement in described first set, obtain corresponding second set, and to the connection element execute vector update in described second set, the query vector obtaining described second set corresponding comprises:

5. data cube computation optimization method according to claim 4, is characterized in that, after determine the first set belonging to the connection element in described record according to the record in second data block of distributing and described query vector, described method also comprises:

6. data cube computation optimization method according to claim 5, is characterized in that, describedly determines whether that carrying out attended operation to described connection element comprises according between described first set for the intersection operation of described connection element:

7. data cube computation optimization method according to claim 6, is characterized in that, described method also comprises:

8. data cube computation optimization method according to claim 7, is characterized in that, described m ₂the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope ₂individual 2nd Reduce node comprises:

If described connection element belongs to described S set _thinor described set R _thin, then the second corresponding for identical described connection element key-value pair is sent to same 2nd Reduce node by described 2nd Map node;

If described connection element belongs to described S set _rdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope by described 2nd Map node, described 4th scope be at first the 2nd Reduce numbering, to described initial 2nd Reduce numbering and described first database described in connection element record sum and each 2nd Reduce node average treatment record number ratio and between numbering.

9. data cube computation optimization method according to claim 8, it is characterized in that, the record number of described each 2nd Reduce node average treatment is: the ratio of wall scroll record size in the file size of described first database and described first database, add the ratio of the wall scroll record size in the file size of described second database and described second database and, with the ratio of the quantity of described 2nd Reduce node.

10. data cube computation optimization method according to claim 8 or claim 9, it is characterized in that, described second key-value pair is classified as different queues by described the first set belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue and comprise:

11. 1 kinds of data operation systems, is characterized in that, comprise m ₁individual Map node, n ₁an individual Reduce node, Centroid, m ₂individual 2nd Map node, wherein: