CN105095455A - Data connection optimization method and data operation system - Google Patents

Data connection optimization method and data operation system Download PDF

Info

Publication number
CN105095455A
CN105095455A CN201510446965.8A CN201510446965A CN105095455A CN 105095455 A CN105095455 A CN 105095455A CN 201510446965 A CN201510446965 A CN 201510446965A CN 105095455 A CN105095455 A CN 105095455A
Authority
CN
China
Prior art keywords
connection element
record
key
database
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510446965.8A
Other languages
Chinese (zh)
Other versions
CN105095455B (en
Inventor
王淑玲
冯伟斌
王志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201510446965.8A priority Critical patent/CN105095455B/en
Publication of CN105095455A publication Critical patent/CN105095455A/en
Application granted granted Critical
Publication of CN105095455B publication Critical patent/CN105095455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24549Run-time optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Abstract

Embodiments of the invention provide a data connection optimization method and a data operation system, and the invention relates to the field of communication, and solves problems of limited application range of existing connection methods for improving data connection efficiency. In the method, data connection operation is divided into two MapReduce stages. The method comprises: in the first stage, counting for values of connecting elements, determining a frequency value assembly the connecting elements belong to; in the second stage, Map loading query vectors and intersections among the frequency value assemblies of the connecting elements in internal storage which calculates nodes to calculate, determining whether a certain record in a database needs to cause connection operation, and thus key value pairs of elements which do not need to cause connection operation do not need to be sent to Reduce nodes. The method and the system are used for data connection optimization facing MapReduce.

Description

A kind of data cube computation optimization method and data operation system
Technical field
The present invention relates to the communications field, particularly relate to a kind of data cube computation optimization method and data operation system.
Background technology
In data handling, the attended operation right and wrong of data are usually shown in and consuming time.Such as have two database R and S, R comprises Data Entry A and B, is designated as R (A, B), and S comprises Data Entry B and C, is designated as S (B, C). represent the attended operation between R and S, condition of contact is R.B=S.B.MapReduce is the main flow programming model instantly in large data processing technique, by abstract for data processing task be map task and reduce task, in the filtration treatment of map stage complete paired data, complete the gathering process into data in the reduce stage.
In this MapReduce programming model, the simplest equivalent attended operation is reducesidejoin, wherein, in reducesidejoin, the all elements of R and S all must be transferred to reduce, this consumption for Internet resources is larger, but, some data can be transmitted, such as values some in R.B, if this value is do not need to be transferred to reduce when not occurring in S.B, in order to optimize the efficiency connected in MapReduce, industry introduces mapsidejoin, Semijoin etc. and connects optimization method.
Wherein, in this method of attachment of mapsidejoin, a table less in R and S can be selected, be assumed to be R, again R is copied many parts, allow exist in the internal memory of each map node a, then only scanning shows S greatly, like this for the record of each in S, identical key record whether is had to have if searched in Hash table, export after then connecting, but it is very large that this method of attachment is only applicable to a table in two tables to be connected, another table is very little, to such an extent as to little table can directly be stored in internal memory, but when two tables are all very large, the internal memory of quick-fried map node will be supportted, usable range is limited to, in Semijoin method of attachment, be also choose a little table, be assumed to be R, first R.B is extracted, be saved in file T, and file T deposited in internal memory, in the map stage, T is copied on each map node, then check each value of S.B, if its value is not in T, then the recorded of correspondence in S is filtered, remaining record adopts the operation identical with mapsidejoin, so also can there is the problem of usable range limitation in mapsidejoin.
Summary of the invention
The embodiment of the present invention provides a kind of data cube computation optimization method and data operation system, the problem of the usable range limitation that the method for attachment that can solve existing raising data cube computation efficiency exists.
First aspect, provides a kind of data cube computation optimization method, comprising:
M 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;
Described n 1an individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;
Described Centroid, to the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;
Described database is distributed to m by described Centroid again 2individual 2nd Map node, and described first set and described query vector are sent to described m 2individual 2nd Map node;
Described m 2individual 2nd Map node determines the belonging to connection element first set in described record respectively according to the record in second data block of distributing and described query vector, and determines whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.
In conjunction with first aspect, in the first mode in the cards of first aspect, described m 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, comprises to obtain statistics:
Described m 1an individual Map node calls map function respectively from described first data block, extracts the record that each article comprise described connection element, and export the first key-value pair, described first key-value pair comprises described connection element, characterizes mark and the counting of the Data Source of described connection element, described in be counted as 1;
For same connection element in described first data block, counting in described first key-value pair identical for the mark of the Data Source of described for described sign connection element is added up, obtains the quantity of record corresponding to the same connection element in the source of identical data in described first data block.
In conjunction with the first mode in the cards of first aspect or first aspect, in the second of first aspect mode in the cards, if described database comprises the first database and the second database, then describedly determine that first belonging to described connection element is gathered according to described record sum comprise with the threshold value preset:
The the first record sum deriving from described first database when connection element described in a described Reduce node is less than described threshold value, and the second record sum deriving from described second database is when being less than described threshold value, if described first record sum is not equal to zero, then determine that the first set belonging to described connection element is set R thinif described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set thin, described set R thinwith described S set thininclude the connection element of sparse appearance in described first database and described second database;
If described first record sum is more than or equal to described threshold value, and described first record sum is more than or equal to described second record sum, then determine that the first set belonging to described connection element is set R den, described set R denbe included in described first database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described second database; And if described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set rden, described S set rdenbe included in described second database and occur, but belong to described set R denconnection element;
If described second record sum is more than or equal to described threshold value, and described second record sum is more than or equal to described first record sum, then determine that the first set belonging to described connection element is S set den, described S set denbe included in described second database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described first database; And if described first record sum is non-vanishing, then determine that the first set belonging to described connection element is set R sden, described set R sdenbe included in described first database and occur, but belong to described S set denconnection element.
The second in conjunction with first aspect mode in the cards, in the third mode in the cards of second aspect, described Centroid is to the connection element rearrangement in described first set, obtain corresponding second set, and to the connection element execute vector update in described second set, the query vector obtaining described second set corresponding comprises:
Described Centroid is described set R to described first set den, described set R sden, described S set denwith described S set rdenin connection element, resequence according to the size of described connection element value, obtain corresponding second set, and record described connection element described second set in position;
For described set R den, described S set rden, described S set denwith described S set rdencorresponding second set, performs described vectorial update according to predefined initial vector to each connection element in described second set, obtains the query vector that described second set is corresponding.
In conjunction with the third mode in the cards of first aspect, in the 4th kind of mode in the cards of first aspect, after determine the first set belonging to the connection element in described record according to the record in second data block of distributing and described query vector, described method also comprises:
Described m 2individual 2nd Map node generates the second key-value pair corresponding to described connection element respectively, and described second key-value pair comprises described connection element, characterizes the mark of described connection element Data Source, the mark of the first set belonging to described connection element and disconnected element corresponding to described connection element.
In conjunction with the 4th kind of mode in the cards of first aspect, in the 5th kind of mode in the cards of first aspect, describedly determine whether that carrying out attended operation to described connection element comprises according between described first set for the intersection operation of described connection element:
After determining the belonging to the connection element in described record first set, if determine described connection element not gather that to occur simultaneously be not that another of sky first is gathered with described first according to described query vector, then determine not carry out attended operation to described connection element.
In conjunction with the 5th kind of mode in the cards of first aspect, in the 6th kind of mode in the cards of first aspect, described method also comprises:
Described m 2the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope 2individual 2nd Reduce node;
Described n 2described second key-value pair is classified as different queues by first set of individual 2nd Reduce node respectively belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue.
In conjunction with the 6th kind of mode in the cards of first aspect, in the 7th kind of mode in the cards of first aspect, described m 2the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope 2individual 2nd Reduce node comprises:
If described connection element belongs to described S set thinor described set R thin, then the second corresponding for identical described connection element key-value pair is sent to same 2nd Reduce node;
If described connection element belongs to described set R denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the first scope by described 2nd Map node at random, described first scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described first database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described S set denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the second scope by described 2nd Map node at random, described second scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described set R sdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 3rd scope by described 2nd Map node, described 3rd scope is initial Reduce node serial number, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described S set rdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope by described 2nd Map node, described 4th scope be initial 2nd Reduce numbering, to described initial 2nd Reduce numbering and described first database described in connection element record sum and each 2nd Reduce node average treatment record number ratio and between numbering.
In conjunction with the 7th kind of mode in the cards of first aspect, in the 8th kind of mode in the cards of first aspect, the record number of described each 2nd Reduce node average treatment is: the ratio of wall scroll record size in the file size of described first database and described first database, add the ratio of the wall scroll record size in the file size of described second database and described second database and, with the ratio of the quantity of described 2nd Reduce node.
In conjunction with the 7th kind of mode in the cards or the 8th kind of mode in the cards of first aspect, in the 9th kind of mode in the cards of first aspect, described second key-value pair is classified as different queues by described the first set belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue and comprise:
To belonging to described set R denconnection element corresponding second key-value pair composition queue and belong to described S set rdenconnection element corresponding key-value pair combination queue carry out attended operation;
To belonging to described set R sdenconnection element corresponding second key-value pair composition queue and belong to described S set denconnection element corresponding key-value pair composition queue carry out attended operation;
To belonging to described set R thinconnection element corresponding second key-value pair composition queue and belong to described S set thinconnection element corresponding key-value pair composition queue carry out attended operation.
Second aspect, provides a kind of data operation system, comprises m 1individual Map node, n 1an individual Reduce node, Centroid, m 2individual 2nd Map node, wherein:
Described m 1individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;
Described n 1an individual Reduce node, be respectively used to the record sum obtaining record corresponding to the connection element that derives from same database in respective Reduce node according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;
Described Centroid, for the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;
Described Centroid, also for again described database being distributed to m 2individual 2nd Map node, and described first set and described query vector are sent to described m 2individual 2nd Map node;
Described m 2individual 2nd Map node, be respectively used to the belonging to connection element first set determined according to the record in second data block of distributing and described query vector in described record, and determine whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.
The embodiment of the present invention provides a kind of data cube computation optimization method and data operation system, m 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and statistics is sent to Centroid, make Centroid that statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to same Reduce node; n 1individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element; Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding; Database is distributed to m by Centroid again 2individual 2nd Map node, and the first set and query vector are sent to m 2individual 2nd Map node; m 2individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set; So, the first set belonging to connection element is judged by vectorial update, and then can determine whether this connection element will cause attended operation according to the intersection operation of connection element, namely need to be passed to Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data cube computation efficiency exists.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The structural representation of a kind of data operation system that Fig. 1 provides for the embodiment of the present invention;
The schematic flow sheet of a kind of data cube computation optimization method that Fig. 2 provides for the embodiment of the present invention;
The schematic flow sheet of the another kind of data cube computation optimization method that Fig. 3 provides for the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The embodiment of the present invention provides a kind of data operation system 1, as shown in Figure 1, comprising: m 1an individual Map node, n 1an individual Reduce node, Centroid and m 2individual 2nd Map node, wherein:
M 1individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and statistics is sent to Centroid, makes Centroid that statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;
N 1an individual Reduce node, be respectively used to the record sum obtaining record corresponding to the connection element that derives from same database in respective Reduce node according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element;
Centroid, for the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding;
Centroid, also for again database being distributed to m 2individual 2nd Map node, and the first set and query vector are sent to m 2individual 2nd Map node;
M 2individual 2nd Map node, be respectively used to the first set belonging to connection element determined according to the record in second data block of distributing and query vector in record, and determine whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.
So, judge whether connection element will cause attended operation by vectorial update, namely need to be passed to the 2nd Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data downlink connection efficiency exists.
Based on above-mentioned data operation system, do the method that notebook data arithmetic system performs to illustrate below, therefore, the embodiment of the present invention provides a kind of data cube computation optimization method, as shown in Figure 2, comprising:
201, m 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and statistics is sent to Centroid, make Centroid that statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node.
202, n 1an individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element.
203, Centroid is to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.
204, database is distributed to m by Centroid again 2individual 2nd Map node, and the first set and query vector are sent to m 2individual 2nd Map node;
205, m 2individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.
Specifically, by two stage MapReduce process implementation, the embodiment of the present invention determines whether the record of a certain connection element will cause attended operation, namely needs to be passed to Reduce node, to reduce Internet Transmission.The MapReduce of first stage mainly adds up the value of connection element, and the MapReduce of subordinate phase completes attended operation according to the statistics of first stage.
Suppose there are two database R and S, R comprises Data Entry A and B, is designated as R (A, B), and S comprises Data Entry B and C, is designated as S (B, C). represent the attended operation between R and S, condition of contact is R.B=S.B, and the MapReduce of first stage, only to R.B and S.B process, namely only processes connection element, and the data transmitted between node are also only relevant to connection element B, have nothing to do with R.A and S.C.
In the process of specific implementation, be the first database with above-mentioned database R, database S is the second database is example, before step 201, namely at m 1before an individual Map node obtains statistics, database R and database S first can be divided into M by the master routine on Centroid 1individual data block, and distributed to m 1an individual Map node, each Map node processing M 1/ m 1blocks of data, i.e. M 1/ m 1individual first data block.
In step 201, for m 1an individual Map node, m 1an individual Map node adds up the Centroid record corresponding to the same connection element B deriving from same database in the first data block of database R and database S distribution respectively, to obtain statistics.
Particularly, one Map node can call map function from the first data block, extract the record that each article comprise connection element B, and exporting the first key-value pair, this first key-value pair comprises connection element B, the mark characterizing connection element Data Source and counting, is counted as 1.
Exemplary, a Map node calls map function, extracts each record in database R or database S, for the record in R, exports (key, 0:1); For the record in database S, export (key, 1:1); Wherein, the value of key is the value of connection element B, and the value of value part is made up of tag:val, tag identifies Data Source (from database R, 1 representative is from database S in 0 representative), and counting 1 represents that the value of B occurs once.
One Map node is again for connection element same in the first data block, counting in the first identical for the mark of Data Source characterizing connection element key-value pair is added up, obtains the quantity of record corresponding to the same connection element in identical data source in the first data block.This one-phase, can be described as the shuffle stage, namely for identical key, if value.tag is identical, then value.val is added up, finally, at the end of all map tasks all, generating data layout is the intermediate file of (key, tag:val), the Data Source of connection element that wherein tag identifies in this first data block is that R or S, val represent and output to the identical record number of key with tag in this intermediate file.
The data layout of generation is sent to Centroid (master routine) by all Map nodes again, starts Reduce process.Centroid is according to the hash function defined, and statistics is sent to n by the Hash result according to connection element 1an individual Reduce node, wherein, the data layout that the statistics of same connection element is namely corresponding is sent to a same Reduce node.
In step 202., a Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in a Reduce node again according to statistics.Particularly, the data layout that the Reduce task inspection of a Reduce node receives, the record identical to key with tag performs the cumulative operation of val, if the result of accumulation is expressed as (key, tag:val), can comprise (key, 0:val r) and (key, 1:val s), wherein, key represents the value of connection element B, and 0 and 1 represents Data Source, val rrecord sum when the value of B is key in expression database R, val srecord sum when the value of B is key in expression database S.
At this moment the Reduce task of first stage terminates, and exports net result, namely performs gathering according to record sum and first belonging to the threshold value determination connection element preset in step 202.
Particularly, the record rule of net result can be:
For each connection element key:
The the first record sum deriving from the first database when connection element in a Reduce node is less than threshold value, and the second record sum deriving from the second database is when being less than threshold value, if the first record sum is not equal to zero, then determine that the first set belonging to connection element is set R thinif the second record sum is non-vanishing, then determine that the first set belonging to connection element is S set thin, set R thinand S set thininclude the connection element of sparse appearance in the first database and the second database.Also can be expressed as: work as val r< Θ & & val sduring < Θ, if val r≠ 0, then record R thin=R thin∪ { key}; If val s≠ 0, then record S thin=S thin∪ { key}; Θ represents threshold value, certainly, if val rand val ssimultaneously non-vanishing, just this key is recorded to simultaneously set R thinand S set thinin;
Otherwise if the first record sum is more than or equal to threshold value, and the first record sum is more than or equal to the second record sum, then determine that the first set belonging to connection element is set R den, set R denbe included in the first database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in the second database; And if the second record sum is non-vanishing, then determine that the first set belonging to connection element is S set rden, S set rdenbe included in the second database and occur, but belong to set R denconnection element, also can be expressed as: if val r>=Θ & & val r>=val s: then record R den=R den∪ { (key, val r); If val s≠ 0, record S rden=S rden∪ { (key, val r);
If the second record sum is more than or equal to threshold value, and the second record sum is more than or equal to the first record sum, then determine that the first set belonging to connection element is S set den, S set denbe included in the second database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in the first database; And if the first record sum is non-vanishing, then determine that the first set belonging to connection element is set R sden, set R sdenbe included in the first database and occur, but belong to S set denconnection element, also can be expressed as: if val s>=Θ & & val s>=val r: then record S den=S den∪ { (key, val s); If val r≠ 0, record R sden=R sden∪ { (key, val s);
Therefore, after the MapReduce of first stage is finished, generate altogether six set, its mark and implication as shown in table 1 below:
Table 1
From the generative process of above-mentioned definition and set, between set, there is result as shown in table 2 below for the intersection operation of key:
Table 2
Further, the first set is sent to Centroid by a Reduce node again, and starts the MapReduce entering subordinate phase, and the MapReduce of subordinate phase performs attended operation.
But before the MapReduce performing subordinate phase, the master routine of Centroid is introduced into the preparatory stage, namely perform step 203, be further processed the result of first stage Reduce, processing mode is as follows:
Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.
Particularly, Centroid can be set R to the first set den, set R sden, S set denand S set rdenin connection element, resequence according to the size of connection element value, obtain corresponding second set, and record connection element second set in position, be designated as seq (key), generation second set can be expressed as accordingly set R ' den, set R ' sden, S set ' denand S set ' rden, element wherein can be expressed as { (key, seq (key)) }.
Here not to set R thinand S set thinprocessing, is the set R owing to belonging to sparse value thinand S set thincan be same as the prior art, connect key-value pair corresponding to element and be all sent to the 2nd Reduce node and carry out attended operation.Wherein, the 2nd Reduce node can be identical with a Reduce node, also may be different.
And for set R den, set R sden, S set denand S set rdencorresponding second set, according to predefined initial vector to each connection element execute vector update in the second set, obtains the query vector that the second set is corresponding.Particularly, for set R den, set R sden, S set denand S set rden, suppose the initial BloomFilter vector defined, be designated as respectively: for set R ' denin each element perform update in vector, for set R ' sdenin each element perform update in vector, for S set ' denin each element perform update in vector, and to S set ' rdenin each element perform update in vector, final generation 4 BloomFliter vectors, i.e. query vector, and these 4 vectors all have following character: if the Query Result lookup (key) of certain element key in certain BloomFilter vector is false, then this key is not certainly in the set that this BloomFilter vector is corresponding.
Data operation system enters the MapReduce process of subordinate phase, specifically can be as follows:
In step 204, database is first distributed to m by Centroid again 2individual 2nd Map node, and the first set and query vector are sent to m 2individual 2nd Map node.Here data road is needed to be reassigned to m 2individual 2nd Map node is that to carry out the nodes of Map operation different with workload due to subordinate phase, needs again to database distribution node.
Particularly, database R and database S is first divided into M by master routine in Centroid again 2block, and distributed to m 2individual Map node, each node processing M 2/ m 2blocks of data.Master routine is also gathered first for set R den, set R sden, S set denand S set rdenset and query vector send to m 2individual Map node, each Map node will gather R den, set R sden, S set den, S set rdenand buffer memory, to internal memory, operates to perform map to its block data.
With the map that data block is performed of first stage operate unlike, the map operation of subordinate phase is not only relevant with R.B and S.B, also relevant with R.A and R.C.
In step 205, m 2individual Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set.
Wherein, after determine the first set belonging to the connection element in record according to the record in second data block of distributing and query vector, the second key-value pair that connection element is corresponding can be generated, the second key-value pair comprises connection element, characterizes the mark of connection element Data Source, the mark of the first set belonging to connection element and disconnected element corresponding to connection element.Be specifically as follows Map node and extract record in the second data block, and the first set belonging to connection element in record is determined according to query vector, generation form is second key-value pair of (key, value), wherein, key is the value of connection element B, value is that a shape is as the stowed value of (tagS, tagF, val), tagS identification record derives from database R or database S, tagF mark key belongs to six set R den, R sden, R thin, S den, S rdenand S thinin which, val records the value of disconnected element.
Particularly, the process of the first set belonging to query vector determination connection element can be:
For certain record in R, the value of note R.B is key, and note R.A is val;
If a) then key ∈ R thin, the second key-value pair of generation is (key, 0:R thin: val), that is, the key in certain record in R determines it neither at R by inquiry denin, also not at R sdentime middle, so key belongs to set R thin, in the conventional mode the second corresponding for key key-value pair can be sent to Reduce and carry out attended operation;
B) otherwise, when time, if then key ∈ R sdenif namely key is not at set R thinwith set R denin, so key is at set R sdenin, if at this moment then do not process, that is, if at set R sdenin do not inquire this key, then determine that this key does not cause attended operation.This is due to according to the set operation in above-mentioned table 2, gathers R sdenonly and S set dencommon factor be not empty, therefore as key ∈ R sdentime, if the value of this key is not in S set denin, so determine that this key does not cause attended operation, so do not need to transmit the second key-value pair corresponding to this key; Otherwise, if this key is in S set denin, so determine that this key causes attended operation, need the second key-value pair of its correspondence to be sent to the 2nd Reduce node, and the second key-value pair is (key, 0:R sden: val);
C) otherwise, key ∈ R den, namely key is not at set R thinwith set R sdentime, so key ∈ R denbut, if then determine that this key does not cause attended operation, do not need to be passed to the 2nd Reduce node, this is due to set R denonly and S set rdencommon factor be not empty, like this as key ∈ R dentime, if the value of this key is not in S set rdenin, so determine that this value does not cause attended operation, so do not need to be passed to the 2nd Reduce node.Because key also belongs to the value of high frequency appearance, therefore do not need the data record of transmission more, playing a role clearly for the network optimization; Otherwise, if this key is in S set rdenin, so this value needs to cause attended operation, and the second key-value pair of generation is (key, 0:S rden: val);
In like manner, for certain record in database S, the value of note S.B is key, and note S.C is val;
If a) then key ∈ S thin, the second key-value pair of generation is (key, 1:S thin: val), determine that it causes attended operation;
B) otherwise, when time, if then key ∈ S rden;
If then due to S set rdenonly with set R dencommon factor be not empty, then as key ∈ S rdentime, if this value is not at set R den, then determine that this value does not cause attended operation, do not need to be passed to Reduce;
C) otherwise, when and time, key ∈ S den;
If then due to S set denonly with set R sdencommon factor be not empty, then as key ∈ S dentime, if this value is not at set R sden, then determine that this value does not cause attended operation, do not need to be passed to Reduce;
Otherwise, if this value is at set R sdenin, second key-value pair that so can generate is (key, 1:R sden: val).
So, the key-value pair of attended operation is not caused then not to be sent to Reduce for determining, the data needing to transmit can be reduced in network, improve the efficiency of Internet Transmission, compared to mapsidejoin or SemiJoin of data cube computation operation in prior art, the application does not limit the size of database, is applicable to any scene.
Therefore, the embodiment of the present invention provides a kind of data cube computation optimization method, m 1an individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and statistics is sent to Centroid, makes Centroid that statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node; One Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to statistics, gather according to record sum and first belonging to the threshold value determination connection element preset, and the first set is sent to Centroid, the frequent value condition of the first set for characterizing connection element; Centroid, to the connection element rearrangement in the first set, obtains corresponding second set, and to the connection element execute vector update in the second set, obtains the query vector that the second set is corresponding; Database is distributed to m by Centroid again 2individual 2nd Map node, and the first set and query vector are sent to m 2individual 2nd Map node; m 2individual 2nd Map node determines the first set belonging to connection element in record respectively according to the record in second data block of distributing and query vector, and determines whether to carry out attended operation to connection element according to the intersection operation for connection element between the first set; So, judge whether connection element will cause attended operation by vectorial update, namely need to be passed to the 2nd Reduce node, to reduce Internet Transmission, be not limited in prior art larger at a table like this, a table determines whether that transmission log is to reduce Internet Transmission when less, the problem of the usable range limitation that the method for attachment that can solve existing raising data downlink connection efficiency exists.
For mapsidejoin or SemiJoin that data cube computation of the prior art operates, not only there is the problem being suitable for part scene in it, also there is the problem of load balancing.This is due in the mode of operation of itself Map and Reduce, the Reduce stage is to carry out task assignment according to the value of key, the record of identical key value is passed to same Reduce node, like this for connection element B, if the data record of the value of certain several B is many especially, the task of being so responsible for the Reduce node of these values is assigned can be heavier, thus cause Reduce task matching uneven, the load imbalance of Reduce node.
Therefore, in order to solve the problem of load imbalance, at above-described embodiment in order to reduce on the basis of Internet Transmission, generate the second key-value pair after the Map stage of subordinate phase performs after, the second key-value pair needing to transmit can be sent to P the 2nd Reduce node according to following rule, for the second key-value pair of the connection element of different sets, correspondence sends to n wherein 2individual 2nd Reduce node, P is the n of the second key-value pair transmission that the element of all set is corresponding 2individual 2nd Reduce node and.
Therefore, as shown in Figure 3, said method embodiment also comprises:
206, m 2the scope of the corresponding 2nd Reduce node of the second key-value pair is determined in first set of individual 2nd Map node respectively belonging to the second key-value pair, and the second key-value pair is sent to the n within the scope of this 2individual 2nd Reduce node.
207, n 2second key-value pair is classified as different queues by first set of individual 2nd Reduce node respectively belonging to the second key-value pair, and carries out attended operation to the second key-value pair in queue.
In step 206, first, each 2nd Reduce node average treatment P is defined avgbar record, and P avgfor the ratio of the wall scroll record size in: the file size of database R and database R, add the ratio of the wall scroll record size in the file size of database S and database S and, ratio with the quantity of the 2nd Reduce node, can be expressed as: [R file size/wall scroll record size+S file size/wall scroll record size]/P.
Like this, if the value:tagF=S in the second key-value pair thinor R thin, even connection element belongs to S set thinor set R thin, then the second corresponding for identical connection element key-value pair is sent to same 2nd Reduce node, namely all sends to the 2nd Reduce node to carry out connection handling operation the second corresponding for identical connection element key-value pair in the conventional mode.This is due to when key belongs to sparse value, and it being distributed in the conventional mode the 2nd Reduce node can not cause load imbalance;
If the connection element in the second key-value pair belongs to set R denthen the second corresponding for connection element key-value pair is sent at random arbitrary 2nd Reduce node that numbering is positioned at the first scope, first scope is that initial Reduce numbers, to initial Reduce numbering and the first database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val r/ P avg) modP]; If the connection element in the second key-value pair belongs to S set denthen the second corresponding for connection element key-value pair is sent at random arbitrary 2nd Reduce node that numbering is positioned at the second scope, second scope is that initial Reduce numbers, to initial Reduce numbering and the second database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val s/ P avg) modP].
Like this, when key belongs to high frequency value, if it to be distributed in the conventional mode the 2nd Reduce node, the 2nd Reduce node load being then assigned to the value of high frequency can much larger than other node, for this reason, high frequency value is picked out by the application, and is its distribution the 2nd Reduce node desirably according to the number of times that high frequency value occurs, alleviates the problem of load balancing that value skewness brings.
If connection element belongs to set R sdenthen the second corresponding for connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 3rd scope, 3rd scope is initial 2nd Reduce node serial number, to initial Reduce numbering and the second database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val s/ P avg) modP]; If connection element belongs to S set rdenthen the second corresponding for connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope, 4th scope is that initial Reduce numbers, to initial Reduce numbering and the first database the record sum of connection element and the record number of each 2nd Reduce node average treatment ratio and between numbering, can be expressed as: [seq (key) modP, (seq (key)+val r/ P avg) modP].
Like this, due to set R sdenin key be sparse element in database R, S set rdenin key be sparse element in database S, therefore, select to gather R sdenor S set rdenin record to carry out being broadcast to the pressure that the 2nd Reduce node brings to network less.
In step 207,2nd Reduce node reads the second key-value pair being assigned to this node, second key-value pair is classified as different queues by the mark according to the first set identified in the second key-value pair belonging to it, and carries out attended operation to the second key-value pair in queue.
Particularly, to belonging to set R denconnection element corresponding second key-value pair composition queue and belong to S set rdenconnection element corresponding key-value pair combination queue carry out attended operation; To belonging to set R sdenconnection element corresponding second key-value pair composition queue and belong to S set denconnection element corresponding key-value pair composition queue carry out attended operation; To belonging to set R thinconnection element corresponding second key-value pair composition queue and belong to S set thinconnection element corresponding key-value pair composition queue carry out attended operation.Certainly, the condition of attended operation is that key is identical, and key.tagS is different, namely occur simultaneously between empty set without the need to doing attended operation.
Exemplary, according to the value:tagF in the second key-value pair, the second key-value pair is put into 6 queues, tagF is different, puts into different queues, and remembers that these 6 queues are QR den, QR sden, QR thin, QS den, QS rdenand QS thin.Then, according to intersection of sets set operation, by queue QR denand QS rden, QR sdenand QS den, QR thinand QS thincarry out attended operation.
Therefore, the embodiment of the present invention also by identifying the occurrence frequency of key, being distributed the 2nd Reduce node desirably, being alleviated the problem of load balancing that value skewness brings.
In several embodiments that the application provides, should be understood that disclosed terminal and method can realize by another way.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit comprises, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add SFU software functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the part steps of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-OnlyMemory, be called for short ROM), random access memory (RandomAccessMemory, be called for short RAM), magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to previous embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (11)

1. a data cube computation optimization method, is characterized in that, comprising:
M 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;
Described n 1an individual Reduce node obtains the record sum of record corresponding to the connection element that derives from same database in respective Reduce node respectively according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;
Described Centroid, to the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;
Described database is distributed to m by described Centroid again 2individual 2nd Map node, and described first set and described query vector are sent to described m 2individual 2nd Map node;
Described m 2individual 2nd Map node determines the belonging to connection element first set in described record respectively according to the record in second data block of distributing and described query vector, and determines whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.
2. data cube computation optimization method according to claim 1, is characterized in that, described m 1an individual Map node adds up record corresponding to the same connection element that derives from same database in the first data block that Centroid distributes database respectively, comprises to obtain statistics:
Described m 1an individual Map node calls map function respectively from described first data block, extracts the record that each article comprise described connection element, and export the first key-value pair, described first key-value pair comprises described connection element, characterizes mark and the counting of the Data Source of described connection element, described in be counted as 1;
For same connection element in described first data block, counting in described first key-value pair identical for the mark of the Data Source of described for described sign connection element is added up, obtains the quantity of record corresponding to the same connection element in the source of identical data in described first data block.
3. data cube computation optimization method according to claim 1 and 2, it is characterized in that, if described database comprises the first database and the second database, then describedly determine that first belonging to described connection element is gathered according to described record sum comprise with the threshold value preset:
The the first record sum deriving from described first database when connection element described in a described Reduce node is less than described threshold value, and the second record sum deriving from described second database is when being less than described threshold value, if described first record sum is not equal to zero, then determine that the first set belonging to described connection element is set R thinif described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set thin, described set R thinwith described S set thininclude the connection element of sparse appearance in described first database and described second database;
If described first record sum is more than or equal to described threshold value, and described first record sum is more than or equal to described second record sum, then determine that the first set belonging to described connection element is set R den, described set R denbe included in described first database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described second database; And if described second record sum is non-vanishing, then determine that the first set belonging to described connection element is S set rden, described S set rdenbe included in described second database and occur, but belong to described set R denconnection element;
If described second record sum is more than or equal to described threshold value, and described second record sum is more than or equal to described first record sum, then determine that the first set belonging to described connection element is S set den, described S set denbe included in described second database medium-high frequency to occur, and occurrence number is greater than the connection element of the occurrence number in described first database; And if described first record sum is non-vanishing, then determine that the first set belonging to described connection element is set R sden, described set R sdenbe included in described first database and occur, but belong to described S set denconnection element.
4. data cube computation optimization method according to claim 3, it is characterized in that, described Centroid is to the connection element rearrangement in described first set, obtain corresponding second set, and to the connection element execute vector update in described second set, the query vector obtaining described second set corresponding comprises:
Described Centroid is described set R to described first set den, described set R sden, described S set denwith described S set rdenin connection element, resequence according to the size of described connection element value, obtain corresponding second set, and record described connection element described second set in position;
For described set R den, described S set rden, described S set denwith described S set rdencorresponding second set, performs described vectorial update according to predefined initial vector to each connection element in described second set, obtains the query vector that described second set is corresponding.
5. data cube computation optimization method according to claim 4, is characterized in that, after determine the first set belonging to the connection element in described record according to the record in second data block of distributing and described query vector, described method also comprises:
Described m 2individual 2nd Map node generates the second key-value pair corresponding to described connection element respectively, and described second key-value pair comprises described connection element, characterizes the mark of described connection element Data Source, the mark of the first set belonging to described connection element and disconnected element corresponding to described connection element.
6. data cube computation optimization method according to claim 5, is characterized in that, describedly determines whether that carrying out attended operation to described connection element comprises according between described first set for the intersection operation of described connection element:
After determining the belonging to the connection element in described record first set, if determine described connection element not gather that to occur simultaneously be not that another of sky first is gathered with described first according to described query vector, then determine not carry out attended operation to described connection element.
7. data cube computation optimization method according to claim 6, is characterized in that, described method also comprises:
Described m 2the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope 2individual 2nd Reduce node;
Described n 2described second key-value pair is classified as different queues by first set of individual 2nd Reduce node respectively belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue.
8. data cube computation optimization method according to claim 7, is characterized in that, described m 2the scope of the corresponding 2nd Reduce node of described second key-value pair is determined in first set of individual 2nd Map node respectively belonging to described second key-value pair, and described second key-value pair is sent to the n in described scope 2individual 2nd Reduce node comprises:
If described connection element belongs to described S set thinor described set R thin, then the second corresponding for identical described connection element key-value pair is sent to same 2nd Reduce node by described 2nd Map node;
If described connection element belongs to described set R denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the first scope by described 2nd Map node at random, described first scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described first database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described S set denthen the second corresponding for described connection element key-value pair is sent to arbitrary 2nd Reduce node that numbering is positioned at the second scope by described 2nd Map node at random, described second scope is that initial Reduce numbers, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described set R sdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 3rd scope by described 2nd Map node, described 3rd scope is initial Reduce node serial number, to the record sum of connection element described in described initial Reduce numbering and described second database and the record number of each Reduce node average treatment ratio and between numbering;
If described connection element belongs to described S set rdenthen the second corresponding for described connection element key-value pair is broadcast to all 2nd Reduce nodes that numbering is positioned at the 4th scope by described 2nd Map node, described 4th scope be at first the 2nd Reduce numbering, to described initial 2nd Reduce numbering and described first database described in connection element record sum and each 2nd Reduce node average treatment record number ratio and between numbering.
9. data cube computation optimization method according to claim 8, it is characterized in that, the record number of described each 2nd Reduce node average treatment is: the ratio of wall scroll record size in the file size of described first database and described first database, add the ratio of the wall scroll record size in the file size of described second database and described second database and, with the ratio of the quantity of described 2nd Reduce node.
10. data cube computation optimization method according to claim 8 or claim 9, it is characterized in that, described second key-value pair is classified as different queues by described the first set belonging to described second key-value pair, and carries out attended operation to the second key-value pair in described queue and comprise:
To belonging to described set R denconnection element corresponding second key-value pair composition queue and belong to described S set rdenconnection element corresponding key-value pair combination queue carry out attended operation;
To belonging to described set R sdenconnection element corresponding second key-value pair composition queue and belong to described S set denconnection element corresponding key-value pair composition queue carry out attended operation;
To belonging to described set R thinconnection element corresponding second key-value pair composition queue and belong to described S set thinconnection element corresponding key-value pair composition queue carry out attended operation.
11. 1 kinds of data operation systems, is characterized in that, comprise m 1individual Map node, n 1an individual Reduce node, Centroid, m 2individual 2nd Map node, wherein:
Described m 1individual Map node, be respectively used to add up the record that the same connection element that derives from same database in the first data block that Centroid distributes database is corresponding, to obtain statistics, and described statistics is sent to described Centroid, make described Centroid that described statistics is sent to n 1an individual Reduce node, wherein the statistics of same connection element is sent to a same Reduce node;
Described n 1an individual Reduce node, be respectively used to the record sum obtaining record corresponding to the connection element that derives from same database in respective Reduce node according to described statistics, determine that first belonging to described connection element is gathered according to described record sum with the threshold value preset, and described first set is sent to described Centroid, the frequent value condition of described first set for characterizing described connection element;
Described Centroid, for the connection element rearrangement in described first set, obtains corresponding second set, and to the connection element execute vector update in described second set, obtains the query vector that described second set is corresponding;
Described Centroid, also for again described database being distributed to m 2individual 2nd Map node, and described first set and described query vector are sent to described m 2individual 2nd Map node;
Described m 2individual 2nd Map node, be respectively used to the belonging to connection element first set determined according to the record in second data block of distributing and described query vector in described record, and determine whether to carry out attended operation to described connection element according to the intersection operation for described connection element between described first set.
CN201510446965.8A 2015-07-27 2015-07-27 A kind of data connection optimization method and data arithmetic system Active CN105095455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510446965.8A CN105095455B (en) 2015-07-27 2015-07-27 A kind of data connection optimization method and data arithmetic system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510446965.8A CN105095455B (en) 2015-07-27 2015-07-27 A kind of data connection optimization method and data arithmetic system

Publications (2)

Publication Number Publication Date
CN105095455A true CN105095455A (en) 2015-11-25
CN105095455B CN105095455B (en) 2018-10-19

Family

ID=54575891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510446965.8A Active CN105095455B (en) 2015-07-27 2015-07-27 A kind of data connection optimization method and data arithmetic system

Country Status (1)

Country Link
CN (1) CN105095455B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106168963A (en) * 2016-06-30 2016-11-30 北京金山安全软件有限公司 Real-time streaming data processing method and device and server
CN106874272A (en) * 2015-12-10 2017-06-20 华为技术有限公司 A kind of distributed connection method and system
CN112597148A (en) * 2020-11-25 2021-04-02 联想(北京)有限公司 Data table connection method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593452A (en) * 2013-11-21 2014-02-19 北京科技大学 Data intensive computing cost optimization method based on MapReduce mechanism
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
US8793674B2 (en) * 2011-09-19 2014-07-29 Nec Laboratories America, Inc. Computer-guided holistic optimization of MapReduce applications
US20140215178A1 (en) * 2013-01-31 2014-07-31 International Business Machines Corporation Resource management in mapreduce architecture and architectural system
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN104731729A (en) * 2015-03-23 2015-06-24 华为技术有限公司 Table connection optimizing method based on heterogeneous system, CPU and accelerator

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8793674B2 (en) * 2011-09-19 2014-07-29 Nec Laboratories America, Inc. Computer-guided holistic optimization of MapReduce applications
US20140215178A1 (en) * 2013-01-31 2014-07-31 International Business Machines Corporation Resource management in mapreduce architecture and architectural system
CN103593452A (en) * 2013-11-21 2014-02-19 北京科技大学 Data intensive computing cost optimization method based on MapReduce mechanism
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN104731729A (en) * 2015-03-23 2015-06-24 华为技术有限公司 Table connection optimizing method based on heterogeneous system, CPU and accelerator

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874272A (en) * 2015-12-10 2017-06-20 华为技术有限公司 A kind of distributed connection method and system
CN106874272B (en) * 2015-12-10 2020-02-14 华为技术有限公司 Distributed connection method and system
CN106168963A (en) * 2016-06-30 2016-11-30 北京金山安全软件有限公司 Real-time streaming data processing method and device and server
CN106168963B (en) * 2016-06-30 2019-06-11 北京金山安全软件有限公司 Real-time streaming data processing method and device and server
CN112597148A (en) * 2020-11-25 2021-04-02 联想(北京)有限公司 Data table connection method and device

Also Published As

Publication number Publication date
CN105095455B (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN108564470B (en) Transaction distribution method for parallel building blocks in block chain
CN110147722A (en) A kind of method for processing video frequency, video process apparatus and terminal device
CN105550225B (en) Index structuring method, querying method and device
CN103246484B (en) A kind of date storage method, Apparatus and system
CN104618361B (en) A kind of network flow data method for reordering
CN114202027B (en) Method for generating execution configuration information, method and device for model training
CN105095455A (en) Data connection optimization method and data operation system
CN107305570B (en) Data retrieval method and system
CN107070645A (en) Compare the method and system of the data of tables of data
CN112100450A (en) Graph calculation data segmentation method, terminal device and storage medium
CN111260220A (en) Group control equipment identification method and device, electronic equipment and storage medium
CN103116641B (en) Obtain method and the collator of the statistics of sequence
CN104462420A (en) Method and device for executing query tasks on database
CN104834709B (en) A kind of parallel cosine mode method for digging based on load balancing
CN105701128B (en) A kind of optimization method and device of query statement
CN104794130A (en) Inter-table correlation query method and device
CN111104541A (en) Efficient face picture retrieval method and device
CN105138638A (en) Database distribution method based on application layer
KR101780534B1 (en) Method and system for extracting image feature based on map-reduce for searching image
CN103414756B (en) A kind of task distribution method, distribution node and system
CN107577531A (en) Load-balancing method and device
CN104965846A (en) Virtual human establishing method on MapReduce platform
CN113704252B (en) Rule engine decision tree implementation method, device, computer equipment and computer readable storage medium
CN104978382A (en) Clustering method based on local density on MapReduce platform
CN112163024B (en) Configuration information export and import method based on hierarchical association structure

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant