The content of the invention
In view of this, the present invention provides a kind of data table correlation method and device, there is the tables of data of data skew
Between when associating, improve the efficiency of tables of data association.
Specifically, the present invention is achieved through the following technical solutions:
A kind of first aspect, there is provided data table correlation method, the method is applied to the first tables of data and the second tables of data
It is associated;Wherein, first tables of data includes:Can result in the tilt data and the tilt data of data skew
Outside non-inclined data, methods described includes:
The first data point table is put into by extracting the tilt data in first tables of data, and by the non-inclined data
It is put into the second data point table;
By extracting the data of matching association the first data point table in second tables of data, the 3rd data point are put into
Table;
First data point table and the 3rd data point table are carried out mapjoin and obtain the first contingency table, by described second
Data point table and second tables of data are associated, and obtain the second contingency table;
First contingency table and the second contingency table are combined, association results table is obtained, the association results table is
The result that first tables of data is associated with the second tables of data.
Second aspect, there is provided a kind of tables of data associated apparatus, described device is applied to the first tables of data and the second data
Table is associated;Wherein, first tables of data includes:Can result in the tilt data and the inclination number of data skew
Non-inclined data outside, described device includes:
Table split cells, for being put into the first data point table by extracting the tilt data in first tables of data, and
The non-inclined data are put into the second data point table;
Table extraction unit, for the data by extracting matching association the first data point table in second tables of data,
It is put into the 3rd data point table;
Table associative cell, for carrying out mapjoin and obtaining the first pass first data point table and the 3rd data point table
Connection table, second data point table and second tables of data are associated, and obtain the second contingency table;
Table pack unit, for first contingency table and the second contingency table to be combined, obtains association results table, institute
It is the result that first tables of data is associated with the second tables of data to state association results table.
The data table correlation method and device of the embodiment of the present invention, are torn open by by the tables of data comprising tilt data
Point, the tilt data after fractionation and a small table are carried out into mapjoin, and by remaining data and another table join so that this two
The tables of data association for dividing is influenceed all without by tilt data, improves the efficiency of tables of data association.
Specific embodiment
Data warehouse is mainly as Analysis of Policy Making provides data, and involved data manipulation is mainly data query, and
In order to ensure that data warehouse provides the accuracy of data, the data into warehouse are typically passed through data cleansing.Tables of data is associated
It is a kind of common method when data warehouse carries out data cleansing, for example, entering by map/reduce Distributed Computing Platforms
During row data processing, the reduce stages can perform join according to the associated key in two or more tables of data to these tables
(also referred to as cartesian product), such as, data warehouse receives a data inquiry request, and requesting query obtains corresponding certain key
Information A and information B, and information A and information B respectively be located at two tables of data in, then can be according to Key to the two data
Table is associated, and obtains a new data table comprising key and corresponding information A and information B, returns to inquiry.
For example, reduce nodes can obtain value list of the key identicals from two tables of data (can be in table
The corresponding relation of key and value, such as, key is ID, and value is the landing time of the user), for same key,
Join treatment is carried out to the data in the two tables of data.When data skew, because the bar number of some key is than other key
Many a lot (as many as sometimes hundred times or thousand times), the data volume handled by reduce nodes where this key is saved than other
Point is just much larger, so as to cause reduce node major parts to be finished, but has one or several reduce nodes to run
It is very slow, slowly cannot run completely, also cause the process time extension of whole tables of data association.
The data table correlation method of the embodiment of the present application, it is intended to which the tables of data to there is data skew is associated
When, the efficiency of tables of data association is improved, reduce influence of the data skew to the association process time.Fig. 1 illustrates tables of data pass
The flow of linked method, the method can be performed by Distributed Computing Platform.In the example being illustrated in fig. 1 shown below, counted with to first
Be associated as a example by join according to table and the second tables of data illustrating (but it is actual implement in the method can also be applied to other
The association of the tables of data in scape, it is not limited to example below), further, it is also possible to the principle schematic with reference to shown in Fig. 2 is come
Description the method:
For example, the first tables of data can be user's logon information table, referring to table 1 below, the first tables of data is illustrated
Partial information, includes the corresponding relation of ID and the landing time of the user, and ID therein is properly termed as associated key,
Join can be carried out between tables of data according to the associated key.
The tables of data of table 1 first
ID |
Landing time |
123 |
2016.3.21 |
······ |
······· |
······ |
······· |
123 |
2016.3.24 |
234 |
2016.3.26 |
345 |
2016.3.27 |
Wherein, in the first tables of data, the data record of ID " 123 " has reached million or ten million bar, and assumes this
The data of the ID " 123 " belong to " can result in the tilt data of data skew " in example, and remaining other data ratios
Data record such as ID " 234 " and " 345 " belongs to non-inclined data, that is, do not result in data skew.With the first tables of data
Second tables of data of join is carried out, can be an information table for user's name, for example, see shown in table 2 below, containing
ID and user's name.
The tables of data of table 2 second
ID |
User's name |
123 |
Zhang San |
234 |
Li Si |
345 |
King five |
Association between this first tables of data and the second tables of data, i.e., according to this associated key of ID, by the first number
According to landing time corresponding with ID and user's name is found in table and the second tables of data, generate shown in a similar table 3
Association results table, includes ID and landing time corresponding with the ID and user's name in the association results table.
The association results table of table 3
ID |
Landing time |
User's name |
123 |
2016.3.21 |
Zhang San |
······ |
······· |
····· |
······ |
······· |
····· |
123 |
2016.3.24 |
Zhang San |
234 |
2016.3.26 |
Li Si |
345 |
2016.3.27 |
King five |
Below in conjunction with above-mentioned example, the process of the data table correlation method of the application is described:
In a step 101, it is put into the first data point table by extracting tilt data in the first tables of data, and by non-inclined data
It is put into the second data point table.
In this step, the first tables of data is split, may be respectively referred to as the first data point table and the second data point
Table.Wherein, tilt data can be included in the first data point table, such as the ID " 123 " corresponding million in table 1 or ten million
The data record of bar, can include non-inclined data, such as number outside ID " 123 " in table 1 in the second data point table
According to record.
A kind of following mode split to the first tables of data of example:
First, at least one associated key of data skew is caused by being extracted in the first tables of data, described at least one is closed
Connection key is put into associated key sublist.
For example, the quantity of each associated key in the first tables of data can be counted, by each associated key according to quantity by many
Sorted to few order.Used as an example, number of repetition of the ID " 123 " in the first tables of data is that quantity can be
1000000, the quantity of ID " 234 " can be 100000, and the quantity of ID " 345 " can be 8000.According to statistics
The quantity of ID sorts from more to less, then be the order of " 123 --- 234 --- 345 ".
Assuming that associated key transformation set in advance is 1, i.e., the pass for making number one is selected from sequence above
Connection key ID " 123 ", as the associated key for causing data skew.Again for example, in other examples, if in the first tables of data
Including the quantity of different associated keys be ten, after counting the quantity of each associated key and sorting, will obtain sorting digit by the
One sequence position to the tenth sort position one puts in order;If associated key transformation set in advance is 5, then it represents that be from
First five associated key is selected in the sequence, first five associated key is the associated key that can cause data skew.The description of this example
In, as a example by selecting an associated key, table 4 below is associated key sublist, causes an associated key of data skew in the table 4
In.
The associated key sublist of table 4
ID |
Statistical magnitude |
123 |
1000000 |
Can be based on experience value or test value determines additionally, the numerical value of associated key transformation set in advance.Than
Such as, the process time time-out that data skew is caused can be run into data cleansing, checks the associated key in this case causing time-out
Number of repetition be how many, if it is 1,000,000, that indicates that 1,000,000 records will likely cause to process time delay.So in root
It is such as 5 according to initial one associated key transformation of setting in the clooating sequence of associated key, selects the associated key of first five, if
It was found that the statistical magnitude of the associated key of sequence the 5th is 8000, then shows that the associated key transformation is set and be not suitable for;If
During by associated key transformation change 2, it is found that the deputy associated key statistical magnitude of sequence is 1,000,000, then show the upper limit number
Value sets reasonable, and the data record of data skew can will be caused to choose.Certainly, it is more than a kind of mode of example,
Associated key transformation can be determined using other modes, as long as tilt data can be recognized.
Secondly, according to associated key sublist, the data that association associated key sublist is matched in the first tables of data are put into the first number
According to a point table, it is impossible to which the data of matching association associated key sublist are put into the second data point table.
For example, it is above-mentioned obtain associated key sublist after, the associated key sublist and the first tables of data can be associated, example
Both mapjoin can be such as carried out, mapjoin is one kind of join modes, small table data can be directly read internal memory
In be associated with another table, can be greatly improved generation association results efficiency.Associated key sublist such as table 4 in this example is
One small table, can use mapjoin.When associated key sublist is associated with the first tables of data, the number of contingency table 4 can be matched
According to the first data point table is put into, " matching association " here is referred to key pairs in the associated key sublist in the first tables of data
The data record answered is found out, and in this example, the first data point table includes the corresponding data record of ID " 123 ";Can not
The data of matching contingency table 4 are put into the data record outside the second data point table, i.e. ID " 123 ".
The data of table 5 first point table
ID |
Landing time |
123 |
2016.3.21 |
······ |
······· |
······ |
······· |
123 |
2016.3.24 |
The data of table 6 second point table
ID |
Landing time |
234 |
2016.3.26 |
345 |
2016.3.27 |
The mode split to the data in the first tables of data above by associated key sublist can have various, for example,
A kind of mode can be that associated key sublist and the first tables of data are carried out into first time mapjoin, obtain matching association associated key
The data of table, are put into the first data point table, i.e., what this mapjoin was obtained is the data that can associate associated key sublist;Can
Second mapjoin is carried out with by associated key sublist and the first tables of data, obtains matching the data of association associated key sublist
It is put into the second data point table.Again for example, another way can be, associated key sublist and the first tables of data are carried out once
Mapjoin, the data by this mapjoin respectively to matching association associated key sublist associate associated key sublist with can not match
Data be identified, that is, it is the data that can associate associated key sublist to identify upper a certain data, or can not be associated
The data of associated key sublist;According to above-mentioned mark, the data that will match association associated key sublist are put into the first data point table, will not
The data that association associated key sublist can be matched are put into the second data point table.Two ways is simply enumerated above, in actual implementation not
This is confined to, as long as can realize that table 5 and the data of table 6 split.
In a step 102, by extracting the data of matching association the first data point table in the second tables of data, the 3rd data are put into
Divide table.
For example, the second tables of data shown in the associated key sublist shown in table 4 and table 2 can be carried out into mapjoin, obtain
The data record that can be associated with the matching of associated key sublist in second tables of data, the 3rd data point table is put into by the data record.
Such as, in the above example, table 4 is associated with table 2 and obtains table 7, for example, the key in table 4 is ID " 123 ", that is just by table 2
In identical key be that the corresponding data record of ID " 123 " is put into table 7:
The data of table 7 the 3rd point table
ID |
User's name |
123 |
Zhang San |
In step 103, the first data point table and the 3rd data point table are carried out mapjoin and obtains the first contingency table, will
Second data point table and the second tables of data carry out join, obtain the second contingency table.
In this step, the 3rd data point table is small table, the 3rd data point table can be carried out with the first data point table
Mapjoin, obtains the first contingency table such as table 8 of both association results:
The contingency table of table 8 first
ID |
Landing time |
User's name |
123 |
2016.3.21 |
Zhang San |
······ |
······· |
····· |
······ |
······· |
······ |
123 |
2016.3.24 |
Zhang San |
Second data point table and the second tables of data carry out join, the second contingency table for obtaining, such as table 9 below:
The contingency table of table 9 second
At step 104, the first contingency table and the second contingency table are combined, obtain association results table, the association
As a result table is the result that first tables of data is associated with the second tables of data.
In this step, the first contingency table and the second contingency table that will can be obtained in step 103 are combined, the pass for obtaining
It is coupled shown in fruit table table 3 as above.
The data table correlation method of this example, is split, by tilt data by by the tables of data containing tilt data
The small table that the data are matched with one carries out mapjoin, has been obviously improved the association process efficiency of this part tilt data, and another
, when being associated with tables of data, due to the influence there is no tilt data, processing procedure can also be complete quickly for outer non-inclined data
Into, above-mentioned two-part processing speed all quickly, so as to improve the efficiency of tables of data association.
In order to realize the above method, the embodiment of the present application additionally provides a kind of tables of data associated apparatus, and the device is applied to
First tables of data and the second tables of data are associated;Wherein, first tables of data includes:Can result in data skew
Non-inclined data outside tilt data and the tilt data.As shown in figure 3, the device can include:Table split cells
31st, table extraction unit 32, table associative cell 33 and table pack unit 34.Wherein,
Table split cells 31, for being put into the first data point table by extracting the tilt data in first tables of data,
And the non-inclined data are put into the second data point table;
Table extraction unit 32, for the number by extracting matching association the first data point table in second tables of data
According to being put into the 3rd data point table;
Table associative cell 33, for carrying out mapjoin and obtaining the first association the first data point table and the 3rd data point table
Table, second data point table and second tables of data are associated, and obtain the second contingency table;
Table pack unit 34, for first contingency table and the second contingency table to be combined, obtains association results table,
The association results table is the result that first tables of data is associated with the second tables of data.
As shown in figure 4, the table split cells 31 in the device can include:Key extracts subelement 311 and table generation is single
Unit 312.
Key extracts subelement 311, at least one association for causing data skew by being extracted in first tables of data
Key, at least one associated key is put into associated key sublist;
Table generates subelement 312, described by association is matched in first tables of data for according to the associated key sublist
The data of associated key sublist are put into first data point table, it is impossible to which the data of the matching association associated key sublist are put into institute
State the second data point table.
In another example, key extracts subelement 311, when for extracting associated key, including:Count first data
The quantity of each associated key in table, the order by each associated key according to quantity from more to less is ranked up;According to setting in advance
Fixed associated key transformation, obtains at least one associated key of the sequence digit within the associated key transformation, as
At least one associated key for causing data skew.
In another example, table extraction unit 32, when for generating the 3rd data point table, including:By the associated key
Sublist is associated with second tables of data, and the data for associating second tables of data for obtaining are put into the 3rd data
Divide table.
The function of unit and the implementation process of effect correspond to step in specifically referring to the above method in said apparatus
Implementation process, will not be repeated here.For device embodiment, because it corresponds essentially to embodiment of the method, so related
Part is illustrated referring to the part of embodiment of the method.
Device embodiment described above is only schematical, wherein the unit illustrated as separating component can
To be or may not be physically separate, the part shown as unit can be or may not be physics list
Unit, you can with positioned at a place, or can also be distributed on multiple NEs.It can according to the actual needs be selected
In some or all of module realize the purpose of application scheme.Those of ordinary skill in the art are not paying creative labor
In the case of dynamic, you can to understand and implement.
The embodiment of the tables of data associated apparatus of the application can be using on a processing device, and the data processing equipment is for example
Can carry out the computing device that data processing is used in data warehouse.Tables of data associated apparatus embodiment can be by software
Realize, it is also possible to realized by way of hardware or software and hardware combining.As shown in figure 5, being the application tables of data associated apparatus
A kind of hardware structure diagram of the processing equipment at place, it is implemented in software as a example by, as the device on a logical meaning, Ke Yitong
Processor 51 in processing equipment where crossing it, corresponding computer program instructions in nonvolatile memory 53 are read
Run in internal memory 52.In addition to including each above-mentioned component and network interface, generally acceptable basis should for the processing equipment
The actual functional capability of processing equipment, can include other functions component, and this is repeated no more.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention
Within god and principle, any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.