CN106874322A

CN106874322A - A kind of data table correlation method and device

Info

Publication number: CN106874322A
Application number: CN201610480216.1A
Authority: CN
Inventors: 康树鹏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2017-06-20

Abstract

The present invention provides a kind of data table correlation method and device, and wherein method is applied to be associated the first tables of data and the second tables of data；Wherein, first tables of data includes：The non-inclined data outside the tilt data and the tilt data of data skew are can result in, the method includes：The first data point table is put into by extracting tilt data in the first tables of data, and non-inclined data are put into the second data point table；By extracting the data of matching association the first data point table in the second tables of data, the 3rd data point table is put into；First data point table and the 3rd data point table are carried out mapjoin and obtain the first contingency table, the second data point table and the second tables of data are associated, obtain the second contingency table；First contingency table and the second contingency table are combined, association results table is obtained, association results table is the result that the first tables of data is associated with the second tables of data.The present invention improves the efficiency of tables of data association.

Description

A kind of data table correlation method and device

Technical field

The present invention relates to data processing technique, more particularly to a kind of data table correlation method and device.

Background technology

When data warehouse carries out data cleansing, the conventional cleaning way of one of which is between tables of data and tables of data Association, the association between tables of data in this data warehouse is properly termed as join operations.In the tables of data of participation join generally With identical associated key (link field used when being associated between tables of data), if associated key is referred to as into key, for example, its In the corresponding relation of the key and information A is stored in a tables of data, the Key is stored in another tables of data corresponding with information B Relation, in both join, can be according to associated key key, by information A and information the B combination of the identical key of correspondence in a new number According in table, the new data table can include the key and corresponding information A, information B.

A kind of situation often occurred during join is data skew, and data skew is：Participate in the data of join In table, data record of one of tables of data comprising substantial amounts of identical key values, for example, certain user's logon information tables of data In, record the million or ten million data record (example that ID " 123 " this user is logged in different time respectively Such as, wherein a record is " ID 123 --- landing time 2016.3.21 ").The distribution for so being used in data warehouse When the join that the tables of data and other tables of data are processed in formula calculating platform is operated, generally calculating the time can be more long.

The content of the invention

In view of this, the present invention provides a kind of data table correlation method and device, there is the tables of data of data skew Between when associating, improve the efficiency of tables of data association.

Specifically, the present invention is achieved through the following technical solutions：

A kind of first aspect, there is provided data table correlation method, the method is applied to the first tables of data and the second tables of data It is associated；Wherein, first tables of data includes：Can result in the tilt data and the tilt data of data skew Outside non-inclined data, methods described includes：

The first data point table is put into by extracting the tilt data in first tables of data, and by the non-inclined data It is put into the second data point table；

By extracting the data of matching association the first data point table in second tables of data, the 3rd data point are put into Table；

First data point table and the 3rd data point table are carried out mapjoin and obtain the first contingency table, by described second Data point table and second tables of data are associated, and obtain the second contingency table；

First contingency table and the second contingency table are combined, association results table is obtained, the association results table is The result that first tables of data is associated with the second tables of data.

Second aspect, there is provided a kind of tables of data associated apparatus, described device is applied to the first tables of data and the second data Table is associated；Wherein, first tables of data includes：Can result in the tilt data and the inclination number of data skew Non-inclined data outside, described device includes：

Table split cells, for being put into the first data point table by extracting the tilt data in first tables of data, and The non-inclined data are put into the second data point table；

Table extraction unit, for the data by extracting matching association the first data point table in second tables of data, It is put into the 3rd data point table；

Table associative cell, for carrying out mapjoin and obtaining the first pass first data point table and the 3rd data point table Connection table, second data point table and second tables of data are associated, and obtain the second contingency table；

Table pack unit, for first contingency table and the second contingency table to be combined, obtains association results table, institute It is the result that first tables of data is associated with the second tables of data to state association results table.

The data table correlation method and device of the embodiment of the present invention, are torn open by by the tables of data comprising tilt data Point, the tilt data after fractionation and a small table are carried out into mapjoin, and by remaining data and another table join so that this two The tables of data association for dividing is influenceed all without by tilt data, improves the efficiency of tables of data association.

Brief description of the drawings

Fig. 1 is the flow chart of data table correlation method provided in an embodiment of the present invention；

Fig. 2 is the principle schematic of data table correlation method provided in an embodiment of the present invention；

Fig. 3 is the structural representation of tables of data associated apparatus provided in an embodiment of the present invention；

Fig. 4 is the structural representation of tables of data associated apparatus provided in an embodiment of the present invention；

Fig. 5 is a kind of hardware structure diagram of the processing equipment where tables of data associated apparatus provided in an embodiment of the present invention.

Specific embodiment

Data warehouse is mainly as Analysis of Policy Making provides data, and involved data manipulation is mainly data query, and In order to ensure that data warehouse provides the accuracy of data, the data into warehouse are typically passed through data cleansing.Tables of data is associated It is a kind of common method when data warehouse carries out data cleansing, for example, entering by map/reduce Distributed Computing Platforms During row data processing, the reduce stages can perform join according to the associated key in two or more tables of data to these tables (also referred to as cartesian product), such as, data warehouse receives a data inquiry request, and requesting query obtains corresponding certain key Information A and information B, and information A and information B respectively be located at two tables of data in, then can be according to Key to the two data Table is associated, and obtains a new data table comprising key and corresponding information A and information B, returns to inquiry.

For example, reduce nodes can obtain value list of the key identicals from two tables of data (can be in table The corresponding relation of key and value, such as, key is ID, and value is the landing time of the user), for same key, Join treatment is carried out to the data in the two tables of data.When data skew, because the bar number of some key is than other key Many a lot (as many as sometimes hundred times or thousand times), the data volume handled by reduce nodes where this key is saved than other Point is just much larger, so as to cause reduce node major parts to be finished, but has one or several reduce nodes to run It is very slow, slowly cannot run completely, also cause the process time extension of whole tables of data association.

The data table correlation method of the embodiment of the present application, it is intended to which the tables of data to there is data skew is associated When, the efficiency of tables of data association is improved, reduce influence of the data skew to the association process time.Fig. 1 illustrates tables of data pass The flow of linked method, the method can be performed by Distributed Computing Platform.In the example being illustrated in fig. 1 shown below, counted with to first Be associated as a example by join according to table and the second tables of data illustrating (but it is actual implement in the method can also be applied to other The association of the tables of data in scape, it is not limited to example below), further, it is also possible to the principle schematic with reference to shown in Fig. 2 is come Description the method：

For example, the first tables of data can be user's logon information table, referring to table 1 below, the first tables of data is illustrated Partial information, includes the corresponding relation of ID and the landing time of the user, and ID therein is properly termed as associated key, Join can be carried out between tables of data according to the associated key.

The tables of data of table 1 first

ID	Landing time
		123	2016.3.21
······	·······
		······	·······
123	2016.3.24
		234	2016.3.26
345	2016.3.27

Wherein, in the first tables of data, the data record of ID " 123 " has reached million or ten million bar, and assumes this The data of the ID " 123 " belong to " can result in the tilt data of data skew " in example, and remaining other data ratios Data record such as ID " 234 " and " 345 " belongs to non-inclined data, that is, do not result in data skew.With the first tables of data Second tables of data of join is carried out, can be an information table for user's name, for example, see shown in table 2 below, containing ID and user's name.

The tables of data of table 2 second

ID	User's name
		123	Zhang San
234	Li Si
		345	King five

Association between this first tables of data and the second tables of data, i.e., according to this associated key of ID, by the first number According to landing time corresponding with ID and user's name is found in table and the second tables of data, generate shown in a similar table 3 Association results table, includes ID and landing time corresponding with the ID and user's name in the association results table.

The association results table of table 3

ID	Landing time	User's name
			123	2016.3.21	Zhang San
······	·······	·····
			······	·······	·····
123	2016.3.24	Zhang San
			234	2016.3.26	Li Si
345	2016.3.27	King five

Below in conjunction with above-mentioned example, the process of the data table correlation method of the application is described：

In a step 101, it is put into the first data point table by extracting tilt data in the first tables of data, and by non-inclined data It is put into the second data point table.

In this step, the first tables of data is split, may be respectively referred to as the first data point table and the second data point Table.Wherein, tilt data can be included in the first data point table, such as the ID " 123 " corresponding million in table 1 or ten million The data record of bar, can include non-inclined data, such as number outside ID " 123 " in table 1 in the second data point table According to record.

A kind of following mode split to the first tables of data of example：

First, at least one associated key of data skew is caused by being extracted in the first tables of data, described at least one is closed Connection key is put into associated key sublist.

For example, the quantity of each associated key in the first tables of data can be counted, by each associated key according to quantity by many Sorted to few order.Used as an example, number of repetition of the ID " 123 " in the first tables of data is that quantity can be 1000000, the quantity of ID " 234 " can be 100000, and the quantity of ID " 345 " can be 8000.According to statistics The quantity of ID sorts from more to less, then be the order of " 123 --- 234 --- 345 ".

Assuming that associated key transformation set in advance is 1, i.e., the pass for making number one is selected from sequence above Connection key ID " 123 ", as the associated key for causing data skew.Again for example, in other examples, if in the first tables of data Including the quantity of different associated keys be ten, after counting the quantity of each associated key and sorting, will obtain sorting digit by the One sequence position to the tenth sort position one puts in order；If associated key transformation set in advance is 5, then it represents that be from First five associated key is selected in the sequence, first five associated key is the associated key that can cause data skew.The description of this example In, as a example by selecting an associated key, table 4 below is associated key sublist, causes an associated key of data skew in the table 4 In.

The associated key sublist of table 4

ID	Statistical magnitude
		123	1000000

Can be based on experience value or test value determines additionally, the numerical value of associated key transformation set in advance.Than Such as, the process time time-out that data skew is caused can be run into data cleansing, checks the associated key in this case causing time-out Number of repetition be how many, if it is 1,000,000, that indicates that 1,000,000 records will likely cause to process time delay.So in root It is such as 5 according to initial one associated key transformation of setting in the clooating sequence of associated key, selects the associated key of first five, if It was found that the statistical magnitude of the associated key of sequence the 5th is 8000, then shows that the associated key transformation is set and be not suitable for；If During by associated key transformation change 2, it is found that the deputy associated key statistical magnitude of sequence is 1,000,000, then show the upper limit number Value sets reasonable, and the data record of data skew can will be caused to choose.Certainly, it is more than a kind of mode of example, Associated key transformation can be determined using other modes, as long as tilt data can be recognized.

Secondly, according to associated key sublist, the data that association associated key sublist is matched in the first tables of data are put into the first number According to a point table, it is impossible to which the data of matching association associated key sublist are put into the second data point table.

For example, it is above-mentioned obtain associated key sublist after, the associated key sublist and the first tables of data can be associated, example Both mapjoin can be such as carried out, mapjoin is one kind of join modes, small table data can be directly read internal memory In be associated with another table, can be greatly improved generation association results efficiency.Associated key sublist such as table 4 in this example is One small table, can use mapjoin.When associated key sublist is associated with the first tables of data, the number of contingency table 4 can be matched According to the first data point table is put into, " matching association " here is referred to key pairs in the associated key sublist in the first tables of data The data record answered is found out, and in this example, the first data point table includes the corresponding data record of ID " 123 "；Can not The data of matching contingency table 4 are put into the data record outside the second data point table, i.e. ID " 123 ".

The data of table 5 first point table

ID	Landing time
		123	2016.3.21
······	·······
		······	·······
123	2016.3.24

The data of table 6 second point table

ID	Landing time
		234	2016.3.26
345	2016.3.27

The mode split to the data in the first tables of data above by associated key sublist can have various, for example, A kind of mode can be that associated key sublist and the first tables of data are carried out into first time mapjoin, obtain matching association associated key The data of table, are put into the first data point table, i.e., what this mapjoin was obtained is the data that can associate associated key sublist；Can Second mapjoin is carried out with by associated key sublist and the first tables of data, obtains matching the data of association associated key sublist It is put into the second data point table.Again for example, another way can be, associated key sublist and the first tables of data are carried out once Mapjoin, the data by this mapjoin respectively to matching association associated key sublist associate associated key sublist with can not match Data be identified, that is, it is the data that can associate associated key sublist to identify upper a certain data, or can not be associated The data of associated key sublist；According to above-mentioned mark, the data that will match association associated key sublist are put into the first data point table, will not The data that association associated key sublist can be matched are put into the second data point table.Two ways is simply enumerated above, in actual implementation not This is confined to, as long as can realize that table 5 and the data of table 6 split.

In a step 102, by extracting the data of matching association the first data point table in the second tables of data, the 3rd data are put into Divide table.

For example, the second tables of data shown in the associated key sublist shown in table 4 and table 2 can be carried out into mapjoin, obtain The data record that can be associated with the matching of associated key sublist in second tables of data, the 3rd data point table is put into by the data record. Such as, in the above example, table 4 is associated with table 2 and obtains table 7, for example, the key in table 4 is ID " 123 ", that is just by table 2 In identical key be that the corresponding data record of ID " 123 " is put into table 7：

The data of table 7 the 3rd point table

ID	User's name
		123	Zhang San

In step 103, the first data point table and the 3rd data point table are carried out mapjoin and obtains the first contingency table, will Second data point table and the second tables of data carry out join, obtain the second contingency table.

In this step, the 3rd data point table is small table, the 3rd data point table can be carried out with the first data point table Mapjoin, obtains the first contingency table such as table 8 of both association results：

The contingency table of table 8 first

ID	Landing time	User's name
			123	2016.3.21	Zhang San
······	·······	·····
			······	·······	······
123	2016.3.24	Zhang San

Second data point table and the second tables of data carry out join, the second contingency table for obtaining, such as table 9 below：

The contingency table of table 9 second

At step 104, the first contingency table and the second contingency table are combined, obtain association results table, the association As a result table is the result that first tables of data is associated with the second tables of data.

In this step, the first contingency table and the second contingency table that will can be obtained in step 103 are combined, the pass for obtaining It is coupled shown in fruit table table 3 as above.

The data table correlation method of this example, is split, by tilt data by by the tables of data containing tilt data The small table that the data are matched with one carries out mapjoin, has been obviously improved the association process efficiency of this part tilt data, and another , when being associated with tables of data, due to the influence there is no tilt data, processing procedure can also be complete quickly for outer non-inclined data Into, above-mentioned two-part processing speed all quickly, so as to improve the efficiency of tables of data association.

In order to realize the above method, the embodiment of the present application additionally provides a kind of tables of data associated apparatus, and the device is applied to First tables of data and the second tables of data are associated；Wherein, first tables of data includes：Can result in data skew Non-inclined data outside tilt data and the tilt data.As shown in figure 3, the device can include：Table split cells 31st, table extraction unit 32, table associative cell 33 and table pack unit 34.Wherein,

Table split cells 31, for being put into the first data point table by extracting the tilt data in first tables of data, And the non-inclined data are put into the second data point table；

Table extraction unit 32, for the number by extracting matching association the first data point table in second tables of data According to being put into the 3rd data point table；

Table associative cell 33, for carrying out mapjoin and obtaining the first association the first data point table and the 3rd data point table Table, second data point table and second tables of data are associated, and obtain the second contingency table；

Table pack unit 34, for first contingency table and the second contingency table to be combined, obtains association results table, The association results table is the result that first tables of data is associated with the second tables of data.

As shown in figure 4, the table split cells 31 in the device can include：Key extracts subelement 311 and table generation is single Unit 312.

Key extracts subelement 311, at least one association for causing data skew by being extracted in first tables of data Key, at least one associated key is put into associated key sublist；

Table generates subelement 312, described by association is matched in first tables of data for according to the associated key sublist The data of associated key sublist are put into first data point table, it is impossible to which the data of the matching association associated key sublist are put into institute State the second data point table.

In another example, key extracts subelement 311, when for extracting associated key, including：Count first data The quantity of each associated key in table, the order by each associated key according to quantity from more to less is ranked up；According to setting in advance Fixed associated key transformation, obtains at least one associated key of the sequence digit within the associated key transformation, as At least one associated key for causing data skew.

In another example, table extraction unit 32, when for generating the 3rd data point table, including：By the associated key Sublist is associated with second tables of data, and the data for associating second tables of data for obtaining are put into the 3rd data Divide table.

The function of unit and the implementation process of effect correspond to step in specifically referring to the above method in said apparatus Implementation process, will not be repeated here.For device embodiment, because it corresponds essentially to embodiment of the method, so related Part is illustrated referring to the part of embodiment of the method.

Device embodiment described above is only schematical, wherein the unit illustrated as separating component can To be or may not be physically separate, the part shown as unit can be or may not be physics list Unit, you can with positioned at a place, or can also be distributed on multiple NEs.It can according to the actual needs be selected In some or all of module realize the purpose of application scheme.Those of ordinary skill in the art are not paying creative labor In the case of dynamic, you can to understand and implement.

The embodiment of the tables of data associated apparatus of the application can be using on a processing device, and the data processing equipment is for example Can carry out the computing device that data processing is used in data warehouse.Tables of data associated apparatus embodiment can be by software Realize, it is also possible to realized by way of hardware or software and hardware combining.As shown in figure 5, being the application tables of data associated apparatus A kind of hardware structure diagram of the processing equipment at place, it is implemented in software as a example by, as the device on a logical meaning, Ke Yitong Processor 51 in processing equipment where crossing it, corresponding computer program instructions in nonvolatile memory 53 are read Run in internal memory 52.In addition to including each above-mentioned component and network interface, generally acceptable basis should for the processing equipment The actual functional capability of processing equipment, can include other functions component, and this is repeated no more.

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Within god and principle, any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.

Claims

1. a kind of data table correlation method, it is characterised in that methods described is applied to enter the first tables of data and the second tables of data Row association；Wherein, first tables of data includes：Can result in data skew tilt data and the tilt data it Outer non-inclined data, methods described includes：

The first data point table is put into by extracting the tilt data in first tables of data, and the non-inclined data are put into Second data point table；

By extracting the data of matching association the first data point table in second tables of data, the 3rd data point table is put into；

First data point table and the 3rd data point table are carried out mapjoin and obtain the first contingency table, by second data Dividing table and second tables of data carries out join, obtains the second contingency table；

First contingency table and the second contingency table are combined, association results table is obtained, the association results table is described The result that first tables of data is associated with the second tables of data.

2. method according to claim 1, it is characterised in that described to be put by extracting the tilt data in the first tables of data Enter the first data point table, and the non-inclined data are put into the second data point table, including：

At least one associated key of data skew is caused by being extracted in first tables of data, at least one associated key is put In entering associated key sublist；

According to the associated key sublist, the data that the association associated key sublist is matched in first tables of data are put into described First data point table, it is impossible to which the data of the matching association associated key sublist are put into second data point table.

3. method according to claim 2, it is characterised in that described to cause data to incline by being extracted in first tables of data At least one oblique associated key, including：

Count the quantity of each associated key in first tables of data, the order by each associated key according to quantity from more to less It is ranked up；

According to associated key transformation set in advance, at least one of sequence digit within the associated key transformation is obtained Individual associated key, as at least one associated key for causing data skew.

4. method according to claim 2, it is characterised in that described by extracting matching association institute in second tables of data The data of the first data point table are stated, the 3rd data point table is put into, including：

The associated key sublist is associated with second tables of data, the data of second tables of data for obtaining will be associated It is put into the 3rd data point table.

5. method according to claim 2, it is characterised in that described according to the associated key sublist, by the described first number First data point table is put into according to the data that the association associated key sublist is matched in table, it is impossible to the matching association association The data of key sublist are put into second data point table, including：

The associated key sublist is carried out into first time mapjoin with first tables of data, the matching association associated key is obtained The data of sublist are put into first data point table；The associated key sublist and first tables of data are carried out second Mapjoin, obtains matching the data for associating the associated key sublist and is put into second data point table；

Or, the associated key sublist is carried out into a mapjoin with first tables of data, respectively to the matching association pass Join the data of key sublist and the data that associate the associated key sublist can not be matched be identified；According to the mark, will be described The data of the matching association associated key sublist are put into first data point table, it is impossible to the matching association associated key sublist Data be put into second data point table.

6. a kind of tables of data associated apparatus, it is characterised in that described device is applied to enter the first tables of data and the second tables of data Row association；Wherein, first tables of data includes：Can result in data skew tilt data and the tilt data it Outer non-inclined data, described device includes：

Table split cells, for being put into the first data point table by extracting the tilt data in first tables of data, and by institute State non-inclined data and be put into the second data point table；

Table extraction unit, for the data by extracting matching association the first data point table in second tables of data, is put into 3rd data point table；

Table associative cell, for carrying out mapjoin and obtaining the first contingency table first data point table and the 3rd data point table, Second data point table and second tables of data are carried out into join, the second contingency table is obtained；

Table pack unit, for first contingency table and the second contingency table to be combined, obtains association results table, the pass It is the result that first tables of data is associated with the second tables of data to be coupled fruit table.

7. device according to claim 6, it is characterised in that the table split cells includes：

Key extracts subelement, at least one associated key for causing data skew by being extracted in first tables of data, by institute At least one associated key is stated to be put into associated key sublist；

Table generates subelement, for according to the associated key sublist, the association associated key being matched in first tables of data The data of sublist are put into first data point table, it is impossible to which the data of the matching association associated key sublist are put into described second Data point table.

8. device according to claim 7, it is characterised in that the key extracts subelement, when for extracting associated key, Including：The quantity of each associated key in first tables of data is counted, by each associated key according to quantity from more to less suitable Sequence is ranked up；According to associated key transformation set in advance, acquisition sequence digit is within the associated key transformation At least one associated key, as at least one associated key for causing data skew.

9. device according to claim 7, it is characterised in that

The table extraction unit, when for generating the 3rd data point table, including：By the associated key sublist and the described second number It is associated according to table, the data for associating second tables of data for obtaining is put into the 3rd data point table.

10. device according to claim 7, it is characterised in that the table generates subelement, is used for：