CN105095413A - Method and apparatus for solving data skew - Google Patents

Method and apparatus for solving data skew Download PDF

Info

Publication number
CN105095413A
CN105095413A CN201510405167.0A CN201510405167A CN105095413A CN 105095413 A CN105095413 A CN 105095413A CN 201510405167 A CN201510405167 A CN 201510405167A CN 105095413 A CN105095413 A CN 105095413A
Authority
CN
China
Prior art keywords
data
contingency table
statistical information
field
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510405167.0A
Other languages
Chinese (zh)
Other versions
CN105095413B (en
Inventor
张军
刘志祖
牟一超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201510405167.0A priority Critical patent/CN105095413B/en
Publication of CN105095413A publication Critical patent/CN105095413A/en
Application granted granted Critical
Publication of CN105095413B publication Critical patent/CN105095413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database

Abstract

The invention provides a method and an apparatus for solving data skew. Through introducing a skew analysis statistical tool and a table flow-type skew processing tool, task fail and timeout due to data skew are avoided; and associated tasks of massive data skew can be completed rapidly, thus further enduring high performance and stability of a database. The method for solving data skew provided by the invention comprises: a scheduler is used by a data skew statistical module to respectively count first statistical information and second statistical information of a first association table and a second association table that are participating in association on an associating field; datum of the first association table and the second association table are processed on a map end by a table flow-type skew processing module based on a skew field to be associated, so as to respectively produce associating pseudo column field; and association to the first association table and the second association table is performed on a reduce end by a skew data associating module based on the associating pseudo column field.

Description

A kind of method and device solving data skew
Technical field
The present invention relates to field of computer technology, particularly a kind of method and device solving data skew.
Background technology
Since internet great outburst, in the face of growing mass data, in order to support main flow search engine companies, ecommerce, social network sites traditional data warehouse system can't bear the heavy load already, and the appearance being implemented in the Hive on Hadoop cluster is a timely one, become the Gospel realizing large data age Distributed Data Warehouse.Data correlation is the basic operation of relation data library facility, is a kind of mode that any resource is issued on the world wide web (www.It is operation associated that data warehouse Hive based on distributed arithmetic (mapreduce) is also supported in massive data sets enterprising row data, when usual Hive carries out data warehouse (DW) data correlation, be all undertaken by two or more massive data sets.The common data pattern of such as ecommerce is: carry out billions of customer flows and several hundred million commodity, millions order data collection associating the marketing data that computing is associated to produce user, commercial articles searching recommendation results and E-mail address.The feature of this type of data correlation operation is mass data, and Data distribution8 is uneven simultaneously, there is hot spot data.
The bottom layer realization associated based on the mass data of Hive is provided by mapreduce, namely multiple distributed computing task is divided in Map (mapping) stage according to the size of data, the fireballing advantage of data localization process is utilized to carry out computing in this locality as far as possible, then carry out subregion (namely entering same data processing row according to the data of identical key word), finally carry out merge connection computing in reduce (reduction) stage.
Mapreduce, when carrying out data correlation, can carry out in map stage and reduce stage respectively.The applicable elements of map end polymerization is that affiliated party has a data set enough little (being generally no more than 25MB) to be placed in distributed caching, and the scene that the reduce stage associate be the data set of association member all very greatly, cannot buffer memory.So for mass data, the association of map end cannot be carried out, can only associate in the reduce stage.But now the bottleneck point of data correlation has not lain in the size of data scale, and is the uneven of Data distribution8.When mapreduce carries out data correlation at executive routine, the executed of major part reduce node is complete, but there have one or several reduce nodes to run to be very slow, cause the execution time of whole program very long, this is because the data number of some associating key word more than other associating key word a lot (being hundred times or thousand times more than sometimes), so the data volume handled by reduce node at this associating key word place is just more much larger than other nodes, thus causing some or several node slowly cannot not run completely, this is referred to as data skew.The key point solving mass data association solves data skew problem exactly.
During current Hive process data correlation, the problem of data skew adopts following methods mostly:
1. create nauropemeter method:
Feature is exactly list high dip value that is single-row in table or multiple row when building table, these data can store separately by Hive automatically, when performing association, the aclinal data of first time query processing, second time inquiry is by the data of process high dip, if wherein the little table data of high dip value are adapted at memory cache, such efficiency can be higher, just associate at map end, finally two parts Query Result is merged.
2. configuration parameter method:
The pre-configured parameter when Hive performs data processing, in first mapreduce task, the Output rusults set meeting stochastic distribution of Map is in reduce, each reduce does partial association operation, and Output rusults, the result of such process is that the data of the same key word of same packets are likely distributed in different reduce, thus reaches the object of load balancing; Second mapreduce task is distributed to (this process can ensure that the data of the same key word of same packets are distributed in same reduce) in reduce according to first pretreated data result of mapreduce task according to the key word of grouping again, finally completes final converging operation.This is the simplest mode.
3.Mapjoin:
Applicable elements has a little table (being generally defaulted as 25MB) in contingency table, data volume is applicable to distributed caching, in the process of implementation, mapjoin all can read in local internal memory little table, directly take in the data of another one table and internal memory in the map stage and show data and do and mate, this avoid the association of reduce end, therefore efficiency is the highest.
In use, find that the mode of existing several data correlation is when processing the problem that mass data associated data tilts, and all can exist some shortcomings.
The shortcoming creating nauropemeter method is, the table participating in association needs read and process twice, because be partial results, so net result also needs read-write twice, user needs to analyze tilting value when manually building table simultaneously, if tilting value frequent variations, then needs list of modification structure, and run environment on center line and do not allow frequent list of modification, therefore waste too much manual operation.
Configuration parameter method occasional has the inconsistent phenomenon of multiple exercise data, and this leak is not also repaired by official so far, and this can produce detrimental effects to the quality of data, is unacceptable.
Mapjoin (map holds association) is to there being very strong constraint in machine, although we have tuned up the constrained parameters of mapjoin, but most of business scenario still cannot the minimum data set of buffer memory, so kind method is not general settling mode, cannot solve our problem.
Described as can be seen from above, during existing Hive process data correlation, the method for data skew all has defect more or less, but mission failure and serious time-out can be caused due to data skew, affect the resistance to overturning of data warehouse, especially these cause associating the key component that the data tilted are all data warehouse task substantially, do not allow failed and overtime, therefore need the inclination field of Intelligent Recognition contingency table before tasks carrying and carry out special processing, this just needs the data skew analyzer of robotization, to count the data occurrence frequency of certain object table on associate field, often data skew is exactly occur on the data field of these frequent.This type of scene of current special processing compares waste of manpower, has become the key factor affecting data warehouse stable operation, how to solve with optimize mass data reduce end associate time tilt problem extremely urgent.
Summary of the invention
In view of this, the invention provides a kind of method and the device that solve data skew, the shortcoming and defect of prior art can be overcome, by introducing pitch analysis statistical tool and surface low formula inclination handling implement, avoid mission failure and time-out that data skew causes, the associated task of magnanimity tilt data can be completed fast, thus ensured high-performance and the stability of data warehouse.
For achieving the above object, technical scheme key point proposed by the invention is: before reduce end carries out data correlation, the data containing inclination key word are become many parts by reasonable algorithm, make its distribution uniformity, avoid occurring only holding the ink-bottle effect of carrying out associating at a reduce, so just, data can be made to be evenly distributed on as much as possible in multiple reduce, smoothly to participate in data correlation.
For achieving the above object, according to an aspect of the present invention, a kind of method solving data skew is provided.
A kind of method solving data skew of the present invention, comprising: the first contingency table being used scheduler program to count respectively to participate in association by data skew statistical module and first statistical information of the second contingency table on associate field and the second statistical information; By surface low formula inclination processing module according to inclination field to be associated, at map end, the data of described first contingency table and described second contingency table are processed, to produce the pseudo-row field of association respectively; According to described association pseudo-row field, described first contingency table is associated with described second contingency table at reduce end by tilt data relating module.
Alternatively, on described associate field, the data volume of described first contingency table is far smaller than the data volume of described second contingency table.
Alternatively, the step of described statistics comprises further: add up the first data occurrence frequency on described associate field of described first contingency table and described second contingency table and the second data occurrence frequency respectively; According to screening threshold value, set up described first statistical information and the second statistical information, described first statistical information comprises the first tilt data train value and described first data occurrence frequency, and described second statistical information comprises the second tilt data train value and described second data occurrence frequency; And described first statistical information and the second statistical information are buffered in corresponding document.
Alternatively, process in the data of map end to described first contingency table and described second contingency table and comprise further: according to described first statistical information, held every data line of described first contingency table of process at map by described surface low formula inclination processing module; For the tilting value of described first contingency table on described associate field, according to described first statistical information, calculated the number higher limit of the sample space for described tilting value by described surface low formula inclination processing module; And based on described higher limit, being copied every data line by described surface low formula inclination processing module and generate the pseudo-row field of described association, wherein, described number of copies is relevant to described higher limit.
Alternatively, process in the data of map end to described first contingency table and described second contingency table and comprise further: according to described second statistical information, held every data line of described second contingency table of process at map by described surface low formula inclination processing module; According to described second statistical information, by described surface low formula inclination processing module the tilting value of described second contingency table on described associate field be averaged and divide into groups and number to generate the pseudo-row field of described association, wherein, described grouping number is relevant to described higher limit.
Alternatively, held to associate with described second contingency table described first contingency table according to described association pseudo-row field at reduce by tilt data relating module and comprise further: generate corresponding reduce field, wherein, the number of samples of described reduce field is relevant to described higher limit.
According to a further aspect in the invention, a kind of device solving data skew is provided.
A kind of device solving data skew of the present invention, comprising: data skew statistical module, counts the first contingency table participating in association and first statistical information of the second contingency table on associate field and the second statistical information for using scheduler program respectively; Surface low formula inclination processing module, for according to inclination field to be associated, processes the data of described first contingency table and described second contingency table at map end, to produce the pseudo-row field of association respectively; Tilt data relating module, for associating with described second contingency table described first contingency table according to described association pseudo-row field at reduce end.
Alternatively, on described associate field, the data volume of described first contingency table is far smaller than the data volume of described second contingency table.
Alternatively, described data skew statistical module is further used for: add up the first data occurrence frequency on described associate field of described first contingency table and described second contingency table and the second data occurrence frequency respectively; According to screening threshold value, set up described first statistical information and the second statistical information, described first statistical information comprises the first tilt data train value and described first data occurrence frequency, and described second statistical information comprises the second tilt data train value and described second data occurrence frequency; And described first statistical information and the second statistical information are buffered in corresponding document.
Alternatively, described surface low formula inclination processing module is further used for: according to described first statistical information, at every data line of described first contingency table of map end process; For the tilting value of described first contingency table on described associate field, according to described first statistical information, calculate the number higher limit of the sample space for described tilting value; And based on described higher limit, copy every data line and generate the pseudo-row field of described association, wherein, described number of copies is relevant to described higher limit.
Alternatively, described surface low formula inclination processing module is further used for: according to described second statistical information, at every data line of described second contingency table of map end process; According to described second statistical information, be averaged by the tilting value of described second contingency table on described associate field and divide into groups and number to generate the pseudo-row field of described association, wherein, described grouping number is relevant to described higher limit.
Alternatively, described tilt data relating module is further used for: generate corresponding reduce field, and wherein, the number of samples of described reduce field is relevant to described higher limit.
According to technical scheme of the present invention, introducing data pitch analysis instrument, before association, pitch analysis is carried out to data, to needing data skew value to be processed to carry out analytic statistics, thus conveniently advanced processing being carried out to associated data; Tilt to process by surface low formula, tilting value is evenly divided into many parts, generate the pseudo-row field of corresponding association, to generate multiple reduce value; Finally according to the association generated pseudo-row field, contingency table is associated, multiple values of the reduce field of generation can be carried out data correlation by multiple reduce, completely eliminate the bottleneck only having a reduce process tilting value, thus perfection solves data skew problem when reduce holds mass data to associate, and has ensured high-performance and the stability of data warehouse.
Accompanying drawing explanation
Accompanying drawing is used for understanding the present invention better, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is a kind of key step schematic diagram solving the method for data skew according to the embodiment of the present invention;
Fig. 2 is a kind of main modular schematic diagram solving the device of data skew according to the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, one exemplary embodiment of the present invention is explained, comprising the various details of the embodiment of the present invention to help understanding, they should be thought it is only exemplary.Therefore, those of ordinary skill in the art will be appreciated that, can make various change and amendment, and can not deviate from scope and spirit of the present invention to the embodiments described herein.Equally, for clarity and conciseness, the description to known function and structure is eliminated in following description.
Fig. 1 is a kind of key step schematic diagram solving the method for data skew according to the embodiment of the present invention.As shown in Figure 1, a kind of method solving data skew in the embodiment of the present invention mainly comprises following step S11 to step S13.
Step S11: the first contingency table being used scheduler program to count respectively to participate in association by data skew statistical module and first statistical information of the second contingency table on associate field and the second statistical information.When to the first contingency table and the second contingency table, the information on associate field is added up, can carry out in the following order:
(1) the first data occurrence frequency on associate field of the first contingency table and the second contingency table and the second data occurrence frequency is added up respectively;
(2) according to screening threshold value, set up the first statistical information and the second statistical information, first statistical information comprises the first tilt data train value and the first data occurrence frequency, and the second statistical information comprises the second tilt data train value and the second data occurrence frequency;
(3) the first statistical information and the second statistical information are buffered in corresponding document.
The screening threshold value herein mentioned can rule of thumb be arranged, such as, during actual observation each thread process data, the information such as speed of travelling speed, makes rational limit value, concrete as, just there will be inclination time when data are more than 10,000 time at data correlation, then this threshold value can be set to 10,000.So, the corresponding data exceeded in the contingency table of this screening threshold value is exactly the tilting value that needs carry out processing.
Step S12: by surface low formula inclination processing module according to inclination field to be associated, processes the data of described first contingency table and described second contingency table at map end, to produce the pseudo-row field of association respectively.Wherein, on described associate field, the data volume of described first contingency table is far smaller than the data volume of described second contingency table.
Surface low formula inclination processing module, according to inclination field to be associated, to process the first contingency table at map end and mainly comprises: according to the first statistical information, held every data line of process first contingency table by surface low formula inclination processing module at map; For the tilting value of the first contingency table on this associate field, according to this first statistical information, calculate the number higher limit of the sample space for described tilting value; And based on this higher limit, copy every data line and generate the pseudo-row field of association, wherein, this number of copies is relevant to described higher limit.
Surface low formula inclination processing module, according to inclination field to be associated, to process the second contingency table at map end and mainly comprises: according to the second statistical information, held every data line of described second contingency table of process by surface low formula inclination processing module at map; According to this second statistical information, be averaged by the tilting value of the second contingency table on this associate field and divide into groups and number to generate the pseudo-row field of association, wherein, this grouping number is relevant to described higher limit.
Described as can be seen from above, the key of step S12 is the number higher limit of the sample space of tilting value described in determining step S11, afterwards, the tilting value of the first contingency table on this associate field is carried out copying to produce the pseudo-row field of association according to this higher limit, the tilting value of the second contingency table on this associate field is averaged according to this higher limit and divide into groups and number to associate pseudo-row field to produce.Wherein, this higher limit is determined according to aforesaid first and second statistical informations.For example, this higher limit is chosen as the number of the thread carrying out data correlation; Or be also chosen as and be divided by gained according to the data number of tilting value and screening threshold value.
Step S13: described first contingency table is associated with described second contingency table according to described association pseudo-row field at reduce end by tilt data relating module.According to step S12, after the data of the first contingency table and the second contingency table are processed, generate the pseudo-row field of association respectively, person's two contingency tables are associated in the pseudo-row field of the association generated, and generate corresponding reduce field, wherein, the number of samples of reduce field is relevant to described higher limit.
According to aforesaid step S11 to step S13, can will carry out a value of an inclination reduce field before treatment, multiple value is copied into by rational algorithm, process like this can make to only have the bottleneck of a reduce process tilting value to eliminate completely in the past, also can ensure the correctness of data simultaneously, thus perfection solves the tilt problem of reduce end data association, has ensured high-performance and the stability of data warehouse.
Below in conjunction with a specific embodiment, aforesaid step S11 to S13 is described in detail.Such as, there are two tables 1 to be associated and tables 2 in data warehouse, tilted after carrying out inclination statistics to these two tables statistical information accordingly.Wherein table 1 is commodity list (see table 1), and table 2 is order table (see table 2), and the associate field C of this two table is " commodity id ".Because the data volume of table 2 on associate field C is assumed to be 1,000 ten thousand (setting screening threshold value is 1,000,000), and the data volume of table 1 on associate field C is 1, will produce like this one 1,000 ten thousand to 1 association, that is can carry out the process of 1,000 ten thousand data on some reduce, now just there will be serious data skew.
In order to overcome the serious consequence that data skew brings, need causing the data of inclination to process.First, according to 1,000 ten thousand to the association ratio of 1 and the screening threshold value 1,000,000 of setting, the number higher limit n of the sample space of the data value causing inclination is calculated.Such as we suppose that Thread Count is abundant, then we can select n=1000 ten thousand/1,000,000=10.Then, the data of his-and-hers watches 1 on associate field C carry out Stream Processing line by line.The process that in the present embodiment, his-and-hers watches 1 carry out for the row data are copied n=10 part, and generates the pseudo-row field k of new association, and its distribution range be d1, d2 ... dn}, as shown in table 3.Afterwards, his-and-hers watches 2 carry out Stream Processing line by line in the data on associate field C.The process that in the present embodiment, his-and-hers watches 2 carry out is the tilt data in this table be averaged be divided into n group and number, and corresponding with table 1 associates pseudo-row field k to generate, and its distribution range be d1, d2 ... dn}, as shown in table 4.
Finally, holding table 3 to obtaining after carrying out tilt data process and table 4 by tilt data relating module at reduce, associating according to the pseudo-row field k of the association generated.So just can produce n=10 reduce, like this with the situation of previous reduce process 1,000 ten thousand data, just become 10 reduce process, 1,000 ten thousand data, that is: the situation of a reduce process 1,000,000 data, the correctness of data also can not be affected simultaneously.
Table 1 commodity list
Commodity id Trade name ……
12345 Iphone6 ……
…… …… ……
Table 2 order table
Order id Commodity id ……
1 12345 ……
2 12345 ……
3 12345 ……
…… …… ……
Commodity list after table 3 tilts to process
Commodity id The pseudo-train value of commodity id Trade name
12345 12345-1 Iphone6
12345 12345-2 Iphone6
…… …… ……
Order table after table 4 tilts to process
Described as can be seen from above, after tilt data being processed by method of the present invention, the exception caused due to data skew can be avoided to report an error and process time-out, and the correctness of data can be ensured, thus ensure high-performance and the stability of data warehouse.
Fig. 2 is a kind of main modular schematic diagram solving the device of data skew according to the embodiment of the present invention.As shown in Figure 2, the device 20 of the solution data skew in the embodiment of the present invention comprises data skew statistical module 21, surface low formula inclination processing module 22 and tilt data relating module 23.
Data skew statistical module 21, counts the first contingency table participating in association and first statistical information of the second contingency table on associate field and the second statistical information respectively for using scheduler program.
Data skew statistical module 21, can also for adding up the first data occurrence frequency on associate field of the first contingency table and the second contingency table and the second data occurrence frequency respectively; According to screening threshold value, set up the first statistical information and the second statistical information, this first statistical information comprises the first tilt data train value and the first data occurrence frequency, and this second statistical information comprises the second tilt data train value and the second data occurrence frequency; And this first statistical information and the second statistical information are buffered in corresponding document.So just can before carry out data correlation, identify the tilting value of contingency table and carry out statistical study, thus conveniently can carry out advanced processing to associated data.
Surface low formula inclination processing module 22, for according to inclination field to be associated, processes the data of the first contingency table and the second contingency table at map end, to produce the pseudo-row field of association respectively.Wherein, on associate field, the data volume of the first contingency table is far smaller than the data volume of the second contingency table.
Surface low formula inclination processing module 22, can also be used for according to the first statistical information, holds every data line of process first contingency table at map; For the tilting value of the first contingency table on associate field, according to the first statistical information, calculate the number higher limit of the sample space for described tilting value; And based on this higher limit, copy every data line and generate the pseudo-row field of association, wherein, this number of copies is relevant to this higher limit.
Surface low formula inclination processing module 22, can also be used for according to the second statistical information, holds every data line of process second contingency table at map; According to the second statistical information, be averaged by the tilting value of the second contingency table on associate field and divide into groups and number to generate the pseudo-row field of association, wherein, this grouping number is relevant to this higher limit.
By above-mentioned, after the tilting value of two contingency tables being processed by surface low formula inclination processing module 22, the pseudo-row field of multiple association can be produced, so just the data on an associate field have been divided into many parts, thus the multiple reduce value of convenient generation.
Tilt data relating module 23, for associating with the second contingency table the first contingency table according to the pseudo-row field of this association at reduce end.
Tilt data relating module 23, can also be used for generating corresponding reduce field, and wherein, the number of samples of the reduce field of generation is relevant to this higher limit.Like this, multiple values of the reduce field of generation just can carry out data correlation by multiple reduce, avoid and occur that the exception that data skew causes reports an error and time-out, thus ensured high-performance and the stability of data warehouse.
According to the technical scheme of the embodiment of the present invention, introducing data pitch analysis instrument, before association, pitch analysis is carried out to data, to needing data skew value to be processed to carry out analytic statistics, thus conveniently advanced processing being carried out to associated data; Tilt to process by surface low formula, tilting value is evenly divided into many parts, generate the pseudo-row field of corresponding association, to generate multiple reduce value; Finally according to the association generated pseudo-row field, contingency table is associated, multiple values of the reduce field of generation can be carried out data correlation by multiple reduce, completely eliminate the bottleneck only having a reduce process tilting value, thus perfection solves data skew problem when reduce holds mass data to associate, and has ensured high-performance and the stability of data warehouse.
Above-mentioned embodiment, does not form limiting the scope of the invention.It is to be understood that depend on designing requirement and other factors, various amendment, combination, sub-portfolio can be there is and substitute in those skilled in the art.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within scope.

Claims (12)

1. solve a method for data skew, it is characterized in that, comprising:
The first contingency table being used scheduler program to count respectively to participate in association by data skew statistical module and first statistical information of the second contingency table on associate field and the second statistical information;
By surface low formula inclination processing module according to inclination field to be associated, at map end, the data of described first contingency table and described second contingency table are processed, to produce the pseudo-row field of association respectively;
According to described association pseudo-row field, described first contingency table is associated with described second contingency table at reduce end by tilt data relating module.
2. method according to claim 1, is characterized in that, on described associate field, the data volume of described first contingency table is far smaller than the data volume of described second contingency table.
3. method according to claim 1, is characterized in that, the step of described statistics comprises further:
Add up the first data occurrence frequency on described associate field of described first contingency table and described second contingency table and the second data occurrence frequency respectively;
According to screening threshold value, set up described first statistical information and the second statistical information, described first statistical information comprises the first tilt data train value and described first data occurrence frequency, and described second statistical information comprises the second tilt data train value and described second data occurrence frequency; And
Described first statistical information and the second statistical information are buffered in corresponding document.
4. method according to claim 1, is characterized in that, processes comprise further in the data of map end to described first contingency table and described second contingency table:
According to described first statistical information, held every data line of described first contingency table of process at map by described surface low formula inclination processing module;
For the tilting value of described first contingency table on described associate field, according to described first statistical information, calculated the number higher limit of the sample space for described tilting value by described surface low formula inclination processing module; And
Based on described higher limit, being copied every data line by described surface low formula inclination processing module and generate the pseudo-row field of described association, wherein, described number of copies is relevant to described higher limit.
5. the method according to claim 1 or 4, is characterized in that, processes comprise further in the data of map end to described first contingency table and described second contingency table:
According to described second statistical information, held every data line of described second contingency table of process at map by described surface low formula inclination processing module;
According to described second statistical information, by described surface low formula inclination processing module the tilting value of described second contingency table on described associate field be averaged and divide into groups and number to generate the pseudo-row field of described association, wherein, described grouping number is relevant to described higher limit.
6. method according to claim 1, is characterized in that, is held to associate with described second contingency table described first contingency table according to described association pseudo-row field to comprise further by tilt data relating module at reduce:
Generate corresponding reduce field, wherein, the number of samples of described reduce field is relevant to described higher limit.
7. solve a device for data skew, it is characterized in that, comprising:
Data skew statistical module, counts the first contingency table participating in association and first statistical information of the second contingency table on associate field and the second statistical information respectively for using scheduler program;
Surface low formula inclination processing module, for according to inclination field to be associated, processes the data of described first contingency table and described second contingency table at map end, to produce the pseudo-row field of association respectively;
Tilt data relating module, for associating with described second contingency table described first contingency table according to described association pseudo-row field at reduce end.
8. device according to claim 7, is characterized in that, on described associate field, the data volume of described first contingency table is far smaller than the data volume of described second contingency table.
9. device according to claim 7, is characterized in that, described data skew statistical module is further used for:
Add up the first data occurrence frequency on described associate field of described first contingency table and described second contingency table and the second data occurrence frequency respectively;
According to screening threshold value, set up described first statistical information and the second statistical information, described first statistical information comprises the first tilt data train value and described first data occurrence frequency, and described second statistical information comprises the second tilt data train value and described second data occurrence frequency; And
Described first statistical information and the second statistical information are buffered in corresponding document.
10. device according to claim 7, is characterized in that, described surface low formula inclination processing module is further used for:
According to described first statistical information, at every data line of described first contingency table of map end process;
For the tilting value of described first contingency table on described associate field, according to described first statistical information, calculate the number higher limit of the sample space for described tilting value; And
Based on described higher limit, copy every data line and generate the pseudo-row field of described association, wherein, described number of copies is relevant to described higher limit.
11. devices according to claim 7 or 10, it is characterized in that, described surface low formula inclination processing module is further used for:
According to described second statistical information, at every data line of described second contingency table of map end process;
According to described second statistical information, be averaged by the tilting value of described second contingency table on described associate field and divide into groups and number to generate the pseudo-row field of described association, wherein, described grouping number is relevant to described higher limit.
12. devices according to claim 7, is characterized in that, described tilt data relating module is further used for:
Generate corresponding reduce field, wherein, the number of samples of described reduce field is relevant to described higher limit.
CN201510405167.0A 2015-07-09 2015-07-09 A kind of method and device solving data skew Active CN105095413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510405167.0A CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510405167.0A CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Publications (2)

Publication Number Publication Date
CN105095413A true CN105095413A (en) 2015-11-25
CN105095413B CN105095413B (en) 2018-11-23

Family

ID=54575850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510405167.0A Active CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Country Status (1)

Country Link
CN (1) CN105095413B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126343A (en) * 2016-06-27 2016-11-16 西北工业大学 MapReduce data balancing method based on increment type partitioning strategies
CN106293938A (en) * 2016-08-05 2017-01-04 飞思达技术(北京)有限公司 Solve the method for data skew in big data calculation process
CN106446030A (en) * 2016-08-31 2017-02-22 天津南大通用数据技术股份有限公司 Cluster database system supporting classification query with pseudo columns and classification query method
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN107577531A (en) * 2016-07-05 2018-01-12 阿里巴巴集团控股有限公司 Load-balancing method and device
CN108121745A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of data load method and device
CN108536824A (en) * 2018-04-10 2018-09-14 中国农业银行股份有限公司 A kind of data processing method and device
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN108776692A (en) * 2018-06-06 2018-11-09 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN109298947A (en) * 2018-10-24 2019-02-01 北京奇虎科技有限公司 Data processing method and device, calculating equipment in distributed system
CN109684401A (en) * 2018-12-30 2019-04-26 北京金山云网络技术有限公司 Data processing method, device and system
CN111611243A (en) * 2020-05-13 2020-09-01 第四范式(北京)技术有限公司 Data processing method and device
CN111708809A (en) * 2020-06-23 2020-09-25 中国平安财产保险股份有限公司 Associated query method, device and equipment based on data tilt and storage medium
CN111966681A (en) * 2020-08-14 2020-11-20 咪咕文化科技有限公司 Data processing method, device, network equipment and storage medium
CN112470148A (en) * 2018-07-26 2021-03-09 罗布乐思公司 MapReduce model for data tilting
CN111708809B (en) * 2020-06-23 2024-05-03 中国平安财产保险股份有限公司 Associated query method, device, equipment and storage medium based on data inclination

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
US20140297585A1 (en) * 2013-03-29 2014-10-02 International Business Machines Corporation Processing Spatial Joins Using a Mapreduce Framework
CN104239529A (en) * 2014-09-19 2014-12-24 浪潮(北京)电子信息产业有限公司 Method and device for preventing Hive data from being inclined
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297585A1 (en) * 2013-03-29 2014-10-02 International Business Machines Corporation Processing Spatial Joins Using a Mapreduce Framework
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system
CN104239529A (en) * 2014-09-19 2014-12-24 浪潮(北京)电子信息产业有限公司 Method and device for preventing Hive data from being inclined
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN106126343B (en) * 2016-06-27 2020-04-03 西北工业大学 MapReduce data balancing method based on incremental partitioning strategy
CN106126343A (en) * 2016-06-27 2016-11-16 西北工业大学 MapReduce data balancing method based on increment type partitioning strategies
CN107577531A (en) * 2016-07-05 2018-01-12 阿里巴巴集团控股有限公司 Load-balancing method and device
CN107577531B (en) * 2016-07-05 2020-12-04 阿里巴巴集团控股有限公司 Load balancing method and device
CN106293938A (en) * 2016-08-05 2017-01-04 飞思达技术(北京)有限公司 Solve the method for data skew in big data calculation process
CN106446030A (en) * 2016-08-31 2017-02-22 天津南大通用数据技术股份有限公司 Cluster database system supporting classification query with pseudo columns and classification query method
CN106446030B (en) * 2016-08-31 2019-11-22 天津南大通用数据技术股份有限公司 Support classification inquiry clustered database system and classification querying method with pseudo- column
CN108121745B (en) * 2016-11-30 2021-08-06 中移(苏州)软件技术有限公司 Data loading method and device
CN108121745A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of data load method and device
CN108536824A (en) * 2018-04-10 2018-09-14 中国农业银行股份有限公司 A kind of data processing method and device
CN108536824B (en) * 2018-04-10 2020-11-20 中国农业银行股份有限公司 Data processing method and device
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN108595268B (en) * 2018-04-24 2021-03-09 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN108776692A (en) * 2018-06-06 2018-11-09 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN112470148A (en) * 2018-07-26 2021-03-09 罗布乐思公司 MapReduce model for data tilting
CN109298947A (en) * 2018-10-24 2019-02-01 北京奇虎科技有限公司 Data processing method and device, calculating equipment in distributed system
CN109684401A (en) * 2018-12-30 2019-04-26 北京金山云网络技术有限公司 Data processing method, device and system
CN111611243A (en) * 2020-05-13 2020-09-01 第四范式(北京)技术有限公司 Data processing method and device
CN111611243B (en) * 2020-05-13 2023-06-13 第四范式(北京)技术有限公司 Data processing method and device
CN111708809A (en) * 2020-06-23 2020-09-25 中国平安财产保险股份有限公司 Associated query method, device and equipment based on data tilt and storage medium
CN111708809B (en) * 2020-06-23 2024-05-03 中国平安财产保险股份有限公司 Associated query method, device, equipment and storage medium based on data inclination
CN111966681A (en) * 2020-08-14 2020-11-20 咪咕文化科技有限公司 Data processing method, device, network equipment and storage medium

Also Published As

Publication number Publication date
CN105095413B (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN105095413A (en) Method and apparatus for solving data skew
US11210279B2 (en) Distributed offline indexing
US7505960B2 (en) Scalable retrieval of data entries using an array index or a secondary key
US10776336B2 (en) Dynamic creation and maintenance of multi-column custom indexes for efficient data management in an on-demand services environment
Dede et al. An evaluation of cassandra for hadoop
US10303702B2 (en) System and method for analysis and management of data distribution in a distributed database environment
CN112269792B (en) Data query method, device, equipment and computer readable storage medium
US9135647B2 (en) Methods and systems for flexible and scalable databases
US10002019B2 (en) System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups
US20100313205A1 (en) System and method for offline data generation for online system analysis
US20140122484A1 (en) System and Method for Flexible Distributed Massively Parallel Processing (MPP) Database
CN112015741A (en) Method and device for storing massive data in different databases and tables
WO2015074477A1 (en) Path analysis method and apparatus
US10191947B2 (en) Partitioning advisor for online transaction processing workloads
CN111966677B (en) Data report processing method and device, electronic equipment and storage medium
WO2016197857A1 (en) Position information providing method and device
Silberstein et al. Efficient bulk insertion into a distributed ordered table
CN101419600A (en) Data copy mapping method and device based on object-oriented LANGUAGE
CN107506388A (en) A kind of iterative data balancing optimization method towards Spark parallel computation frames
US10482076B2 (en) Single level, multi-dimension, hash-based table partitioning
CN104199924B (en) The method and device of network form of the selection with snapshot relation
CN116126901A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN113094444A (en) Data processing method, data processing apparatus, computer device, and medium
CN111198847A (en) Data parallel processing method, device and system suitable for large data set
CN116226250A (en) Convergence type management method and system for managing mass time sequence data in power generation field

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant