CN105095413B - A kind of method and device solving data skew - Google Patents

A kind of method and device solving data skew Download PDF

Info

Publication number
CN105095413B
CN105095413B CN201510405167.0A CN201510405167A CN105095413B CN 105095413 B CN105095413 B CN 105095413B CN 201510405167 A CN201510405167 A CN 201510405167A CN 105095413 B CN105095413 B CN 105095413B
Authority
CN
China
Prior art keywords
data
contingency table
statistical information
field
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510405167.0A
Other languages
Chinese (zh)
Other versions
CN105095413A (en
Inventor
张军
刘志祖
牟超
牟一超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201510405167.0A priority Critical patent/CN105095413B/en
Publication of CN105095413A publication Critical patent/CN105095413A/en
Application granted granted Critical
Publication of CN105095413B publication Critical patent/CN105095413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database

Abstract

The present invention provides a kind of method and device for solving data skew, handling implement is tilted by introducing pitch analysis statistical tool and surface low formula, avoid mission failure caused by data skew and time-out, the associated task of magnanimity tilt data is rapidly completed, to ensure the high-performance and stability of data warehouse.A kind of method of solution data skew of the invention includes:The first statistical information and the second statistical information that participate in associated first contingency table and the second contingency table on associate field are counted respectively using scheduler program by data skew statistical module;By surface low formula inclination processing module according to inclination field to be associated, at the end map, the data to the first contingency table and the second contingency table are handled, and are associated with pseudo- column field to generate respectively;The first contingency table and the second contingency table are associated according to pseudo- column field is associated at the end reduce by tilt data relating module.

Description

A kind of method and device solving data skew
Technical field
The present invention relates to field of computer technology, a kind of particularly method and device for solving data skew.
Background technique
Since internet great outburst, in face of growing mass data, to support mainstream search engine companies, electricity Sub commercial, social network sites traditional data warehouse system can't bear the heavy load already, and be implemented in the Hive on Hadoop cluster Be a timely one, it has also become realize big data era Distributed Data Warehouse Gospel.Data correlation is relational database The basic operation of function is a kind of mode that any resource is issued on the world wide web (www.Based on distributed arithmetic (mapreduce) Data warehouse Hive also supports that progress data are operation associated in mass data collection, and usual Hive carries out data warehouse (DW) data It when association, is carried out by two or more mass data collections.Such as the common data pattern of e-commerce is:It will be billions of Customer flow and several hundred million commodity, millions order data collection are associated operation to generate user, commercial articles searching recommendation results With the associated marketing data in E-mail address.The characteristics of such data correlation operates is mass data, while data distribution is uneven Even, there are hot spot datas.
The associated bottom layer realization of mass data based on Hive is provided by mapreduce, i.e., in Map (mapping) stage Multiple distributed computing tasks are divided into according to the size of data, are existed as far as possible using the fireballing advantage of data localization process It is local to carry out operation, subregion (arranging according to the data of identical keyword into same data processing) is then carried out, is finally existed Reduce (reduction) stage merges association operation.
Mapreduce can be carried out respectively when carrying out data correlation in map stage and reduce stage.The polymerization of the end map Applicable elements are that affiliated party has data set sufficiently small (being usually no more than 25MB) that can be placed in distributed caching, and Reduce stage associated scene is that the data set of association member is all very big, can not be cached.So for mass data, Wu Fajin The association of the end row map, can only be associated in the reduce stage.However, the bottleneck point of data correlation has not lain in data rule at this time The size of mould, and it is the uneven of data distribution.It is most of when mapreduce is executing program and carries out data correlation Reduce node has been finished, but has the operation of one or several reduce nodes very slow, leads to the execution of entire program Time is very long, this is because the number of data of some associating key word is (sometimes hundred times many more than other associating key words Or as many as thousand times), then data volume handled by reduce node where this associating key word is than other nodes with regard to big Very much, slowly run so as to cause some or several nodes endless, this is referred to as data skew.Solve mass data association Key point be exactly to solve the problems, such as data skew.
The problem of data skew, mostly uses greatly following methods when Hive handles data correlation at present:
1. creating nauropemeter method:
Feature is exactly that single-row or multiple row high dip value in table is listed when building table, and Hive can be automatically independent by these data Storage, when executing association, the data of first time query processing non-inclined, second of inquiry will handle the data of high dip, such as The small table data of fruit wherein high dip value are suitble in memory cache, and such efficiency can be higher, is just associated at the end map, most Two parts query result is merged afterwards.
2. configuration parameter method:
Parameter is pre-configured with when Hive executes data processing, in first mapreduce task, the output result of Map Collect credit union's random distribution into reduce, each reduce does partial association operation, and export as a result, handle in this way the result is that The data of the same keyword of same packets are possible to be distributed in different reduce, to reach the mesh of load balancing 's;Second mapreduce task is further according to first pretreated data result of mapreduce task according to the key of grouping Word is distributed in reduce that (it is same that this process can guarantee that the data of the same keyword of same packets are distributed to In reduce), finally complete final converging operation.This is simplest mode.
3.Mapjoin:
Applicable elements are that have one small table (being generally defaulted as 25MB) in contingency table, and data volume is suitble to distributed caching, In implementation procedure, mapjoin all can be read in small table in local memory, and the data of another table are directly taken in the map stage It is matched with table data in memory, this avoid the associations of the end reduce, therefore efficiency is highest.
In use, it is found that the mode of existing several data correlations is inclined in processing mass data associated data When problem, can all there are some disadvantages.
The shortcomings that creating nauropemeter method is, participates in associated table and needs to read and handle twice, because being partial results, institute Read-write is also required to twice with final result, while user needs to analyze tilting value when manually building table, if tilting value frequently becomes Change, then needs to change table structure, and running environment on middle line does not allow frequent list of modification, therefore wastes excessive manual operation.
Configuration parameter method occasional, which has, is performed a plurality of times the inconsistent phenomenon of data, and official does not repair this leakage also so far Hole, this can generate detrimental effects to the quality of data, be unacceptable.
Mapjoin (association of the end map) is to there being very strong constraint in machine, although we have tuned up the constraint of mapjoin Parameter, but most of business scenario still can not cache the smallest data set, so kind method is not general settling mode, It can not solve the problems, such as us.
It can be seen from above description when existing Hive processing data correlation the method for data skew all have or it is more or Few defect, however since data skew will lead to mission failure and serious time-out, the overall stability of data warehouse is influenced, especially It is that these cause to be associated with the key component that inclined data are essentially all data warehouse task, does not allow failure and time-out, Therefore it needs the inclination field of intelligent recognition contingency table before task execution and carries out specially treated, this requires automation Data skew analyzer, to count data occurrence frequency of certain object table on associate field, often data skew is exactly to send out Life is on the data field of these frequent.The more wasteful manpower of the such scene of specially treated at present, it has also become influence data bins How the key factor of library stable operation solves and optimizes tilt problem of the mass data when the end reduce is associated It is extremely urgent.
Summary of the invention
In view of this, the present invention provides a kind of method and device for solving data skew, lacking for the prior art can be overcome Point and deficiency tilt handling implement by introducing pitch analysis statistical tool and surface low formula, avoid and appoint caused by data skew Business failure and time-out, enable the associated task of magnanimity tilt data to be rapidly completed, to ensure the high-performance of data warehouse And stability.
To achieve the above object, technical solution key point proposed by the invention is:Data correlation is carried out at the end reduce The data containing inclination keyword are become more parts by reasonable algorithm before, make its distribution uniformity, are avoided the occurrence of only one The bottleneck effect that a end reduce is associated, in this way can so that data are evenly distributed on as far as possible in multiple reduce, It is smooth to participate in data correlation.
To achieve the above object, according to an aspect of the invention, there is provided a kind of method for solving data skew.
A kind of method of solution data skew of the invention, including:Scheduler program point is used by data skew statistical module The first statistical information participated in associated first contingency table and the second contingency table on associate field and the second statistics are not counted Information;By surface low formula inclination processing module according to inclination field to be associated, at the end map to first contingency table and described the The data of two contingency tables are handled, and are associated with pseudo- column field to generate respectively;By tilt data relating module in the end reduce root First contingency table and second contingency table are associated according to the pseudo- column field of the association.
Optionally, on the associate field, the data volume of first contingency table is far smaller than second contingency table Data volume.
Optionally, the step of statistics further comprises:First contingency table is counted respectively to be associated with described second First data occurrence frequency and second data occurrence frequency of the table on the associate field;According to screening threshold value, described in foundation First statistical information and the second statistical information, first statistical information include the first tilt data train value and first data Occurrence frequency, and second statistical information includes the second tilt data train value and the second data occurrence frequency;And First statistical information and the second statistical information are buffered in corresponding document.
Optionally, at the end map, the data to first contingency table and second contingency table carry out handling further packet It includes:According to first statistical information, the every of first contingency table is handled at the end map by surface low formula inclination processing module Data line;For tilting value of first contingency table on the associate field, according to first statistical information, by institute It states surface low formula inclination processing module and calculates number upper limit value for the sample space of the tilting value;And it is based on described Limit value replicates every data line by surface low formula inclination processing module and generates the pseudo- column field of the association, wherein The number of copies is related to the upper limit value.
Optionally, at the end map, the data to first contingency table and second contingency table carry out handling further packet It includes:According to second statistical information, the every of second contingency table is handled at the end map by surface low formula inclination processing module Data line;According to second statistical information, by surface low formula inclination processing module by second contingency table described Tilting value on associate field carries out average packet and numbers to generate the pseudo- column field of the association, wherein the grouping number It is related to the upper limit value.
Optionally, it is associated with according to the pseudo- column field of the association to described first by tilt data relating module at the end reduce Table and second contingency table are associated and further comprise:Generate corresponding reduce field, wherein the reduce field Number of samples it is related to the upper limit value.
According to another aspect of the present invention, a kind of device for solving data skew is provided.
A kind of device of solution data skew of the invention, including:Data skew statistical module, for using scheduler program The first statistical information participated in associated first contingency table and the second contingency table on associate field and the second system are counted respectively Count information;Surface low formula tilts processing module, for according to inclination field to be associated, the end map to first contingency table and The data of second contingency table are handled, and are associated with pseudo- column field to generate respectively;Tilt data relating module is used for The end reduce is associated first contingency table and second contingency table according to the pseudo- column field of the association.
Optionally, on the associate field, the data volume of first contingency table is far smaller than second contingency table Data volume.
Optionally, the data skew statistical module is further used for:First contingency table and described are counted respectively First data occurrence frequency and second data occurrence frequency of two contingency tables on the associate field;According to screening threshold value, build Found first statistical information and the second statistical information, first statistical information includes the first tilt data train value and described the One data occurrence frequency, and second statistical information includes that the second tilt data train value and second data frequency occur Degree;And first statistical information and the second statistical information are buffered in corresponding document.
Optionally, the surface low formula inclination processing module is further used for:According to first statistical information, at the end map Handle every data line of first contingency table;For tilting value of first contingency table on the associate field, root According to first statistical information, the number upper limit value for the sample space of the tilting value is calculated;And it is based on described Limit value replicates every data line and generates the pseudo- column field of the association, wherein the number of copies and the upper limit value It is related.
Optionally, the surface low formula inclination processing module is further used for:According to second statistical information, at the end map Handle every data line of second contingency table;According to second statistical information, by second contingency table in the pass Tilting value in connection field carries out average packet and numbers to generate the pseudo- column field of the association, wherein the grouping number and The upper limit value is related.
Optionally, the tilt data relating module is further used for:Generate corresponding reduce field, wherein described The number of samples of reduce field is related to the upper limit value.
According to the technique and scheme of the present invention, data pitch analysis tool is introduced, pitch analysis is carried out to data before association, Data skew value to be treated is analyzed and counted, so that convenient carry out advanced processing to associated data;Pass through surface low formula Tilting value, is uniformly divided into more parts by inclination processing, the pseudo- column field of corresponding association is generated, to generate multiple reduce values;Most Contingency table is associated according to the association puppet column field of generation afterwards, multiple values of the reduce field of generation can be passed through more A reduce carries out data correlation, the bottleneck of only one reduce processing tilting value is completely eliminated, to be solved perfectly Data skew problem when the end reduce mass data is associated with, has ensured the high-performance and stability of data warehouse.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is a kind of key step schematic diagram of method for solving data skew according to an embodiment of the present invention;
Fig. 2 is a kind of main modular schematic diagram of device for solving data skew according to an embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is a kind of key step schematic diagram of method for solving data skew according to an embodiment of the present invention.Such as Fig. 1 institute Show, the method that one of embodiment of the present invention solves data skew mainly includes the following steps, namely S11 to step S13.
Step S11:It is counted respectively by data skew statistical module using scheduler program and participates in associated first contingency table With first statistical information and second statistical information of second contingency table on associate field.It is associated with to the first contingency table and second Table can carry out in the following order when the information on associate field is counted:
(1) the first data occurrence frequency and second of the first contingency table and the second contingency table on associate field is counted respectively Data occurrence frequency;
(2) according to screening threshold value, the first statistical information and the second statistical information are established, the first statistical information is inclined including first Oblique data train value and the first data occurrence frequency, and the second statistical information includes that the second tilt data train value and the second data go out Existing frequency;
(3) the first statistical information and the second statistical information are buffered in corresponding document.
Referring herein to screening threshold value can rule of thumb be configured, such as run when each thread process data of actual observation The information such as the speed of speed make reasonable limit value, specifically such as, will when data being more than 10,000 when in data correlation It tilts, then the threshold value can be set as 10,000.So, the corresponding data in contingency table beyond the screening threshold value is exactly to need The tilting value handled.
Step S12:By surface low formula inclination processing module according to inclination field to be associated, closed at the end map to described first The data of connection table and second contingency table are handled, and are associated with pseudo- column field to generate respectively.Wherein, in the associate field On, the data volume of first contingency table is far smaller than the data volume of second contingency table.
Surface low formula tilts processing module according to inclination field to be associated, carries out processing master to the first contingency table at the end map Including:According to the first statistical information, each line number of the first contingency table is handled at the end map by surface low formula inclination processing module According to;It is calculated according to first statistical information for the inclination for tilting value of first contingency table on the associate field The number upper limit value of the sample space of value;And it is based on the upper limit value, every data line is replicated and generates the pseudo- column of association Field, wherein the number of copies is related to the upper limit value.
Surface low formula tilts processing module according to inclination field to be associated, carries out processing master to the second contingency table at the end map Including:According to the second statistical information, every a line of second contingency table is handled at the end map by surface low formula inclination processing module Data;According to second statistical information, tilting value of second contingency table on the associate field is subjected to average packet and is numbered Pseudo- column field is associated with to generate, wherein the grouping number is related to the upper limit value.
It is to determine the sample space of tilting value described in step S11 by the key of step S12 it can be seen from above description Number upper limit value, later, by tilting value of first contingency table on the associate field according to the upper limit value replicate with produce It is raw to be associated with pseudo- column field, tilting value of second contingency table on the associate field is subjected to average packet according to the upper limit value and is compiled Number pseudo- column field is associated with to generate.Wherein, which determines according to the first and second statistical information above-mentioned.For example, The upper limit value is chosen as carrying out the number of the thread of data correlation;Or also it is chosen as number of data and screening according to tilting value Obtained by threshold value is divided by.
Step S13:It is closed according to the pseudo- column field of the association to described first by tilt data relating module at the end reduce Connection table and second contingency table are associated.According to step S12, at the data of the first contingency table and the second contingency table After reason, generates be associated with pseudo- column field respectively, two contingency tables of person are associated in the association puppet column field of generation, and generates Corresponding reduce field, wherein the number of samples of reduce field is related to the upper limit value.
According to aforementioned step S11 to step S13, one of a reduce field before inclination processing can will be carried out Value, is copied into multiple values by reasonable algorithm, and so processing can make the bottleneck of former only one reduce processing tilting value It completely eliminates, while also ensuring the correctness of data, so that the associated tilt problem of reduce end data be solved perfectly, protect The high-performance and stability of data warehouse are hindered.
Aforementioned step S11 to S13 is described in detail below with reference to a specific embodiment.For example, data bins There are two Tables 1 and 2s to be associated in library, obtain tilting statistical information accordingly after carrying out inclination statistics to the two tables. Wherein table 1 is commodity list (being shown in Table 1), and table 2 is order table (being shown in Table 2), and the associate field C of two table is " commodity id ".Due to table 2 Data volume on associate field C is assumed to be 10,000,000 (setting screening threshold value as 1,000,000), and table 1 is on associate field C Data volume be 1, can thus generate one 10,000,000 pairs 1 of association, that is to say, that can carry out on some reduce The processing of 10000000 datas just will appear serious data skew at this time.
In order to overcome data skew bring serious consequence, need to causing inclined data to handle.Firstly, according to 10000000 pair 1 of association ratio and the screening threshold value 1,000,000 of setting calculate for the sample space for causing inclined data value Number upper limit value n.Such as we assume that Thread Count is enough, then we can choose n=1000 ten thousand/1,000,000=10.Then, right Data of the table 1 on associate field C carry out Stream Processing line by line.It is by the row data to the processing that table 1 carries out in the present embodiment N=10 parts of duplication, and new association puppet column field k is generated, and its distribution is { d1, d2 ... dn }, as shown in table 3.It Afterwards, Stream Processing line by line is carried out in the data on associate field C to table 2.Being to the processing that table 2 carries out in the present embodiment should Tilt data in table is carried out being equally divided into n group and be numbered, and to generate the pseudo- column field k of association corresponding with table 1, and it is distributed Range is { d1, d2 ... dn }, as shown in table 4.
Finally, the table 3 and table 4 that are obtained after the end reduce is to progress tilt data processing by tilt data relating module, It is associated according to the association puppet column field k of generation.It thus can produce n=10 reduce, in this way with previous Reduce handles the case where 10,000,000 data, has reformed into 10 reduce and has handled 10,000,000 datas, i.e.,:One reduce The case where handling 1,000,000 data, while the correctness of data will not be affected.
1 commodity list of table
Commodity id Product name ……
12345 Iphone6 ……
…… …… ……
2 order table of table
Order id Commodity id ……
1 12345 ……
2 12345 ……
3 12345 ……
…… …… ……
The inclination of table 3 treated commodity list
Commodity id Commodity id puppet train value Product name
12345 12345-1 Iphone6
12345 12345-2 Iphone6
…… …… ……
The inclination of table 4 treated order table
After method described through the invention handles tilt data it can be seen from above description, that is, it can avoid Due to reporting an error and handling extremely time-out caused by data skew, and it can guarantee the correctness of data, to ensure data bins The high-performance and stability in library.
Fig. 2 is a kind of main modular schematic diagram of device for solving data skew according to an embodiment of the present invention.Such as Fig. 2 institute Show, the device 20 of the solution data skew in the embodiment of the present invention includes data skew statistical module 21, surface low formula inclination processing Module 22 and tilt data relating module 23.
Data skew statistical module 21 participates in associated first contingency table and for counting respectively using scheduler program First statistical information and second statistical information of two contingency tables on associate field.
Data skew statistical module 21 can be also used for counting the first contingency table and the second contingency table respectively in associate field On the first data occurrence frequency and the second data occurrence frequency;According to screening threshold value, the first statistical information and the second system are established Information is counted, which includes the first tilt data train value and the first data occurrence frequency, and second statistics is believed Breath includes the second tilt data train value and the second data occurrence frequency;And first statistical information and the second statistical information are delayed There are in corresponding document.The tilting value of contingency table and for statistical analysis can be thus identified before carrying out data correlation, from And it can facilitate and advanced processing is carried out to associated data.
Surface low formula tilts processing module 22, for according to inclination field to be associated, at the end map to the first contingency table and the The data of two contingency tables are handled, and are associated with pseudo- column field to generate respectively.Wherein, on associate field, the first contingency table Data volume is far smaller than the data volume of the second contingency table.
Surface low formula tilts processing module 22, can be also used for according to the first statistical information, handles the first contingency table at the end map Every data line;For tilting value of first contingency table on associate field, according to the first statistical information, calculate for institute State the number upper limit value of the sample space of tilting value;And it is based on the upper limit value, every data line is replicated and generates pass Join pseudo- column field, wherein the number of copies is related to the upper limit value.
Surface low formula tilts processing module 22, can be also used for according to the second statistical information, handles the second contingency table at the end map Every data line;According to the second statistical information, tilting value of second contingency table on associate field is subjected to average packet simultaneously Number is associated with pseudo- column field to generate, wherein the grouping number is related to the upper limit value.
By above-mentioned, after being handled by surface low formula inclination processing module 22 tilting value of two contingency tables, can produce Data on one associate field have thus been divided into more parts, to facilitate generation more by the raw pseudo- column field of multiple associations A reduce value.
Tilt data relating module 23 is used at the end reduce according to the association puppet column field to the first contingency table and second Contingency table is associated.
Tilt data relating module 23 can be also used for generating corresponding reduce field, wherein the reduce word of generation The number of samples of section is related to the upper limit value.In this way, multiple values of the reduce field generated can pass through multiple reduce Carry out data correlation, avoid report an error extremely caused by there is data skew and time-out, to ensure the high property of data warehouse Energy and stability.
Technical solution according to an embodiment of the present invention introduces data pitch analysis tool, inclines before association to data Tiltedly analysis, analyzes and counts data skew value to be treated, so that convenient carry out advanced processing to associated data;Pass through Tilting value, is uniformly divided into more parts by surface low formula inclination processing, the pseudo- column field of corresponding association is generated, to generate multiple reduce Value;Finally contingency table is associated according to the association puppet column field of generation, it can be by multiple values of the reduce field of generation Data correlation is carried out by multiple reduce, the bottleneck of only one reduce processing tilting value is completely eliminated, thus perfect It solves the problems, such as the data skew when association of the end reduce mass data, has ensured the high-performance and stability of data warehouse.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (14)

1. a kind of method for solving data skew, which is characterized in that including:
It is counted respectively by data skew statistical module using scheduler program and participates in associated first contingency table and the second contingency table The first statistical information and the second statistical information on associate field is to obtain data skew value to be treated;
By surface low formula inclination processing module according to inclination field to be associated, at the end map to first contingency table and described the The data of two contingency tables are handled, and the data skew value is uniformly divided into more parts, and generate be associated with pseudo- column field respectively, The pseudo- column field of the association is by being numbered generation to the associate field;
By tilt data relating module at the end reduce according to the pseudo- column field of the association to first contingency table and described the Two contingency tables are associated.
2. the method according to claim 1, wherein on the associate field, the number of first contingency table It is far smaller than the data volume of second contingency table according to amount.
3. the method according to claim 1, wherein the step of statistics, further comprises:
The the first data occurrence frequency of first contingency table and second contingency table on the associate field is counted respectively With the second data occurrence frequency;
According to screening threshold value, first statistical information and the second statistical information are established, first statistical information includes first Tilt data train value and the first data occurrence frequency, and second statistical information include the second tilt data train value and The second data occurrence frequency;And
First statistical information and the second statistical information are buffered in corresponding document.
4. the method according to claim 1, wherein being closed at the end map to first contingency table and described second The data of connection table carry out processing:
According to first statistical information, first contingency table is handled at the end map by surface low formula inclination processing module Every data line;
For tilting value of first contingency table on the associate field, according to first statistical information, by the table Streaming inclination processing module calculates the number upper limit value for the sample space of the tilting value;And
Based on the upper limit value, every data line is replicated by surface low formula inclination processing module and generates the association Pseudo- column field, wherein the number of copies is related to the upper limit value.
5. according to the method described in claim 4, it is characterized in that, being closed at the end map to first contingency table and described second The data of connection table carry out processing:
According to second statistical information, second contingency table is handled at the end map by surface low formula inclination processing module Every data line;
According to second statistical information, by surface low formula inclination processing module by second contingency table in the associated characters Tilting value in section carries out average packet and numbers to generate the association puppet column field, wherein the grouping number with it is described Upper limit value is related.
6. according to the method described in claim 4, it is characterized in that, as tilt data relating module at the end reduce according to It is associated with pseudo- column field first contingency table and second contingency table are associated and further comprise:
Generate corresponding reduce field, wherein the number of samples of the reduce field is related to the upper limit value.
7. a kind of device for solving data skew, which is characterized in that including:
Data skew statistical module participates in associated first contingency table and the second association for counting respectively using scheduler program First statistical information and second statistical information of the table on associate field are to obtain data skew value to be treated;
Surface low formula tilts processing module, for according to inclination field to be associated, at the end map to first contingency table and described The data of second contingency table are handled, and the data skew value is uniformly divided into more parts, and generate be associated with pseudo- column word respectively Section, the pseudo- column field of the association is by being numbered generation to the associate field;
Tilt data relating module, at the end reduce according to association puppet column field to first contingency table and described Second contingency table is associated.
8. device according to claim 7, which is characterized in that on the associate field, the number of first contingency table It is far smaller than the data volume of second contingency table according to amount.
9. device according to claim 7, which is characterized in that the data skew statistical module is further used for:
The the first data occurrence frequency of first contingency table and second contingency table on the associate field is counted respectively With the second data occurrence frequency;
According to screening threshold value, first statistical information and the second statistical information are established, first statistical information includes first Tilt data train value and the first data occurrence frequency, and second statistical information include the second tilt data train value and The second data occurrence frequency;And
First statistical information and the second statistical information are buffered in corresponding document.
10. device according to claim 7, which is characterized in that the surface low formula inclination processing module is further used for:
According to first statistical information, every data line of first contingency table is handled at the end map;
Use is calculated according to first statistical information for tilting value of first contingency table on the associate field In the number upper limit value of the sample space of the tilting value;And
Based on the upper limit value, every data line is replicated and generates the pseudo- column field of the association, wherein described duplication part Number is related to the upper limit value.
11. device according to claim 10, which is characterized in that the surface low formula inclination processing module is further used for:
According to second statistical information, every data line of second contingency table is handled at the end map;
According to second statistical information, tilting value of second contingency table on the associate field is subjected to average packet And number to generate the pseudo- column field of the association, wherein the grouping number is related to the upper limit value.
12. device according to claim 10, which is characterized in that the tilt data relating module is further used for:
Generate corresponding reduce field, wherein the number of samples of the reduce field is related to the upper limit value.
13. a kind of electronic equipment for solving data skew, which is characterized in that including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 6.
14. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor Such as method as claimed in any one of claims 1 to 6 is realized when row.
CN201510405167.0A 2015-07-09 2015-07-09 A kind of method and device solving data skew Active CN105095413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510405167.0A CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510405167.0A CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Publications (2)

Publication Number Publication Date
CN105095413A CN105095413A (en) 2015-11-25
CN105095413B true CN105095413B (en) 2018-11-23

Family

ID=54575850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510405167.0A Active CN105095413B (en) 2015-07-09 2015-07-09 A kind of method and device solving data skew

Country Status (1)

Country Link
CN (1) CN105095413B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN106126343B (en) * 2016-06-27 2020-04-03 西北工业大学 MapReduce data balancing method based on incremental partitioning strategy
CN107577531B (en) * 2016-07-05 2020-12-04 阿里巴巴集团控股有限公司 Load balancing method and device
CN106293938A (en) * 2016-08-05 2017-01-04 飞思达技术(北京)有限公司 Solve the method for data skew in big data calculation process
CN106446030B (en) * 2016-08-31 2019-11-22 天津南大通用数据技术股份有限公司 Support classification inquiry clustered database system and classification querying method with pseudo- column
CN108121745B (en) * 2016-11-30 2021-08-06 中移(苏州)软件技术有限公司 Data loading method and device
CN108536824B (en) * 2018-04-10 2020-11-20 中国农业银行股份有限公司 Data processing method and device
CN108595268B (en) * 2018-04-24 2021-03-09 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN108776692A (en) * 2018-06-06 2018-11-09 北京京东尚科信息技术有限公司 Method and apparatus for handling information
US11003686B2 (en) * 2018-07-26 2021-05-11 Roblox Corporation Addressing data skew using map-reduce
CN109298947A (en) * 2018-10-24 2019-02-01 北京奇虎科技有限公司 Data processing method and device, calculating equipment in distributed system
CN109684401A (en) * 2018-12-30 2019-04-26 北京金山云网络技术有限公司 Data processing method, device and system
CN111611243B (en) * 2020-05-13 2023-06-13 第四范式(北京)技术有限公司 Data processing method and device
CN111966681A (en) * 2020-08-14 2020-11-20 咪咕文化科技有限公司 Data processing method, device, network equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN104239529A (en) * 2014-09-19 2014-12-24 浪潮(北京)电子信息产业有限公司 Method and device for preventing Hive data from being inclined
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9311380B2 (en) * 2013-03-29 2016-04-12 International Business Machines Corporation Processing spatial joins using a mapreduce framework

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system
CN104239529A (en) * 2014-09-19 2014-12-24 浪潮(北京)电子信息产业有限公司 Method and device for preventing Hive data from being inclined
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system

Also Published As

Publication number Publication date
CN105095413A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN105095413B (en) A kind of method and device solving data skew
Zhao et al. Dache: A data aware caching for big-data applications using the MapReduce framework
Dede et al. An evaluation of cassandra for hadoop
Hu et al. Flutter: Scheduling tasks closer to data across geo-distributed datacenters
Vulimiri et al. Global analytics in the face of bandwidth and regulatory constraints
Hecht et al. NoSQL evaluation: A use case oriented survey
US8620899B2 (en) Generating materialized query table candidates
US10303702B2 (en) System and method for analysis and management of data distribution in a distributed database environment
US20160306849A1 (en) Geo-scale analytics with bandwidth and regulatory constraints
CN103345514A (en) Streamed data processing method in big data environment
CN112015741A (en) Method and device for storing massive data in different databases and tables
CN102722553A (en) Distributed type reverse index organization method based on user log analysis
CN107077453A (en) For the system and method for the parallel optimization that data base querying is carried out using cluster cache
CN103336791A (en) Hadoop-based fast rough set attribute reduction method
CN106951526B (en) Entity set extension method and device
Tang et al. An intermediate data partition algorithm for skew mitigation in spark computing environment
CN102298650A (en) Distributed recommendation method of massive digital information
US11288266B2 (en) Candidate projection enumeration based query response generation
CN108415912A (en) Data processing method based on MapReduce model and equipment
CN108416054A (en) Dynamic HDFS copy number calculating methods based on file access temperature
CN109471877B (en) Incremental temporal frequent pattern parallel mining method facing streaming data
CN109669987A (en) A kind of big data storage optimization method
US11106676B2 (en) Fast OLAP query execution in main memory on large data in a cluster
CN108920110A (en) A kind of parallel processing big data storage system and method calculating mode based on memory
Nasir et al. Partial key grouping: Load-balanced partitioning of distributed streams

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant