CN105930479A - Data skew processing method and apparatus - Google Patents

Data skew processing method and apparatus Download PDF

Info

Publication number
CN105930479A
CN105930479A CN201610279684.2A CN201610279684A CN105930479A CN 105930479 A CN105930479 A CN 105930479A CN 201610279684 A CN201610279684 A CN 201610279684A CN 105930479 A CN105930479 A CN 105930479A
Authority
CN
China
Prior art keywords
data
sub
data set
hashtable
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610279684.2A
Other languages
Chinese (zh)
Inventor
刘光华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610279684.2A priority Critical patent/CN105930479A/en
Publication of CN105930479A publication Critical patent/CN105930479A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the invention provide a data skew processing method and apparatus, and relate to the technical field of computer applications. The method comprises the steps of splitting a first data set into N sub-data sets when the data volumes of the first data set and a second data set which are subjected to mapjoin operation are both greater than a preset threshold, wherein N is greater than or equal to 2, and the data volume of the sub-data sets is smaller than or equal to the preset threshold; and performing the mapjoin operation on the second data set and the N sub-data sets. According to the technical scheme provided by the embodiments of the invention, a shuffle process of mapreduce and processing of a large amount of data of the same key in single reduce can be effectively avoided and most data skew problems can be solved to a certain extent; and in other words, the efficiency is exchanged with resources, so that the efficiency of a hive task with data skew is greatly improved.

Description

Data tilt processing method and device
Technical Field
The embodiment of the invention relates to the technical field of computer application, in particular to a data tilt processing method and device.
Background
At present, in the face of increasing mass data, a traditional data warehouse system for supporting mainstream search engine companies, e-commerce and social websites is too heavy, and the occurrence of Hive constructed on a Hadoop cluster realizes a big data era distributed data warehouse and can effectively solve the existing problems. Data association is the basic operation of a relational database function and is a way for any resource to be published on the world wide web. The distributed operation (mapreduce) -based data warehouse Hive also supports data association operations on the sea volume data sets, and generally, when the Hive performs Data Warehouse (DW) data association, the data warehouse Hive performs data association by two or more sea volume data sets. For example, a common data schema for e-commerce is: billions of user flow and billions of commodity and million-level order data sets are subjected to correlation operation to generate marketing data related to the user, the commodity search recommendation result and the electronic mailbox. The data correlation operation has the characteristics of mass data, uneven data distribution and hot spot data.
The bottom layer implementation of massive data association based on Hive is provided by mapreduce, namely, the massive data association is divided into a plurality of distributed computing tasks according to the size of data in the Map (mapping) stage, the operation is locally performed by utilizing the advantage of high data localization processing speed as far as possible, then the operation is divided (namely, the data enter the same data processing column according to the same keyword), and finally the merging association operation is performed in the reduce (reduction) stage.
Mapreduce can be performed in the map phase and the reduce phase respectively when performing data association. The applicable condition of Map-end aggregation is that an associated party has a data set which is small enough (generally not more than 25MB) and can be placed in a distributed cache, and the associated scenario in the reduce stage is that the data sets of associated members are large and cannot be cached. Therefore, for mass data, map-end association cannot be performed, and association can be performed only in the reduce stage. However, the bottleneck point associated with the data is not at the size of the data scale, but at the uneven distribution of the data. When mapreduce performs data association in an execution program, most reduce nodes are executed completely, but one or more reduce nodes are operated slowly, so that the execution time of the whole program is very long, because the number of data of one associated keyword is much more than that of other associated keywords (sometimes, hundreds of times or thousands of times), the data volume processed by the reduce node where the associated keyword is located is much more than that of other nodes, and one or more nodes cannot operate completely in a late time, which is called data skew. The key point for solving the association of mass data is to solve the problem of data tilt.
At present, the condition that the mapjoin operation is applicable is that a small table exists in the association table, the data volume is suitable for distributed cache, in the execution process, the mapjoin can read all the small tables into the local memory, and the data of the other table is directly taken to be matched with the data in the memory in the map stage, so that the association of reduce ends is avoided, and the efficiency is highest. The above method has a great limitation on the data size of the small table. When the data amount of the small table exceeds a certain threshold (the threshold of the mapjoin is specifically set according to the specific situation of the cluster, and is generally about 100MB at present), the small table cannot be converted into the mapjoin. If there is a data skew problem in the two tables, the operation speed is very slow, some of them take several days to complete, and sometimes even the required result is obtained at all. Although we enlarge the constraint parameters of mapjoin, most business scenarios still cannot cache the minimum data set, so this method is not a general solution and cannot solve our problems.
Disclosure of Invention
The embodiment of the invention provides a data skew processing method and device, which are used for solving the problem that the mapjoin operation limits the data volume in the prior art.
The embodiment of the invention provides a data inclination processing method, which comprises the following steps:
when the data volume of a first data set and a second data set subjected to mapping association mapjoin operation is larger than a preset threshold, splitting the first data set into N sub-data sets, wherein N is larger than or equal to 2, and the data volume of the sub-data sets is smaller than or equal to the preset threshold;
and performing mapjoin operation on the second data set and the N sub-data sets respectively.
Optionally, the method described above further includes:
acquiring the data volume of two data sets for performing mapjoin operation;
and comparing the data volumes of the two data sets, and taking the data set with small data volume as the first data set and the data set with large data volume as the second data set according to the comparison result.
Optionally, in the foregoing method, the splitting the first data set into N sub-data sets includes:
acquiring the data volume of the first data set;
calculating the splitting number N according to the data volume and the preset threshold value;
and splitting the first data set into N sub-data sets.
Optionally, in the foregoing method, performing mapjoin operation on the second data set and the N sub-data sets respectively includes:
respectively serializing the N sub-data sets to obtain N HashTable files;
segmenting the second data set into M data blocks, wherein M is greater than or equal to 2;
and starting M x N mapping map tasks, and executing associated join operations on the N HashTable files and the M data blocks.
Optionally, in the foregoing method, the starting M × N map tasks and performing join operations on the N HashTable files and the M data blocks specifically includes:
and starting M × N map tasks, reading one subdata set from each map task to the memory, and acquiring one data block and the subdata set in the memory to perform join operation.
Optionally, in the foregoing method, the serializing the N sub-data sets respectively to obtain N HashTable files specifically:
and using a kryo serialization library of java to serialize the N sub-data sets respectively to obtain N HashTable files.
Optionally, the method further includes:
naming the N HashTable files respectively according to a preset naming rule;
and storing the N named HashTable files into a preset HDFS directory of the Hadoop distributed file system.
Optionally, the method further includes:
and writing the result generated by the mapjoin operation into an intermediate result file directory of the HDFS.
An embodiment of the present invention provides a data tilt processing apparatus, including:
the device comprises a splitting module, a judging module and a processing module, wherein the splitting module is used for splitting a first data set into N sub-data sets when the data volumes of the first data set and a second data set which are subjected to mapjoin operation are both larger than a preset threshold, wherein N is larger than or equal to 2, and the data volume of the sub-data sets is smaller than or equal to the preset threshold;
and the execution module is used for performing mapjoin operation on the second data set and the N sub-data sets respectively.
Optionally, the above apparatus further comprises:
the acquisition module is used for acquiring the data volume of the two data sets for performing mapjoin operation;
and the comparison module is used for comparing the data volumes of the two data sets, taking the data set with the small data volume as the first data set according to the comparison result, and taking the data set with the large data volume as the second data set.
Optionally, in the above apparatus, the splitting module includes:
an acquisition unit configured to acquire a data amount of the first data set;
the calculating unit is used for calculating the splitting number N according to the data volume and the preset threshold value;
and the splitting unit is used for splitting the first data set into N sub-data sets.
Optionally, in the above apparatus, the executing module includes:
the processing unit is used for serializing the N sub-data sets respectively to obtain N HashTable files;
a segmentation unit configured to segment the second data set into M data blocks, where M is greater than or equal to 2;
and the execution unit is used for starting M x N mapping map tasks and executing join operation on the N HashTable files and the M data blocks.
Optionally, in the above apparatus, the execution unit is specifically configured to start M × N map tasks, each map task reads one of the sub data sets to the memory, and obtains one of the data blocks and the sub data set in the memory to perform a join operation.
Optionally, in the above apparatus, the processing unit is specifically configured to use a kryo serialization library of java to serialize the N sub-data sets, respectively, to obtain N HashTable files.
Optionally, in the above apparatus, the execution module further includes:
the naming unit is used for naming the N HashTable files respectively according to a preset naming rule;
and the storage unit is used for storing the named N HashTable files into a preset HDFS directory.
Optionally, the apparatus further includes:
and the writing module is used for writing the result generated by the mapjoin operation into an intermediate result file directory of the HDFS.
The embodiment of the invention divides the table with the data volume larger than the mapjoin operation preset threshold into a plurality of branch tables meeting the threshold allowed by the mapjoin operation, then generates and starts a plurality of map tasks to execute the join operation, so that two data sets for performing the mapjoin operation can be associated at the map end, the shuffle process of mapreduce and the processing of a large amount of data of the same key (keyword) in a single reduce can be effectively avoided, and the problem of inclination of most data can be solved to a certain extent; in other words, the technical solution provided by the embodiment of the present invention uses resources to exchange efficiency, thereby greatly improving the efficiency of hive tasks that data skew will occur.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data skew processing method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a data skew processing method according to a second embodiment of the present invention
Fig. 3 is a schematic structural diagram of a data tilt processing apparatus according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
By adopting the technical scheme provided by the embodiment of the invention, the data volume limitation of the hive table participating in mapjoin operation can be improved from the original level of 25-200 MB to the level of 1-100 GB. The technical scheme provided by the embodiment of the invention can be understood as an enhancement to the existing hive mapjoin scheme, and can solve most of the data inclination problem.
Fig. 1 is a flowchart illustrating a data skew processing method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101, when the data amount of the first data set and the second data set subjected to mapjoin operation is larger than a preset threshold, splitting the first data set into N sub-data sets.
Wherein N is greater than or equal to 2, and the data volume of the sub data set is less than or equal to the preset threshold. The splitting step can be realized by adopting the following method:
first, the data volume of the first data set is obtained.
And then, calculating the splitting number N according to the data volume and the preset threshold value.
In a specific embodiment, the splitting number N may be calculated by using the following formula:
n-data amount/preset threshold
And finally, splitting the first data set into N sub-data sets.
Here, it should be noted that: as known to those skilled in the art, the two data sets for the mapjoin operation are usually a large table (i.e., a table with a large data amount) and a small table (i.e., a table with a small data amount), the small table is read into the local memory, and the data of the direct Canadian table is matched with the table data in the memory in the map phase. Therefore, when the first data volume and the second data volume in the embodiment of the present invention are two data sets with different data volumes, and particularly, the difference between the data volumes is large, the technical solution provided in the embodiment of the present invention may also split one of the tables with a small data volume, so as to reduce the number of the split sub data sets. Namely, the method provided by the embodiment of the present invention further includes:
and step S1, acquiring the data volume of the two data sets subjected to the mapjoin operation.
And step S2, comparing the data volumes of the two data sets, and according to the comparison result, using the data set with small data volume as the first data set, and using the data set with large data volume as the second data set.
And 102, performing mapjoin operation on the second data set and the N sub-data sets respectively.
In specific implementation, the implementation process of the step is as follows:
and 1021, serializing the N sub-data sets respectively to obtain N HashTable files.
In specific implementation, the N sub-data sets can be serialized respectively by using a kryo serialization library of java, so as to obtain N HashTable files. By using kryo as a serialization library of java, the size of a compressed file can be effectively reduced.
Step 1022, segment the second data set into M data blocks.
Wherein M is greater than or equal to 2. In practical application, the Mapjoin calculation step is divided into two steps, data of a small table is changed into hashtable and is broadcast to all map ends, data of a large table is reasonably split, then the hashtable of the small table is detected (probe) row by the data of the large table in the map stage, and if join keys are equal, the hashtable is written into an HDFS (Hadoop Distributed File System). The splitting of the large table is at hadoop level, and the data of the large table is stored on hdfs. Where files on hdfs are stored in time-divided blocks, each block size is determined by configuration parameters (currently the cluster is 512 MB). As can be seen from this, in the foregoing steps provided by the embodiment of the present invention, M is the data amount/configuration parameter of the second data set. If the data volume of the second data set is less than or equal to the configuration parameter, M is equal to 1; if the data size of the second data set is greater than the configuration parameter, then M is greater than or equal to 2.
And 1023, starting M x N mapping map tasks, and executing associated join operations on the N HashTable files and the M data blocks.
And starting M × N map tasks, reading one subdata set from each map task to the memory, and acquiring one data block and the subdata set in the memory to perform join operation. Suppose that the first data set is split into 2, sub data set 1 and sub data set 2; the second data set is divided into 3 data blocks, namely a data block 1, a data block 2 and a data block 3; then 6 map tasks need to be started and these 6 map tasks are executed in parallel. For convenience of explanation, we will refer to map task 1, map task 2, map task 3, map task 4, map task 5, and map task 6, respectively. Wherein,
the map task 1 reads the subdata set 1 into the memory, and acquires the data block 1 and the subdata set 1 to perform join operation;
the map task 2 reads the subdata set 1 into the memory and obtains the data block 2 and the subdata set 1 to perform join operation;
the map task 3 reads the subdata set 1 into the memory, and acquires the data block 3 and the subdata set 1 to perform join operation;
the map task 4 reads the subdata set 2 into the memory, and obtains the data block 1 and the subdata set 2 to perform join operation;
the map task 5 reads the subdata set 2 into the memory, and obtains the data block 2 and the subdata set 2 to perform join operation;
and the map task 6 reads the subdata set 2 into the memory, and acquires the data block 3 and the subdata set 2 to perform join operation.
Here, it should be noted that: and starting M × N map tasks according to the execution plan generated by hive. Specifically, the generation process of the execution plan includes: first, it is necessary to modify the existing source code of hive, and generate an execution plan by parsing the hit (hint) of mapjoin specified by hive. The execution plan is then rewritten, generating a new plan tree. For example, assuming that the number N of the sub data sets is 3, the sub data sets are rewritten into 3 respective mapjoin tasks, and meanwhile, a dependency is added to the downstream plan, and after the upstream task is completed, a trigger instruction for triggering the 3 mapjoin tasks is added. The purpose of adding dependency to the downstream plan is to trigger the downstream operation after the mapjoin operation is completed until the task is completely executed. The downstream plan may be some operations after the mapjoin operation is performed in the prior art, which is not the focus of the embodiment of the present invention, and therefore, will not be described in detail herein.
All the sub data sets are distributed to the nodes of the mapjoin tasks by using the distributeddcache of hadoop, for example, if the number N of the sub data sets is 3, the 3 sub data sets are distributed to the nodes of the 3 mapjoin tasks by using the distributeddcache of hadoop. And under the mapreduce framework, simultaneously starting M × N maps according to the generated execution plan. Each map task reads a sub data set into the memory, and detects (probe) hashtables of the small tables row by using data of the second data set at the map stage. If the join keys are equal, the HDFS is written. Wherein, M × N map tasks are performed in parallel.
Further, the method provided by the embodiment of the present invention may further include:
and 103, naming the N HashTable files respectively according to a preset naming rule.
The naming rule may be specified by human, for example, file1, file2, … …, and this is not limited in the embodiments of the present invention.
And 104, storing the named N HashTable files into a preset HDFS directory.
For example, the N named HashTable files may be stored under an HDFS directory dir with a unique identifier.
Further, the method provided by the embodiment of the present invention may further include:
and 105, writing a result generated by the mapjoin operation into an intermediate result file directory of the HDFS.
The embodiment of the invention divides the table with the data volume larger than the mapjoin operation preset threshold into a plurality of branch tables meeting the threshold allowed by the mapjoin operation, then generates and starts a plurality of map tasks to execute the join operation, so that two data sets for performing the mapjoin operation can be associated at the map end, the shuffle process of mapreduce and the processing of a large amount of data of the same key (keyword) in a single reduce can be effectively avoided, and the problem of inclination of most data can be solved to a certain extent; in other words, the technical solution provided by the embodiment of the present invention uses resources to exchange efficiency, thereby greatly improving the efficiency of hive tasks that data skew will occur.
As shown in fig. 2, a flow chart of a data skew processing method according to a second embodiment of the present invention is schematically shown. As shown in fig. 2, the method includes:
step 201, acquiring the data volume of the two data sets subjected to the mapjoin operation.
Step 202, comparing the data volumes of the two data sets, and according to the comparison result, using the data set with small data volume as the first data set, and using the data set with large data volume as the second data set.
And 203, calculating the splitting number N according to the data volume of the first data set and the preset threshold value.
And 204, splitting the first data set into N sub-data sets.
And step 205, serializing the N sub-data sets to obtain N HashTable files.
In specific implementation, the N sub-data sets can be serialized respectively by using a kryo serialization library of java, so as to obtain N HashTable files. By using kryo as a serialization library of java, the size of a compressed file can be effectively reduced.
And step 206, storing the N HashTable files in a preset HDFS directory.
And step 207, generating N mapjoin tasks according to the files in the HDFS directory.
And 208, respectively distributing the N HashTable files to corresponding mapjoin tasks.
In a specific embodiment, all the sub data sets may be distributed to the nodes of the respective mapjoin tasks by using the DistributedCache of hadoop, for example, assuming that the number N of the sub data sets is 3, the 3 sub data sets are distributed to the nodes of the 3 mapjoin tasks by using the Distributecache of hadoop.
Step 209, the second data set is segmented into M data blocks.
Wherein M is greater than or equal to 2. In practical application, the Mapjoin calculation step is divided into two steps, data of a small table is changed into hashtable and is broadcast to all map ends, data of a large table is reasonably split, then the hashtable of the small table is detected (probe) row by the data of the large table in the map stage, and if join keys are equal, the hashtable is written into an HDFS (Hadoop Distributed File System). The splitting of the large table is at hadoop level, and the data of the large table is stored on hdfs. Where files on hdfs are stored in time-divided blocks, each block size is determined by configuration parameters (currently the cluster is 512 MB). As can be seen from this, in the foregoing steps provided by the embodiment of the present invention, M is the data amount/configuration parameter of the second data set. If the data volume of the second data set is less than or equal to the configuration parameter, M is equal to 1; if the data size of the second data set is greater than the configuration parameter, then M is greater than or equal to 2.
Step 210, starting M × N mapping map tasks, and executing associated join operations on the N HashTable files and the M data blocks.
Among them, what needs to be explained here is: and starting M × N map tasks according to the execution plan generated by hive. Specifically, the generation process of the execution plan includes: first, it is necessary to modify the existing source code of hive, and generate an execution plan by parsing the hit (hint) of mapjoin specified by hive. The execution plan is then rewritten, generating a new plan tree. For example, assuming that the number N of the sub data sets is 3, the sub data sets are rewritten into 3 respective mapjoin tasks, and meanwhile, a dependency is added to the downstream plan, and after the upstream task is completed, a trigger instruction for triggering the 3 mapjoin tasks is added.
And step 211, writing the result generated by the mapjoin operation into an intermediate result file directory of the HDFS.
Here, it should be noted that: after the step 211 is executed, the subsequent operations can be triggered, which are the same as those in the prior art and will not be described herein again.
The embodiment of the invention divides the table with the data volume larger than the mapjoin operation preset threshold into a plurality of branch tables meeting the threshold allowed by the mapjoin operation, then generates and starts a plurality of map tasks to execute the join operation, so that two data sets for performing the mapjoin operation can be associated at the map end, the shuffle process of mapreduce and the processing of a large amount of data of the same key (keyword) in a single reduce can be effectively avoided, and the problem of inclination of most data can be solved to a certain extent; in other words, the technical solution provided by the embodiment of the present invention uses resources to exchange efficiency, thereby greatly improving the efficiency of hive tasks that data skew will occur.
It should be noted that: while, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Fig. 3 is a schematic structural diagram of a data tilt processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, the data skew processing apparatus provided in this embodiment includes: a splitting module 10 and an execution module 20. Wherein,
the splitting module 10 is configured to split the first data set into N sub-data sets when data volumes of the first data set and the second data set subjected to mapjoin operation are both greater than a preset threshold, where N is greater than or equal to 2, and the data volume of the sub-data set is less than or equal to the preset threshold.
And an executing module 20, configured to perform mapjoin operation on the second data set and the N sub-data sets respectively.
Further, the data tilt processing apparatus provided in the embodiment of the present invention may further include: an acquisition module and a comparison module, wherein,
and the acquisition module is used for acquiring the data volume of the two data sets for carrying out the mapjoin operation.
And the comparison module is used for comparing the data volumes of the two data sets, taking the data set with the small data volume as the first data set according to the comparison result, and taking the data set with the large data volume as the second data set.
Further, the splitting module may include: the device comprises an acquisition unit, a calculation unit and a splitting unit. Wherein,
an obtaining unit configured to obtain a data amount of the first data set.
And the calculating unit is used for calculating the splitting number N according to the data volume and the preset threshold value.
And the splitting unit is used for splitting the first data set into N sub-data sets.
Further, the execution module may include: the device comprises a processing unit, a segmentation unit and an execution unit. Wherein,
the processing unit is used for serializing the N sub-data sets respectively to obtain N HashTable files;
a segmentation unit configured to segment the second data set into M data blocks, where M is greater than or equal to 2;
and the execution unit is used for starting M x N mapping map tasks and executing join operation on the N HashTable files and the M data blocks.
Further, the execution unit is specifically configured to start M × N map tasks, where each map task reads one of the sub data sets to the memory, and obtains one of the data blocks and the sub data set in the memory to perform a join operation.
Further, the processing unit is specifically configured to use a kryo serialization library of java to serialize the N sub-data sets respectively, so as to obtain N HashTable files.
Further, the execution module may further include: a naming unit and a storage unit. Wherein,
and the naming unit is used for naming the N HashTable files respectively according to a preset naming rule.
And the storage unit is used for storing the named N HashTable files into a preset HDFS directory.
Further, the data tilt processing apparatus provided in the embodiment of the present invention may further include: and writing into a module. And the writing module is used for writing the result generated by the mapjoin operation into an intermediate result file directory of the HDFS.
The embodiment of the invention divides the table with the data volume larger than the mapjoin operation preset threshold into a plurality of branch tables meeting the threshold allowed by the mapjoin operation, then generates and starts a plurality of map tasks to execute the join operation, so that two data sets for performing the mapjoin operation can be associated at the map end, the shuffle process of mapreduce and the processing of a large amount of data of the same key (keyword) in a single reduce can be effectively avoided, and the problem of inclination of most data can be solved to a certain extent; in other words, the technical solution provided by the embodiment of the present invention uses resources to exchange efficiency, thereby greatly improving the efficiency of hive tasks that data skew will occur.
The apparatus provided in this embodiment can implement the method provided in the first embodiment, and specific implementation principles can refer to the contents of the corresponding parts, which are not described herein again.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data skew processing method, comprising:
when the data volume of a first data set and a second data set subjected to mapping association mapjoin operation is larger than a preset threshold, splitting the first data set into N sub-data sets, wherein N is larger than or equal to 2, and the data volume of the sub-data sets is smaller than or equal to the preset threshold;
and performing mapjoin operation on the second data set and the N sub-data sets respectively.
2. The method of claim 1, wherein the splitting the first data set into N sub-data sets comprises:
acquiring the data volume of the first data set;
calculating the splitting number N according to the data volume and the preset threshold value;
and splitting the first data set into N sub-data sets.
3. The method according to claim 1 or 2, wherein performing a mapjoin operation on the second data set with the N sub-data sets, respectively, comprises:
respectively serializing the N sub-data sets to obtain N HashTable files;
segmenting the second data set into M data blocks, wherein M is greater than or equal to 2;
and starting M x N mapping map tasks, and executing associated join operations on the N HashTable files and the M data blocks.
4. The method according to claim 3, wherein the N sub-datasets are serialized respectively to obtain N HashTable files, specifically:
and using a kryo serialization library of java to serialize the N sub-data sets respectively to obtain N HashTable files.
5. The method of claim 3, wherein the performing a mapjoin operation on the second dataset and the N sub-datasets, respectively, further comprises:
naming the N HashTable files respectively according to a preset naming rule;
and storing the N named HashTable files into a preset HDFS directory of the Hadoop distributed file system.
6. A data skew processing apparatus, comprising:
the device comprises a splitting module, a judging module and a processing module, wherein the splitting module is used for splitting a first data set into N sub-data sets when the data volumes of the first data set and a second data set which are subjected to mapjoin operation are both larger than a preset threshold, wherein N is larger than or equal to 2, and the data volume of the sub-data sets is smaller than or equal to the preset threshold;
and the execution module is used for performing mapjoin operation on the second data set and the N sub-data sets respectively.
7. The apparatus of claim 6, wherein the splitting module comprises:
an acquisition unit configured to acquire a data amount of the first data set;
the calculating unit is used for calculating the splitting number N according to the data volume and the preset threshold value;
and the splitting unit is used for splitting the first data set into N sub-data sets.
8. The apparatus of claim 6 or 7, wherein the execution module comprises:
the processing unit is used for serializing the N sub-data sets respectively to obtain N HashTable files;
a segmentation unit configured to segment the second data set into M data blocks, where M is greater than or equal to 2;
and the execution unit is used for starting M x N mapping map tasks and executing join operation on the N HashTable files and the M data blocks.
9. The apparatus according to claim 8, wherein the processing unit is specifically configured to serialize the N sub-datasets using a jar kryo serialization library, respectively, to obtain N HashTable files.
10. The apparatus of claim 8, wherein the execution module further comprises:
the naming unit is used for naming the N HashTable files respectively according to a preset naming rule;
and the storage unit is used for storing the named N HashTable files into a preset HDFS directory.
CN201610279684.2A 2016-04-28 2016-04-28 Data skew processing method and apparatus Pending CN105930479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610279684.2A CN105930479A (en) 2016-04-28 2016-04-28 Data skew processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610279684.2A CN105930479A (en) 2016-04-28 2016-04-28 Data skew processing method and apparatus

Publications (1)

Publication Number Publication Date
CN105930479A true CN105930479A (en) 2016-09-07

Family

ID=56836613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610279684.2A Pending CN105930479A (en) 2016-04-28 2016-04-28 Data skew processing method and apparatus

Country Status (1)

Country Link
CN (1) CN105930479A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776026A (en) * 2016-12-19 2017-05-31 北京奇虎科技有限公司 A kind of distributed data processing method and device
CN108694205A (en) * 2017-04-11 2018-10-23 北京京东尚科信息技术有限公司 Match the method, apparatus of aiming field
CN108776692A (en) * 2018-06-06 2018-11-09 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN110232095A (en) * 2019-05-21 2019-09-13 中国平安财产保险股份有限公司 A kind of method of data synchronization, device, storage medium and server
CN110673794A (en) * 2019-09-18 2020-01-10 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
CN111046045A (en) * 2019-12-13 2020-04-21 中国平安财产保险股份有限公司 Method, device, equipment and storage medium for processing data tilt
CN111061712A (en) * 2019-11-29 2020-04-24 苏宁金融科技(南京)有限公司 Data connection operation processing method and device
CN111339064A (en) * 2020-03-03 2020-06-26 中国平安人寿保险股份有限公司 Data tilt correction method, device and computer readable storage medium
CN112256704A (en) * 2020-10-23 2021-01-22 山东超越数控电子股份有限公司 Quick join method, storage medium and computer
CN112346847A (en) * 2019-08-09 2021-02-09 高新兴科技集团股份有限公司 Data processing method, device, computer storage medium and electronic device
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data tilt processing method, device, storage medium and program product
CN114638008A (en) * 2020-12-15 2022-06-17 阿里巴巴集团控股有限公司 Data processing method, device, system and storage medium
CN115185525A (en) * 2022-05-17 2022-10-14 贝壳找房(北京)科技有限公司 Data skew code block positioning method, apparatus, device, medium, and program product
CN116126862A (en) * 2023-02-06 2023-05-16 中国银联股份有限公司 Data table association method, device, equipment and storage medium
CN116596638A (en) * 2023-07-11 2023-08-15 中国标准化研究院 Information recommendation method based on numerical processing model
CN119105847A (en) * 2024-08-02 2024-12-10 鹏城实验室 Data aggregation method, related device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system
CN104408159A (en) * 2014-12-04 2015-03-11 曙光信息产业(北京)有限公司 Data correlating, loading and querying method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298736A (en) * 2014-09-30 2015-01-21 华为软件技术有限公司 Method and device for aggregating and connecting data as well as database system
CN104408159A (en) * 2014-12-04 2015-03-11 曙光信息产业(北京)有限公司 Data correlating, loading and querying method and device

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776026A (en) * 2016-12-19 2017-05-31 北京奇虎科技有限公司 A kind of distributed data processing method and device
CN108694205A (en) * 2017-04-11 2018-10-23 北京京东尚科信息技术有限公司 Match the method, apparatus of aiming field
CN108776692A (en) * 2018-06-06 2018-11-09 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN110232095A (en) * 2019-05-21 2019-09-13 中国平安财产保险股份有限公司 A kind of method of data synchronization, device, storage medium and server
CN110232095B (en) * 2019-05-21 2024-04-02 中国平安财产保险股份有限公司 Data synchronization method, device, storage medium and server
CN112346847A (en) * 2019-08-09 2021-02-09 高新兴科技集团股份有限公司 Data processing method, device, computer storage medium and electronic device
CN110673794A (en) * 2019-09-18 2020-01-10 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
CN110673794B (en) * 2019-09-18 2021-12-17 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
CN111061712A (en) * 2019-11-29 2020-04-24 苏宁金融科技(南京)有限公司 Data connection operation processing method and device
CN111046045A (en) * 2019-12-13 2020-04-21 中国平安财产保险股份有限公司 Method, device, equipment and storage medium for processing data tilt
CN111046045B (en) * 2019-12-13 2023-09-29 中国平安财产保险股份有限公司 Method, device, equipment and storage medium for processing data inclination
CN111339064A (en) * 2020-03-03 2020-06-26 中国平安人寿保险股份有限公司 Data tilt correction method, device and computer readable storage medium
CN112256704A (en) * 2020-10-23 2021-01-22 山东超越数控电子股份有限公司 Quick join method, storage medium and computer
CN114638008A (en) * 2020-12-15 2022-06-17 阿里巴巴集团控股有限公司 Data processing method, device, system and storage medium
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
CN113238993B (en) * 2021-05-14 2023-12-05 中国人民银行数字货币研究所 A data processing method and device
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data tilt processing method, device, storage medium and program product
WO2023045295A1 (en) * 2021-09-27 2023-03-30 北京沃东天骏信息技术有限公司 Data skew processing method, device, storage medium, and program product
CN115185525A (en) * 2022-05-17 2022-10-14 贝壳找房(北京)科技有限公司 Data skew code block positioning method, apparatus, device, medium, and program product
CN116126862A (en) * 2023-02-06 2023-05-16 中国银联股份有限公司 Data table association method, device, equipment and storage medium
CN116596638A (en) * 2023-07-11 2023-08-15 中国标准化研究院 Information recommendation method based on numerical processing model
CN116596638B (en) * 2023-07-11 2023-09-22 中国标准化研究院 An information recommendation method based on numerical processing model
CN119105847A (en) * 2024-08-02 2024-12-10 鹏城实验室 Data aggregation method, related device and medium
CN119105847B (en) * 2024-08-02 2025-09-30 鹏城实验室 Data aggregation method, related device and medium

Similar Documents

Publication Publication Date Title
CN105930479A (en) Data skew processing method and apparatus
Das et al. Big data analytics: A framework for unstructured data analysis
Xin et al. Graphx: Unifying data-parallel and graph-parallel analytics
Mohebi et al. Iterative big data clustering algorithms: a review
Mehmood et al. Distributed real-time ETL architecture for unstructured big data
Kyrola Drunkardmob: billions of random walks on just a pc
CN106611037A (en) Method and device for distributed diagram calculation
CN111767287A (en) Data import method, device, device and computer storage medium
Silva et al. Integrating big data into the computing curricula
Tanase et al. A highly efficient runtime and graph library for large scale graph analytics
Siddiqa et al. SmallClient for big data: an indexing framework towards fast data retrieval
Banaei et al. Hadoop and its role in modern image processing
Choi et al. Improving Database System Performance by Applying NoSQL.
Wang et al. Research and implementation on spatial data storage and operation based on Hadoop platform
CN103891244B (en) A kind of method and device carrying out data storage and search
Al-Hamodi et al. An enhanced frequent pattern growth based on MapReduce for mining association rules
Serbanescu et al. Architecture of distributed data aggregation service
CN108334532A (en) A kind of Eclat parallel methods, system and device based on Spark
Pivert NoSQL data models: trends and challenges
Xie et al. On massive spatial data retrieval based on spark
Sudha et al. A survey paper on map reduce in big data
Abughofa et al. Towards online graph processing with spark streaming
CA3065157C (en) Parallel map and reduce on hash chains
CN108121807B (en) Implementation method of multi-dimensional index structure OBF-Index in Hadoop environment
Shang Efficient task scheduling for large-scale graph data processing in cloud computing: A particle swarm optimization approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160907

WD01 Invention patent application deemed withdrawn after publication