CN110309177A

CN110309177A - A kind of method and relevant apparatus of data processing

Info

Publication number: CN110309177A
Application number: CN201810245691.XA
Authority: CN
Inventors: 张韶全; 朱锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2019-10-08
Anticipated expiration: 2038-03-23
Also published as: CN110309177B

Abstract

The embodiment of the invention discloses a kind of methods of data processing, comprising: receives data processing instructions；The first partition data and the second partition data are obtained according to the data processing instructions；Processing is ranked up to first partition data by Mapper, the first data to be combined is obtained, and processing is ranked up to second partition data, obtains the second data to be combined；Processing is merged to the described first data to be combined and second data to be combined by Reducer, obtains target connection data.The embodiment of the invention also discloses a kind of data processing equipments.The process of data sorting is placed in Mapper and completes by the embodiment of the present invention, and the process that Reducer only needs to be performed data merging promotes the execution efficiency of Join to reduce the data processing time delay of each Reducer.

Description

A kind of method and relevant apparatus of data processing

Technical field

The present invention relates to computer information processing field more particularly to the methods and relevant apparatus of a kind of data processing.

Background technique

Spark is a kind of big data distributed computing framework memory-based, Spark structured query language (Structured Query Language, SQL) provides structuring as the important member in the Spark ecosphere for user The function of data processing and SQL query analysis, so that only need can be transparent by SQL by the analysis personnel in different business field The processing of mass data is completed using Spark in ground.

Connection method (Join) is as an important syntactic property in SQL, and the data analysis scene of nearly all complexity is all It is be unable to do without Join, so Join is always the emphasis optimized in database field.Currently, a kind of common Join is that sequence merges It connects (Sort Merge Join), in this Join, shuffles the mapping node (Mapper) of (Shuffle) for elasticity distribution The data of the identical key assignments of formula data set (Resilient Distributed Datasets, RDD) are divided into same subregion and fall Disk, the reduction node (Reducer) of Shuffle the subregion of the identical key assignments of different RDD is read out after first according to key assignments into Data after sequence, are then combined by row sequence.

In the actual environment, data volume handled by Mapper is relatively uniform, process of each Mapper in processing data In do not need mutually to wait.However, data volume handled by Reducer is very different, some Reducer may need to handle greatly The data of amount, and some Reducer only need to may handle a small amount of data, then, the Reducer of short processing time needs to wait The Reducer of processing time length completes data sorting causes the efficiency of Join to reduce to increase data processing time delay.

Summary of the invention

The embodiment of the invention provides a kind of method of data processing and relevant apparatus, and the process of data sorting is placed on It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer Time delay is managed, the execution efficiency of Join is promoted.

In view of this, first aspect present invention provides a kind of method of data processing, comprising:

Receive data processing instructions；

The first partition data and the second partition data are obtained according to the data processing instructions；

Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined, And processing is ranked up to second partition data, obtain the second data to be combined；

The described first data to be combined and second data to be combined are merged by reduction node R educer Processing obtains target connection data.

Second aspect of the present invention provides a kind of data processing equipment, comprising:

Receiving module, for receiving data process instruction；

Obtain module, for according to the received data processing instructions of the receiving module obtain the first partition data with And second partition data；

Sorting module, for by mapping node Mapper to it is described acquisition module obtain first partition data into Row sequence processing obtains the first data to be combined, and second partition data obtained to the acquisition module is ranked up Processing, obtains the second data to be combined；

Merging module, for be combined to described first after sorting module sequence by reduction node R educer Data and second data to be combined merge processing, obtain target connection data.

Third aspect present invention provides a kind of data processing equipment, and the data processing equipment includes: memory, transmitting-receiving Device, processor and bus system；

Wherein, the memory is for storing program；

The processor is used to execute the program in the memory, includes the following steps:

Receive data processing instructions；

The described first data to be combined and second data to be combined are merged by reduction node R educer Processing obtains target connection data；

The bus system is for connecting the memory and the processor, so that the memory and the place Reason device is communicated.

The fourth aspect of the present invention provides a kind of computer readable storage medium, in the computer readable storage medium It is stored with instruction, when run on a computer, so that the method that computer executes above-mentioned various aspects.

As can be seen from the above technical solutions, the embodiment of the present invention has the advantage that

In the embodiment of the present invention, a kind of method of data processing is provided, first reception data processing instructions, then basis Data processing instructions obtain the first partition data and the second partition data, then are arranged by Mapper the first partition data Sequence processing, obtains the first data to be combined, and be ranked up processing to the second partition data, obtains the second data to be combined, most Processing is merged to the first data to be combined and the second data to be combined by Reducer afterwards, obtains target connection data. By the above-mentioned means, the characteristic that the Spark SQL task that the overwhelming majority is utilized has the more Reducer of Mapper few, by data The process of sequence, which is placed in Mapper, to be completed, and Reducer only needs to be performed the process of data merging, to reduce each The data processing time delay of Reducer, promotes the execution efficiency of Join.

Detailed description of the invention

Fig. 1 is a schematic diagram of big data distributed computing framework Spark；

Fig. 2 is a configuration diagram of big data distributed computing framework Spark；

Fig. 3 is a configuration diagram of distributed system in the embodiment of the present invention；

Fig. 4 is method one embodiment schematic diagram of data processing in the embodiment of the present invention；

Fig. 5 is that data are moved a schematic diagram to hard disk from caching in the embodiment of the present invention；

Fig. 6 is a block schematic illustration of data processing method in application scenarios of the present invention；

Fig. 7 is one embodiment schematic diagram of data processing equipment in the embodiment of the present invention；

Fig. 8 is a structural schematic diagram of data processing equipment in the embodiment of the present invention.

Specific embodiment

Description and claims of this specification and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any Deformation, it is intended that cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, production Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this A little process, methods, the other step or units of product or equipment inherently.

It should be understood that present invention is mainly applied to big data distributed computing framework Spark, Spark memory-based are earliest Be born in California, USA university Berli gram branch school " algorithm machine mankind laboratory (Algorithms Machines People, AMPLab)".Currently as the top project of " Apache (Apache) " open source community, be attracted to each major company in the whole world and The participation of developer.Spark is developed so far the fact that have become the processing of industry big data standard.

Usually when data volume to be treated has been more than that (for example our computer has the interior of 4 gigabytes to single machine scale It deposits, and we need to handle the data of 100 gigabytes or more), at this moment we can choose Spark cluster and calculate, and have When it may be desired to the data volumes of processing and little, but calculate very complicated, need a large amount of time, at this moment we can also be with The powerful computing resource of Selection utilization Spark cluster, calculates to parallelization, and introduces Spark below in conjunction with Fig. 1 and Fig. 2.

Referring to Fig. 1, Fig. 1 is a schematic diagram of big data distributed computing framework Spark, as shown, Spark core Basic function of the pericardium containing Spark especially defines application programming interface (the Application Programming of RDD Interface, API) and operation and the two on movement.The library of other Spark is all building in RDD and Spark core On the heart.

Wherein, Spark SQL provides the API interacted by the SQL and Spark of Apache Tool for Data Warehouse, often A database table is taken as a RDD, and Spark SQL query is converted into Spark operation.Spark data flow is to real-time number It is handled and is controlled according to stream, Spark data flow allows program handle real time data as common RDD.Machine learning One, library machine in normal service learning algorithm library, algorithm are implemented as the Spark operation to RDD, this library includes expansible study Algorithm, such as the operation that classification and recurrence etc. need to be iterated mass data collection.Image X (GraphX) is control figure, simultaneously The set of row graphic operation and one group of algorithm and tool calculating, GraphX extend RDD API, include control figure, creation subgraph And in access path all vertex operation

Referring to Fig. 2, Fig. 2 is a configuration diagram of big data distributed computing framework Spark, as shown, collection Group's manager (Cluster Manager) is host node in offline mode, for controlling entire cluster, monitoring work section Point is resource management in another resource coordination person (Yet Another Resource Negotiator, YARN) mode Device.Working node also known as starts actuator or driver from node for being responsible for control calculate node.Actuator is A process on working node is operated in for some application program (Application, APP).Driver is for running APP Main () function.

It should be understood that the data processing method in the embodiment of the present invention is mainly used in distributed system, referring to Fig. 3, Fig. 3 For a configuration diagram of distributed system in the embodiment of the present invention, it is generally the case that the quantity of Mapper is greater than Reducer Quantity, Mapper is mainly responsible for data processing stage, the form that it is used for Mapper<K, V>, Mapper is independent Business, is converted to intermediate record for input record, i.e., handles the key-value pair of input, and exports as one group of centre key-value pair. One input record may be exported after handling via Mapper as 0 or a plurality of intermediate record.For example, if input record not If meeting business need (not comprising being specifically worth or containing specific value), it can directly return, then can export 0 Record, Mapper has played filter at this time.

Reducer allows to configure and clear up, and one group of median with same keys can be reduced to one group of more smallest number Value, such as merge word quantity etc..Data sorting is completed in Mapper in the present invention, and Reducer is merely responsible for data Merge, to promote data sorting efficiency, improves Join efficiency.

Below by from the angle of data processing equipment, the method for data processing in the present invention is introduced, figure is please referred to 4, method one embodiment of data processing includes: in the embodiment of the present invention

101, data processing instructions are received；

In the present embodiment, data processing equipment can be deployed on server or terminal device.Data processing equipment receives Data processing instructions, the data processing instructions can be the Client-initiateds such as network management or operation maintenance personnel, are also possible to by equipment The instruction actively initiated, herein without limitation.

102, the first partition data and the second partition data are obtained according to data processing instructions；

In the present embodiment, data processing equipment constructs the corresponding elasticity distribution formula number of data source according to data processing instructions According to collection (Resilient Distributed Datasets, RDD).In the present invention, RDD is carried out for comprising two subregions It introduces, respectively the first subregion and the second subregion.

On the big data analysis platform constructed based on Spark SQL, PreSortMergeJoin algorithm in this programme Function default is opened, and is set by corresponding parameter (Spark.join.shuffle.sort=true), when the parameter is When false, can switching back into the Join algorithm that Spark SQL defaults, (Reducer is ranked up and carries out i.e. in existing scheme Merge).Therefore, in product side, when Spark SQL big data analysis platform is online, system manager only needs the ginseng Number is arranged in configuration file, does not need user and does any operation.

It is understood that PreSortMergeJoin is to realize algorithm title provided by this programme, actually answering In, PreSortMergeJoin also may be defined as other titles, but algorithm is substantially constant.

103, processing is ranked up to the first partition data by mapping node Mapper, obtains the first data to be combined, And processing is ranked up to the second partition data, obtain the second data to be combined；

In the present embodiment, data processing equipment is ranked up processing to the first partition data by Mapper, then obtains The first data to be combined after sequence, similarly, data processing equipment is ranked up the second partition data by Mapper Processing, the second data to be combined after then being sorted.A kind of example of sequence processing is introduced below in conjunction with Tables 1 and 2 Son, wherein table 1 is the example before sequence.

Table 1

User identifier	Age
		3	21
5	64
		2	33
1	28
		7	19
4	56
		6	40

Result shown in table 2 is obtained after sequence.

Table 2

It is understood that above-mentioned sequence is to be carried out with " user identifier " for foundation, it in practical applications, can basis Demand selects other sortords.

104, place is merged to the first data to be combined and the second data to be combined by reduction node R educer Reason obtains target connection data.

In the present embodiment, data processing equipment is by Reducer to the first data to be combined and the second data to be combined Processing is merged, obtains target connection data to get Join is arrived.Assuming that the first merging data is shown in table 2 as a result, second Merging data is as shown in table 3.

Table 3

User identifier	Occupation
		1	Engineer
2	Teacher
		3	Teacher
4	Engineer
		5	Teacher
6	Engineer
		7	Engineer

Then the target connection data obtained are as shown in table 4.

Table 4

User identifier	Occupation	Age
			1	Engineer	28
2	Teacher	33
			3	Teacher	21
4	Engineer	56
			5	Teacher	64
6	Engineer	40
			7	Engineer	19

In the embodiment of the present invention, a kind of method of data processing is provided, first reception data processing instructions, then basis Data processing instructions obtain the first partition data and the second partition data, wherein the first partition data and the second partition data Belong to different elasticity distribution formula data set RDD, then processing is ranked up to the first partition data by Mapper, obtains first Data to be combined, and processing is ranked up to the second partition data, the second data to be combined are obtained, finally by Reducer pairs First data to be combined and the second data to be combined merge processing, obtain target connection data.By the above-mentioned means, sharp The characteristic for having the more Reducer of Mapper few with the Spark SQL task of the overwhelming majority, the process of data sorting is placed on It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer Time delay is managed, the execution efficiency of Join is promoted.

Optionally, on the basis of above-mentioned Fig. 4 corresponding embodiment, data processing method institute provided in an embodiment of the present invention In corresponding first alternative embodiment, the first partition data and the second partition data, packet are obtained according to data processing instructions It includes:

First partition data and institute are obtained from elasticity distribution formula data set RDD according to the data processing instructions State the second partition data.

In the present embodiment, data processing equipment obtains the first partition data according to data processing instructions from the same RDD With the second partition data.

Specifically, Join is exactly that the row that two or more subregions belong to identical key assignments combines, therefore, here One partition data and the second partition data are only a signal, in practical applications, can also have more partition datas, herein Without limitation.

RDD is a fault-tolerant and parallel data structure, can explicitly be stored data into disk and memory, this It is a read-only data record set in matter.One RDD may include multiple subregions, and each subregion is exactly a data set piece Section.RDD can be carried out using partitioning technique by subregion, partitioning technique sets how to carry out subregion, example to the data set in RDD It such as, can be by corresponding subregion of each data block etc..Subregion is then the object to RDD division operation.

RDD is carried out to be known as " shuffle (Shuffle) " from the process of new subregion.In Spark, Shuffle means to count According to rule and reading.For corresponding Join, the data of the identical key assignments of RDD are divided into same subregion by the Mapper of Shuffle (note: each subregion may include different key assignments) and rule, the Reducer of Shuffle is by point of the identical key assignments of different RDD Area is combined after reading out.

Secondly, data processing equipment needs obtain partition data from the same RDD, i.e., from this in the embodiment of the present invention The first partition data is obtained in RDD, and the second partition data is obtained from the RDD.By the above-mentioned means, can be by a RDD In belong to the row of identical key assignments and combine, i.e. completion Join process, thus the practicability and feasibility of lifting scheme.

Optionally, on the basis of above-mentioned Fig. 4 corresponding embodiment, data processing method institute provided in an embodiment of the present invention In corresponding second alternative embodiment, processing is ranked up to the first partition data by Mapper, it is to be combined to obtain first Data may include:

Processing is ranked up to the first partition data by the first Mapper, obtains at least one first file, wherein extremely Few first file belongs to the first data to be combined；

Processing is merged at least one first file, obtains the first data to be combined；

Processing is ranked up to the second partition data by the 2nd Mapper, obtains at least one second file, wherein extremely Few second file belongs to the second data to be combined；

Processing is merged at least one second file, obtains the second data to be combined.

In the present embodiment, in the stage that Mapper handles data, need to be ranked up data and polymerize, it should Operation can be completed by function joinShuffleWrite.The input of function is the partition data of RDD, then by the every of data One data is inserted into sorting unit, until total data is all inserted into completion, finally by all sort file merging.In order to just In understanding, following pseudocode is please referred to:

Secondly, describing in the embodiment of the present invention and being ranked up processing to the first partition data by the first Mapper, obtain To at least one the first file, processing then is merged at least one first file, obtains the mistake of the first data to be combined Journey, and processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, and at least One the second file merges processing, obtains the second data to be combined.By the above-mentioned means, different Mapper can be to not Same partition data is ranked up, so that Reducer is not necessarily to carry out data sorting, so as to avoid multiple Reducer due to each From the data volume of processing irregular caused the case where waiting other side to complete.

Optionally, on the basis of above-mentioned Fig. 4 corresponding second embodiment, data processing provided in an embodiment of the present invention In third alternative embodiment corresponding to method, processing is ranked up to the first partition data by the first Mapper, is obtained At least one first file may include:

If it is pre- to reach first by the data volume that the first Mapper is ranked up processing to the first partition data in memory Gating limit then stores the data after the M item sequence in memory into hard disk as the first file, and by the first file, wherein M is the integer greater than 0；

Processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, can wrap It includes:

If it is pre- to reach second by the data volume that the 2nd Mapper is ranked up processing to the second partition data in memory Gating limit then stores the data after the N item sequence in memory into hard disk as the second file, and by the second file, wherein N is the integer greater than 0.

In the present embodiment, the first Mapper to the first partition data in the process of processing, firstly, the first Mapper First partition data is written in memory, if being ranked up processing to the first partition data by the first Mapper in memory Data volume reach the first pre-determined threshold, then the first Mapper can using M articles in memory by sequence data as first Then first file is sent to hard disk by file.Similarly, the second partition data can be written in memory by the 2nd Mapper, If reaching the second pre-determined threshold by the data volume that the 2nd Mapper is ranked up processing to the first partition data in memory, that Then first file is sent to firmly by the data that N articles in memory can be passed through sequence by the 2nd Mapper as the second file Disk.

It should be noted that under normal conditions, the first pre-determined threshold and the second pre-determined threshold are consistent, such as first pre- Gating limit and the second pre-determined threshold are 30.And M value and N value are generally also that always, can also according to circumstances be set, this Place is without limitation.

Below in conjunction with following pseudocode, illustrates how for data record to be inserted into sorting unit, please refer to as follows Pseudocode:

Wherein, insertRecord method is that data record is inserted into sorting unit, and memoryBufferc is used Memory storage record, when reaching certain amount, sortAndSpill method is by memoryBuffer data sorting and stores Onto disk, and empty memoryBuffer.

Assuming that records four datas, respectively (1, r1), (3, r3), (4, r4) and (2, r2) in total, wherein 1,2, 3 and 4 be respectively key assignments.If memoryBuffer accommodates up to two datas, (1, r1) and (3, r3) is according to key assignments Disk file (the data sequence in file is (1, r1), (3, r3)) can be saved as after sequence, same (4, r4) and (2, r2) exist Disk file can be saved as after sorting according to key assignments (the data sequence in file is (2, r2) and (4, r4)).

Again, in the embodiment of the present invention, if being ranked up place to the first partition data by the first Mapper in memory The data volume of reason reaches the first pre-determined threshold, then using the data after the M item sequence in memory as the first file, similarly, if The second pre-determined threshold is reached by the data volume that the 2nd Mapper is ranked up processing to the second partition data in memory, then will The data after the sequence of N item in memory are as the second file.By the above-mentioned means, under the very big scene of data volume, it can It will write in full data placement hard disk and handle in memory, with this is avoided Insufficient memory and causes Mapper that can not sort The problem of, thus the feasibility and practicability of lifting scheme.

Optionally, on the basis of above-mentioned Fig. 4 corresponding second embodiment, data processing provided in an embodiment of the present invention In 4th alternative embodiment corresponding to method, processing is merged at least one first file, it is to be combined to obtain first Data may include:

Minimum data is obtained from each first file；

The first data to be combined are determined according to the minimum data in each first file；

Processing is merged at least one second file, obtains the second data to be combined, comprising:

Minimum data is obtained from each second file；

The second data to be combined are determined according to the minimum data in each second file.

In the present embodiment, the first partition data is ranked up in memory, the first file is obtained after sequence, another interior The second partition data is ranked up in depositing, the second file is obtained after sequence.Referring to Fig. 5, Fig. 5 is will in the embodiment of the present invention Data move a schematic diagram to hard disk from caching, as shown in the figure, it is assumed that and the first file includes file A, file B and file C, Assuming that current partition is No. 1, then can be from No. 1 point in No. 1 subregion in file A, No. 1 subregion and file C in file B Area takes out minimum data respectively, and No. 1 subregion into hard disk is written.And so on, the minimum value in No. 1 subregion is write every time Enter to hard disk, to obtain orderly result.

Below in conjunction with following pseudocode, illustrate how for the data after sequence to be merged into a file.Please refer to as Lower pseudocode:

Wherein, mergeSpills method is by spilled file mergences in all insertRecord.Every time from file It takes out one to belong to the minimum value of current partition and write in file, if belonging to the data for not belonging to current partition, change To next subregion, until all data are all removed.

Assuming that records four datas, respectively (1, r1), (3, r3), (4, r4) and (2, r2) in total, wherein 1,2, 3 and 4 be respectively key assignments.If memoryBuffer accommodates up to two datas, (1, r1) and (3, r3) is according to key assignments Disk file (the data sequence in file is (1, r1), (3, r3)) can be saved as after sequence, same (4, r4) and (2, r2) exist Disk file can be saved as after sorting according to key assignments (the data sequence in file is (2, r2) and (4, r4)).? A file can be generated after mergeSpills, data sequence is (1, r1), (2, r2), (3, r3), (4, r4) in file.

Again, in the embodiment of the present invention, processing is merged at least one file, the mode for obtaining data to be combined can To be, minimum data is obtained from each file, and data to be combined are then obtained according to the minimum data in each file.Pass through Aforesaid way, due to the data in the first file and the second file be all it is sorted, every time using the first file and the Minimum value in two files is ranked up the order that also can guarantee result, thus the operability of lifting scheme.

Optionally, on the basis of any one of corresponding first to fourth embodiment of above-mentioned Fig. 4 and Fig. 4, this hair It is to be combined to first by Reducer in 5th alternative embodiment corresponding to the data processing method that bright embodiment provides Data and the second data to be combined merge processing, obtain target connection data, may include:

The first Mapper is received to the first number to be combined obtained after the first partition data sequence processing by Reducer According to；

The 2nd Mapper is received to the second number to be combined obtained after the second partition data sequence processing by Reducer According to；

Merge the first data to be combined and the second data to be combined by Reducer, obtains target connection data.

In the present embodiment, describes and the first data to be combined and the second data to be combined are closed by Reducer And the process handled.

Specifically, with a Reducer in order to be introduced, which receives the sent from the first Mapper One data to be combined, wherein the first data to be combined are to have already passed through the data obtained after the first Mapper sequence.Similarly, The Reducer can also receive the second data to be combined sent from the 2nd Mapper, wherein the second data to be combined are The data obtained after the 2nd Mapper sequence.Reducer is by the receive first data to be combined and the second number to be combined According to processing is merged, to obtain the connection data of the target after Join.

Further, it in the embodiment of the present invention, describes through Reducer to the first data to be combined and second wait close And data merge the process of processing.By the above-mentioned means, Reducer can directly to the result after multiple sequences into Row merges, and without being ranked up to data, to improve the efficiency of sorting operation, thus improves the execution efficiency of Join.

Optionally, on the basis of above-mentioned Fig. 4 corresponding 5th embodiment, data processing provided in an embodiment of the present invention In 6th alternative embodiment corresponding to method, the first data to be combined and the second number to be combined are merged by Reducer According to, obtain target connection data, may include:

The first minimum data is obtained from the first data to be combined by Reducer, and is obtained from the second data to be combined Take the second minimum data；

The first minimum data and the second minimum data are compared by Reducer, and according to comparison result to One data to be combined and the second data to be combined merge, and obtain target connection data.

In the present embodiment, Reducer to the first data to be combined and the second data to be combined in the process of processing, Firstly, Reducer needs obtain the first minimum data from the first data to be combined, and obtain from the second data to be combined the Then two minimum data are compared the first minimum data and the second minimum data, minimum value is sequentially exported.

Below in conjunction with following pseudocode, illustrate how to close the first data to be combined and the second data to be combined And.Please refer to following pseudocode:

The Reduce stage merges the data of different Mapper, and is input in Join operator in the form of iteration. Interator.getNext () is called to can be obtained by next data value in this way.Its logic carries out data processing with Mapper The mergeSpills method in stage is similar, is not repeated herein.

Further, to be combined by Reducer the first data to be combined of merging and second in the embodiment of the present invention The mode of data can be, and obtain the first minimum data from the first data to be combined by Reducer, and to be combined from second The second minimum data is obtained in data；The first minimum data and the second minimum data are compared by Reducer, and root The first data to be combined and the second data to be combined are merged according to comparison result.By the above-mentioned means, due to first to Merging data and the second data to be combined be all it is sorted, every time take the first data to be combined and the second data to be combined in Minimum value is ranked up the order that also can guarantee result, thus the operability of lifting scheme.

In order to make it easy to understand, referring to Fig. 6, the frame that Fig. 6 is data processing method in application scenarios of the present invention shows It is intended to, as shown, PreSortMergeJoin algorithm, there are two key component, a part is mapping (map) phase data point Data are ranked up when area, another part is reduction (reduce) stage to merge collated data.

For a Mapper, partition data is ranked up first, since data volume is bigger, memory It may not be enough to store all data, therefore, it is necessary to be split processing to the data after sequence.Assuming that memory most multipotency is deposited 100,000 datas are put, 1,000,000 datas is shared and needs to sort, then, Mapper can deposit the data after every 100,000 sequences It puts into hard disk, is all stored to hard disk until by this million data.Certainly, it stores to this million data of hard disk It is also sorted, this is because minimum value is put into hard disk every time during storage, to constitute orderly Mapper file.

The data stored in hard disk might not be used only to Reducer, so, the data in hard disk can distribute to In different Reducer, similarly, Mapper file accessed by a Reducer is also not necessarily from same Mapper。

At least one Mapper file is merged all available final Join result by Reducer.

Present invention could apply to the business that on data warehouse, about 1,600,000 SQL queries of support are handled.The invention pair Join efficiency increases significantly, and at most Join speed can be improved 70%, the service operation time is greatly decreased.

The data processing equipment in the present invention is described in detail below, referring to Fig. 7, Fig. 7 is the embodiment of the present invention Middle data processing equipment one embodiment schematic diagram, data processing equipment 20 include:

Receiving module 201, for receiving data process instruction；

Module 202 is obtained, for obtaining the first subregion according to the received data processing instructions of the receiving module 201 Data and the second partition data；

Sorting module 203, first subregion for being obtained by mapping node Mapper to the acquisition module 202 Data are ranked up processing, obtain the first data to be combined, and second partition data that the acquisition module is obtained into Row sequence processing, obtains the second data to be combined；

Merging module 204, for described first after being sorted by reduction node R educer to the sorting module 203 Data to be combined and second data to be combined merge processing, obtain target connection data.

In the present embodiment, receiving module 201 receives data processing instructions, obtains module 202 according to the receiving module 201 The received data processing instructions obtain the first partition data and the second partition data, and sorting module 203 passes through mapping section First partition data that point Mapper obtains the acquisition module 202 is ranked up processing, obtains the first number to be combined According to, and second partition data obtained to the acquisition module is ranked up processing, obtains the second data to be combined, merges Module 204 is by reduction node R educer to the described first data to be combined and described after the sorting module 203 sequence Second data to be combined merge processing, obtain target connection data.

In the embodiment of the present invention, a kind of data processing equipment is provided, first reception data processing instructions, then according to number The first partition data and the second partition data are obtained according to process instruction, wherein the first partition data and the second partition data category Processing is ranked up to the first partition data in different elasticity distribution formula data set RDD, then by Mapper, obtain first to Merging data, and processing is ranked up to the second partition data, the second data to be combined are obtained, finally by Reducer to One data to be combined and the second data to be combined merge processing, obtain target connection data.By the above-mentioned means, utilizing The characteristic that the Spark SQL task of the overwhelming majority has the more Reducer of Mapper few, the process of data sorting is placed on It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer Time delay is managed, the execution efficiency of Join is promoted.

Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention In 20 another embodiment,

The acquisition module 202, specifically for being obtained from elasticity distribution formula data set RDD according to the data processing instructions Take first partition data and second partition data.

The sorting module 203, specifically for being ranked up processing to first partition data by the first Mapper, Obtain at least one first file, wherein at least one described first file belongs to the described first data to be combined；

Processing is merged at least one described first file, obtains the described first data to be combined；

Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file, In, at least one described second file belongs to the described second data to be combined；

Processing is merged at least one described second file, obtains the described second data to be combined.

Secondly, describing in the embodiment of the present invention and being ranked up processing to the first partition data by the first Mapper, obtain To at least one the first file, processing then is merged at least one first file, obtains the mistake of the first data to be combined Journey, and processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, and at least One the second file merges processing, obtains the second data to be combined.By the above-mentioned means, different Mapper can be to not Same partition data is ranked up, so that Reducer is not necessarily to carry out data sorting, so as to avoid phase between multiple Reducer Mutually wait the time of other side's sequence.

The sorting module 203, if being specifically used in memory by the first Mapper to first number of partitions Reach the first pre-determined threshold according to the data volume for being ranked up processing, then using the data after M articles in memory sequence as the One file, and first file is stored into hard disk, wherein the M is the integer greater than 0；

If being reached in memory by the data volume that the 2nd Mapper is ranked up processing to second partition data To the second pre-determined threshold, then the data after the N item sequence in the memory are deposited as the second file, and by second file Storage is into hard disk, wherein the N is the integer greater than 0.

The sorting module 203, specifically for obtaining minimum data from each first file；

The described first data to be combined are determined according to the minimum data in each first file；

Minimum data is obtained from each second file；

The described second data to be combined are determined according to the minimum data in each second file.

The merging module 204 is specifically used for receiving the first Mapper to first number of partitions by the Reducer The described first data to be combined obtained after being handled according to sequence；

The 2nd Mapper is received to described the obtained after the second partition data sequence processing by the Reducer Two data to be combined；

Merge first data to be combined and second data to be combined by the Reducer, obtains described Target connects data.

The merging module 204, specifically for obtaining first from the described first data to be combined by the Reducer Minimum data, and the second minimum data is obtained from the described second data to be combined；

First minimum data and second minimum data are compared by the Reducer, and according to Comparison result merges the described first data to be combined and second data to be combined, obtains the target connection number According to.

Fig. 8 is a kind of data processing equipment structural diagram provided in an embodiment of the present invention, which can Bigger difference is generated because configuration or performance are different, may include one or more central processing units (central Processing units, CPU) 322 (for example, one or more processors) and memory 332, one or more Store the storage medium 330 (such as one or more mass memory units) of application program 342 or data 344.Wherein, it deposits Reservoir 332 and storage medium 330 can be of short duration storage or persistent storage.The program for being stored in storage medium 330 may include One or more modules (diagram does not mark), each module may include to the series of instructions behaviour in data processing equipment Make.Further, central processing unit 322 can be set to communicate with storage medium 330, hold on data processing equipment 300 Series of instructions operation in row storage medium 330.

Data processing equipment 300 can also include one or more power supplys 326, one or more wired or nothings Wired network interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341, Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by data processing equipment can be based on the data processing equipment shown in Fig. 8 in above-described embodiment Structure.

CPU 322 is for executing following steps:

Receive data processing instructions；

Optionally, CPU 322 is specifically used for executing following steps:

Processing is ranked up to first partition data by the first Mapper, obtains at least one first file, In, at least one described first file belongs to the described first data to be combined；

Optionally, CPU 322 is specifically used for executing following steps:

If being reached in memory by the data volume that the first Mapper is ranked up processing to first partition data To the first pre-determined threshold, then the data after the M item sequence in the memory are deposited as the first file, and by first file Storage is into hard disk, wherein the M is the integer greater than 0；

Optionally, CPU 322 is specifically used for executing following steps:

Minimum data is obtained from each first file；

Minimum data is obtained from each second file；

Optionally, CPU 322 is specifically used for executing following steps:

The first Mapper is received to described the obtained after the first partition data sequence processing by the Reducer One data to be combined；

Optionally, CPU 322 is specifically used for executing following steps:

Obtain the first minimum data from the described first data to be combined by the Reducer, and from described second to The second minimum data is obtained in merging data；

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program The medium of code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although referring to before Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of method of data processing characterized by comprising

Receive data processing instructions；

Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined, and right Second partition data is ranked up processing, obtains the second data to be combined；

Place is merged to the described first data to be combined and second data to be combined by reduction node R educer Reason obtains target connection data.

2. the method according to claim 1, wherein described obtain the first subregion according to the data processing instructions Data and the second partition data, comprising:

First partition data and described are obtained from elasticity distribution formula data set RDD according to the data processing instructions Two partition datas.

3. the method according to claim 1, wherein it is described by mapping node Mapper to first subregion Data are ranked up processing, obtain the first data to be combined, comprising:

Processing is ranked up to first partition data by the first Mapper, obtains at least one first file, wherein institute It states at least one first file and belongs to the described first data to be combined；

Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file, wherein institute It states at least one second file and belongs to the described second data to be combined；

4. according to the method described in claim 3, it is characterized in that, the first Mapper that passes through is to first partition data It is ranked up processing, obtains at least one first file, comprising:

If reaching by the data volume that the first Mapper is ranked up processing to first partition data in memory One pre-determined threshold, then using in the memory M item sequence after data as the first file, and by first file store to In hard disk, wherein the M is the integer greater than 0；

It is described that processing is ranked up to second partition data by the 2nd Mapper, at least one second file is obtained, is wrapped It includes:

If reaching by the data volume that the 2nd Mapper is ranked up processing to second partition data in memory Two pre-determined thresholds, then using in the memory N item sequence after data as the second file, and by second file store to In hard disk, wherein the N is the integer greater than 0.

5. according to the method described in claim 3, it is characterized in that, described merge place at least one described first file Reason, obtains the described first data to be combined, comprising:

Minimum data is obtained from each first file；

It is described that processing is merged at least one described second file, obtain the described second data to be combined, comprising:

Minimum data is obtained from each second file；

6. the method according to any one of claims 1 to 5, which is characterized in that described to pass through educer pairs of reduction node R First data to be combined and second data to be combined merge processing, obtain target connection data, comprising:

By the Reducer receive the first Mapper to the first partition data sequence processing after obtain described first to Merging data；

By the Reducer receive the 2nd Mapper to the second partition data sequence processing after obtain described second to Merging data；

Merge first data to be combined and second data to be combined by the Reducer, obtains the target Connect data.

7. according to the method described in claim 6, it is characterized in that, described merge described first wait close by the Reducer And data and second data to be combined, obtain the target connection data, comprising:

The first minimum data is obtained from the described first data to be combined by the Reducer, and to be combined from described second The second minimum data is obtained in data；

First minimum data and second minimum data are compared by the Reducer, and according to comparison As a result the described first data to be combined and second data to be combined are merged, obtains the target connection data.

8. a kind of data processing equipment characterized by comprising

Receiving module, for receiving data process instruction；

Module is obtained, for obtaining the first partition data and the according to the received data processing instructions of the receiving module Two partition datas；

Sorting module, first partition data for being obtained by mapping node Mapper to the acquisition module are arranged Sequence processing obtains the first data to be combined, and second partition data obtained to the acquisition module is ranked up processing, Obtain the second data to be combined；

Merging module, for the described first data to be combined after being sorted by reduction node R educer to the sorting module And second data to be combined merge processing, obtain target connection data.

9. data processing equipment according to claim 8, which is characterized in that

The sorting module, specifically for being ranked up processing to first partition data by the first Mapper, obtain to Few first file, wherein at least one described first file belongs to the described first data to be combined；

10. data processing equipment according to claim 9, which is characterized in that

The sorting module, if specifically for being arranged in memory by the first Mapper first partition data The data volume of sequence processing reaches the first pre-determined threshold, then using the data after the M item sequence in the memory as the first file, and First file is stored into hard disk, wherein the M is the integer greater than 0；

11. data processing equipment according to claim 9, which is characterized in that

The sorting module, specifically for obtaining minimum data from each first file；

Minimum data is obtained from each second file；

12. a kind of data processing equipment, which is characterized in that the data processing equipment includes: memory, transceiver, processor And bus system；

Wherein, the memory is for storing program；

Receive data processing instructions；

Place is merged to the described first data to be combined and second data to be combined by reduction node R educer Reason obtains target connection data；

The bus system is for connecting the memory and the processor, so that the memory and the processor It is communicated.

13. data processing equipment according to claim 12, which is characterized in that

The processor obtains at least specifically for being ranked up processing to first partition data by the first Mapper One the first file, wherein at least one described first file belongs to the described first data to be combined；

14. data processing equipment according to claim 13, which is characterized in that

The processor, if specifically for being ranked up in memory by the first Mapper to first partition data The data volume of processing reaches the first pre-determined threshold, then using the data after the M item sequence in the memory as the first file, and will First file is stored into hard disk, wherein the M is the integer greater than 0；

15. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as Method described in any one of claims 1 to 7.