CN110309177B

CN110309177B - Data processing method and related device

Info

Publication number: CN110309177B
Application number: CN201810245691.XA
Authority: CN
Inventors: 张韶全; 朱锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2023-11-03
Anticipated expiration: 2038-03-23
Also published as: CN110309177A

Abstract

The embodiment of the invention discloses a data processing method, which comprises the following steps: receiving a data processing instruction; acquiring first partition data and second partition data according to the data processing instruction; sorting the first partition data through a Mapper to obtain first data to be combined, and sorting the second partition data to obtain second data to be combined; and combining the first data to be combined and the second data to be combined through a Reducer to obtain target connection data. The embodiment of the invention also discloses a data processing device. According to the embodiment of the invention, the data sorting process is completed in the Mapper, and the Reducer only needs to complete the data merging process, so that the data processing time delay of each Reducer is reduced, and the execution efficiency of Join is improved.

Description

Data processing method and related device

Technical Field

The present invention relates to the field of computer information processing, and in particular, to a data processing method and related apparatus.

Background

Spark is a big data distributed computing framework based on memory, and a Spark structured query language (Structured Query Language, SQL) is used as an important member in a Spark ecological circle, so that functions of structured data processing and SQL query analysis are provided for users, and analysts in different business fields can transparently utilize Spark to complete processing of mass data only through SQL.

The connection method (Join) is an important grammatical feature in SQL, and almost all complex data analysis scenes are not separated from Join, so Join is always an important point of optimization in the database field. Currently, one common Join is a Sort Merge Join (Sort Merge Join), in which a map node (Mapper) of a Shuffle (Shuffle) divides data of the same key value of an elastic distributed data set (Resilient Distributed Datasets, RDD) into the same partition and drops the same partition, and a reduce node (Reducer) of the Shuffle reads out the partitions of the same key value of different RDDs, sorts the partitions according to the key value, and then combines the sorted data.

In an actual environment, the data quantity processed by the mappers is relatively uniform, and each Mapper does not need to wait for each other in the process of processing the data. However, the amount of data processed by the reducers varies greatly, some reducers may need to process a large amount of data, while some reducers may need to process only a small amount of data, so that the Reducer with a short processing time needs to wait for the Reducer with a long processing time to finish data sorting, thereby increasing data processing delay and resulting in reduced efficiency of Join.

Disclosure of Invention

The embodiment of the invention provides a data processing method and a related device, wherein the data ordering process is finished in a Mapper, and a Reducer only needs to finish the data merging process, so that the data processing time delay of each Reducer is reduced, and the execution efficiency of Join is improved.

In view of this, a first aspect of the present invention provides a method of data processing, comprising:

receiving a data processing instruction;

acquiring first partition data and second partition data according to the data processing instruction;

sequencing the first partition data through a mapping node Mapper to obtain first data to be combined, and sequencing the second partition data to obtain second data to be combined;

and merging the first data to be merged and the second data to be merged through a reduction node Reducer to obtain target connection data.

A second aspect of the present invention provides a data processing apparatus comprising:

the receiving module is used for receiving the data processing instruction;

the acquisition module is used for acquiring the first partition data and the second partition data according to the data processing instruction received by the receiving module;

the sorting module is used for sorting the first partition data acquired by the acquisition module through a mapping node Mapper to obtain first data to be combined, and sorting the second partition data acquired by the acquisition module to obtain second data to be combined;

and the merging module is used for merging the first data to be merged and the second data to be merged which are sequenced by the sequencing module through a reduction node Reducer to obtain target connection data.

A third aspect of the present invention provides a data processing apparatus comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and comprises the following steps:

receiving a data processing instruction;

combining the first data to be combined and the second data to be combined through a reduction node Reducer to obtain target connection data;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

A fourth aspect of the invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

From the above technical solutions, the embodiment of the present invention has the following advantages:

in the embodiment of the invention, a data processing method is provided, which comprises the steps of firstly receiving a data processing instruction, then obtaining first partition data and second partition data according to the data processing instruction, then sequencing the first partition data through a Mapper to obtain first data to be combined, sequencing the second partition data to obtain second data to be combined, and finally combining the first data to be combined and the second data to be combined through a Reducer to obtain target connection data. By the mode, the characteristic that most Spark SQL tasks have more mappers and fewer reducers is utilized, the data sorting process is completed in the mappers, and the reducers only need to complete the data merging process, so that the data processing time delay of each Reducer is reduced, and the execution efficiency of the Join is improved.

Drawings

FIG. 1 is a schematic diagram of a big data distributed computing framework Spark;

FIG. 2 is a schematic diagram of a large data distributed computing framework Spark;

FIG. 3 is a schematic diagram of a distributed system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of a method for data processing according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of transferring data from a cache to a hard disk according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a frame of a data processing method in an application scenario of the present invention;

FIG. 7 is a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be appreciated that the present invention is primarily applicable to a memory-based big data distributed computing framework Spark, which was originally developed in the university of california, U.S. burley division "algorithm robot laboratory (Algorithms Machines People, AMPLab)". Currently, the world-wide company and developer are attracted as top-level projects of an "Apache" open source community. Spark development has so far become a de facto standard for large data processing in the industry.

Typically, when the amount of data to be processed exceeds a single scale (e.g., our computer has 4 gigabytes of memory and we need to process data above 100 gigabytes), we can choose Spark clusters to perform computation, and sometimes we may need less data to be processed, but the computation is complex and requires a lot of time, and we can choose to use the powerful computing resources of Spark clusters to perform computation in parallel, and the Spark will be described below with reference to fig. 1 and 2.

Referring to fig. 1, fig. 1 is a schematic diagram of a big data distributed computing framework Spark, where the Spark core includes the basic functions of Spark, and in particular, application programming interfaces (Application Programming Interface, APIs) and operations defining RDD and actions on both. Other Spark libraries are built on top of RDD and Spark cores.

Wherein Spark SQL provides APIs that interact with Spark through SQL of the Apache data warehouse tool, each database table is treated as an RDD, and Spark SQL queries are converted into Spark operations. Spark data stream is the processing and control of real-time data streams that allow programs to process real-time data as in conventional RDD. Machine learning library a common library of machine learning algorithms, the algorithms being implemented as Spark operations on RDD, this library contains scalable learning algorithms such as classification and regression that require iteration over a large number of data sets. Image X (GraphX) is a collection of algorithms and tools that control the graph, parallelize the graph operations and computations, and extends the RDD API, including the operations of controlling the graph, creating a subgraph, and accessing all vertices on the path

Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of a big data distributed computing framework Spark, and as shown, a Cluster Manager (Cluster Manager) is a master node in an offline mode, and is used for controlling the entire Cluster, monitoring working nodes, and is a resource Manager in another resource coordinator (Yet Another Resource Negotiator, yacn) mode. The working node, also called slave node, is responsible for controlling the computing node, starting the executor or driver. An executor is a process running on a working node for a certain Application (APP). The driver is used to run the main () function of the APP.

It should be understood that, in the embodiment of the present invention, the data processing method is mainly applied to the distributed system, please refer to fig. 3, fig. 3 is a schematic diagram of an architecture of the distributed system in the embodiment of the present invention, in general, the number of mappers is greater than the number of Reducer, and the mappers are mainly responsible for the data processing stage, and are in the form of mappers < K, V >, and the mappers are independent tasks, and convert the input record into the intermediate record, that is, process the input key value pair, and output as a group of intermediate key value pairs. An input record may be output as 0 or more intermediate records after being processed by a Mapper. For example, if the input record does not meet the service requirement (does not contain a specific value or contains a specific value), it can be returned directly, and then 0 records are output, and the Mapper functions as a filter.

Reducer allows configuration and cleaning, and can reduce a set of intermediate values with the same key to a set of smaller numbers of values, such as the number of merged words, etc. In the invention, the data ordering is completed in the Mapper, and the Reducer is only responsible for data merging, so that the data ordering efficiency is improved, and the Join efficiency is improved.

Referring to fig. 4, an embodiment of a method for data processing according to the present invention includes:

101. Receiving a data processing instruction;

in this embodiment, the data processing apparatus may be disposed on a server or a terminal device. The data processing device receives a data processing instruction, where the data processing instruction may be initiated by a user such as a network manager or an operation and maintenance personnel, or may be an instruction initiated actively by a device, and is not limited herein.

102. Acquiring first partition data and second partition data according to the data processing instruction;

in this embodiment, the data processing apparatus constructs a resilient distributed data set (Resilient Distributed Datasets, RDD) corresponding to the data source according to the data processing instruction. In the present invention, RDD is described as including two partitions, a first partition and a second partition, respectively.

On a big data analysis platform constructed based on Spark SQL, the function of the PreSortMergeJoin algorithm is started by default in the scheme, and is set through corresponding parameters (Spark. Join. Shuffle. Sort=true), when the parameters are false, the Spark SQL default Join algorithm can be switched back (namely, the Reducer is used for sorting and merging in the prior scheme). Therefore, on the product side, when the Spark SQL big data analysis platform is on line, the system manager only needs to set the parameter in the configuration file, and no operation is required to be performed by the user.

It will be appreciated that, in practical applications, preSortMergeJoin may be defined as another name, but the algorithm is not substantially changed, as the name of the algorithm provided for implementing the scheme.

103. Sequencing the first partition data through a mapping node Mapper to obtain first data to be combined, and sequencing the second partition data to obtain second data to be combined;

in this embodiment, the data processing apparatus performs a sorting process on the first partition data by using a Mapper, and then obtains first data to be merged after sorting, and similarly, the data processing apparatus performs a sorting process on the second partition data by using a Mapper, and then obtains second data to be merged after sorting. An example of a sort process will be described below in conjunction with tables 1 and 2, where table 1 is an example prior to sort.

TABLE 1

User identification	Age of
		3	21
5	64
		2	33
1	28
		7	19
4	56
		6	40

The results shown in Table 2 were obtained after sorting.

TABLE 2

It will be appreciated that the above ranking is based on "user identification", and that in practical applications, other ranking modes may be selected according to the requirements.

104. And merging the first data to be merged and the second data to be merged through a reduction node Reducer to obtain target connection data.

In this embodiment, the data processing apparatus performs merging processing on the first data to be merged and the second data to be merged through the Reducer to obtain target connection data, and then obtains Join. Assuming that the first merged data is the result shown in table 2, the second merged data is shown in table 3.

TABLE 3 Table 3

User identification	Occupation of
		1	Engineer(s)
2	Teacher's teacher
		3	Teacher's teacher
4	Engineer(s)
		5	Teacher's teacher
6	Engineer(s)
		7	Engineer(s)

The target connection data thus obtained are shown in table 4.

TABLE 4 Table 4

User identification	Occupation of	Age of
			1	Engineer(s)	28
2	Teacher's teacher	33
			3	Teacher's teacher	21
4	Engineer(s)	56
			5	Teacher's teacher	64
6	Engineer(s)	40
			7	Engineer(s)	19

In the embodiment of the invention, a data processing method is provided, which comprises the steps of firstly receiving a data processing instruction, then acquiring first partition data and second partition data according to the data processing instruction, wherein the first partition data and the second partition data belong to different elastic distributed data sets RDD, sequencing the first partition data through a Mapper to obtain first data to be combined, sequencing the second partition data to obtain second data to be combined, and finally merging the first data to be combined and the second data to be combined through a Reducer to obtain target connection data. By the mode, the characteristic that most Spark SQL tasks have more mappers and fewer reducers is utilized, the data sorting process is completed in the mappers, and the reducers only need to complete the data merging process, so that the data processing time delay of each Reducer is reduced, and the execution efficiency of the Join is improved.

Optionally, on the basis of the embodiment corresponding to fig. 4, in a first optional embodiment corresponding to the data processing method provided by the embodiment of the present invention, acquiring the first partition data and the second partition data according to the data processing instruction includes:

and acquiring the first partition data and the second partition data from the elastic distributed data set RDD according to the data processing instruction.

In this embodiment, the data processing apparatus obtains the first partition data and the second partition data from the same RDD according to the data processing instruction.

Specifically, join is to combine two or more rows of partitions belonging to the same key value, so the first partition data and the second partition data are only shown as one example, and in practical application, there may be more partition data, which is not limited herein.

RDD is a fault-tolerant and parallel data structure that can explicitly store data to disk and memory, essentially a read-only collection of data records. An RDD may comprise a plurality of partitions, each partition being a data set segment. The RDD may be partitioned using a partitioning technique that sets how to partition the data set in the RDD, e.g., one partition may be associated with each data block, etc. A partition is then the object that operates on RDD partitions.

The process of conducting the RDD from the new partition is called "shuffling". In Spark, shuffle means the landing and reading of data. For Join, the Mapper of the Shuffle divides the data of the same key value of the RDD into the same partition (note: each partition may contain a different key value) and drops the data, and the Reducer of the Shuffle reads out the partitions of the same key value of different RDDs and combines the partitions.

In the second embodiment of the present invention, the data processing apparatus needs to obtain the partition data from the same RDD, that is, obtain the first partition data from the RDD, and obtain the second partition data from the RDD. Through the mode, the rows belonging to the same key value in one RDD can be combined, namely the Join process is completed, and therefore the practicability and feasibility of the scheme are improved.

Optionally, in a second alternative embodiment corresponding to the data processing method provided by the embodiment of the present invention on the basis of the embodiment corresponding to fig. 4, the sorting processing is performed on the first partition data by using a Mapper, so as to obtain first data to be merged, which may include:

sequencing the first partition data through a first Mapper to obtain at least one first file, wherein the at least one first file belongs to first data to be merged;

Merging at least one first file to obtain first data to be merged;

sequencing the second partition data through a second Mapper to obtain at least one second file, wherein the at least one second file belongs to second data to be combined;

and merging at least one second file to obtain second data to be merged.

In this embodiment, at the stage of processing data by the Mapper, the data needs to be ordered and aggregated, and this operation can be completed by the function join shufflewrite. The input of the function is the partition data of RDD, then each piece of data is inserted into the sequencer until all the data are inserted, and finally all sequencing files are combined. For ease of understanding, please refer to the following pseudocode:

secondly, in the embodiment of the invention, a process of sorting the first partition data through the first Mapper to obtain at least one first file, then merging the at least one first file to obtain first data to be merged, and a process of sorting the second partition data through the second Mapper to obtain at least one second file, and merging the at least one second file to obtain second data to be merged are introduced. By the method, different mappers can sort different partition data, so that the Reducer does not need to sort the data, and the situation that a plurality of reducers wait for the completion of the other side due to uneven data volume processed by the Reducer is avoided.

Optionally, on the basis of the second embodiment corresponding to fig. 4, in a third optional embodiment corresponding to the data processing method provided by the embodiment of the present invention, the sorting processing is performed on the first partition data by using a first Mapper, so as to obtain at least one first file, which may include:

if the data quantity of the first partition data which is sequenced by the first Mapper in the memory reaches a first preset threshold, taking M sequenced data in the memory as a first file, and storing the first file into a hard disk, wherein M is an integer greater than 0;

the sorting processing is performed on the second partition data through the second Mapper, so as to obtain at least one second file, which may include:

if the data quantity of the second partition data subjected to the sorting processing in the memory through the second Mapper reaches a second preset threshold, taking N pieces of sorted data in the memory as a second file, and storing the second file into a hard disk, wherein N is an integer larger than 0.

In this embodiment, in the process of processing the first partition data, the first Mapper writes the first partition data into the memory, and if the amount of data in the memory, in which the first partition data is sequenced by the first Mapper, reaches a first preset threshold, the first Mapper uses M sequenced data in the memory as a first file, and then sends the first file to the hard disk. Similarly, the second Mapper writes the second partition data into the memory, and if the amount of data in the memory, which is obtained by sorting the first partition data by the second Mapper, reaches a second preset threshold, the second Mapper uses the N sorted data in the memory as a second file, and then sends the first file to the hard disk.

It should be noted that, in general, the first preset threshold and the second preset threshold are identical, for example, the first preset threshold and the second preset threshold are both 30. The values of M and N are usually the same, and may be set according to circumstances, and are not limited thereto.

In the following, a description will be given of how to insert a data record into a sequencer in conjunction with the following pseudo-code, please refer to the following pseudo-code:

the insert record method is to insert a data record into the sequencer, the memerybuffer uses the memory to store record, when a certain number is reached, the sortAndSpill method sequences and stores the memerybuffer data on the disk, and empties the memerybuffer.

Assume that the records are four pieces of data in total, respectively (1, r 1), (3, r 3), (4, r 4), and (2, r 2), where 1, 2, 3, and 4 are bond values, respectively. If the memory buffer holds at most two pieces of data, (1, r 1) and (3, r 3) are saved as disk files after sorting by key value (data order in file is (1, r 1), (3, r 3)), and (4, r 4) and (2, r 2) are saved as disk files after sorting by key value (data order in file is (2, r 2) and (4, r 4)).

In the embodiment of the present invention, if the data amount of the first partition data sequenced by the first Mapper in the memory reaches a first preset threshold, the M sequenced data in the memory is used as a first file, and similarly, if the data amount of the second partition data sequenced by the second Mapper in the memory reaches a second preset threshold, the N sequenced data in the memory is used as a second file. By the method, under the scene of extremely large data volume, the data written in the memory can be placed in the hard disk for processing, so that the problem that a Mapper cannot sort due to insufficient memory space is avoided, and the feasibility and the practicability of the scheme are improved.

Optionally, on the basis of the second embodiment corresponding to fig. 4, in a fourth optional embodiment corresponding to the data processing method provided by the embodiment of the present invention, merging at least one first file to obtain first data to be merged may include:

acquiring minimum data from each first file;

determining first data to be combined according to the minimum data in each first file;

at least one second file is combined to obtain second data to be combined, which comprises the following steps:

acquiring minimum data from each second file;

and determining second data to be combined according to the minimum data in each second file.

In this embodiment, the first partition data is ordered in the memory, a first file is obtained after the ordering, and the second partition data is ordered in the other memory, a second file is obtained after the ordering. Referring to fig. 5, fig. 5 is a schematic diagram of transferring data from a cache to a hard disk in an embodiment of the present invention, where, as shown in the drawing, assuming that a first file includes a file a, a file B and a file C, and assuming that a current partition is No. 1, minimum data can be respectively taken out from partition No. 1 in the file a, partition No. 1 in the file B and partition No. 1 in the file C, and written into partition No. 1 in the hard disk. And so on, the minimum value in partition number 1 is written to the hard disk every time, so that an ordered result is obtained.

In the following, it will be described how to merge the ordered data into one file in combination with the following pseudo code. Please refer to the following pseudo code:

the mergeSpils method merges the stirred files in all insertRecords. And each time, taking out the minimum value belonging to the current partition from the file and writing the minimum value into the file, and if the minimum value belongs to the data which does not belong to the current partition, changing to the next partition until all the data are taken out.

Assume that the records are four pieces of data in total, respectively (1, r 1), (3, r 3), (4, r 4), and (2, r 2), where 1, 2, 3, and 4 are bond values, respectively. If the memory buffer holds at most two pieces of data, (1, r 1) and (3, r 3) are saved as disk files after sorting by key value (data order in file is (1, r 1), (3, r 3)), and (4, r 4) and (2, r 2) are saved as disk files after sorting by key value (data order in file is (2, r 2) and (4, r 4)). After mergeSpils a file is created in which the data sequences are (1, r 1), (2, r 2), (3, r 3), (4, r 4).

In the embodiment of the present invention, the method for obtaining the data to be combined by combining at least one file may be that minimum data is obtained from each file, and then the data to be combined is obtained according to the minimum data in each file. By the method, the data in the first file and the data in the second file are ordered, so that the ordering of the minimum values in the first file and the second file can be guaranteed every time, and the operability of the scheme is improved.

Optionally, on the basis of any one of the first to fourth embodiments corresponding to fig. 4 and fig. 4, in a fifth optional embodiment corresponding to the data processing method provided by the embodiment of the present invention, the merging processing, by Reducer, of the first data to be merged and the second data to be merged to obtain the target connection data may include:

receiving first data to be merged, which is obtained after the first Mapper orders the first partition data, through a Reducer;

receiving second data to be merged, which is obtained after the second Mapper orders and processes the second partition data, through a Reducer;

and merging the first data to be merged and the second data to be merged through the Reducer to obtain target connection data.

In this embodiment, a procedure of merging the first data to be merged and the second data to be merged by the Reducer is described.

Specifically, for introduction, a Reducer is used, and the Reducer receives first data to be combined sent from a first Mapper, where the first data to be combined is data obtained after the first Mapper orders. Similarly, the Reducer also receives second data to be combined from the second Mapper, where the second data to be combined is data obtained after the second Mapper has ordered. And the Reducer performs combination processing on the received first data to be combined and the second data to be combined, so that the target connection data after Join is obtained.

Further, in the embodiment of the present invention, a process of merging the first data to be merged and the second data to be merged by the Reducer is described. By the method, one Reducer can directly combine a plurality of ordered results without ordering data, so that the efficiency of ordering operation is improved, and the execution efficiency of Join is improved.

Optionally, on the basis of the fifth embodiment corresponding to fig. 4, in a sixth optional embodiment corresponding to the data processing method provided by the embodiment of the present invention, the obtaining, by Reducer, the target connection data may include:

acquiring first minimum data from the first data to be combined through a Reducer, and acquiring second minimum data from the second data to be combined;

and comparing the first minimum data with the second minimum data through the Reducer, and combining the first data to be combined with the second data to be combined according to the comparison result to obtain target connection data.

In this embodiment, during the process of processing the first data to be combined and the second data to be combined, firstly, the Reducer needs to obtain the first minimum data from the first data to be combined, obtain the second minimum data from the second data to be combined, then compare the first minimum data with the second minimum data, and output the minimum values in sequence.

In the following, it will be described how to merge the first data to be merged and the second data to be merged in combination with the following pseudo code. Please refer to the following pseudo code:

the Reduce stage combines the data of the different mappers and inputs them in iterative form into the Join operator. Thus calling the Integrator. GetNext () can get the next data value. The logic is similar to the mergeSpillis method of the data processing stage of the Mapper, and will not be described here.

Further, in the embodiment of the present invention, the manner of merging the first data to be merged and the second data to be merged by the Reducer may be that the Reducer obtains the first minimum data from the first data to be merged and obtains the second minimum data from the second data to be merged; and comparing the first minimum data with the second minimum data through the Reducer, and merging the first data to be merged and the second data to be merged according to the comparison result. By the method, the first data to be combined and the second data to be combined are ordered, and the ordering of the minimum values in the first data to be combined and the second data to be combined can be guaranteed every time, so that the operability of the scheme is improved.

For ease of understanding, referring to fig. 6, fig. 6 is a schematic diagram of a frame of a data processing method in an application scenario of the present invention, where the PreSortMergeJoin algorithm has two key parts, one part is to sort data when the map phase data is partitioned, and the other part is to merge the sorted data in the reduce phase.

First, for one Mapper, the partition data is first sorted, and since the data size is relatively large, the memory is likely to be insufficient to store all the data, and thus the sorted data needs to be subjected to a segmentation process. Assuming that the memory can store up to one hundred thousand data, and a total of one million data needs to be sorted, the Mapper can store every one hundred thousand sorted data into the hard disk until the one million data is stored into the hard disk. Of course, the millions of data stored to the hard disk are also ordered because the minimum value is placed into the hard disk each time during storage, thereby forming an ordered Mapper file.

The data stored in the hard disk is not necessarily used by only one Reducer, so the data in the hard disk can be distributed to different reducers, and similarly, the Mapper file acquired by one Reducer is not necessarily derived from the same Mapper.

The Reducer can obtain the final Join result by merging at least one Mapper file.

The invention can be applied to a data warehouse and supports about 160 ten thousand SQL query processing services. The invention has obvious improvement on the Join efficiency, can improve the Join speed by 70% at most, and greatly reduces the service running time.

Referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present invention, and the data processing apparatus 20 includes:

a receiving module 201, configured to receive a data processing instruction;

an obtaining module 202, configured to obtain first partition data and second partition data according to the data processing instruction received by the receiving module 201;

the sorting module 203 is configured to sort the first partition data acquired by the acquiring module 202 by using a mapping node Mapper to obtain first data to be merged, and sort the second partition data acquired by the acquiring module to obtain second data to be merged;

and the merging module 204 is configured to merge the first data to be merged and the second data to be merged, which are sequenced by the sequencing module 203, by using a reduction node Reducer, so as to obtain target connection data.

In this embodiment, the receiving module 201 receives a data processing instruction, the acquiring module 202 acquires first partition data and second partition data according to the data processing instruction received by the receiving module 201, the sorting module 203 performs sorting processing on the first partition data acquired by the acquiring module 202 through a mapping node Mapper to obtain first to-be-merged data, and performs sorting processing on the second partition data acquired by the acquiring module to obtain second to-be-merged data, and the merging module 204 performs merging processing on the first to-be-merged data and the second to-be-merged data after sorting by the sorting module 203 through a reduction node Reducer to obtain target connection data.

In an embodiment of the present invention, a data processing apparatus is provided, which first receives a data processing instruction, then obtains first partition data and second partition data according to the data processing instruction, where the first partition data and the second partition data belong to different elastic distributed data sets RDD, and then performs ordering processing on the first partition data by a Mapper to obtain first data to be combined, and performs ordering processing on the second partition data to obtain second data to be combined, and finally performs combining processing on the first data to be combined and the second data to be combined by a Reducer to obtain target connection data. By the mode, the characteristic that most Spark SQL tasks have more mappers and fewer reducers is utilized, the data sorting process is completed in the mappers, and the reducers only need to complete the data merging process, so that the data processing time delay of each Reducer is reduced, and the execution efficiency of the Join is improved.

Alternatively, in another embodiment of the data processing apparatus 20 according to the embodiment of the present invention based on the embodiment corresponding to fig. 7,

the obtaining module 202 is specifically configured to obtain the first partition data and the second partition data from the elastic distributed data set RDD according to the data processing instruction.

the sorting module 203 is specifically configured to sort the first partition data by using a first Mapper to obtain at least one first file, where the at least one first file belongs to the first data to be merged;

combining the at least one first file to obtain the first data to be combined;

Sorting the second partition data through a second Mapper to obtain at least one second file, wherein the at least one second file belongs to the second data to be merged;

and merging the at least one second file to obtain the second data to be merged.

Secondly, in the embodiment of the invention, a process of sorting the first partition data through the first Mapper to obtain at least one first file, then merging the at least one first file to obtain first data to be merged, and a process of sorting the second partition data through the second Mapper to obtain at least one second file, and merging the at least one second file to obtain second data to be merged are introduced. By the method, different mappers can sort different partition data, so that the Reducer does not need to sort the data, and the time of waiting for sorting of each other among the plurality of Reducer is avoided.

the sorting module 203 is specifically configured to, if the amount of data in the memory, which is processed by sorting the first partition data by the first Mapper, reaches a first preset threshold, take M pieces of sorted data in the memory as a first file, and store the first file in a hard disk, where M is an integer greater than 0;

And if the data quantity of the second partition data subjected to the sorting processing in the memory through the second Mapper reaches a second preset threshold, taking N pieces of sorted data in the memory as a second file, and storing the second file into a hard disk, wherein N is an integer greater than 0.

The sorting module 203 is specifically configured to obtain minimum data from each first file;

determining the first data to be combined according to the minimum data in each first file;

acquiring minimum data from each second file;

and determining the second data to be combined according to the minimum data in each second file.

the merging module 204 is specifically configured to receive, by using the Reducer, the first data to be merged obtained after the first Mapper performs the sorting processing on the first partition data;

Receiving the second data to be combined, which is obtained after the second partition data is sequenced and processed by a second Mapper, through the Reducer;

and merging the first data to be merged and the second data to be merged through the Reducer to obtain the target connection data.

the merging module 204 is specifically configured to obtain, by using the Reducer, first minimum data from the first data to be merged, and obtain second minimum data from the second data to be merged;

and comparing the first minimum data with the second minimum data through the Reducer, and combining the first data to be combined with the second data to be combined according to a comparison result to obtain the target connection data.

Fig. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, where the data processing apparatus 300 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 322 (e.g., one or more processors) and a memory 332, and one or more storage media 330 (e.g., one or more mass storage devices) storing application programs 342 or data 344. Wherein the memory 332 and the storage medium 330 may be transitory or persistent. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instruction operations on the data processing apparatus. Still further, the central processor 322 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the data processing apparatus 300.

The data processing apparatus 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The steps performed by the data processing apparatus in the above-described embodiments may be based on the data processing apparatus structure shown in fig. 8.

The CPU 322 is configured to perform the steps of:

receiving a data processing instruction;

Optionally, the CPU 322 is specifically configured to perform the following steps:

sequencing the first partition data through a first Mapper to obtain at least one first file, wherein the at least one first file belongs to the first data to be merged;

combining the at least one first file to obtain the first data to be combined;

if the data quantity of the first partition data subjected to sorting processing in the memory through the first Mapper reaches a first preset threshold, taking M pieces of sorted data in the memory as a first file, and storing the first file into a hard disk, wherein M is an integer greater than 0;

acquiring minimum data from each first file;

acquiring minimum data from each second file;

receiving the first data to be merged, which is obtained after the first partition data is sequenced and processed by a first Mapper, through the Reducer;

acquiring first minimum data from the first data to be combined through the Reducer, and acquiring second minimum data from the second data to be combined;

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of data processing, for use in a data warehouse, the number of map nodes in the data warehouse being greater than the number of reduce nodes, the map nodes having a uniform volume of data processed, the method comprising:

receiving a data processing instruction;

according to the data processing instruction, constructing an elastic distributed data set (RDD) corresponding to a data source, and acquiring first partition data and second partition data from the same RDD, wherein each database table is used as one RDD, and one RDD comprises a plurality of partitions;

by the first mapping node, based on the user identification, performing the following processing:

the first partition data are subjected to sorting processing in a first memory to obtain a plurality of first files, wherein each first file comprises data of a plurality of partitions which are sorted according to partition numbers;

According to the sequence of the partition numbers, writing the minimum value in the data of the same partition in each first file into the corresponding partition in the hard disk each time to obtain first data to be combined;

by the second mapping node, based on the user identification, performing the following processing:

sorting the second partition data in a second memory to obtain a plurality of second files, wherein each second file comprises data of a plurality of partitions sorted according to partition numbers;

according to the sequence of the partition numbers, writing the minimum value in the data of the same partition in each second file into the corresponding partition in the hard disk each time to obtain second data to be combined;

and merging the first data to be merged and the second data to be merged through the reduction node, and obtaining target connection data without data ordering.

2. The method of claim 1, wherein the merging, by the reduction node, the first data to be merged and the second data to be merged without data ordering, to obtain target connection data, includes:

acquiring first minimum data from the first data to be combined and acquiring second minimum data from the second data to be combined;

And comparing the first minimum data with the second minimum data, and combining the first data to be combined with the second data to be combined according to a comparison result to obtain the target connection data.

3. A data processing apparatus for use in a data warehouse, the number of map nodes in the data warehouse being greater than the number of reduce nodes, the map nodes having a uniform volume of data processed, the apparatus comprising:

the receiving module is used for receiving the data processing instruction;

the acquisition module is used for constructing an elastic distributed data set RDD corresponding to a data source according to the data processing instruction received by the receiving module, and acquiring first partition data and second partition data from the same RDD, wherein each database table is used as an RDD, and each RDD comprises a plurality of partitions;

the ordering module is used for executing the following processing according to the user identification through the first mapping node: the first partition data acquired by the acquisition module are subjected to sorting processing in a first memory to obtain a plurality of first files, wherein each first file comprises data of a plurality of partitions which are sorted according to partition numbers; according to the sequence of the partition numbers, writing the minimum value in the data of the same partition in each first file into the corresponding partition in the hard disk each time to obtain first data to be combined; by the second mapping node, based on the user identification, performing the following processing: sorting the second partition data acquired by the acquisition module in a second memory to obtain a plurality of second files, wherein each second file comprises data of a plurality of partitions sorted according to partition numbers; according to the sequence of the partition numbers, writing the minimum value in the data of the same partition in each second file into the corresponding partition in the hard disk each time to obtain second data to be combined;

And the merging module is used for merging the first data to be merged and the second data to be merged after being sequenced by the sequencing module through the reduction node, and acquiring target connection data without data sequencing.

4. A data processing apparatus for use in a data warehouse, the number of map nodes in the data warehouse being greater than the number of reduce nodes, the map nodes having a uniform volume of data processed, the data processing apparatus comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

receiving a data processing instruction;

5. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of claim 1 or 2.