CN110309177A - A kind of method and relevant apparatus of data processing - Google Patents
A kind of method and relevant apparatus of data processing Download PDFInfo
- Publication number
- CN110309177A CN110309177A CN201810245691.XA CN201810245691A CN110309177A CN 110309177 A CN110309177 A CN 110309177A CN 201810245691 A CN201810245691 A CN 201810245691A CN 110309177 A CN110309177 A CN 110309177A
- Authority
- CN
- China
- Prior art keywords
- data
- combined
- processing
- file
- partition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 224
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000005192 partition Methods 0.000 claims abstract description 142
- 239000003638 chemical reducing agent Substances 0.000 claims abstract description 78
- 230000008569 process Effects 0.000 claims abstract description 30
- 238000003860 storage Methods 0.000 claims description 17
- 238000013507 mapping Methods 0.000 claims description 13
- 230000009467 reduction Effects 0.000 claims description 13
- 241001269238 Data Species 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 7
- 238000003672 processing method Methods 0.000 description 6
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000003516 pericardium Anatomy 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of methods of data processing, comprising: receives data processing instructions;The first partition data and the second partition data are obtained according to the data processing instructions;Processing is ranked up to first partition data by Mapper, the first data to be combined is obtained, and processing is ranked up to second partition data, obtains the second data to be combined;Processing is merged to the described first data to be combined and second data to be combined by Reducer, obtains target connection data.The embodiment of the invention also discloses a kind of data processing equipments.The process of data sorting is placed in Mapper and completes by the embodiment of the present invention, and the process that Reducer only needs to be performed data merging promotes the execution efficiency of Join to reduce the data processing time delay of each Reducer.
Description
Technical field
The present invention relates to computer information processing field more particularly to the methods and relevant apparatus of a kind of data processing.
Background technique
Spark is a kind of big data distributed computing framework memory-based, Spark structured query language
(Structured Query Language, SQL) provides structuring as the important member in the Spark ecosphere for user
The function of data processing and SQL query analysis, so that only need can be transparent by SQL by the analysis personnel in different business field
The processing of mass data is completed using Spark in ground.
Connection method (Join) is as an important syntactic property in SQL, and the data analysis scene of nearly all complexity is all
It is be unable to do without Join, so Join is always the emphasis optimized in database field.Currently, a kind of common Join is that sequence merges
It connects (Sort Merge Join), in this Join, shuffles the mapping node (Mapper) of (Shuffle) for elasticity distribution
The data of the identical key assignments of formula data set (Resilient Distributed Datasets, RDD) are divided into same subregion and fall
Disk, the reduction node (Reducer) of Shuffle the subregion of the identical key assignments of different RDD is read out after first according to key assignments into
Data after sequence, are then combined by row sequence.
In the actual environment, data volume handled by Mapper is relatively uniform, process of each Mapper in processing data
In do not need mutually to wait.However, data volume handled by Reducer is very different, some Reducer may need to handle greatly
The data of amount, and some Reducer only need to may handle a small amount of data, then, the Reducer of short processing time needs to wait
The Reducer of processing time length completes data sorting causes the efficiency of Join to reduce to increase data processing time delay.
Summary of the invention
The embodiment of the invention provides a kind of method of data processing and relevant apparatus, and the process of data sorting is placed on
It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer
Time delay is managed, the execution efficiency of Join is promoted.
In view of this, first aspect present invention provides a kind of method of data processing, comprising:
Receive data processing instructions;
The first partition data and the second partition data are obtained according to the data processing instructions;
Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined,
And processing is ranked up to second partition data, obtain the second data to be combined;
The described first data to be combined and second data to be combined are merged by reduction node R educer
Processing obtains target connection data.
Second aspect of the present invention provides a kind of data processing equipment, comprising:
Receiving module, for receiving data process instruction;
Obtain module, for according to the received data processing instructions of the receiving module obtain the first partition data with
And second partition data;
Sorting module, for by mapping node Mapper to it is described acquisition module obtain first partition data into
Row sequence processing obtains the first data to be combined, and second partition data obtained to the acquisition module is ranked up
Processing, obtains the second data to be combined;
Merging module, for be combined to described first after sorting module sequence by reduction node R educer
Data and second data to be combined merge processing, obtain target connection data.
Third aspect present invention provides a kind of data processing equipment, and the data processing equipment includes: memory, transmitting-receiving
Device, processor and bus system;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Receive data processing instructions;
The first partition data and the second partition data are obtained according to the data processing instructions;
Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined,
And processing is ranked up to second partition data, obtain the second data to be combined;
The described first data to be combined and second data to be combined are merged by reduction node R educer
Processing obtains target connection data;
The bus system is for connecting the memory and the processor, so that the memory and the place
Reason device is communicated.
The fourth aspect of the present invention provides a kind of computer readable storage medium, in the computer readable storage medium
It is stored with instruction, when run on a computer, so that the method that computer executes above-mentioned various aspects.
As can be seen from the above technical solutions, the embodiment of the present invention has the advantage that
In the embodiment of the present invention, a kind of method of data processing is provided, first reception data processing instructions, then basis
Data processing instructions obtain the first partition data and the second partition data, then are arranged by Mapper the first partition data
Sequence processing, obtains the first data to be combined, and be ranked up processing to the second partition data, obtains the second data to be combined, most
Processing is merged to the first data to be combined and the second data to be combined by Reducer afterwards, obtains target connection data.
By the above-mentioned means, the characteristic that the Spark SQL task that the overwhelming majority is utilized has the more Reducer of Mapper few, by data
The process of sequence, which is placed in Mapper, to be completed, and Reducer only needs to be performed the process of data merging, to reduce each
The data processing time delay of Reducer, promotes the execution efficiency of Join.
Detailed description of the invention
Fig. 1 is a schematic diagram of big data distributed computing framework Spark;
Fig. 2 is a configuration diagram of big data distributed computing framework Spark;
Fig. 3 is a configuration diagram of distributed system in the embodiment of the present invention;
Fig. 4 is method one embodiment schematic diagram of data processing in the embodiment of the present invention;
Fig. 5 is that data are moved a schematic diagram to hard disk from caching in the embodiment of the present invention;
Fig. 6 is a block schematic illustration of data processing method in application scenarios of the present invention;
Fig. 7 is one embodiment schematic diagram of data processing equipment in the embodiment of the present invention;
Fig. 8 is a structural schematic diagram of data processing equipment in the embodiment of the present invention.
Specific embodiment
The embodiment of the invention provides a kind of method of data processing and relevant apparatus, and the process of data sorting is placed on
It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer
Time delay is managed, the execution efficiency of Join is promoted.
Description and claims of this specification and term " first ", " second ", " third ", " in above-mentioned attached drawing
The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage
The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove
Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any
Deformation, it is intended that cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, production
Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this
A little process, methods, the other step or units of product or equipment inherently.
It should be understood that present invention is mainly applied to big data distributed computing framework Spark, Spark memory-based are earliest
Be born in California, USA university Berli gram branch school " algorithm machine mankind laboratory (Algorithms Machines People,
AMPLab)".Currently as the top project of " Apache (Apache) " open source community, be attracted to each major company in the whole world and
The participation of developer.Spark is developed so far the fact that have become the processing of industry big data standard.
Usually when data volume to be treated has been more than that (for example our computer has the interior of 4 gigabytes to single machine scale
It deposits, and we need to handle the data of 100 gigabytes or more), at this moment we can choose Spark cluster and calculate, and have
When it may be desired to the data volumes of processing and little, but calculate very complicated, need a large amount of time, at this moment we can also be with
The powerful computing resource of Selection utilization Spark cluster, calculates to parallelization, and introduces Spark below in conjunction with Fig. 1 and Fig. 2.
Referring to Fig. 1, Fig. 1 is a schematic diagram of big data distributed computing framework Spark, as shown, Spark core
Basic function of the pericardium containing Spark especially defines application programming interface (the Application Programming of RDD
Interface, API) and operation and the two on movement.The library of other Spark is all building in RDD and Spark core
On the heart.
Wherein, Spark SQL provides the API interacted by the SQL and Spark of Apache Tool for Data Warehouse, often
A database table is taken as a RDD, and Spark SQL query is converted into Spark operation.Spark data flow is to real-time number
It is handled and is controlled according to stream, Spark data flow allows program handle real time data as common RDD.Machine learning
One, library machine in normal service learning algorithm library, algorithm are implemented as the Spark operation to RDD, this library includes expansible study
Algorithm, such as the operation that classification and recurrence etc. need to be iterated mass data collection.Image X (GraphX) is control figure, simultaneously
The set of row graphic operation and one group of algorithm and tool calculating, GraphX extend RDD API, include control figure, creation subgraph
And in access path all vertex operation
Referring to Fig. 2, Fig. 2 is a configuration diagram of big data distributed computing framework Spark, as shown, collection
Group's manager (Cluster Manager) is host node in offline mode, for controlling entire cluster, monitoring work section
Point is resource management in another resource coordination person (Yet Another Resource Negotiator, YARN) mode
Device.Working node also known as starts actuator or driver from node for being responsible for control calculate node.Actuator is
A process on working node is operated in for some application program (Application, APP).Driver is for running APP
Main () function.
It should be understood that the data processing method in the embodiment of the present invention is mainly used in distributed system, referring to Fig. 3, Fig. 3
For a configuration diagram of distributed system in the embodiment of the present invention, it is generally the case that the quantity of Mapper is greater than Reducer
Quantity, Mapper is mainly responsible for data processing stage, the form that it is used for Mapper<K, V>, Mapper is independent
Business, is converted to intermediate record for input record, i.e., handles the key-value pair of input, and exports as one group of centre key-value pair.
One input record may be exported after handling via Mapper as 0 or a plurality of intermediate record.For example, if input record not
If meeting business need (not comprising being specifically worth or containing specific value), it can directly return, then can export 0
Record, Mapper has played filter at this time.
Reducer allows to configure and clear up, and one group of median with same keys can be reduced to one group of more smallest number
Value, such as merge word quantity etc..Data sorting is completed in Mapper in the present invention, and Reducer is merely responsible for data
Merge, to promote data sorting efficiency, improves Join efficiency.
Below by from the angle of data processing equipment, the method for data processing in the present invention is introduced, figure is please referred to
4, method one embodiment of data processing includes: in the embodiment of the present invention
101, data processing instructions are received;
In the present embodiment, data processing equipment can be deployed on server or terminal device.Data processing equipment receives
Data processing instructions, the data processing instructions can be the Client-initiateds such as network management or operation maintenance personnel, are also possible to by equipment
The instruction actively initiated, herein without limitation.
102, the first partition data and the second partition data are obtained according to data processing instructions;
In the present embodiment, data processing equipment constructs the corresponding elasticity distribution formula number of data source according to data processing instructions
According to collection (Resilient Distributed Datasets, RDD).In the present invention, RDD is carried out for comprising two subregions
It introduces, respectively the first subregion and the second subregion.
On the big data analysis platform constructed based on Spark SQL, PreSortMergeJoin algorithm in this programme
Function default is opened, and is set by corresponding parameter (Spark.join.shuffle.sort=true), when the parameter is
When false, can switching back into the Join algorithm that Spark SQL defaults, (Reducer is ranked up and carries out i.e. in existing scheme
Merge).Therefore, in product side, when Spark SQL big data analysis platform is online, system manager only needs the ginseng
Number is arranged in configuration file, does not need user and does any operation.
It is understood that PreSortMergeJoin is to realize algorithm title provided by this programme, actually answering
In, PreSortMergeJoin also may be defined as other titles, but algorithm is substantially constant.
103, processing is ranked up to the first partition data by mapping node Mapper, obtains the first data to be combined,
And processing is ranked up to the second partition data, obtain the second data to be combined;
In the present embodiment, data processing equipment is ranked up processing to the first partition data by Mapper, then obtains
The first data to be combined after sequence, similarly, data processing equipment is ranked up the second partition data by Mapper
Processing, the second data to be combined after then being sorted.A kind of example of sequence processing is introduced below in conjunction with Tables 1 and 2
Son, wherein table 1 is the example before sequence.
Table 1
User identifier | Age |
3 | 21 |
5 | 64 |
2 | 33 |
1 | 28 |
7 | 19 |
4 | 56 |
6 | 40 |
Result shown in table 2 is obtained after sequence.
Table 2
It is understood that above-mentioned sequence is to be carried out with " user identifier " for foundation, it in practical applications, can basis
Demand selects other sortords.
104, place is merged to the first data to be combined and the second data to be combined by reduction node R educer
Reason obtains target connection data.
In the present embodiment, data processing equipment is by Reducer to the first data to be combined and the second data to be combined
Processing is merged, obtains target connection data to get Join is arrived.Assuming that the first merging data is shown in table 2 as a result, second
Merging data is as shown in table 3.
Table 3
User identifier | Occupation |
1 | Engineer |
2 | Teacher |
3 | Teacher |
4 | Engineer |
5 | Teacher |
6 | Engineer |
7 | Engineer |
Then the target connection data obtained are as shown in table 4.
Table 4
User identifier | Occupation | Age |
1 | Engineer | 28 |
2 | Teacher | 33 |
3 | Teacher | 21 |
4 | Engineer | 56 |
5 | Teacher | 64 |
6 | Engineer | 40 |
7 | Engineer | 19 |
In the embodiment of the present invention, a kind of method of data processing is provided, first reception data processing instructions, then basis
Data processing instructions obtain the first partition data and the second partition data, wherein the first partition data and the second partition data
Belong to different elasticity distribution formula data set RDD, then processing is ranked up to the first partition data by Mapper, obtains first
Data to be combined, and processing is ranked up to the second partition data, the second data to be combined are obtained, finally by Reducer pairs
First data to be combined and the second data to be combined merge processing, obtain target connection data.By the above-mentioned means, sharp
The characteristic for having the more Reducer of Mapper few with the Spark SQL task of the overwhelming majority, the process of data sorting is placed on
It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer
Time delay is managed, the execution efficiency of Join is promoted.
Optionally, on the basis of above-mentioned Fig. 4 corresponding embodiment, data processing method institute provided in an embodiment of the present invention
In corresponding first alternative embodiment, the first partition data and the second partition data, packet are obtained according to data processing instructions
It includes:
First partition data and institute are obtained from elasticity distribution formula data set RDD according to the data processing instructions
State the second partition data.
In the present embodiment, data processing equipment obtains the first partition data according to data processing instructions from the same RDD
With the second partition data.
Specifically, Join is exactly that the row that two or more subregions belong to identical key assignments combines, therefore, here
One partition data and the second partition data are only a signal, in practical applications, can also have more partition datas, herein
Without limitation.
RDD is a fault-tolerant and parallel data structure, can explicitly be stored data into disk and memory, this
It is a read-only data record set in matter.One RDD may include multiple subregions, and each subregion is exactly a data set piece
Section.RDD can be carried out using partitioning technique by subregion, partitioning technique sets how to carry out subregion, example to the data set in RDD
It such as, can be by corresponding subregion of each data block etc..Subregion is then the object to RDD division operation.
RDD is carried out to be known as " shuffle (Shuffle) " from the process of new subregion.In Spark, Shuffle means to count
According to rule and reading.For corresponding Join, the data of the identical key assignments of RDD are divided into same subregion by the Mapper of Shuffle
(note: each subregion may include different key assignments) and rule, the Reducer of Shuffle is by point of the identical key assignments of different RDD
Area is combined after reading out.
Secondly, data processing equipment needs obtain partition data from the same RDD, i.e., from this in the embodiment of the present invention
The first partition data is obtained in RDD, and the second partition data is obtained from the RDD.By the above-mentioned means, can be by a RDD
In belong to the row of identical key assignments and combine, i.e. completion Join process, thus the practicability and feasibility of lifting scheme.
Optionally, on the basis of above-mentioned Fig. 4 corresponding embodiment, data processing method institute provided in an embodiment of the present invention
In corresponding second alternative embodiment, processing is ranked up to the first partition data by Mapper, it is to be combined to obtain first
Data may include:
Processing is ranked up to the first partition data by the first Mapper, obtains at least one first file, wherein extremely
Few first file belongs to the first data to be combined;
Processing is merged at least one first file, obtains the first data to be combined;
Processing is ranked up to the second partition data by the 2nd Mapper, obtains at least one second file, wherein extremely
Few second file belongs to the second data to be combined;
Processing is merged at least one second file, obtains the second data to be combined.
In the present embodiment, in the stage that Mapper handles data, need to be ranked up data and polymerize, it should
Operation can be completed by function joinShuffleWrite.The input of function is the partition data of RDD, then by the every of data
One data is inserted into sorting unit, until total data is all inserted into completion, finally by all sort file merging.In order to just
In understanding, following pseudocode is please referred to:
Secondly, describing in the embodiment of the present invention and being ranked up processing to the first partition data by the first Mapper, obtain
To at least one the first file, processing then is merged at least one first file, obtains the mistake of the first data to be combined
Journey, and processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, and at least
One the second file merges processing, obtains the second data to be combined.By the above-mentioned means, different Mapper can be to not
Same partition data is ranked up, so that Reducer is not necessarily to carry out data sorting, so as to avoid multiple Reducer due to each
From the data volume of processing irregular caused the case where waiting other side to complete.
Optionally, on the basis of above-mentioned Fig. 4 corresponding second embodiment, data processing provided in an embodiment of the present invention
In third alternative embodiment corresponding to method, processing is ranked up to the first partition data by the first Mapper, is obtained
At least one first file may include:
If it is pre- to reach first by the data volume that the first Mapper is ranked up processing to the first partition data in memory
Gating limit then stores the data after the M item sequence in memory into hard disk as the first file, and by the first file, wherein
M is the integer greater than 0;
Processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, can wrap
It includes:
If it is pre- to reach second by the data volume that the 2nd Mapper is ranked up processing to the second partition data in memory
Gating limit then stores the data after the N item sequence in memory into hard disk as the second file, and by the second file, wherein
N is the integer greater than 0.
In the present embodiment, the first Mapper to the first partition data in the process of processing, firstly, the first Mapper
First partition data is written in memory, if being ranked up processing to the first partition data by the first Mapper in memory
Data volume reach the first pre-determined threshold, then the first Mapper can using M articles in memory by sequence data as first
Then first file is sent to hard disk by file.Similarly, the second partition data can be written in memory by the 2nd Mapper,
If reaching the second pre-determined threshold by the data volume that the 2nd Mapper is ranked up processing to the first partition data in memory, that
Then first file is sent to firmly by the data that N articles in memory can be passed through sequence by the 2nd Mapper as the second file
Disk.
It should be noted that under normal conditions, the first pre-determined threshold and the second pre-determined threshold are consistent, such as first pre-
Gating limit and the second pre-determined threshold are 30.And M value and N value are generally also that always, can also according to circumstances be set, this
Place is without limitation.
Below in conjunction with following pseudocode, illustrates how for data record to be inserted into sorting unit, please refer to as follows
Pseudocode:
Wherein, insertRecord method is that data record is inserted into sorting unit, and memoryBufferc is used
Memory storage record, when reaching certain amount, sortAndSpill method is by memoryBuffer data sorting and stores
Onto disk, and empty memoryBuffer.
Assuming that records four datas, respectively (1, r1), (3, r3), (4, r4) and (2, r2) in total, wherein 1,2,
3 and 4 be respectively key assignments.If memoryBuffer accommodates up to two datas, (1, r1) and (3, r3) is according to key assignments
Disk file (the data sequence in file is (1, r1), (3, r3)) can be saved as after sequence, same (4, r4) and (2, r2) exist
Disk file can be saved as after sorting according to key assignments (the data sequence in file is (2, r2) and (4, r4)).
Again, in the embodiment of the present invention, if being ranked up place to the first partition data by the first Mapper in memory
The data volume of reason reaches the first pre-determined threshold, then using the data after the M item sequence in memory as the first file, similarly, if
The second pre-determined threshold is reached by the data volume that the 2nd Mapper is ranked up processing to the second partition data in memory, then will
The data after the sequence of N item in memory are as the second file.By the above-mentioned means, under the very big scene of data volume, it can
It will write in full data placement hard disk and handle in memory, with this is avoided Insufficient memory and causes Mapper that can not sort
The problem of, thus the feasibility and practicability of lifting scheme.
Optionally, on the basis of above-mentioned Fig. 4 corresponding second embodiment, data processing provided in an embodiment of the present invention
In 4th alternative embodiment corresponding to method, processing is merged at least one first file, it is to be combined to obtain first
Data may include:
Minimum data is obtained from each first file;
The first data to be combined are determined according to the minimum data in each first file;
Processing is merged at least one second file, obtains the second data to be combined, comprising:
Minimum data is obtained from each second file;
The second data to be combined are determined according to the minimum data in each second file.
In the present embodiment, the first partition data is ranked up in memory, the first file is obtained after sequence, another interior
The second partition data is ranked up in depositing, the second file is obtained after sequence.Referring to Fig. 5, Fig. 5 is will in the embodiment of the present invention
Data move a schematic diagram to hard disk from caching, as shown in the figure, it is assumed that and the first file includes file A, file B and file C,
Assuming that current partition is No. 1, then can be from No. 1 point in No. 1 subregion in file A, No. 1 subregion and file C in file B
Area takes out minimum data respectively, and No. 1 subregion into hard disk is written.And so on, the minimum value in No. 1 subregion is write every time
Enter to hard disk, to obtain orderly result.
Below in conjunction with following pseudocode, illustrate how for the data after sequence to be merged into a file.Please refer to as
Lower pseudocode:
Wherein, mergeSpills method is by spilled file mergences in all insertRecord.Every time from file
It takes out one to belong to the minimum value of current partition and write in file, if belonging to the data for not belonging to current partition, change
To next subregion, until all data are all removed.
Assuming that records four datas, respectively (1, r1), (3, r3), (4, r4) and (2, r2) in total, wherein 1,2,
3 and 4 be respectively key assignments.If memoryBuffer accommodates up to two datas, (1, r1) and (3, r3) is according to key assignments
Disk file (the data sequence in file is (1, r1), (3, r3)) can be saved as after sequence, same (4, r4) and (2, r2) exist
Disk file can be saved as after sorting according to key assignments (the data sequence in file is (2, r2) and (4, r4)).?
A file can be generated after mergeSpills, data sequence is (1, r1), (2, r2), (3, r3), (4, r4) in file.
Again, in the embodiment of the present invention, processing is merged at least one file, the mode for obtaining data to be combined can
To be, minimum data is obtained from each file, and data to be combined are then obtained according to the minimum data in each file.Pass through
Aforesaid way, due to the data in the first file and the second file be all it is sorted, every time using the first file and the
Minimum value in two files is ranked up the order that also can guarantee result, thus the operability of lifting scheme.
Optionally, on the basis of any one of corresponding first to fourth embodiment of above-mentioned Fig. 4 and Fig. 4, this hair
It is to be combined to first by Reducer in 5th alternative embodiment corresponding to the data processing method that bright embodiment provides
Data and the second data to be combined merge processing, obtain target connection data, may include:
The first Mapper is received to the first number to be combined obtained after the first partition data sequence processing by Reducer
According to;
The 2nd Mapper is received to the second number to be combined obtained after the second partition data sequence processing by Reducer
According to;
Merge the first data to be combined and the second data to be combined by Reducer, obtains target connection data.
In the present embodiment, describes and the first data to be combined and the second data to be combined are closed by Reducer
And the process handled.
Specifically, with a Reducer in order to be introduced, which receives the sent from the first Mapper
One data to be combined, wherein the first data to be combined are to have already passed through the data obtained after the first Mapper sequence.Similarly,
The Reducer can also receive the second data to be combined sent from the 2nd Mapper, wherein the second data to be combined are
The data obtained after the 2nd Mapper sequence.Reducer is by the receive first data to be combined and the second number to be combined
According to processing is merged, to obtain the connection data of the target after Join.
Further, it in the embodiment of the present invention, describes through Reducer to the first data to be combined and second wait close
And data merge the process of processing.By the above-mentioned means, Reducer can directly to the result after multiple sequences into
Row merges, and without being ranked up to data, to improve the efficiency of sorting operation, thus improves the execution efficiency of Join.
Optionally, on the basis of above-mentioned Fig. 4 corresponding 5th embodiment, data processing provided in an embodiment of the present invention
In 6th alternative embodiment corresponding to method, the first data to be combined and the second number to be combined are merged by Reducer
According to, obtain target connection data, may include:
The first minimum data is obtained from the first data to be combined by Reducer, and is obtained from the second data to be combined
Take the second minimum data;
The first minimum data and the second minimum data are compared by Reducer, and according to comparison result to
One data to be combined and the second data to be combined merge, and obtain target connection data.
In the present embodiment, Reducer to the first data to be combined and the second data to be combined in the process of processing,
Firstly, Reducer needs obtain the first minimum data from the first data to be combined, and obtain from the second data to be combined the
Then two minimum data are compared the first minimum data and the second minimum data, minimum value is sequentially exported.
Below in conjunction with following pseudocode, illustrate how to close the first data to be combined and the second data to be combined
And.Please refer to following pseudocode:
The Reduce stage merges the data of different Mapper, and is input in Join operator in the form of iteration.
Interator.getNext () is called to can be obtained by next data value in this way.Its logic carries out data processing with Mapper
The mergeSpills method in stage is similar, is not repeated herein.
Further, to be combined by Reducer the first data to be combined of merging and second in the embodiment of the present invention
The mode of data can be, and obtain the first minimum data from the first data to be combined by Reducer, and to be combined from second
The second minimum data is obtained in data;The first minimum data and the second minimum data are compared by Reducer, and root
The first data to be combined and the second data to be combined are merged according to comparison result.By the above-mentioned means, due to first to
Merging data and the second data to be combined be all it is sorted, every time take the first data to be combined and the second data to be combined in
Minimum value is ranked up the order that also can guarantee result, thus the operability of lifting scheme.
In order to make it easy to understand, referring to Fig. 6, the frame that Fig. 6 is data processing method in application scenarios of the present invention shows
It is intended to, as shown, PreSortMergeJoin algorithm, there are two key component, a part is mapping (map) phase data point
Data are ranked up when area, another part is reduction (reduce) stage to merge collated data.
For a Mapper, partition data is ranked up first, since data volume is bigger, memory
It may not be enough to store all data, therefore, it is necessary to be split processing to the data after sequence.Assuming that memory most multipotency is deposited
100,000 datas are put, 1,000,000 datas is shared and needs to sort, then, Mapper can deposit the data after every 100,000 sequences
It puts into hard disk, is all stored to hard disk until by this million data.Certainly, it stores to this million data of hard disk
It is also sorted, this is because minimum value is put into hard disk every time during storage, to constitute orderly
Mapper file.
The data stored in hard disk might not be used only to Reducer, so, the data in hard disk can distribute to
In different Reducer, similarly, Mapper file accessed by a Reducer is also not necessarily from same
Mapper。
At least one Mapper file is merged all available final Join result by Reducer.
Present invention could apply to the business that on data warehouse, about 1,600,000 SQL queries of support are handled.The invention pair
Join efficiency increases significantly, and at most Join speed can be improved 70%, the service operation time is greatly decreased.
The data processing equipment in the present invention is described in detail below, referring to Fig. 7, Fig. 7 is the embodiment of the present invention
Middle data processing equipment one embodiment schematic diagram, data processing equipment 20 include:
Receiving module 201, for receiving data process instruction;
Module 202 is obtained, for obtaining the first subregion according to the received data processing instructions of the receiving module 201
Data and the second partition data;
Sorting module 203, first subregion for being obtained by mapping node Mapper to the acquisition module 202
Data are ranked up processing, obtain the first data to be combined, and second partition data that the acquisition module is obtained into
Row sequence processing, obtains the second data to be combined;
Merging module 204, for described first after being sorted by reduction node R educer to the sorting module 203
Data to be combined and second data to be combined merge processing, obtain target connection data.
In the present embodiment, receiving module 201 receives data processing instructions, obtains module 202 according to the receiving module 201
The received data processing instructions obtain the first partition data and the second partition data, and sorting module 203 passes through mapping section
First partition data that point Mapper obtains the acquisition module 202 is ranked up processing, obtains the first number to be combined
According to, and second partition data obtained to the acquisition module is ranked up processing, obtains the second data to be combined, merges
Module 204 is by reduction node R educer to the described first data to be combined and described after the sorting module 203 sequence
Second data to be combined merge processing, obtain target connection data.
In the embodiment of the present invention, a kind of data processing equipment is provided, first reception data processing instructions, then according to number
The first partition data and the second partition data are obtained according to process instruction, wherein the first partition data and the second partition data category
Processing is ranked up to the first partition data in different elasticity distribution formula data set RDD, then by Mapper, obtain first to
Merging data, and processing is ranked up to the second partition data, the second data to be combined are obtained, finally by Reducer to
One data to be combined and the second data to be combined merge processing, obtain target connection data.By the above-mentioned means, utilizing
The characteristic that the Spark SQL task of the overwhelming majority has the more Reducer of Mapper few, the process of data sorting is placed on
It is completed in Mapper, and Reducer only needs to be performed the process of data merging, to reduce at the data of each Reducer
Time delay is managed, the execution efficiency of Join is promoted.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The acquisition module 202, specifically for being obtained from elasticity distribution formula data set RDD according to the data processing instructions
Take first partition data and second partition data.
Secondly, data processing equipment needs obtain partition data from the same RDD, i.e., from this in the embodiment of the present invention
The first partition data is obtained in RDD, and the second partition data is obtained from the RDD.By the above-mentioned means, can be by a RDD
In belong to the row of identical key assignments and combine, i.e. completion Join process, thus the practicability and feasibility of lifting scheme.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The sorting module 203, specifically for being ranked up processing to first partition data by the first Mapper,
Obtain at least one first file, wherein at least one described first file belongs to the described first data to be combined;
Processing is merged at least one described first file, obtains the described first data to be combined;
Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file,
In, at least one described second file belongs to the described second data to be combined;
Processing is merged at least one described second file, obtains the described second data to be combined.
Secondly, describing in the embodiment of the present invention and being ranked up processing to the first partition data by the first Mapper, obtain
To at least one the first file, processing then is merged at least one first file, obtains the mistake of the first data to be combined
Journey, and processing is ranked up to the second partition data by the 2nd Mapper, at least one second file is obtained, and at least
One the second file merges processing, obtains the second data to be combined.By the above-mentioned means, different Mapper can be to not
Same partition data is ranked up, so that Reducer is not necessarily to carry out data sorting, so as to avoid phase between multiple Reducer
Mutually wait the time of other side's sequence.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The sorting module 203, if being specifically used in memory by the first Mapper to first number of partitions
Reach the first pre-determined threshold according to the data volume for being ranked up processing, then using the data after M articles in memory sequence as the
One file, and first file is stored into hard disk, wherein the M is the integer greater than 0;
If being reached in memory by the data volume that the 2nd Mapper is ranked up processing to second partition data
To the second pre-determined threshold, then the data after the N item sequence in the memory are deposited as the second file, and by second file
Storage is into hard disk, wherein the N is the integer greater than 0.
Again, in the embodiment of the present invention, if being ranked up place to the first partition data by the first Mapper in memory
The data volume of reason reaches the first pre-determined threshold, then using the data after the M item sequence in memory as the first file, similarly, if
The second pre-determined threshold is reached by the data volume that the 2nd Mapper is ranked up processing to the second partition data in memory, then will
The data after the sequence of N item in memory are as the second file.By the above-mentioned means, under the very big scene of data volume, it can
It will write in full data placement hard disk and handle in memory, with this is avoided Insufficient memory and causes Mapper that can not sort
The problem of, thus the feasibility and practicability of lifting scheme.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The sorting module 203, specifically for obtaining minimum data from each first file;
The described first data to be combined are determined according to the minimum data in each first file;
Minimum data is obtained from each second file;
The described second data to be combined are determined according to the minimum data in each second file.
Again, in the embodiment of the present invention, processing is merged at least one file, the mode for obtaining data to be combined can
To be, minimum data is obtained from each file, and data to be combined are then obtained according to the minimum data in each file.Pass through
Aforesaid way, due to the data in the first file and the second file be all it is sorted, every time using the first file and the
Minimum value in two files is ranked up the order that also can guarantee result, thus the operability of lifting scheme.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The merging module 204 is specifically used for receiving the first Mapper to first number of partitions by the Reducer
The described first data to be combined obtained after being handled according to sequence;
The 2nd Mapper is received to described the obtained after the second partition data sequence processing by the Reducer
Two data to be combined;
Merge first data to be combined and second data to be combined by the Reducer, obtains described
Target connects data.
Further, it in the embodiment of the present invention, describes through Reducer to the first data to be combined and second wait close
And data merge the process of processing.By the above-mentioned means, Reducer can directly to the result after multiple sequences into
Row merges, and without being ranked up to data, to improve the efficiency of sorting operation, thus improves the execution efficiency of Join.
Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, data processing equipment provided in an embodiment of the present invention
In 20 another embodiment,
The merging module 204, specifically for obtaining first from the described first data to be combined by the Reducer
Minimum data, and the second minimum data is obtained from the described second data to be combined;
First minimum data and second minimum data are compared by the Reducer, and according to
Comparison result merges the described first data to be combined and second data to be combined, obtains the target connection number
According to.
Further, to be combined by Reducer the first data to be combined of merging and second in the embodiment of the present invention
The mode of data can be, and obtain the first minimum data from the first data to be combined by Reducer, and to be combined from second
The second minimum data is obtained in data;The first minimum data and the second minimum data are compared by Reducer, and root
The first data to be combined and the second data to be combined are merged according to comparison result.By the above-mentioned means, due to first to
Merging data and the second data to be combined be all it is sorted, every time take the first data to be combined and the second data to be combined in
Minimum value is ranked up the order that also can guarantee result, thus the operability of lifting scheme.
Fig. 8 is a kind of data processing equipment structural diagram provided in an embodiment of the present invention, which can
Bigger difference is generated because configuration or performance are different, may include one or more central processing units (central
Processing units, CPU) 322 (for example, one or more processors) and memory 332, one or more
Store the storage medium 330 (such as one or more mass memory units) of application program 342 or data 344.Wherein, it deposits
Reservoir 332 and storage medium 330 can be of short duration storage or persistent storage.The program for being stored in storage medium 330 may include
One or more modules (diagram does not mark), each module may include to the series of instructions behaviour in data processing equipment
Make.Further, central processing unit 322 can be set to communicate with storage medium 330, hold on data processing equipment 300
Series of instructions operation in row storage medium 330.
Data processing equipment 300 can also include one or more power supplys 326, one or more wired or nothings
Wired network interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341,
Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by data processing equipment can be based on the data processing equipment shown in Fig. 8 in above-described embodiment
Structure.
CPU 322 is for executing following steps:
Receive data processing instructions;
The first partition data and the second partition data are obtained according to the data processing instructions;
Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined,
And processing is ranked up to second partition data, obtain the second data to be combined;
The described first data to be combined and second data to be combined are merged by reduction node R educer
Processing obtains target connection data.
Optionally, CPU 322 is specifically used for executing following steps:
First partition data and institute are obtained from elasticity distribution formula data set RDD according to the data processing instructions
State the second partition data.
Optionally, CPU 322 is specifically used for executing following steps:
Processing is ranked up to first partition data by the first Mapper, obtains at least one first file,
In, at least one described first file belongs to the described first data to be combined;
Processing is merged at least one described first file, obtains the described first data to be combined;
Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file,
In, at least one described second file belongs to the described second data to be combined;
Processing is merged at least one described second file, obtains the described second data to be combined.
Optionally, CPU 322 is specifically used for executing following steps:
If being reached in memory by the data volume that the first Mapper is ranked up processing to first partition data
To the first pre-determined threshold, then the data after the M item sequence in the memory are deposited as the first file, and by first file
Storage is into hard disk, wherein the M is the integer greater than 0;
If being reached in memory by the data volume that the 2nd Mapper is ranked up processing to second partition data
To the second pre-determined threshold, then the data after the N item sequence in the memory are deposited as the second file, and by second file
Storage is into hard disk, wherein the N is the integer greater than 0.
Optionally, CPU 322 is specifically used for executing following steps:
Minimum data is obtained from each first file;
The described first data to be combined are determined according to the minimum data in each first file;
Minimum data is obtained from each second file;
The described second data to be combined are determined according to the minimum data in each second file.
Optionally, CPU 322 is specifically used for executing following steps:
The first Mapper is received to described the obtained after the first partition data sequence processing by the Reducer
One data to be combined;
The 2nd Mapper is received to described the obtained after the second partition data sequence processing by the Reducer
Two data to be combined;
Merge first data to be combined and second data to be combined by the Reducer, obtains described
Target connects data.
Optionally, CPU 322 is specifically used for executing following steps:
Obtain the first minimum data from the described first data to be combined by the Reducer, and from described second to
The second minimum data is obtained in merging data;
First minimum data and second minimum data are compared by the Reducer, and according to
Comparison result merges the described first data to be combined and second data to be combined, obtains the target connection number
According to.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention
Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory,
ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program
The medium of code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although referring to before
Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.
Claims (15)
1. a kind of method of data processing characterized by comprising
Receive data processing instructions;
The first partition data and the second partition data are obtained according to the data processing instructions;
Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined, and right
Second partition data is ranked up processing, obtains the second data to be combined;
Place is merged to the described first data to be combined and second data to be combined by reduction node R educer
Reason obtains target connection data.
2. the method according to claim 1, wherein described obtain the first subregion according to the data processing instructions
Data and the second partition data, comprising:
First partition data and described are obtained from elasticity distribution formula data set RDD according to the data processing instructions
Two partition datas.
3. the method according to claim 1, wherein it is described by mapping node Mapper to first subregion
Data are ranked up processing, obtain the first data to be combined, comprising:
Processing is ranked up to first partition data by the first Mapper, obtains at least one first file, wherein institute
It states at least one first file and belongs to the described first data to be combined;
Processing is merged at least one described first file, obtains the described first data to be combined;
Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file, wherein institute
It states at least one second file and belongs to the described second data to be combined;
Processing is merged at least one described second file, obtains the described second data to be combined.
4. according to the method described in claim 3, it is characterized in that, the first Mapper that passes through is to first partition data
It is ranked up processing, obtains at least one first file, comprising:
If reaching by the data volume that the first Mapper is ranked up processing to first partition data in memory
One pre-determined threshold, then using in the memory M item sequence after data as the first file, and by first file store to
In hard disk, wherein the M is the integer greater than 0;
It is described that processing is ranked up to second partition data by the 2nd Mapper, at least one second file is obtained, is wrapped
It includes:
If reaching by the data volume that the 2nd Mapper is ranked up processing to second partition data in memory
Two pre-determined thresholds, then using in the memory N item sequence after data as the second file, and by second file store to
In hard disk, wherein the N is the integer greater than 0.
5. according to the method described in claim 3, it is characterized in that, described merge place at least one described first file
Reason, obtains the described first data to be combined, comprising:
Minimum data is obtained from each first file;
The described first data to be combined are determined according to the minimum data in each first file;
It is described that processing is merged at least one described second file, obtain the described second data to be combined, comprising:
Minimum data is obtained from each second file;
The described second data to be combined are determined according to the minimum data in each second file.
6. the method according to any one of claims 1 to 5, which is characterized in that described to pass through educer pairs of reduction node R
First data to be combined and second data to be combined merge processing, obtain target connection data, comprising:
By the Reducer receive the first Mapper to the first partition data sequence processing after obtain described first to
Merging data;
By the Reducer receive the 2nd Mapper to the second partition data sequence processing after obtain described second to
Merging data;
Merge first data to be combined and second data to be combined by the Reducer, obtains the target
Connect data.
7. according to the method described in claim 6, it is characterized in that, described merge described first wait close by the Reducer
And data and second data to be combined, obtain the target connection data, comprising:
The first minimum data is obtained from the described first data to be combined by the Reducer, and to be combined from described second
The second minimum data is obtained in data;
First minimum data and second minimum data are compared by the Reducer, and according to comparison
As a result the described first data to be combined and second data to be combined are merged, obtains the target connection data.
8. a kind of data processing equipment characterized by comprising
Receiving module, for receiving data process instruction;
Module is obtained, for obtaining the first partition data and the according to the received data processing instructions of the receiving module
Two partition datas;
Sorting module, first partition data for being obtained by mapping node Mapper to the acquisition module are arranged
Sequence processing obtains the first data to be combined, and second partition data obtained to the acquisition module is ranked up processing,
Obtain the second data to be combined;
Merging module, for the described first data to be combined after being sorted by reduction node R educer to the sorting module
And second data to be combined merge processing, obtain target connection data.
9. data processing equipment according to claim 8, which is characterized in that
The sorting module, specifically for being ranked up processing to first partition data by the first Mapper, obtain to
Few first file, wherein at least one described first file belongs to the described first data to be combined;
Processing is merged at least one described first file, obtains the described first data to be combined;
Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file, wherein institute
It states at least one second file and belongs to the described second data to be combined;
Processing is merged at least one described second file, obtains the described second data to be combined.
10. data processing equipment according to claim 9, which is characterized in that
The sorting module, if specifically for being arranged in memory by the first Mapper first partition data
The data volume of sequence processing reaches the first pre-determined threshold, then using the data after the M item sequence in the memory as the first file, and
First file is stored into hard disk, wherein the M is the integer greater than 0;
If reaching by the data volume that the 2nd Mapper is ranked up processing to second partition data in memory
Two pre-determined thresholds, then using in the memory N item sequence after data as the second file, and by second file store to
In hard disk, wherein the N is the integer greater than 0.
11. data processing equipment according to claim 9, which is characterized in that
The sorting module, specifically for obtaining minimum data from each first file;
The described first data to be combined are determined according to the minimum data in each first file;
Minimum data is obtained from each second file;
The described second data to be combined are determined according to the minimum data in each second file.
12. a kind of data processing equipment, which is characterized in that the data processing equipment includes: memory, transceiver, processor
And bus system;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Receive data processing instructions;
The first partition data and the second partition data are obtained according to the data processing instructions;
Processing is ranked up to first partition data by mapping node Mapper, obtains the first data to be combined, and right
Second partition data is ranked up processing, obtains the second data to be combined;
Place is merged to the described first data to be combined and second data to be combined by reduction node R educer
Reason obtains target connection data;
The bus system is for connecting the memory and the processor, so that the memory and the processor
It is communicated.
13. data processing equipment according to claim 12, which is characterized in that
The processor obtains at least specifically for being ranked up processing to first partition data by the first Mapper
One the first file, wherein at least one described first file belongs to the described first data to be combined;
Processing is merged at least one described first file, obtains the described first data to be combined;
Processing is ranked up to second partition data by the 2nd Mapper, obtains at least one second file, wherein institute
It states at least one second file and belongs to the described second data to be combined;
Processing is merged at least one described second file, obtains the described second data to be combined.
14. data processing equipment according to claim 13, which is characterized in that
The processor, if specifically for being ranked up in memory by the first Mapper to first partition data
The data volume of processing reaches the first pre-determined threshold, then using the data after the M item sequence in the memory as the first file, and will
First file is stored into hard disk, wherein the M is the integer greater than 0;
If reaching by the data volume that the 2nd Mapper is ranked up processing to second partition data in memory
Two pre-determined thresholds, then using in the memory N item sequence after data as the second file, and by second file store to
In hard disk, wherein the N is the integer greater than 0.
15. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as
Method described in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810245691.XA CN110309177B (en) | 2018-03-23 | 2018-03-23 | Data processing method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810245691.XA CN110309177B (en) | 2018-03-23 | 2018-03-23 | Data processing method and related device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110309177A true CN110309177A (en) | 2019-10-08 |
CN110309177B CN110309177B (en) | 2023-11-03 |
Family
ID=68073525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810245691.XA Active CN110309177B (en) | 2018-03-23 | 2018-03-23 | Data processing method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309177B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114237510A (en) * | 2021-12-17 | 2022-03-25 | 北京达佳互联信息技术有限公司 | Data processing method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314336A (en) * | 2010-07-05 | 2012-01-11 | 深圳市腾讯计算机系统有限公司 | Data processing method and system |
CN102541858A (en) * | 2010-12-07 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Data equality processing method, device and system based on mapping and protocol |
US20120311581A1 (en) * | 2011-05-31 | 2012-12-06 | International Business Machines Corporation | Adaptive parallel data processing |
CN103970604A (en) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | Method and device for realizing image processing based on MapReduce framework |
CN104408667A (en) * | 2014-11-20 | 2015-03-11 | 深圳供电局有限公司 | Method and system for comprehensively evaluating power quality |
CN106372114A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Big data-based online analytical processing system and method |
CN107506388A (en) * | 2017-07-27 | 2017-12-22 | 浙江工业大学 | A kind of iterative data balancing optimization method towards Spark parallel computation frames |
-
2018
- 2018-03-23 CN CN201810245691.XA patent/CN110309177B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314336A (en) * | 2010-07-05 | 2012-01-11 | 深圳市腾讯计算机系统有限公司 | Data processing method and system |
CN102541858A (en) * | 2010-12-07 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Data equality processing method, device and system based on mapping and protocol |
US20120311581A1 (en) * | 2011-05-31 | 2012-12-06 | International Business Machines Corporation | Adaptive parallel data processing |
CN103970604A (en) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | Method and device for realizing image processing based on MapReduce framework |
CN104408667A (en) * | 2014-11-20 | 2015-03-11 | 深圳供电局有限公司 | Method and system for comprehensively evaluating power quality |
CN106372114A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Big data-based online analytical processing system and method |
CN107506388A (en) * | 2017-07-27 | 2017-12-22 | 浙江工业大学 | A kind of iterative data balancing optimization method towards Spark parallel computation frames |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114237510A (en) * | 2021-12-17 | 2022-03-25 | 北京达佳互联信息技术有限公司 | Data processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110309177B (en) | 2023-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105446979B (en) | Data digging method and node | |
US20140358977A1 (en) | Management of Intermediate Data Spills during the Shuffle Phase of a Map-Reduce Job | |
Gu et al. | A parallel computing platform for training large scale neural networks | |
CN106462578A (en) | Method for querying and updating entries in database | |
CN111176832A (en) | Performance optimization and parameter configuration method based on memory computing framework Spark | |
CN106547882A (en) | A kind of real-time processing method and system of big data of marketing in intelligent grid | |
Yazdani et al. | Evolutionary algorithms for multi-objective dual-resource constrained flexible job-shop scheduling problem | |
CN109241093A (en) | A kind of method of data query, relevant apparatus and Database Systems | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN106209989A (en) | Spatial data concurrent computational system based on spark platform and method thereof | |
CN108073696B (en) | GIS application method based on distributed memory database | |
CN107943963A (en) | Mass data distributed rule engine operation system based on cloud platform | |
CN104915717A (en) | Data processing method, knowledge base reasoning method and related device | |
CN103631922A (en) | Hadoop cluster-based large-scale Web information extraction method and system | |
CN109828790A (en) | A kind of data processing method and system based on Shen prestige isomery many-core processor | |
CN109885651A (en) | A kind of question pushing method and device | |
CN108287889A (en) | A kind of multi-source heterogeneous date storage method and system based on elastic table model | |
CN107066328A (en) | The construction method of large-scale data processing platform | |
CN107193940A (en) | Big data method for optimization analysis | |
CN106462591A (en) | Partition filtering using smart index in memory | |
CN108073641A (en) | The method and apparatus for inquiring about tables of data | |
CN107871055A (en) | A kind of data analysing method and device | |
CN110309177A (en) | A kind of method and relevant apparatus of data processing | |
CN108182243A (en) | A kind of Distributed evolutionary island model parallel method based on Spark | |
Gu et al. | Characterizing job-task dependency in cloud workloads using graph learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |