CN108415912B - Data processing method and device based on MapReduce model - Google Patents

Data processing method and device based on MapReduce model Download PDF

Info

Publication number
CN108415912B
CN108415912B CN201710072197.3A CN201710072197A CN108415912B CN 108415912 B CN108415912 B CN 108415912B CN 201710072197 A CN201710072197 A CN 201710072197A CN 108415912 B CN108415912 B CN 108415912B
Authority
CN
China
Prior art keywords
key value
data
sorting
local
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710072197.3A
Other languages
Chinese (zh)
Other versions
CN108415912A (en
Inventor
路璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710072197.3A priority Critical patent/CN108415912B/en
Publication of CN108415912A publication Critical patent/CN108415912A/en
Application granted granted Critical
Publication of CN108415912B publication Critical patent/CN108415912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations

Abstract

A data processing method and device based on a MapReduce model are disclosed. The method comprises the steps of determining whether local reduction is carried out on an SQL instruction at a Map end in the process of utilizing a MapReduce model to realize the SQL instruction, wherein the local reduction reduces the data volume between the Map end and a Reduce end by utilizing the repeatability of data; if the local specification is determined, sorting the data subjected to the local specification processing by using a first sorting algorithm; and if the local reduction is not performed, sorting the data output by the Mapper by using a second sorting algorithm. The method can match a proper data sorting algorithm according to the actual instruction, improves the efficiency of data sorting and avoids the defects caused by a single sorting algorithm.

Description

Data processing method and device based on MapReduce model
Technical Field
The application relates to a distributed file system, in particular to a data processing method and device based on a MapReduce model.
Background
Distributed File System (Distributed File System) means that the physical storage resources managed by the File System are not necessarily directly connected to the local nodes, but are connected to the nodes through a computer network. The design of the distributed file system is based on a client/server model.
In a large-scale distributed file system, in order to process data in a distributed computing manner, a Structured Query Language (SQL) instruction is usually converted into a MapReduce-like format for processing. MapReduce is a programming model for massively parallel programming proposed by Google corporation, and the MapReduce model can implement parallel computation of a large-scale data set (greater than 1TB) and perform parallel operation by distributing massive operations on the data set to a plurality of nodes, and thus is widely used by distributed file systems.
In the process of realizing the SQL instruction by using the MapReduce model, each node is locally sorted at a Map (mapping) end, and then is completely sorted at a Reduce (reduction) end according to the local sorting result of each node. Therefore, the ordering effect directly affects the execution efficiency of the distributed system, and how to improve the ordering efficiency in the MapReduce model is an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The invention mainly aims to provide a method for solving the problem of sorting efficiency in a Mapreduce model.
The embodiment of the application provides a data processing method based on a MapReduce model, which comprises the following steps: in the process of realizing the SQL instruction by using a MapReduce model, determining whether the SQL instruction is subjected to local reduction at a Map end, wherein the local reduction reduces the data volume between the Map end and a Reduce end by using the repeatability of data; if the local specification is determined, sorting the data subjected to the local specification processing by using a first sorting algorithm; and if the local reduction is not performed, sorting the data output by the Mapper by using a second sorting algorithm.
Another embodiment of the present application provides a data processing apparatus based on a MapReduce model, where the apparatus includes: the system comprises a determining module and a judging module, wherein the determining module is used for determining whether the SQL instruction is subjected to local reduction at a Map end in the process of realizing the SQL instruction by using a MapReduce model, and the local reduction is used for reducing the data volume between the Map end and a Reduce end by using the repeatability of data; the first ordering module is used for ordering the data subjected to the local specification processing by utilizing a first ordering algorithm if the local specification processing is determined; and the second sorting module is used for sorting the data output by the Mapper by using a second sorting algorithm if the local reduction is determined not to be carried out.
According to the technical scheme, different sorting algorithms can be automatically switched by utilizing condition judgment, and more specifically, the sorting algorithm of the Map end can be determined by determining whether the SQL instruction carries out local protocol at the Map end, so that a proper data sorting algorithm can be matched according to an actual instruction, the data sorting efficiency is improved, and the defect caused by a single sorting algorithm is avoided.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 shows a schematic diagram of a prior art MapReduce model-based distributed data processing system;
FIG. 2A shows a schematic diagram of a sort operation performed in the MapReduce model;
FIG. 2B shows another schematic diagram of a sort operation in the MapReduce model;
FIG. 3 illustrates a flowchart of a MapReduce model-based data processing method according to an exemplary embodiment of the present invention;
fig. 4 illustrates a block diagram of a MapReduce model-based data processing apparatus according to an exemplary embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 shows a schematic diagram of a conventional MapReduce model-based distributed data processing system. For clarity of description, the terms involved are first explained.
The instruction (instruction) is executable by a MapReduce compilation tool. Each instruction comprises data operated by the instruction or a storage address of the data operated by the instruction in the distributed file system.
Job (job) is executable by the MapReduce software framework. A MapReduce compilation tool compiles an instruction into one or more jobs.
The task (task) is executable by the MapReduce software framework. The MapReduce software framework breaks up a job into multiple tasks.
At present, when large-scale data is processed, the processing process is totally divided into two stages, namely a Map stage and a Reduce stage, the Reduce stage is directly entered no matter how many intermediate results are output from the Map stage, and in order to Reduce the data volume input to the Reduce end, the data output by the Mapper is usually processed through a combiner. For example, the intermediate results < key2, value2> obtained by the Mapper are merged according to a predetermined rule, for example, all the intermediate results with the same key value are merged into one intermediate result, and then the merged intermediate result is output to the Reducer, so as to reduce the data amount of the intermediate result output to the Reducer by each Mapper.
A MapReduce model-based distributed data processing system including a container will be described in detail with reference to fig. 1. Referring to fig. 1, the system includes: client (client), job server (JobTracker), and task server (tasktacker). The client submits a job through a client program (client program) and a jobtracker, and the job is coordinated by the jobtracker. The Map phase is performed first, as identified in FIG. 1 as M1, M2, and M3, and then the Reduce phase is performed as identified in FIG. 1 as R1 and R2. The data processing operations executed by the Map phase and the reduce phase are monitored by the TaskTracker and are independent of the progress of the TaskTracker.
Specifically, the Client submits a job, i.e., inputs data, through the Client Program and the JobTracker. In the Map phase, the input data is divided into 5 initial partitions which do not overlap with each other, including an input part 1 to an input part 5, and then processed by 5 mappers (mappers), respectively, wherein the TaskTracker1 and the TaskTracker3 include two mappers, respectively, the TaskTracker2 includes one Mapper, the mappers in the TaskTracker1 in the initial partition 1 and the initial partition 4, the mappers in the TaskTracker3 in the initial partition 3 and the initial partition 5, and the mappers in the TaskTracker2 in the initial partition 2. The data format of the input Mapper is < key, value >, which is referred to as key1 and value1 in the following description.
After Mapper processes key1 and value1, intermediate results (intermediate data) in the < key, value > format are generated and stored in Random Access Memory (RAM), referred to as key2 and value2 in the following description. Optionally, the tasktacker may merge (Combine) the intermediate data stored in the RAM. The Reduce Partition (Partition) of the intermediate result is specified by calling a Partition (Partition) function for each intermediate result, and the intermediate result is output to the corresponding buffer of the corresponding Reduce Partition, such as Region 1(Region1) and Region 2(Region2) in fig. 1.
In the example shown in fig. 1, two reducers (reducers) are included, and the specified number of Reduce partitions has a value of 2, one Reducer for each Reduce partition.
After the Map stage processing flow is completed, the Reduce stage is entered, and the Reduce stage comprises three steps, which are respectively: shuffle (Shuffle), Sort (Sort), and Reduce (Reduce), where Shuffle may represent data interaction between a Map end and a Reduce end. And classifying the intermediate results output by the Map stage through the shuffling and sorting stage, and outputting the intermediate results of one type to a Reduce task. For example, the intermediate results of the same key value generated by multiple mappers are distributed on different devices, and the intermediate results of the same key value distributed on different devices are all output to the device where the Reducer processing the key value is located through shuffling and sorting. For example, in fig. 1, < key2, value2> pairs from different mappers having the same key value are merged together to form < key2, < list of value2> >, as an input to Reducer. Reducer forms the final result < key3, value3> by processing < key2, < list of value2 >.
In the above example, the tasktacker that performs the Map phase processing and the tasktacker that performs the Reduce phase processing may be the same tasktacker or different tasktackers.
The sorting operation employed in the MapReduce model will now be described in detail with reference to fig. 2A and 2B. It should be noted that in the existing MapReduce model, the sorting method shown in fig. 2A or the sorting method shown in fig. 2B is generally adopted, for example, spare selects the sorting method shown in fig. 2A, and Hive selects the index sorting method shown in fig. 2B.
As shown in FIG. 2A, FIG. 2A illustrates standard ordering in the MapReduce model. Regarding the input data, in S210, the key value of the first data in the input data is regarded as an ordered sequence, and the key values of the other data are regarded as unordered data, and it should be noted that the selection of the key value of the first data as the ordered sequence is merely exemplary, and other key values may also be selected as the ordered sequence. Subsequently, at S220, the Key value of the ordered sequence is compared to each Key value in the unordered sequence. If the Key value of the ordered sequence is larger than the Key value in the unordered sequence, the Key value in the unordered sequence is ranked behind the Key value of the ordered sequence and is continuously compared with the next Key value in the unordered sequence, and if the Key value in the unordered sequence is smaller than the Key value in the ordered sequence, the Key value in the unordered sequence is ranked in front of the Key value of the ordered sequence. And sequentially performing the operations so as to finish the sorting of the input data.
As shown in FIG. 2B, FIG. 2B illustrates index ordering (IndexSort) in the MapReduce model. A reference key value is first determined, which may be predetermined or randomly determined. The input data is then classified by comparing the key values in the input data to a reference key value, the input data being classified into a set less than the reference key value, a set equal to the reference key value, and a set greater than the reference key value. The data of each set is then classified in a similar classification manner, e.g., for sets larger than a key value, a key value larger than a reference key value may be set, and then re-classified by comparing the data in the set with the key value. In this way, all data can be divided into different sets according to the size sequence, so that the data can be sorted.
It can be seen that for smaller data sets, the insertion ordering in FIG. 2A can be used. Because the index sorting can classify the input data first, the index sorting can obtain better efficiency when the repeatability of the data is higher. The method for determining the sorting algorithm by using the specific conditions according to the exemplary embodiment of the present invention can simultaneously have the advantages of multiple algorithms and avoid the disadvantages caused by a single sorting algorithm. As will be explained in detail below with reference to fig. 3, fig. 3 illustrates a flowchart of a data processing method based on a MapReduce model according to an exemplary embodiment of the present invention.
In a large-scale distributed Data Processing system, Data can be obtained from various databases, wherein the various databases include a Mysql database, a hbase database and an ODPS database, a Mysql Data source is a relational database of Open source codes, the hbase database is a distributed storage system of unstructured Data, and an ODPS database is an Open Data Processing Service (Open Data Processing Service), is a Data storage analysis platform constructed by a cloud computing platform based on the complete proprietary property rights of the acriba group, and is suitable for offline Processing of mass Data (TB/PB level) with low real-time requirements.
The data extracted from the database may then be processed according to SQL (Structured Query Language), e.g., databases from mysql, hbase, and odps databases may be processed using SQL instructions. SQL instructions include a publish instruction (distribute by), a sort instruction (sort by), a group instruction (group by), an inner join instruction (inner join), an outer join instruction (outer join), and the like.
In step S310, in the process of implementing the SQL instruction by using the MapReduce model, it is determined whether the SQL instruction has undergone a local reduction at the Map end, where the local reduction is to Reduce the data amount between the Map end and the Reduce end by using the data repeatability. For example, the Map end can be locally reduced by using combiner. As described above, combiner is a common optimization method of the MapReduce model, and may use < key2, < list of value2> processed by the Mapper function as an input of the combiner function, and input the result to the Reduce end after being processed by the combiner function. It should be noted that the combiner function may be a function optimized by the user as desired.
As shown in fig. 3, if it is determined that the local reduction is performed, the data after the local reduction processing is sorted using the first sorting algorithm, and according to an exemplary embodiment, in the case where it is determined that the local reduction is performed, the local reduction processing is performed on the data output by the Mapper, so that duplicate data can be reduced, and then the data after the local reduction processing is sorted according to the first sorting algorithm, wherein the local reduction processing is described in detail above, and a detailed description thereof will be omitted. If it is determined that the local reduction is not performed, the data output by the Mapper is sorted using a second sorting algorithm in step S330. It should be noted that the first ordering algorithm is different from the second ordering algorithm. Since the local specification can Reduce the data amount between the Map end and the Reduce end by using the repeatability of the data, the first sorting algorithm comprises a sorting algorithm with higher sorting performance for the data with low repeatability, and the second sorting algorithm comprises a sorting algorithm with high efficiency by using the high repeatability of the data.
For example, the first ranking algorithm comprises a fast ranking algorithm and the second ranking algorithm comprises an index ranking algorithm. Specifically, in a case where the first sorting algorithm is a quick sorting algorithm, sorting the data subjected to the local reduction processing by using the first sorting algorithm includes: selecting any key value from the data after the local reduction processing as the key value of the ordered sequence, and taking other key values as the key values of the unordered sequence; comparing the key values of the ordered sequences with the key values of the unordered sequences, respectively; if the key value of the ordered sequence is greater than the key value of the unordered sequence, ranking the key value of the unordered sequence behind the key value of the ordered sequence; and if the key value of the ordered sequence is smaller than the key value of the unordered sequence, arranging the key value of the unordered sequence in front of the key value of the ordered sequence.
In the case that the second sorting algorithm is an index sorting algorithm, sorting by using the second sorting algorithm at the Map terminal includes: determining a reference key value; the key values are classified into a set smaller than the reference key value, a set equal to the reference key value, and three sets larger than the reference key value by comparing the key values in the Mapper-output data with the reference key value. Then, new reference key values may be set for sets smaller than the reference key value and sets larger than the reference key value, respectively, and then the key values in the respective sets are compared with the corresponding reference key values, thereby further dividing the two sets, and thus, finishing the sorting of the data to be output by the Mapper according to the key values.
As described above, the data processing method based on the MapReduce model of the present invention can automatically switch different sorting algorithms by using condition judgment, and more specifically, can determine the sorting algorithm of the Map end by determining whether the SQL instruction performs a local specification at the Map end, so that an appropriate data sorting algorithm can be matched according to an actual instruction, the efficiency of data sorting is improved, and the disadvantage caused by a single sorting algorithm is avoided.
Fig. 4 illustrates a block diagram of a MapReduce model-based data processing apparatus according to an exemplary embodiment of the present invention. As shown in fig. 4, the data processing apparatus 400 includes a determination module 410, a first sorting module 420, and a second sorting module 430. Those of ordinary skill in the art will understand that: the data processing apparatus in fig. 4 shows only components related to the present exemplary embodiment, and may include general components other than those shown in fig. 4.
The determining module 410 determines whether the SQL instruction performs local reduction at the Map end in the process of implementing the SQL instruction by using the MapReduce model, where the local reduction includes a method of reducing the data amount between the Map end and the Reduce end by using the data repeatability. For example, combine can be used to perform local reduction on Map end. As described above, combine is a common optimization method in the MapReduce model, and may be processed by the Mapper function (key2, < list of value2 >) as an input of the combine function, and then input to the Reduce end after being processed by the combine function. It should be noted that the combine function may be a function optimized by the user as desired.
If the determination module 410 determines to perform local reduction, the first sorting module 420 sorts the data processed with the local reduction using a first sorting algorithm. According to an exemplary embodiment, the apparatus 400 may further include a local specification processing module (not shown), and if the determining module 410 determines to perform local specification, the local specification processing module performs local specification processing on the data output by the Mapper; and sending the data after the local reduction to a first sequencing module. If the determination module 410 determines not to perform local reduction, the second sorting module 430 sorts the Mapper output data using a second sorting algorithm. It should be noted that the first ordering algorithm is different from the second ordering algorithm. Since the local specification can Reduce the data amount between the Map end and the Reduce end by using the repeatability of the data, the first sorting algorithm comprises a sorting algorithm with higher sorting performance for the data with low repeatability, and the second sorting algorithm comprises a sorting algorithm with high efficiency by using the high repeatability of the data.
For example, the first ranking algorithm comprises a fast ranking algorithm and the second ranking algorithm comprises an index ranking algorithm. Specifically, in the case that the first sorting algorithm is a quick sorting algorithm, the first sorting module 410 selects any key value from the data after the local reduction processing as a key value of the ordered sequence, and takes other key values as key values of the unordered sequence; comparing the key values of the ordered sequences with the key values of the unordered sequences, respectively; if the key value of the ordered sequence is greater than the key value of the unordered sequence, ranking the key value of the unordered sequence behind the key value of the ordered sequence; and if the key value of the ordered sequence is smaller than the key value of the unordered sequence, arranging the key value of the unordered sequence in front of the key value of the ordered sequence.
In the case where the second sorting algorithm is an index sorting algorithm, the second sorting algorithm 420 determines a reference key value; the key values are classified into a set smaller than the reference key value, a set equal to the reference key value, and three sets larger than the reference key value by comparing the key values in the Mapper-output data with the reference key value. Then, new reference key values may be set for sets smaller than the reference key value and sets larger than the reference key value, respectively, and then the key values in the respective sets are compared with the corresponding reference key values, thereby further dividing the two sets, and thus, finishing the sorting of the data to be output by the Mapper according to the key values.
As described above, the data processing device based on the MapReduce model of the present invention can automatically switch between different sorting algorithms by using condition judgment, and more specifically, can determine the sorting algorithm of the Map end by determining whether the SQL instruction performs a local specification at the Map end, so that an appropriate data sorting algorithm can be matched according to an actual instruction, the efficiency of data sorting is improved, and a disadvantage caused by a single sorting algorithm is avoided.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by an article with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program articles according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable inventory reduction device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable inventory reduction device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable inventory reduction device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable inventory reduction device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application. The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (14)

1. The data processing method based on the MapReduce model is characterized by comprising the following steps:
in the process of realizing the SQL instruction by using a MapReduce model, determining whether the SQL instruction is subjected to local reduction at a Map end, wherein the local reduction reduces the data volume between the Map end and a Reduce end by using the repeatability of data;
if the local specification is determined, sorting the data subjected to the local specification processing by using a first sorting algorithm;
and if the local reduction is determined not to be carried out, sorting the data output by the Mapper by using a second sorting algorithm different from the first sorting algorithm.
2. The method of claim 1, wherein the first ranking algorithm comprises a quick ranking algorithm.
3. The method of claim 1, wherein the second sorting algorithm comprises an index sorting algorithm.
4. The method of claim 1, wherein sorting the local reduction processed data using a first sorting algorithm comprises:
performing local specification processing on the data output by the Mapper;
and sorting the data after the local specification processing according to a first sorting algorithm.
5. The method of claim 2, wherein sorting the local reduction processed data using the first sorting algorithm comprises:
selecting any key value from the data after the local reduction processing as the key value of the ordered sequence, and taking other key values as the key values of the unordered sequence;
comparing the key values of the ordered sequences with the key values of the unordered sequences, respectively;
if the key value of the ordered sequence is greater than the key value of the unordered sequence, ranking the key value of the unordered sequence behind the key value of the ordered sequence;
and if the key value of the ordered sequence is smaller than the key value of the unordered sequence, arranging the key value of the unordered sequence in front of the key value of the ordered sequence.
6. The method of claim 3, wherein sorting the Mapper output data using a second sorting algorithm comprises:
determining a reference key value;
classifying key values in the Mapper-output data into a set smaller than the reference key value, a set equal to the reference key value, and three sets larger than the reference key value by comparing the key values with the reference key value;
and respectively setting new reference key values for the sets smaller than the reference key value and the sets larger than the reference key value, and comparing the key values in the respective sets with the corresponding new reference key values, so as to further divide the two sets, and sort the data output by the Mapper from large to small in a set form.
7. The method of any of claims 1 to 6, wherein the local reduction is performed using Combiner.
8. A data processing device based on a MapReduce model is characterized by comprising:
the system comprises a determining module and a judging module, wherein the determining module is used for determining whether the SQL instruction is subjected to local reduction at a Map end in the process of realizing the SQL instruction by using a MapReduce model, and the local reduction is used for reducing the data volume between the Map end and a Reduce end by using the repeatability of data;
the first ordering module is used for ordering the data subjected to the local specification processing by using a first ordering algorithm if the local specification processing is determined;
and the second sorting module is used for sorting the data output by the Mapper by using a second sorting algorithm different from the first sorting algorithm if the local reduction is not performed.
9. The apparatus of claim 8, wherein the first ordering algorithm comprises a quick ordering algorithm.
10. The apparatus of claim 8, wherein the second ordering algorithm comprises an index ordering algorithm.
11. The apparatus of claim 8, further comprising: the local protocol processing module is used for carrying out local protocol processing on the data output by the Mapper after the determining module determines to carry out local protocol; and sending the data after the local reduction to a first sequencing module.
12. The apparatus of claim 9, wherein the first sorting module selects any key value from the data after the local reduction processing as a key value of the ordered sequence, and takes other key values as key values of the unordered sequence; comparing the key values of the ordered sequences with the key values of the unordered sequences, respectively; if the key value of the ordered sequence is greater than the key value of the unordered sequence, ranking the key value of the unordered sequence behind the key value of the ordered sequence; and if the key value of the ordered sequence is smaller than the key value of the unordered sequence, arranging the key value of the unordered sequence in front of the key value of the ordered sequence.
13. The apparatus of claim 10, wherein the second ranking module determines a base key value; classifying the key values into a set smaller than the reference key value, a set equal to the reference key value, and three sets larger than the reference key value by comparing the key values in the Mapper-output data with the reference key value; and respectively setting new reference key values for the sets smaller than the reference key value and the sets larger than the reference key value, and comparing the key values in the respective sets with the corresponding new reference key values, so as to further divide the two sets, and sort the data output by the Mapper from large to small in a set form.
14. The device of any of claims 8 to 13, wherein the local reduction is performed using Combiner.
CN201710072197.3A 2017-02-09 2017-02-09 Data processing method and device based on MapReduce model Active CN108415912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710072197.3A CN108415912B (en) 2017-02-09 2017-02-09 Data processing method and device based on MapReduce model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710072197.3A CN108415912B (en) 2017-02-09 2017-02-09 Data processing method and device based on MapReduce model

Publications (2)

Publication Number Publication Date
CN108415912A CN108415912A (en) 2018-08-17
CN108415912B true CN108415912B (en) 2021-11-09

Family

ID=63124821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710072197.3A Active CN108415912B (en) 2017-02-09 2017-02-09 Data processing method and device based on MapReduce model

Country Status (1)

Country Link
CN (1) CN108415912B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008382B (en) * 2018-12-26 2023-06-16 创新先进技术有限公司 Method, system and equipment for determining TopN data
CN110110170B (en) * 2019-04-30 2021-12-07 北京字节跳动网络技术有限公司 Data processing method, device, medium and electronic equipment
CN110222105B (en) * 2019-05-14 2021-06-29 联动优势科技有限公司 Data summarization processing method and device
CN110471935B (en) * 2019-08-15 2022-02-18 上海达梦数据库有限公司 Data operation execution method, device, equipment and storage medium
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593180A (en) * 2008-05-30 2009-12-02 国际商业机器公司 The SPARQL inquiry is changed into the method and apparatus of SQL query
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314336B (en) * 2010-07-05 2016-04-13 深圳市腾讯计算机系统有限公司 A kind of data processing method and system
US9798831B2 (en) * 2011-04-01 2017-10-24 Google Inc. Processing data in a MapReduce framework
CN102799622B (en) * 2012-06-19 2015-07-15 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework
CN103678609B (en) * 2013-12-16 2017-05-17 中国科学院计算机网络信息中心 Large data inquiring method based on distribution relation-object mapping processing
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593180A (en) * 2008-05-30 2009-12-02 国际商业机器公司 The SPARQL inquiry is changed into the method and apparatus of SQL query
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol

Also Published As

Publication number Publication date
CN108415912A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108415912B (en) Data processing method and device based on MapReduce model
US9934276B2 (en) Systems and methods for fault tolerant, adaptive execution of arbitrary queries at low latency
US20140122484A1 (en) System and Method for Flexible Distributed Massively Parallel Processing (MPP) Database
JP6779231B2 (en) Data processing method and system
CN111813805A (en) Data processing method and device
CN107908714B (en) Data merging and sorting method and device
CN105404690A (en) Database querying method and apparatus
Verma et al. Big Data representation for grade analysis through Hadoop framework
CN108073687B (en) Random walk, random walk method based on cluster, random walk device and equipment
CN107451204B (en) Data query method, device and equipment
CN110019298A (en) Data processing method and device
CN111475511A (en) Data storage method, data access method, data storage device, data access device and data access equipment based on tree structure
US9384238B2 (en) Block partitioning for efficient record processing in parallel computing environment
CN112506887B (en) Vehicle terminal CAN bus data processing method and device
CN110008382B (en) Method, system and equipment for determining TopN data
CN110909072B (en) Data table establishment method, device and equipment
CN107562533B (en) Data loading processing method and device
CN109165325A (en) Method, apparatus, equipment and computer readable storage medium for cutting diagram data
CN107122849B (en) Spark R-based product detection total completion time minimization method
CN110825453A (en) Data processing method and device based on big data platform
WO2015189970A1 (en) Information processing device and data processing method therefor
CN106790620B (en) Distributed big data processing method
CN110990378A (en) Block chain-based data consistency comparison method, device and medium
US11379728B2 (en) Modified genetic recombination operator for cloud optimization
CN107451201B (en) Data access method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant