CN103440246A - Intermediate result data sequencing method and system for MapReduce - Google Patents

Intermediate result data sequencing method and system for MapReduce Download PDF

Info

Publication number
CN103440246A
CN103440246A CN2013103059318A CN201310305931A CN103440246A CN 103440246 A CN103440246 A CN 103440246A CN 2013103059318 A CN2013103059318 A CN 2013103059318A CN 201310305931 A CN201310305931 A CN 201310305931A CN 103440246 A CN103440246 A CN 103440246A
Authority
CN
China
Prior art keywords
intermediate result
data
burst
result data
mapreduce
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103059318A
Other languages
Chinese (zh)
Inventor
王猛
杨毅
王谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2013103059318A priority Critical patent/CN103440246A/en
Publication of CN103440246A publication Critical patent/CN103440246A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides an intermediate result data sequencing method and for MapReduce. The method comprises the following steps that a plurality of intermediate result data generated by mapping tasks are obtained from a mapping task server; of the intermediate result data are divided into N groups according to slices of the intermediate result data; the intermediate result data in the N groups are respectively sequenced through N threads; and N groups of sequenced intermediate result data are written into a local disk from an internal memory. The intermediate result data sequencing method for MapReduce according to the embodiment of the invention has the advantages that the sequencing time for sequencing a plurality of intermediate data before writing the intermediate result data of MapReduce into the disk can be effectively reduced, the sequencing efficiency can be effectively improved. The invention also provides an intermediate result data sequencing system for MapReduce.

Description

Intermediate result data reordering method and system for MapReduce
Technical field
The present invention relates to the cloud computing technology field, particularly a kind of sort method and system of the data of the intermediate result for MapReduce.
Background technology
In current cloud computing field, MapReduce is a kind of simple popular but powerful programming model wherein, for the concurrent operation of large-scale dataset (being greater than 1TB).The application of MapReduce is very extensive, comprising: distribution grep, and distribution sorting, the reversion of web connection layout, the word vector of every machine, the web access log is analyzed, and reverse indexing builds, clustering documents, machine learning etc.
MapReduce does not require that the computing node of cluster is large scale computer, as long as common machines.The data of MapReduce operation are the mass datas be distributed on many machines (node), and these data all are stored in a distributed file system, and file system can externally provide some interfaces that file carried out to the streaming operation.The core concept of MapReduce programming model is to divide and rule, and an operation of writing with the MapReduce framework is comprised of two parts usually, mapping (Map) stage and abbreviation (Reduce) stage.The logic in these two stages is:
The Map(mapping) stage: framework is carried out a plurality of mapping tasks (map task), reads in corresponding data to be processed from distributed file system, uses the map function to carry out (mapping) to each data and processes, and result is write to local disk according to the order of sequence.
The Reduce(abbreviation) stage: framework is carried out one or more reduce task, by all data that produced by the Map stage of Network Capture, uses the reduce function to do further (abbreviation) to it and processes, and result is write in distributed file system.
The main part in Map stage is user-defined map function, it to be input as (key, a value) on territory right, it is output as (key, value) right chained list on another territory.
Map(k1,v1)→list(k2,v2)
The map function can be applied to all input data concurrently, thereby it is right to be that each input (k1, v1) generates a series of (k2, v2).The reduce function in Reduce stage is also by User Defined, and the MapReduce framework can collect all k2 identical (k2, v2) and form one group, and is distributed to reduce by all groups by certain rule and processed.Each group will be employed the Reduce function, generate 0 or a plurality of value.Reduce(k2,list(v2))→list(v3)
As shown in Figure 2, the output assisted class of MapTask is all inherited from MapOutputCollector, if Mapper is follow-up the reduce task is arranged, and system can be used this derived class of MapOutputBuffer as output.These outputs are called as the intermediate result of map, these intermediate results may be very large, and before giving reduce, need to be sorted, the output that is each map is sorted in advance, giving like this after reduce reduce just can be at multiway merge on sorted a plurality of map inputs basis, thereby allows the input of reduce become global orderly.In the intermediate result of map end can leave an internal memory buffer in temporarily, but this internal memory buffer finite capacity, when buffer has expired or consumption reaches certain threshold value and will trigger the spill process, data in buffer are brushed on hard disk, generate interim spill0 on hard disk, the spill1 file, these temporary files need first to sequence order in internal memory before writing hard disk, and the spill file itself is the ordered arrangement of the data of a spill.When the map process finishes, after total data is all exported, mapoutputbuffer can carry out multiway merge according to the spill file on hard disk, synthesizes an orderly definitive document file.out and generates the index file that this file is corresponding.
As shown in Figure 3, in the map process, the record that user program is got can be received by the class of mapoutputbuffer, the byte array that a buffer by name is arranged in this class, for depositing concrete record key, byte after the value serializing, location mode is key, the value continuous and compact is deposited, also have two int arrays simultaneously, for depositing the metadata about each record, one of them is deposited is the partition of the keyvalue that every record is corresponding, keystart, valstart, kvindices(intermediate data index by name), the deposit position of each record in buffer and the partition under it are just described by such tlv triple, each tlv triple is exactly three int variablees, each tlv triple is adjacent one by one in the int array,<partition, keystart, valstart ><partition, keystart, valstart > ... also has in addition int array kvoffsets(burst location index by name), array of indexes as kvindices, the only corresponding int value of each record of this inside, it is the position of partition that this value is pointed to first int in the tlv triple of every record corresponding in kvindices.So just formed the metadata (kvindices) of sequence number (kvoffsets)---> record of record---> the secondary index relation of record real data (kvbuffer).True Data do not need to operate in kvbuffer in sequence in, and only need to read the key that needs record relatively according to index point, then comparative result is reacted in this sequence number array of kvoffsets, only to kvoffsets, sequence gets final product, data physical location in kvindices and kvbuffer does not need to change, and in kvoffsets, the change in location of element means the result of sequence.
Data after sequence according to the order of kvoffsets, read the byte array of key and value from buffer, write in the spill file.But there is a problem in this process, due in most cases, the speed of sequence is compared the speed of user program output record and is wanted slow a lot, thereby be easy to cause when buffer full, but the spill thread work does not also complete, writing like this user thread of data will block, until the data that spill completes in internal memory are released, this can cause certain time delay.
Summary of the invention
Purpose of the present invention is intended at least solve one of described technological deficiency.
For this reason, one object of the present invention is to propose a kind of intermediate result data that can effectively reduce MapReduce and writes from internal memory the sorting time that hard disk is sorted to a plurality of intermediate data before, effectively promotes the sort method of the data of the intermediate result for MapReduce of sequence efficiency.
Another object of the present invention is to propose a kind of ordering system of the data of the intermediate result for MapReduce.
For achieving the above object, the embodiment of first aspect present invention discloses a kind of sort method of the data of the intermediate result for MapReduce, comprises the following steps: from the mapping task server, obtain a plurality of intermediate result data that the mapping task produces; According to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data are divided into to the N group; By N thread, the intermediate result data in described N group are sorted respectively; And the group of the N after sequence intermediate result data are write to local disk from internal memory.
Sort method according to the data of the intermediate result for MapReduce of the embodiment of the present invention, alleviation is in the map output procedure, be greater than spill sequence speed owing to recording output speed, thereby buffer zone buffer is forced to wait for that spill completes the time delay that releasing memory causes because be filled to cause writing.The sort method of the data of the intermediate result for MapReduce by the embodiment of the present invention, the utilization thought of dividing and ruling is divided into aliquot to intermediate data sequence and processes, grouping based on partition simultaneously, simplified Compare Logic, utilize multithreading to be sorted, greatly accelerated sequence speed, reduced and write (spill) time, thereby alleviated buffer in the spill process, be fully written and have to wait for that spill completes the time delay brought.Particularly, when spill, all record to be sorted divide into groups for partition, to the independent thread ordering of the inner use of each partition, for example: the wordcount that uses hadoop to carry is tested the sort method of the embodiment of the present invention, opens three thread map and on average accelerates 30% left and right.
In addition, the sort method of the data of the intermediate result for MapReduce according to the above embodiment of the present invention can also have following additional technical characterictic:
In some instances, also comprise: simplify the N group intermediate data result after task server obtains sequence from described local disk, and described N group intermediate data result is carried out to the processing of simplification task.
In some instances, describedly according to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data are divided into to the N group, further comprise: create burst index two-dimensional array; After being merged, burst under a plurality of described intermediate result data is stored in the first dimension storage space of described burst index two-dimensional array; Each burst location index is stored in the second dimension storage space of described burst index two-dimensional array; According to a plurality of bursts of described the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
In some instances, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted by a thread.
The embodiment of second aspect present invention provides a kind of ordering system of the data of the intermediate result for MapReduce, comprising: home server and mapping server, and wherein, described mapping server is for carrying out the mapping task to generate a plurality of intermediate result data; Described home server is for obtaining described a plurality of intermediate result data from described mapping task server, and according to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data are divided into to the N group, and by N thread, the intermediate result data in described N group are sorted respectively, and the group of the N after sequence intermediate result data are write to local disk from internal memory.
Ordering system according to the data of the intermediate result for MapReduce of the embodiment of the present invention, alleviation is in the map output procedure, be greater than spill sequence speed owing to recording output speed, thereby buffer zone buffer is forced to wait for that spill completes the time delay that releasing memory causes because be filled to cause writing.The ordering system of the data of the intermediate result for MapReduce by the embodiment of the present invention, the utilization thought of dividing and ruling is divided into aliquot to intermediate data sequence and processes, grouping based on partition simultaneously, simplified Compare Logic, utilize multithreading to be sorted, greatly accelerated sequence speed, reduced and write (spill) time, thereby alleviated buffer in the spill process, be fully written and have to wait for that spill completes the time delay brought.Particularly, when spill, all record to be sorted divide into groups for partition, to the independent thread ordering of the inner use of each partition, for example: the wordcount that uses hadoop to carry is tested the sort method of the embodiment of the present invention, opens three thread map and on average accelerates 30% left and right.
In addition, the ordering system of the data of the intermediate result for MapReduce according to the above embodiment of the present invention can also have following additional technical characterictic:
In some instances, also comprise: simplify server, described simplification server is used for the N group intermediate data result from the local disk of described home server obtains sequence, and described N group intermediate data result is carried out to the processing of simplification task.
In some instances, described home server is divided into the N group according to the burst under a plurality of described intermediate result data by a plurality of described intermediate result data, comprise: create burst index two-dimensional array, after being merged, burst under a plurality of described intermediate result data is stored in the first dimension storage space of described burst index two-dimensional array, and each burst location index is stored in the second dimension storage space of described burst index two-dimensional array, and according to a plurality of bursts of described the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
In some instances, described home server to be stored in a plurality of burst location indexs that each burst in the second dimension storage space is corresponding by one independently thread sorted.
The aspect that the present invention is additional and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.
The accompanying drawing explanation
Of the present invention and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments and obviously and easily understand, wherein:
Fig. 1 is according to an embodiment of the invention for the process flow diagram of the sort method of the intermediate result data of MapReduce;
Fig. 2 is the intermediate result data that produce by MapReduce in the prior art flow graphs while from internal memory, writing hard disk;
Fig. 3 writes from internal memory the index structure schematic diagram that hard disk is sorted to middle result data before to the intermediate result data that produce by MapReduce in prior art;
Fig. 4 is according to an embodiment of the invention for the index structure schematic diagram from middle result data being sorted before internal memory writes hard disk to the intermediate result data that produce by MapReduce of the sort method of the intermediate result data of MapReduce; And
Fig. 5 is according to an embodiment of the invention for the schematic diagram of the ordering system of the intermediate result data of MapReduce.
Embodiment
Below describe embodiments of the invention in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label means same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " vertically ", " laterally ", " on ", orientation or the position relationship of the indications such as D score, 'fornt', 'back', " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward " be based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, rather than indicate or imply that the device of indication or element must have specific orientation, construct and operation with specific orientation, therefore can not be interpreted as limitation of the present invention.
In description of the invention, it should be noted that, unless otherwise prescribed and limit, term " installation ", " being connected ", " connection " should be done broad understanding, for example, can be mechanical connection or electrical connection, can be also the connection of two element internals, can be directly to be connected, and also can indirectly be connected by intermediary, for the ordinary skill in the art, can understand as the case may be the concrete meaning of described term.
Below in conjunction with accompanying drawing, sort method and the system according to the data of the intermediate result for MapReduce of the embodiment of the present invention described.
Fig. 1 is according to an embodiment of the invention for the process flow diagram of the sort method of the intermediate result data of MapReduce.As shown in Figure 1, for the sort method of the intermediate result data of MapReduce, comprise the following steps according to an embodiment of the invention:
Step S101: from the mapping task server, obtain a plurality of intermediate result data that the mapping task produces.
Step S102: a plurality of intermediate result data are divided into to the N group according to the burst under a plurality of intermediate result data.
As a concrete example, according to the burst under a plurality of intermediate result data, a plurality of intermediate result data are divided into to the N group, further comprise:
1, create burst index two-dimensional array.
2, will after the merging of the burst under a plurality of intermediate result data, be stored in the first dimension storage space of burst index two-dimensional array.
3, each burst location index is stored in the second dimension storage space of burst index two-dimensional array.
4, according to a plurality of bursts of the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
Particularly, as shown in Figure 3, by the background technology that middle result data is sorted when internal memory writes (spill) hard disk, sequencer procedure is to use burst location index (kvoffsets) as index, the key real data of pointing in concrete internal memory (buffer) compares, then exchange the index position in kvoffsets, thereby realize that all record of final current spill are preferentially according to burst partition sequence, the logic of the comparator sequence that then partition defines according to key again.And whole process is all to be completed by single-threaded.Suppose that the record total quantity is n, use quicksort, time complexity is nlogn.
Due to comparison rule each time element relatively need in twos two step Compare Logic, and want the whole record of disposable sequence, if n is larger for concrete sort algorithm, do not have good effect under Compare Logic yet complicated situation.And if can be divided into record to be sorted r part, to the part record of every portion sequence, time complexity is exactly so
Figure BDA00003539639600061
for
Figure BDA00003539639600062
be that data scale is less, the raising of sequence speed is nonlinear.For the record here, most suitable is just according to burst partition quantity, to be divided into r part before sequence, sequence only need to compare the key of same partition inside, does not need relatively burst partition again, can reduce to a step to two step logics.And if the record to having divided according to burst partition, each partition is used an independent thread ordering to belong to the data of this partition, does not interfere with each other between partition, can realize the multi-threaded parallel sequence.
More specifically, for example in the mapoutputbuffer class, add an auxiliary two-dimensional array as new index, it is burst index two-dimensional array, be designated as: partitionIndexes[] [], the first dimension is burst partition, the second dimension is record that this partition the is corresponding index at intermediate data index: kvindices, at record each time, write fashionable, record the partition counting that this record is corresponding and add 1, when each spill starts, the second dimension number of elements according to partition counting initialization burst index two-dimensional array partitionIndexes, then go over the kvoffsets that needs the spill interval, partitionIndexes[inserted in the index that each is recorded in kvindices] in array corresponding to the corresponding partition of [], final so just the record that will sort in burst location index kvoffsets at partitionIndexes[] all re-established new index and mapping relations in [].
Step S103: by N thread, the intermediate result data in the N group are sorted respectively.For example: as shown in Figure 4, create a plurality of threads (as Thread0 to Thread3), the record of each burst partition inside is sorted, the two-dimensional array sequence of partitionIndexes got final product.
Step S104: the group of the N after sequence intermediate result data are write to local disk from internal memory.In one embodiment of the invention, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted by a thread.For example: sequence order in step S103 after, use partitionIndexes as the index that outputs to hard disk, 0 subscript since the first dimension, represent partition0, if data are arranged successively keyvalue corresponding in internal memory buffer is write to hard disk, gets final product.
In one embodiment of the invention, also comprise: simplify the N group intermediate data result after task server obtains sequence from local disk, and N group intermediate data result is carried out to the processing of simplification task.The intermediate result data that are stored in local hard drive after sequence are carried out to abbreviation task (Reduce Task).
Sort method according to the data of the intermediate result for MapReduce of the embodiment of the present invention, alleviation is in the map output procedure, be greater than spill sequence speed owing to recording output speed, thereby buffer zone buffer is forced to wait for that spill completes the time delay that releasing memory causes because be filled to cause writing.The sort method of the data of the intermediate result for MapReduce by the embodiment of the present invention, the utilization thought of dividing and ruling is divided into aliquot to intermediate data sequence and processes, grouping based on partition simultaneously, simplified Compare Logic, utilize multithreading to be sorted, greatly accelerated sequence speed, reduced and write (spill) time, thereby alleviated buffer in the spill process, be fully written and have to wait for that spill completes the time delay brought.Particularly, when spill, all record to be sorted divide into groups for partition, to the independent thread ordering of the inner use of each partition, for example: the wordcount that uses hadoop to carry is tested the sort method of the embodiment of the present invention, opens three thread map and on average accelerates 30% left and right.
Fig. 5 is according to an embodiment of the invention for the schematic diagram of the ordering system of the intermediate result data of MapReduce.As shown in Figure 5, for the ordering system 500 of the intermediate result data of MapReduce, comprising: home server 510 and mapping server 520 according to an embodiment of the invention.
Wherein, mapping server 520 is for carrying out the mapping task to generate a plurality of intermediate result data.Home server 510 is for obtaining a plurality of intermediate result data from mapping task server 520, and according to the burst under a plurality of intermediate result data, a plurality of intermediate result data are divided into to the N group, and by N thread, the intermediate result data in the N group are sorted respectively, and the group of the N after sequence intermediate result data are write to local disk from internal memory.
As a concrete example, home server 510 is divided into the N group according to the burst under a plurality of intermediate result data by a plurality of described intermediate result data, comprise: create burst index two-dimensional array, after being merged, burst under a plurality of intermediate result data is stored in the first dimension storage space of burst index two-dimensional array, and each burst location index is stored in the second dimension storage space of burst index two-dimensional array, and according to a plurality of bursts of the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
Particularly, as shown in Figure 3, by the background technology that middle result data is sorted when internal memory writes (spill) hard disk, sequencer procedure is to use burst location index (kvoffsets) as index, the key real data of pointing in concrete internal memory (buffer) compares, then exchange the index position in kvoffsets, thereby realize that all record of final current spill are preferentially according to burst partition sequence, the logic of the comparator sequence that then partition defines according to key again.And whole process is all to be completed by single-threaded.Suppose that the record total quantity is n, use quicksort, time complexity is nlogn.
Due to comparison rule each time element relatively need in twos two step Compare Logic, and want the whole record of disposable sequence, if n is larger for concrete sort algorithm, do not have good effect under Compare Logic yet complicated situation.And if can be divided into record to be sorted r part, to the part record of every portion sequence, time complexity is exactly so
Figure BDA00003539639600081
for
Figure BDA00003539639600082
be that data scale is less, the raising of sequence speed is nonlinear.For the record here, most suitable is just according to burst partition quantity, to be divided into r part before sequence, sequence only need to compare the key of same partition inside, does not need relatively burst partition again, can reduce to a step to two step logics.And if the record to having divided according to burst partition, each partition is used an independent thread ordering to belong to the data of this partition, does not interfere with each other between partition, can realize the multi-threaded parallel sequence.
More specifically, for example in the mapoutputbuffer class, add an auxiliary two-dimensional array as new index, it is burst index two-dimensional array, be designated as: partitionIndexes[] [], the first dimension is burst partition, the second dimension is record that this partition the is corresponding index at intermediate data index: kvindices, at record each time, write fashionable, record the partition counting that this record is corresponding and add 1, when each spill starts, the second dimension number of elements according to partition counting initialization burst index two-dimensional array partitionIndexes, then go over the kvoffsets that needs the spill interval, partitionIndexes[inserted in the index that each is recorded in kvindices] in array corresponding to the corresponding partition of [], final so just the record that will sort in burst location index kvoffsets at partitionIndexes[] all re-established new index and mapping relations in [].
Further.As shown in Figure 4, create a plurality of threads (as Thread0 to Thread3), the record of each burst partition inside is sorted, the two-dimensional array sequence of partitionIndexes got final product.
In one embodiment of the invention, a plurality of burst location indexs corresponding to each burst that 510 pairs of home servers are stored in the second dimension storage space are sorted by a thread.For example: use partitionIndexes as the index that outputs to hard disk, 0 subscript since the first dimension, represent partition0, if data are arranged successively keyvalue corresponding in internal memory buffer is write to hard disk, gets final product.
Shown in Fig. 5, the ordering system 500 of the data of the intermediate result for MapReduce of the embodiment of the present invention also comprises: simplify server 530.Simplify server 530 and obtain the N group intermediate data result sequence for the local disk from home server 510, and N group intermediate data result is carried out to the processing of simplification task.The intermediate result data that are stored in local hard drive after sequence are carried out to abbreviation task (Reduce Task).
Ordering system according to the data of the intermediate result for MapReduce of the embodiment of the present invention, alleviation is in the map output procedure, be greater than spill sequence speed owing to recording output speed, thereby buffer zone buffer is forced to wait for that spill completes the time delay that releasing memory causes because be filled to cause writing.The ordering system of the data of the intermediate result for MapReduce by the embodiment of the present invention, the utilization thought of dividing and ruling is divided into aliquot to intermediate data sequence and processes, grouping based on partition simultaneously, simplified Compare Logic, utilize multithreading to be sorted, greatly accelerated sequence speed, reduced and write (spill) time, thereby alleviated buffer in the spill process, be fully written and have to wait for that spill completes the time delay brought.Particularly, when spill, all record to be sorted divide into groups for partition, to the independent thread ordering of the inner use of each partition, for example: the wordcount that uses hadoop to carry is tested the sort method of the embodiment of the present invention, opens three thread map and on average accelerates 30% left and right.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of described term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.
Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification to these embodiment, scope of the present invention is by claims and be equal to and limit.

Claims (8)

1. the sort method of the data of the intermediate result for MapReduce, is characterized in that, comprises the following steps:
Obtain from the mapping task server a plurality of intermediate result data that the mapping task produces;
According to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data are divided into to the N group;
By N thread, the intermediate result data in described N group are sorted respectively; And
N after sequence group intermediate result data are write to local disk from internal memory.
2. the sort method of the data of the intermediate result for MapReduce according to claim 1, is characterized in that, also comprises:
Simplify the N group intermediate data result after task server obtains sequence from described local disk, and described N group intermediate data result is carried out to the processing of simplification task.
3. the sort method of the data of the intermediate result for MapReduce according to claim 1, is characterized in that, describedly according to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data is divided into to the N group, further comprises:
Create burst index two-dimensional array;
After being merged, burst under a plurality of described intermediate result data is stored in the first dimension storage space of described burst index two-dimensional array;
Each burst location index is stored in the second dimension storage space of described burst index two-dimensional array;
According to a plurality of bursts of described the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
4. the sort method of the data of the intermediate result for MapReduce according to claim 3, is characterized in that, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted by a thread.
5. the ordering system of the data of the intermediate result for MapReduce, is characterized in that, comprising: home server and mapping server, wherein,
Described mapping server is for carrying out the mapping task to generate a plurality of intermediate result data;
Described home server is for obtaining described a plurality of intermediate result data from described mapping task server, and according to the burst under a plurality of described intermediate result data, a plurality of described intermediate result data are divided into to the N group, and by N thread, the intermediate result data in described N group are sorted respectively, and the group of the N after sequence intermediate result data are write to local disk from internal memory.
6. the ordering system of the data of the intermediate result for MapReduce according to claim 1, is characterized in that, also comprises:
Simplify server, described simplification server is used for the N group intermediate data result from the local disk of described home server obtains sequence, and described N group intermediate data result is carried out to the processing of simplification task.
7. the ordering system of the data of the intermediate result for MapReduce according to claim 5, it is characterized in that, described home server is divided into the N group according to the burst under a plurality of described intermediate result data by a plurality of described intermediate result data, comprise: create burst index two-dimensional array, after being merged, burst under a plurality of described intermediate result data is stored in the first dimension storage space of described burst index two-dimensional array, and each burst location index is stored in the second dimension storage space of described burst index two-dimensional array, and according to a plurality of bursts of described the first dimension storage space storage, a plurality of burst location indexs corresponding to each burst that are stored in the second dimension storage space are sorted.
8. the ordering system of the data of the intermediate result for MapReduce according to claim 7, it is characterized in that, described home server to be stored in a plurality of burst location indexs that each burst in the second dimension storage space is corresponding by one independently thread sorted.
CN2013103059318A 2013-07-19 2013-07-19 Intermediate result data sequencing method and system for MapReduce Pending CN103440246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103059318A CN103440246A (en) 2013-07-19 2013-07-19 Intermediate result data sequencing method and system for MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103059318A CN103440246A (en) 2013-07-19 2013-07-19 Intermediate result data sequencing method and system for MapReduce

Publications (1)

Publication Number Publication Date
CN103440246A true CN103440246A (en) 2013-12-11

Family

ID=49693937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103059318A Pending CN103440246A (en) 2013-07-19 2013-07-19 Intermediate result data sequencing method and system for MapReduce

Country Status (1)

Country Link
CN (1) CN103440246A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123386A (en) * 2014-08-06 2014-10-29 浪潮(北京)电子信息产业有限公司 Parallel sequencing method and device
CN106484879A (en) * 2016-10-14 2017-03-08 哈尔滨工程大学 A kind of polymerization of the Map end data based on MapReduce
CN106528763A (en) * 2016-10-28 2017-03-22 北京海誉动想科技股份有限公司 Method for combining two-channel document blocks and multi-channel document blocks
CN106850849A (en) * 2017-03-15 2017-06-13 联想(北京)有限公司 A kind of data processing method, device and server
CN107729375A (en) * 2017-09-13 2018-02-23 微梦创科网络科技(中国)有限公司 A kind of method and device of daily record data sequence
CN108874798A (en) * 2017-05-09 2018-11-23 北京京东尚科信息技术有限公司 A kind of big data sort method and system
CN109002467A (en) * 2018-06-08 2018-12-14 中国科学院计算技术研究所 A kind of database sort method and system executed based on vectorization
CN109408490A (en) * 2018-09-29 2019-03-01 武汉斗鱼网络科技有限公司 A kind of regular method, apparatus of array, terminal and readable medium
CN112685747A (en) * 2020-01-17 2021-04-20 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
WO2022179023A1 (en) * 2021-02-25 2022-09-01 华为技术有限公司 Sorting device and method
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8190610B2 (en) * 2006-10-05 2012-05-29 Yahoo! Inc. MapReduce for distributed database processing
CN103116525A (en) * 2013-01-24 2013-05-22 贺海武 Map reduce computing method under internet environment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8190610B2 (en) * 2006-10-05 2012-05-29 Yahoo! Inc. MapReduce for distributed database processing
CN103116525A (en) * 2013-01-24 2013-05-22 贺海武 Map reduce computing method under internet environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
乔鸿欣: "基于MapReduce的KNN分类算法的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
彭辅权等: "MapReduce中shuffle优化与重构", 《中国科技论文》 *
艾树宇: "基于Hadoop/MapReduce的K_NN算法", 《科技传播》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123386B (en) * 2014-08-06 2017-08-25 浪潮(北京)电子信息产业有限公司 A kind of method and device of sorting in parallel
CN104123386A (en) * 2014-08-06 2014-10-29 浪潮(北京)电子信息产业有限公司 Parallel sequencing method and device
CN106484879B (en) * 2016-10-14 2019-08-06 哈尔滨工程大学 A kind of polymerization of the Map end data based on MapReduce
CN106484879A (en) * 2016-10-14 2017-03-08 哈尔滨工程大学 A kind of polymerization of the Map end data based on MapReduce
CN106528763A (en) * 2016-10-28 2017-03-22 北京海誉动想科技股份有限公司 Method for combining two-channel document blocks and multi-channel document blocks
CN106528763B (en) * 2016-10-28 2019-11-01 北京海誉动想科技股份有限公司 The method of two-way and multichannel file merged block
CN106850849A (en) * 2017-03-15 2017-06-13 联想(北京)有限公司 A kind of data processing method, device and server
CN108874798A (en) * 2017-05-09 2018-11-23 北京京东尚科信息技术有限公司 A kind of big data sort method and system
CN107729375A (en) * 2017-09-13 2018-02-23 微梦创科网络科技(中国)有限公司 A kind of method and device of daily record data sequence
CN109002467B (en) * 2018-06-08 2021-04-27 中国科学院计算技术研究所 Database sorting method and system based on vectorization execution
CN109002467A (en) * 2018-06-08 2018-12-14 中国科学院计算技术研究所 A kind of database sort method and system executed based on vectorization
CN109408490A (en) * 2018-09-29 2019-03-01 武汉斗鱼网络科技有限公司 A kind of regular method, apparatus of array, terminal and readable medium
CN112685747A (en) * 2020-01-17 2021-04-20 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN112685747B (en) * 2020-01-17 2022-02-01 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
WO2022179023A1 (en) * 2021-02-25 2022-09-01 华为技术有限公司 Sorting device and method
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
CN102799486B (en) Data sampling and partitioning method for MapReduce system
CN103246749B (en) The matrix database system and its querying method that Based on Distributed calculates
Kolb et al. Parallel sorted neighborhood blocking with mapreduce
US20140351239A1 (en) Hardware acceleration for query operators
US10152502B2 (en) Systems, apparatuses, methods, and computer readable media for processing and analyzing big data using columnar index data format
CN104111936B (en) Data query method and system
CN106126601A (en) A kind of social security distributed preprocess method of big data and system
Ngu et al. B+-tree construction on massive data with Hadoop
US20160098481A1 (en) Parallel data sorting
CN103309958A (en) OLAP star connection query optimizing method under CPU and GPU mixing framework
CN107015868B (en) Distributed parallel construction method of universal suffix tree
CN103324765A (en) Multi-core synchronization data query optimization method based on column storage
CN105706092A (en) Methods and systems of four-valued simulation
You et al. Spatial join query processing in cloud: Analyzing design choices and performance comparisons
CN111159235A (en) Data pre-partition method and device, electronic equipment and readable storage medium
Han et al. Distme: A fast and elastic distributed matrix computation engine using gpus
Qiao et al. Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope
CN103995827A (en) High-performance ordering method for MapReduce calculation frame
Shibla et al. Improving efficiency of DBSCAN by parallelizing kd-tree using spark
CN107506394A (en) Optimization method for eliminating big data standard relation connection redundancy
CN103761298A (en) Distributed-architecture-based entity matching method
Zhao et al. MapReduce model-based optimization of range queries
Perwej et al. An extensive investigate the mapreduce technology
CN108121807A (en) The implementation method of multi-dimensional index structures OBF-Index under Hadoop environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131211