CN105740332A

CN105740332A - Data sorting method and device

Info

Publication number: CN105740332A
Application number: CN201610045738.9A
Authority: CN
Inventors: 魏国建; 王春明; 周涛; 韦永剑; 叶华; 张思进
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-01-22
Filing date: 2016-01-22
Publication date: 2016-07-06

Abstract

The invention discloses a data sorting method and device. A specific embodiment of the method comprises the following steps: obtaining data to be sorted and data identifiers of the data to be sorted; carrying out distribution operations, that is, determining a maximum identifier value and a minimum identifier value of the data identifiers, dividing an interval between a right endpoint value and a left endpoint value, which respectively serve as the maximum value and the minimum value, into a plurality of subintervals, and generating a plurality of data sets to be sorted, each of which corresponds to one of the subintervals; and carrying out sorting operations. The data sorting method and device provided by the invention have the advantages that the data to be sorted can be uniformly distributed during the sorting process, and differences among the identifier values of the data identifiers of the data to be sorted in each generated data set to be sorted are small, so that when the data to be sorted in each data set to be sorted is sorted according to the data identifier values, the expenses can be approximately equal, and the sorting efficiency of a system can be further improved.

Description

Data reordering method and device

Technical field

The application relates to computer realm, is specifically related to big data technique field, particularly relates to data reordering method and device.

Background technology

The Map-Reduce model of distributed computing framework Hadoop is widely used in big data processing technique.When utilizing Map-Reduce model that data are processed, need to utilize Map task that data are distributed to different Reduce tasks, then according to the size of the ident value of the Data Identification of data data it is ranked up and processes so that data global orderly after treatment.At present, the ways of distribution generally adopted is: gather the ident value of the Data Identification of data, it was predicted that the regularity of distribution of the ident value of the Data Identification of overall data, then according to the regularity of distribution, data is distributed to different Reduce tasks.

But, when adopting aforesaid way that data are distributed to different Reduce tasks, there is problems in that 1) when the data gathered are part data, there will be the uncorrelated situation of ident value of the ident value of the Data Identification of the data collected and the Data Identification of not collected data, cause the regularity of distribution that cannot dope the ident value of entirety exactly, and then data cannot be distributed to each Reduce task equably, the sequence efficiency of reduction system, 2) when the data gathered are total data, cause that overhead sharply increases, and then reduce the sequence efficiency of system.

Summary of the invention

This application provides data reordering method and device, for solving the technical problem that above-mentioned background section exists.

First aspect, this application provides data reordering method, and the method includes: obtain the Data Identification of pending data evidence and pending data evidence；Perform distribution operation: determine maximum and the minima of the ident value of Data Identification in Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；Perform sorting operation: treat the pending data evidence in sorting data set, be ranked up according to the size of the ident value of Data Identification.

Second aspect, this application provides data sorting device, and this device includes: acquiring unit, and configuration is for obtaining pending data evidence and the Data Identification of pending data evidence；Dispatching Unit, is configured to carry out distribution operation: determine maximum and the minima of the ident value of Data Identification in Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；Sequencing unit, is configured to carry out sorting operation: treats the pending data evidence in sorting data set, is ranked up according to the size of the ident value of Data Identification.

The data reordering method of the application offer and device, by obtaining the Data Identification of pending data evidence and pending data evidence；Perform distribution operation: determine maximum and the minima of the ident value of Data Identification in Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively maximum and minima；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；Perform sorting operation: treat the pending data evidence in sorting data set, be ranked up according to the size of the ident value of Data Identification.Achieve and in sequencer procedure, treat sorting data distribute equably, the pending data generated is less according to the difference between the ident value of the Data Identification of the pending data evidence in set, so that in the size according to Data Identification to each pending data in ordered set according to when being ranked up, expense can be similar to equal, and then the sequence efficiency of the system of lifting.

Accompanying drawing explanation

By reading the detailed description that non-limiting example is made made with reference to the following drawings, other features, purpose and advantage will become more apparent upon:

Fig. 1 is that the application can apply to exemplary system architecture figure therein；

Fig. 2 illustrates the flow chart of an embodiment of the data reordering method according to the application；

Fig. 3 illustrates the schematic diagram generating multiple pending datas according to set；

Fig. 4 illustrates an exemplary architecture figure of the data reordering method suitable in the application；

Fig. 5 illustrates the structural representation of an embodiment of the data sorting device according to the application；

Fig. 6 is adapted for the structural representation of the computer system for the terminal unit or server realizing the embodiment of the present application.

Detailed description of the invention

Below in conjunction with drawings and Examples, the application is described in further detail.It is understood that specific embodiment described herein is used only for explaining related invention, but not the restriction to this invention.It also should be noted that, for the ease of describing, accompanying drawing illustrate only the part relevant to about invention.

It should be noted that when not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.Describe the application below with reference to the accompanying drawings and in conjunction with the embodiments in detail.

Fig. 1 illustrates the exemplary system architecture 100 of the embodiment of the data reordering method that can apply the application or data sorting device.

As it is shown in figure 1, system architecture 100 can include terminal unit 101,102,103, network 104 and server 105.Network 104 in order to provide the medium of transmission link between terminal unit 101,102,103 and server 105.Network 104 can include various connection type, for instance wired, wireless transmission link or fiber optic cables etc..

User can use terminal unit 101,102,103 mutual with server 105 by network 104, to receive or to send message etc..Terminal unit 101,102,103 can be provided with various communication applications, for instance, browser class application, searching class application.

Terminal unit 101,102,103 can be have display screen and support the various electronic equipments of network service, include but not limited to smart mobile phone, panel computer, E-book reader, MP3 player (MovingPictureExpertsGroupAudioLayerIII, dynamic image expert's compression standard audio frequency aspect 3), MP4 (MovingPictureExpertsGroupAudioLayerIV, dynamic image expert's compression standard audio frequency aspect 4) player, pocket computer on knee and desk computer etc..

Server 105 can be applied from the browser class terminal unit 101,102,103 and obtain the data (such as cookie) that are associated with the network behavior of user as pending data according to this and the Data Identification of pending data evidence, it is ranked up it is then possible to treat sorting data according to the size of the Data Identification of pending data evidence.

It should be understood that the number of terminal unit in Fig. 1, network and server is merely schematic.According to realizing needs, it is possible to have any number of terminal unit, network and server.

Refer to Fig. 2, it illustrates the flow process 200 of the data reordering method according to the application a embodiment.It should be noted that the data reordering method that the embodiment of the present application provides generally is performed by the server 105 in Fig. 1, correspondingly, data sorting device is generally positioned in server 105.The method comprises the following steps:

Step 201, obtains the Data Identification of pending data evidence and pending data evidence.

In the present embodiment, pending data evidence can in units of bar.Such as, pending data is according to the cookie that can be the information being associated for the network behavior recorded with user.Article one, cookie can comprise the URL of the webpage that user browses, browsing time.Correspondingly, a cookie can correspond to a Data Identification, i.e. the Data Identification (key) of pending data evidence, and the type of the ident value of this Data Identification can be integer (int) type.

Step 202, performs distribution operation.

In the present embodiment, it is possible to by distributing operation by pending data according to being divided into multiple pending data according to set.Distribution operates maximum and the minima of the ident value of Data Identification in the Data Identification comprising determining that pending data evidence；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set.

In some optional implementations of the present embodiment, also include: utilize the distribution operation of the Map tasks carrying in the Map-Reduce model of distributed computing framework Hadoop and utilize the Reduce tasks carrying sorting operation in Map-Reduce model.

In the present embodiment, it is possible to first look for out maximum and the minima of Data Identification value in the Data Identification of all pending data evidences.In the Data Identification finding out all pending data evidences after the maximum of Data Identification value and minima, it may be determined that going out an interval, the numerical value of the left end point in this interval is above-mentioned minima, and the numerical value of right endpoint is above-mentioned maximum.It is then possible to by the subinterval that this interval division is multiple sequential, namely the left end point value in each subinterval is the right-hand member point value in the subinterval before it, right-hand member point value is the left end point value in subinterval after.After marking off the subinterval of multiple sequential, the subinterval belonging to ident value of the Data Identification of each pending data evidence can be determined respectively, so so that belong to the pending data in same subinterval according to may be constructed a pending data according to set.

In some optional implementations of the present embodiment, it is that multiple subinterval includes by the interval division of right-hand member point value and left end point value respectively described maximum and minima: adopt below equation to calculate the right-hand member point value in subinterval: Nmaxkey=Minkey+Average*N；Average=(Maxkey-Minkey)/Rnumber；Wherein, Nmaxkey represents the right-hand member point value in n-th subinterval, and Minkey represents the minima of the ident value of Data Identification, and Average represents that meansigma methods, Maxkey represent the maximum of the ident value of Data Identification, and Rnumber represents the quantity of Reduce task.

Below for Map-Reduce model, illustrate the mode that interval division is multiple subinterval of right-hand member point value and left end point value respectively maximum and minima: in the present embodiment, can according to the quantity of Reduce task, it is determined that go out the quantity in the subinterval marked off.Map task can be passed through based on the subinterval marked off, generate pending data according to set, then pending data is sent to Reduce task according to set, completes to treat the distribution of sorting data.

Refer to Fig. 3, it illustrates the schematic diagram generating multiple pending datas according to set.

First, determine maximum (Maxkey) and the minima (Minkey) of the ident value of Data Identification in the Data Identification of all of pending data evidence, it is thus possible to determine that a left end point value is Minkey, right-hand member point value is the interval of Maxkey.

It is then possible to the numerical value (also referred to as intervaled scale) of the right endpoint by calculating each subinterval.It is thus possible to be Minkey by left end point value, right-hand member point value is the interval division of Maxkey is multiple subinterval.The intervaled scale in each subinterval can adopt below equation to calculate:

Each_dregion=(Maxkey-Minkey)/Rnumber

Dregion_0=Minkey

Dregion_1=Minkey+Each_dregion*1

Dregion_2=Minkey+Each_dregion*2

……

Dregion_N-2=Minkey+Each_dregion* (n-2)

Dregion_N-1=Minkey+Each_dregion* (n-1)

Dregion_N=Maxkey

Wherein: Rnumber represents the quantity in subinterval, this quantity can be identical with the quantity of Reduce task.Each_dregion represents the meansigma methods of the ident value that Data Data identifies.Dregion_0 represents the initial value of intervaled scale, and this initial value can be Minkey.Dregion_N-1 represents that the intervaled scale in N-1 subinterval, Dregion_N represent the intervaled scale in n-th subinterval, i.e. the maximum Maxkey of the ident value of Data Identification in the Data Identification of all of pending data evidence.

In the present embodiment, it is being Minkey by left end point value, right-hand member point value is the interval division of Maxkey is after multiple subinterval, can based on the intervaled scale in subinterval, treat sorting data and be distributed (being such as distributed by Map task), namely judge that each pending data is according to affiliated subinterval, the pending data evidence belonged to so that belonging in same subinterval is made to may be constructed a pending data according to set (also referred to as treating sort file), such as, the FileN shown in Fig. 3.After treating that sort file is sent to Reduce task, complete to treat the distribution of sorting data.

In the present embodiment, ident value owing to being in each Data Identification treating pending data evidence in ordered set is in and both corresponds to same interval, each is treated, and the difference between the ident value of the Data Identification of pending data evidence in ordered set is less, thus in the size of the ident value according to Data Identification, to each pending data in ordered set according to when being ranked up, expense can be similar to equal.

In some optional implementations of the present embodiment, pre-set pending data according in the maximum of ident value of Data Identification of pending data evidence and minima.

In the present embodiment, can pre-set pending data according in the maximum Maxkey and minimum M inkey of ident value of Data Identification of pending data evidence, so that in sequencer procedure, the time complexity finding out Maxkey and Minkey is O (1), promotes the sequence efficiency of system further.

Step 203, performs sorting operation.

In the present embodiment, sorting operation includes: treats the pending data evidence in sorting data set, is ranked up according to the size of the ident value of Data Identification.In the present embodiment, to each pending data according to the pending data in set according to after being ranked up, it is possible to complete the overall sequence to whole pending data evidences further.Treat that ordered set first is treated ordered set, second treated ordered set for two, assume, first treat in ordered set after sorted, during the maximum of the ident value of the Data Identification of data is waited to sort less than second after sorted, the minima of the ident value of the Data Identification of data, after treating that ordered set is ranked up to two respectively, then may determine that first treats that ordered set came before second treats ordered set.Based on aforesaid way, it is determined that the position of each ordered set, thus completing the overall sequence to whole pending data evidences.

Refer to Fig. 4, it illustrates the data reordering method suitable in the application an exemplary architecture figure.

In fig. 4 it is shown that PartA part and PartB part.PartA part for providing non-ordered data and the pending data evidence of magnanimity to PartB part, the ident value of the Data Identification of every pending data evidence can be int type, simultaneously, PartA part can also get maximum and the minima of the ident value of the Data Identification of all pending data evidences in advance, such as, maximum and the minima of the ident value of Data Identification is obtained in the way of traveling through all pending data evidences, then, the maximum of the ident value of Data Identification and minima are supplied to PartB part.PartB part is used for maximum and the minima of the ident value based on Data Identification, the pending data evidence of magnanimity is ranked up, PartB part comprises the Map task in the Map-Reduce model of multiple distributed computing framework Hadoop and Reduce task, PartB carry out overall situation sequence.

The distribution operation of Map tasks carrying, judge that each pending data is according to affiliated subinterval, make the pending data belonging in same subinterval according to may be constructed a pending data according to set (also referred to as treating sort file), then, after treating that sort file is sent to Reduce task, complete to treat the distribution of sorting data.

Each Reduce task can to the pending data evidence received, the size of the Data Identification according to pending data evidence is ranked up, namely to the pending data belonged in a subinterval according to being ranked up, thus forming the sort file (also referred to as local order small documents) of local order.Additionally, Reduce task can also to the pending data that receives according to being further processed, thus obtaining the sort file (also referred to as Reduce output file) of local order after treatment.In the present embodiment, sort file each corresponding subinterval due to local order, and the maximum that the ident value of the Data Identification of the data in the sort file of local order is likely to obtain has predetermined that, the right-hand member point value in each subinterval namely marked off, magnitude relationship between the sort file of each local order can also be determined, such that it is able to the ident value according to the Data Identification of the data in the sort file of local order is likely to the magnitude relationship between the maximum obtained, form overall orderly sort file.

In some optional implementations of the present embodiment, also include: under Hadoopstreaming mode of operation, perform distribution operation, sorting operation.

In the present embodiment, can under HadoopStreaming mode of operation, partitioner data distribution interface in definition Hadoop, it is possible to adopt multiple code speech to realize distribution operation and sorting operation, so that the code of distribution operation, sorting operation runs in Hadoop.

Below for Map-Reduce model, the difference of the data sorting mode in the present embodiment and data sorting mode of the prior art is described: in the prior art, when Map task according to the regularity of distribution of the ident value of the Data Identification of pending data evidence to Reduce task distribute pending data according to time, can cause pending data evidence in some intervals Data Identification ident value between difference less, and the difference between the ident value of the Data Identification of the pending data evidence in other intervals is bigger.Thus, pending data in the interval that difference is less is according to when completing to sort, need the pending data waiting in the interval that difference is bigger according to completing sequence, perform the thread of the sorting operation of the pending data evidence in this interval can be suspended, cause that the sequence efficiency of whole system reduces.

And in the present embodiment, it is distributed to the ident value of the Data Identification of all pending data evidences of same Reduce task to be in both correspond to same interval due to Map task, make the difference being distributed between the ident value of the Data Identification of the pending data evidence of same Reduce task less, thus when each Reduce task is to its pending data evidence received, when the size of the ident value of the Data Identification according to pending data evidence is ranked up, expense is similar to equal, and then promotes the sequence efficiency of whole system.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides an embodiment of a kind of data sorting device, this device embodiment is corresponding with the data reordering method embodiment shown in Fig. 2, and this device specifically can apply in various electronic equipment.

As it is shown in figure 5, the data sorting device 500 of the present embodiment includes: acquiring unit 501, Dispatching Unit 502, sequencing unit 503.Acquiring unit 501 configuration is for obtaining pending data evidence and the Data Identification of pending data evidence；Dispatching Unit 502 is configured to carry out distribution operation: determine maximum and the minima of the ident value of Data Identification in Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；Sequencing unit 503 is configured to carry out sorting operation: treats the pending data evidence in sorting data set, is ranked up according to the size of the ident value of Data Identification.

In the present embodiment, acquiring unit 501 can obtain pending data evidence and the Data Identification of pending data evidence.Pending data evidence can in units of bar.Such as, pending data is according to the cookie that can be the information being associated for the network behavior recorded with user.Article one, cookie can comprise the URL of the webpage that user browses, browsing time.Correspondingly, a cookie can correspond to a Data Identification.

In the present embodiment, Dispatching Unit 502 can first look for out maximum and the minima of Data Identification value in the Data Identification of all pending data evidences.In the Data Identification finding out all pending data evidences after the maximum of Data Identification value and minima, it may be determined that going out an interval, the numerical value of the left end point in this interval is above-mentioned minima, and the numerical value of right endpoint is above-mentioned maximum.It is then possible to by the subinterval that this interval division is multiple sequential, namely the left end point value in each subinterval is the right-hand member point value in the subinterval before it, right-hand member point value is the left end point value in subinterval after.After marking off the subinterval of multiple sequential, the subinterval belonging to ident value of the Data Identification of each pending data evidence can be determined respectively, so so that belong to the pending data in same subinterval according to may be constructed a pending data according to set.

In the present embodiment, sequencing unit 503 can treat the pending data evidence in sorting data set, is ranked up according to the size of the ident value of Data Identification.To each pending data according to the pending data in set according to after being ranked up, it is possible to complete the overall sequence to whole pending data evidences further.

In some optional implementations of the present embodiment, device 500 also includes: distribution operation execution unit (not shown), and configuration is used for the Map tasks carrying distribution operation utilized in the Map-Reduce model of distributed computing framework Hadoop；Sorting operation performance element (not shown), configuration is for utilizing the Reduce tasks carrying sorting operation in Map-Reduce model.

In some optional implementations of the present embodiment, Dispatching Unit 502 includes: computation subunit (not shown), and configuration calculates the right-hand member point value in subinterval for adopting below equation: Nmaxkey=Minkey+Average*N；Average=(Maxkey-Minkey)/Rnumber；Wherein, Nmaxkey represents the right-hand member point value in n-th subinterval, and Minkey represents the minima of the ident value of Data Identification, and Average represents that meansigma methods, Maxkey represent the maximum of the ident value of Data Identification, and Rnumber represents the quantity of Reduce task.

In some optional implementations of the present embodiment, device 500 also includes: arrange unit (not shown), and configuration is for pre-setting maximum and the minima of the ident value of the Data Identification of the pending data evidence in pending data evidence.

In some optional implementations of the present embodiment, device 500 also includes: performance element (not shown), and configuration for performing distribution operation, sorting operation under Hadoopstreaming mode of operation.

It will be understood by those skilled in the art that above-mentioned data sorting device 500 also includes some other known features, for instance processor, memorizer etc., embodiment of the disclosure in order to unnecessarily fuzzy, these known structures are not shown in Figure 5.

Fig. 6 illustrates the structural representation being suitable to the computer system for the terminal unit or server realizing the embodiment of the present application.

As shown in Figure 6, computer system 600 includes CPU (CPU) 601, its can according to the program being stored in read only memory (ROM) 602 or from storage part 608 be loaded into the program random access storage device (RAM) 603 and perform various suitable action and process.In RAM603, also storage has system 600 to operate required various programs and data.CPU601, ROM602 and RAM603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to bus 604.

It is connected to I/O interface 605: include the importation 606 of keyboard, mouse etc. with lower component；Output part 607 including such as cathode ray tube (CRT), liquid crystal display (LCD) etc. and speaker etc.；Storage part 608 including hard disk etc.；And include the communications portion 609 of the NIC of such as LAN card, modem etc..Communications portion 609 performs communication process via the network of such as the Internet.Driver 610 is connected to I/O interface 605 also according to needs.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc., be arranged in driver 610 as required, in order to the computer program read from it is mounted into storage part 608 as required.

Especially, according to embodiment of the disclosure, the process described above with reference to flow chart may be implemented as computer software programs.Such as, embodiment of the disclosure and include a kind of computer program, it includes the computer program being tangibly embodied on machine readable media, and described computer program comprises the program code for performing the method shown in flow chart.In such embodiments, this computer program can pass through communications portion 609 and be downloaded and installed from network, and/or is mounted from detachable media 611.

Flow chart in accompanying drawing and block diagram, it is illustrated that according to the system of the various embodiment of the application, the architectural framework in the cards of method and computer program product, function and operation.In this, flow chart or each square frame in block diagram can represent a part for a module, program segment or code, and a part for described module, program segment or code comprises the executable instruction of one or more logic function for realizing regulation.It should also be noted that at some as in the realization replaced, the function marked in square frame can also to be different from the order generation marked in accompanying drawing.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, and they can also perform sometimes in the opposite order, and this determines according to involved function.It will also be noted that, the combination of the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart, can realize by the special hardware based system of the function or operation that perform regulation, or can realize with the combination of specialized hardware Yu computer instruction.

As on the other hand, present invention also provides a kind of nonvolatile computer storage media, this nonvolatile computer storage media can be the nonvolatile computer storage media comprised in device described in above-described embodiment；Can also be individualism, be unkitted the nonvolatile computer storage media allocating in terminal.Above-mentioned nonvolatile computer storage media storage has one or more program, when one or multiple program are performed by an equipment so that described equipment: obtain the Data Identification of pending data evidence and pending data evidence；Perform distribution operation: determine maximum and the minima of the ident value of Data Identification in described Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively described maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；Perform sorting operation: treat the pending data evidence in sorting data set, be ranked up according to the size of the ident value of Data Identification.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Skilled artisan would appreciate that, invention scope involved in the application, it is not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, when also should be encompassed in without departing from described inventive concept simultaneously, other technical scheme being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed.Such as features described above and (but not limited to) disclosed herein have the technical characteristic of similar functions and replace mutually and the technical scheme that formed.

Claims

1. a data reordering method, it is characterised in that described method includes:

Obtain the Data Identification of pending data evidence and pending data evidence；

Perform distribution operation: determine maximum and the minima of the ident value of Data Identification in described Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively described maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；

Perform sorting operation: treat the pending data evidence in sorting data set, be ranked up according to the size of the ident value of Data Identification.

2. method according to claim 1, it is characterised in that described method also includes:

Utilize distribution operation described in the Map tasks carrying in the Map-Reduce model of distributed computing framework Hadoop and utilize sorting operation described in the Reduce tasks carrying in Map-Reduce model.

3. method according to claim 2, it is characterised in that be that multiple subinterval includes by the interval division of right-hand member point value and left end point value respectively described maximum and minima:

Below equation is adopted to calculate the right-hand member point value in subinterval:

Nmaxkey=Minkey+Average*N；Average=(Maxkey-Minkey)/Rnumber；

Wherein, Nmaxkey represents the right-hand member point value in n-th subinterval, and Minkey represents the minima of the ident value of described Data Identification, and Average represents meansigma methods, Maxkey represents the maximum of the ident value of described Data Identification, and Rnumber represents the quantity of Reduce task.

4. method according to claim 3, it is characterised in that described method also includes:

Pre-set pending data according in the maximum of ident value of Data Identification of pending data evidence and minima.

5. method according to claim 4, it is characterised in that described method also includes: perform described distribution operation, sorting operation under Hadoopstreaming mode of operation.

6. a data sorting device, it is characterised in that described device includes:

Acquiring unit, configuration is for obtaining pending data evidence and the Data Identification of pending data evidence；

Dispatching Unit, is configured to carry out distribution operation: determine maximum and the minima of the ident value of Data Identification in described Data Identification；It is multiple subinterval by the interval division of right-hand member point value and left end point value respectively described maximum and minima, wherein, each subinterval meets the following conditions: left end point value is the right-hand member point value in the subinterval before it, and right-hand member point value is the left end point value in subinterval after；Determine the subinterval belonging to ident value of the Data Identification of each pending data evidence；Generating multiple pending data according to set, each treats the corresponding subinterval of ordered set；

Sequencing unit, is configured to carry out sorting operation: treats the pending data evidence in sorting data set, is ranked up according to the size of the ident value of Data Identification.

7. device according to claim 6, it is characterised in that described device also includes:

Distribution operation execution unit, configuration is used for utilizing distribution operation described in the Map tasks carrying in the Map-Reduce model of distributed computing framework Hadoop；

Sorting operation performance element, configuration is for utilizing sorting operation described in the Reduce tasks carrying in Map-Reduce model.

8. device according to claim 7, it is characterised in that described Dispatching Unit includes:

Computation subunit, configuration calculates the right-hand member point value in subinterval for adopting below equation:

Nmaxkey=Minkey+Average*N；Average=(Maxkey-Minkey)/Rnumber；

9. device according to claim 8, it is characterised in that described device also includes:

Arranging unit, configuration is for pre-setting maximum and the minima of the ident value of the Data Identification of the pending data evidence in pending data evidence.

10. device according to claim 9, it is characterised in that described device also includes:

Performance element, configuration for performing described distribution operation, sorting operation under Hadoopstreaming mode of operation.