Embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, described embodiment is only the application one
Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Go out under the premise of creative work the every other embodiment obtained, belong to the scope of the application protection.
The defects of for causing computational efficiency relatively low using traditional quantile calculation in the prior art, the application are real
Apply and a kind of data processing method is provided in example, can effectively reduce calculation times during quantile calculating is carried out, reduce
I/O operation to disk, so as to increase the efficiency of quantile calculating.
It should be noted that the data processing method described in the embodiment of the present application, can be based on frame as shown in Figure 1a
Structure.In fig 1 a, including client, computing device and database, wherein:Client can be initiated to be related to quantile to computing device
The service request of calculating.In actual applications, described client includes but is not limited to:Have in service application, or operating system
A kind of functional unit (in Fig. 1 a, client being represented with the terminal device shaped like mobile phone, be only example) of standby client functionality.Together
When, client can be enterprises client or personal user's client, be not especially limited here.Correspondingly,
Computing device is used to perform quantile calculating, and computing device includes but is not limited to:Server, mainframe computer etc..Set in calculating
During standby execution quantile calculates, database can be directed to and perform I/O operation.And database is used for data storage, calculating is set
The standby quantile that performs is calculated based on the data stored in database.
Certainly, in application scenes, the database shown in Fig. 1 a, it is also believed that being disposed on inside computing device
Disk for data storage.Here and it is not especially limited.
Based on framework as shown in Figure 1a, the data handling procedure that the embodiment of the present application provides as shown in Figure 1 b, the process
Executive agent generally can be computing device shown in Fig. 1 a.Under application scenes, client possesses quantile
Computing function, then, the data handling procedure shown in Fig. 1 b can also be completed by client as executive agent.
Data handling procedure as shown in Figure 1 b specifically includes following steps:
S101:Obtain multiple pending datas.
In actual applications, described pending data, it is believed that be the data that can be ranked up.In a kind of scene,
Pending data can be numerical value, such as:The data such as the history spending amount data of user, age, height, body weight.
Certainly, in addition to the numeric values, under another application scenarios or letter (letter can enter according to alphabet
Row sequence) or the character string that can be ranked up according to corresponding ordering rule etc., here and it is not especially limited.
Generally, described pending data is stored in database, thus server can be got from database it is above-mentioned
Pending data.As a kind of feasible pattern in the embodiment of the present application, described database can be relevant database, by
Data in relevant database are stored according to the corresponding relation of corresponding " key-value ", then, server can basis
Corresponding key, multiple data corresponding with the key is obtained as pending data, so as to form pending data collection.Certainly, on
The mode stated simultaneously should not be used as the restriction to the application.
Explanation is needed exist for, the multiple pending datas got, is not ranked up, such as:Get
Multiple pending datas include:{ 4.0,1.5,9.0,3.5,6.5,8.5 }, it is seen then that pending data therein is unsorted
Data.
It is appreciated that in practical operation, can be obtained by the service request for being related to quantile calculating to trigger computing device
Pending data.
S102:Each pending data is divided in default multiple sections.
In the embodiment of the present application, default multiple sections, can be continuous multiple open and close type sections, such as:
(0,5], (5,10].
The length in each section be able to can also be differed with identical, can specifically be set according to the needs of practical application.
On the basis of default in multiple sections, each pending data can be divided in multiple sections.In the application
, can be according to the order size corresponding to each pending data (such as when dividing each pending data in embodiment:Numerical values recited)
Divided.
With reference to both of the aforesaid example, pending data can be divided in the continuous section of above-mentioned two, i.e. by 4.0,
1.5th, 3.5 be divided in section (0,5] in, and by 9.0,6.5,8.5 be divided in section (5,10] in.
S103:Determine the approximation of the pending data in each section.
As it was previously stated, when calculating quantile using quantile algorithm of the prior art, it usually needs travel through each data
It is ranked up after being compared.Obviously, such calculation will cause excessive I/O operation.Therefore in the embodiment of the present application,
In order to reduce the number of I/O operation, therefore the pending data being divided in in different sections is replaced with corresponding approximation.
That is in the embodiment of the present application, by the approximation of the pending data determined in each section.
Wherein, as a kind of mode in the embodiment of the present application, for any section, can wait to locate by falling into the section
The average of data is managed as approximation.Still continue above-mentioned example:For be divided in section (0,5] in pending data for,
Its approximation is (4.0+1.5+3.5)/3=3;For be divided in section (5,10] in pending data for, its approximation
For (9.0+6.5+8.5)/3=8.
And as the another way in the embodiment of the present application, for any section, the average of interval endpoint value can be made
For approximation.It is same to continue above-mentioned example:For be divided in section (0,5] in pending data for, its approximation is (0+
5)/2=2.5;For be divided in section (5,10] in pending data for, its approximation is (5+10)/2=7.5.
Above two mode can be used in actual calculating process, will specifically be selected according to the needs of practical application
Select, here and be not especially limited.
S104:According to the approximation, it is determined that the quantile corresponding to the pending data collection.
Explanation is needed exist for, the approximation for determining to obtain by abovementioned steps, can be used as and be divided in different sections
Pending data corresponding to approximation.So, a fairly large number of pending data of script, also it has been converted to corresponding
Approximation, the total number of approximation is less than even much smaller than the quantity of the pending data got.
Continuation of the previous cases:Assuming that section (0,5] approximation be 3, section (5,10] approximation be 8, then, be divided in area
Between (0,5] in pending data 4.0,1.5 and 3.5 be approximately 3, and be divided in section (5,10] in pending data
9.0th, 6.5 and 8.5 8 are approximately.So, when calculating quantile, then the two numerical value can be calculated according to 3 and 8.Obviously,
Compared to existing quantile calculation, the data handling procedure in the embodiment of the present application can effectively reduce time of comparison
Number, and then also just can effectively reduce the I/O operation for database.
Certainly, quantile calculating is carried out on the basis of approximation, its specific calculating process can still use existing point
Digit algorithm, is just no longer excessively repeated here.
By above-mentioned steps, when performing quantile calculating, corresponding computing device is (such as:Server) can be from database
Multiple data that pending quantile calculates are obtained, as pending data.Hereafter, computing device draws these pending datas
Divide in default multiple sections, also, for being divided in the pending data in different sections, determine corresponding approximation.
Approximation can characterize a kind of average value of the pending data fallen into each section.So, will can be divided in
The numerical value of pending data in different sections replaces with the calculating that approximation carries out quantile.Obviously, just because of using near
The numerical value for each pending data for replacing being divided in any section like value, also can just reduce the number participated in calculating process
According to quantity.It is thus possible to reduce traversal number when calculating quantile, the I/O operation to database, energy can be further reduced
Enough liftings to a certain extent calculate the efficiency of quantile process.
Explanation is needed exist for, when carrying out quantile calculating using the above method, multiple pending datas for getting
Possible skewness, i.e. most pending data is distributed in a certain section, and only the pending of minority is distributed in it
In his section.
Such as:Assuming that multiple pending datas include { 1.1,1.2,1.3,1.4,1.5,6.1 }, default section is still
(0,5], (5,10].Obviously, in the case, after being divided for pending data, pending data 1.1,1.2,1.3,
1.4 and 1.5 will be divided in section (0,5] in, and pending data 6.1 be divided in section (5,10] in.
It can be seen that above-mentioned data distribution is too greatly inclined (that is, data are excessively intensively distributed in a certain section), once this
The situation of sample occurs, and the too low phenomenon of accuracy will be caused according to above-mentioned quantile calculation.Therefore, in the application
Under a kind of mode in embodiment, after pending data is got, a kind of checking procedure will be preferentially performed, to determine to wait to locate
The distribution situation for managing data is uniform enough.That is, before each pending data is divided in default multiple sections, it is described
Method also includes:Sampled for each pending data got, generate multiple sample datas, by each sample data
It is divided in default multiple sections, it is determined that comentropy corresponding to the sample data after division, and the comentropy is no more than setting
Threshold value.
It is visible in from the discussion above, in order to ensure checking procedure will not elapsed time too much, therefore be directed to pending data
Sampled, it is clear that the quantity for the sample data for sampling to obtain is less than the quantity of pending data.It is appreciated that sample data
Distribution situation, the distribution situation of pending data can be reflected to a certain extent, also, small in the quantity of sample data
In the case of the quantity of pending data, the mode verified using sample data can reduce the time-consuming of checking procedure.
In the embodiment of the present application, used sample mode can be:According to default oversampling ratio, for the institute got
Each pending data is stated to carry out putting back to sampling.Wherein, described oversampling ratio, for (0,1] in any value, its expression needs to gather
Accounting of the sample size in overall pending data.In practical operation, the numerical value of oversampling ratio is generally smaller, to ensure
The negligible amounts of the sample data collected.
Sample data is collected, also can is verified using sample data, that is, sample data is divided in
In default multiple sections (default multiple sections described here, consistent with foregoing section), and determine to be divided in not
With the comentropy corresponding to the sample data in section.
Determining the process of comentropy can be:For each section, the sample data in the section is counted in whole samples
Accounting in data, as the informational probability in the section, according to the informational probability in each section determined, determine sample data
Corresponding comentropy.
Specifically, comentropy can use equation below to calculate:
Wherein, X is sample set;xiFor the sample data being divided in i-th of section;
H (X) is the value of the comentropy of sample data;
P(xi) to be divided in the probability of the sample data in i-th of section, (its numerical values recited is the sample in i-th of section
Accounting of the notebook data in sample set);
B is constant, and usual value is 2.
Explanation is needed exist for, the comentropy being calculated by said process, can be reflected to a certain extent
Go out the distribution situation of data, it is believed that the more concentration of sample data distribution, the numerical value of its comentropy is bigger, and sample data
Distribution it is more scattered, the numerical value of its comentropy is smaller.Therefore, in the embodiment of the present application, determining corresponding to sample data
Comentropy after, by further compared with default threshold value, be only not more than the situation of the threshold value in the numerical value of comentropy
Under, just it can carry out quantile calculating using above-mentioned method.
In the embodiment of the present application, the process based on approximation calculation quantile can be:For each section, it is determined that drawing
Divide accumulative accounting of the pending data in all pending datas in the section, it is determined that the quantile institute of required calculating is right
The quantile answered, the accumulative accounting that the quantile is matched is determined, and determine the section corresponding to the accumulative accounting, will determined
Approximation going out, corresponding to the section, it is defined as the approximation of the quantile of required calculating.
Below with a concrete application example, explain in the embodiment of the present application to the calculating process of quantile.
In this example, it is assumed that one group of unordered pending data includes:{x1, x2, x3, x4, x5, its specific value is such as
Shown in Fig. 2 a.Simultaneously it is assumed that default siding-to-siding block length value is 5, specific value is also as shown in Figure 2 a.In Fig. 2 a, s1~s4The company of expression
The mark in 4 continuous sections.So, pending data can be divided in default 4 sections.
With the section s in Fig. 2 a2Exemplified by, (5,10], 3 } in " (5,10] " represent section s2Scope, " 3 " represent
The quantity for the pending data being divided in the section.It can be seen that pending data x1~x5It is respectively divided in section s2、s3And s4
In.
And then it can be directed to and be divided in section s2、s3And s4In pending data, determine corresponding approximation respectively, this
In assume that approximation is calculated according to (a+b)/2, wherein, a and b characterize two endpoint values in any section.So, s2It is corresponding
Approximation be 7.5, s3Corresponding approximation is 12.5, s4Corresponding approximation is 17.5.
If in actual applications, user asks to calculate the 75th quantile, then, if according to existing calculation,
Each pending data will be traveled through one by one, i.e., (in Fig. 2 b, abscissa represents the digit of quantile, and ordinate represents as shown in Figure 2 b
The numerical value of pending data), final calculate determines that the 75th quantile is x4(that is, 10).The process needs to perform multiple I/O behaviour
Make.And exactly in the embodiment of the present application by the way of approximation, above three approximation 7.5,12.5,17.5 can be based on and calculated
Quantile.
Specifically, for section s1, because no pending data is divided in the section, so its pending data
Accumulative accounting in overall pending data is 0.
For section s2, the wherein accumulative accounting of pending data is 60%, that is to say, that section s2Corresponding approximation
7.5, can be as the approximate value of the quantile of the 1st quantile~the 60th.
For section s3, the wherein accumulative accounting of pending data is 80%, that is to say, that section s3Corresponding approximation
12.5, can be as the approximate value of the quantile of the 61st quantile~the 80th.
For section s4, the wherein accumulative accounting of pending data is 100%, that is to say, that section s4Corresponding approximation
17.5, can be as the approximate value of the quantile of the 81st quantile~the 100th.
(in Fig. 2 c, abscissa represents the digit of quantile, and ordinate represents the numerical value of approximation) i.e. as shown in Figure 2 c, it is false
If user needs to calculate the 75th quantile, then available 75th quantile is 12.5.
Certainly, in actual applications, after corresponding quantile has been calculated, the result of calculation fed back to user can
To be vector, vectorial length is relevant with the quantile section set in computation requests, such as:User can set point to be asked
Digit section is 100, that is, is represented, it is necessary to calculate 100 quantiles.Meanwhile user asks to calculate the 75th quantile.So,
The vectorial length for feeding back to user is 100, and is arranged according to the order of the 1st~100 quantile.
The data processing method provided above for the embodiment of the present application, based on same thinking, the embodiment of the present application also carries
For a kind of data processing equipment, as shown in Figure 3.Described device includes:
Data acquisition module 301, obtain multiple pending datas;
Division module 302, each pending data is divided in default multiple sections;
Approximation determining module 303, determine the approximation of the pending data in each section;
Quantile computing module 304, according to the approximation, it is determined that the quantile corresponding to the pending data.
Described device also includes:Data distribution detection module 305, adopted for each pending data got
Sample, multiple sample datas are generated, each sample data is divided in default multiple sections, it is determined that the sample data pair after division
The comentropy answered, and the comentropy is not more than given threshold.
Furthermore, data distribution detection module 305, according to default oversampling ratio, described respectively treated for what is got
Processing data carries out putting back to sampling.
Data distribution detection module 305, numerical value corresponding to each sample data is determined, according to the number of each sample data
Value, each sample data is divided in default multiple sections.
Data distribution detection module 305, for each section, the sample data in the section is counted in whole sample datas
In accounting, as the informational probability in the section, according to the informational probability in each section determined, determine that sample data is corresponding
Comentropy.
Division module 302, numerical value corresponding to each pending data is determined, will according to the numerical value of each pending data
Each pending data is divided in default multiple sections.
Approximation determining module 303, for any section, the endpoint value in the section is determined, calculates the equal of the endpoint value
Value, and it is defined as the approximation corresponding to the pending data that is divided in the section;Or for any section, count the section
The numerical value of interior each pending data, calculates the average of each pending data, and is defined as being divided in pending in the section
Approximation corresponding to data.
Quantile computing module 304, for each section, it is determined that the pending data being divided in the section is being needed
Accumulative accounting in processing data, it is determined that the quantile corresponding to the quantile of required calculating, determines that the quantile is matched
Accumulative accounting, and determine the section corresponding to the accumulative accounting, by the approximation in the section determined, be defined as needed for
The approximation of the quantile of calculating.
In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example,
Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So
And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit.
Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause
This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD
(Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate
Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer
Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make
Special IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols
Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development,
And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language
(Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL
(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description
Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL
(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby
Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present
Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should
This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages,
Can is readily available the hardware circuit for realizing the logical method flow.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing
Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device
Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit,
ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but is not limited to following microcontroller
Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited
Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with
Pure computer readable program code mode realized beyond controller, completely can be by the way that method and step is carried out into programming in logic to make
Controller is obtained in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact
Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it
The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions regards
For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity,
Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play
It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment
The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented
The function of each unit can be realized in same or multiple softwares and/or hardware during application.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein
Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping
Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described
Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product.
Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code
The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The application can be described in the general context of computer executable instructions, such as program
Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type
Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by
Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with
In the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system
For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method
Part explanation.
Embodiments herein is the foregoing is only, is not limited to the application.For those skilled in the art
For, the application can have various modifications and variations.All any modifications made within spirit herein and principle, it is equal
Replace, improve etc., it should be included within the scope of claims hereof.