CN107368281A - A kind of data processing method and device - Google Patents

A kind of data processing method and device Download PDF

Info

Publication number
CN107368281A
CN107368281A CN201710267507.7A CN201710267507A CN107368281A CN 107368281 A CN107368281 A CN 107368281A CN 201710267507 A CN201710267507 A CN 201710267507A CN 107368281 A CN107368281 A CN 107368281A
Authority
CN
China
Prior art keywords
data
section
pending data
pending
approximation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710267507.7A
Other languages
Chinese (zh)
Other versions
CN107368281B (en
Inventor
周扬
杨树波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710267507.7A priority Critical patent/CN107368281B/en
Publication of CN107368281A publication Critical patent/CN107368281A/en
Application granted granted Critical
Publication of CN107368281B publication Critical patent/CN107368281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses a kind of data processing method and device, and method includes:Multiple pending datas are obtained, each pending data is divided in default multiple sections, determine the approximation of the pending data in each section, according to the approximation, it is determined that the quantile corresponding to the pending data.Using the embodiment of the present application, the numerical value for each pending data for replacing being divided in any section using approximation, the quantity of the data participated in calculating process can be reduced.It is thus possible to reduce traversal number when calculating quantile, the I/O operation to database can be further reduced, the efficiency for calculating quantile process can be lifted to a certain extent.

Description

A kind of data processing method and device
Technical field
The application is related to field of computer technology, more particularly to a kind of data processing method and device.
Background technology
With the development of information technology and the popularization of Internet technology, business provider is (such as:Website, bank, telecom operation Business etc.) needed for processing data volume it is huge.In practical application, business provider may be according to business demand (such as:Data point Analysis), perform quantile for some data and calculate.
Wherein, quantile is regarded as the data of one group of ordered arrangement being divided into corresponding to the quantile of different piece Data.Specifically for example:In the data of one group of ordered arrangement, different pieces of information correspond to different values, it is assumed that a certain data Value be 50, while it is assumed that in this group of data, the ratio shared by the quantity of data of the numerical value less than 50 is 70%, then, Value is that 50 data are then regarded as the 70th quantile of this group of data.
In the prior art, the process of quantile calculating is usually:The phase traveled through one by one for one group of data to be calculated Mutually compare, complete the sequence to this group data to be calculated, and based on the data sequence after sequence, determine point position accordingly Number.
However, in practical application, the data that business provider is generated are generally stored inside corresponding storage device (such as: The disk of database, server local) in.So, from above-mentioned quantile calculating process, in order to determine corresponding point Digit is, it is necessary to be compared and sorted one by one with other each pending datas for each pending data.The process generally needs To be directed to storage device and perform multiple I/O (Input/Output) operation.Particularly when data volume to be calculated is larger, it is necessary to It is a large amount of to perform I/O operation, it is clear that substantial amounts of I/O will expend the regular hour, so as to influence the efficiency calculated.
The content of the invention
The embodiment of the present application provides a kind of data processing method, to solve the efficiency of existing quantile calculation compared with The problem of low.
A kind of data processing method that the embodiment of the present application provides, methods described include:
Obtain multiple pending datas;
Each pending data is divided in default multiple sections;
Determine the approximation of the pending data in each section;
According to the approximation, it is determined that the quantile corresponding to the pending data.
A kind of data processing equipment that the embodiment of the present application provides, described device include:
Data acquisition module, obtain multiple pending datas;
Division module, each pending data is divided in default multiple sections;
Approximation determining module, determine the approximation of the pending data in each section;
Quantile computing module, according to the approximation, it is determined that the quantile corresponding to the pending data.
Above-mentioned at least one technical scheme that the embodiment of the present application uses can reach following beneficial effect:
When performing quantile calculating, corresponding computing device is (such as:Server) pending point can be obtained from database Multiple data that digit calculates, as pending data.Hereafter, these pending datas are divided in default more by computing device In individual section, also, for being divided in the pending data in different sections, determine corresponding approximation.Approximation being capable of table Levy a kind of average value of the pending data fallen into each section.So, will can be divided in different sections The numerical value of pending data replaces with the calculating that approximation carries out quantile.Obviously, division is replaced just because of using approximation The numerical value of each pending data in any section, it also can just reduce the quantity of the data participated in calculating process.From And, it is possible to reduce traversal number during quantile is calculated, can further reduce the I/O operation to database, can be in certain journey Lifting calculates the efficiency of quantile process on degree.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, forms the part of the application, this Shen Schematic description and description please is used to explain the application, does not form the improper restriction to the application.In the accompanying drawings:
Fig. 1 a are the configuration diagram that the data processing that the embodiment of the present application provides is based on;
Fig. 1 b are the data handling procedure that the embodiment of the present application provides;
Fig. 2 a are the schematic diagram divided under application example to pending data that the embodiment of the present application provides;
Fig. 2 b are the schematic diagram that corresponding quantile is determined using existing quantile computational methods;
The schematic diagram that corresponding quantile is calculated based on approximation that Fig. 2 c are provided by the embodiment of the present application;
Fig. 3 is the data processing equipment structural representation that the embodiment of the present application provides.
Embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out under the premise of creative work the every other embodiment obtained, belong to the scope of the application protection.
The defects of for causing computational efficiency relatively low using traditional quantile calculation in the prior art, the application are real Apply and a kind of data processing method is provided in example, can effectively reduce calculation times during quantile calculating is carried out, reduce I/O operation to disk, so as to increase the efficiency of quantile calculating.
It should be noted that the data processing method described in the embodiment of the present application, can be based on frame as shown in Figure 1a Structure.In fig 1 a, including client, computing device and database, wherein:Client can be initiated to be related to quantile to computing device The service request of calculating.In actual applications, described client includes but is not limited to:Have in service application, or operating system A kind of functional unit (in Fig. 1 a, client being represented with the terminal device shaped like mobile phone, be only example) of standby client functionality.Together When, client can be enterprises client or personal user's client, be not especially limited here.Correspondingly, Computing device is used to perform quantile calculating, and computing device includes but is not limited to:Server, mainframe computer etc..Set in calculating During standby execution quantile calculates, database can be directed to and perform I/O operation.And database is used for data storage, calculating is set The standby quantile that performs is calculated based on the data stored in database.
Certainly, in application scenes, the database shown in Fig. 1 a, it is also believed that being disposed on inside computing device Disk for data storage.Here and it is not especially limited.
Based on framework as shown in Figure 1a, the data handling procedure that the embodiment of the present application provides as shown in Figure 1 b, the process Executive agent generally can be computing device shown in Fig. 1 a.Under application scenes, client possesses quantile Computing function, then, the data handling procedure shown in Fig. 1 b can also be completed by client as executive agent.
Data handling procedure as shown in Figure 1 b specifically includes following steps:
S101:Obtain multiple pending datas.
In actual applications, described pending data, it is believed that be the data that can be ranked up.In a kind of scene, Pending data can be numerical value, such as:The data such as the history spending amount data of user, age, height, body weight.
Certainly, in addition to the numeric values, under another application scenarios or letter (letter can enter according to alphabet Row sequence) or the character string that can be ranked up according to corresponding ordering rule etc., here and it is not especially limited.
Generally, described pending data is stored in database, thus server can be got from database it is above-mentioned Pending data.As a kind of feasible pattern in the embodiment of the present application, described database can be relevant database, by Data in relevant database are stored according to the corresponding relation of corresponding " key-value ", then, server can basis Corresponding key, multiple data corresponding with the key is obtained as pending data, so as to form pending data collection.Certainly, on The mode stated simultaneously should not be used as the restriction to the application.
Explanation is needed exist for, the multiple pending datas got, is not ranked up, such as:Get Multiple pending datas include:{ 4.0,1.5,9.0,3.5,6.5,8.5 }, it is seen then that pending data therein is unsorted Data.
It is appreciated that in practical operation, can be obtained by the service request for being related to quantile calculating to trigger computing device Pending data.
S102:Each pending data is divided in default multiple sections.
In the embodiment of the present application, default multiple sections, can be continuous multiple open and close type sections, such as:
(0,5], (5,10].
The length in each section be able to can also be differed with identical, can specifically be set according to the needs of practical application.
On the basis of default in multiple sections, each pending data can be divided in multiple sections.In the application , can be according to the order size corresponding to each pending data (such as when dividing each pending data in embodiment:Numerical values recited) Divided.
With reference to both of the aforesaid example, pending data can be divided in the continuous section of above-mentioned two, i.e. by 4.0, 1.5th, 3.5 be divided in section (0,5] in, and by 9.0,6.5,8.5 be divided in section (5,10] in.
S103:Determine the approximation of the pending data in each section.
As it was previously stated, when calculating quantile using quantile algorithm of the prior art, it usually needs travel through each data It is ranked up after being compared.Obviously, such calculation will cause excessive I/O operation.Therefore in the embodiment of the present application, In order to reduce the number of I/O operation, therefore the pending data being divided in in different sections is replaced with corresponding approximation. That is in the embodiment of the present application, by the approximation of the pending data determined in each section.
Wherein, as a kind of mode in the embodiment of the present application, for any section, can wait to locate by falling into the section The average of data is managed as approximation.Still continue above-mentioned example:For be divided in section (0,5] in pending data for, Its approximation is (4.0+1.5+3.5)/3=3;For be divided in section (5,10] in pending data for, its approximation For (9.0+6.5+8.5)/3=8.
And as the another way in the embodiment of the present application, for any section, the average of interval endpoint value can be made For approximation.It is same to continue above-mentioned example:For be divided in section (0,5] in pending data for, its approximation is (0+ 5)/2=2.5;For be divided in section (5,10] in pending data for, its approximation is (5+10)/2=7.5.
Above two mode can be used in actual calculating process, will specifically be selected according to the needs of practical application Select, here and be not especially limited.
S104:According to the approximation, it is determined that the quantile corresponding to the pending data collection.
Explanation is needed exist for, the approximation for determining to obtain by abovementioned steps, can be used as and be divided in different sections Pending data corresponding to approximation.So, a fairly large number of pending data of script, also it has been converted to corresponding Approximation, the total number of approximation is less than even much smaller than the quantity of the pending data got.
Continuation of the previous cases:Assuming that section (0,5] approximation be 3, section (5,10] approximation be 8, then, be divided in area Between (0,5] in pending data 4.0,1.5 and 3.5 be approximately 3, and be divided in section (5,10] in pending data 9.0th, 6.5 and 8.5 8 are approximately.So, when calculating quantile, then the two numerical value can be calculated according to 3 and 8.Obviously, Compared to existing quantile calculation, the data handling procedure in the embodiment of the present application can effectively reduce time of comparison Number, and then also just can effectively reduce the I/O operation for database.
Certainly, quantile calculating is carried out on the basis of approximation, its specific calculating process can still use existing point Digit algorithm, is just no longer excessively repeated here.
By above-mentioned steps, when performing quantile calculating, corresponding computing device is (such as:Server) can be from database Multiple data that pending quantile calculates are obtained, as pending data.Hereafter, computing device draws these pending datas Divide in default multiple sections, also, for being divided in the pending data in different sections, determine corresponding approximation. Approximation can characterize a kind of average value of the pending data fallen into each section.So, will can be divided in The numerical value of pending data in different sections replaces with the calculating that approximation carries out quantile.Obviously, just because of using near The numerical value for each pending data for replacing being divided in any section like value, also can just reduce the number participated in calculating process According to quantity.It is thus possible to reduce traversal number when calculating quantile, the I/O operation to database, energy can be further reduced Enough liftings to a certain extent calculate the efficiency of quantile process.
Explanation is needed exist for, when carrying out quantile calculating using the above method, multiple pending datas for getting Possible skewness, i.e. most pending data is distributed in a certain section, and only the pending of minority is distributed in it In his section.
Such as:Assuming that multiple pending datas include { 1.1,1.2,1.3,1.4,1.5,6.1 }, default section is still (0,5], (5,10].Obviously, in the case, after being divided for pending data, pending data 1.1,1.2,1.3, 1.4 and 1.5 will be divided in section (0,5] in, and pending data 6.1 be divided in section (5,10] in.
It can be seen that above-mentioned data distribution is too greatly inclined (that is, data are excessively intensively distributed in a certain section), once this The situation of sample occurs, and the too low phenomenon of accuracy will be caused according to above-mentioned quantile calculation.Therefore, in the application Under a kind of mode in embodiment, after pending data is got, a kind of checking procedure will be preferentially performed, to determine to wait to locate The distribution situation for managing data is uniform enough.That is, before each pending data is divided in default multiple sections, it is described Method also includes:Sampled for each pending data got, generate multiple sample datas, by each sample data It is divided in default multiple sections, it is determined that comentropy corresponding to the sample data after division, and the comentropy is no more than setting Threshold value.
It is visible in from the discussion above, in order to ensure checking procedure will not elapsed time too much, therefore be directed to pending data Sampled, it is clear that the quantity for the sample data for sampling to obtain is less than the quantity of pending data.It is appreciated that sample data Distribution situation, the distribution situation of pending data can be reflected to a certain extent, also, small in the quantity of sample data In the case of the quantity of pending data, the mode verified using sample data can reduce the time-consuming of checking procedure.
In the embodiment of the present application, used sample mode can be:According to default oversampling ratio, for the institute got Each pending data is stated to carry out putting back to sampling.Wherein, described oversampling ratio, for (0,1] in any value, its expression needs to gather Accounting of the sample size in overall pending data.In practical operation, the numerical value of oversampling ratio is generally smaller, to ensure The negligible amounts of the sample data collected.
Sample data is collected, also can is verified using sample data, that is, sample data is divided in In default multiple sections (default multiple sections described here, consistent with foregoing section), and determine to be divided in not With the comentropy corresponding to the sample data in section.
Determining the process of comentropy can be:For each section, the sample data in the section is counted in whole samples Accounting in data, as the informational probability in the section, according to the informational probability in each section determined, determine sample data Corresponding comentropy.
Specifically, comentropy can use equation below to calculate:
Wherein, X is sample set;xiFor the sample data being divided in i-th of section;
H (X) is the value of the comentropy of sample data;
P(xi) to be divided in the probability of the sample data in i-th of section, (its numerical values recited is the sample in i-th of section Accounting of the notebook data in sample set);
B is constant, and usual value is 2.
Explanation is needed exist for, the comentropy being calculated by said process, can be reflected to a certain extent Go out the distribution situation of data, it is believed that the more concentration of sample data distribution, the numerical value of its comentropy is bigger, and sample data Distribution it is more scattered, the numerical value of its comentropy is smaller.Therefore, in the embodiment of the present application, determining corresponding to sample data Comentropy after, by further compared with default threshold value, be only not more than the situation of the threshold value in the numerical value of comentropy Under, just it can carry out quantile calculating using above-mentioned method.
In the embodiment of the present application, the process based on approximation calculation quantile can be:For each section, it is determined that drawing Divide accumulative accounting of the pending data in all pending datas in the section, it is determined that the quantile institute of required calculating is right The quantile answered, the accumulative accounting that the quantile is matched is determined, and determine the section corresponding to the accumulative accounting, will determined Approximation going out, corresponding to the section, it is defined as the approximation of the quantile of required calculating.
Below with a concrete application example, explain in the embodiment of the present application to the calculating process of quantile.
In this example, it is assumed that one group of unordered pending data includes:{x1, x2, x3, x4, x5, its specific value is such as Shown in Fig. 2 a.Simultaneously it is assumed that default siding-to-siding block length value is 5, specific value is also as shown in Figure 2 a.In Fig. 2 a, s1~s4The company of expression The mark in 4 continuous sections.So, pending data can be divided in default 4 sections.
With the section s in Fig. 2 a2Exemplified by, (5,10], 3 } in " (5,10] " represent section s2Scope, " 3 " represent The quantity for the pending data being divided in the section.It can be seen that pending data x1~x5It is respectively divided in section s2、s3And s4 In.
And then it can be directed to and be divided in section s2、s3And s4In pending data, determine corresponding approximation respectively, this In assume that approximation is calculated according to (a+b)/2, wherein, a and b characterize two endpoint values in any section.So, s2It is corresponding Approximation be 7.5, s3Corresponding approximation is 12.5, s4Corresponding approximation is 17.5.
If in actual applications, user asks to calculate the 75th quantile, then, if according to existing calculation, Each pending data will be traveled through one by one, i.e., (in Fig. 2 b, abscissa represents the digit of quantile, and ordinate represents as shown in Figure 2 b The numerical value of pending data), final calculate determines that the 75th quantile is x4(that is, 10).The process needs to perform multiple I/O behaviour Make.And exactly in the embodiment of the present application by the way of approximation, above three approximation 7.5,12.5,17.5 can be based on and calculated Quantile.
Specifically, for section s1, because no pending data is divided in the section, so its pending data Accumulative accounting in overall pending data is 0.
For section s2, the wherein accumulative accounting of pending data is 60%, that is to say, that section s2Corresponding approximation 7.5, can be as the approximate value of the quantile of the 1st quantile~the 60th.
For section s3, the wherein accumulative accounting of pending data is 80%, that is to say, that section s3Corresponding approximation 12.5, can be as the approximate value of the quantile of the 61st quantile~the 80th.
For section s4, the wherein accumulative accounting of pending data is 100%, that is to say, that section s4Corresponding approximation 17.5, can be as the approximate value of the quantile of the 81st quantile~the 100th.
(in Fig. 2 c, abscissa represents the digit of quantile, and ordinate represents the numerical value of approximation) i.e. as shown in Figure 2 c, it is false If user needs to calculate the 75th quantile, then available 75th quantile is 12.5.
Certainly, in actual applications, after corresponding quantile has been calculated, the result of calculation fed back to user can To be vector, vectorial length is relevant with the quantile section set in computation requests, such as:User can set point to be asked Digit section is 100, that is, is represented, it is necessary to calculate 100 quantiles.Meanwhile user asks to calculate the 75th quantile.So, The vectorial length for feeding back to user is 100, and is arranged according to the order of the 1st~100 quantile.
The data processing method provided above for the embodiment of the present application, based on same thinking, the embodiment of the present application also carries For a kind of data processing equipment, as shown in Figure 3.Described device includes:
Data acquisition module 301, obtain multiple pending datas;
Division module 302, each pending data is divided in default multiple sections;
Approximation determining module 303, determine the approximation of the pending data in each section;
Quantile computing module 304, according to the approximation, it is determined that the quantile corresponding to the pending data.
Described device also includes:Data distribution detection module 305, adopted for each pending data got Sample, multiple sample datas are generated, each sample data is divided in default multiple sections, it is determined that the sample data pair after division The comentropy answered, and the comentropy is not more than given threshold.
Furthermore, data distribution detection module 305, according to default oversampling ratio, described respectively treated for what is got Processing data carries out putting back to sampling.
Data distribution detection module 305, numerical value corresponding to each sample data is determined, according to the number of each sample data Value, each sample data is divided in default multiple sections.
Data distribution detection module 305, for each section, the sample data in the section is counted in whole sample datas In accounting, as the informational probability in the section, according to the informational probability in each section determined, determine that sample data is corresponding Comentropy.
Division module 302, numerical value corresponding to each pending data is determined, will according to the numerical value of each pending data Each pending data is divided in default multiple sections.
Approximation determining module 303, for any section, the endpoint value in the section is determined, calculates the equal of the endpoint value Value, and it is defined as the approximation corresponding to the pending data that is divided in the section;Or for any section, count the section The numerical value of interior each pending data, calculates the average of each pending data, and is defined as being divided in pending in the section Approximation corresponding to data.
Quantile computing module 304, for each section, it is determined that the pending data being divided in the section is being needed Accumulative accounting in processing data, it is determined that the quantile corresponding to the quantile of required calculating, determines that the quantile is matched Accumulative accounting, and determine the section corresponding to the accumulative accounting, by the approximation in the section determined, be defined as needed for The approximation of the quantile of calculating.
In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Special IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, Can is readily available the hardware circuit for realizing the logical method flow.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but is not limited to following microcontroller Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode realized beyond controller, completely can be by the way that method and step is carried out into programming in logic to make Controller is obtained in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions regards For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity, Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during application.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Embodiments herein is the foregoing is only, is not limited to the application.For those skilled in the art For, the application can have various modifications and variations.All any modifications made within spirit herein and principle, it is equal Replace, improve etc., it should be included within the scope of claims hereof.

Claims (16)

1. a kind of data processing method, methods described include:
Obtain multiple pending datas;
Each pending data is divided in default multiple sections;
Determine the approximation of the pending data in each section;
According to the approximation, it is determined that the quantile corresponding to the pending data.
2. the method as described in claim 1, it is characterised in that each pending data is being divided in default multiple sections Before, methods described also includes:
Sampled for each pending data got, generate multiple sample datas;
Each sample data is divided in default multiple sections;
It is determined that comentropy corresponding to sample data after division, and the comentropy is not more than given threshold.
3. method as claimed in claim 2, sampled, specifically included for each pending data got:
According to default oversampling ratio, carry out putting back to sampling for each pending data got.
4. method as claimed in claim 2, each sample data is divided in default multiple sections, specifically included:
Determine numerical value corresponding to each sample data;
According to the numerical value of each sample data, each sample data is divided in default multiple sections.
5. method as claimed in claim 2, it is determined that comentropy corresponding to sample data after division, is specifically included:
For each section, accounting of the sample data in whole sample datas in the section, the letter as the section are counted Cease probability;
According to the informational probability in each section determined, comentropy corresponding to sample data is determined.
6. the method as described in claim 1, each pending data is divided in default multiple sections, specifically included:
Determine numerical value corresponding to each pending data;
According to the numerical value of each pending data, each pending data is divided in default multiple sections.
7. the method as described in claim 1, the approximation of the pending data in each section is determined, is specifically included:
For any section, the endpoint value in the section is determined, calculates the average of the endpoint value, and be defined as being divided in the section Approximation corresponding to interior pending data;Or
For any section, the numerical value of each pending data in the section is counted, calculates the average of each pending data, and really It is set to the approximation corresponding to the pending data being divided in the section.
8. the method as described in claim 1, according to the approximation, it is determined that corresponding to the quantile of the pending data, Specifically include:
For each section, it is determined that accumulative accounting of the pending data being divided in the section in all pending datas;
It is determined that the quantile corresponding to the quantile of required calculating;
The accumulative accounting that the quantile is matched is determined, and determines the section corresponding to the accumulative accounting;
By approximation determining, corresponding to the section, be defined as needed for calculating quantile approximation.
9. a kind of data processing equipment, described device include:
Data acquisition module, obtain multiple pending datas;
Division module, each pending data is divided in default multiple sections;
Approximation determining module, determine the approximation of the pending data in each section;
Quantile computing module, according to the approximation, it is determined that the quantile corresponding to the pending data.
10. device as claimed in claim 9, described device also include:Data distribution detection module, described in getting Each pending data is sampled, and generates multiple sample datas, and each sample data is divided in default multiple sections, it is determined that Comentropy corresponding to sample data after division, and the comentropy is not more than given threshold.
11. device as claimed in claim 10, the data distribution detection module, according to default oversampling ratio, for obtaining To each pending data carry out putting back to sampling.
12. device as claimed in claim 10, the data distribution detection module, determine numerical value corresponding to each sample data, According to the numerical value of each sample data, each sample data is divided in default multiple sections.
13. device as claimed in claim 10, the data distribution detection module, for each section, count in the section Accounting of the sample data in whole sample datas, as the informational probability in the section, according to each section determined Informational probability, determine comentropy corresponding to sample data.
14. device as claimed in claim 9, the division module, numerical value corresponding to each pending data is determined, according to described The numerical value of each pending data, each pending data is divided in default multiple sections.
15. device as claimed in claim 9, the approximation determining module, for any section, the end points in the section is determined Value, calculates the average of the endpoint value, and is defined as the approximation corresponding to the pending data that is divided in the section;Or
For any section, the numerical value of each pending data in the section is counted, calculates the average of each pending data, and really It is set to the approximation corresponding to the pending data being divided in the section.
16. device as claimed in claim 9, the quantile computing module, for each section, it is determined that being divided in the section In accumulative accounting of the pending data in all pending datas, it is determined that dividing position corresponding to the quantile of required calculating Point, the accumulative accounting that the quantile is matched is determined, and determine the section corresponding to the accumulative accounting, described in determining The approximation in section, it is defined as the approximation of the quantile of required calculating.
CN201710267507.7A 2017-04-21 2017-04-21 Data processing method and device Active CN107368281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710267507.7A CN107368281B (en) 2017-04-21 2017-04-21 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710267507.7A CN107368281B (en) 2017-04-21 2017-04-21 Data processing method and device

Publications (2)

Publication Number Publication Date
CN107368281A true CN107368281A (en) 2017-11-21
CN107368281B CN107368281B (en) 2020-10-16

Family

ID=60304717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710267507.7A Active CN107368281B (en) 2017-04-21 2017-04-21 Data processing method and device

Country Status (1)

Country Link
CN (1) CN107368281B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889462A (en) * 2019-12-09 2020-03-17 秒针信息技术有限公司 Data processing method, device, equipment and storage medium
CN113987049A (en) * 2021-12-27 2022-01-28 北京安华金和科技有限公司 Sensitive data discovery processing method and system
CN114860811A (en) * 2022-05-25 2022-08-05 湖南大学 Median approximate value searching method and device for data set and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657388A (en) * 2013-11-22 2015-05-27 阿里巴巴集团控股有限公司 Data processing method and device
CN105045806A (en) * 2015-06-04 2015-11-11 中国科学院信息工程研究所 Dynamic splitting and maintenance method of quantile query oriented summary data
CN105474377A (en) * 2013-06-28 2016-04-06 科磊股份有限公司 Selection and use of representative target subsets
US20160246852A1 (en) * 2012-05-29 2016-08-25 Sas Institute Inc. Systems and Methods for Quantile Estimation in a Distributed Data System

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160246852A1 (en) * 2012-05-29 2016-08-25 Sas Institute Inc. Systems and Methods for Quantile Estimation in a Distributed Data System
CN105474377A (en) * 2013-06-28 2016-04-06 科磊股份有限公司 Selection and use of representative target subsets
CN104657388A (en) * 2013-11-22 2015-05-27 阿里巴巴集团控股有限公司 Data processing method and device
CN105045806A (en) * 2015-06-04 2015-11-11 中国科学院信息工程研究所 Dynamic splitting and maintenance method of quantile query oriented summary data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889462A (en) * 2019-12-09 2020-03-17 秒针信息技术有限公司 Data processing method, device, equipment and storage medium
CN110889462B (en) * 2019-12-09 2023-05-02 秒针信息技术有限公司 Data processing method, device, equipment and storage medium
CN113987049A (en) * 2021-12-27 2022-01-28 北京安华金和科技有限公司 Sensitive data discovery processing method and system
CN114860811A (en) * 2022-05-25 2022-08-05 湖南大学 Median approximate value searching method and device for data set and computer equipment

Also Published As

Publication number Publication date
CN107368281B (en) 2020-10-16

Similar Documents

Publication Publication Date Title
US20210374610A1 (en) Efficient duplicate detection for machine learning data sets
US10452992B2 (en) Interactive interfaces for machine learning model evaluations
CN107943825A (en) Data processing method, device and the electronic equipment of page access
CN107679700A (en) Business flow processing method, apparatus and server
CN107274442A (en) A kind of image-recognizing method and device
CN107016604A (en) Buffer method, device and the equipment of book keeping operation
CN109190754A (en) Quantitative model generation method, device and electronic equipment
CN107368281A (en) A kind of data processing method and device
CN112712795B (en) Labeling data determining method, labeling data determining device, labeling data determining medium and electronic equipment
CN107622413A (en) A kind of price sensitivity computational methods, device and its equipment
CN107391540A (en) A kind of small routine methods of exhibiting, device and grader
CN109299096A (en) A kind of processing method of pipelined data, device and equipment
CN110457430A (en) A kind of Traceability detection method of text, device and equipment
CN107491484A (en) A kind of data matching method, device and equipment
CN107423269A (en) Term vector processing method and processing device
CN107038127A (en) Application system and its buffer control method and device
CN110516062A (en) A kind of search processing method and device of document
CN107704760A (en) A kind of simulator detection method, device and equipment based on bottom instruction
CN112200183A (en) Image processing method, device, equipment and computer readable medium
CN110443007A (en) A kind of Traceability detection method of multi-medium data, device and equipment
CN113051400B (en) Labeling data determining method and device, readable medium and electronic equipment
CN115640523A (en) Text similarity measurement method, device, equipment, storage medium and program product
CN108921375A (en) A kind of data processing method and device
CN107392408A (en) The prompt message output intent and device of a kind of credit score
CN111309988B (en) Character string retrieval method and device based on coding and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TA01 Transfer of patent application right

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right