Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Aiming at the defect that the traditional quantile calculation mode is used in the prior art to cause low calculation efficiency, the data processing method provided in the embodiment of the application can effectively reduce the calculation times and reduce the I/O operation on a disk in the quantile calculation process, thereby increasing the quantile calculation efficiency.
It should be noted that the data processing method described in the embodiment of the present application may be based on the architecture shown in fig. 1 a. In fig. 1a, a client, a computing device and a database are included, wherein: the client may initiate a business request involving quantile computation to the computing device. In practical applications, the client includes but is not limited to: a service application, or a functional unit with a client function in an operating system (in fig. 1a, a client is represented by a terminal device like a mobile phone, which is just an example). Meanwhile, the client may be an internal client of an enterprise, or may be an individual user client, which is not limited specifically herein. Accordingly, computing devices are used to perform quantile computations, including, but not limited to: servers, mainframe computers, and the like. During the quantile calculation performed by the computing device, an I/O operation is performed with respect to the database. And the database is used to store data, and the calculation device to perform the quantile calculation will be based on the data stored in the database.
Of course, in some application scenarios, the database shown in fig. 1a may also be considered as a disk disposed inside the computing device for storing data. And is not particularly limited herein.
Based on the architecture shown in fig. 1a, the data processing process provided by the embodiment of the present application is shown in fig. 1b, and the execution subject of the process may be the computing device shown in fig. 1a in general. In some application scenarios, the client has a computation function of quantile, and then the data processing process shown in fig. 1b can also be completed by the client as an execution subject.
The data processing process shown in fig. 1b specifically includes the following steps:
s101: a plurality of data to be processed is acquired.
In practical applications, the data to be processed may be regarded as data that can be sorted. In one scenario, the data to be processed may be numerical values, such as: historical consumption amount data, age, height, weight and the like of the user.
Of course, besides numeric values, in another application scenario, the numeric values may also be letters (letters may be sorted according to an alphabet) or character strings that may be sorted according to a corresponding sorting rule, and the like, and are not limited herein.
Usually, the data to be processed is stored in a database, so the server can obtain the data to be processed from the database. As a feasible way in the embodiment of the present application, the database may be a relational database, and since data in the relational database is stored according to a corresponding "key-value" correspondence, the server may obtain, according to a corresponding key, a plurality of data corresponding to the key as data to be processed, thereby forming a data set to be processed. Of course, the above-described embodiments should not be construed as limiting the present application.
It should be noted here that, all of the acquired multiple pieces of data to be processed are not sorted, for example: the acquired data to be processed comprises: {4.0, 1.5, 9.0, 3.5, 6.5, 8.5}, it can be seen that the data to be processed therein is unordered data.
It will be appreciated that in actual operation, the computing device may be triggered to obtain the data to be processed by a service request involving the quantile calculation.
S102: dividing each data to be processed into a plurality of preset intervals.
In the embodiment of the present application, the preset plurality of sections may be a plurality of continuous open-close type sections, such as:
(0,5],(5,10]。
the lengths of the intervals can be the same or different, and can be specifically set according to the requirements of practical application.
On the basis of presetting a plurality of intervals, each data to be processed can be divided into a plurality of intervals. In the embodiment of the present application, when dividing each to-be-processed data, the to-be-processed data may be divided according to the sequence size (e.g., numerical value size) corresponding to each to-be-processed data.
In connection with the two foregoing examples, the data to be processed may be divided into the two consecutive intervals, i.e., 4.0, 1.5, 3.5 are divided into the interval (0, 5), and 9.0, 6.5, 8.5 are divided into the interval (5, 10).
S103: an approximation of the data to be processed within each interval is determined.
As mentioned above, when calculating quantiles by using the quantile algorithm in the prior art, it is usually necessary to traverse each data for comparison and then sort the data. Obviously, such a calculation would result in excessive I/O operations. Therefore, in the embodiment of the present application, in order to reduce the number of I/O operations, the data to be processed divided into different intervals is replaced by corresponding approximate values. That is, in the embodiment of the present application, an approximate value of the data to be processed within each interval will be determined.
As a mode in the embodiment of the present application, for any interval, the mean value of the to-be-processed data falling in the interval may be used as an approximate value. The above example continues: the approximation value is (4.0+1.5+ 3.5)/3-3 for the data to be processed divided in the interval (0, 5), and (9.0+6.5+ 8.5)/3-8 for the data to be processed divided in the interval (5, 10).
As another way in the embodiment of the present application, for any interval, the mean value of the endpoint values of the interval may be taken as an approximation. The above example also continues: the approximate value of the data to be processed divided in the interval (0, 5) is (0+ 5)/2-2.5, and the approximate value of the data to be processed divided in the interval (5, 10) is (5+ 10)/2-7.5.
Both of the above two manners can be applied in the actual calculation process, and will be specifically selected according to the needs of the actual application, which is not specifically limited herein.
S104: determining a quantile corresponding to the set of data to be processed from the approximation.
It should be noted that the approximate values determined through the foregoing steps may be approximate values corresponding to data to be processed divided into different intervals. Thus, the original data to be processed with a larger amount is converted into corresponding approximate values, and the total amount of the approximate values is smaller than or even far smaller than the amount of the acquired data to be processed.
Continuing the above example: assuming that the approximate value of the interval (0, 5) is 3, and the approximate value of the interval (5, 10) is 8, the to-be-processed data 4.0, 1.5 and 3.5 divided in the interval (0, 5) will be approximately 3, and the to-be-processed data 9.0, 6.5 and 8.5 divided in the interval (5, 10) will be approximately 8, then when calculating the quantile, the calculation can be performed according to the two values of 3 and 8.
Of course, the quantile calculation is performed on the basis of the approximate values, and the specific calculation process can still adopt the existing quantile algorithm, which is not described in detail herein.
Through the steps, when the quantile calculation is executed, the corresponding calculation equipment (such as a server) can acquire a plurality of data to be subjected to the quantile calculation from the database to be used as the data to be processed. After that, the computing device divides the data to be processed into a plurality of preset intervals, and determines corresponding approximate values aiming at the data to be processed divided into different intervals. The approximation can represent an average value of the data to be processed that falls within each interval. In this way, the numerical values of the data to be processed divided in different intervals can be replaced by the approximate values to calculate the quantile. Obviously, the number of data participating in the calculation process can be reduced just by using the approximate value instead of dividing the value of each data to be processed in any interval. Therefore, the traversal times during the quantile calculation can be reduced, the I/O operation on the database can be further reduced, and the efficiency of the quantile calculation process can be improved to a certain extent.
It should be noted here that, when the quantile calculation is performed by using the above method, the obtained multiple pieces of to-be-processed data may be unevenly distributed, that is, most of the to-be-processed data are distributed in a certain interval, and only a few of the to-be-processed data are distributed in other intervals.
For example: assuming that the plurality of data to be processed includes {1.1, 1.2, 1.3, 1.4, 1.5, 6.1}, the preset section is still (0, 5], (5, 10 ]. obviously, in this case, after the data to be processed is divided, the data to be processed 1.1, 1.2, 1.3, 1.4 and 1.5 are divided into sections (0, 5), and the data to be processed 6.1 is divided into sections (5, 10).
It can be seen that the data distribution is too inclined (i.e. the data is distributed too intensively in a certain interval), and once such a situation occurs, the accuracy is too low if the above quantile calculation method is adopted. Therefore, in one mode in this embodiment of the present application, after the to-be-processed data is acquired, a verification process is preferentially performed to determine that the distribution of the to-be-processed data is sufficiently uniform. That is, before dividing each piece of data to be processed into a plurality of preset intervals, the method further includes: sampling the acquired data to be processed to generate a plurality of sample data, dividing the sample data into a plurality of preset intervals, and determining the information entropy corresponding to the divided sample data, wherein the information entropy is not greater than a set threshold.
As can be seen from the above, in order to ensure that the verification process does not consume too much time, the data to be processed is sampled, and obviously, the number of sample data obtained by sampling is less than the number of the data to be processed. The method can be used for verifying the data to be processed according to the distribution condition of the sample data, and can reduce the time consumption of the verification process by using the mode of verifying the sample data when the quantity of the sample data is less than that of the data to be processed.
In the embodiment of the present application, the sampling method may be: and according to a preset sampling proportion, performing playback sampling on the acquired data to be processed. In actual operation, the value of the sampling ratio is usually smaller so as to ensure that the number of the acquired sample data is smaller.
The collected sample data can be checked by using the sample data, that is, the sample data is divided into a plurality of preset intervals (the preset intervals are consistent with the intervals), and information entropies corresponding to the sample data divided into different intervals are determined.
The process of determining the information entropy may be: and counting the proportion of the sample data in the interval in all the sample data for each interval to serve as the information probability of the interval, and determining the information entropy corresponding to the sample data according to the determined information probability of each interval.
Specifically, the information entropy can be calculated using the following formula:
wherein X is a sample set; x is the number ofiThe sample data divided in the ith interval is obtained;
h (X) is the value of the information entropy of the sample data;
P(xi) Is the probability of dividing the sample data in the ith interval (the numerical value is the proportion of the sample data in the ith interval in the sample set);
b is a constant, typically 2.
It should be noted that the information entropy calculated by the above process can reflect the distribution of data to some extent, and it is considered that the more concentrated the sample data distribution is, the larger the value of the information entropy is, and the more dispersed the sample data distribution is, the smaller the value of the information entropy is. Therefore, in the embodiment of the present application, after the information entropy corresponding to the sample data is determined, the information entropy is further compared with a preset threshold, and the quantile calculation can be performed by using the above method only when the value of the information entropy is not greater than the threshold.
In the embodiment of the present application, the process of calculating the quantile based on the approximate value may be: and determining the accumulative occupation ratio of the data to be processed divided in each interval in all the data to be processed, determining the quantile corresponding to the quantile required to be calculated, determining the accumulative occupation ratio matched with the quantile, determining the interval corresponding to the accumulative occupation ratio, and determining the determined approximate value corresponding to the interval as the approximate value of the quantile required to be calculated.
The following describes the calculation process of the score in the embodiment of the present application in detail with a specific application example.
In this example, assume that an unordered set of data to be processed includes: { x1,x2,x3,x4,x5The specific values are shown in fig. 2 a. Meanwhile, it is assumed that the preset interval length value is 5, and the specific value is also shown in fig. 2 a. In FIG. 2a, s1~s4Indicating the identity of 4 consecutive intervals. Then, the data to be processed may be divided into preset 4 intervals.
In the interval s of FIG. 2a2For example, { (5, 10)]"5, 10" in "(3)]"represents the interval s2"3" represents the number of pieces of data to be processed divided in the section. As can be seen, data x to be processed1~x5Are respectively divided into sections s2、s3And s4In (1).
Further, the division may be in the section s2、s3And s4Respectively determining corresponding approximate values, wherein the approximate values are assumed to be calculated according to (a + b)/2, wherein a and b represent two endpoint values of any interval. Then, s2Corresponding to an approximation of 7.5, s3Corresponding to an approximation of 12.5, s4The corresponding approximation is 17.5.
If in practical application, the user requests to calculate the 75 th quantile, then, if the data to be processed are traversed one by one according to the existing calculation mode, that is, as shown in fig. 2b (in fig. 2b, the abscissa represents the quantile of the quantile, and the ordinate represents the numerical value of the data to be processed), the final calculation determines that the 75 th quantile is x4(i.e., 10). This process requires multiple I/O operations to be performed. It is the way of approximation that is adopted in the embodiment of the present application, and the quantile can be calculated based on the above three approximations, 7.5, 12.5, and 17.5.
Specifically, for the section s1Since no data to be processed is divided in the interval, the cumulative percentage of the data to be processed in the whole data to be processed is 0.
For the interval s2Wherein the cumulative percentage of data to be processed is 60%, that is, the interval s2The corresponding approximate value of 7.5 can be taken as the approximate value of the 1 st quantile to the 60 th quantile.
For the interval s3Wherein the cumulative percentage of data to be processed is 80%, that is, the interval s3The corresponding approximate value of 12.5 can be used as the approximate value of the 61 st quantile to the 80 th quantile.
For the interval s4Wherein the cumulative percentage of the data to be processed is 100%, that is, the interval s4The corresponding approximate value of 17.5 can be used as the approximate value of the 81 th quantile to the 100 th quantile.
That is, as shown in fig. 2c (in fig. 2c, the abscissa represents the number of quantiles, and the ordinate represents the numerical value of the approximation), assuming that the user needs to calculate the 75 th quantile, the 75 th quantile is 12.5.
Of course, in practical applications, after the corresponding quantiles are obtained through calculation, the calculation result fed back to the user may be a vector, and the length of the vector is related to the quantile interval set in the calculation request, for example: the user can set the quantile interval to be solved to be 100, namely, the quantile interval is represented, and 100 quantiles need to be calculated. Meanwhile, the user requests the 75 th quantile to be calculated. Then, the length of the vector fed back to the user is 100, and the vectors are arranged according to the order of 1 st to 100 quantiles.
Based on the same idea, the data processing method provided in the embodiment of the present application further provides a data processing apparatus, as shown in fig. 3. The device comprises:
a data acquisition module 301, configured to acquire a plurality of pieces of data to be processed;
a dividing module 302, configured to divide each to-be-processed data into a plurality of preset intervals;
an approximate value determining module 303, configured to determine an approximate value of the data to be processed in each interval;
and a quantile calculation module 304, which determines the quantile corresponding to the data to be processed according to the approximate value.
The device further comprises: the data distribution detection module 305 performs sampling on the acquired data to be processed, generates a plurality of sample data, divides each sample data into a plurality of preset intervals, and determines an information entropy corresponding to the divided sample data, wherein the information entropy is not greater than a set threshold.
Further, the data distribution detecting module 305 performs playback sampling on the acquired data to be processed according to a preset sampling ratio.
The data distribution detection module 305 determines a value corresponding to each sample data, and divides each sample data into a plurality of preset intervals according to the value of each sample data.
The data distribution detection module 305 counts, for each interval, the proportion of the sample data in the interval in all sample data, and determines the information entropy corresponding to the sample data according to the determined information probability of each interval.
The dividing module 302 determines a numerical value corresponding to each piece of data to be processed, and divides each piece of data to be processed into a plurality of preset intervals according to the numerical value of each piece of data to be processed.
An approximate value determining module 303, configured to determine, for any interval, an endpoint value of the interval, calculate a mean value of the endpoint values, and determine an approximate value corresponding to data to be processed divided in the interval; or counting the numerical value of each to-be-processed data in the interval aiming at any interval, calculating the mean value of each to-be-processed data, and determining the mean value as the approximate value corresponding to the to-be-processed data divided in the interval.
The quantile calculation module 304 determines, for each interval, a cumulative proportion of the to-be-processed data divided in the interval in all the to-be-processed data, determines a quantile point corresponding to the quantile to be calculated, determines a cumulative proportion matched with the quantile point, determines an interval corresponding to the cumulative proportion, and determines an approximate value of the determined interval as an approximate value of the quantile to be calculated.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.