CN113672661A - Data processing method, device, equipment and computer readable storage medium - Google Patents

Data processing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN113672661A
CN113672661A CN202110886699.6A CN202110886699A CN113672661A CN 113672661 A CN113672661 A CN 113672661A CN 202110886699 A CN202110886699 A CN 202110886699A CN 113672661 A CN113672661 A CN 113672661A
Authority
CN
China
Prior art keywords
sample data
sampled
data
data set
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110886699.6A
Other languages
Chinese (zh)
Inventor
杨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202110886699.6A priority Critical patent/CN113672661A/en
Publication of CN113672661A publication Critical patent/CN113672661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Abstract

The invention relates to a data processing method, a data processing device, data processing equipment and a computer readable storage medium. The data processing method comprises the following steps: acquiring the capacity of a memory allocated when a histogram is constructed; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; and generating the histogram according to the sample data set. The method can obtain a larger number of samples according to different data types, so that the accuracy of a data processing result can be improved.

Description

Data processing method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and computer readable storage medium.
Background
The histogram is one kind of column level statistical information, is mainly used for describing the distribution situation of column values in a database, and is suitable for a scene with uneven data distribution. According to the histogram, the database can accurately calculate the selection rate aiming at different parameter values, and the accuracy of the plan is ensured.
In the prior art, a uniform sampling set is generated through a reservoir sampling algorithm, for example, the number of samples in a sample pool is S, a data stream is scanned from the beginning, the data stream includes N data, the data in the data stream are all selected into the sample set with a probability of S/N, the uniform sampling set is generated, and a histogram is generated according to the uniform sampling set. The sampling ratio S/N is affected by the number S of samples in the sample pool, the larger the number S of samples in the sample pool, the higher the sampling ratio, and the higher the accuracy of the histogram.
However, the number of samples in the sample pool in the above embodiment is predefined, but the type of data to be processed is uncertain, and therefore, the accuracy of the data processing result may be affected.
Disclosure of Invention
Embodiments of the present invention provide a data processing method, an apparatus, a device, and a computer-readable storage medium, which can obtain a larger number of samples for different data types, so as to improve accuracy of a data processing result.
In a first aspect, an embodiment of the present invention provides a data processing method, including:
acquiring the capacity of a memory allocated when a histogram is constructed;
determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory;
extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and generating the histogram according to the sample data set.
Optionally, determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory, includes:
determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
Optionally, the extracting a plurality of data to be sampled with a fixed probability from the data set to be sampled to form a sample data set, including:
adding the data to be sampled scanned for the ith time into the sample data set, and updating the number of times i +1 of scanning the data to be sampled, wherein i is more than or equal to 1 and is less than or equal to S, and S is the number of samples;
if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S;
adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled;
if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
Optionally, the generating a histogram according to the sample data set includes:
according to the numerical values of all the sample data in the sample data set, averagely dividing the numerical range of the sample data set into a plurality of intervals;
and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
Optionally, the dividing, according to the numerical values of all the sample data in the sample data set, the numerical range of the sample data set into a plurality of intervals on average includes:
determining the maximum value and the minimum value in the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set;
determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value;
and averagely dividing the numerical range of the sample data set into a plurality of intervals.
Optionally, the generating a histogram according to the sample data set includes:
arranging all the sample data in the sample data set according to a sequence from small to large;
dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average;
and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
Optionally, generating an equal-depth histogram according to the numerical values and the number of the sample data in all the sample data subsets, including:
dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, wherein the quantity of the sample data in all the intervals is the same as that of the sample data in the sample data subsets;
and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
In a second aspect, an embodiment of the present invention provides a data processing apparatus, including:
the acquisition module is used for acquiring the capacity of the memory allocated when the histogram is constructed;
the determining module is used for determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and the histogram generating module is used for generating a histogram according to the sample data set.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of any of the methods provided by the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of any one of the methods provided in the first aspect.
In the technical scheme provided by the embodiment of the invention, the capacity of the memory allocated when the histogram is constructed is obtained; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of a memory; extracting a plurality of data to be sampled from a data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; the histogram is generated according to the sample data set, the number of samples can be determined according to the length of the data to be sampled, and the data to be sampled of different types have different lengths, so that the number of samples as large as possible can be obtained according to the data to be sampled of different types, and the sampling proportion is as high as possible, so that the accuracy of the sample data set is improved, and the accuracy of a data processing result can be improved; in addition, aiming at the condition that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with fixed probability to form a sample data set, and the extraction fairness is ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another data processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a further data processing method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a histogram with equal width according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an iso-depth histogram according to an embodiment of the present invention;
FIG. 10 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, including:
s101, the capacity of the memory allocated when the histogram is constructed is obtained.
When constructing a corresponding histogram according to all sample data in the sample data set, a certain memory space needs to be allocated to store all sample data in the sample data set, and the histogram is established based on all sample data in the sample data set stored in the allocated memory space. Therefore, the size of the allocated memory space directly affects the number of sample data in the sample data set, and the larger the allocated memory space is, the more sample data in the sample data set is stored.
And S103, determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory.
The types of all the data to be sampled in the data sets to be sampled are the same, that is, the lengths of all the data to be sampled in the data sets to be sampled are the same, the types of the data to be sampled in different data sets to be sampled are different, and the lengths of the data to be sampled in different types are different.
And extracting part of data to be sampled in the data set to be sampled to form a sample data set, namely all sample data in the sample data set are part of data to be sampled in the data set to be sampled, and thus, the length of the data to be sampled is the size of the memory occupied by each sample data. The number of sample data in the sample data set, that is, the number of samples, can be obtained according to the size of the memory occupied by each sample data and the allocated capacity of the memory for storing the sample data set.
In summary, for different types of data to be sampled, different sample numbers can be obtained due to the different lengths of the data to be sampled, so that for different types of data to be sampled, as large a sample number as possible can be obtained, as high a sampling proportion as possible is maximized, a sample data set is closer to a data set to be sampled, that is, the sample data set is closer to a true value, accuracy of the sample data set is improved, and accuracy of a subsequent data processing result can be ensured.
S105, extracting a plurality of data to be sampled from the data set to be sampled with fixed probability to form a sample data set.
The number of sample data in the set of sample data is equal to the number of samples.
Based on the above embodiments, the number of samples has been determined, for example: the number of the samples is 100, the number of the data to be sampled in the data set to be sampled is N, 100 data to be sampled are extracted from the N data to be sampled with the probability of 100/N to serve as sample data, and the 100 sample data form a sample data set. Obviously, for the case that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with a fixed probability to form a sample data set, so that the extraction fairness is ensured.
And S107, generating the histogram according to the sample data set.
The numerical value of the sample data in the sample data set is used as an abscissa, the abscissa is divided into a plurality of intervals, the number of the sample data in each interval is used as an ordinate, a histogram of the sample data set is generated, and the distribution condition of all the sample data in the sample data set can be visually displayed through the histogram. Based on the embodiment, the sample data set with higher accuracy can be obtained, so that the distribution condition of the sample data is closer to the real condition, namely the accuracy of the data processing result is improved.
In the technical scheme provided by the embodiment of the invention, the capacity of the memory allocated when the histogram is constructed is obtained; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of a memory; extracting a plurality of data to be sampled from a data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; the histogram is generated according to the sample data set, the number of samples can be determined according to the length of the data to be sampled, and the data to be sampled of different types have different lengths, so that the number of samples as large as possible can be obtained according to the data to be sampled of different types, and the sampling proportion is as high as possible, so that the accuracy of the sample data set is improved, and the accuracy of a data processing result can be improved; in addition, aiming at the condition that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with fixed probability to form a sample data set, and the extraction fairness is ensured.
Fig. 2 is a schematic flow chart of another data processing method according to an embodiment of the present invention, and fig. 2 is a detailed description of a possible implementation manner when S103 is executed on the basis of the embodiment shown in fig. 1, as follows:
s103', determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
The capacity of the memory space allocated when the corresponding histogram is constructed is S1, the length of the sample data in the sample data set is L, and the allocated memory space is used for storing the sample data set, that is, the memory space occupied by a single sample data stored in the allocated memory space is L, so that the sample number S can be calculated according to S1/L, and if the calculated value of S1/L is a non-integer, the S1/L is rounded, so that the final sample number S is obtained. For example, the length of the sample data in the sample data set is 3, the capacity of the memory space allocated when constructing the corresponding histogram is 215, 215/3 is 71.6, [71.6] indicates that 71 is obtained by rounding 71.6, and then the number of samples is 71.
In the embodiment of the present invention, the sample number S is obtained according to the formula S ═ S1/L ], so that the sample number is ensured not to exceed the maximum number of sample data that can be stored in the allocated content space, and meanwhile, the sample number as large as possible is obtained for data to be sampled with different lengths.
Fig. 3 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 3 is a detailed description of a possible implementation manner of performing S105 on the basis of the embodiment shown in fig. 1, as follows:
s1051 adds the data to be sampled scanned for the ith time to the sample data set, and updates the number of times i +1 of scanning the data to be sampled.
Wherein i is more than or equal to 1 and less than or equal to S, and S is the number of samples.
And S1052, determining whether the number i of times of scanning the data to be sampled is greater than S.
If not, repeatedly executing S1051-S1052 until the number i of times of scanning the data to be sampled is larger than the number S of samples.
For the data to be sampled of the 1 st scanning, the data to be sampled of the ith scanning is extracted into the sample data set with a probability of 1, and for the data to be sampled of the ith scanning, if 1< i ≦ S, the probability of the data to be sampled of the ith scanning is 1, so that the data to be sampled of the 1 st to the S th scanning are all added into the sample data set. For example, the number S of samples is 100, and the data to be sampled in the 1 st scan to the data to be sampled in the 100 th scan are all added to the sample data set.
When the data to be sampled is scanned for the (S +1) th time, the data to be sampled of the (S +1) th time scanning starts to replace the data to be sampled of the previous (S) th time scanning in the sample data set, the probability that the data to be sampled of the (S +1) th time scanning is extracted is S/(S +1), the probability that the data to be sampled of the ith time scanning in the sample data set is replaced is 1/S, the probability that the data to be sampled of the (S +1) th time scanning replaces the data to be sampled of the ith time scanning in the sample data set is 1/(S +1), and the probability that the data to be sampled of the ith time scanning is not replaced is 1-1/(S +1) — S/(S + 1). The above operations are repeatedly executed until the data to be sampled of the nth scan, the probability that the data to be sampled of the nth scan is extracted is S/N, the probability that the data to be sampled of the ith scan in the sample data set is replaced is 1/S, the probability that the data to be sampled of the nth scan replaces the data to be sampled of the ith scan in the sample data set is 1/N, and the probability that the data to be sampled of the ith scan is not replaced is 1-1/N ═ 1/N, so that the probability that the data to be sampled of the ith scan is not replaced is S/(S +1) (S +1)/(S +2) · …. (N-1)/N ═ S/N.
And S1053, adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number of times i of scanning the data to be sampled to be i + 1.
And i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled.
S1054, determining whether the number i of times of scanning the data to be sampled is larger than N.
If not, repeatedly executing S1053-S1054 until the number i of times of scanning the data to be sampled is larger than the number N of the data to be sampled in the data set to be sampled.
If the number of times i for scanning the data to be sampled is larger than the number S of samples, selecting a random number d in [1, i ] according to the data to be sampled scanned in the ith time, wherein i is not more than N, and if d is not more than S, replacing the data to be sampled scanned in the ith time in the sample data set by using the data to be sampled scanned in the ith time, so that the probability that the data to be sampled in the ith time is extracted to the sample data set is S/i.
When the data to be sampled is scanned for the (i +1) th time, the probability that the data to be sampled of the (i +1) th time of scanning is extracted to the sample data set is S/(i +1), the probability that the data to be sampled of the (i) th time of scanning in the sample data set is replaced is 1/S, the probability that the data to be sampled of the (i +1) th time of scanning replaces the data to be sampled of the (i) th time of scanning in the sample data set is 1/(i +1), and the probability that the data to be sampled of the (i) th time of scanning is not replaced is 1-1/(i +1) ═ i/(i + 1). The above operations are repeatedly executed until the data to be sampled of the nth scan, the probability that the data to be sampled of the nth scan is extracted is S/N, the probability that the data to be sampled of the ith scan in the sample data set is replaced is 1/S, the probability that the data to be sampled of the nth scan replaces the data to be sampled of the ith scan in the sample data set is 1/N, and the probability that the data to be sampled of the ith scan is not replaced is 1-1/N (N-1)/N, so that the probability that the data to be sampled of the ith scan is not replaced is i/(i +1) (i +1)/(i +2) · …. (N-1)/N ═ S/N.
In summary, for the unknown amount of data to be sampled in the data set to be sampled, the data to be sampled is retained in the sample data set with the probability of S/N, that is, for the unknown amount of data to be sampled, the data is extracted into the sample data set with the fixed probability, so that the extraction fairness is ensured.
Fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 4 is a detailed description of a possible implementation manner when S107 is executed on the basis of the embodiment shown in fig. 1, as follows:
s201, according to the numerical values of all the sample data in the sample data set, dividing the numerical range of the sample data set into a plurality of intervals on average.
Obtaining the numerical values of all sample data in the sample data set, determining the numerical range of the sample data set according to the numerical values of all the sample data, and averagely dividing the numerical range of the sample data set into a plurality of numerical intervals, namely the difference values of the maximum value and the minimum value in each numerical interval are equal.
And S203, generating a uniform width histogram according to the number of the sample data in all the intervals and the intervals.
The method comprises the steps of dividing an abscissa axis according to all numerical value intervals by taking the numerical value of sample data as an abscissa x, obtaining the number of the sample data in each equal-width interval by taking the divided intervals on the abscissa axis as equal-width intervals, and generating an equal-width histogram by taking the number of the sample data as an ordinate y.
Fig. 5 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 5 is a detailed description of a possible implementation manner when S201 is executed on the basis of the embodiment shown in fig. 4, as follows:
and S2011, determining the maximum value and the minimum value of the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set.
And obtaining the numerical values of all sample data in the sample data set, and determining the maximum value and the minimum value in the numerical values of all sample data. For example, the sample data set is {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5}, and the maximum value and the minimum value in the sample data set are 3.5 and 1.6, respectively.
S2012, determining the numerical range of the sample data set according to the interval formed by the minimum value and the maximum value.
Based on the above embodiment, an interval formed by the maximum value and the minimum value is [1.6, 3.5], and [1.6, 3.5] may be determined as a numerical range of the sample data set, or [0, 3.5] may be determined as a numerical range of the sample data set, or [1.6, 4] may be determined as a numerical range of the sample data set, or [0, 4] may be determined as a numerical range of the sample data set.
And S2013, averagely dividing the numerical range of the sample data set into a plurality of intervals.
For example, fig. 6 is a schematic diagram of a uniform width histogram according to an embodiment of the present invention, based on the above embodiment, if the numerical range of the sample data set is [1.6, 3.5], the numerical range of the sample data set is divided into 2 intervals, which are [1.6, 2.55], [2.55, 3.5], on average, and accordingly the abscissa is divided into 2 intervals, where the number of sample data in the interval [1.6, 2.55] is 5, and the number of sample data in the interval [2.55, 3.5] is 7, so as to generate the uniform width histogram shown in fig. 6.
It should be noted that fig. 6 only exemplarily shows that the numerical range of the sample data set is divided into 2 intervals, in other embodiments, the numerical range [1.6, 3.5] of the sample data set may also be divided into three or more intervals, which is not specifically limited in this embodiment of the present invention.
Fig. 7 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 7 is a detailed description of another possible implementation manner of executing S107 based on the embodiment shown in fig. 1, as follows:
s301, arranging all the sample data in the sample data set according to a descending order.
All sample data in the sample data set are arranged in the order from small to large, for example, the sample data set after the sample data are arranged in the order from small to large may be {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5 }.
And S303, averagely dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets.
Illustratively, based on the above embodiment, all sample data in the sample data set {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5} are equally divided into 4 sample data subsets, which are {1.6, 1.9, 1.9}, {2.0, 2.4, 2.6}, {2.7, 2.7, 2.8} and {2.9, 3.4, 3.5}, respectively, and the number of sample data in each sample data subset is 3, i.e. the number of sample data in all sample data subsets is the same. In other embodiments, all sample data in the sample data set {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5} may be equally divided into 2 or 3 sample data subsets, which is not specifically limited in this application.
S305, generating an equal-depth histogram according to the numerical values and the number of the sample data in all the sample data subsets.
And taking the numerical value of the sample data as an abscissa x, wherein the abscissa is divided into a plurality of intervals according to the numerical values of all sample data in each sample data subset, each interval covers the numerical values of all sample data in the corresponding sample data subset and does not cover the numerical value of any sample data in the adjacent sample data subset, and the equal-depth histogram is generated by taking the number of the sample data in the sample data subset as an ordinate y.
In the embodiment of the invention, all the sample data in the sample data set are arranged according to the sequence from small to big; dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average; and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets, wherein the equal-depth histogram reflects a small data distribution error, and the accuracy of a data processing result can be improved.
Fig. 8 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 8 is a detailed description of a possible implementation manner of executing S305 based on the embodiment shown in fig. 7, as follows:
s3051, dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets.
The number of sample data in all of the intervals is the same as the number of sample data in the subset of sample data.
The numerical range of the sample data set is divided into a plurality of intervals, each area only covers the numerical values of all sample data in one data subset, based on the above embodiment, the sample data subset is combined into {1.6, 1.9, 1.9}, {2.0, 2.4, 2.6}, {2.7, 2.7, 2.8} and {2.9, 3.4, 3.5}, the numerical range of the sample data set is [1.6, 3.5], the numerical range of the sample data set is divided into intervals [1.6, 1.9], [2.0, 2.6], [2.7, 2.8] and [2.9, 3.5], and the number of sample data in each interval is 3.
S3052, generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
Exemplarily, fig. 9 is a schematic diagram of an equal-depth histogram according to an embodiment of the present invention, based on the above embodiment, an abscissa is divided into intervals of [1.6, 1.9], [2.0, 2.6], [2.7, 2.8] and [2.9, 3.5], the number of sample data corresponding to each interval is 3, and the equal-depth histogram shown in fig. 9 is generated.
Fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 10, a data processing apparatus 100 includes:
an obtaining module 110, configured to obtain a capacity of a memory allocated when the histogram is constructed.
A determining module 120, configured to determine the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; and extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples.
A histogram generating module 130, configured to generate a histogram according to the sample data set.
Optionally, the determining module 120 is further configured to determine the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
Optionally, the determining module 120 is further configured to add the data to be sampled scanned for the ith time to the sample data set, and update the number of times i +1 of scanning the data to be sampled, where i is greater than or equal to 1 and is less than or equal to S, and S is the number of samples; if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S; adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled; if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
Optionally, the histogram generating module 130 is further configured to averagely divide the numerical range of the sample data set into a plurality of intervals according to the numerical values of all the sample data in the sample data set; and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
Optionally, the histogram generating module 130 is further configured to determine, according to the numerical values of all the sample data in the sample data set, a maximum value and a minimum value of the numerical values of all the sample data; determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value; and averagely dividing the numerical range of the sample data set into a plurality of intervals.
Optionally, the histogram generating module 130 is further configured to arrange all the sample data in the sample data set according to a descending order; dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average; and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
Optionally, the histogram generating module 130 is further configured to divide a numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, where the number of the sample data in all the intervals is the same as the number of the sample data in the sample data subsets; and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
An electronic device is further provided in an embodiment of the present invention, fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention, and fig. 11 shows a block diagram of an exemplary electronic device suitable for implementing an embodiment of the present invention. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 11, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors 16, a system memory 28, and a bus 18 that connects the various system components (including the system memory 28 and the processors 16).
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments described herein.
The processor 16 executes various functional applications and data processing, such as implementing the steps of any of the above-described method embodiments, by executing at least one of a plurality of programs stored in the system memory 28.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of any of the above-mentioned method embodiments.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
Embodiments of the present invention further provide a computer program product, which when run on a computer causes the computer to perform the steps of any of the above-described method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data processing method, comprising:
acquiring the capacity of a memory allocated when a histogram is constructed;
determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory;
extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and generating the histogram according to the sample data set.
2. The method of claim 1, wherein determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory comprises:
determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
3. The method according to claim 1 or 2, wherein said extracting a plurality of said data to be sampled with a fixed probability from said data set to be sampled forms a sample data set, comprising:
adding the data to be sampled scanned for the ith time into the sample data set, and updating the number of times i +1 of scanning the data to be sampled, wherein i is more than or equal to 1 and is less than or equal to S, and S is the number of samples;
if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S;
adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled;
if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
4. The method according to claim 1 or 2, wherein said generating a histogram from said set of sample data comprises:
according to the numerical values of all the sample data in the sample data set, averagely dividing the numerical range of the sample data set into a plurality of intervals;
and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
5. The method according to claim 4, wherein said dividing the range of values of the sample data set into a plurality of intervals according to the values of all the sample data in the sample data set on average comprises:
determining the maximum value and the minimum value in the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set;
determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value;
and averagely dividing the numerical range of the sample data set into a plurality of intervals.
6. The method according to claim 1 or 2, wherein said generating a histogram from said set of sample data comprises:
arranging all the sample data in the sample data set according to a sequence from small to large;
dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average;
and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
7. The method of claim 6, wherein generating an equal-depth histogram from values and quantities of the sample data in all of the subsets of sample data comprises:
dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, wherein the quantity of the sample data in all the intervals is the same as that of the sample data in the sample data subsets;
and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring the capacity of the memory allocated when the histogram is constructed;
the determining module is used for determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and the histogram generating module is used for generating a histogram according to the sample data set.
9. An electronic device, comprising: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110886699.6A 2021-08-03 2021-08-03 Data processing method, device, equipment and computer readable storage medium Pending CN113672661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110886699.6A CN113672661A (en) 2021-08-03 2021-08-03 Data processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110886699.6A CN113672661A (en) 2021-08-03 2021-08-03 Data processing method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113672661A true CN113672661A (en) 2021-11-19

Family

ID=78541220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110886699.6A Pending CN113672661A (en) 2021-08-03 2021-08-03 Data processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113672661A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278989B1 (en) * 1998-08-25 2001-08-21 Microsoft Corporation Histogram construction using adaptive random sampling with cross-validation for database systems
CN107330083A (en) * 2017-07-03 2017-11-07 贵州大学 Wide histogram parallel constructing method
CN112905517A (en) * 2021-03-09 2021-06-04 明峰医疗系统股份有限公司 Variable packet length data acquisition method based on FPGA

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278989B1 (en) * 1998-08-25 2001-08-21 Microsoft Corporation Histogram construction using adaptive random sampling with cross-validation for database systems
CN107330083A (en) * 2017-07-03 2017-11-07 贵州大学 Wide histogram parallel constructing method
CN112905517A (en) * 2021-03-09 2021-06-04 明峰医疗系统股份有限公司 Variable packet length data acquisition method based on FPGA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范聪;李建增;张岩;: "基于采样优化的随机抽取一致性算法", 电光与控制, no. 07 *

Similar Documents

Publication Publication Date Title
Kopylova et al. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data
DeRaad SNPfiltR: an R package for interactive and reproducible SNP filtering
CN109978006B (en) Face image clustering method and device
CN108805174A (en) clustering method and device
CN110728526A (en) Address recognition method, apparatus and computer readable medium
CN111522968A (en) Knowledge graph fusion method and device
CN115392477A (en) Skyline query cardinality estimation method and device based on deep learning
CN111125658A (en) Method, device, server and storage medium for identifying fraudulent users
CN114168318A (en) Training method of storage release model, storage release method and equipment
CN110751400B (en) Risk assessment method and device
CN113672661A (en) Data processing method, device, equipment and computer readable storage medium
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN115630643A (en) Language model training method and device, electronic equipment and storage medium
CN115564578B (en) Fraud recognition model generation method
Sharma et al. Simulating noisy, nonparametric, and multivariate discrete patterns
CN111984652B (en) Method for searching idle block in bitmap data and related components
JP2007072883A (en) Cache memory analysis method, processor and simulative information processing device
US11410749B2 (en) Stable genes in comparative transcriptomics
CN109542927B (en) Effective data screening method, readable storage medium and terminal
JP2006092478A (en) Gene expression profile retrieval apparatus, gene expression profile retrieval method, and program
CN109686400B (en) Enrichment degree inspection method and device, readable medium and storage controller
CN110968690A (en) Clustering division method and device for words, equipment and storage medium
CN117217101B (en) Experiment simulation method based on virtual reality technology
CN113068123A (en) Centroid determining method, centroid determining device, server and storage medium
CN109522300B (en) Effective data screening device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination