CN113672661A - Data processing method, device, equipment and computer readable storage medium - Google Patents
Data processing method, device, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN113672661A CN113672661A CN202110886699.6A CN202110886699A CN113672661A CN 113672661 A CN113672661 A CN 113672661A CN 202110886699 A CN202110886699 A CN 202110886699A CN 113672661 A CN113672661 A CN 113672661A
- Authority
- CN
- China
- Prior art keywords
- sample data
- sampled
- data
- data set
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 claims abstract description 22
- 238000004590 computer program Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 9
- 238000005070 sampling Methods 0.000 description 9
- 230000003287 optical effect Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 239000013307 optical fiber Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Abstract
The invention relates to a data processing method, a data processing device, data processing equipment and a computer readable storage medium. The data processing method comprises the following steps: acquiring the capacity of a memory allocated when a histogram is constructed; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; and generating the histogram according to the sample data set. The method can obtain a larger number of samples according to different data types, so that the accuracy of a data processing result can be improved.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and computer readable storage medium.
Background
The histogram is one kind of column level statistical information, is mainly used for describing the distribution situation of column values in a database, and is suitable for a scene with uneven data distribution. According to the histogram, the database can accurately calculate the selection rate aiming at different parameter values, and the accuracy of the plan is ensured.
In the prior art, a uniform sampling set is generated through a reservoir sampling algorithm, for example, the number of samples in a sample pool is S, a data stream is scanned from the beginning, the data stream includes N data, the data in the data stream are all selected into the sample set with a probability of S/N, the uniform sampling set is generated, and a histogram is generated according to the uniform sampling set. The sampling ratio S/N is affected by the number S of samples in the sample pool, the larger the number S of samples in the sample pool, the higher the sampling ratio, and the higher the accuracy of the histogram.
However, the number of samples in the sample pool in the above embodiment is predefined, but the type of data to be processed is uncertain, and therefore, the accuracy of the data processing result may be affected.
Disclosure of Invention
Embodiments of the present invention provide a data processing method, an apparatus, a device, and a computer-readable storage medium, which can obtain a larger number of samples for different data types, so as to improve accuracy of a data processing result.
In a first aspect, an embodiment of the present invention provides a data processing method, including:
acquiring the capacity of a memory allocated when a histogram is constructed;
determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory;
extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and generating the histogram according to the sample data set.
Optionally, determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory, includes:
determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
Optionally, the extracting a plurality of data to be sampled with a fixed probability from the data set to be sampled to form a sample data set, including:
adding the data to be sampled scanned for the ith time into the sample data set, and updating the number of times i +1 of scanning the data to be sampled, wherein i is more than or equal to 1 and is less than or equal to S, and S is the number of samples;
if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S;
adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled;
if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
Optionally, the generating a histogram according to the sample data set includes:
according to the numerical values of all the sample data in the sample data set, averagely dividing the numerical range of the sample data set into a plurality of intervals;
and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
Optionally, the dividing, according to the numerical values of all the sample data in the sample data set, the numerical range of the sample data set into a plurality of intervals on average includes:
determining the maximum value and the minimum value in the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set;
determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value;
and averagely dividing the numerical range of the sample data set into a plurality of intervals.
Optionally, the generating a histogram according to the sample data set includes:
arranging all the sample data in the sample data set according to a sequence from small to large;
dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average;
and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
Optionally, generating an equal-depth histogram according to the numerical values and the number of the sample data in all the sample data subsets, including:
dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, wherein the quantity of the sample data in all the intervals is the same as that of the sample data in the sample data subsets;
and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
In a second aspect, an embodiment of the present invention provides a data processing apparatus, including:
the acquisition module is used for acquiring the capacity of the memory allocated when the histogram is constructed;
the determining module is used for determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and the histogram generating module is used for generating a histogram according to the sample data set.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of any of the methods provided by the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of any one of the methods provided in the first aspect.
In the technical scheme provided by the embodiment of the invention, the capacity of the memory allocated when the histogram is constructed is obtained; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of a memory; extracting a plurality of data to be sampled from a data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; the histogram is generated according to the sample data set, the number of samples can be determined according to the length of the data to be sampled, and the data to be sampled of different types have different lengths, so that the number of samples as large as possible can be obtained according to the data to be sampled of different types, and the sampling proportion is as high as possible, so that the accuracy of the sample data set is improved, and the accuracy of a data processing result can be improved; in addition, aiming at the condition that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with fixed probability to form a sample data set, and the extraction fairness is ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another data processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a further data processing method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a histogram with equal width according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating another data processing method according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an iso-depth histogram according to an embodiment of the present invention;
FIG. 10 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, including:
s101, the capacity of the memory allocated when the histogram is constructed is obtained.
When constructing a corresponding histogram according to all sample data in the sample data set, a certain memory space needs to be allocated to store all sample data in the sample data set, and the histogram is established based on all sample data in the sample data set stored in the allocated memory space. Therefore, the size of the allocated memory space directly affects the number of sample data in the sample data set, and the larger the allocated memory space is, the more sample data in the sample data set is stored.
And S103, determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory.
The types of all the data to be sampled in the data sets to be sampled are the same, that is, the lengths of all the data to be sampled in the data sets to be sampled are the same, the types of the data to be sampled in different data sets to be sampled are different, and the lengths of the data to be sampled in different types are different.
And extracting part of data to be sampled in the data set to be sampled to form a sample data set, namely all sample data in the sample data set are part of data to be sampled in the data set to be sampled, and thus, the length of the data to be sampled is the size of the memory occupied by each sample data. The number of sample data in the sample data set, that is, the number of samples, can be obtained according to the size of the memory occupied by each sample data and the allocated capacity of the memory for storing the sample data set.
In summary, for different types of data to be sampled, different sample numbers can be obtained due to the different lengths of the data to be sampled, so that for different types of data to be sampled, as large a sample number as possible can be obtained, as high a sampling proportion as possible is maximized, a sample data set is closer to a data set to be sampled, that is, the sample data set is closer to a true value, accuracy of the sample data set is improved, and accuracy of a subsequent data processing result can be ensured.
S105, extracting a plurality of data to be sampled from the data set to be sampled with fixed probability to form a sample data set.
The number of sample data in the set of sample data is equal to the number of samples.
Based on the above embodiments, the number of samples has been determined, for example: the number of the samples is 100, the number of the data to be sampled in the data set to be sampled is N, 100 data to be sampled are extracted from the N data to be sampled with the probability of 100/N to serve as sample data, and the 100 sample data form a sample data set. Obviously, for the case that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with a fixed probability to form a sample data set, so that the extraction fairness is ensured.
And S107, generating the histogram according to the sample data set.
The numerical value of the sample data in the sample data set is used as an abscissa, the abscissa is divided into a plurality of intervals, the number of the sample data in each interval is used as an ordinate, a histogram of the sample data set is generated, and the distribution condition of all the sample data in the sample data set can be visually displayed through the histogram. Based on the embodiment, the sample data set with higher accuracy can be obtained, so that the distribution condition of the sample data is closer to the real condition, namely the accuracy of the data processing result is improved.
In the technical scheme provided by the embodiment of the invention, the capacity of the memory allocated when the histogram is constructed is obtained; determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of a memory; extracting a plurality of data to be sampled from a data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples; the histogram is generated according to the sample data set, the number of samples can be determined according to the length of the data to be sampled, and the data to be sampled of different types have different lengths, so that the number of samples as large as possible can be obtained according to the data to be sampled of different types, and the sampling proportion is as high as possible, so that the accuracy of the sample data set is improved, and the accuracy of a data processing result can be improved; in addition, aiming at the condition that the number of the data to be sampled in the data to be sampled is unknown, the data to be sampled can still be extracted with fixed probability to form a sample data set, and the extraction fairness is ensured.
Fig. 2 is a schematic flow chart of another data processing method according to an embodiment of the present invention, and fig. 2 is a detailed description of a possible implementation manner when S103 is executed on the basis of the embodiment shown in fig. 1, as follows:
s103', determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
The capacity of the memory space allocated when the corresponding histogram is constructed is S1, the length of the sample data in the sample data set is L, and the allocated memory space is used for storing the sample data set, that is, the memory space occupied by a single sample data stored in the allocated memory space is L, so that the sample number S can be calculated according to S1/L, and if the calculated value of S1/L is a non-integer, the S1/L is rounded, so that the final sample number S is obtained. For example, the length of the sample data in the sample data set is 3, the capacity of the memory space allocated when constructing the corresponding histogram is 215, 215/3 is 71.6, [71.6] indicates that 71 is obtained by rounding 71.6, and then the number of samples is 71.
In the embodiment of the present invention, the sample number S is obtained according to the formula S ═ S1/L ], so that the sample number is ensured not to exceed the maximum number of sample data that can be stored in the allocated content space, and meanwhile, the sample number as large as possible is obtained for data to be sampled with different lengths.
Fig. 3 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 3 is a detailed description of a possible implementation manner of performing S105 on the basis of the embodiment shown in fig. 1, as follows:
s1051 adds the data to be sampled scanned for the ith time to the sample data set, and updates the number of times i +1 of scanning the data to be sampled.
Wherein i is more than or equal to 1 and less than or equal to S, and S is the number of samples.
And S1052, determining whether the number i of times of scanning the data to be sampled is greater than S.
If not, repeatedly executing S1051-S1052 until the number i of times of scanning the data to be sampled is larger than the number S of samples.
For the data to be sampled of the 1 st scanning, the data to be sampled of the ith scanning is extracted into the sample data set with a probability of 1, and for the data to be sampled of the ith scanning, if 1< i ≦ S, the probability of the data to be sampled of the ith scanning is 1, so that the data to be sampled of the 1 st to the S th scanning are all added into the sample data set. For example, the number S of samples is 100, and the data to be sampled in the 1 st scan to the data to be sampled in the 100 th scan are all added to the sample data set.
When the data to be sampled is scanned for the (S +1) th time, the data to be sampled of the (S +1) th time scanning starts to replace the data to be sampled of the previous (S) th time scanning in the sample data set, the probability that the data to be sampled of the (S +1) th time scanning is extracted is S/(S +1), the probability that the data to be sampled of the ith time scanning in the sample data set is replaced is 1/S, the probability that the data to be sampled of the (S +1) th time scanning replaces the data to be sampled of the ith time scanning in the sample data set is 1/(S +1), and the probability that the data to be sampled of the ith time scanning is not replaced is 1-1/(S +1) — S/(S + 1). The above operations are repeatedly executed until the data to be sampled of the nth scan, the probability that the data to be sampled of the nth scan is extracted is S/N, the probability that the data to be sampled of the ith scan in the sample data set is replaced is 1/S, the probability that the data to be sampled of the nth scan replaces the data to be sampled of the ith scan in the sample data set is 1/N, and the probability that the data to be sampled of the ith scan is not replaced is 1-1/N ═ 1/N, so that the probability that the data to be sampled of the ith scan is not replaced is S/(S +1) (S +1)/(S +2) · …. (N-1)/N ═ S/N.
And S1053, adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number of times i of scanning the data to be sampled to be i + 1.
And i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled.
S1054, determining whether the number i of times of scanning the data to be sampled is larger than N.
If not, repeatedly executing S1053-S1054 until the number i of times of scanning the data to be sampled is larger than the number N of the data to be sampled in the data set to be sampled.
If the number of times i for scanning the data to be sampled is larger than the number S of samples, selecting a random number d in [1, i ] according to the data to be sampled scanned in the ith time, wherein i is not more than N, and if d is not more than S, replacing the data to be sampled scanned in the ith time in the sample data set by using the data to be sampled scanned in the ith time, so that the probability that the data to be sampled in the ith time is extracted to the sample data set is S/i.
When the data to be sampled is scanned for the (i +1) th time, the probability that the data to be sampled of the (i +1) th time of scanning is extracted to the sample data set is S/(i +1), the probability that the data to be sampled of the (i) th time of scanning in the sample data set is replaced is 1/S, the probability that the data to be sampled of the (i +1) th time of scanning replaces the data to be sampled of the (i) th time of scanning in the sample data set is 1/(i +1), and the probability that the data to be sampled of the (i) th time of scanning is not replaced is 1-1/(i +1) ═ i/(i + 1). The above operations are repeatedly executed until the data to be sampled of the nth scan, the probability that the data to be sampled of the nth scan is extracted is S/N, the probability that the data to be sampled of the ith scan in the sample data set is replaced is 1/S, the probability that the data to be sampled of the nth scan replaces the data to be sampled of the ith scan in the sample data set is 1/N, and the probability that the data to be sampled of the ith scan is not replaced is 1-1/N (N-1)/N, so that the probability that the data to be sampled of the ith scan is not replaced is i/(i +1) (i +1)/(i +2) · …. (N-1)/N ═ S/N.
In summary, for the unknown amount of data to be sampled in the data set to be sampled, the data to be sampled is retained in the sample data set with the probability of S/N, that is, for the unknown amount of data to be sampled, the data is extracted into the sample data set with the fixed probability, so that the extraction fairness is ensured.
Fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 4 is a detailed description of a possible implementation manner when S107 is executed on the basis of the embodiment shown in fig. 1, as follows:
s201, according to the numerical values of all the sample data in the sample data set, dividing the numerical range of the sample data set into a plurality of intervals on average.
Obtaining the numerical values of all sample data in the sample data set, determining the numerical range of the sample data set according to the numerical values of all the sample data, and averagely dividing the numerical range of the sample data set into a plurality of numerical intervals, namely the difference values of the maximum value and the minimum value in each numerical interval are equal.
And S203, generating a uniform width histogram according to the number of the sample data in all the intervals and the intervals.
The method comprises the steps of dividing an abscissa axis according to all numerical value intervals by taking the numerical value of sample data as an abscissa x, obtaining the number of the sample data in each equal-width interval by taking the divided intervals on the abscissa axis as equal-width intervals, and generating an equal-width histogram by taking the number of the sample data as an ordinate y.
Fig. 5 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 5 is a detailed description of a possible implementation manner when S201 is executed on the basis of the embodiment shown in fig. 4, as follows:
and S2011, determining the maximum value and the minimum value of the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set.
And obtaining the numerical values of all sample data in the sample data set, and determining the maximum value and the minimum value in the numerical values of all sample data. For example, the sample data set is {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5}, and the maximum value and the minimum value in the sample data set are 3.5 and 1.6, respectively.
S2012, determining the numerical range of the sample data set according to the interval formed by the minimum value and the maximum value.
Based on the above embodiment, an interval formed by the maximum value and the minimum value is [1.6, 3.5], and [1.6, 3.5] may be determined as a numerical range of the sample data set, or [0, 3.5] may be determined as a numerical range of the sample data set, or [1.6, 4] may be determined as a numerical range of the sample data set, or [0, 4] may be determined as a numerical range of the sample data set.
And S2013, averagely dividing the numerical range of the sample data set into a plurality of intervals.
For example, fig. 6 is a schematic diagram of a uniform width histogram according to an embodiment of the present invention, based on the above embodiment, if the numerical range of the sample data set is [1.6, 3.5], the numerical range of the sample data set is divided into 2 intervals, which are [1.6, 2.55], [2.55, 3.5], on average, and accordingly the abscissa is divided into 2 intervals, where the number of sample data in the interval [1.6, 2.55] is 5, and the number of sample data in the interval [2.55, 3.5] is 7, so as to generate the uniform width histogram shown in fig. 6.
It should be noted that fig. 6 only exemplarily shows that the numerical range of the sample data set is divided into 2 intervals, in other embodiments, the numerical range [1.6, 3.5] of the sample data set may also be divided into three or more intervals, which is not specifically limited in this embodiment of the present invention.
Fig. 7 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 7 is a detailed description of another possible implementation manner of executing S107 based on the embodiment shown in fig. 1, as follows:
s301, arranging all the sample data in the sample data set according to a descending order.
All sample data in the sample data set are arranged in the order from small to large, for example, the sample data set after the sample data are arranged in the order from small to large may be {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5 }.
And S303, averagely dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets.
Illustratively, based on the above embodiment, all sample data in the sample data set {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5} are equally divided into 4 sample data subsets, which are {1.6, 1.9, 1.9}, {2.0, 2.4, 2.6}, {2.7, 2.7, 2.8} and {2.9, 3.4, 3.5}, respectively, and the number of sample data in each sample data subset is 3, i.e. the number of sample data in all sample data subsets is the same. In other embodiments, all sample data in the sample data set {1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5} may be equally divided into 2 or 3 sample data subsets, which is not specifically limited in this application.
S305, generating an equal-depth histogram according to the numerical values and the number of the sample data in all the sample data subsets.
And taking the numerical value of the sample data as an abscissa x, wherein the abscissa is divided into a plurality of intervals according to the numerical values of all sample data in each sample data subset, each interval covers the numerical values of all sample data in the corresponding sample data subset and does not cover the numerical value of any sample data in the adjacent sample data subset, and the equal-depth histogram is generated by taking the number of the sample data in the sample data subset as an ordinate y.
In the embodiment of the invention, all the sample data in the sample data set are arranged according to the sequence from small to big; dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average; and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets, wherein the equal-depth histogram reflects a small data distribution error, and the accuracy of a data processing result can be improved.
Fig. 8 is a schematic flowchart of another data processing method according to an embodiment of the present invention, and fig. 8 is a detailed description of a possible implementation manner of executing S305 based on the embodiment shown in fig. 7, as follows:
s3051, dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets.
The number of sample data in all of the intervals is the same as the number of sample data in the subset of sample data.
The numerical range of the sample data set is divided into a plurality of intervals, each area only covers the numerical values of all sample data in one data subset, based on the above embodiment, the sample data subset is combined into {1.6, 1.9, 1.9}, {2.0, 2.4, 2.6}, {2.7, 2.7, 2.8} and {2.9, 3.4, 3.5}, the numerical range of the sample data set is [1.6, 3.5], the numerical range of the sample data set is divided into intervals [1.6, 1.9], [2.0, 2.6], [2.7, 2.8] and [2.9, 3.5], and the number of sample data in each interval is 3.
S3052, generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
Exemplarily, fig. 9 is a schematic diagram of an equal-depth histogram according to an embodiment of the present invention, based on the above embodiment, an abscissa is divided into intervals of [1.6, 1.9], [2.0, 2.6], [2.7, 2.8] and [2.9, 3.5], the number of sample data corresponding to each interval is 3, and the equal-depth histogram shown in fig. 9 is generated.
Fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 10, a data processing apparatus 100 includes:
an obtaining module 110, configured to obtain a capacity of a memory allocated when the histogram is constructed.
A determining module 120, configured to determine the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; and extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples.
A histogram generating module 130, configured to generate a histogram according to the sample data set.
Optionally, the determining module 120 is further configured to determine the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
Optionally, the determining module 120 is further configured to add the data to be sampled scanned for the ith time to the sample data set, and update the number of times i +1 of scanning the data to be sampled, where i is greater than or equal to 1 and is less than or equal to S, and S is the number of samples; if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S; adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled; if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
Optionally, the histogram generating module 130 is further configured to averagely divide the numerical range of the sample data set into a plurality of intervals according to the numerical values of all the sample data in the sample data set; and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
Optionally, the histogram generating module 130 is further configured to determine, according to the numerical values of all the sample data in the sample data set, a maximum value and a minimum value of the numerical values of all the sample data; determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value; and averagely dividing the numerical range of the sample data set into a plurality of intervals.
Optionally, the histogram generating module 130 is further configured to arrange all the sample data in the sample data set according to a descending order; dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average; and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
Optionally, the histogram generating module 130 is further configured to divide a numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, where the number of the sample data in all the intervals is the same as the number of the sample data in the sample data subsets; and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
An electronic device is further provided in an embodiment of the present invention, fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention, and fig. 11 shows a block diagram of an exemplary electronic device suitable for implementing an embodiment of the present invention. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 11, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors 16, a system memory 28, and a bus 18 that connects the various system components (including the system memory 28 and the processors 16).
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments described herein.
The processor 16 executes various functional applications and data processing, such as implementing the steps of any of the above-described method embodiments, by executing at least one of a plurality of programs stored in the system memory 28.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of any of the above-mentioned method embodiments.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
Embodiments of the present invention further provide a computer program product, which when run on a computer causes the computer to perform the steps of any of the above-described method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A data processing method, comprising:
acquiring the capacity of a memory allocated when a histogram is constructed;
determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory;
extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and generating the histogram according to the sample data set.
2. The method of claim 1, wherein determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory comprises:
determining the number of samples S according to the following formula:
S=[S1/L]
wherein S1 is the capacity of the memory, and L is the length of the data to be sampled in the data set to be sampled.
3. The method according to claim 1 or 2, wherein said extracting a plurality of said data to be sampled with a fixed probability from said data set to be sampled forms a sample data set, comprising:
adding the data to be sampled scanned for the ith time into the sample data set, and updating the number of times i +1 of scanning the data to be sampled, wherein i is more than or equal to 1 and is less than or equal to S, and S is the number of samples;
if the number of times i for scanning the data to be sampled is less than or equal to the number of samples S, repeatedly executing the step of adding the data to be sampled scanned at the ith time into the sample data set, and updating the number of times i +1 for scanning the data to be sampled until i > S;
adding the data to be sampled scanned at the ith time into the sample data set according to the probability of S/i, deleting one sample data in an original sample data set, and updating the number of times of scanning the data to be sampled, i being i +1, wherein i is greater than S, i is less than or equal to N, and N is the number of the data to be sampled in the data set to be sampled;
if the number i of times of scanning the data to be sampled is smaller than or equal to N, repeatedly executing the step of adding the data to be sampled scanned at the ith time to the sample data set according to the probability of S/i, deleting one sample data in the original sample data set, and updating the number i of times of scanning the data to be sampled to be i +1 until i is larger than N.
4. The method according to claim 1 or 2, wherein said generating a histogram from said set of sample data comprises:
according to the numerical values of all the sample data in the sample data set, averagely dividing the numerical range of the sample data set into a plurality of intervals;
and generating a uniform width histogram according to the number of the sample data in all the intervals and the respective intervals.
5. The method according to claim 4, wherein said dividing the range of values of the sample data set into a plurality of intervals according to the values of all the sample data in the sample data set on average comprises:
determining the maximum value and the minimum value in the numerical values of all the sample data according to the numerical values of all the sample data in the sample data set;
determining the numerical range of the sample data set according to an interval formed by the minimum value and the maximum value;
and averagely dividing the numerical range of the sample data set into a plurality of intervals.
6. The method according to claim 1 or 2, wherein said generating a histogram from said set of sample data comprises:
arranging all the sample data in the sample data set according to a sequence from small to large;
dividing all the sample data arranged in the order from small to large into a plurality of sample data subsets on average;
and generating an equal-depth histogram according to the numerical values and the quantity of the sample data in all the sample data subsets.
7. The method of claim 6, wherein generating an equal-depth histogram from values and quantities of the sample data in all of the subsets of sample data comprises:
dividing the numerical range of the sample data set into a plurality of intervals according to the numerical values of the sample data in all the sample data subsets, wherein the quantity of the sample data in all the intervals is the same as that of the sample data in the sample data subsets;
and generating the equal-depth histogram according to all the intervals and the number of the sample data in the intervals.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring the capacity of the memory allocated when the histogram is constructed;
the determining module is used for determining the number of samples according to the length of the data to be sampled in the data set to be sampled and the capacity of the memory; extracting a plurality of data to be sampled from the data set to be sampled with a fixed probability to form a sample data set, wherein the number of sample data in the sample data set is equal to the number of samples;
and the histogram generating module is used for generating a histogram according to the sample data set.
9. An electronic device, comprising: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110886699.6A CN113672661A (en) | 2021-08-03 | 2021-08-03 | Data processing method, device, equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110886699.6A CN113672661A (en) | 2021-08-03 | 2021-08-03 | Data processing method, device, equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113672661A true CN113672661A (en) | 2021-11-19 |
Family
ID=78541220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110886699.6A Pending CN113672661A (en) | 2021-08-03 | 2021-08-03 | Data processing method, device, equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113672661A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6278989B1 (en) * | 1998-08-25 | 2001-08-21 | Microsoft Corporation | Histogram construction using adaptive random sampling with cross-validation for database systems |
CN107330083A (en) * | 2017-07-03 | 2017-11-07 | 贵州大学 | Wide histogram parallel constructing method |
CN112905517A (en) * | 2021-03-09 | 2021-06-04 | 明峰医疗系统股份有限公司 | Variable packet length data acquisition method based on FPGA |
-
2021
- 2021-08-03 CN CN202110886699.6A patent/CN113672661A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6278989B1 (en) * | 1998-08-25 | 2001-08-21 | Microsoft Corporation | Histogram construction using adaptive random sampling with cross-validation for database systems |
CN107330083A (en) * | 2017-07-03 | 2017-11-07 | 贵州大学 | Wide histogram parallel constructing method |
CN112905517A (en) * | 2021-03-09 | 2021-06-04 | 明峰医疗系统股份有限公司 | Variable packet length data acquisition method based on FPGA |
Non-Patent Citations (1)
Title |
---|
范聪;李建增;张岩;: "基于采样优化的随机抽取一致性算法", 电光与控制, no. 07 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kopylova et al. | SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data | |
DeRaad | SNPfiltR: an R package for interactive and reproducible SNP filtering | |
CN109978006B (en) | Face image clustering method and device | |
CN108805174A (en) | clustering method and device | |
CN110728526A (en) | Address recognition method, apparatus and computer readable medium | |
CN111522968A (en) | Knowledge graph fusion method and device | |
CN115392477A (en) | Skyline query cardinality estimation method and device based on deep learning | |
CN111125658A (en) | Method, device, server and storage medium for identifying fraudulent users | |
CN114168318A (en) | Training method of storage release model, storage release method and equipment | |
CN110751400B (en) | Risk assessment method and device | |
CN113672661A (en) | Data processing method, device, equipment and computer readable storage medium | |
JP6356015B2 (en) | Gene expression information analyzing apparatus, gene expression information analyzing method, and program | |
CN115630643A (en) | Language model training method and device, electronic equipment and storage medium | |
CN115564578B (en) | Fraud recognition model generation method | |
Sharma et al. | Simulating noisy, nonparametric, and multivariate discrete patterns | |
CN111984652B (en) | Method for searching idle block in bitmap data and related components | |
JP2007072883A (en) | Cache memory analysis method, processor and simulative information processing device | |
US11410749B2 (en) | Stable genes in comparative transcriptomics | |
CN109542927B (en) | Effective data screening method, readable storage medium and terminal | |
JP2006092478A (en) | Gene expression profile retrieval apparatus, gene expression profile retrieval method, and program | |
CN109686400B (en) | Enrichment degree inspection method and device, readable medium and storage controller | |
CN110968690A (en) | Clustering division method and device for words, equipment and storage medium | |
CN117217101B (en) | Experiment simulation method based on virtual reality technology | |
CN113068123A (en) | Centroid determining method, centroid determining device, server and storage medium | |
CN109522300B (en) | Effective data screening device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |