CN107436954A - A kind of online flow data approximate processing method of quality control and device - Google Patents

A kind of online flow data approximate processing method of quality control and device Download PDF

Info

Publication number
CN107436954A
CN107436954A CN201710701336.4A CN201710701336A CN107436954A CN 107436954 A CN107436954 A CN 107436954A CN 201710701336 A CN201710701336 A CN 201710701336A CN 107436954 A CN107436954 A CN 107436954A
Authority
CN
China
Prior art keywords
error
current
data
processing result
approximate processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710701336.4A
Other languages
Chinese (zh)
Other versions
CN107436954B (en
Inventor
魏晓辉
刘圆圆
王兴旺
徐海啸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201710701336.4A priority Critical patent/CN107436954B/en
Publication of CN107436954A publication Critical patent/CN107436954A/en
Application granted granted Critical
Publication of CN107436954B publication Critical patent/CN107436954B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of online flow data approximate processing method of quality control, this method comprises the following steps:It is determined that handling the sampling policy of the flow data of window for current data, according to sampling policy, stream data is sampled, and obtains sampled data;Approximate processing is carried out to sampled data, obtains current approximate processing result, according to the user's request and current approximate processing result being obtained ahead of time, error analysis is carried out, obtains error amount;Whether error in judgement value is less than or equal to default error threshold;If it is, current approximate processing result is exported, if it is not, then carrying out error correction.The technical scheme provided using the embodiment of the present invention, the quality of the approximate processing result of flow data can be improved.The invention also discloses a kind of online flow data approximate processing quality control apparatus, there is relevant art effect.

Description

Online streaming data approximate processing quality control method and device
Technical Field
The invention relates to the technical field of stream data processing, in particular to a method and a device for controlling the quality of on-line stream data approximate processing.
Background
As the data volume of stream data has increased in a well-injection manner, approximation processing has become an indispensable key technology in stream data processing.
The speed of processing the streaming data can be continuously increased by the approximation processing. For example, the sampling algorithm in the approximate processing technique replaces the data characteristics of the entire stream data with the sample set data characteristics to increase the processing speed of the stream data. However, the approximation processing often sacrifices the quality of the result of the approximation processing of the stream data as a cost for processing data quickly when increasing the speed of the approximation processing of the stream data.
In practical applications, when a user submits an approximation processing request, the user often demands the quality of an approximation processing result. For example: in an online query traffic application, the error of the received query request sent by a user, which requires the approximate processing result, is within ± 10%.
In summary, how to effectively improve the quality of the approximate processing result of the streaming data is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a method and a device for controlling the quality of on-line streaming data approximate processing, which are used for improving the quality of an approximate processing result obtained when the streaming data is subjected to approximate processing.
In order to solve the technical problems, the invention provides the following technical scheme:
an online streaming data approximate processing quality control method comprises the following steps:
determining a sampling strategy of the stream data for the current data processing window;
sampling the streaming data according to the sampling strategy to obtain sampling data;
carrying out approximate processing on the sampling data to obtain a current approximate processing result;
according to the user requirement obtained in advance and the current approximate processing result, carrying out error analysis to obtain an error value;
judging whether the error value is smaller than or equal to a preset error threshold value;
if yes, outputting the current approximate processing result;
if not, error correction is performed.
Preferably, the performing error analysis according to a user requirement obtained in advance and the current approximate processing result to obtain an error value includes:
and if the user requirement is a requirement aiming at the maximum error, carrying out error analysis aiming at the approximate processing result corresponding to the current data processing window to obtain an error value.
Preferably, the performing error analysis according to a user requirement obtained in advance and the current approximate processing result to obtain an error value includes:
if the user requirement is a requirement for average errors, obtaining historical approximate processing results corresponding to N data processing windows adjacent to the current data processing window, wherein N is a positive integer;
and carrying out error analysis according to the historical approximate processing result and the current approximate processing result to obtain an error value.
Preferably, the performing error correction includes:
judging whether the error value is obtained by performing first error analysis on the current data processing window;
if yes, the step of sampling the streaming data according to the sampling strategy to obtain sampling data is repeatedly executed.
Preferably, when it is determined that the error value is not an error value obtained by performing the first error analysis on the current data processing window, the method further includes:
adjusting the sampling strategy.
Preferably, before performing error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value, the method further includes:
judging whether the current moment is within a preset quality monitoring time period;
and if so, executing the step of carrying out error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value.
Preferably, when the current time is not within the quality monitoring time period, the method further includes:
and directly outputting the current approximate processing result.
An online streaming data approximate processing quality control apparatus comprising:
the sampling strategy determining module is used for determining the sampling strategy of the stream data aiming at the current data processing window;
the sampling data obtaining module is used for sampling the streaming data according to the sampling strategy to obtain sampling data;
an approximate processing result obtaining module, configured to perform approximate processing on the sample data to obtain a current approximate processing result;
the error analysis module is used for carrying out error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value;
the error judgment module is used for judging whether the error value is smaller than or equal to a preset error threshold value or not;
the output module is used for outputting the current approximate processing result when the error value is smaller than or equal to a preset error threshold value;
and the error correction module is used for correcting errors when the error value is greater than a preset error threshold value.
Preferably, the system further comprises a quality monitoring and judging module, configured to:
before performing error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value, judging whether the current moment is within a preset quality monitoring time period; if so, triggering the error analysis module.
Preferably, the quality monitoring and judging module is further configured to: and when the current moment is not in the quality monitoring time period, directly outputting the current approximate processing result.
By applying the technical scheme provided by the embodiment of the invention, the sampling strategy of the stream data of the current data processing window is firstly determined, the stream data is sampled according to the sampling strategy to obtain the sampling data, then the sampling data is approximately processed to obtain the current approximate processing result, error analysis is carried out according to the pre-obtained user requirement and the current approximate processing result to obtain an error value, whether the error value is less than or equal to a preset error threshold value or not is judged, if so, the current approximate processing result is output, and if not, error correction is carried out. Before the current approximate processing result is output, the current approximate processing result is subjected to error analysis, when the error of the approximate processing result is larger than an error threshold value, error correction is carried out, and when the approximate processing result is smaller than or equal to the error threshold value, the current approximate processing result is output, so that the quality of the approximate processing result of the stream data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of an implementation of a method for quality control of on-line approximate processing of streaming data according to an embodiment of the present invention;
FIG. 2 is a flow chart of another embodiment of the method for controlling quality of approximate processing of online stream data according to the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for controlling quality of on-line approximate processing of stream data according to an embodiment of the present invention;
fig. 4 is a data distribution diagram of stream data according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a quality control method for approximate processing of online stream data according to an embodiment of the present invention, including the following steps:
s101, determining a sampling strategy of the stream data of the current data processing window.
In the embodiment of the present invention, a plurality of sampling strategies may be preset, and the sampling strategy for the stream data of the current data processing window may be determined from the preset plurality of sampling strategies. Specifically, the corresponding sampling strategy may be determined according to the data distribution condition of the stream data, or one sampling strategy may be randomly selected from a plurality of preset sampling strategies.
The sampling strategy may include sampling algorithms used in sampling, and information such as sampling order, sampling frequency, and sampling window size among the sampling algorithms. For example, the sampling strategy may be to sample the flow data using a random sampling algorithm and a layered random sampling algorithm, respectively, or to sample the flow data using a layered random sampling algorithm in parallel.
After the sampling strategy is determined, the operation of step S102 may be continued.
And S102, sampling the flow data according to a sampling strategy to obtain sampling data.
The determined sampling strategy may include a sampling algorithm, and according to the sampling algorithm in the sampling strategy, the stream data of the current data processing window may be sampled to obtain sampled data of the stream data of the current data processing window.
Wherein the sample data obtained from the stream data may be divided into different sample sets for ease of analysis. For example, when the sampling policy is to perform sampling by using a random sampling algorithm and a layered random sampling algorithm in parallel, sample sets for two different sampling algorithms may be obtained, and how many sample sets are obtained by a specific sampling algorithm may be preset or may be determined or adjusted according to an actual situation, which is not limited in the embodiment of the present invention.
S103, carrying out approximate processing on the sampling data to obtain a current approximate processing result.
And carrying out approximate processing on the sampled data to obtain a current approximate processing result of the stream data aiming at the current data processing window. When the flow data is sampled to obtain a plurality of sample sets, the sampled data in each sample set is subjected to approximate processing, and a current approximate processing result corresponding to each sample set can be obtained.
In particular, the sampled data may be approximated according to current requirements. For example, when the flow data is temperature data in weather information and the average temperature needs to be counted, the approximate calculation of averaging may be performed on the sampled data; when the stream data is the click rate of a plurality of servers of a certain website and the total click rate of the website needs to be counted, the sampling data can be summed up approximately.
And S104, carrying out error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value.
The user requirement may include a quality requirement of the user on the approximation processing result, and for example, an error of the output approximation processing result does not exceed ± 5%.
And performing error analysis according to the quality requirement of the approximate processing result in the user requirements obtained in advance and the current approximate processing result to obtain an error value. Specifically, the absolute value of the difference between the current approximation processing results corresponding to different sample sets obtained by sampling the flow data may be determined as the error value.
In an embodiment of the present invention, if the user requirement is a requirement for a maximum error, an error analysis is performed on an approximate processing result corresponding to the current data processing window to obtain an error value.
And if the user requirement is a requirement aiming at the maximum error, namely the error of the current approximate processing result required to be output by the user does not exceed the maximum error, carrying out error analysis on the approximate processing result corresponding to the current data processing window to obtain an error value.
For example, when it is determined that the error of the current approximation processing result required to be output by the user does not exceed the maximum error according to the user requirement, | window | ═ 1 may be set, that is, after the current data processing window acquires the current approximation processing result, the error analysis is performed on the current approximation processing result immediately, so as to obtain an error value.
In one embodiment of the invention, if the user requirement is a requirement for an average error, historical approximate processing results corresponding to N data processing windows adjacent to a current data processing window are obtained, wherein N is a positive integer; and performing error analysis according to the historical approximate processing result and the current approximate processing result to obtain an error value.
And if the user requirement is a requirement aiming at the average error, namely the error of the current approximate processing result required to be output by the user does not exceed the average error, acquiring historical approximate processing results corresponding to N data processing windows adjacent to the current data processing window, and performing error analysis according to the historical approximate processing results and the current approximate processing result to obtain an error value related to the average error. It should be noted that N is a positive integer, and N may be preset, or may be determined and adjusted according to an actual situation, and the embodiment of the present invention is not limited.
For example, when it is determined that the error of the current approximation processing result requested by the user to be output does not exceed the average error according to the user requirement, | window | ═ N +1 may be set, that is, after the current data processing window obtains the current approximation processing result, the current approximation processing result and the historical approximation processing result are subjected to error analysis immediately, so as to obtain an error value related to the average error.
And S105, judging whether the error value is smaller than or equal to a preset error threshold value.
In the embodiment of the present invention, an error threshold may be preset, and the specific size of the error threshold may be determined and adjusted according to an actual situation, which is not limited in the embodiment of the present invention. After obtaining the error value, the error value may be compared with the error threshold to determine whether the error value is less than or equal to the error threshold, and if so, whether the error value is less than or equal to the preset error threshold may be determined by using a subtraction or division method.
Specifically, a corresponding error threshold may be set in advance for each sample set, or may be set for a sampling window. For example: when there are two sample sets, the first sample error threshold and the second sample error threshold may be set as theoretical error limit values corresponding to the respective sample sets, or one error threshold may be set for the sampling window, and the specific error threshold may be determined and adjusted according to actual conditions, which is not limited in the embodiment of the present invention.
If the error value is less than or equal to the preset error threshold, the operation of step S106 is performed, and if the error value is greater than the preset error threshold, the operation of step S107 is performed.
And S106, outputting the current approximate processing result.
If the error value obtained in step S105 is smaller than or equal to the preset error threshold, it indicates that the current approximation processing result is satisfactory, and the current approximation processing result may be output.
And S107, error correction is carried out.
If the error value obtained in step S105 is greater than the preset error threshold, it indicates that the current approximation processing result is not satisfactory, and error correction may be performed with respect to the error value.
In one embodiment of the present invention, step S107 includes the steps of:
and judging whether the error value is the error value obtained by performing the first error analysis on the current data processing window, and if so, repeatedly executing the operation of the step S102.
Firstly, judging whether the error value is the error value obtained by carrying out the first error analysis on the current data processing window, if so, repeatedly executing the step of sampling the flow data according to the sampling strategy to obtain the sampling data. That is, when the result of the approximation processing for the current data processing window is greater than or equal to the error threshold for the first time, resampling is performed, that is, step S102 to step S105 are performed again, and then the operation of step S106 or step S107 is performed according to the determination result of step S105.
In another embodiment of the present invention, when the error value is determined not to be the error value obtained by performing the first error analysis on the current data processing window, the sampling strategy may also be adjusted.
When it is determined that the error value is not an error value obtained by performing the first error analysis on the current data processing window, that is, the error value obtained at this time is an error value obtained by performing error correction in step S107, in this case, the sampling strategy may be adjusted.
Adjusting the sampling strategy may adjust a sampling algorithm, a sampling window size, a sampling frequency of the sampling algorithm, and the like included in the sampling strategy. For example, the sampling window size of the sampling algorithm in the sampling strategy may be adjusted to be smaller, or the sampling frequency of the sampling algorithm in the sampling strategy may be increased, or the sampling algorithm in the sampling strategy may be changed.
The method provided by the embodiment of the invention comprises the steps of firstly determining a sampling strategy of stream data of a current data processing window, sampling the stream data according to the sampling strategy to obtain sampling data, then carrying out approximate processing on the sampling data to obtain a current approximate processing result, carrying out error analysis according to a pre-obtained user requirement and the current approximate processing result to obtain an error value, judging whether the error value is smaller than or equal to a preset error threshold value, if so, outputting the current approximate processing result, and if not, carrying out error correction. Before the current approximate processing result is output, the current approximate processing result is subjected to error analysis, when the error of the approximate processing result is larger than an error threshold value, error correction is carried out, and when the approximate processing result is smaller than or equal to the error threshold value, the current approximate processing result is output, so that the quality of the approximate processing result of the stream data is improved.
For convenience of understanding, fig. 2 is taken as an example to illustrate the technical solution provided by the embodiment of the present invention:
assuming that, according to practical situations, the approximation processing performed on the stream data is an AVG operation performed on the stream data, that is, an averaging operation performed on the stream data, the sampling strategies used may be SRS and SRS, which use a layered random sampling algorithm respectivelySampling is carried out on the flow data by a machine sampling algorithm RS, or sampling is carried out on the flow data by a layered random sampling algorithm SRS in parallel, and sample data is obtained, wherein the sample data is two sample sets which are respectively marked as S1、S2. The mean values of the two sample set estimates are respectivelyI.e. the current approximate processing result of the stream data of the current data processing window. The sampling algorithm included in the sampling strategy may be determined based on a data distribution obtained by a hierarchical structure obtained by performing hierarchical processing on the streaming data by a preset hierarchical strategy.
Assuming a true average of the stream data asThen the current approximate processing resultThe absolute values of the differences from the true average are:
wherein,respectively setting the sum of theoretical error limit values as an error threshold value delta for the theoretical error limit values corresponding to each sample set, and judging whether the error value is less than or equal to the error threshold value:
it should be noted that, if the number n of sample sets is greater than 2, the error value can be expressed as:
wherein,and respectively corresponding to the current approximate processing result of each sample set, wherein n is a positive integer.
According to the error valueAnd continuing to execute the subsequent steps according to the judgment result of the error threshold value delta.
Since the processing of streaming data is usually implemented in a window mode, the above equation is used for understanding purposesIs converted intoWherein window is a positive integer determined for user needs.
When the user requirement is a requirement for the maximum error, that is, | window | ═ 1 at this time, the error value corresponding to the current data processing window is compared with a preset error threshold Δ, that is, whether the user requirement is met is judged by the following equation:
when the user requirement is a requirement for the average error, the number of the historical data processing windows can be determined, and a historical approximate processing result corresponding to each historical data processing window is obtained. Calculating an error value for the average error from the current approximation processing result and the historical approximation processing result:
wherein the calculated valueAnd the error value is the average error of the current approximate processing result corresponding to the current data processing window and the historical approximate processing results corresponding to the N historical data processing windows. N is a positive integer. Window | ═ N + 1.
And determining whether to output the current approximate processing result or carry out error correction according to the comparison result of the error value and a preset error threshold value. If it is notOutputting the current approximate processing result, ifAnd judging whether the error value is obtained by performing first error analysis on the current data processing window, if so, resampling the stream data of the current data processing window, otherwise, adjusting a sampling strategy, for example, deleting a leaf node of a hierarchical structure of the stream data, so as to be used when sampling the stream data of a subsequent data processing window.
In an embodiment of the present invention, before performing step S104, and performing error analysis according to a user requirement obtained in advance and a current approximate processing result to obtain an error value, the method further includes the following steps:
judging whether the current moment is within a preset quality monitoring time period; if so, the operation of step S104 is performed.
In this embodiment, a quality monitoring time period may be preset, and the time period may be determined and adjusted according to an actual situation, which is not limited in the embodiment of the present invention.
And judging whether the current moment is within a preset quality monitoring time period, if so, executing step S104 to perform error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value. That is, the current time is within the preset quality monitoring time period, step S104 and step S105 are continuously executed, and it is determined whether to execute step S106 or step S107 according to step S105.
In this embodiment, if the current time is within the preset quality monitoring time period, the error analysis is performed on the current approximate processing result, so that the quality of the current approximate processing result within the quality monitoring time period can be improved.
In another embodiment of the invention, when the current time is not in the quality monitoring time period, the current approximate processing result can also be directly output.
And when the current moment is not in the quality monitoring time period, directly outputting the result of the current approximate processing without executing other steps.
If the current time is not in the quality monitoring time period, the current approximate processing result is directly output without executing other steps, and the overhead of the stream data approximate processing can be reduced.
Corresponding to the above method embodiment, the embodiment of the present invention further provides an online streaming data approximate processing quality control device, and the online streaming data approximate processing quality control device described below and the online streaming data approximate processing quality control method described above may be referred to each other.
Referring to fig. 3, the apparatus includes the following modules:
a sampling policy determination module 201, configured to determine a sampling policy of the stream data for the current data processing window;
a sampling data obtaining module 202, configured to sample the flow data according to a sampling policy, so as to obtain sampling data;
an approximate processing result obtaining module 203, configured to perform approximate processing on the sample data to obtain a current approximate processing result;
the error analysis module 204 is configured to perform error analysis according to a user requirement obtained in advance and a current approximate processing result to obtain an error value;
an error determination module 205, configured to determine whether the error value is less than or equal to a preset error threshold;
the output module 206 is configured to output a current approximate processing result when the error value is less than or equal to a preset error threshold;
and the error correction module 207 is configured to perform error correction when the error value is greater than a preset error threshold.
The device provided by the embodiment of the invention is applied to determine a sampling strategy of stream data of a current data processing window, sample the stream data according to the sampling strategy to obtain the sampled data, then carry out approximate processing on the sampled data to obtain a current approximate processing result, carry out error analysis according to the pre-obtained user requirement and the current approximate processing result to obtain an error value, judge whether the error value is less than or equal to a preset error threshold value, if so, output the current approximate processing result, and if not, carry out error correction. Before the current approximate processing result is output, the current approximate processing result is subjected to error analysis, when the error of the approximate processing result is larger than an error threshold value, error correction is carried out, and when the approximate processing result is smaller than or equal to the error threshold value, the current approximate processing result is output, so that the quality of the approximate processing result of the stream data is improved.
In an embodiment of the present invention, the error analysis module 204 is specifically configured to:
and if the user requirement is the requirement aiming at the maximum error, carrying out error analysis aiming at the approximate processing result corresponding to the current data processing window to obtain an error value.
In an embodiment of the present invention, the error analysis module 204 is specifically configured to:
if the user requirement is a requirement aiming at the average error, obtaining historical approximate processing results corresponding to N data processing windows adjacent to the current data processing window, wherein N is a positive integer;
and performing error analysis according to the historical approximate processing result and the current approximate processing result to obtain an error value.
In an embodiment of the present invention, the error correction module 207 is specifically configured to:
judging whether the error value is obtained by performing first error analysis on the current data processing window;
if so, the sample data acquisition module 202 is triggered.
In an embodiment of the present invention, the apparatus further includes a sampling strategy adjusting module, configured to:
and when the error value is not the error value obtained by performing the first error analysis on the current data processing window, adjusting the sampling strategy.
In a specific embodiment of the present invention, the system further includes a quality monitoring and determining module, configured to:
judging whether the current moment is within a preset quality monitoring time period;
if so, the error analysis module 204 is triggered.
In a specific embodiment of the present invention, the quality monitoring and determining module is further configured to:
and when the current moment is not in the quality monitoring time period, directly outputting the current approximate processing result.
For the convenience of understanding, a series of experiments performed using the technical solutions provided in the embodiments of the present invention are described.
In this series of experiments, error control is performed by using an online error detection program, and an error detection strategy may refer to steps S104 to S107 of the online flow data approximation quality control method provided in the embodiment of the present invention.
An online streaming data processing application is simulated by a pre-acquired data set, and fig. 4 is a data distribution diagram of streaming data in an embodiment of the present invention, wherein the data file is stored in a bzip2 compression type and has a size of 12.6 GB. The data set records web page information in an XML (eXtensible Markup Language) format, reads the file stream, and analyzes the length (bytes) of the web page in the data set. An online error detection program is set in each data processing window.
The system tests the error values sampled and calculated by each data processing window under the average and maximum error requirements respectively according to different types of requirements of users. The data processing window size is divided into 2000 and 4000, and the data processing window size specifically refers to the number of data items processed each time. Because the system returns the query result for the user in real time, when the data is processed online, the overall error and the error under each data processing window are counted.
Experiment one, respectively using the sampling strategies of the random sampling algorithm and the layered random sampling algorithm, the final overall error is obtained as shown in table 1:
sampling rate 0.01 0.05 0.1 0.2 0.3
Layered random sampling Algorithm 2000 (%) 5.6385 1.8278 0.6881 0.3030 0.1887
Random sampling algorithm 2000 (%) 6.6861 2.5303 1.0745 0.4727 0.3406
TABLE 1
From table 1 above, it can be known that, under the same conditions, the error of the hierarchical random sampling algorithm is smaller than that of the random hierarchical sampling algorithm, and it can be deduced that the hierarchical random sampling algorithm is obviously superior to the random sampling algorithm.
In the next experiment, two sample sets were generated in parallel using a hierarchical random sampling algorithm in the sampling phase for comparison.
Experiment two, for the average error requirement, first set | window | ═ 5, and the average error comparing the results of multiple data processing window processing is shown in table 2:
error threshold (Delta) 30 50 80 100 200
Window size 2000 (%) 8.5272 9.6209 10.9134 11.2466 12.2904
Window size 4000 (%) 7.4944 8.9266 9.3936 9.9036 11.2858
TABLE 2
When the error threshold Δ is set small, the requirement for error is high and the accuracy of the result correspondingly produced is high. As shown in the above table, when the error threshold Δ is 30 and the data processing window size is 2000, the average error is about 8.5272%; and when the error threshold value delta is increased to 200, the error of calculating the average value of the web page is about 12.2904%.
When the error threshold value delta is increased from 30 to 200, the generated average error is finally counted to be gradually increased, and the phenomenon verifies the effect of the approximate output error control strategy. Each time the results of the last five data processing windows are detected, the sampling is re-performed when the approximation of the two sample sets is greater than the error threshold Δ. Thus, a smaller Δ indicates a more stringent requirement for the output result, and the more accurate the final result will be.
Experiment three, assuming that the real-time result returned by the user should satisfy a certain maximum error constraint, setting error thresholds Δ to be 30, 50, 80, 100, and 200, respectively, and making | window | > 1, that is, comparing the current approximate processing result of the stream data of the current data processing window with the error thresholds, to obtain the error for the maximum error as shown in table 3:
error threshold (Delta) 30 50 80 100 200
Window size 2000 (%) 8.7729 9.2060 10.0202 11.5643 14.5708
Window size 4000 (%) 7.7994 8.3757 9.3527 10.5840 13.2013
TABLE 3
After sampling calculation is performed on the stream data of the current data processing window, the error control strategy outputs a comparison result, namely an error value. If the error value output by the current data processing window exceeds the preset threshold value delta, the stream data of the current data processing window needs to be resampled to correct the error. Similar to the result of measuring the average error requirement, when the error detection criterion is relaxed, i.e. the specific value of the error threshold Δ is increased from 30 to 200, the average error accuracy of the statistical result is gradually reduced.
In table 3, the same row of data reflects the improvement of the error detection method on the accuracy of the calculation result, and the same column of data reflects the influence of different data processing window sizes on the approximate processing result. From the two sets of data in table 3, the technical scheme provided by the invention can improve the quality of the output result, and can correct the larger error generated by the approximate calculation.
And fourthly, testing the average error generated by the data processing windows under different sampling rates to further verify the technical scheme of the embodiment of the invention. The sampling rates are set to be 0.05 and 0.1 respectively, the size of a data processing window is 2000, errors under different conditions are compared, and specific experimental data are shown in table 4:
TABLE 4
As can be seen from table 4 above, the greater the sampling rate, the more accurate the result of the calculation. And when the variation trends of the error thresholds are the same, the obtained error values have the same variation trend, and the average error generated by the data processing window is increased along with the increase of the error threshold delta at different sampling rates.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. An online streaming data approximate processing quality control method is characterized by comprising the following steps:
determining a sampling strategy of the stream data for the current data processing window;
sampling the streaming data according to the sampling strategy to obtain sampling data;
carrying out approximate processing on the sampling data to obtain a current approximate processing result;
according to the user requirement obtained in advance and the current approximate processing result, carrying out error analysis to obtain an error value;
judging whether the error value is smaller than or equal to a preset error threshold value;
if yes, outputting the current approximate processing result;
if not, error correction is performed.
2. The method as claimed in claim 1, wherein the performing an error analysis according to a pre-obtained user requirement and the current approximation processing result to obtain an error value comprises:
and if the user requirement is a requirement aiming at the maximum error, carrying out error analysis aiming at the approximate processing result corresponding to the current data processing window to obtain an error value.
3. The method as claimed in claim 1, wherein the performing an error analysis according to a pre-obtained user requirement and the current approximation processing result to obtain an error value comprises:
if the user requirement is a requirement for average errors, obtaining historical approximate processing results corresponding to N data processing windows adjacent to the current data processing window, wherein N is a positive integer;
and carrying out error analysis according to the historical approximate processing result and the current approximate processing result to obtain an error value.
4. The online stream data approximate processing quality control method according to claim 1, wherein the performing error correction includes:
judging whether the error value is obtained by performing first error analysis on the current data processing window;
if yes, the step of sampling the streaming data according to the sampling strategy to obtain sampling data is repeatedly executed.
5. The method as claimed in claim 4, further comprising, when determining that the error value is not an error value obtained by performing a first error analysis for the current data processing window:
adjusting the sampling strategy.
6. The method as claimed in any one of claims 1 to 5, wherein before performing error analysis to obtain an error value according to a user requirement obtained in advance and the current approximation processing result, the method further comprises:
judging whether the current moment is within a preset quality monitoring time period;
and if so, executing the step of carrying out error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value.
7. The online streaming data approximate processing quality control method according to claim 6, further comprising, when the current time is not within the quality monitoring period:
and directly outputting the current approximate processing result.
8. An online streaming data approximate processing quality control apparatus, comprising:
the sampling strategy determining module is used for determining the sampling strategy of the stream data aiming at the current data processing window;
the sampling data obtaining module is used for sampling the streaming data according to the sampling strategy to obtain sampling data;
an approximate processing result obtaining module, configured to perform approximate processing on the sample data to obtain a current approximate processing result;
the error analysis module is used for carrying out error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value;
the error judgment module is used for judging whether the error value is smaller than or equal to a preset error threshold value or not;
the output module is used for outputting the current approximate processing result when the error value is smaller than or equal to a preset error threshold value;
and the error correction module is used for correcting errors when the error value is greater than a preset error threshold value.
9. The online streaming data approximation processing quality control device as claimed in claim 8, further comprising a quality monitoring judgment module for:
before performing error analysis according to the user requirement obtained in advance and the current approximate processing result to obtain an error value, judging whether the current moment is within a preset quality monitoring time period;
if so, triggering the error analysis module.
10. The online streaming data approximate processing quality control device according to claim 9, wherein the quality monitoring and determining module is further configured to:
and when the current moment is not in the quality monitoring time period, directly outputting the current approximate processing result.
CN201710701336.4A 2017-08-16 2017-08-16 A kind of online flow data approximate processing method of quality control and device Expired - Fee Related CN107436954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710701336.4A CN107436954B (en) 2017-08-16 2017-08-16 A kind of online flow data approximate processing method of quality control and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710701336.4A CN107436954B (en) 2017-08-16 2017-08-16 A kind of online flow data approximate processing method of quality control and device

Publications (2)

Publication Number Publication Date
CN107436954A true CN107436954A (en) 2017-12-05
CN107436954B CN107436954B (en) 2018-10-02

Family

ID=60461371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710701336.4A Expired - Fee Related CN107436954B (en) 2017-08-16 2017-08-16 A kind of online flow data approximate processing method of quality control and device

Country Status (1)

Country Link
CN (1) CN107436954B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782550A (en) * 2019-09-20 2020-02-11 腾讯科技(深圳)有限公司 Data acquisition method, device and equipment
CN114325553A (en) * 2021-12-22 2022-04-12 杭州明特科技有限公司 Self-heating error correction method and device for electric energy meter, electric energy meter and storage medium
WO2023185050A1 (en) * 2022-03-30 2023-10-05 蚂蚁区块链科技(上海)有限公司 Smart contract-based calculating, updating, and reading methods and apparatuses, and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1263327A (en) * 1999-02-09 2000-08-16 索尼公司 Data processing method and its device
CN101247526A (en) * 2008-03-18 2008-08-20 天津大学 Sound volume equalization regulation and its application method based on digital television code stream
CN101908065A (en) * 2010-07-27 2010-12-08 浙江大学 On-line attribute abnormal point detecting method for supporting dynamic update
CN102798384A (en) * 2012-07-03 2012-11-28 天津大学 Ocean remote sensing image water color and water temperature monitoring method based on compression sampling
CN103236825A (en) * 2013-03-22 2013-08-07 中国科学院光电技术研究所 Data correction method for high-precision data acquisition system
US20140258253A1 (en) * 2013-03-08 2014-09-11 International Business Machines Corporation Summarizing a stream of multidimensional, axis-aligned rectangles
CN106997303A (en) * 2017-04-10 2017-08-01 中国人民解放军国防科学技术大学 Big data approximate evaluation method based on MapReduce

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1263327A (en) * 1999-02-09 2000-08-16 索尼公司 Data processing method and its device
CN101247526A (en) * 2008-03-18 2008-08-20 天津大学 Sound volume equalization regulation and its application method based on digital television code stream
CN101908065A (en) * 2010-07-27 2010-12-08 浙江大学 On-line attribute abnormal point detecting method for supporting dynamic update
CN102798384A (en) * 2012-07-03 2012-11-28 天津大学 Ocean remote sensing image water color and water temperature monitoring method based on compression sampling
US20140258253A1 (en) * 2013-03-08 2014-09-11 International Business Machines Corporation Summarizing a stream of multidimensional, axis-aligned rectangles
CN103236825A (en) * 2013-03-22 2013-08-07 中国科学院光电技术研究所 Data correction method for high-precision data acquisition system
CN106997303A (en) * 2017-04-10 2017-08-01 中国人民解放军国防科学技术大学 Big data approximate evaluation method based on MapReduce

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵阳: "面向时间序列的阈值近似压缩处理技术", 《中国优秀硕士学位论文全文数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782550A (en) * 2019-09-20 2020-02-11 腾讯科技(深圳)有限公司 Data acquisition method, device and equipment
CN114325553A (en) * 2021-12-22 2022-04-12 杭州明特科技有限公司 Self-heating error correction method and device for electric energy meter, electric energy meter and storage medium
WO2023185050A1 (en) * 2022-03-30 2023-10-05 蚂蚁区块链科技(上海)有限公司 Smart contract-based calculating, updating, and reading methods and apparatuses, and electronic device

Also Published As

Publication number Publication date
CN107436954B (en) 2018-10-02

Similar Documents

Publication Publication Date Title
CN107436954B (en) A kind of online flow data approximate processing method of quality control and device
US9967275B1 (en) Efficient detection of network anomalies
WO2017148314A1 (en) Method of training machine learning system, and training system
CN109115257B (en) Method, device, equipment and storage medium for correcting sensor characteristic curve
KR102335985B1 (en) Apparatus for controlling steering angle, lane keeping assist system having the same and method thereof
CN106649026B (en) Monitoring data compression method suitable for operation and maintenance automation system
CN107509155B (en) Array microphone correction method, device, equipment and storage medium
Zhu et al. A modified second‐order SPSA optimization algorithm for finite samples
CN115086060A (en) Flow detection method, device and equipment and readable storage medium
CN112232011A (en) Wide-frequency-band electromagnetic response self-adaptive determination method and system of integrated circuit
CN112988892B (en) Distributed system hot spot data management method
CN110231772B (en) Method, device and equipment for acquiring process model
CN115150159B (en) Flow detection method, device, equipment and readable storage medium
CN116449081A (en) Data acquisition system, device and storage medium with self-adaptive regulation and control function
CN110830003A (en) Filtering method based on α - β filter
CN106961398B (en) Bandwidth control method and device of distributed file system
CN113297195B (en) Time series abnormity detection method, device and equipment
CN114997233A (en) Signal processing method and device and electronic equipment
CN111833197B (en) Telemetry data processing method and device of credit investigation protocol
Wang et al. Feedback systems with communications: integrated study of signal estimation, sampling, quantization, and feedback robustness
CN110547790A (en) Electrocardio monitoring system
JP2021039685A (en) Abnormality detection device and abnormality detection method
CN115824020B (en) Capacitance calibration method, evaluation method, device and storage medium
CN115564260A (en) Method and device for calibrating production line reference value and electronic equipment
WO2023139640A1 (en) Information processing device and information processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181002

Termination date: 20190816