CN112883064A

CN112883064A - Self-adaptive sampling and query method and system

Info

Publication number: CN112883064A
Application number: CN202110231990.XA
Authority: CN
Inventors: 王建民; 沈恩亚; 宋怡然; 沈磊贤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-06-01
Anticipated expiration: 2041-03-02
Also published as: CN112883064B

Abstract

The invention provides a self-adaptive sampling and query method and a system, wherein the method comprises the following steps: calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result; extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results; sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping consistency of level results, and constructing a level query engine according to the level sample structure; and when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as a query result. The invention reduces the sampling error and ensures the consistency and low delay of the data query result.

Description

Self-adaptive sampling and query method and system

Technical Field

The invention relates to the technical field of computer visual sampling, in particular to a self-adaptive sampling and query method and a self-adaptive sampling and query system.

Background

The real-time visual monitoring is carried out on massive high-frequency streaming data, and the real-time visual monitoring has extremely important significance on tasks such as data analysis, fault detection and the like. However, due to the characteristics of large scale and high frequency of streaming data, a lot of time is consumed for querying, processing and rendering millions of pieces of data, so that only real-time visual monitoring on the streaming data becomes an unsolved problem. The method realizes real-time visualization of the streaming data, is beneficial to finding abnormal features of the data at an early stage, prevents the data from being suffered in the bud, and has important practical significance. One possible solution is to reduce the size of the visualized data without losing too much visualization accuracy.

The most straightforward implementation is to sample the raw data at each query, but this still requires a significant amount of query and processing time. The database sampling technique randomly samples raw data first, and then materializes the samples in a database to respond to a query. This avoids duplicate computations between different queries, but random sampling means that its visualization error is not guaranteed. Visualization results without error guarantees may lead to erroneous visualization conclusions. Common simple sampling methods such as uniform sampling, layered sampling and the like lack the guarantee of sampling errors and have higher delay.

Therefore, there is a need for an adaptive sampling and query method and system to solve the above problems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a self-adaptive sampling and query method and a self-adaptive sampling and query system.

The invention provides a self-adaptive sampling and query method, which comprises the following steps:

calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result;

extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results;

sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping consistency of level results, and constructing a level query engine according to the level sample structure;

and when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as a query result.

According to the self-adaptive sampling and query method provided by the invention, the cosine distance weight function based on the fluctuation degree is used for calculating the fluctuation degree weight of the data point, wherein the fluctuation degree weight is the cosine distance between the target data point and the adjacent front and back data points of the target data point.

According to the self-adaptive sampling and query method provided by the invention, the fast self-adaptive bucket separation is carried out on the accumulated data by using a binary search algorithm, so that the maximum weight sum of all buckets in a bucket separation result is minimum, and the method comprises the following steps:

scanning the weight array once to obtain the weights of all data points and the maximum value of the weight of a single data point;

and taking the maximum value of the weight sum of all data points and the weight of a single data point as an upper bound and a lower bound of the bucket weight sum, and obtaining the bucket weight sum meeting the preset condition through a binary search algorithm based on the upper bound and the lower bound as starting points.

According to the self-adaptive sampling and query method provided by the invention, the sampling is carried out from low-level samples based on sampling results to obtain high-level samples, and a hierarchical sample structure for keeping consistency of the hierarchical results is constructed, and the method comprises the following steps:

dividing the hierarchical sample structure into a lowest sample level and an upper sample level, wherein the lowest sample level is used for directly obtaining real-time updated original data and carrying out self-adaptive barrel sampling on the original data; continuously polling the sampling result obtained by the next level by the upper sample level, and judging whether the current accumulated data can meet the bucket dividing condition;

when the data weight is accumulated to the branch self-adaptive bucket, self-adaptive sampling is carried out, and a self-sampling result is transmitted to the previous level, so that the bottom-up update of the whole level sample structure is realized.

According to the adaptive sampling and query method provided by the invention, the hierarchical query engine is constructed according to the hierarchical sample structure, and the method comprises the following steps:

according to the number of samples given by a user or a sampling error condition, inquiring a hierarchical sample structure from top to bottom, and returning a sample set meeting the condition;

and if the current level sample does not meet the condition given by the user, determining that the current level sampling granularity is larger than the target sampling granularity, and inquiring the next level sample until the sample level meeting the user inquiry condition is found.

According to the adaptive sampling and query method provided by the invention, after a hierarchy query engine queries a hierarchy which accords with sampling granularity, a sample which retains a current hierarchy is taken as a query result, and the method comprises the following steps:

when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, retaining the sample of the current hierarchy as a query result, and acquiring the latest data point of the sample of the current hierarchy; and the hierarchical query engine takes the latest data time of the hierarchical sample result as the starting time of the time range, and carries out sample query on the next hierarchy until the lowest hierarchical sample is queried.

The invention also provides a self-adaptive sampling and query system, which comprises:

the bucket dividing module is used for calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket dividing on accumulated data by using a binary search algorithm according to the fluctuation degree weight so as to enable the maximum weight and the minimum weight of all buckets in a bucket dividing result;

the sampling module is used for extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators so as to realize self-adaptive sampling density and obtain a corresponding sampling result;

the engine construction module is used for sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping the consistency of level results, and constructing a level query engine according to the level sample structure;

and the query module is used for taking the sample of the current level as a query result after the level query engine queries the level meeting the sampling granularity.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any of the adaptive sampling and querying methods described above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the adaptive sampling and query method as described in any of the above.

The method provides a technology for determining the self-adaptive sampling density based on a data point weight function of the fluctuation degree and a quick bucket division algorithm of binary search, reduces the sampling error, combines a hierarchical structure for managing samples, and ensures the consistency and low delay of data query results through preprocessing and hierarchical query technologies.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for adaptive sampling and query according to the present invention;

FIG. 2 is a schematic structural diagram of an adaptive sampling and query system provided in the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a self-adaptive sampling and query method provided by the present invention, and as shown in fig. 1, the present invention provides a self-adaptive sampling and query method, which includes:

step 101, calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result;

102, extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results;

103, sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping the consistency of level results, and constructing a level query engine according to the level sample structure;

and step 104, when the hierarchy inquiry engine inquires the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as an inquiry result.

It should be noted that the data source provided by the present invention is a unified query interface compatible with various kinds of streaming data sources, and supports the mainstream message queue system Kafka, and the IoTDB database, the In-fluxDB database, and the TimescaleDB database.

Specifically, in step 101, weight calculation and adaptive bucket partitioning are performed on data obtained by data controller query; then, in step 102, a sampling operator is applied within each bucket to extract the same number of samples therefrom; next, in step 103, storing the sampled results in a middle database in a hierarchy; finally, in step 104, according to the query parameters of the user, the matched sample is searched from top to bottom in the middle-layer database according to the hierarchy, and the result is returned. It should be noted that the present invention can also store and query the hierarchical sample sampling result.

The self-adaptive sampling and query method provided by the invention provides a technology for determining self-adaptive sampling density based on a data point weight function of fluctuation and a quick bucket division algorithm of binary search, reduces sampling errors, and ensures the consistency and low delay of data query results by combining a hierarchical structure for managing samples through preprocessing and a hierarchical query technology.

On the basis of the above embodiment, the cosine distance weighting function based on the fluctuation degree is used to calculate the fluctuation degree weight of the data point, wherein the fluctuation degree weight is the cosine distance between the target data point and the two adjacent data points before and after the target data point.

It can be understood that the waviness weight of a data point is the cosine distance of the point from two points before and after the point.

On the basis of the above embodiment, the fast adaptive bucket partitioning for the accumulated data by using the binary search algorithm to make the maximum weight sum of all buckets in the bucket partitioning result minimum includes:

In the present invention, for a given time series, and the calculated volatility weight and required number of buckets, the time series is partitioned into a number of buckets by an approximate x-bucket partitioning algorithm with the goal of minimizing the maximum weight sum of all buckets. And giving a preset bucket weight sum, and dividing the weight array into a plurality of continuous sub-arrays to ensure that the sum of each sub-array does not exceed the preset bucket weight sum. For accumulating newly incoming data, the data are divided into buckets when their weight sum is approximately equal to the preset bucket weight sum, based on the characteristics of the streaming data. The binary search algorithm is a simple and quick algorithm and is beneficial to efficiently finding out an ideal bucket weight sum. Further, by scanning the weight array once, the maximum value of the weight sum of all data points and the weight of a single data point can be obtained, and the two weights are used as the upper bound and the lower bound of the preset bucket weight sum. Then, starting from the upper and lower bounds as starting points, an ideal bucket weight sum is searched by gradually dividing into two parts, and the searched bucket weight sum is used for dividing the time series data into different buckets.

On the basis of the above embodiment, the sampling a high-level sample from a low-level sample based on the sampling result to construct a hierarchical sample structure that maintains consistency of the hierarchical result, including:

In the invention, the sample hierarchy structure is divided into a lowest sample hierarchy and an upper sample hierarchy, the lowest sample hierarchy directly obtains real-time updated original data, and the original data is subjected to self-adaptive barrel sampling. The sampled results are materialized into a sample hierarchy for responding to a query on the one hand; on the other hand, the data is transmitted into the previous layer through the blocking queue between adjacent layers to serve as a streaming data source of the previous sample layer.

And the upper sample level continuously polls the sampling result obtained by the next level and judges whether the current accumulated data can meet the bucket dividing condition. When the data weight is accumulated to the point that the self-adaptive bucket can be separated, self-adaptive sampling is carried out, and the sampling result is further transmitted to the previous level, so that the bottom-up updating of the whole level sample structure is realized.

On the basis of the above embodiment, the building a hierarchical query engine according to the hierarchical sample structure includes:

In the invention, according to the number of samples or sampling error conditions given by a user, a hierarchical sample structure is queried from top to bottom, and a sample set meeting the number of samples or sampling error conditions is returned. Specifically, each sample query request given by the user contains three basic parameters of a data source, a time range and a sample number (or sampling error), and the sample level query engine queries the sample number or the sampling error of each level in a given time range from top to bottom in a level mode. If the current level sample does not meet the condition given by the user, the current level sampling granularity is larger than the target sampling granularity, and the next level sample needs to be further inquired until the sample level which can meet the condition of the user inquiry is inquired.

On the basis of the above embodiment, after the hierarchical query engine queries a hierarchy meeting the sampling granularity, taking a sample of a current hierarchy as a query result, including:

In the invention, when a hierarchy inquiry engine inquires a hierarchy which accords with the sampling granularity, a sample of a current hierarchy is reserved as a result, and the latest data point of the sample of the current hierarchy is obtained; then, the sample level query engine sets the time range starting time as the latest data time of the sample result of the current level, and further queries samples to lower levels until the sample of the lowest level.

Fig. 2 is a schematic structural diagram of the adaptive sampling and querying system provided by the present invention, and as shown in fig. 2, the present invention provides an adaptive sampling and querying system, which includes a bucket dividing module 201, a sampling module 202, an engine constructing module 203, and a querying module 204, where the bucket dividing module 201 is configured to calculate a waviness weight of each data point based on a cosine distance weight function of the waviness, and perform fast adaptive bucket dividing on accumulated data using a binary search algorithm according to the waviness weight, so that a maximum weight sum of all buckets in a bucket dividing result is minimum; the sampling module 202 is configured to extract the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to achieve adaptive sampling density and obtain a corresponding sampling result; the engine construction module 203 is configured to sample a high-level sample from a low-level sample based on a sampling result, construct a level sample structure that maintains consistency of level results, and construct a level query engine according to the level sample structure; the query module 204 is configured to, after the hierarchical query engine queries a hierarchy that meets the sampling granularity, use a sample that retains a current hierarchy as a query result.

The self-adaptive sampling and query system provided by the invention provides a technology for determining self-adaptive sampling density based on a data point weight function of fluctuation and a quick bucket division algorithm of binary search, reduces sampling errors, combines a hierarchical structure for managing samples, and ensures the consistency and low delay of data query results through preprocessing and hierarchical query technologies.

The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.

Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)301, a communication interface (communication interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 complete communication with each other through the communication bus 304. Processor 301 may invoke logic instructions in memory 303 to perform an adaptive sampling and query method comprising: calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result; extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results; sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping consistency of level results, and constructing a level query engine according to the level sample structure; and when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as a query result.

In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the adaptive sampling and query method provided by the above methods, the method comprising: calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result; extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results; sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping consistency of level results, and constructing a level query engine according to the level sample structure; and when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as a query result.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the adaptive sampling and query method provided in the foregoing embodiments, the method including: calculating the fluctuation degree weight of each data point based on a cosine distance weight function of the fluctuation degree, and performing fast self-adaptive bucket separation on accumulated data by using a binary search algorithm according to the fluctuation degree weight to enable the maximum weight and the minimum weight of all buckets in a bucket separation result; extracting the same number of samples from the streaming data in each sub-bucket result through multiple sampling operators to realize self-adaptive sampling density and obtain corresponding sampling results; sampling from low-level samples to obtain high-level samples based on sampling results, constructing a level sample structure for keeping consistency of level results, and constructing a level query engine according to the level sample structure; and when the hierarchy query engine queries the hierarchy which accords with the sampling granularity, reserving the sample of the current hierarchy as a query result.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An adaptive sampling and query method, comprising:

2. The adaptive sampling and query method of claim 1, wherein the cosine distance weighting function based on the degree of fluctuation is used to calculate the degree of fluctuation weight of the data point, wherein the degree of fluctuation weight is the cosine distance between a target data point and two adjacent data points before and after the target data point.

3. The adaptive sampling and query method of claim 1, wherein the fast adaptive binning of the accumulated data using a binary search algorithm such that the maximum sum of weights of all bins in the binning result is minimal comprises:

4. The adaptive sampling and query method according to claim 1, wherein the sampling from the low-level samples based on the sampling result to obtain the high-level samples, and constructing the hierarchical sample structure maintaining consistency of the hierarchical result comprises:

5. The adaptive sampling and query method of claim 4, wherein the building a hierarchical query engine according to the hierarchical sample structure comprises:

6. The adaptive sampling and query method according to claim 5, wherein after the hierarchical query engine queries a hierarchy meeting the sampling granularity, the method includes the following steps:

7. An adaptive sampling and query system, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the adaptive sampling and query method according to any one of claims 1 to 6 are implemented when the computer program is executed by the processor.

9. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the adaptive sampling and query method according to any one of claims 1 to 6.