CN111026784A

CN111026784A - Uncertain data stream probability summation threshold query method

Info

Publication number: CN111026784A
Application number: CN201911106844.3A
Authority: CN
Inventors: 陈岭; 陈东辉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-04-17
Anticipated expiration: 2039-11-13
Also published as: CN111026784B

Abstract

The invention discloses a method for querying a probability summation threshold of an uncertain data stream, and belongs to the technical field of query processing of uncertain data streams. The method comprises the following steps: 1) initializing, including inquiring parameter setting and modeling uncertain data flow by using a Gaussian mixture model; 2) obtaining an upper bound or a lower bound of a result by using a filtering strategy based on the properties and probability theory of a Gaussian mixture model, thereby quickly making a judgment; 3) when the filtering strategy is invalid, a sliding window model is used for calculating an accurate value, and the calculation cost is reduced through incremental calculation. The method for querying the probability summation threshold of the uncertain data stream has wide application prospect in the fields of cluster monitoring, health monitoring, intelligent security and the like.

Description

Uncertain data stream probability summation threshold query method

Technical Field

The invention relates to the technical field of uncertain data stream query processing, in particular to a method for querying a probability summation threshold of an uncertain data stream.

Background

With the development of sensing and network technologies, data streams can be widely acquired. Data in a data stream is typically a probabilistic-based representation due to inherent errors in the device, interference from ambient noise, recovery of lost information through inference, etc. Simply computing the statistics (e.g., mean and variance) of these uncertain data will lose useful information and even draw incorrect conclusions. Uncertain data flow management can solve these problems by employing an uncertain data model to support probabilistic queries, where probabilistic summation queries (probabilistic query) is an important query type that takes a large amount of uncertain data (such as a probability distribution function) as input and returns a probability distribution as a result. In many monitoring applications, it is only necessary to know whether the result distribution exceeds a user-defined threshold. An example is given below.

Example 1: and (5) monitoring the temperature. Six sensors measure the temperature of an object simultaneously. Temperature readings can be subject to errors due to errors inherent in the sensors and interference from noise signals. The temperature readings of the six sensors are converted to a probability distribution using a data fusion technique, such as density estimation. Then, probability distributions at different time instants are aggregated to detect anomalies. To this end, the monitoring application programs devise the following queries:

and (3) inquiring: is the probability of the average temperature exceeding 60 degrees greater than 80% in the last 10 minutes?

When the query result is "true", an alarm will be triggered.

The above query explicitly considers the load fluctuations of the cluster as a whole in the last 10 minutes and introduces two thresholds into the probability summation query, one being the probability threshold and the other being the score threshold. The query is an uncertain data stream probability summation threshold query, and is an extension of the uncertain data stream probability summation query.

Although there has been a lot of research work on probabilistic summation queries on uncertain data streams, most of these methods focus on obtaining approximation results based on unbounded data stream models by proposing space and time efficient algorithms. Still other approaches implement incremental updating of results by processing both incoming and outgoing tuples through a sliding window model. In addition, in the existing probability threshold query method, although various filtering strategies (such as distance-based filtering and probability-based filtering) are designed, the filtering strategies of the queries are designed for specific query types, and threshold semantics of different query types are different in nature (for example, two thresholds in the probability range threshold query: a range threshold and a probability threshold; and two thresholds in the probability summation threshold query: a score threshold and a probability threshold). At present, no uncertain data stream probability summation threshold query method is available. A naive solution is to consider the threshold constraint after performing the probabilistic summation query to get the final result. The computational efficiency of this approach is very low (i.e., it is not necessary to compute the result distribution for any given sliding window) due to the separation of query processing and threshold computation.

Disclosure of Invention

The method aims to solve the technical problem of how to efficiently process the uncertain data stream probability summation threshold query. The invention provides a method for querying a probability summation threshold of an uncertain data stream.

The technical scheme of the invention is as follows:

a method for querying a probability summation threshold of an uncertain data stream, the method comprising the steps of:

(1) dividing continuous uncertain data into sliding windows and carrying out Gaussian mixture model modeling on random variables in each window, namely expressing the random variables by utilizing Gaussian distribution;

(2) performing two-time filtering judgment on random variables based on a first moment and a first-order variance, a second moment and a second-order variance of the sum of the random variables in a sliding window, outputting a query result and returning to the step (1) when a query result can be obtained by performing first filtering judgment according to the first moment and the first-order variance, performing second filtering judgment according to the second moment and the second-order variance when the query result cannot be obtained by performing first filtering judgment according to the first moment and the first-order variance, outputting the query result and returning to the step (1) when the query result can be obtained, and entering the step (3) when the query result cannot be obtained;

(3) and converting the random variable in the sliding window into a characteristic function, carrying out probability summation based on the characteristic function, judging whether the query result is 'yes' or 'no' according to the magnitude relation between the summed probability value and the score threshold value and the probability threshold value, and outputting the query result.

When the method is used for processing the query of the probability summation threshold of the uncertain data stream, the properties and the probability theory of the Gaussian mixture model are fully utilized, and the characteristic function, the pruning strategy and the incremental processing based on the sliding window are combined, so that the calculation efficiency is improved. Compared with the prior art, the method has the advantages that:

1) the uncertain data are modeled into a Gaussian mixture model, so that the method is more flexible and efficient.

2) And a pruning strategy based on the properties of a Gaussian mixture model and a probability theory is designed, so that unnecessary calculation is reduced.

3) In the accurate calculation stage, a characteristic function is introduced, so that the complexity of the algorithm is reduced, and meanwhile, the calculation efficiency is further improved by utilizing incremental processing.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flow chart of a method for querying a summation threshold of probabilities of an uncertain data stream according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a method for querying a probability summation threshold of an uncertain data stream according to an embodiment of the present invention. As shown in fig. 1, the uncertain data flow probability summation threshold query method provided by the embodiment uses a continuous random variable instead of a discrete random variable to represent uncertain data; a Gaussian mixture model is adopted as a basic model to improve the calculation efficiency and provide high flexibility; and integrating a filtering strategy and accurate calculation in query processing, quickly making judgment by using the filtering strategy based on the properties of a Gaussian mixture model and a probability theory, and performing incremental calculation on an accurate value by using a sliding window model when the filtering strategy is invalid. The method specifically comprises an initialization stage, a rapid judgment stage based on a filtering strategy and an accurate calculation stage based on a sliding window model. Each stage is explained in detail below.

Initialization phase

The initial stage is mainly used for dividing sliding windows and carrying out Gaussian mixture model modeling on random variables in each window, namely, the Gaussian distribution is used for representing the random variables, and the method specifically comprises the following steps:

s101, acquiring a new jth uncertain data t in an uncertain data stream_jForm a sliding window with the latest w pieces of data

Wherein w ∈ R⁺Is the length of the sliding window and takes a random variable X_mRepresenting sliding windows

M-th tuple t in (1)_j-w+m(1≤m≤w)；

S102, setting a score threshold tau (tau epsilon R)⁺) And a probability threshold δ (δ ∈ (0,1)), the uncertain data stream probability sum threshold query can be expressed as: probability Pr (Y) of random variable Y being greater than tau>τ) is greater than δ, i.e. the inequality Pr (Y)>τ)>Whether or not δ is true. If the inequality is true, the query result is yes, otherwise, the query result is no.

S103, adopting a single-variable Gaussian mixture model to the random variable X_mModeling, namely representing uncertain data by using continuous random variables, wherein the model comprises k Gaussian variables

And corresponding non-negative probability (p)₁,p₂,…,p_k)。

The probability density function for the random variable X is:

wherein,

μ_iand σ_i ²Is a Gaussian variable

The expectation and variance of (c), i.e.:

thus, all data in each sliding window is represented by a gaussian mixture model through S101 to S103, and the gaussian mixture model is used as a basic model to improve the calculation efficiency and provide high flexibility.

Fast judging stage based on filtering strategy

The fast judging stage based on the filtering strategy is mainly used for carrying out twice filtering judgment on random variables based on a first moment and a first order variance, a second moment and a second order variance of the sum of the random variables in a sliding window, when a query result can be obtained by carrying out first filtering judgment according to the first moment and the first order variance, the query result is output and returned to the initializing stage, new uncertain data are obtained again, when the query result cannot be obtained by carrying out first filtering judgment according to the first moment and the first order variance, second filtering judgment is carried out according to the second moment and the second order variance, when the query result can be obtained, the query result is output and returned to the initializing stage, and when the query result cannot be obtained, the accurate calculating stage based on the sliding window model is entered. The method specifically comprises the following steps:

s201, calculating a first order moment, a second order moment, a first order variance and a second order variance of the sum of all random variables in a sliding window according to the expectation and the variance of the random variables;

s201 specifically includes the following steps:

s2011 calculates a random variable X_mDesired e (x) and variance var (x);

in particular, according to the expectation of a Gaussian distribution

Sum variance

Calculating the expectation E (X), and the specific formula is as follows:

s2012, calculating the sum of all random variables in the sliding window

First order moment E (Y) and second order moment E (Y)²)。

In particular, E (X) as desired_m) Sum variance Var (X)_m) Calculating the first order moment E (Y) and the second order moment E (Y) of the sum Y of all random variables in the sliding window²) The concrete formula is as follows:

s2013, calculating the variance Var (Y) of the sum Y of all random variables in the sliding window.

In particular, according to a first order moment E (Y) and a second order moment E (Y)²) Calculating the variance Var (Y) of the sum Y of all random variables in the sliding window, wherein the specific formula is as follows:

Var(Y)＝E(Y²)-(E(Y))²(7)

s2014, calculating a fourth moment E (Y) of the sum Y of all random variables in the sliding window⁴) And a second order variance Var (Y)²)。

Specifically, according to the first order moment E (Y), the second order moment E (Y)²) And the first order variance Var (Y) calculating a fourth order moment E (Y) of the sum Y of all random variables in the sliding window⁴) And a second order variance Var (Y)²) The concrete formula is as follows:

E(Y⁴)＝E(Y)⁴+6(E(Y))²Var(Y)+3(Var(Y))⁴(8)

Var(Y²)＝E(Y⁴)-(E(Y²))²(9)

in order to reduce the calculation amount, the first fourth moment and the first second moment can utilize the result of the previous sliding window to realize incremental calculation. For new sliding window

Variable Y ═ X_j-w+2+X_j-w+3+…+X_j+1The first four moments of (a) can be calculated by the following formula:

E(Y′)＝E(Y)-E(X_j-w+1)+E(X_j+1), (10)

E(Y^′2)＝E(Y²)-Var(X_j-w+1)+Var(X_j+1)+(E(Y^′))²(11)

Var(Y′)＝E(Y^′2)-(E(Y′))²(12)

E(Y^′4)＝E(Y′)⁴+6(E(Y′))²Var(Y′)+3(Var(Y′))⁴(13)

Var(Y^′2)＝E(Y^′4)-(E(Y^′2))²(14)

s202, performing first filtering according to the first moment E (Y) and the first variance Var (Y) of the sum of all random variables in the sliding window and the size relationship between the score threshold and the probability threshold to judge a query result;

s202 specifically includes the following steps:

s2021, if tau is greater than E (Y) and delta is greater than 0.5, outputting a query result, and jumping to S101 in an initialization stage if the output query result is 'no';

because τ > E (Y), the probability Pr (Y > τ) that the random variable Y is greater than τ is less than Pr (Y ≧ E (Y)), and Pr (Y ≧ E (Y)), (Y)) is 0.5, the value of Pr (Y > τ) must be less than 0.5. Inequality Pr (Y > τ) > δ must not be true, so the output query result is no.

S2022, if τ>E (Y) and δ ≦ 0.5, when the condition is satisfied:

if so, outputting a query result, if not, jumping to the S101 of the initialization stage;

obtained according to the unilateral chebyshev inequality:

when the conditions are as follows:

when satisfied, Pr (Y)>τ)>Since δ does not hold, the output query result is no.

S2023, if tau is less than or equal to E (Y) and delta is less than 0.5, outputting a query result, wherein the output query result is 'yes', and jumping to S101 in an initialization stage;

since Pr (Y > τ) > Pr (Y.gtoreq.E (Y)) >0.5, the value of Pr (Y > τ) must be 0.5 or more.

S2024, if tau is less than or equal to E (Y) and delta is more than or equal to 0.5, when the condition is met:

if so, outputting the query result, and if so, outputting the query result as yes and jumping to the S101 in the initialization stage;

obtained according to the unilateral chebyshev inequality:

when the condition is satisfied:

pr (Y)>τ)>δ holds true.

S203, when the query result can not be output, according to the sliding windowSecond moment E (Y) of the sum of all random variables in the mouth²) And a second order variance Var (Y)²) Performing secondary filtering on the relationship between the score threshold and the probability threshold to judge a query result;

s203 specifically includes the following steps:

s2031, if τ²>E(Y²) And delta>0.5, outputting the query result, and jumping to the S101 in the initialization stage if the output query result is 'no';

s2032, if tau²>E(Y²) And delta is less than or equal to 0.5, when the condition is satisfied:

Pr(Y>tau) is equivalently converted into Pr (Y)²>τ²). Obtained according to the unilateral chebyshev inequality:

when the condition is satisfied:

pr (Y)>τ)>δ does not hold.

S2033, if tau²≤E(Y²) And delta<0.5, the query result can be output, and the output query result is yes, and the step is shifted to S101 in the initialization stage;

s2034, if tau²≤E(Y²) And delta is more than or equal to 0.5, and when the condition is met:

if so, outputting a query result, wherein the output query result is 'yes', and jumping to the S101 in the initialization stage;

obtained according to the unilateral chebyshev inequality:

when the condition is satisfied:

pr (Y)>τ)>δ holds true.

Accurate calculation phase based on sliding window model

The accurate calculation stage based on the sliding window model is mainly used for converting random variables in the sliding window into characteristic functions, carrying out probability summation based on the characteristic functions, and calculating the score threshold tau (tau belongs to R) according to the summed probability value⁺) And the size relation with the probability threshold value delta, judging whether the query result is 'yes' or 'no', and outputting the query result. The specific process is as follows:

s301, each random variable X_mExpressed by a characteristic function;

random variable X_mModeling as a Gaussian mixture model consisting of k expectation (μ)₁,μ₂,…,μ_k) Variance of

And a corresponding probability of (p)₁,p₂,…,p_k) The Gaussian component of (1), then the random variable X_mIs expressed as follows:

wherein,

s302, representing the sum Y of all random variables of all uncertain data in a sliding window by using a characteristic function;

the sum of the random variables Y is w random variables (X)₁,X₂,…,X_w) Is a sum of

Then the sum of the random variables YCharacteristic function of

Is represented as follows:

as can be seen from equation (16), for a linear combination of a plurality of random variables, the computation using the feature function is very efficient, and the use of the probability density function requires multiple integrations, which consumes a large amount of computing resources.

S303, for the random variable in the current sliding window, based on the old sliding window and the old summation result

And incrementally updating the feature function value of the sum of the random variables in the current sliding window.

For data within a current sliding window

Sliding windows based on age

Characteristic function of

Processing a new tuple t_jNew results

It can be calculated incrementally as follows:

at the same time, the old tuple t is culled_j-wNew results

It can be calculated incrementally as follows:

s304, according to the characteristic function of probability summation

Calculating a probability Pr (Y) greater than a score threshold τ>τ), if Pr (Y)>τ)>If delta, outputting the query result as yes, otherwise, outputting no; and the query process of the current sliding window is finished, and the step jumps to the step S101 of the initialization stage.

Characteristic function of current sliding window

Can be expressed as a set of Gaussian components phi_cThen, there are:

wherein, F_c(τ) is the cumulative density function of the Gaussian distribution c. If Pr (Y)>τ)>And delta, outputting the query result as yes, otherwise, outputting no. And the query process of the current sliding window is finished, and the step jumps to the step S101 of the initialization stage.

In the method for querying the probability summation threshold of the uncertain data stream, the uncertain data is modeled into a Gaussian mixture model, so that the method is more flexible and efficient; meanwhile, a pruning strategy based on the properties of a Gaussian mixture model and a probability theory is designed, unnecessary calculation is reduced, in addition, a characteristic function is introduced in an accurate calculation stage, the complexity of an algorithm is reduced, and meanwhile, the calculation efficiency is further improved by utilizing incremental processing.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for querying a probability summation threshold of an uncertain data stream, the method comprising the steps of:

2. The uncertain data stream probability summation threshold query method according to claim 1, wherein in step (1), a new jth uncertain data t in uncertain data stream is obtained_jForm a sliding window with the latest w pieces of data

M-th tuple t in (1)_j-w+m(1≤m≤w)；

Using a mixture of univariates of the Gaussian model to the random variable X_mModeling, namely representing uncertain data by using continuous random variables, wherein the model comprises k Gaussian variables

And corresponding non-negative probability (p)₁,p₂,…,p_k)。

The probability density function for the random variable X is:

wherein,

μ_iand σ_i ²Is a Gaussian variable

The expectation and variance of (c), i.e.:

3. the uncertain data stream probability summation threshold query method according to claim 1, wherein the specific process of the step (2) is as follows:

(2-1) calculating a first order moment, a second order moment, a first order variance and a second order variance of the sum of all random variables within the sliding window according to the expectation and variance of the random variables;

(2-2) carrying out first filtering according to the first moment and the first variance of the sum of all random variables in the sliding window and the size relation between the score threshold and the probability threshold to judge a query result;

and (2-3) when the query result cannot be output, performing secondary filtering according to the second moment and the second variance of the sum of all random variables in the sliding window and the size relationship between the score threshold and the probability threshold to judge the query result.

4. The uncertain data stream probability summation threshold query method according to claim 3, wherein the step (2-1) specifically comprises the following steps:

(2-1-1) calculation of random variable X_mDesired e (x) and variance var (x);

in particular, according to the expectation of a Gaussian distribution

Sum variance

Calculating the expectation E (X), and the specific formula is as follows:

(2-1-2) calculating the sum of all random variables within the sliding window

First order moment E (Y) and second order moment E (Y)²)；

(2-1-3) calculating the variance Var (Y) of the sum Y of all random variables in the sliding window;

Var(Y)＝E(Y²)-(E(Y))²

(2-1-4) calculating the fourth order moment E (Y) of the sum Y of all random variables in the sliding window⁴) And a second order variance Var (Y)²)；

E(Y⁴)＝E(Y)⁴+6(E(Y))²Var(Y)+3(Var(Y))⁴

Var(Y²)＝E(Y⁴)-(E(Y²))²

for new sliding window

Variable Y ═ X_j-w+2+X_j-w+3+…+X_j+1The first fourth moment and the first two variances can be calculated by the following formulas:

E(Y′)＝E(Y)-E(X_j-w+1)+E(X_j+1)

E(Y′²)＝E(Y²)-Var(X_j-w+1)+Var(X_j+1)+(E(Y′))²

Var(Y′)＝E(Y′²)-(E(Y′))²

E(Y′⁴)＝E(Y′)⁴+6(E(Y′))²Var(Y′)+3(Var(Y′))⁴

Var(Y′²)＝E(Y′⁴)-(E(Y′²))²。

5. the uncertain data stream probability summation threshold query method according to claim 3, wherein the step (2-2) specifically comprises the following steps:

(2-2-1) if tau is greater than E (Y) and delta is greater than 0.5, outputting a query result, and jumping to the step (1) if the output query result is 'no';

(2-2-2) if τ>E (Y) and δ ≦ 0.5, when the condition is satisfied:

if so, outputting the query result, if not, jumping to the step (1);

(2-2-3) if tau is less than or equal to E (Y) and delta is less than 0.5, outputting a query result, and skipping to the step (1) if the output query result is 'yes';

(2-2-4) if τ is less than or equal to E (Y) and δ is more than or equal to 0.5, when the condition is satisfied:

and (3) if the query result can be output, outputting the query result as yes, and jumping to the step (1).

6. The uncertain data stream probability summation threshold query method according to claim 3, wherein the step (2-3) specifically comprises the following steps:

(2-3-1) if τ²>E(Y²) And delta>0.5, outputting the query result, and jumping to the step (1) if the output query result is 'no';

(2-3-2) if τ²>E(Y²) And delta is less than or equal to 0.5, when the condition is satisfied:

if so, outputting the query result, if not, jumping to the step (1);

(2-3-3) if τ²≤E(Y²) And delta<0.5, the query result can be output, the output query result is yes, and the step (1) is skipped;

(2-3-4) if τ²≤E(Y²) And delta is more than or equal to 0.5, and when the condition is met:

and (3) outputting the query result, wherein the output query result is 'yes', and skipping to the step (1).

7. The uncertain data stream probability summation threshold query method according to claim 1, wherein the specific process of step (3) is as follows:

(3-1) Each random variable X_mExpressed by a characteristic function;

wherein,

(3-2) representing the sum Y of all random variables of all uncertain data in the sliding window by using a characteristic function;

Then the characteristic function of the sum of random variables Y

Is represented as follows:

(3-3) for the random variable within the current sliding window, based on the old sliding window and the old summation result

Incrementally updating the characteristic function value of the sum of the random variables in the current sliding window;

for data within a current sliding window

Sliding windows based on age

Characteristic function of

Processing a new tuple t_jNew results

It can be calculated incrementally as follows:

at the same time, the old tuple t is culled_j-wNew results

It can be calculated incrementally as follows:

s304, according to the characteristic function of probability summation

Calculating a probability Pr (Y) greater than a score threshold τ>τ), if Pr (Y)>τ)>δ, outputting the query result as yes, otherwise, no; and (5) finishing the query process of the current sliding window and jumping to the step (1).