CN103336906B - The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects - Google Patents

The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects Download PDF

Info

Publication number
CN103336906B
CN103336906B CN201310295975.7A CN201310295975A CN103336906B CN 103336906 B CN103336906 B CN 103336906B CN 201310295975 A CN201310295975 A CN 201310295975A CN 103336906 B CN103336906 B CN 103336906B
Authority
CN
China
Prior art keywords
data
window
prediction
index
data stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310295975.7A
Other languages
Chinese (zh)
Other versions
CN103336906A (en
Inventor
刘大同
彭宇
庞景月
罗清华
彭喜元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of technology high tech Development Corporation
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201310295975.7A priority Critical patent/CN103336906B/en
Publication of CN103336906A publication Critical patent/CN103336906A/en
Application granted granted Critical
Publication of CN103336906B publication Critical patent/CN103336906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The sampling GPR method that in the image data stream of environmental sensor, continuous abnormal detects, belongs to the data monitoring technical field of environmental sensor.The present invention is to solve because data calculated amount is large in traditional environmental sensor data throat floater detection, the problem of carrying out abnormality detection that can not be real-time.It is based on the method for forecast model, forecast model is set up by historical data, obtain average and the fiducial interval of current data, Current data values is compared with fiducial interval, if exceed interval, then think that it is abnormal data, this method only needs less historical data, and algorithm execution efficiency increases, and the training data of input does not require to have tag along sort, abnormal conditions can be detected according to the data adaptive arrived in real time, be adapted to the real-time abnormality detection requirement of environmental sensor.The present invention is used for continuous abnormal Data Detection in the image data stream of environmental sensor.

Description

The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects
Technical field
The present invention relates to the sampling GPR method that in the image data stream of environmental sensor, continuous abnormal detects, belong to the data monitoring technical field of environmental sensor.
Background technology
The real-time analysis of widespread use to data of environmental sensor is had higher requirement, environmental sensor is distributed in its environment of monitoring, the data gathered continue to produce database with seasonal effect in time series form by telemetry, and this kind of data mode continuing to produce is data flow model.At present, the real-time application of environmental sensor data obtains extensive concern, but compare in rugged environment because environmental sensor is generally applied to, its data are transmitted by communication network, the impact of environment can be subject in data transmission procedure, be easy to be corroded, and the mistake do not detected produces considerable influence by the real-time analysis of data value.Therefore NSF (NationalScienceFoundation) proposes clear and definite requirement to the self-perfection of the quality of data with control.
Abnormality detection is exactly have for differentiating the data pattern departed from very greatly with historical models.In environmental sensor, the generation of abnormal data divides in two, and what namely caused by sensor itself causes with the mistake in data transmission or the system exception behavior of less appearance.These abnormal datas need to remove, so that avoid the generation of system disaster.Be real-time mostly to the requirement of abnormality detection in environmental sensor, therefore method for detecting abnormality must be detect rapidly, to ensure the requirement of real time data acquisition.
The exception that traditional abnormality detection utilizes the graphical tools of data to come in manual identification data, but manual method can not requirement of real time in the application of data stream, because manual mode is difficult to continue seven days weekly, the intensity of 24 hours every days.Recently, the method of researcher to statistics and machine learning is studied, such as minimumvolumeellipsoid, convexpealing, neighbour's cluster, neural network classifier, support vector machine classifier and decision tree etc., the efficiency of these methods is better than manual methods, but the shortcoming of these methods makes it not be suitable for real-time data flow anomaly to be detected.Minimumvolumeellipsoid and convexpealing method requires to carry out abnormality detection again by after all data storages; And the method for neighbour's cluster, support vector machine is for large-scale data, its calculated amount is very large; And neural network classifier, support vector machine classifier and decision tree require it is the mode of learning having supervision.Due to the collection data of environmental sensor real-time continuous, the method acting on whole data set will lose efficacy.
Summary of the invention
The present invention seeks to solve in traditional environmental sensor data throat floater detection because data calculated amount is large, the problem of carrying out abnormality detection that can not be real-time, provides the sampling Gaussian process regression model that in a kind of image data stream of environmental sensor, continuous abnormal detects.
The sampling GPR method that in the image data stream of environmental sensor of the present invention, continuous abnormal detects, it comprises the following steps:
Step one: the sliding window size of set environment sensor senses data is N, and to set sampling fraction be B:1, the data of N*B before data stream in moving window are sampled as off-line data, N*B the data obtained as initial prediction window data, and form prediction window D according to initial prediction window data t;
Step 2: using subsequent time data element index adjacent with current time in environmental sensor sensing data stream as prediction window D tinput value, prediction window D tthe prediction average of subsequent time data element in output environment sensor senses data stream, and obtain the variance corresponding with this prediction average;
Step 3: according to prediction window D tthe prediction average of the subsequent time data element exported and corresponding variance determine that described subsequent time data element should fall into time normal 95% fiducial interval;
Step 4: when described subsequent time data element arrives, the scope that itself and described fiducial interval are determined being compared, the scope that fiducial interval determines if exceed, is then abnormal data depending on this data element, store this abnormal data and index thereof, and return step 2; Otherwise execution step 5;
Step 5: whether described subsequent time data element adds prediction window D to utilize UBCS algorithm to determine tif add, then this subsequent time data element is stored in prediction window D tin, and deletion prediction window D tthe data element that interior minimum index is corresponding, completes prediction window D trenewal, then return step 2 circulation perform until sensing data stream terminates, then perform step 6; Otherwise directly return step 2 circulation to perform until sensing data stream terminates, then perform step 6;
Step 6: export all abnormal datas judging in step 4 to obtain, realize the detection of continuous abnormal data in the image data stream of environmental sensor.
Described prediction window D t={ x i-Q, x i-Q+1..., x i, wherein i represents current time, and Q is prediction window D tsize, and Q=N*B, x are the prediction window data in moment corresponding to its subscript;
By subsequent time data element x i+1index as prediction window D tinput value, obtain data element x i+1prediction average and the variance q corresponding with this prediction average;
Determine that the fiducial interval of 95% of subsequent time data element is
Whether described subsequent time data element adds prediction window D to utilize UBCS algorithm to determine described in step 5 tconcrete grammar be:
Step May Day: the environmentally sliding window size N of sensor senses data, set its sampling and be of a size of k, then each basic window is of a size of N/k, the ratio of N/k is the ratio rounded downwards, data element index then in first basic window is [1,2,3, N/k], the data element index in second basic window is [N/k+1, N/k+2,2*N/k] ..., the data element index in I basic window is [(I-1) * N/k+1,, I*N/k];
Step 5 two: from prediction window D tthe next data element index of middle Stochastic choice representatively index;
Step 5 three: when described represent data element corresponding to index arrive time, this data element is added prediction window D as the sample data of uniform sampled data stream t, until when the difference of the representative index of the representative index of current first basic window current basic window corresponding with current time is greater than window size N, delete the element that the representative index of current first basic window is corresponding; Simultaneously when the number of the sample data of environmental sensor sensing data stream is greater than sampling size k, random erasure sample data from the sample data of uniform sampled data stream.
In described step May Day, the sliding window size N of environmental sensor sensing data is 6, and sampling size k is 2, then each basic window size N/k is 3, then the element index in first basic window is [1,2,3], the element index in second basic window is [4,5,6] ...
The representative index of the next data element chosen in described step 5 two is 2.
In described step 5 three when next number arrives according to the data element that the representative index of element is 2 correspondences, this data element is added prediction window D as the sample data of uniform sampled data stream t.
Advantage of the present invention: the inventive method is based on the method for forecast model, forecast model is set up by historical data, obtain average and the fiducial interval of current data, Current data values is compared with fiducial interval, if exceed interval, then think that it is abnormal data, this method only needs less historical data, algorithm execution efficiency increases, and the training data of input does not require to have tag along sort, abnormal conditions can be detected according to the data adaptive arrived in real time, be adapted to the real-time abnormality detection requirement of environmental sensor.The inventive method introduces GPR forecast model, set up based on seasonal effect in time series prediction framework, efficiently utilize GPR and export the characteristic with uncertain expression, introduce sampling GPR method simultaneously, effectively combine the thought of being carried out raw data modeling by training subset, and demonstrate the detection validity of sampling GPR method for data stream continuous abnormal by emulated data collection.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the sampling GPR method that in the image data stream of environmental sensor of the present invention, continuous abnormal detects;
Fig. 2 is the process flow diagram of GPR forecast model;
Fig. 3 is the original data stream curve map of simulating, verifying in the inventive method;
Fig. 4 is the execution example schematic of UBCS algorithm; In figure, A place arrow represents the direction that new data arrives, the numeral element index on the right side of these data, and the numeral in sample data represents index, and D represents the expired deletion of data, and C represents that sampling sample number is greater than sample size and deletes;
Fig. 5 is the process flow diagram of UBCS algorithm.
Embodiment
Embodiment one: present embodiment is described below in conjunction with Fig. 1, the sampling GPR method that in the image data stream of environmental sensor described in present embodiment, continuous abnormal detects, it comprises the following steps:
Step one: the sliding window size of set environment sensor senses data is N, and to set sampling fraction be B:1, the data of N*B before data stream in moving window are sampled as off-line data, N*B the data obtained as initial prediction window data, and form prediction window D according to initial prediction window data t;
Step 2: using subsequent time data element index adjacent with current time in environmental sensor sensing data stream as prediction window D tinput value, prediction window D tthe prediction average of subsequent time data element in output environment sensor senses data stream, and obtain the variance corresponding with this prediction average;
Step 3: according to prediction window D tthe prediction average of the subsequent time data element exported and corresponding variance determine that described subsequent time data element should fall into time normal 95% fiducial interval;
Step 4: when described subsequent time data element arrives, the scope that itself and described fiducial interval are determined being compared, the scope that fiducial interval determines if exceed, is then abnormal data depending on this data element, store this abnormal data and index thereof, and return step 2; Otherwise execution step 5;
Step 5: whether described subsequent time data element adds prediction window D to utilize UBCS algorithm to determine tif add, then this subsequent time data element is stored in prediction window D tin, and deletion prediction window D tthe data element that interior minimum index is corresponding, completes prediction window D trenewal, then return step 2 circulation perform until sensing data stream terminates, then perform step 6; Otherwise directly return step 2 circulation to perform until sensing data stream terminates, then perform step 6;
Step 6: export all abnormal datas judging in step 4 to obtain, realize the detection of continuous abnormal data in the image data stream of environmental sensor.
Embodiment two: present embodiment is described further embodiment one, prediction window D described in present embodiment t={ x i-Q, x i-Q+1..., x i, wherein i represents current time, and Q is prediction window D tsize, and Q=N*B, x are the prediction window data in moment corresponding to its subscript;
By subsequent time data element x i+1index as prediction window D tinput value, obtain data element x i+1prediction average and the variance q corresponding with this prediction average;
Determine that the fiducial interval of 95% of subsequent time data element is
Embodiment three: present embodiment is described below in conjunction with Fig. 5, present embodiment is described further embodiment two, and whether described subsequent time data element adds prediction window D to utilize UBCS algorithm to determine described in present embodiment step 5 tconcrete grammar be:
Step May Day: the environmentally sliding window size N of sensor senses data, set its sampling and be of a size of k, then each basic window is of a size of N/k, the ratio of N/k is the ratio rounded downwards, data element index then in first basic window is [1,2,3, N/k], the data element index in second basic window is [N/k+1, N/k+2,2*N/k] ..., the data element index in I basic window is [(I-1) * N/k+1,, I*N/k];
Step 5 two: from prediction window D tthe next data element index of middle Stochastic choice representatively index;
Step 5 three: when described represent data element corresponding to index arrive time, this data element is added prediction window D as the sample data of uniform sampled data stream t, until when the difference of the representative index of the representative index of current first basic window current basic window corresponding with current time is greater than window size N, delete the element that the representative index of current first basic window is corresponding; Simultaneously when the number of the sample data of environmental sensor sensing data stream is greater than sampling size k, random erasure sample data from the sample data of uniform sampled data stream.
Embodiment four: present embodiment is described further embodiment three, in step May Day described in present embodiment, the sliding window size N of environmental sensor sensing data is 6, sampling size k is 2, then each basic window size N/k is 3, then the element index in first basic window is [1,2,3], the element index in second basic window is [4,5,6] ...
Embodiment five: present embodiment is described below in conjunction with Fig. 4, present embodiment is described further embodiment four, and the representative index of the next data element chosen in step 5 two described in present embodiment is 2.
Embodiment six: present embodiment is described below in conjunction with Fig. 1 to Fig. 5, present embodiment is described further embodiment five, in step 5 three described in present embodiment when next number arrives according to the data element that the representative index of element is 2 correspondences, this data element is added prediction window D as the sample data of uniform sampled data stream t.
The GPR Forecasting Methodology related in the inventive method:
1, Gaussian process regression forecasting principle:
Gaussian process regression model GPR is a kind of probabilistic technique for nonlinear regression problem, namely by can training data restriction prior distribution carried out to Posterior distrbutionp estimation.Namely carried out the functional space defined by the prior distribution of GP, the function prediction output valve of the Posterior distrbutionp of GP can utilize Bayesian frame to calculate.As, training data set
by the individual training data of N ' to forming, wherein x i1for training data input value, y i1for training data desired value, i1 is the subscript of training data, the matrix X ∈ R of corresponding input data d × N 'be made up of training data, and predicted data matrix by N *individual test input composition, d is the dimension of input data, and R is expressed as real number.Now predict that output vector is utilize m and m *be respectively used to the mean vector representing that training data, test data set are corresponding.Then prediction exports f *joint Gaussian distribution is obeyed, namely with the desired value y of training data
y f * ~ ( m m * , C ( X , X ) K ( X , X * ) K ( X * , X ) K ( X * , X * ) ) - - - ( 1 )
In above formula, C (X, X)=K (X, X)+δ iji is the covariance matrix of training data, substitutes into the matrix obtained in the concrete form of covariance function by training data, and it is N × N dimension, wherein δ ijfor the variance of the white noise of setting, I is the unit matrix of N × N; K (X, X *) be the covariance vector of test data and training data, substituting in the expression of covariance function by each test data and training data and obtain, is N × N *dimension; K (X *, X) and be K (X, X *) transposition, i.e. K (X *, X) and=K (X *, X) t; K (X *, X *) be the covariance matrix of test data itself, the N obtained after substituting into by test data *× N *the matrix of dimension; M is the column vector of the 1 × N dimension obtained after the input matrix X of test data substitutes into concrete mean value function expression formula, and m *by test input X *1 × the N obtained after substituting into same mean value function expression formula *test mean vector.
Above formula may be used for target of prediction and exports y *main Basis be that the character of Gaussian process is as follows: if x and t obeys the random vector of Joint Gaussian distribution, namely
x t ~ ( m x m t , A E ) E T B ) - - - ( 2 )
Then the marginal distribution of x is such as formula shown in (3), and under the condition that t is known, the condition distribution of x is such as formula shown in (3):
x~N(m x,A),
x|t~N(m x+EB -1(t-m t),A-EB -1E T)(3)
Equally, A, E, B represent covariance matrix, the transposition of symbol T representing matrix or vector.
By above Gaussian process character and combine (1) formula and can obtain f easily *meet posteriority condition distribution:
f * | X , y , X * ~ N ( f * ‾ , cov ( f * ) ) , - - - 4 ( a )
f * ‾ = E [ f * | X , y , X * ] = m * + K ( X * , X ) C ( X , X ) - 1 ( y - m ) , - - - 4 ( b )
cov(f *)=K(X *,X *)-K(X *,X)C(X,X) -1K(X,X *)4(c)
Wherein, by (4b), the known output y predicted by GPR of (4c) formula *obey average and variance gaussian distribution, namely
f ‾ ( x * ) = m ( x * ) + k * T C - 1 ( y - m ( x ) ) , - - - ( 5 )
σ f 2 ( x * ) = k ( X * , X * ) - k * T C - 1 k * - - - ( 6 )
In above formula, c -1=C (X, X) -1, y is the observed reading of training data.The fiducial interval of GP model prediction output valve is by (10) formula determine, the fiducial interval as 95% is [ f ‾ ( x * ) - 2 × σ f 2 ( x * ) , f ‾ ( x * ) + 2 × σ f 2 ( x * ) ] , The fiducial interval of 99% is show that GPR model is for can not only predicting the average of test output and can providing confidence level or the uncertainty of forecast model during forecasting problem.This can merge the noise of the external world, test value and model better in the application of reality, provides and has more predicting the outcome of reliability.
Such as, one-variable linear regression forecasting problem is the output valve obtained after substituting into clear and definite expression formula by given prediction input newly.And for GPR, formula (1) is exactly the function expression for predicting, regression problem only from general is different, f (x) can not show by parameter or non-parametric form, and known be exactly f (x) be a Gaussian process, wherein each variable f (x 1) ..., f (x n ') obey Joint Gaussian distribution, so the forecast model obtained is exactly y ~ GP (m (x), k (x, x *)+σ 2δ ij), each training points is brought into and obtains Matrix C (X, X)=K (X, X)+σ 2q, so the form that forecast model is write as matrix is as follows:
y~(M(X),K(X,X)+σ 2Q)(7)
Here can be understood as the relation between y and x, be equivalent to the y=ax+b in one-variable linear regression.Wherein m (x) and k (x, x *)+σ 2δ ijall contain unknown parameter, be referred to as hyper parameter, as m (x)=a+bx, k ( x , x * ) + σ 2 δ i j = υ 0 exp { - 1 2 Σ l = 1 d ω l ( x i - x j ) 2 } + σ 2 δ i j , Hyper parameter is Θ=[a, b, υ 0, ω l, σ n] a in these hyper parameter and one-variable linear regression, b effect is identical, needs to utilize training data to determine.
2, the prediction steps of GPR
By above statement, the prediction principle of GP model and GPR is introduced, pay close attention to the concrete execution step of GPR models applying when forecasting problem herein, the prediction steps with reference to conventional forecast model is introduced, and GPR model for the block diagram of training and predict as shown in Figure 2.
Concrete Gaussian process prediction steps:
The first step: the factor analysis before predicting.Namely judge the correlationship between variable, the training input determining predicting exports with training.
Second step: the training data of the independent variable that the collection first step is determined and dependent variable is to { x, y} set up regressive prediction model.As training dataset x, y}, y=t (x) | x=1,2 ..., 100, x is the time numbering in time series, and y is the target function value of the training data corresponding to each time numbering.The Gaussian process model set up is y ~ GP (m (x), k (x, x *)+σ 2δ ij), suppose m (x)=a+bx, k ( x , x ′ ) + σ n 2 δ i j = k ( x , x * ) + σ 2 δ i j = υ 0 exp { - 1 2 Σ l = 1 d ω l ( x i - x j ) 2 } + σ 2 δ i j , The form of mean value function and the form of covariance function can unrestricted choice, as long as ensure that covariance matrix is nonnegative definite form, now in mean value function and covariance function containing unknown parameter, i.e. hyper parameter Θ=[a, b, υ 0, ω l, σ n].
3rd step: Optimal Parameters value Θ=[a, b, υ 0, ω l, σ n], used herein is Bayesian frame, and it is based on the maximized theory of evidence, and namely hyper parameter is by determining the maximization of the log-likelihood function shown in following formula, namely
θ o p t = arg max θ { log ( y | X , θ ) } = arg max θ { - 1 2 log ( det ( K - σ n 2 ) ) - 1 2 ( y - m ) T [ K + σ n 2 ] - 1 ( y - m ) - N 2 log 2 π ,
Wherein, det is determinant symbol.First hyper parameter is initialized as random value, general training data are all through normalized data, and the initialization of hyper parameter can be set to and obey average is 0, and variance is the random value of the normal distribution of 1.In order to obtain the optimal value θ of hyper parameter vector in above formula opt, adopt the mode that negative log-likelihood function is asked for about the gradient of θ, namely
∂ ∂ θ k log p ( y | X , θ ) = 1 2 ( y - m ) T C - 1 ∂ C ∂ θ k C - 1 ( y - m ) - 1 2 t r ( C - 1 ∂ C ∂ θ k ) ,
∂ ∂ θ m log p ( y | X , θ ) = - ( y - m ) T C - 1 ∂ m ∂ θ m
Wherein, symbol tr is matrix trace operation, θ mrepresent the hyper parameter related in mean value function, and θ kit is the hyper parameter that covariance function (comprising the variance of noise) comprises.Utilize method of conjugate gradient search obtain above formula closest to 0 parameter value be optimum hyper parameter value.The forecast model now determined is optimum forecast model.
4th step: utilize the regression model set up to obtain prediction and export.Only need by prediction input x at traditional regression prediction method *value substitute in model and can obtain output valve, also can understand like this when GPR is used for forecasting problem, according to describing above, the observation output of test data and the observation output valve of training data will obey Joint Gaussian distribution, shown in (1).
So right side is just completely known after prediction input and training data being substituted into, according to theorem above, obtains prediction and export y *average and variance such as formula (5), shown in (6).Thus obtain the GPR model prediction output with average and uncertain expression.
3, sampling GPR method:
The point caused by isolated point in environmental sensor is abnormal, caused by the introducing of noise in sensor gatherer process or misoperation often in actual applications, can revise it according to expertise after detecting, be mainly used in the pre-service of data, the continuous abnormal situation that what user more paid close attention to is in data stream.
Be directed in environmental data stream and occurred the situation of continuous abnormal, abnormal data occupies the forefield of historical forecast data window, and the predicted value in later stage will, close to abnormal data, cause continuous print abnormal data to be considered to normal data in the later stage.So reference data subset carries out the thought that GP model training obtains optimum prediction model, introduce UBCS algorithm, it is combined with GPR forecast model and is used for continuous abnormal in data stream and detects.UBCS algorithm application can utilize less internal memory use amount to obtain setting the sampling sample of size when data stream, and its sampling meets the rule of uniform sampling, sampling sample is evenly distributed in valid window, based on above advantage, be introduced in the data flow anomaly detection algorithm of GPR forecast model, mainly be applicable to the amplitude fluctuations situation not too of time series data stream normal mode, it can well detect the continuous abnormal of its process, and the execution step of the abnormality detection framework of sampling GPR method is as follows:
(1) training data of moving window is collected.Existing abnormality detection framework, when supposing that setting history moving window is of a size of 30, only needs front 30 data data stream arrived as prediction window data.And in conjunction with in the prediction framework of sampling algorithm, if setting sampling fraction is 3:1, then when to obtain window size be the data of 30, need front 90 data of data stream as the data obtained after off-line data is sampled in 30 initial predicted data windows.
(2) moving window D is utilized ttraining pattern, after obtaining optimization model, use one-step prediction model, current data stream index, as input, obtains predicted value
(3) with Probability p calculate under normal circumstances data stream in the bound of the numerical fluctuations scope in t+1 moment.Obtain the fiducial interval of GPR model, its fiducial interval is
(4) as the data x that the t+1 moment is corresponding t+1during arrival, the scope of the normal data itself and (3) step determined compares, if it is beyond the predicting interval of normal data, then regards it as exception, otherwise is normal event.
(5) UBCS algorithm is utilized to determine current True Data x t+1whether numerical value will add prediction window, if meet the Rule of judgment of UBCS algorithm, be stored, otherwise be given up.
(6) judge in current window, whether data size is greater than setting value, if words then data the earliest in window are deleted.
(7) repetitive process (2)-(6).Thus the abnormal conditions estimating data stream that can be real-time.
In conjunction with the GPR abnormality detection of sampling algorithm general frame as shown in Figure 1.
4, simulating, verifying and assessment
Performance Evaluating Indexes:
When carrying out data exception and detecting, its detected situation is as shown in table 1.
The situation that table 1 abnormality detection may occur
So the availability in order to verify Outlier Detection Algorithm, adopt FNR and FPR as evaluation index, its definition is as follows respectively:
(1)FPR(FalsePositiveRatio)
Normal data is detected as exception by mistake, is then rejected, is called false drop rate, FPR=FN/ (TP+FN);
(2)FNR(FalseNegativeRatio)
Abnormal data is detected as normally, is then accepted, and is called loss, FNR=FP/ (FP+TN);
Be directed to data flow anomaly to detect, the execution efficiency of algorithm is also important performance index, therefore same using the another kind of evaluation index of the working time of algorithm as algorithm.It is defined as:
The time that t=algorithm consumes when performing same data volume abnormality detection.
Data stream continuous abnormal test experience based on sampling GPR method:
Abnormality detection framework based on historical data is poor for the Detection results of the continuous abnormal in data stream, carries out experimental verification above herein by the sampling GPR method proposed.
Continuous abnormal mainly refer to data stream in time in arrival process data amplitude there is the situation of fluctuation continuously.Consider that in actual conditions, the definition of its abnormal data has its specific physical meaning, cannot, preferably for the verification and measurement ratio of measure algorithm for single-point exception, therefore utilize the performance of emulated data collection to algorithm to evaluate.The principle produced is adopt to meet the digital simulation normal data mode of specific distribution, and do not belong to this distribution, and adopts amplitude and the larger continuous print digital simulation abnormal behaviour of normal data difference.
By the obedience equally distributed data set simulation normal flow manually generated, and adopt the larger data of the multiple continuous amplitude deviation in time data stream to simulate the continuous abnormal likely occurred in data stream, abnormal data amount is 6%, comparatively be evenly distributed in whole test data to concentrate, and consider the evolution properties of data stream, carried out the evolution of analog data flow normal mode and abnormal patterns again by the change of equally distributed threshold value and the change of exceptional value amplitude, the expression-form of final analog data flow is shown below:
Raw data form as shown in Figure 3.
History training window is still taken as 30, and sampling fraction is set as 1/3, and namely in 3 data, random selecting 1, as sampled data, makes when continuous abnormal occurs like this, and minimizing abnormal data occupies situation to historical forecast window.
Experimental evaluation index is FNR, FRP, Riming time of algorithm, abnormality detection result in conjunction with the GPR model of sampling algorithm is known, the exception of continuous appearance is positioned at beyond the fiducial interval based on the prediction of sampling GPR method all more intuitively, and normal data drops in fiducial interval mostly, reach continuous abnormal Detection results better with this.And when the normal mode of data changes, when namely the amplitude of data changes, the method for detecting abnormality based on forecast model along with the pattern of the change adaptively modifying normal data of data, can realize the self-adaptation abnormality detection of data stream.
Quantitative evaluation contrast is as shown in table 2.
Abnormality detection quantitative comparison under table 2 two kinds of frameworks
As can be seen from the above table, for the continuous abnormal of data stream, after adopting the forecast model of sampling GPR, abnormality detection rate is promoted greatly, and the performance of false drop rate does not have considerable influence.But perform the time efficiency that UBCS algorithm is also reduction of algorithm, in actual applications, for a large amount of data arrived in time, when the execution time that data stream flow velocity meets algorithm requires, sampling GPR method can be applicable to the situation of the continuous abnormal of now data stream better.
UBCS (uniformbasic-windowschainsampling) algorithm supplements
For current popular being applied in the algorithm of data stream sampling, RS algorithm, it can only respond the input of new data in data stream, and can not process the deletion of stale data, therefore is not suitable for sliding window data flow model.And CS algorithm its internal memory use amount is uncertain in the worst cases, and as multiple sampling algorithm, need to safeguard multiple sampling chain simultaneously, result in the waste of resource.The SBWRS algorithm then proposed considers the time response of data in data stream, but it needs to store whole window, therefore is only applicable to the less situation of sliding window size.Optimal sampling algorithm is also multiple sample algorithm, needs to safeguard multiple sampling process simultaneously.As data stream sampling algorithm, be intended to utilize relatively less internal memory use amount to meet the sample requirement of setting, and consider that uniform sampling algorithm more generally and more concerned, the probability that data become sample should be consistent.Consider that sampling sample needs the information of the more comprehensive whole valid window of reflection, therefore its sample preferably can comparatively be evenly distributed in whole window.Be directed to above application needs, a kind of even strand sampling algorithm UBCS based on data element sliding window data flow model is proposed, UniformBasic-windowsChainSampling, this algorithm has introduced the thought of basic window technology, merge the advantage of CS algorithm, following target can be reached:
(1) sampling algorithm meets uniform sampling requirement, and namely each data all becomes sample with identical probability.
(2) sampling algorithm in the worst cases internal memory use amount determine, be O (k).
(3) in order to obtain the integrated information of data in window better, sample is evenly distributed in current valid window.
Arthmetic statement:
Uniform sampling method, it comprises the following steps:
Step one: the window size of setting sliding window data stream is N, and sampling is of a size of k, then each basic window is of a size of N/k, the ratio of N/k is the ratio rounded downwards, then the element index in first basic window is [1,2,3 ..., N/k], element index in second basic window be [N/k+1, N/k+2 ... 2*N/k] ..., the data element index in I basic window is [(I-1) * N/k+1,, I*N/k];
Step 2: adopt a UBCS algorithm Stochastic choice element index representatively index from first basic window;
Step 3: when element corresponding to the representative index in first basic window arrives, store, and the beginning of sample data in this, as uniform sampled data stream;
Step 4: obtain the representative index in next basic window in turn, and element corresponding to representative index in this next basic window is when arriving, store, until when the difference of the representative index of the representative index of current first basic window current basic window corresponding with current time is greater than window size N, delete the element that the representative index of current first basic window is corresponding; Simultaneously when the number of the sample data of uniform sampled data stream is greater than sampling size k, random erasure sample data from the sample data of uniform sampled data stream, circulation performs this step until the data stream of sensor network terminates.
If only pay close attention to the data of nearest a day, the acquisition rate of sensor network is 0.5s, then N=24*3600/0.5=172800, adopts UBCS algorithm to be multiple basic window by sliding window data flow point.If sample data correspond to sensor network data stream, then data are higher-dimension.
In described step one, the window size N of the sliding window data stream of sensor network is 6, and sampling size k is 2, then each basic window size N/k is 3, then the element index in first basic window is [1,2,3], the element index in second basic window is [4,5,6] ...
Representative index in first basic window chosen in described step 2 is 2.
In described step 3 when the element representing index 2 correspondence in first basic window arrives, store, and the beginning of sample data in this, as uniform sampled data stream.
The representative index chosen in second basic window is 5, obtains the representative index 5 in second basic window, and when the element that this represents index 5 correspondence arrives, stores; The representative index chosen in the 3rd basic window is 7, then obtains the representative index 7 in the 3rd basic window, and when the element that this represents index 7 correspondence arrives, stores; Now the number of the sample data of uniform sampled data stream is 3, is greater than sampling size 2, then random erasure sample data from the sample data of uniform sampled data stream; The representative index chosen in the 4th basic window is 11, continue the representative index 11 in acquisition the 4th basic window again, now, the difference representing the representative index 2 in index 11 and first basic window is 9, be greater than window size 6, then represent the expired deletion of element corresponding to index 2, circulation performs until the data stream of sensor network terminates.
The little execution example of of UBCS algorithm as shown in Figure 4.
UBCS algorithm performs flow process as shown in Figure 5.

Claims (4)

1. the sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects, it is characterized in that, it comprises the following steps:
Step one: the sliding window size of set environment sensor senses data is N, and to set sampling fraction be B:1, the data of N*B before data stream in moving window are sampled as off-line data, N*B the data obtained as initial prediction window data, and form prediction window D according to initial prediction window data t;
Step 2: using subsequent time data element index adjacent with current time in environmental sensor sensing data stream as prediction window D tinput value, prediction window D tthe prediction average of subsequent time data element in output environment sensor senses data stream, and obtain the variance corresponding with this prediction average;
Step 3: according to prediction window D tthe prediction average of the subsequent time data element exported and corresponding variance determine that described subsequent time data element should fall into time normal 95% fiducial interval;
Step 4: when described subsequent time data element arrives, the scope that itself and described fiducial interval are determined being compared, the scope that fiducial interval determines if exceed, is then abnormal data depending on this data element, store this abnormal data and index thereof, and return step 2; Otherwise execution step 5;
Step 5: whether described subsequent time data element adds prediction window D to utilize even strand sampling algorithm to determine tif add, then this subsequent time data element is stored in prediction window D tin, and deletion prediction window D tthe data element that interior minimum index is corresponding, completes prediction window D trenewal, then return step 2 circulation perform until sensing data stream terminates, then perform step 6; Otherwise directly return step 2 circulation to perform until sensing data stream terminates, then perform step 6;
Step 6: export all abnormal datas judging in step 4 to obtain, realize the detection of continuous abnormal data in the image data stream of environmental sensor;
Gaussian process regression model is a kind of probability model for nonlinear regression problem, it by can training data restriction prior distribution carried out to Posterior distrbutionp estimation;
Described prediction window D t={ x i-Q, x i-Q+1..., x i, wherein i represents current time, and Q is prediction window D tsize, and Q=N*B, x are the prediction window data in moment corresponding to its subscript;
By subsequent time data element x i+1index as prediction window D tinput value, obtain data element x i+1prediction average and the variance q corresponding with this prediction average;
Determine that the fiducial interval of 95% of subsequent time data element is
Whether described subsequent time data element adds prediction window D to utilize even strand sampling algorithm to determine described in step 5 tconcrete grammar be:
Step May Day: the environmentally sliding window size N of sensor senses data, set its sampling and be of a size of k, then each basic window is of a size of N/k, the ratio of N/k is the ratio rounded downwards, data element index then in first basic window is [1,2,3, N/k], the data element index in second basic window is [N/k+1, N/k+2,2*N/k] ..., the data element index in I basic window is [(I-1) * N/k+1,, I*N/k];
Step 5 two: from prediction window D tthe next data element index of middle Stochastic choice representatively index;
Step 5 three: when described represent data element corresponding to index arrive time, this data element is added prediction window D as the sample data of uniform sampled data stream t, until when the difference of the representative index of the representative index of current first basic window current basic window corresponding with current time is greater than window size N, delete the element that the representative index of current first basic window is corresponding; Simultaneously when the number of the sample data of environmental sensor sensing data stream is greater than sampling size k, random erasure sample data from the sample data of uniform sampled data stream.
2. the sampling Gaussian process regression model that in the image data stream of environmental sensor according to claim 1, continuous abnormal detects, is characterized in that,
In described step May Day, the sliding window size N of environmental sensor sensing data is 6, and sampling size k is 2, then each basic window size N/k is 3, then the element index in first basic window is [1,2,3], the element index in second basic window is [4,5,6] ...
3. the sampling Gaussian process regression model that in the image data stream of environmental sensor according to claim 2, continuous abnormal detects, is characterized in that,
The representative index of the next data element chosen in described step 5 two is 2.
4. the sampling Gaussian process regression model that in the image data stream of environmental sensor according to claim 3, continuous abnormal detects, is characterized in that,
In described step 5 three when next number arrives according to the data element that the representative index of element is 2 correspondences, this data element is added prediction window D as the sample data of uniform sampled data stream t.
CN201310295975.7A 2013-07-15 2013-07-15 The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects Active CN103336906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310295975.7A CN103336906B (en) 2013-07-15 2013-07-15 The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310295975.7A CN103336906B (en) 2013-07-15 2013-07-15 The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects

Publications (2)

Publication Number Publication Date
CN103336906A CN103336906A (en) 2013-10-02
CN103336906B true CN103336906B (en) 2016-03-16

Family

ID=49245069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310295975.7A Active CN103336906B (en) 2013-07-15 2013-07-15 The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects

Country Status (1)

Country Link
CN (1) CN103336906B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326638A (en) * 2016-08-12 2017-01-11 傅崇辉 Method and device for simply detecting carcinogenic risks of indoor nitrogen oxide

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678936B (en) * 2013-12-26 2017-09-22 清华大学 Exceptional part localization method in a kind of multi-part engineering system
CN103728419B (en) * 2013-12-31 2015-05-20 北京雪迪龙科技股份有限公司 Data judgment method and device
CN103974311B (en) * 2014-05-21 2017-06-20 哈尔滨工业大学 Based on the Condition Monitoring Data throat floater detection method for improving Gaussian process regression model
CN104156615A (en) * 2014-08-25 2014-11-19 哈尔滨工业大学 Sensor test data point anomaly detection method based on LS-SVM
CN104200113A (en) * 2014-09-10 2014-12-10 山东农业大学 Internet of Things data uncertainty measurement, prediction and outlier-removing method based on Gaussian process
FR3030808A1 (en) * 2014-12-19 2016-06-24 Orange METHOD FOR TRANSMITTING DATA FROM A SENSOR
CN104902509B (en) * 2015-05-19 2018-03-30 浙江农林大学 Abnormal deviation data examination method based on top k (σ) algorithm
CN105160181B (en) * 2015-09-02 2018-02-23 华中科技大学 A kind of digital control system domain of instruction sequence variation data detection method
CN105978848A (en) * 2015-12-04 2016-09-28 乐视致新电子科技(天津)有限公司 Processing method and device for collection of sensor data
CN105700517B (en) * 2016-03-09 2018-10-12 中国石油大学(北京) A kind of the initial failure monitoring method and device of refinery process adaptive data-driven
CN106055885B (en) * 2016-05-26 2018-12-11 哈尔滨工业大学 Unmanned plane during flying data exception detection method is tracked based on over-sampling projection approximation base
CN106302487B (en) * 2016-08-22 2019-08-09 中国农业大学 Agriculture internet of things data throat floater real-time detection processing method and processing device
CN108345574B (en) * 2017-01-23 2021-09-03 无锡市计量测试院 Method for detecting and correcting related double data stream abnormity
CN107092772B (en) * 2017-03-01 2019-12-10 深圳怡化电脑股份有限公司 Method and device for determining characteristic curve of sensor
CN107194034B (en) * 2017-04-21 2021-05-28 广州明珞汽车装备有限公司 GPR-based equipment damage detection method and system
CN107682319B (en) * 2017-09-13 2020-07-03 桂林电子科技大学 Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method
CN108089962A (en) * 2017-11-13 2018-05-29 北京奇艺世纪科技有限公司 A kind of method for detecting abnormality, device and electronic equipment
KR102131922B1 (en) 2018-08-29 2020-07-08 국방과학연구소 Method and device for receiving data from a plurality of peripheral devices
CN109167708B (en) * 2018-09-13 2020-06-26 中国人民解放军国防科技大学 Self-adaptive online anomaly detection method based on sliding window
TWI709188B (en) 2018-09-27 2020-11-01 財團法人工業技術研究院 Fusion-based classifier, classification method, and classification system
CN109298351B (en) * 2018-09-30 2021-07-27 清华大学深圳研究生院 New energy vehicle-mounted battery residual life estimation method based on model learning
CN109446730B (en) * 2018-12-05 2022-11-29 新奥数能科技有限公司 Short-term equipment operation data-based generator set load factor missing value recruitment method
CN109752504B (en) * 2019-01-25 2021-11-30 西安科技大学 Working face gas sensor adjustment and correction auxiliary decision-making method
CN110555063B (en) * 2019-07-15 2024-06-28 凯盛融英信息科技(上海)股份有限公司 Dynamic generation method of local data
CN110535781B (en) * 2019-07-30 2021-08-13 西安交通大学 Flow control method based on window prediction
CN110971488A (en) * 2019-11-27 2020-04-07 软通动力信息技术有限公司 Data processing method, device, server and storage medium
CN111077876B (en) * 2019-12-11 2021-06-08 湖南大唐先一科技有限公司 Power station equipment state intelligent evaluation and early warning method, device and system
CN111241481B (en) * 2020-01-10 2022-04-29 西南科技大学 Detection method for abnormal data of aerodynamic data set
CN111425932B (en) * 2020-03-30 2022-01-14 瑞纳智能设备股份有限公司 Heat supply network operation monitoring and warning system and method based on FLINK
CN111608870A (en) * 2020-06-09 2020-09-01 中国船舶重工集团海装风电股份有限公司 Method for identifying fouling of yaw brake disc of wind driven generator
CN112036075A (en) * 2020-08-11 2020-12-04 中国环境监测总站 Abnormal data judgment method based on environmental monitoring data association relation
CN113569491B (en) * 2021-08-13 2023-08-18 江苏集萃智能光电系统研究所有限公司 Wheel set size detection data analysis and correction method and device
CN117553864B (en) * 2024-01-12 2024-04-19 北京宏数科技有限公司 Sensor acquisition method and system based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326638A (en) * 2016-08-12 2017-01-11 傅崇辉 Method and device for simply detecting carcinogenic risks of indoor nitrogen oxide

Also Published As

Publication number Publication date
CN103336906A (en) 2013-10-02

Similar Documents

Publication Publication Date Title
CN103336906B (en) The sampling Gaussian process regression model that in the image data stream of environmental sensor, continuous abnormal detects
CN112202736B (en) Communication network anomaly classification method based on statistical learning and deep learning
CN103974311B (en) Based on the Condition Monitoring Data throat floater detection method for improving Gaussian process regression model
Yu et al. Advances and challenges in building engineering and data mining applications for energy-efficient communities
US20210390230A1 (en) Method for Quickly Optimizing Key Mining Parameters of Outburst Coal Seam
CN103345593A (en) Gathering abnormity detection method for single sensor data flow
CN105676670B (en) For handling the method and system of multi-energy data
CN107786369A (en) Based on the perception of IRT step analyses and LSTM powerline network security postures and Forecasting Methodology
CN106777703A (en) A kind of bus passenger real-time analyzer and its construction method
CN107430715A (en) Cascade identification in building automation
CN106934237A (en) Radar cross-section redaction measures of effectiveness creditability measurement implementation method
CN110636066B (en) Network security threat situation assessment method based on unsupervised generative reasoning
CN107742168A (en) A kind of workshop bottleneck Forecasting Methodology based on Internet of Things technology
CN102663264A (en) Semi-supervised synergistic evaluation method for static parameter of health monitoring of bridge structure
CN114662793B (en) Business process remaining time prediction method and system based on interpretable hierarchical model
CN105844501A (en) Consumption behavior risk control system and method
CN110580213A (en) Database anomaly detection method based on cyclic marking time point process
CN110991776A (en) Method and system for realizing water level prediction based on GRU network
CN114692983A (en) Automatic gear shifting prediction method and system for special vehicle
CN114048362A (en) Block chain-based power data anomaly detection method, device and system
CN114548494A (en) Visual cost data prediction intelligent analysis system
CN112560252A (en) Prediction method for residual life of aircraft engine
CN117195114A (en) Chemical production line identification method and system
Abdellatief et al. Egyptian Case Study-Sales forecasting model for automotive section
CN111126694A (en) Time series data prediction method, system, medium and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200326

Address after: 150001 No. 118 West straight street, Nangang District, Heilongjiang, Harbin

Patentee after: Harbin University of technology high tech Development Corporation

Address before: 150001 Harbin, Nangang, West District, large straight street, No. 92

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY

TR01 Transfer of patent right