CN110569890A

CN110569890A - A method for detecting abnormal patterns in hydrological data based on similarity measure

Info

Publication number: CN110569890A
Application number: CN201910784182.9A
Authority: CN
Inventors: 万定生; 张祥
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2019-12-13

Abstract

The invention discloses a method for detecting abnormal patterns of hydrological data based on similarity measurement. The method is based on the linear segment representation of key points and represents KPRA-PLR algorithm, cuts hydrological data according to the definition of key points, and performs PLR algorithm for each subsequence. Line fitting is performed, and the subsequence is represented by the slope a _i of the straight line and the time interval _Δt ; each subsequence after segmentation is called a meta-pattern, and the sequence pattern is obtained by combining adjacent meta-patterns. Using the weighted distance and SDTW algorithm respectively, the similarity measure of each sequence pattern is calculated by using the weighted distance and the SDTW algorithm, and then calculating the anomaly score of each sequence pattern, which is the inverse of the average distance between the pattern and other patterns; the anomaly score is the k-nearest neighbor distance of the sequence pattern S _x , according to k- The nearest neighbor local detection principle calculates the local outlier factor LOF. The abnormal pattern detected by the method of similarity measurement in the present invention is more accurate, and a new technology is provided for the abnormal pattern detection of hydrological data from the perspective of data analysis.

Description

A method for detecting abnormal patterns in hydrological data based on similarity measure

技术领域technical field

本发明属于水文检测技术，具体涉及一种基于相似性度量的水文数据异常模式检测方法。The invention belongs to the hydrological detection technology, in particular to a method for detecting abnormal patterns of hydrological data based on similarity measurement.

背景技术Background technique

水文时间序列是观测系统按照时间顺序获取到的一系列物理量(水位、流量、降雨量等)的观测值。水文时间序列是一种常见的复杂数据类型，它客观记录了观测系统按照时间顺序取得的观测信息。在我国，随着水利信息化的发展，水文时间序列会传输到水利信息共享平台，由工作人员处理进行保存到国家水文库中。由于采集设备的测量误差、人工操作误差、水文自身演进规律变化等影响因素都会导致水文时间序列中存在一些异常数据。Hydrological time series is a series of observed values of physical quantities (water level, flow, rainfall, etc.) obtained by the observation system in time sequence. Hydrological time series is a common complex data type, which objectively records the observation information obtained by the observation system in chronological order. In my country, with the development of water conservancy informatization, the hydrological time series will be transmitted to the water conservancy information sharing platform, processed by the staff, and stored in the national hydrological library. Due to the measurement error of the acquisition equipment, the error of manual operation, and the change of the evolution law of the hydrology itself, there will be some abnormal data in the hydrology time series.

异常检测(Anomaly Detection)主要目标是挖掘出隐藏在大型数据集中的不一般数据，即与其它数据存在明显差异的、有潜在意义的信息。而水文时间序列中异常数据检测方法与技术还处在探索阶段，业界没有完美的异常检测方案，目前针对水文时间序列异常点检测的方法有Box-Plot方法、Benford法则。The main goal of Anomaly Detection is to mine unusual data hidden in large data sets, that is, potentially meaningful information that is significantly different from other data. However, the methods and technologies for detecting abnormal data in hydrological time series are still in the exploratory stage, and there is no perfect anomaly detection solution in the industry. At present, the methods for detecting abnormal points in hydrological time series include Box-Plot method and Benford's rule.

发明内容SUMMARY OF THE INVENTION

发明目的：针对上述现有技术精准度不高，并且异常情况的检测速度慢问题，本发明提供一种基于相似性度量的水文数据异常模式检测方法，为检测水文数据存在的异常模式提供理论依据。Purpose of the invention: Aiming at the problems of low accuracy and slow detection of abnormal conditions in the prior art, the present invention provides a method for detecting abnormal patterns in hydrological data based on similarity measurement, which provides a theoretical basis for detecting abnormal patterns in hydrological data. .

技术方案：一种基于相似性度量的水文数据异常模式检测方法，所述方法基于关键点的分段线性表示法KPRA-PLR算法、时间序列模式相似性度量的方法和基于k-近邻局部异常检测算法，包括如下步骤：Technical solution: A method for detecting anomalous patterns in hydrological data based on similarity measurement, the method is based on the piecewise linear representation of key points KPRA-PLR algorithm, a method for similarity measurement of time series patterns, and local anomaly detection based on k-nearest neighbors The algorithm includes the following steps:

(1)基于关键点的分段线性表示法KPRA-PLR算法：首先根据关键点的定义对原始水文时间序列进行分割，得到每一段子序列，然后根据PLR算法对每段子序列进行直线拟合，得到直线的斜率，解出相邻两条直线间的交点，得到每个子序列起始时刻t_il和终止时刻t_ir，用(a_i，t_il，t_ir)表示每一段子序列又称元模式；(1) KPRA-PLR algorithm based on piecewise linear representation of key points: First, the original hydrological time series is divided according to the definition of key points to obtain each subsequence, and then each subsequence is fitted with a straight line according to the PLR algorithm, Obtain the slope of the straight line, solve the intersection between two adjacent straight lines, and obtain the start time t _il and end time t _ir of each subsequence, and use ( _ai , t _il , t _ir ) to represent each subsequence, also known as the element model;

(2)时间序列模式相似性度量的方法：将相邻元模式之间组合为序列模式；在度量两个模式相似性时，建立两个模式之间的相似性距离度量函数，用距离大小表示两者之间的相似度，每一模式的平均相似度代表该模式的异常分数；(2) Method for measuring the similarity of time series patterns: combine adjacent meta-patterns into sequential patterns; when measuring the similarity of two patterns, establish a similarity distance measure function between the two patterns, which is expressed by the distance The similarity between the two, the average similarity of each pattern represents the abnormal score of the pattern;

(3)k-近邻局部异常检测算法：用异常分数值表示该模式的k-近邻距离，得出k-近邻局部可达距离、局部可达密度、局部异常因子，最大的异常因子即为异常模式。(3) k-nearest neighbor local anomaly detection algorithm: the k-nearest neighbor distance of the model is represented by the anomaly score value, and the k-nearest neighbor local reachable distance, local reachability density, and local anomaly factor are obtained. The largest anomaly factor is the anomaly. model.

进一步的，步骤(1)包括如下过程：Further, step (1) includes the following process:

(11)将关键点定义为S＝{s₁，s₂，...，s_m}为时间序列X的所有局部极值点，当极值点s_k满足|s_k-s_k-1|＜ε且|t_k-t_k-1|＜δ时，其中ε和δ为给定的常数，t_k表示各个水文物理量的观测时间，则称s_k为非重要局部极值点，集合S中其他极值点为关键点；(11) Define the key point as S={s ₁ , s ₂ , ..., s _m } as all local extreme points of the time series X, when the extreme point _sk satisfies |s _k -s _k-1 When |<ε and |t _k -t _k-1 |<δ, where ε and δ are given constants, and t _k represents the observation time of each hydrological physical quantity, then s _k is called an unimportant local extreme point. Other extreme points in S are key points;

(12)然后根据线性分段表示算法，通过相邻两个关键点用最小二乘法拟合直线，得到直线对应的斜率，解出相邻两条直线间的交点，得到每个子序列起始时刻和终止时刻t_il和t_ir；每个子序列的模式表示为(a_i，t_il，t_ir)称为元模式；(12) Then, according to the linear segment representation algorithm, use the least squares method to fit a straight line through two adjacent key points to obtain the slope corresponding to the straight line, solve the intersection between the two adjacent straight lines, and obtain the starting time of each subsequence and the termination moments t _il and t _ir ; the pattern of each subsequence is represented as ( _ai , t _il , t _ir ) is called a meta-pattern;

(13)通过调整参数，根据得到的拟合直线和原曲线形态吻合程度确定适合当前数据的常数ε和δ值。(13) By adjusting the parameters, determine the constant ε and δ values suitable for the current data according to the degree of conformity between the obtained fitted straight line and the original curve shape.

步骤(2)所述的时间序列模式相似性度量的方法中，对元模式、序列模式分别建立相似性距离度量函数，具体如下：In the method for measuring the similarity of time series patterns described in step (2), a similarity distance measure function is established for the meta-pattern and the sequence pattern respectively, and the details are as follows:

(21)对于两个元模式之间的相似程度，采用加权距离方法，由PLR算法得到时间序列X的模式序列为S_x＝{M₁，M₂，M₃，...，M_n}，S_x中任意选择两个元模式M_i＝(a_i，Δt_i)，M_j＝(a_j，Δt_j)，(i，j＝1，2，3，...，n)，(21) For the degree of similarity between two meta-patterns, using the weighted distance method, the pattern sequence of the time series X obtained by the PLR algorithm is S _x ={M ₁ , M ₂ , M ₃ ,..., _Mn } , two meta-patterns M _i =(a _i ,Δt _i ), M _j =(a _j ,Δt _j ),(i,j=1,2,3,...,n) are arbitrarily selected in S _x ,

式中，a_i表示拟合直线的斜率，Δt_j表示j点相邻的时间间隔，Mi表示元模式，D_wei(Mi，M_j)为元模式M_i和M_j的加权距离，0≤β≤1；In the formula, a _i represents the slope of the fitted line, Δt _j represents the time interval adjacent to point _j , Mi represents the meta-pattern, D _wei (Mi, M _j ) is the weighted distance between the meta-patterns Mi and M _j , 0≤ β≤1;

(22)将相邻的元模式两两相组合得到序列模式，序列模式的相似性度量是在元模式的相似性基础上，使用动态时间弯曲距离法来度量序列模式的相似性；(22) Combining adjacent meta-patterns in two phases to obtain sequential patterns, the similarity measurement of sequential patterns is based on the similarity of meta-patterns, using dynamic time-warped distance method to measure the similarity of sequential patterns;

时间序列X＝(x₁，x₂，x₃，...，x_n)和时间序列Y＝(y₁，y₂，...，y_m)，通过KPRA-PLR算法进行降维、线性表示得到的模式序列分别为：Time series X=(x ₁ , x ₂ , x ₃ ,..., x _n ) and time series Y=(y ₁ , y ₂ ,..., y _m ), through the KPRA-PLR algorithm for dimensionality reduction, The pattern sequences obtained by linear representation are:

S_x＝{M₁，M₂，M₃，...，M_p}，S _x = {M ₁ , M ₂ , M ₃ , ..., M _p },

S_y＝{N₁，N₂，N₃，...，N_t}，S _y = {N ₁ , N ₂ , N ₃ , ..., N _t },

其中(i＝1，2，3，...，p；j＝1，2，3，...，t)，M_i＝(a_i，Δt_i)是S_x的元模式，where ( _i = ₁ , 2, 3 _, _.

N_j＝(a_j，Δt_j)是S_y的元模式；建立两个模式序列间的距离矩阵计算公式如下：N _j = (a _j , Δt _j ) is the meta-pattern of S _y ; the calculation formula for establishing the distance matrix between two pattern sequences is as follows:

DM＝(a_ij)_p×t DM=(a _ij ) _p×t

式中，a_ij＝D_wei(M_i，N_j)表示为模式序列S_x的第i个元模式M_i和S_y的第j个元模式N_j的加权距离。a_ij值的大小体现元模式M_i，N_j之间相似程度，当元模式M_i，N_j之间越相似，a_ij值越接近0；当元模式M_i，N_j之间差别越大，a_ij值越大；In the formula, a _ij =D _wei (M _i , N _j ) is expressed as the weighted distance of the i-th meta-pattern Mi of the pattern sequence S _x and the _{j-th meta-pattern N j} _of S _y . The value of a _ij reflects the degree of similarity between the meta-patterns _Mi and N _j . When the meta-patterns _Mi and N _j are more similar, _{the value of a ij} _is closer to 0 _; The larger the value of a _ij , the larger the value of a ij;

(23)与时间序列间的动态时间弯曲距离相似，在距离矩阵DM＝(a_ij)_p×t中搜索最佳弯曲路径W＝w₁，w₂，..，w_k，并计算累积距离最小的弯曲路径，把这个最小值称为模式序列S_x和S_y的动态时间弯曲距离简记为D_dtw(S_x，S_y)，得到如下公式：(23) Similar to the dynamic time-curved distance between time series, search for the best curved path W=w ₁ , w ₂ , .. , w _k in the distance matrix DM=(a _ij ) _p×t , and calculate the cumulative distance The minimum curved path, which is called the dynamic time curved distance of the mode sequence S _x and S _y , is abbreviated as D _dtw (S _x , S _y ), and the following formula is obtained:

式中，w就是计算两个元模式之间的距离大小。In the formula, w is to calculate the distance between two meta-patterns.

步骤(3)所述的k-近邻局部异常检测算法中，异常分数值等于该模式的k-近邻距离，具体计算过程如下：In the k-nearest neighbor local anomaly detection algorithm described in step (3), the anomaly score value is equal to the k-nearest neighbor distance of the mode, and the specific calculation process is as follows:

设k是一个正整数，实例x与实例t之间的k-近邻可达距离用如下公式表示为：Let k be a positive integer, the k-nearest neighbor reachable distance between instance x and instance t is expressed as:

r_dist(x，t)＝max{k_distance(t)，d(x，t)}r _{dist(x, t)} = max{k _distance(t) , d(x, t)}

实例x的局部可达密度定义为实例x的k个最近邻点的平均可达距离的倒数，局部可达密度公式如下：The local reachability density of an instance x is defined as the reciprocal of the average reachable distance of the k nearest neighbors of the instance x, and the local reachability density formula is as follows:

在数据点集D中，实例x的局部异常系数的LOF(x)可以定义为：In the data point set D, the LOF(x) of the local outlier coefficients of the instance x can be defined as:

其中，根据DTW算法计算每个序列模式的异常分数，r_dist(x，t)为每个序列模式的异常分数，实例x表示一个序列模式的异常分数，按照公式计算局部可达密度，lof表示异常因子。Among them, the abnormal score of each sequence pattern is calculated according to the DTW algorithm, r _{dist(x, t)} is the abnormal score of each sequence pattern, the instance x represents the abnormal score of a sequence pattern, and the local reachability density is calculated according to the formula, lof represents abnormal factor.

更进一步的，对于DTW方法存在的奇异点问题，基于空间维度、时间维度的改进方法，通过自适应权重的方式，不引入额外权重系数对空间维度(数值)和时间维度(梯度)加权求和，进而计算积累路径最小的距离。Further, for the singularity problem of the DTW method, the improved method based on the spatial dimension and the time dimension, through the method of adaptive weight, does not introduce additional weight coefficients to weight the sum of the spatial dimension (value) and the time dimension (gradient). , and then calculate the minimum distance of the accumulated path.

更进一步的，所述的DTW方法中的奇异点问题包括基于SDTW算法作为时间序列相似性度量方法，具体过程如下：Further, the singularity problem in the DTW method includes the SDTW algorithm as the time series similarity measurement method, and the specific process is as follows:

定义特征F_s(x(i))为：Define the feature F _s (x(i)) as:

式中max(|Δx|)表示时间序列中所有时间点的最大梯度，某点的梯度用相邻点的差分表示比如x(i)-x(i-1)；Max(|Δx|)的作用是将时间序列中各点的梯度大小约束在[-1，1]之中，从而以比率的形式将时间维度融与空间维度相结合，得到矩阵中各元素之间距离d(i，j)的表达式如下：In the formula, max(|Δx|) represents the maximum gradient of all time points in the time series, and the gradient of a point is represented by the difference of adjacent points, such as x(i)-x(i-1); Max(|Δx|) The function is to constrain the gradient size of each point in the time series to [-1, 1], so as to combine the time dimension with the space dimension in the form of a ratio, and obtain the distance d(i, j) between the elements in the matrix ) is expressed as follows:

d(i，j)＝(F_s(x(i))-F_s(y(j)))²。d(i,j)=( _Fs (x(i))- _Fs (y(j))) ² .

有益效果：本发明一套水文数据异常模式检测方案，实现KPRA-PLR算法对原始序列进行分割、针对元模式的相似性度量使用加权距离、针对序列模式的相似性度量使用改进的DTW算法、基于相似性度量的异常模式检测原理使用k-近邻局部异常检测算法，对比常用的异常模式检测方法——符号化SAX方法，不用固定分段以及固定模式类别个数，本发明利用相似性度量的方法检测出的异常模式更为精准，从数据分析的角度为水文数据异常模式检测提供了参考。Beneficial effects: a set of hydrological data abnormal pattern detection scheme of the present invention realizes the KPRA-PLR algorithm to segment the original sequence, uses weighted distance for the similarity measure of the meta pattern, uses the improved DTW algorithm for the similarity measure of the sequence pattern, based on The anomaly pattern detection principle of similarity measurement uses the k-nearest neighbor local anomaly detection algorithm. Compared with the commonly used abnormal pattern detection method—the symbolic SAX method, the method of similarity measurement is used without fixed segmentation and fixed number of pattern categories. The detected abnormal patterns are more accurate, which provides a reference for the detection of abnormal patterns in hydrological data from the perspective of data analysis.

附图说明Description of drawings

图1为本发明中方法的整体结构示意图；Fig. 1 is the overall structure schematic diagram of the method in the present invention;

图2为本发明中KPRA-PLR算法拟合直线示意图；Fig. 2 is KPRA-PLR algorithm fitting straight line schematic diagram in the present invention;

图3为实施例中模式185～195流量变化示意图；FIG. 3 is a schematic diagram of flow changes in modes 185 to 195 in the embodiment;

图4为实施例中模式282～287流量变化示意图；FIG. 4 is a schematic diagram of flow changes in modes 282 to 287 in the embodiment;

图5为实施例中模式290～302流量变化示意图；FIG. 5 is a schematic diagram of flow changes in modes 290 to 302 in the embodiment;

图6为实施例中模式363～370流量变化示意图。FIG. 6 is a schematic diagram of flow changes in modes 363 to 370 in the embodiment.

具体实施方式Detailed ways

为了详细的说明本发明所公开的技术方案，下面结合说明书附图和具体实施例做进一步的阐述。In order to describe the technical solutions disclosed in the present invention in detail, further description is made below with reference to the accompanying drawings and specific embodiments of the description.

一种基于相似性度量的水文数据异常模式检测方法，如图1所示，发明的基于相似度度量的水文数据异常模式检测方法，包括基于关键点的线性分段表示KPRA-PLR算法、时间序列模式相似性度量方法、K-近邻原理三个模块。本实施例中水文数据选用选择龙门水文测站汛期小时流量(单位为m³/s)数据，采用该测站2000/6/1 6:00:00—2015/9/30 17:39:00的实测汛期小时流量数据进行实验。A method for detecting abnormal patterns in hydrological data based on similarity measure, as shown in Figure 1, the invented method for detecting abnormal pattern in hydrological data based on similarity measure includes KPRA-PLR algorithm based on linear segment representation of key points, time series There are three modules of pattern similarity measurement method and K-nearest neighbor principle. In this embodiment, the hydrological data is selected to select the hourly flow (unit is m ³ /s) data of the Longmen hydrological station during the flood season, and the station is used from 2000/6/1 6:00:00 to 2015/9/30 17:39:00 The measured hourly flow data in the flood season were used for experiments.

首先，基于关键点的线性分段表示KPRA-PLR算法，根据关键点定义中的公式，对以上两种水文数据的降维采用一样的阈值ε和δ，分别得到历史水文时间序列(2000年—2015年汛期数据)和实时水文时间序列(2016年—2017年汛期数据)的关键点，据统计：实时水文数据得到关键点189个，历史水文数据得到2844个关键点。关键点按照[t,i]格式表示，其中t代表着时间序列角标，i代表着时间序列对应时刻观测值大小。截取部分实时水文时间得到的关键点放入表格中展示，如下表1所示。First, the KPRA-PLR algorithm is based on the linear segment representation of key points. According to the formula in the definition of key points, the same thresholds ε and δ are used for the dimensionality reduction of the above two hydrological data, and the historical hydrological time series (2000- 2015 flood season data) and real-time hydrological time series (2016-2017 flood season data) key points, according to statistics: real-time hydrological data obtained 189 key points, historical hydrological data obtained 2844 key points. The key points are represented in the format of [t,i], where t represents the time series index, and i represents the size of the observation value at the corresponding time of the time series. The key points obtained by intercepting part of the real-time hydrological time are displayed in a table, as shown in Table 1 below.

表1关键点展示Table 1 shows the key points

[1,159][1,159] [4,258][4,258] [8,168][8,168] [11,244][11,244] [12,232][12,232] [14,220][14,220] [29,188][29,188] [32,184][32,184] [35,216][35,216] [38,183][38,183] [43,256][43,256] [46,214][46,214] [50,205][50,205] [52,276][52,276] [61,173][61,173] [64,406][64,406] [65,344][65,344] [67,537][67,537] [73,330][73,330] [75,421][75,421] [89,142][89,142] [96,164][96,164] [105,141][105,141] [111,160][111,160] [120,387][120,387] [121,511],[121,511], [122,475][122,475] [128,615][128,615] [131,960][131,960] [142,273][142,273] [148,768][148,768] [149,747][149,747] [150,780][150,780] [151,705][151,705] [155,556][155,556] [165,147][165,147] [174,396][174,396] [177,651][177,651] [180,476][180,476] [181,554][181,554] [184,331][184,331] [190,234][190,234] [194,1160][194,1160] [195,1300],[195,1300], [196,1150][196,1150] [197,125][197,125] [198,122][198,122] [89,142][89,142] …… …… …… …… …… …… [605,296][605,296] [606,266][606,266] [610,185][610,185] [612,181][612,181] [615,451][615,451] [618,360][618,360] [621,580][621,580] [624,308][624,308] [627,525][627,525] [631,388][631,388] [635,808][635,808] [639,490][639,490] [640,514][640,514] [643,332][643,332] [645,377][645,377] [649,232][649,232] [651,226][651,226] [653,215][653,215] [654,266][654,266] [657,184][657,184] [658,201][658,201] [664,480][664,480] [666,301][666,301] [668,426][668,426] [670,301][670,301] [672,499][672,499] [675,306][675,306] [677,322][677,322] [678,310][678,310] [680,322][680,322]

通过关键点对原始时间序列进行分割得到一个个子序列，使用PLR算法对子序列进行模式表示，得到一个个元模式。根据定义，相邻的元模式进行组合得到序列模式，采用不一样的组合方式，得到的序列模式会不同，本章采用相邻两两之间组合得到序列模式，用M_i＝(a_i,Δt_i)表示，其中a_i代表着直线斜率，表示曲线的走势，Δt_i代表着直线的长短，表示曲线的长短。The original time series is divided by key points to obtain sub-sequences, and the sub-sequences are represented by the PLR algorithm to obtain the sub-sequences. According to the definition, adjacent meta-patterns are combined to obtain sequential patterns. If different combination methods are used, the obtained sequential patterns will be different. This chapter uses the combination of adjacent meta-patterns to obtain sequential patterns, using Mi _{= (a i} _, Δt _i ), where a _i represents the slope of the straight line, which represents the trend of the curve, and Δt _i represents the length of the straight line, which represents the length of the curve.

据统计：实时水文序列得到188个元模式，通过相邻的两个元模式组成序列模式有94个；历史水文序列共计2844个元模式，通过相邻的两个元模式组成序列模式有1422个。对实时数据得到的序列模式进行部分展示如下表2所示。According to statistics: 188 meta-patterns are obtained from the real-time hydrological sequence, and 94 are formed by two adjacent meta-patterns; there are 2,844 meta-patterns in the historical hydrological sequence, and 1,422 are formed by two adjacent meta-patterns. . A partial presentation of the sequence patterns obtained from real-time data is shown in Table 2 below.

表2部分序列模式表Table 2 Partial sequence pattern table

[33.0,3][33.0,3] [-22.5,4][-22.5,4] [25.33,3][25.33,3] [-12.0,1][-12.0,1] [-6.0,2][-6.0,2] [-1.2,5][-1.2,5] [-2.6,10][-2.6,10] [-1.32,3][-1.32,3] [10.67,3][10.67,3] [-11.0,3][-11.0,3] [14.6,5][14.6,5] [-14.0,3][-14.0,3] [15.0,2][15.0,2] [-19.5,2][-19.5,2] [35.5,2][35.5,2] [-11.43,9][-11.43,9] [77.67,3][77.67,3] [-62.0,1][-62.0,1] [96.5,2][96.5,2] [-43.25,4],[-43.25,4], [-17.0,2[-17.0,2 [45.5,2][45.5,2] [-19.9,14][-19.9,14] [3.14,7][3.14,7] [-2.54,9][-2.54,9] [3.17,6][3.17,6] [42.25,8][42.25,8] [-111.0,1][-111.0,1] [124.0,1][124.0,1] [-36.0,1][-36.0,1] [23.33,6][23.33,6] [115.0,3][115.0,3] [-62.44,11][-62.44,11] [82.5,6][82.5,6] [-21.0,1][-21.0,1] [33.0,1][33.0,1] [-75.0,1][-75.0,1] [-37.25,4][-37.25,4] [91.4,10][91.4,10] [-119.3,9][-119.3,9] [85.0,3][85.0,3] [-58.3,3][-58.3,3] [78.0,1][78.0,1] [-74.32,3][-74.32,3] [334.83,6][334.83,6] [-295.0,4][-295.0,4] [140.0,1][140.0,1] [-150,1][-150,1] [100.0,1][100.0,1] [-30.0,1][-30.0,1] [13.33,3][13.33,3] [75.0,2][75.0,2] [-85.5,6][-85.5,6] [32.0,1][32.0,1] …… …… …… …… …… …… [-15.0,4][-15.0,4] [20.13,8][20.13,8] [-18.0,6][-18.0,6] [18.5,4][18.5,4] [-4.0,4][-4.0,4] [-23.3,3][-23.3,3] [15.0,4][15.0,4] [24.67,3][24.67,3] [-19.0,6][-19.0,6] [13.33,3][13.33,3] [-13.0,1][-13.0,1] [27.3,3][27.3,3] [-30.0,1][-30.0,1] [-20.25,4][-20.25,4] [-2.0,2][-2.0,2] [90.0,3][90.0,3] [-30.32,3][-30.32,3] [73.3,3][73.3,3] [-90.6,3][-90.6,3] [72.33,3][72.33,3] [-34.25,4][-34.25,4] [105.0,4][105.0,4] [-79.5,4][-79.5,4] [24.0,1][24.0,1] [-60.66,3][-60.66,3] [22.5,2][22.5,2] [-36.25,4][-36.25,4] [-3.0,2][-3.0,2] [-5.5,2],[-5.5,2], [51.0,1][51.0,1] [99.0,2][99.0,2] [8.0,2][8.0,2] [-12.0,1][-12.0,1] [6.0,2][6.0,2]

所述时间序列模式相似性度量方法中，本章实验主要是挖掘实时水文数据里的异常模式，故根据KPRA-PLR算法对实时水文数据进行分割、模式表示为元模式，相邻两个元模式组成序列模式，共得到序列模式94个，对历史水文数据进行分割、模式表示为元模式，相邻两个元模式组成序列模式，共得到序列模式1422个。其中每一个序列模式与历史水文数据得到的1422个序列模式两两计算距离，展示部分相似性度量结果如下表3所示。In the method for measuring the similarity of time series patterns, the experiments in this chapter are mainly to mine abnormal patterns in real-time hydrological data. Therefore, the real-time hydrological data is segmented according to the KPRA-PLR algorithm, and the pattern is represented as a meta-pattern, which consists of two adjacent meta-patterns. Sequence patterns, a total of 94 sequence patterns were obtained. The historical hydrological data were segmented, and the patterns were represented as meta-patterns. Two adjacent meta-patterns formed a sequence pattern, and a total of 1422 sequence patterns were obtained. The distance between each sequence pattern and the 1422 sequence patterns obtained from the historical hydrological data is calculated in pairs, and the results of some similarity measures are shown in Table 3 below.

表3相似性度量结果Table 3 Similarity measurement results

序号serial number 11 22 33 …… 14201420 14211421 14221422 11 9.49.4 11.9111.91 11.511.5 …… 69.8369.83 3.833.83 52.1552.15 22 11.9111.91 12.6812.68 16.4116.41 …… 75.7475.74 12.0812.08 71.7671.76 33 11.511.5 16.4116.41 6.356.35 …… 80.3380.33 9.679.67 25.3525.35 44 25.0325.03 28.1228.12 17.1317.13 ….…. 93.8693.86 23.223.2 24.8824.88 55 8.838.83 8.338.33 8.088.08 …… 77.6677.66 7.07.0 6.436.43 66 50.66550.665 56.57556.575 60.16560.165 …… 20.16520.165 53.49553.495 54.81554.815 77 18.6718.67 24.5824.58 29.1729.17 …… 53.1653.16 19.519.5 22.8222.82 88 27.7427.74 30.8330.83 21.2421.24 …… 96.5796.57 24.9124.91 27.5927.59 99 26.68526.685 23.98523.985 18.39518.395 …… 95.72595.725 25.06525.065 21.74521.745 1010 33.89533.895 31.98531.985 26.39526.395 …… 102.725102.725 32.06532.065 28.74528.745 1111 33.5833.58 35.6735.67 24.0824.08 …… 102.41102.41 31.7531.75 32.4332.43 1212 12.8312.83 23.7423.74 11.3311.33 …… 73.073.0 13.6613.66 12.9812.98 …… …… …… …… …… …… …… …… 8585 66.5866.58 69.6769.67 59.0859.08 …… 135.41135.41 62.7562.75 66.4366.43 8686 55.4155.41 58.558.5 48.9148.91 …… 124.24124.24 52.5852.58 55.2655.26 8787 30.95530.955 34.04534.045 23.45523.455 …… 99.78599.785 28.12528.125 30.80530.805 8888 42.0842.08 45.1745.17 36.5836.58 …… 110.91110.91 38.2538.25 41.9341.93 8989 36.4936.49 39.5839.58 29.9929.99 …… 105.32105.32 32.6632.66 37.3437.34 9090 58.6758.67 64.5864.58 65.1765.17 …… 41.3441.34 60.560.5 68.8268.82 9191 51.1751.17 57.0857.08 61.6761.67 …… 19.6619.66 53.053.0 56.3256.32 9292 69.8369.83 75.7475.74 80.3380.33 …… 54.954.9 72.6672.66 74.9874.98 9393 3.833.83 12.0812.08 9.679.67 …… 72.6672.66 56.456.4 5.685.68 9494 5.155.15 10.7610.76 11.3511.35 …… 73.9873.98 10.6810.68 9.49.4

K-近邻原理中，使用SDTW算法计算实时水文数据得到的序列模式与历史水文数据得到的序列模式之间的相似性，相似性度量结果如上表4和表3所示。然后计算实时水文数据中每个序列模式的异常分数，异常分数计算方式为：与历史数据得到的1422个序列模式之间平均距离的倒数。计算完异常分数后，可采用基于相似度的点异常检测算法检测异常模式，有两种方法可以使用，一种是k-近邻原理；另一种是基于聚类的方法。In the K-nearest neighbor principle, the SDTW algorithm is used to calculate the similarity between the sequence pattern obtained from real-time hydrological data and the sequence pattern obtained from historical hydrological data. The similarity measurement results are shown in Tables 4 and 3 above. Then calculate the anomaly score of each sequence pattern in the real-time hydrological data. The anomaly score is calculated as: the reciprocal of the average distance between the 1422 sequence patterns obtained from the historical data. After calculating the anomaly score, a similarity-based point anomaly detection algorithm can be used to detect anomalous patterns. There are two methods that can be used, one is the k-nearest neighbor principle; the other is a clustering-based method.

本章使用k-近邻方法计算局部异常因子，选取Top-k最大的局部异常因子对应的序列模式即为异常模式。K-近邻原理重要的参数就是近邻数K，通过不断调整近邻数K值，计算对应的局部异常因子，选取Top-k最大的局部异常因子如下表4所示。In this chapter, the k-nearest neighbor method is used to calculate the local anomaly factor, and the sequence pattern corresponding to the local anomaly factor with the largest Top-k is selected as the anomaly pattern. An important parameter of the K-nearest neighbor principle is the number of neighbors K. By continuously adjusting the value of the number of neighbors K, the corresponding local anomaly factor is calculated, and the local anomaly factor with the largest Top-k is selected as shown in Table 4 below.

表4 Top-k局部异常因子表Table 4 Top-k local abnormal factor table

实验结果分析，将局部异常因子从大到小排序，模式185～195最大，模式282～287次之。从图3，图4可以发现当前测站小时流量在2个小时内突然暴涨，序列模式的斜率a_i很大，反映出序列模式的增长速度很快，而后又下降。这样的序列模式在2000年—2015年流量数据中是存在的，但出现的次数很少，并且能和该序列模式的走势、时间长度相类似的就更少了。图3，图4表示的两个序列模式短短2个小时流量增加趋势太快，其次模式185～195的模式增减时间偏长，模式282～287的增减时间长度过于太短，因此模式185～195，模式282～287判定为异常模式。According to the analysis of the experimental results, the local abnormal factors are sorted from large to small, the mode 185-195 is the largest, and the mode 282-287 is the second. From Figure 3 and Figure 4, it can be found that the hourly flow of the current station suddenly surges within 2 hours, and the slope a _i of the sequence pattern is very large, reflecting the rapid growth rate of the sequence pattern, and then decreases. Such a sequence pattern exists in the flow data from 2000 to 2015, but the number of occurrences is very small, and the trend and time length of the sequence pattern are similar. Figure 3 and Figure 4 show that the traffic flow increases too fast in just 2 hours. Secondly, the increase and decrease time of the modes 185 to 195 is too long, and the increase and decrease time of the modes 282 to 287 is too short. Therefore, the mode 185 to 195, and modes 282 to 287 are judged as abnormal modes.

将局部异常因子从大到小排序，模式290～302第三位，模式363～370第四位。从图5，图6可以发现该流量短时间内缓慢增长，然后长时间的在下降，序列模式的斜率a_i是负值，反映出序列模式的下降程度，图5该序列模式整体时间过长，根据关键点的定义，1520并不是关键点，出现了下降-上升走势，前1个小时更是暴增，后面的下降速度很快从2640降到755，在2000年—2015年流量数据中能和模式290～302的增减趋势、时间长度相类似的很少。图6该序列模式只有短短1个小时流量缓慢增加，然后长时间处于下降状态，很少出现这样的走势，加上下降时间过长，就显得“奇怪”，因此模式290～302，模式363～370判定为异常模式。Sort the local anomaly factors from large to small, mode 290-302 is the third, mode 363-370 is the fourth. From Figure 5 and Figure 6, it can be found that the flow increases slowly for a short time, and then decreases for a long time. The slope a _i of the sequence mode is a negative value, reflecting the degree of decline of the sequence mode. The overall time of the sequence mode in Figure 5 is too long , according to the definition of the key point, 1520 is not the key point, there is a downward-rising trend, the first hour is a sharp increase, and the subsequent decline rate quickly drops from 2640 to 755. In the traffic data from 2000 to 2015 There are very few that can be similar to the increasing and decreasing trends and time lengths of patterns 290-302. Figure 6 This sequence pattern only has a slow increase in traffic for a short period of 1 hour, and then it is in a state of decline for a long time. Such a trend seldom occurs, and if the decline time is too long, it appears "strange". Therefore, patterns 290 to 302, pattern 363 ~370 is determined as abnormal mode.

Claims

1. a kind of hydrological data abnormal pattern detection method based on similarity measure, it is characterized in that: described method is based on the piecewise linear representation KPRA-PLR algorithm of key point, the method for time series pattern similarity measure and based on k-nearest neighbor The local anomaly detection algorithm includes the following steps:

(1) KPRA-PLR algorithm based on piecewise linear representation of key points: First, the original hydrological time series is divided according to the definition of key points to obtain each subsequence, and then each subsequence is fitted with a straight line according to the PLR algorithm, Obtain the slope of the straight line, solve the intersection between two adjacent straight lines, and obtain the start time t _il and end time t _ir of each subsequence, and use ( _ai , t _il , t _ir ) to represent each subsequence, also known as the element model;

(2) Method for measuring the similarity of time series patterns: combine adjacent meta-patterns into sequential patterns; when measuring the similarity of two patterns, establish a similarity distance measure function between the two patterns, which is expressed by the distance The similarity between the two, the average similarity of each pattern represents the abnormal score of the pattern;

(3) k-nearest neighbor local anomaly detection algorithm: the k-nearest neighbor distance of the pattern is represented by the anomaly score value, and the k-nearest neighbor local reachable distance, local reachability density, and local anomaly factor are obtained. The largest anomaly factor is the anomaly pattern. .

2. the hydrological data abnormal pattern detection method based on similarity measure according to claim 1, is characterized in that: step (1) comprises following process:

(11) Define the key point as S={s ₁ , s ₂ ,...,s _m } as all local extreme points of the time series X, when the extreme point _sk satisfies |s _k -s _k-1 |< When ε and |t _k -t _k-1 |<δ, where ε and δ are given constants, and t _k represents the observation time of each hydrological physical quantity, then s _k is called an unimportant local extreme point. Other extreme points are key points;

(12) According to the linear segmental representation algorithm, use the least squares method to fit a straight line through two adjacent key points to obtain the slope corresponding to the straight line, solve the intersection between the two adjacent straight lines, and obtain the starting time and sum of each subsequence. Termination time t _il and t _ir ; the pattern of each subsequence is expressed as (a _i , t _il , t _ir ) is called a meta-pattern;

(13) By adjusting the parameters, determine the constant ε and δ values suitable for the current data according to the degree of conformity between the obtained fitted straight line and the original curve shape.

3. The method for detecting abnormal patterns of hydrological data based on similarity measurement according to claim 1, wherein: in the method for time series pattern similarity measurement described in step (2), the meta-pattern and the sequence pattern are established respectively The similarity distance metric function is as follows:

(21) For the degree of similarity between two meta-patterns, using the weighted distance method, the pattern sequence of the time series X obtained by the PLR algorithm is S _x ={M ₁ ,M ₂ ,M ₃ ,...,M _n }, S Two meta-patterns M _i =(a _i ,Δt _i ), M _j =(a _j ,Δt _j ),(i,j=1,2,3,...,n) are arbitrarily selected in _x ,

Denoted as: D _wei (M _i ,M _j )=β·|a _i -a _j |+(1-β)·|Δt _i -Δt _j |

In the formula, a _i represents the slope of the fitted line, Δt _j represents the time interval adjacent to point _j , Mi represents the meta-pattern, D _wei (Mi, M _j ) is the weighted distance between the meta-patterns Mi and M _j , 0≤ β≤1;

(22) Combining adjacent meta-patterns in two phases to obtain sequential patterns, the similarity measurement of sequential patterns is based on the similarity of meta-patterns, using dynamic time-warped distance method to measure the similarity of sequential patterns;

Time series X=(x ₁ , x ₂ , x ₃ ,...,x _n ) and time series Y=(y ₁ , y ₂ ,..., y _m ), obtained by dimensionality reduction and linear representation by KPRA-PLR algorithm The pattern sequences are:

S _x ={M ₁ ,M ₂ ,M ₃ ,...,M _p },

S _y ={N ₁ ,N ₂ ,N ₃ ,...,N _t },

where (i=1,2,3,...,p; j=1,2,3,...,t), M _i =(a _i ,Δt _i ) is the meta-pattern of S _x , N _j =(a _j ,Δt _j ) is the meta-pattern of S _y ; the calculation formula for establishing the distance matrix between two pattern sequences is as follows:

DM=(a _ij ) _p×t

In the formula, a _ij =D _wei (M _i , N _j ) represents the weighted distance between the i-th meta-pattern Mi of the pattern sequence S _x and the _{j-th meta-pattern N j} _of S _y . The value of a _ij reflects the degree of similarity between the meta-patterns _Mi and N _j . When the meta-patterns _Mi and N _j are more similar, _{the value of a ij} _is closer to 0 _; The larger the value of a _ij , the larger the value of a ij;

(23) Similar to the dynamic time warp distance between time series, search for the best curved path W=w ₁ ,w ₂ ,...,w _k in the distance matrix DM=(a _ij ) _p×t , and calculate the minimum cumulative distance The curved path of , this minimum value is referred to as the dynamic time curved distance of the mode sequence S _x and S _y , abbreviated as D _dtw (S _x , S _y ), and the following formula is obtained:

In the formula, w is to calculate the distance between two meta-patterns.

4. the hydrological data abnormal pattern detection method based on similarity measure according to claim 1 is characterized in that: in the described k-nearest neighbor local abnormality detection algorithm of step (3), the abnormal score value is equal to the k- of this pattern The nearest neighbor distance, the specific process is as follows:

Let k be a positive integer, the k-nearest neighbor reachable distance between instance x and instance t is expressed as:

r _dist(x,t) =max{k _distance(t) ,d(x,t)}

The local reachability density of an instance x is defined as the inverse of the average reachable distance of the k nearest neighbors of the instance x, and the local reachability density formula is as follows:

In the data point set D, the LOF(x) of the local outlier coefficients of the instance x can be defined as:

Among them, the anomaly score of each sequence pattern is calculated according to the DTW algorithm, r _dist(x,t) is the anomaly score of each sequence pattern, the instance x represents the anomaly score of a sequence pattern, and the local reachability density is calculated according to the formula, lof represents abnormal factor.

5. the hydrological data abnormal pattern detection method based on similarity measure according to claim 3, it is characterized in that: for the singular point problem that DTW method exists, the improvement method based on space dimension, time dimension, by the mode of self-adaptive weight , without introducing additional weight coefficients to weight the space dimension and time dimension, and then calculate the minimum distance of the accumulated path.

6. the abnormal pattern detection method of hydrological data based on similarity measure according to claim 5, is characterized in that: the singularity problem in described DTW method comprises to carry out data processing as time series similarity measure method based on SDTW algorithm, The specific process is as follows:

Define the feature F _s (x(i)) as:

where max(|Δx|) represents the maximum gradient of all time points in the time series, and the gradient of a point is represented by the difference between adjacent points, such as x(i)-x(i-1); Max(|Δx|) The function is to constrain the gradient size of each point in the time series to [-1, 1], so as to combine the time dimension with the space dimension in the form of a ratio, and obtain the distance d(i, j) between the elements in the matrix ) is expressed as follows:

d(i,j)=(F _s (x(i))-F _s (y(j))) ² .