CN111079089B

CN111079089B - Base station data anomaly detection method based on interval division

Info

Publication number: CN111079089B
Application number: CN201911329988.5A
Authority: CN
Inventors: 刘海波; 廖闻剑; 卢山; 张俊杰; 张坤
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-08-11
Anticipated expiration: 2039-12-20
Also published as: CN111079089A

Abstract

The invention discloses a base station data anomaly detection method based on interval division, which comprises the following steps: preprocessing an original track data set, and dividing the processed data set into a dynamic interval and a static interval; the dynamic interval is represented as a range formed by subscripts of any plurality of continuous adjacent isolated points, and the static interval is represented as a range formed by start and stop subscripts of all the rest data fragments in the original data set; extracting abnormal points from the dynamic interval by using a multidimensional Gaussian model and a sliding window distance model; extracting abnormal points from the static interval by using a gravity center distance scoring method; a five-tuple is used for representing the dynamic abnormal point and the static abnormal point, and a five-tuple set is used for representing the abnormal point set. The method disclosed by the invention is suitable for processing online data, has short time and high accuracy, can effectively evaluate a new abnormal mode, and has low misjudgment rate.

Description

Base station data anomaly detection method based on interval division

Technical Field

The invention discloses a base station data anomaly detection method based on interval division, and relates to the field of data mining in an artificial intelligent computer, in particular to the technical field of space-time track data anomaly detection.

Background

With the vigorous development of positioning technology and pervasive computing, daily behavior data of people are collected in various modes, and big track data are generated. The track big data is represented as a large-scale high-speed space-time data stream generated by the positioning equipment, the track big data appearing in a data stream form is effectively analyzed and processed, and abnormal phenomena hidden in the track data can be found, so that the application of city planning, safety management and control and the like is served.

Existing trace data anomaly detection techniques include classification-based detection, historical data similarity-based detection, distance-based detection, cluster-based detection, and the like. These methods suffer from the following disadvantages:

1. anomalies in the track stream data are unknown, time-varying, and are not suitable for processing online data based on classification;

2. the distance-based method relates to neighbor query and distance calculation of a large amount of track data, and has the advantages of high time overhead and low accuracy;

3. based on the method of the historical data, depending on a large amount of historical data, a new abnormal mode cannot be effectively evaluated;

4. the clustering-based method has high selection requirements on features and class clusters, and generally has high misjudgment rate.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the defects of the prior art, the base station data anomaly detection method based on interval division is provided, an original data set is divided into a plurality of subsets according to the characteristics of base station acquisition data, and then different models are adopted for solving the subsets of different types. Finally, an outlier candidate set is obtained.

The invention adopts the following technical scheme for solving the technical problems:

a base station data anomaly detection method based on interval division, the method comprising the steps of:

preprocessing an original track data set, and dividing the processed data set into a dynamic interval and a static interval; the dynamic interval is represented as a range formed by subscripts of any plurality of continuous adjacent isolated points, and the static interval is represented as a range formed by start-stop subscripts of all the remaining data fragments in the original data set;

step (2), model solving, namely extracting abnormal points of the dynamic interval by using a multidimensional Gaussian model and a sliding window distance model; extracting abnormal points from the static interval by using a gravity center distance scoring method;

and (3) using the five-tuple to represent the dynamic abnormal point and the static abnormal point to form a five-tuple set to represent the abnormal point set.

As a further preferred embodiment of the present invention, the rule of the preprocessing in the step (1) is: the cleaning data does not contain data of a preset field; the cleaned data is de-duplicated and time ordered.

As a further preferable aspect of the present invention, in step (1), the method for dividing the original trajectory data set into intervals by using a dynamic interval search algorithm includes the steps of:

101. the isolated point is selected, so that data which only appears once in a specified time range is used as the isolated point, and the expression formula is as follows:

wherein ,l_t ＝(lon _t ，lat _t ) The spatial position at a certain time t is represented, and consists of longitude lon and latitude lat at the time,expressed in terms of time t _i A time segment of the center moment;

if it isL is then _t Is an isolated point;

102. dynamic interval search, in which a range formed by start and stop subscripts of any plurality of continuous neighboring isolated points is set is called a dynamic interval:

the above represents two isolated points l _x ，l _y Of (c), wherein index (l) _t ) Representing isolated points l _t Index subscript in original dataset, then l _x ，l _y Neighbor if and only if

For a set of multiple isolated points, l= { L ₁ ，l ₂ ，l ₃ …l _i }, any subset ofIf it isThen L is referred to as the i-neighbor isolated point set;

the range of subscript composition of start-stop elements of neighboring isolated point sets is a dynamic interval, denoted as i= [ index (l) ₁ )，index(l _i )]；

103. Generating a static interval, removing all dynamic intervals in the subscript range of the preprocessing result set, and enabling all the rest intervals to be called static intervals;

let the original dataset subscript interval s= [0, n]Assume a dynamic interval I ₁ ＝[i，i+k]，I ₂ ＝[j，j+u]Where k, u > 0, i > 0, J > i+k, j+u < n, interval J ₁ ＝[0，i-1]，J ₂ ＝[i+k+1，j-1]，J ₃ ＝[j+u+1，n]Referred to as a static interval.

As a further preferable aspect of the present invention, in the step (2), the model solution of the dynamic section includes the steps of:

201. the longitude, latitude, extraction time and position switching rate of the extracted data sample are substituted into a Gaussian model to calculate probability density of each item of data in the whole data set, probability values are ordered from small to large, data corresponding to the first lambda probability values are selected to be added into an abnormal point candidate set E ₁ The calculation formula of the multidimensional Gaussian model is as follows:

wherein μ is an N-dimensional mean vector, Σ is an n×n covariance matrix, and Σ is a determinant of Σ;

202. establishing a sliding window distance model, and selecting any continuous data W=w with the size of 2k+1 from a preprocessing result set _i-k ，…，w _i-1 ，w _i ，w _i+1 ，…，w _i+k As a window, itW in _i Is the center of window W, W _up ＝w _i-k ，…，w _i-1 Represent the upper half window of length k, w _down ＝w _i+1 ，…，w _i+k The lower half window of length k is shown. Let R (w) _i ，w _up ) Representing the center point w _i And the upper half window w _up Is expressed as:

wherein distance (w _i ，w _i-1 ) Representing the window center w _i And the above information w _i-1 Is used for the distance of the Europe type (R),representing the upper half window w _up A maximum value of the distance between any two positions;

then window center w _i And the upper half window w _up Correlation if and only if R (w _i ，w _up )＝1；

Let R (w) _i ，w _down ) Representing the center point w _i And lower half window w _down Is expressed as:

wherein distance (w _i ，w _i+1 ) Representing the window center w _i And the following information w _i+1 Is used for the distance of the Europe type (R),representing the lower half window w _down A maximum value of the distance between any two positions;

then window center w _i And lower half window w _down Correlation if and only if R (w _i ，w _down )＝1；

Converting the process of searching abnormal points on the preprocessing result set into the process of translating the window W with a fixed Step length Step to search for coincidenceCondition R (w) _i ，w _up )＝0∩R(w _i ，w _dow n) =0, adding the window center point to the outlier candidate set E ₂ 。

As a further preferable embodiment of the present invention, in the step (2), the outlier solving for the static section using the barycenter distance scoring method includes the steps of:

203. center of gravity point selection, let M represent the set of all data in the static interval J, then L' = { l|l ε M, freq _M (l) The > gamma } represents position data in the set M with occurrence frequency greater than the threshold gamma, wherein freq _M (l) The frequency of the occurrence of the position l in the set M is represented by calculating the interval gravity center point O by a weighted average method, which is represented as:

wherein ,representing weights +.>For position l _i Longitude of->For position l _i N is the number of elements in L';

204. distance score calculation to distance (l) _x ，l _y ) When the distance between any two positions is expressed, the maximum value of the distance between any element in the set L and the center of gravity is referred to as the distance radius, and is expressed as

Further score for arbitrary data M in collection M _m Expressed as:

then the static interval outlier candidate set E ₃ ＝{m|m∈M，score _m ＝1}。

As a further preferable aspect of the present invention, the step (3) specifically includes the steps of:

301. abnormal point candidate set E obtained by solving dynamic interval ₁ and E₂ Making intersections, wherein the same elements are extracted as outliers;

302. abnormal point candidate set E obtained by solving static interval ₃ The medium element is an abnormal point;

303. five-tuple error= [ Account, lon, lat, cptime, errFlag ] is defined to represent the above-mentioned extracted outlier, where ErrFlag represents outlier type, errflag=0 represents dynamic outlier, and errflag=1 represents static outlier.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects: the method disclosed by the invention is suitable for processing online data, has short time and high accuracy, can effectively evaluate a new abnormal mode, and has low misjudgment rate.

Drawings

Fig. 1 is an overall flow chart of the present invention.

Fig. 2 is a schematic diagram of interval division in the present invention.

FIG. 3 is a schematic view of a sliding window distance model according to the present invention.

FIG. 4 is a schematic diagram of the center of gravity distance scoring method according to the present invention.

FIG. 5 is a schematic diagram of the original trajectory of experimental data in an embodiment of the present invention.

Fig. 6 is a schematic diagram of static interval 1 of experimental data in an embodiment of the present invention.

FIG. 7 is a schematic diagram of dynamic intervals of experimental data in an embodiment of the present invention.

FIG. 8 is a schematic diagram of static interval 2 of experimental data in an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

the invention discloses a base station data anomaly detection method based on interval division, the whole flow chart of which is shown in figure 1, comprising the following steps:

step (1), preprocessing an original track data set, and dividing the processed data set into a dynamic interval and a static interval, wherein the dynamic interval is represented as a range formed by subscripts of any plurality of continuous adjacent isolated points, the static interval is represented as a range formed by start and stop subscripts of all the remaining data fragments in the original track data set, and the interval division schematic diagram is shown in fig. 2.

The rules of the preprocessing in the step (1) are as follows: the cleaning data does not contain data of fields such as longitude, latitude, time and the like; the cleaned data is de-duplicated and time ordered.

Further, the method for dividing the interval of the original track data set by utilizing a dynamic interval searching algorithm comprises the following steps:

1. and selecting the isolated point, and enabling the data which only appears once in the appointed time range to serve as the isolated point. The formula is as follows:

wherein ,l_t ＝(lont，lat _t ) The spatial position at a certain time t is represented, and consists of longitude lon and latitude lat at the time,expressed in terms of time t _i At the center momentIs a time slice of (a). If->L is then _t Is an isolated point.

2. The dynamic interval search is performed, and a range formed by start and stop subscripts of any plurality of continuous adjacent isolated points is called a dynamic interval. Order the

The above represents two isolated points l _x ，l _y Of (c), wherein index (l) _t ) Representing isolated points l _t Index subscript in original dataset, then l _x ，l _y Neighbor if and only ifFor a set of multiple isolated points, l= { L ₁ ，l ₂ ，l ₃ …l _i -arbitrary subset->If->Then L is referred to as the set of i-neighbor isolated points. The range of subscript composition of start-stop elements of neighboring isolated point sets is a dynamic interval, denoted as i= [ index (l) ₁ )，index(l _i )]。

3. And generating a static interval. All dynamic intervals are removed within the subscript range of the preprocessing result set, and all the remaining intervals are called static intervals. Let the original dataset subscript interval s= [0, n]Assume a dynamic interval I ₁ ＝[i，i+k]，I ₂ ＝[j，j+u]Where k, u > 0, i > 0, j > i+k, j+u < n. Interval J ₁ ＝[0，i-1]，J ₂ ＝[i+k+1，j-1]，J ₃ ＝[j+u+1，n]Referred to as a static interval.

Step (2), model solving, namely extracting abnormal points of the dynamic interval by using a multidimensional Gaussian model and a sliding window distance model; and extracting abnormal points from the static interval by using a gravity center distance scoring method, wherein a sliding window distance model schematic diagram is shown in fig. 3, and a gravity center distance scoring method schematic diagram is shown in fig. 4.

Further, the model solving of the dynamic interval in the step (2) includes the following steps:

1. the longitude, latitude, extraction time and position switching rate of the extracted data sample are substituted into a Gaussian model to calculate probability density of each item of data in the whole data set, probability values are ordered from small to large, data corresponding to the first lambda probability values are selected to be added into an abnormal point candidate set E ₁ The calculation formula of the multidimensional Gaussian model is as follows:

where μ is an N-dimensional mean vector, Σ is an n×n covariance matrix, and Σ is a determinant of Σ.

2. A sliding window distance model. Selecting any continuous data W=w with the size of 2k+1 from the preprocessing result set _i-k ，…，w _i-1 ，w _i ，w _i+1 ，…，w _i+k As a window, where w _i Is the center of window W, W _up ＝w _i+k ，…，w _i-1 Represent the upper half window of length k, w _down ＝w _i+1 ，…，w _i+k The lower half window of length k is shown. Let R (w) _i ，w _up ) Representing the center point w _i And the upper half window w _up Is expressed as:

wherein distance (w) _i ，w _i-1 ) Representing the window center w _i And the above information w _i-1 Is used for the distance of the Europe type (R),representing the upper half window w _up A maximum value of the distance between any two positions. Then window center w _i And the upper half window w _up Correlation if and only if R (w _i ，w _up ) =1. Similarly, let R (w _i ，w _down ) Representing the center point w _i And lower half window w _down Is expressed as:

wherein distance (w) _i ，w _i+1 ) Representing the window center w _i And the following information w _i+1 Is used for the distance of the Europe type (R),representing the lower half window w _down A maximum value of the distance between any two positions. Then window center w _i And lower half window w _down Correlation if and only if R (w _i ，w _down )＝1。

The process of finding outliers on the pre-processing result set can therefore be converted into translating the window W by a fixed Step, finding a match to the condition R (W _i ，w _up )＝0∩R(w _i ，w _down ) Window center procedure of=0. Adding the window center point into the outlier candidate set E ₂ ；

Further, the outlier solving method for the static interval by using the gravity center distance scoring method comprises the following steps:

3. and selecting a gravity center point. Let M denote the set of all data within the static interval J. Then L' = { l|l e M, freq _M (l) The > gamma } represents position data in the set M with occurrence frequency greater than the threshold gamma, wherein freq _M (I) Representing how frequently position l occurs in set M. The center of gravity point O of the interval is calculated by adopting a weighted average mode and is expressed as

wherein Representing weights +.>For position l _i Longitude of->For position l _i N is the number of elements in L'.

4. And calculating a distance score. Distance (l) _x ，l _y ) Representing the distance between any two locations. The maximum value of the distance between any element in the set L and the center of gravity is called the distance radius, expressed asFurther score for arbitrary data M in collection M _m Expressed as:

And (3) using the five-tuple to represent the dynamic abnormal point and the static abnormal point to form a five-tuple set to represent the abnormal point set. The method specifically comprises the following steps:

1. abnormal point candidate set E obtained by solving dynamic interval ₁ and E₂ Making intersections, wherein the same elements are extracted as outliers;

2. abnormal point candidate set E obtained by solving static interval ₃ The medium element is an abnormal point;

3. five-tuple error= [ Account, lon, lat, cptime, errFlag ] is defined to represent the above-mentioned extracted outlier, where ErrFlag represents outlier type, errflag=0 represents dynamic outlier, and errflag=1 represents static outlier.

The following detailed description of the embodiments of the invention refers to the accompanying drawings and tables.

Take part of the experimental data in table 1 as an example:

table 1: part of the experimental data

The invention relates to a base station data anomaly detection method based on interval division, which comprises the following steps:

1. pretreatment of

And removing the data which does not contain fields such as account numbers, longitudes, latitudes, extraction time and the like, and sorting the data conforming to the rules according to time.

2. Section division

1) And selecting isolated points. In the present invention, if it is satisfied thatL is then _t Is an isolated point. Wherein T is _t For a time period of 2 hours back and forth. The data numbered 20,34-39,72 in Table 1 were selected as outliers because they appeared only once within 2 hours of each other.

2) Dynamic interval searching. In the present invention, a range formed by start and stop subscripts of any plurality of continuous neighboring isolated points is referred to as a dynamic range. Where the neighbor relation threshold μ=10. In table 1, no. 20 and No. 34 are spaced 14 apart, and the neighbor condition is not satisfied; neither the interval 33 between the numbers 39 and 72 satisfies the condition. While 34-39 satisfies the neighbor condition and 34-39 may constitute a 6-neighbor isolated point set. Thus [34,39] is a dynamic interval.

3) And generating a static interval. Within the subscript range, all dynamic intervals are removed, and all remaining intervals are referred to as static intervals. I.e., static intervals of [1,33], [40,73].

Therefore, the method divides the original data set into a plurality of dynamic intervals and a plurality of static intervals to be solved respectively. As shown in fig. 5.

3. Model solving

1) Gaussian model+sliding window distance model

The anomaly judgment of each point in the dynamic interval depends on the context, so that when the experiment solves the dynamic interval, 5 position data are respectively blurred upwards and downwards for auxiliary calculation.

First, choose longitude, latitude, time, and location switching rate as 4 latitude pairs of multi-dimensional Gaussian [29,44 ]]Calculating probability density, sorting probability values from small to large, selecting data 36,37,38 corresponding to the first lambda=3 probability values, and adding the data into the abnormal point candidate set E ₁ 。

Next, using a sliding window of size 2k+1 (k=5), the window center is moved from 34 to 39, and the degree of association of each point with the context in the [34,39] interval is determined, where the association threshold δ=2.

Take the example of number 37. Suppose that the window center moves to 37. Firstly, calculating the maximum value of the pairwise Euclidean distance of the upper half window 32-36 as 0.37251550801928224, and the maximum value of the pairwise Euclidean distance of the lower half window 40-44 as 0.23724898399045094; the point 37 is located 1.3443963712052325 from the upper portion 36 and 2.0689669522578904 from the lower portion 38. It can be found that the correlation of the number 37 with the upper half window and the lower half window is insufficient as shown in fig. 7. Finally, 37 is added to the outlier candidate set E ₂

2) Barycentric distance scoring

The invention proposes to solve abnormal points in a static interval by using a gravity center distance scoring algorithm. The method comprises the following steps:

and selecting a gravity center point. First, the frequency of each position point in the static interval is calculated. If the frequency is greater than threshold 2. The position is taken as an influencing factor for the centre of gravity. As shown in fig. 6, a static section 1 is taken as an example. The frequency of each point in the interval is as follows:

therefore, the center of gravity is affected by the positions 1 to 5, and the center of gravity O (120.01242,30.28419) can be obtained by using a weighted average method. The distances between the numbers 1 to 5 and the gravity center O are respectively calculated, and the maximum value is taken as the radius r. The distance O between position No. 2 and the position furthest from position No. 2 is about 1509 meters. And drawing a circle by taking O as a circle center and r as a radius, wherein points outside the circle are abnormal points.

Similarly, as shown in fig. 8, the data of the number 72 in the static section 2 is abnormal, and is added to the static section abnormal point candidate set E.

4. Five-tuple represents an abnormal point set

The representation of outliers comprises the steps of:

1) Abnormal point candidate set E obtained by solving dynamic interval ₁ and E₂ Making intersections, wherein the same elements are extracted as outliers;

2) Abnormal point candidate set E obtained by solving static interval ₃ The medium element is an abnormal point;

3) Five-tuple error= [ Account, lon, lat, cptime, errFlag ] is defined to represent the outlier extracted above, where ErrFlag represents outlier type, errFlag=0 represents dynamic outlier, errFlag=1 represents static outlier

Thus, the final set of outliers is expressed as:

[136****9106,120.24317,30.27825,1520317737,0]

[136****9106,120.42331,30.21835,1520333487,1]。

the embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention. The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other embodiments, such as those described above, of making various modifications and equivalents will fall within the spirit and scope of the present invention.

Claims

1. A base station data anomaly detection method based on interval division, the method comprising the steps of:

comprising the following steps:

if it isL is then _t Is an isolated point;

the above represents two isolated points l _x ，l _y In a neighbor relation of (2), whereinindex(l _t ) Representing isolated points l _t Index subscript in original dataset, then l _x ，l _y Neighbor if and only ifFor a set of multiple isolated points, l= { L ₁ ，l ₂ ，l ₃ …l _i Arbitrary subset ∈>If->Then L is referred to as the i-neighbor isolated point set;

let the original dataset subscript interval s= [0, n]Assume a dynamic interval I ₁ ＝[i，i+k]，I ₂ ＝[j，j+u]Where k, u > 0, i > 0, J > i+k, j+u < n, interval J ₁ ＝[0，i-1]，J ₂ ＝[i+k+1，j-1]，J ₃ ＝[j+u+1，n]Called static intervals;

the model solving of the dynamic interval comprises the following steps:

201. the longitude, latitude, extraction time and position switching rate of the extracted data sample are substituted into a Gaussian model to calculate probability density of each item of data in the whole data set, probability values are ordered from small to large, data corresponding to the first lambda probability values are selected to be added into an abnormal point candidate set E ₁ Multidimensional Gaussian modelThe calculation formula of (2) is as follows:

202. establishing a sliding window distance model, and selecting any continuous data W=w with the size of 2k+1 from a preprocessing result set _i-k ，…，w _i-1 ，w _i ，w _i+1 ，…，w _i+k As a window, where w _i Is the center of window W, W _up ＝w _i-k ，…，w _i-1 Represent the upper half window of length k, w _down ＝w _i+1 ，…，w _i+k The lower half window of length k is represented, let R (w _i ，w _up ) Representing the center point w _i And the upper half window w _up Is expressed as:

wherein distance (w _i ，w _i-1 ) Representing the window center w _i And the above information w _i-1 Is used for the distance of the Europe type (R),representing the upper half window w _up A maximum value of the distance between any two positions; delta represents an association threshold;

Converting the process of finding outliers on the pre-processing result set into translating the window W with a fixed Step size Step, finding a coincidence condition R (W _i ，w _up )＝0∩R(w _i ，w _down ) Procedure of window center of=0, adding the window center point to the outlier candidate set E ₂ ；

The method for solving the abnormal points in the static interval by using the gravity center distance scoring method comprises the following steps:

Further score for arbitrary data M in collection M _m Expressed as:

then the static interval outlier candidate set E ₃ ＝{m|m∈M，score _m ＝1}；

Step (3) using five-tuple to represent dynamic abnormal points and static abnormal points to form a five-tuple set to represent abnormal point sets;

the step (3) specifically comprises the following steps:

2. The method for detecting abnormal base station data based on interval division as claimed in claim 1, wherein the preprocessing rule in the step (1) is as follows: the cleaning data does not contain data of a preset field; the cleaned data is de-duplicated and time ordered.