CN112559571A - Approximate outlier calculation method and system for numerical type stream data - Google Patents

Approximate outlier calculation method and system for numerical type stream data Download PDF

Info

Publication number
CN112559571A
CN112559571A CN202011518175.3A CN202011518175A CN112559571A CN 112559571 A CN112559571 A CN 112559571A CN 202011518175 A CN202011518175 A CN 202011518175A CN 112559571 A CN112559571 A CN 112559571A
Authority
CN
China
Prior art keywords
stream data
data
outlier
newly added
sliding window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011518175.3A
Other languages
Chinese (zh)
Inventor
田增垚
任吉媛
宋阳
王哲
罗真
刘阜阳
沈毅
李丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Branch Of State Grid Corp Of China
Original Assignee
Northeast Branch Of State Grid Corp Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Branch Of State Grid Corp Of China filed Critical Northeast Branch Of State Grid Corp Of China
Priority to CN202011518175.3A priority Critical patent/CN112559571A/en
Publication of CN112559571A publication Critical patent/CN112559571A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures

Abstract

The invention provides an approximate outlier calculation method and system for numerical type stream data, wherein the method comprises the following steps: acquiring newly added stream data; inserting the stream data object into a corresponding leaf node in a preset index structure according to the known coordinate information of each dimension of the newly added stream data; searching an object set with the distance to the newly added stream data smaller than a preset radius; and comparing the number of the objects in the object set with a preset number threshold, and if the number of the objects in the object set is smaller than the preset number threshold, outputting the newly added stream data to an outlier set. Compared with the prior art, the method and the system can effectively reduce the space cost of the candidate outliers.

Description

Approximate outlier calculation method and system for numerical type stream data
Technical Field
The invention relates to the technical field of data processing, in particular to an approximate outlier calculation method and system for numerical type stream data.
Background
With the continuous development of computer technology, streaming data is becoming one of the mainstream data types. Compared with the traditional data type, the stream data has the characteristics of large data scale, high transmission speed and the like. According to statistics, the calculation amount of real-time data of the Tencent data platform per day exceeds 30 trillions, the transaction peak value of the twenty-one of the Aries reaches 32.5 ten thousand per second, and the payment peak value reaches 25.6 ten thousand per second. Outlier detection is a class of important queries in a dataflow environment. The method finds abnormal data from mass data, and has important application in the fields of network security and the like.
In the known art, researchers typically maintain a set of candidate outliers, then manage the stream data by means of multidimensional data indexing techniques, find data that is unlikely to be an outlier based on the temporal relationship of the data entering the window, and move such data out of the candidate set. However, streaming data is usually updated frequently and is large in scale, and high computational and space costs are required to implement the indexing of streaming data and maintenance of candidate sets. In the prior art, Yang et al propose a predictable outlier detection algorithm based on the time sequence relationship between stream data in Neighbor-based pattern detection for windows over streaming data. The algorithm determines the time and likelihood that a candidate outlier will become an outlier based on the number of objects surrounding the candidate outlier and the temporal relationship between these objects and the candidate outlier. The algorithm can ensure that only a small number of candidate outliers need to be maintained under most conditions and the outlier detection can be supported. M.Kontaki et al, in Continuous monitoring of distance-based outliers over data streams, propose a micro-clustering-based algorithm MCOD. It partitions the data into a set of "micro-cores" and then manages the "micro-cores" using an index M-Tree. Its advantage is that it can use "micronucleus" to quickly filter non-outliers. However, MCOD is very sensitive to the distribution and dimensions of data, and the effect of "microkernels" is reduced when the data dimensions are high. In addition, in the prior art, Cao et al propose a high-speed stream-based algorithm LEAP in Scalable distance-based detection over high-volume data streams. The LEAP utilizes the characteristic of high-speed flow to rapidly filter data arriving in the same time period, and utilizes the R-Tree to manage flow data. However, LEAP is very sensitive to the size and flow rate of the data stream. When the flow rate is small, the computational efficiency of LEAP decreases rapidly. In summary, the above-mentioned known technologies all require high computational cost and space cost to detect outliers in a streaming data environment, and cannot support outlier detection in a high-speed streaming environment.
Disclosure of Invention
In order to solve at least one technical problem in the prior art, the invention provides an approximate outlier calculation method and system for numerical flow data by analyzing various flow data including representative financial data, network flow data, medical detection data and the like.
According to one embodiment of one aspect of the present application, the present application provides an approximate outlier calculation method for numeric type stream data, comprising the following steps: acquiring newly added stream data; inserting the stream data object into a corresponding leaf node in a preset index structure according to the known coordinate information of each dimension of the newly added stream data; searching an object set with the distance to the newly added stream data smaller than a preset radius; and comparing the number of the objects in the object set with a preset number threshold, and if the number of the objects in the object set is smaller than the preset number threshold, outputting the newly added stream data to an outlier set.
According to an embodiment of another aspect of the present application, there is also provided an approximate outlier computing system for numeric streaming data, the system being capable of performing the steps included in any of the methods of the present application.
Compared with the prior art, the method has the advantages that the non-safety outliers with probability guarantee can be found according to the time sequence relation between the object and the neighbor arrival window, so that the situation that the objects are maintained in a candidate set is avoided, and the calculation cost and the space maintenance cost are reduced.
Drawings
FIG. 1 is a schematic flow chart of a method for calculating approximate outliers of numeric flow data according to a preferred embodiment of the present invention;
FIG. 2 is an index structure diagram of the method for detecting approximate outliers of streaming data according to a preferred embodiment of the present invention;
FIG. 3 is a schematic structural diagram of the data management module according to a preferred embodiment of the present invention;
FIG. 4 is a schematic block diagram of a numerical streaming data oriented approximate outlier computing system in accordance with a preferred embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The main idea of the present invention is to adaptively adjust the maintenance policy of target data and related data (e.g. neighbors comprising the target data) by calculating the probability that the target data object becomes an outlier. Compared with the prior art, the method can effectively reduce the space cost for processing the candidate outliers. Specifically, the present application calculates the probability that a target data object and related data will become outliers in a future period of time based on the distribution of the neighbor inflow time of the target data object. On the basis, a candidate outlier maintenance method with probability guarantee is provided. More specifically, the method utilizes a sliding window model to depict timeliness of large-scale stream data, constructs an index structure for data in a sliding window based on a B-Tree or/and a Z-address technology, maintains a coordinate relation between the stream data, uses a range query based on the B-Tree to perform range query on an obtained index, finds a neighbor object set of each object, calculates the probability of each object becoming an outlier by using an outlier prediction algorithm based on a central limit theorem, and finally constructs an outlier candidate set in a self-adaptive manner according to a probability calculation result, so that calculation cost and space cost can be obviously reduced. Among them, B-Tree or/and Z-address technology is the prior art.
The outlier may also be referred to as an outlier, and the specific definition may be: and taking the target data object as a circle center, making a circle by using a preset radius r, wherein r is a threshold value, and the number k of the data objects in the circle is judged to be an outlier if k is smaller than the preset threshold value, or else, the target data object is a non-outlier. If there are many neighbors, it is determined that the target data object is not an outlier. In addition, the non-secure outlier with probability error guarantee is determined by determining the arrival time, for example, the target data object has k neighbors with the latest arrival time, the arrival time of the k neighbors has a difference with the arrival time of the target data object, and whether the target data object is a non-secure outlier with probability error guarantee is determined based on the difference and a preset calculation rule, and a specific determination process is described in detail below.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.
In order to facilitate understanding of the concept of the present application, a data processing procedure of the approximate outlier calculation method for numerical type stream data according to the present application is described in detail below with reference to the accompanying drawings. The data processing process of the present application can be applied to terminal devices such as servers, mobile phones, personal computers, portable devices, and the like, and establishes a connection with a third party device through a communication network such as the internet, a local area network, a wide area network, and the like to acquire corresponding target data.
The streaming data of the present application may refer to a set of all data in a preset time, for example, a set of all data entering a terminal or captured by the terminal within thirty minutes from a certain time node, and more specifically, unit streaming data may be divided according to time, or may be divided according to a data size, and when divided according to time, for example, data flowing into or captured by the terminal in each unit time; when the data size is divided, for example, the data size that can be accommodated in a data window is taken as an example, ten thousand pieces of data of a certain type can be accommodated in a certain data window, and after a new piece of data enters the data window, a piece of old data is simultaneously streamed out from the data window. In addition, the streaming data of the present application mainly indicates value type streaming data, including but not limited to numeric type streaming data such as integer type or floating point type streaming data, for example, including but not limited to various types of streaming data representing environment, finance, medical treatment, network security monitoring, and the like, for example, including real-time generated financial data, medical detection data, sensor data, and the like, and taking financial streaming data as an example, two-dimensional data composed of stock transaction amount and stock transaction time may be analyzed and processed as the numeric type streaming data of the present application. The method and system of the present application are particularly applicable to streaming data in power networks.
Next, please refer to fig. 1 to fig. 3. FIG. 1 illustrates a flow diagram of a method of approximate outlier calculation for numeric streaming data according to one embodiment of the present application; fig. 2 is an index structure diagram of the stream data approximate outlier detection method, fig. 2 shows a two-level index structure, where the first-level index is a B-Tree for maintaining position information of a cube where stream data is located, and the second-level index structure is a set of cubes for maintaining coordinate information of an object and maintaining order information of the object reaching a window by using a reverse arrangement table; fig. 3 is a schematic structural diagram of a data management module according to the present invention, which includes three modules. The first module is a sliding window module that maintains time information for objects flowing into/out of the window, and is also a time range for marking valid/invalid data of stream data. The second module is an index module of stream data, and the third module is a maintenance module of candidate objects, and is used for maintaining the objects which can become the objects with the probability of outliers larger than a certain threshold value and the m neighbors with the latest arrival time of the k corresponding to the objects. Wherein, the calculation method of m is as follows: given a stream data object o, if it has u neighbors that arrive later than o in the window, then m is k-u, where k and u are positive integers.
According to fig. 1, in step S101, newly added stream data is acquired.
Specifically, it is possible to establish a connection with a device such as a third-party server through a communication network such as the internet, a local area network, a wide area network, or the like, by the server or the client segment, and acquire newly added target stream data from the device such as the third-party server. The newly added stream data here may mean that data newly generated or newly acquired from after the preset time point is newly added stream data with respect to the preset time point, and accordingly, stream data acquired before the preset time point is old stream data. The technical means for acquiring the stream data may be implemented by using the prior art, and is not limited herein.
In order to better distinguish the newly added stream data from the failed stream data, the application adopts a sliding window or data window with a preset length as a distinguishing medium, the sliding window with the preset length can accommodate a preset amount of data, such as ten thousand pieces of data of a certain type, and after a new stream of data enters the sliding window, an old stream of data or the failed stream of data simultaneously flows out of the sliding window.
Suppose o is the object of a new stream into a sliding window of preset length N, [0,1 ]]The value range of the data, d the dimension of the data, and the time for the object o to flow into the sliding window is o.t. More specifically, for example, the set of streaming data may be labeled W { (t)1,p1 1,p2 1,…,pd 1),...,(tm,p1 m,p2 m,…,pd m) The set comprising data in m sliding windows W, ti,p1 i,p2 i,…,pd iRespectively representing the time when the object oi flows into the sliding window and the data from dimension 1 to dimension d. The sliding window herein may refer to, for example, a data window with a preset space size, for example, a data window capable of accommodating ten thousand pieces of stream data, and every N pieces of stream data enter the sliding window, an equivalent amount of stream data flow out of the sliding window, where the outgoing stream data is the N pieces of stream data that originally entered the sliding window first, and the N pieces of stream data that originally entered the sliding window first are regarded as expired data.
In step S102, the stream data object is inserted into a corresponding leaf node in a preset index structure according to the known coordinate information of each dimension of the newly added stream data.
Specifically, the preset index structure of the present application preferentially adopts a two-layer index structure of a B-Tree in the prior art, the first-layer index is a set of cubes or squares, and the stream data is stored in the corresponding cubes or squares according to the coordinate information (or called position) of the stream data. Specifically, for example, a square is a grid with a unit of 4 multiplied by 4, the grid ID at the leftmost lower corner is 0, the grid IDs from the left to the right are 1, 2, and 3, the grid IDs from the right to the left are 4, 5, 6, and 7, and so on, and the two-layer index structure of the B-Tree stores stream data according to the size of the grid.
Specifically, for example, the stream data in the present application is mostly multidimensional data, the data constructs a two-dimensional or multidimensional square or cube according to its value range, and performs an averaging process on the constructed square or cube to divide it into a stack of small lattices, for example, into a stack of small squares, where a diagonal length of the small squares is r. Assuming that the range of the target object is 0 to 100 in each dimension, if the radius r is 1, the length of the diagonal line of each divided microcube is 1, and then the side length of each microcube is two-half root. If three-dimensional, the side length is one-third root number three … … and so on. Each minicube is given a number that is indexed by the B-Tree, which is prior art. Moreover, the index of the application is correspondingly established according to different specific dimensions of the stream data. The identification number or serial number ID of each small cube is stored in the leaf node of the B-Tree.
In addition, in order to better distinguish the stream data at different times, the stream data in each individual cube is arranged in descending or ascending order according to the time it flows into a sliding window of a preset length. The second level index is a B-Tree Tree structure to build a set of, e.g., reshaped, numbers for all non-empty cubes and maintain the built numbers using the B-Tree Tree structure. It is understood herein that in descending or ascending order, for example, multiple pieces of data are stored in a single cube, and those pieces of data cannot be distinguished in coordinates and can only be stored and distinguished according to time, for example, the data range is from 0 to 100, then the first grid is stored with 0 to 25, there are two pieces of data, the coordinates of the first piece of data are <10,10>, and the second piece of data are <10,15>, both pieces of data are in the first grid, the first piece of data arrives at 100 times, the second piece of data arrives at 200 times, and then the data in that grid are stored in time order.
In addition, the step S102 may further include the step of performing the following processing on newly arrived or expired stream data: when newly added stream data flows into a sliding window with a preset length, inserting the newly added stream data into a proper position in a preset index structure according to a preset insertion rule; and further, in order to realize the balance of the B-Tree Tree structure, readjusting the topological structure of the B-Tree according to the deletion result. For example, if data in a certain grid is removed, the grid becomes a space, and the space needs to be deleted in order to save space, and if the space is deleted, data of a corresponding leaf node is not available, and the leaf node needs to be deleted. However, the deletion of the leaf nodes may cause an imbalance in the tree structure, and the index structure needs to be adjusted to achieve the balance of the tree structure.
Taking the stream data object o newly flowing into the sliding window in step S101 as an example, and the length of the cube in the corresponding index structure is L, based on the multiple practices of the inventor, it is preferable to determine the number or ID identification number of the cube containing the stream data object o by using the following formula (1):
ID=∑(o[i]/L)<<i*log1/L
(1)
wherein o [ i ] represents coordinate information of a stream data object o in the ith dimension, wherein i is a positive integer, after calculating the ID of the cube, accessing the index structure, searching a leaf node e containing the ID and accessing the leaf node e, and if the leaf node e already contains the cube with the serial number equal to the ID, inserting the object o into the cube; otherwise, a cube containing o is created in the leaf node e and o is inserted into the cube.
Here, the object o in the index structure may be represented by a binary < pos, t >, where pos represents the location information of o and t represents the time information when o arrives at the window.
In step S103, an object set whose distance from the newly added stream data is smaller than a preset radius is searched.
Specifically, for the newly added stream data o, by accessing the established index result, the stream data object whose distance from the newly added stream data o is less than the threshold R is found. Specifically, during the visit, a cube is found that intersects C (o, R). Where C (o, R) represents a circle or sphere having the coordinates of the newly added stream data o as the center of the circle and R as the radius. For found cubes intersecting with C (o, R), stream data objects in the cubes are accessed, and stream data objects with a distance smaller than R are screened out and treated as neighbors of the newly added stream data o.
More specifically, according to the ID1 of the cube c (o) where the object o is located and a preset calculation rule, the IDs { ID2, ID3, … IDm } (m ^ 3 d) of 3 { d-1} neighbor cubes of the cube c (o) are calculated, further, according to each neighbor cube of the cube c (o), a search operation is executed on the Tree structure of the B-Tree, data objects in cells in the neighbor cube ID2 are found, and the data objects are inserted into an object set, wherein the distance between the data objects and the newly added stream data is smaller than a preset radius. Here, taking the first neighbor cube of c (o) as an example, which is located at the lower left of c (o), its ID is calculated using a calculation method such as Z-address in the prior art.
Preferably, in order to reduce the overhead of searching times on the Tree structure of the B-Tree, the neighbor cubes are first sorted according to their IDs, and when a leaf node e is visited, it is assumed that e includes a cube with a sequence number [ e.s, s.e ], and n (c) is a set of neighbor cubes c (o), where e.s and s.e identify the data objects with the smallest sequence number and the largest sequence number in the cubes, respectively. Thus, when accessing e, all cubes contained in N (c) and n.d.e may be accessed.
In step S104, comparing the number of objects in the object set with a preset number threshold, and if the number of objects in the object set is smaller than the preset number threshold, outputting the newly added stream data to an outlier set; if it is greater than the preset number threshold, the process proceeds to step S105.
In step S105, according to the time sequence generated by each object in the object set, the arrival sequence number of each object is determined and a plurality of objects with the smallest arrival sequence number relative to the newly added stream data are determined, wherein the number of the plurality of objects is preferably equal to a preset number threshold.
Specifically, the understanding of the arrival sequence number here can be specifically understood as: for example, the serial number of the data in the sliding window is numbered from 0 to 10000, the first piece of data is the first arrival, the last piece of data is the last arrival, the serial number of the first piece of data is 0, and the serial number of the last piece of data is 10000.
Here, the plurality of objects having the smallest arrival sequence numbers with respect to the newly added stream data, for example, the newly added stream data has three neighbors whose arrival sequence numbers are 900, 1000, and 1500, respectively, and the preset number threshold is 2, and the 2 objects having the smallest arrival sequence numbers with respect to the newly added stream data are the neighbors whose arrival sequence numbers are 900 and 1000, respectively.
More specifically, the data objects in the outlier candidate set are scanned for k data objects closest to the target data object arrival sequence number. Here, the scanning method may be implemented by a median search based method. More specifically, 2k objects in the range query result set R are first put into an array, then a median is found according to the 2k data, and objects with arrival times earlier than the median are deleted from R. Thereafter, the above process is repeated until k objects are found that flow into the sliding window with the latest time.
In step S106, the probability that the newly added stream data becomes an outlier is determined according to the time when the newly added stream data flows into the sliding window with the preset length and the time when the object in the object set, which has the closest time sequence relation to the newly added stream data, flows into the sliding window.
Specifically, according to a plurality of practical studies of the computation of the outlier probability by the inventors, it is preferable to compute the probability G that the newly added stream data becomes an outlier using the following formula (2):
Figure BDA0002848658670000111
n/(ok.t-o.t), where N denotes a preset length of the sliding window, ok.t denotes a time when an object in the object set nn (o) closest to the newly added stream data in time-series relationship flows into the sliding window, o.t denotes a time when the newly added stream data o flows into the sliding window of the preset length, and k denotes a preset number threshold.
Further, if the distance of the data from the object o is in accordance with a normal distribution, the present application preferably predicts the probability that o becomes an outlier using the central limit theorem. The core idea is that N/xi data which finally flow into the sliding window are given, wherein N represents the length of the preset window, and xi represents a given threshold value. If the number of objects in them whose distance from o is less than R is not less than k, then after another N (1-1/ξ) objects flow into the sliding window, the probability that no less than k of these objects are less than R away from object o is greater than the user-defined threshold ρ.
In step S107, if the probability that the newly added stream data becomes an outlier obtained in step S106 is smaller than a given threshold ξ, it may be determined that the probability that the newly added stream data becomes an outlier is smaller, and the newly added stream data is determined as a non-secure outlier with probability guarantee; otherwise, the process proceeds to step S108.
Specifically, for the above-described data object o, if the probability that o becomes an outlier is smaller than ρ, o can be regarded as a non-outlier with probability guarantee. In this case, it is not inserted into the outlier candidate set. Otherwise, the data object o is inserted into the outlier candidate set.
In step S108, the newly added stream data is inserted into the outlier candidate set, and a plurality of neighbor data with the nearest arrival time are reserved for the newly added stream data as auxiliary information of the newly added stream data. Wherein the number of the plurality of neighbor data is preferably equal to the preset number threshold.
Further, according to a preferred embodiment of the present application, and still taking the above data object o as an example, in order to detect the current state of o, a neighbor queue (denoted o.q) is allocated to o, and o.q maintains k neighbors of all neighbors of o that flow into the sliding window at the latest.
When there are other objects o 'flowing into the sliding window, if o' is a neighbor of o, then o.q data arriving earliest in the window is shifted out o.q. When o.q becomes empty, this means that o does not become an outlier during its lifetime. At this point, o is moved out of the set of candidate outliers.
When there are other objects o 'flowing into the sliding window, if o' is a neighbor of o, then data from o.q that arrived in the window earliest is removed o.q. Thereafter, it is determined whether it is a non-secure object with probability assurance using the above equation (2). If so, it is still removed from the set of candidate outliers.
When an object in o.q flows out of the window, if no other new incoming object becomes a neighbor of o after o flows into the sliding window, that means that the number of neighbors of o is less than the threshold k. At this point, o is input to the set of outliers.
Based on the approximate outlier calculation method facing numerical type stream data, provided by the invention, the non-safe outliers with probability guarantee can be found according to the time sequence relation between the newly added stream data and the neighbor arrival window, so that the outlier candidate set is prevented from being maintained, and the calculation cost and the space maintenance cost are obviously reduced.
Further, to better and at a lower cost maintain candidate outliers and maintain secure objects, the method of the present application maintains secure objects and non-secure objects with probabilistic guarantees, or non-secure outliers, using the following steps:
step S201, the maintenance process for the secure object includes: after a newly added data object flows into the sliding window, if the distance between the data object and the candidate outlier is smaller than a preset distance threshold value R, popping up a first element in a neighbor queue corresponding to the candidate outlier out of a queue; removing the set of candidate outliers from the candidate outliers when the queue is empty.
In particular, a secure object herein may refer to a data stream that is certainly secure if k neighbors arrive later than it for the target object o. But the data stream is currently secure and does not mean that it is secure afterwards. If k pieces of stream data come earlier in time than the target object o, the k pieces of data are also neighbors of the target object o, and the current target object o is a non-outlier, then when the k pieces of data flow out of the sliding window or at least one of the k pieces of data flows out of the sliding window, the current target object o becomes an outlier again. However, if the k pieces of data are neighbors of the target object o, the time of entering the sliding window is later than that of the target object o, that is, the time of the target object o flowing out of the sliding window is definitely earlier than that of the k neighbors, the target object o is not an outlier but a safe object before flowing out of the sliding window, and in this case, the target object o is never likely to become the outlier, so in order to save the maintenance cost of the outlier, the present application maintains the object which is not the outlier but is likely to become the outlier. Therefore, the present application is directed to the first k neighbors that are earlier than the target object entering the sliding window to maintain, and does not need to maintain all neighbors of the target object, because even if there are more than k neighbors that are earlier than the target object entering the sliding window, the more than k number of neighbors have no effect on the outlier determination of the target object, and only the neighbors within the k number have an effect on the outlier determination of the target object.
Therefore, once the target object flows into the sliding window, a queue is established, when other data enter the sliding window, the data are judged to be neighbors of the target object, and once the neighbor object appears, a queue is popped up for a first element in the corresponding neighbor queue; when the queue is empty, the target object is a safe object, and the target object is removed from the candidate outlier set.
Step S202, the process of maintaining the non-safety object with probability guarantee includes: and after new data flows into the sliding window, if the distance between the data and the candidate outlier is smaller than a threshold value R, popping up a queue from a first element in a neighbor queue corresponding to the candidate outlier. At this time, assume that the arrival time of the first object in the queue is ofT, if 1/ofT-o.t is less than f (γ), then o is considered a non-secure object with a probability guarantee. At this point, it is still removed from the set of candidate outliers. Where f (γ) is a preset threshold.
Specifically, the innovation of this step is that, because the cost of maintaining the queue is very high, each data object needs to be maintained after entering the sliding window, and then the present application improves this maintenance process, and determines whether to maintain the corresponding queue by predicting whether the target object is likely to become an outlier. If this point is less likely to become an outlier, its queue is not maintained.
It should be noted that the steps S201 and S202 may be executed synchronously or alternatively, and may be executed flexibly according to actual requirements.
Based on the general inventive concept, the present application also provides a schematic block diagram of a system for approximate outlier computation oriented to numeric streaming data. The computing system of the present application may perform the steps included in any of the methods described above.
Referring to fig. 2, according to fig. 2, the system for calculating approximate outliers for numeric data includes: a stream data management module 101, a stream data indexing module 102, a neighbor lookup module 103, and an approximate outlier detection module 104.
The stream data management module 101 mainly manages N pieces of data recently flowing into a window by using a sliding window model; the stream data index module 102 mainly maintains stream data by using indexes, and includes a stream data initialization module, a newly added data management module, and an expired data management module. The neighbor query module 103 is mainly used for querying objects whose distance to a given object is less than a certain threshold R. The approximate outlier detection module 104 is primarily used for maintenance of candidate outliers, identification of secure non-outliers, and management of outliers.
Further, the stream data management module 101 is provided with a start timestamp identifier and an end timestamp identifier for identifying a valid data timing range in the stream data, and the stream data index module 102 is connected to the stream data management module 101. The valid data may be data determined manually or by a computer device according to a preset rule, for example, if the computer device acquires 100 ten thousand pieces of data, and the target user only needs the last ten thousand pieces of data, the valid data time sequence range is data in the data range of subtracting 1 ten thousand from 100 ten thousand to 100 ten thousand.
In the system, the stream data indexing module 102 is mainly used for maintaining coordinates between data.
As described above, the present system applies to a B-Tree based two-tier index. The first level index is a set of cubes. The data is stored in its corresponding cube according to their location. In addition, the data in the cubes are arranged in descending order according to the time they flow into the sliding window. The second level index is a B-Tree that builds a set of integer numbers for all non-empty cubes and maintains their numbers using the B-Tree. In addition, another function of the stream data indexing module 102 is to process newly arriving or failing data. As data flows into the sliding window, the stream data indexing module 102 finds the appropriate position insert in the index for it. When the data flows out of the window, the stream data indexing module 102 deletes the data from the index and adjusts the topology of the B-Tree according to the deletion result. In the system, the stream data indexing module 102 is connected with the neighbor query module 103. Given the new added data o, the neighbor query module 103 accesses the stream data index module 102 to find the object whose distance from o is less than the threshold R. Specifically, during the visit, a cube is found that intersects C (o, R). Wherein C (o, R) represents a circle having the coordinates of o as the center and R as the radius. For these cubes, neighbor query module 103 accesses objects in the cube and screens out objects that are less than R away from it, treating them as neighbors of o.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. An approximate outlier calculation method oriented to numerical type stream data, characterized by comprising:
acquiring newly added stream data;
inserting the stream data object into a corresponding leaf node in a preset index structure according to the known coordinate information of each dimension of the newly added stream data;
searching an object set with the distance to the newly added stream data smaller than a preset radius;
and comparing the number of the objects in the object set with a preset number threshold, and if the number of the objects in the object set is smaller than the preset number threshold, outputting the newly added stream data to an outlier set.
2. The approximate outlier calculation method of claim 1, wherein a sliding window of a predetermined length is used to distinguish between the new stream data and the failed stream data entering the sliding window.
3. The approximate outlier calculation method of claim 1, wherein a two-level index structure of B-Tree is used to store coordinate information of the stream data.
4. The approximate outlier calculation method of claim 3, wherein the numbering of the stream data in the cube in the B-Tree index structure is calculated using the following formula:
ID=∑(o[i]/L)<<i*log1/L
where ID denotes the cube number, L denotes the cube length, o denotes stream data, and o [ i ] denotes the coordinates of the stream data in the ith dimension, where i is a positive integer.
5. The approximate outlier calculation method of claim 2, further comprising:
if the number of the objects in the object set is larger than a preset number threshold, determining the arrival sequence number of each object and determining a plurality of objects with the minimum arrival sequence number relative to newly added stream data according to the time sequence generated by each object in the object set, wherein the number of the plurality of objects is preferably equal to the preset number threshold;
judging the probability of the newly added stream data becoming an outlier according to the time of the newly added stream data flowing into a sliding window with a preset length and the time of an object in the object set, which has the closest time sequence relation with the newly added stream data, flowing into the sliding window;
and if the probability that the obtained newly added stream data becomes an outlier is smaller than a given threshold value, determining the newly added stream data as a non-safe outlier with probability guarantee.
6. The approximate outlier calculation method of claim 5, further comprising:
if the obtained probability that the newly added stream data becomes an outlier is larger than or equal to a given threshold value, inserting the newly added stream data into an outlier candidate set, and reserving a plurality of neighbor data with the nearest arrival time for the newly added stream data to serve as auxiliary information of the newly added stream data; wherein the number of the plurality of neighbor data is equal to a preset number threshold.
7. The approximate outlier calculating method according to claim 5, wherein a probability G of said newly added stream data becoming an outlier is calculated using the following formula:
Figure FDA0002848658660000021
n/(ok.t-o.t), where N denotes a preset length of the sliding window, ok.t denotes a time when an object in the object set nn (o) closest to the newly added stream data in time-series relationship flows into the sliding window, o.t denotes a time when the newly added stream data o flows into the sliding window of the preset length, and k denotes a preset number threshold.
8. The approximate outlier calculation method of claim 2, further comprising:
distributing a neighbor queue, recorded as o.q, to the newly added stream data o, where the neighbor queue maintains k neighbors that flow into the sliding window at the latest among all the neighbors of the newly added stream data o, and k is an integer;
when there are other objects o 'flowing into the sliding window, if o' is a neighbor of o, then data from o.q that arrived at the window earliest is shifted out o.q; removing o from the set of candidate outliers when o.q becomes empty;
when an object in o.q flows out of the window, if no other new incoming object becomes a neighbor of o after o flows into the sliding window, which means that the number of neighbors of o is less than the threshold k, o is input to the set of outliers.
9. The approximate outlier calculation method of claim 2, further comprising at least one of the steps of:
after a newly added data object flows into the sliding window, if the distance between the data object and the candidate outlier is smaller than a preset distance threshold value R, popping up a first element in a neighbor queue corresponding to the candidate outlier out of a queue; when the queue is empty, removing the candidate outlier set from the candidate outlier set;
when newly-added data o flows into a sliding window, if the distance between the data and the candidate outlier is smaller than a threshold value R, popping up a first element in a neighbor queue corresponding to the candidate outlier, and assuming that the arrival time of a first object in the queue is ofT, and o.t represents the time when the newly added stream data o flows into a sliding window of a preset length if 1/ofT-o.t is less than f (γ), then o is considered a non-secure object with probability guarantees and is removed from the set of candidate outliers; where f (γ) is a preset threshold.
10. An approximate outlier computing system oriented to numeric streaming data, said approximate outlier computing system being capable of performing the steps comprising the method of any of claims 1-9.
CN202011518175.3A 2020-12-21 2020-12-21 Approximate outlier calculation method and system for numerical type stream data Pending CN112559571A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011518175.3A CN112559571A (en) 2020-12-21 2020-12-21 Approximate outlier calculation method and system for numerical type stream data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011518175.3A CN112559571A (en) 2020-12-21 2020-12-21 Approximate outlier calculation method and system for numerical type stream data

Publications (1)

Publication Number Publication Date
CN112559571A true CN112559571A (en) 2021-03-26

Family

ID=75030654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011518175.3A Pending CN112559571A (en) 2020-12-21 2020-12-21 Approximate outlier calculation method and system for numerical type stream data

Country Status (1)

Country Link
CN (1) CN112559571A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975519A (en) * 2016-04-28 2016-09-28 深圳大学 Multi-supporting point index-based outlier detection method and system
US20170199902A1 (en) * 2016-01-07 2017-07-13 Amazon Technologies, Inc. Outlier detection for streaming data
CN110471946A (en) * 2019-07-08 2019-11-19 广东工业大学 A kind of LOF outlier detection method and system based on grid beta pruning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170199902A1 (en) * 2016-01-07 2017-07-13 Amazon Technologies, Inc. Outlier detection for streaming data
CN105975519A (en) * 2016-04-28 2016-09-28 深圳大学 Multi-supporting point index-based outlier detection method and system
CN110471946A (en) * 2019-07-08 2019-11-19 广东工业大学 A kind of LOF outlier detection method and system based on grid beta pruning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
唐成龙;邢长征;: "基于数据分区和网格的离群点挖掘算法", 计算机应用, no. 08, 1 August 2012 (2012-08-01) *
张晓龙;曾伟;: "基于衰减窗口与剪枝维度树的实时数据流聚类", 计算机应用研究, no. 04, 15 April 2009 (2009-04-15) *
朱斌;钟毓灵;王习特;白梅;: "不确定数据流上的离群点检测处理", 湖南大学学报(自然科学版), no. 02, 25 February 2020 (2020-02-25) *

Similar Documents

Publication Publication Date Title
Mouratidis et al. Continuous monitoring of top-k queries over sliding windows
Singh et al. Probabilistic data structures for big data analytics: A comprehensive review
CN109450900B (en) Mimicry judgment method, device and system
CN107404530B (en) Social network cooperation caching method and device based on user interest similarity
Gao et al. Fractionally cascaded information in a sensor network
CN108306879B (en) Distributed real-time anomaly positioning method based on Web session flow
Nwana et al. A latent social approach to youtube popularity prediction
CN109150859B (en) Botnet detection method based on network traffic flow direction similarity
WO2021022875A1 (en) Distributed data storage method and system
CN103178989A (en) Method and device for calculating visit hotness
Zhao et al. Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing
Dai et al. Identifying and estimating persistent items in data streams
CN113379176A (en) Telecommunication network abnormal data detection method, device, equipment and readable storage medium
CN113486339A (en) Data processing method, device, equipment and machine-readable storage medium
CN115766189B (en) Multichannel isolation safety protection method and system
CN108416054B (en) Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat
Patgiri et al. Hunting the pertinency of bloom filter in computer networking and beyond: A survey
Cohen et al. Spatially-decaying aggregation over a network
Yoon et al. Multiple dynamic outlier-detection from a data stream by exploiting duality of data and queries
US8370363B2 (en) Hybrid neighborhood graph search for scalable visual indexing
Luo et al. Set reconciliation with cuckoo filters
Li et al. Ladderfilter: Filtering infrequent items with small memory and time overhead
CN112559571A (en) Approximate outlier calculation method and system for numerical type stream data
CN117294497A (en) Network traffic abnormality detection method and device, electronic equipment and storage medium
Feng et al. An efficient caching mechanism for network-based url filtering by multi-level counting bloom filters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination