CN107124329B

CN107124329B - Outlier data discovery method and system based on low-water-level sliding time window

Info

Publication number: CN107124329B
Application number: CN201710284487.4A
Authority: CN
Inventors: 马坤; 周劲; 于自强; 纪科
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2020-05-05
Anticipated expiration: 2037-04-25
Also published as: CN107124329A

Abstract

The invention discloses an outlier data discovery method and system based on a low-water-level sliding time window; the method comprises the following steps: data distribution: receiving an external data stream, and then distributing the external data stream to each data processing node; data processing: the data processing node processes the received external data stream; defining a low water level sliding time window, taking a time stamp as a horizontal coordinate axis, wherein the low water level sliding time window continuously moves from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, unprocessed data is arranged above the horizontal coordinate axis of the low water level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data; and (3) data aggregation: and summarizing and outputting the data processing results. Discriminable data, outlier data and normal data to be processed are distinguished, data processing reliability is improved, and fault recovery is accelerated.

Description

Outlier data discovery method and system based on low-water-level sliding time window

Technical Field

The invention relates to an outlier data discovery method, in particular to an outlier data discovery method and system based on a low-water-level sliding time window.

Background

Stream processing is the computation of a changing data stream in real time. In order to solve the challenges brought by users to the instant processing of mass data and solve the bottleneck problem of the batch processing mode represented by the traditional MapReduce in real-time processing, the emerging stream processing method has important application values in the aspects of risk management, marketing management, advertisement delivery, socialized recommendation and the like.

Due to network delay, system internal concurrency and other reasons, the same type of data cannot be guaranteed to arrive at the data processing node strictly according to the time stamp sequence, and the data generates outlier data which is inconsistent with the arriving data processing node. The processing speed of a large amount of outlier data is low, interference is generated on data processing fault judgment, and the misjudgment probability of stream processing fault is increased.

In the prior art, fault tolerance is mainly realized by methods such as log, hot copy, upstream backup and the like, and outlier data is not discussed. The log and hot copy fault-tolerant method uses synchronous protocol incremental copy, so that a large amount of outlier data seriously overwhelms the copy process; the upstream backup fault tolerance approach treats the cluster data as a failure and initiates erroneous failover.

In the prior art, D-Stream adopts a parallel recovery method to find outlier data, adopts speculative execution to carry out fault recovery, and relies on a batch data analysis stack. The prior art provides a disorder arrival processing method, which orders disorder data by using explicit methods such as punctuation marks, heartbeat mechanisms and the like. The prior art MillWheel system proposes, on the basis of this principle, that the concept of low water level represents the floor of the data to be processed, which is discarded directly when data with a timestamp less than the low water level arrives for data processing. The method provides a method for judging lost data, but does not provide a method for judging outlier data, and only represents that the low water level cannot strictly distinguish the outlier data by using a time point. In the prior art, Trident avoids generating outlier data through strict and ordered requirements of data to be processed, and the method depends on a transaction framework and generates a large amount of extra overhead.

Disclosure of Invention

The invention aims to solve the problems and provides an outlier data discovery method and system based on a low-water-level sliding time window, which can effectively distinguish discardable data, outlier data and normal data to be processed, improve the data processing reliability and accelerate the fault recovery.

In order to achieve the purpose, the invention adopts the following technical scheme:

the outlier data discovery method based on the low water level sliding time window comprises the following steps:

step (1): data distribution: receiving an external data stream, and then distributing the external data stream to each data processing node;

step (2): data processing: the data processing node processes the received external data stream;

defining a low water level sliding time window, wherein the time stamp of the low water level sliding time window starts from a low water level initial value, and the width of the low water level sliding time window is w; the timestamp range of the low water level sliding time window is [ the initial value of the low water level, the initial value of the low water level + the width w of the low water level sliding time window ];

taking the time stamp as a horizontal coordinate axis, continuously moving a low-water-level sliding time window from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, wherein unprocessed data is arranged above the horizontal coordinate axis of the low-water-level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data;

and (3): and (3) data aggregation: and summarizing and outputting the data processing results.

And (3) the data streams from different keywords in the step (2) can be processed on different data processing nodes concurrently.

The low water level initial value acquisition step is as follows: identifying a timestamp of an earliest unprocessed data packet in the current data processing as a low water level of the current data processing; identifying a timestamp of an earliest unprocessed data packet in upstream data processing of the current data processing as a low water level of the upstream data processing; then comparing the low water level of the current data processing with the low water level of the upstream data processing of the current data processing, taking the small low water level as the low water level of the current data processing, then tracing the upstream data processing along the stream processing network topology, finally finding the earliest unprocessed data of the whole stream processing network through recursion, and taking the timestamp of the earliest unprocessed data of the whole stream processing network as the initial value of the low water level.

The step of finding whether the current data processing is outlier according to the position of the current data processing time stamp in the low water level sliding time window range comprises the following steps:

if the current data processing timestamp is in the range of [ low water level, low water level + sliding time window width/2 ], the unprocessed data packet is outlier data;

if the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ], the unprocessed data packet is normal data to be processed;

if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data.

The low level sliding time window size w is set according to the maximum data delay reaching time that can be tolerated by the data processing.

An outlier data discovery system based on a low water level sliding time window, comprising:

the data distribution module: receiving an external data stream, and then distributing the external data stream to each data processing node;

a data processing module: the data processing node processes the received external data stream;

a data aggregation module: and summarizing and outputting the data processing results.

Data streams from different keywords in the data processing module can be processed concurrently on different data processing nodes.

Due to the control of the low water level sliding time window, discarded data is minimal.

For the interpretation of the terms:

stream processing is the real-time computation of a constantly changing data stream.

The data flow and the data packet processed in the flow processing process are composed of a keyword, a key value and a timestamp triple group. The stream computation may be run by distributing the keywords of the packet across multiple stream processing nodes.

The invention has the beneficial effects that:

1. effectively distinguishing data delay arrival and data processing faults in a low-water level sliding time window, and reducing the times of data processing fault recovery misjudgment in the stream processing process;

2. discardable data, outlier data and normal data to be processed in data processing are effectively distinguished;

3. and the outlier data in the data processing is found, so that the data processing reliability can be improved, and the fault recovery is accelerated.

4. The low water level of the step (2) is defined based on the data flow between data processing, and the current data processing is guaranteed not to generate data packets with earlier time stamps. The data processing only processes the data packets with the time stamps larger than the low water level, and the data packets with the time stamps smaller than the low water level are directly discarded.

Drawings

FIG. 1 is a flow processing network topology;

FIG. 2 is a low water level sliding time window;

FIG. 3 a low water level sliding time window embodiment;

fig. 4 data handles a low water level sliding time window snapshot at three points in time.

Detailed Description

The invention is further described with reference to the following figures and examples.

As shown in fig. 1, the stream handles the network topology, including data distribution 101,

various data processing

102, 103, 104, 105, and data aggregation 106. The data distribution 101 is used for receiving external data streams and forwarding the external data streams to subsequent various data processing. The

data processing

102, 103, 104, 105 is a calculation unit for stream processing. And the data aggregation 106 is used for summarizing and outputting the data processing results. Data streams from different keywords may be executed concurrently on different data processing nodes.

The low water level is defined based on the data flow between data processes, identifying the timestamp of the oldest unprocessed data packet in the current data process, ensuring that the current data process no longer produces data packets with an earlier timestamp. The data processing only processes the data packets with the time stamps larger than the low water level, and the data packets with the time stamps smaller than the low water level are directly discarded.

Given data processing a and B, which is the upstream data processing of a, the low water level recursion of data processing a is defined as lowwatermark (a) min (oldword (a), lowwatermark (B)), where the oldestword function represents the timestamp of the unprocessed data with the smallest timestamp in data processing a. Tracing back the upstream data processing along the stream processing network topology can find out the earliest unprocessed data processed by the whole stream and take the time stamp as the initial value of the low water level. And setting the size w of the low-water level sliding time window according to the maximum data delay reaching time which can be tolerated by data processing.

As shown in fig. 2, the low level slides through a time window 200, with a timestamp starting at a low level 201 and a width w. The timestamp range of the low water level sliding time window is [ low water level, low water level + sliding time window width ].

unprocessed data packets with current data processing timestamps within the range of [ low water level, low water level + sliding time window width/2 ] are outlier data;

the unprocessed data packet of which the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ] is the normal data to be processed;

unprocessed data with a current data processing timestamp less than the low water level is discardable data, and due to control of the low water level sliding time window, discarded data are few.

As shown in FIG. 3, the low water level sliding time window embodiment, A is discardable data, B, D, F is processed data, C is outlier data, E, G is normal pending data.

As shown in fig. 4, the low water level sliding time window snapshots at three time points are processed by data processing, a rectangle represents the low water level sliding time window, a left vertical line of the rectangle is a low water level, and the low water level sliding time window is continuously shifted to the right as the system time goes by. The data in each data processing snapshot is distributed on a horizontal time axis. At any point in time, unprocessed data is above the horizontal axis of the low water level sliding time window and processed data is below the horizontal axis. The low water level sliding time window timestamp ranges [ lowwatermark, lowwatermark + w ]. In a low water level sliding time window, the unprocessed data with time stamp [ lowwaterfastermark + w/2, lowwaterfastermark + w ] is non-outlier data, and the unprocessed data with time stamp [ lowwaterfastermark, lowwaterfastermark + w/2) is outlier data (e.g. fig. 4 data I). Unprocessed data that is not within the low water-level sliding time window is discarded directly (as in data H of fig. 4), and the discarded data is rare due to the control of the low water-level sliding time window.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The outlier data discovery method based on the low water level sliding time window is characterized by comprising the following steps:

the low level sliding time window size w is set according to the maximum data delay reaching time which can be tolerated by data processing;

if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data;

2. The method of claim 1 for outlier data discovery based on a low water level sliding time window,

3. The method of claim 1 for outlier data discovery based on a low water level sliding time window,

4. An outlier data discovery system based on a low water level sliding time window, comprising:

5. The system of claim 4 wherein the outlier data discovery system based on a low water level sliding time window,

6. The system of claim 4 wherein the outlier data discovery system based on a low water level sliding time window,