CN107124329B - Outlier data discovery method and system based on low-water-level sliding time window - Google Patents
Outlier data discovery method and system based on low-water-level sliding time window Download PDFInfo
- Publication number
- CN107124329B CN107124329B CN201710284487.4A CN201710284487A CN107124329B CN 107124329 B CN107124329 B CN 107124329B CN 201710284487 A CN201710284487 A CN 201710284487A CN 107124329 B CN107124329 B CN 107124329B
- Authority
- CN
- China
- Prior art keywords
- water level
- low water
- data
- data processing
- time window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0852—Delays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
Abstract
The invention discloses an outlier data discovery method and system based on a low-water-level sliding time window; the method comprises the following steps: data distribution: receiving an external data stream, and then distributing the external data stream to each data processing node; data processing: the data processing node processes the received external data stream; defining a low water level sliding time window, taking a time stamp as a horizontal coordinate axis, wherein the low water level sliding time window continuously moves from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, unprocessed data is arranged above the horizontal coordinate axis of the low water level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data; and (3) data aggregation: and summarizing and outputting the data processing results. Discriminable data, outlier data and normal data to be processed are distinguished, data processing reliability is improved, and fault recovery is accelerated.
Description
Technical Field
The invention relates to an outlier data discovery method, in particular to an outlier data discovery method and system based on a low-water-level sliding time window.
Background
Stream processing is the computation of a changing data stream in real time. In order to solve the challenges brought by users to the instant processing of mass data and solve the bottleneck problem of the batch processing mode represented by the traditional MapReduce in real-time processing, the emerging stream processing method has important application values in the aspects of risk management, marketing management, advertisement delivery, socialized recommendation and the like.
Due to network delay, system internal concurrency and other reasons, the same type of data cannot be guaranteed to arrive at the data processing node strictly according to the time stamp sequence, and the data generates outlier data which is inconsistent with the arriving data processing node. The processing speed of a large amount of outlier data is low, interference is generated on data processing fault judgment, and the misjudgment probability of stream processing fault is increased.
In the prior art, fault tolerance is mainly realized by methods such as log, hot copy, upstream backup and the like, and outlier data is not discussed. The log and hot copy fault-tolerant method uses synchronous protocol incremental copy, so that a large amount of outlier data seriously overwhelms the copy process; the upstream backup fault tolerance approach treats the cluster data as a failure and initiates erroneous failover.
In the prior art, D-Stream adopts a parallel recovery method to find outlier data, adopts speculative execution to carry out fault recovery, and relies on a batch data analysis stack. The prior art provides a disorder arrival processing method, which orders disorder data by using explicit methods such as punctuation marks, heartbeat mechanisms and the like. The prior art MillWheel system proposes, on the basis of this principle, that the concept of low water level represents the floor of the data to be processed, which is discarded directly when data with a timestamp less than the low water level arrives for data processing. The method provides a method for judging lost data, but does not provide a method for judging outlier data, and only represents that the low water level cannot strictly distinguish the outlier data by using a time point. In the prior art, Trident avoids generating outlier data through strict and ordered requirements of data to be processed, and the method depends on a transaction framework and generates a large amount of extra overhead.
Disclosure of Invention
The invention aims to solve the problems and provides an outlier data discovery method and system based on a low-water-level sliding time window, which can effectively distinguish discardable data, outlier data and normal data to be processed, improve the data processing reliability and accelerate the fault recovery.
In order to achieve the purpose, the invention adopts the following technical scheme:
the outlier data discovery method based on the low water level sliding time window comprises the following steps:
step (1): data distribution: receiving an external data stream, and then distributing the external data stream to each data processing node;
step (2): data processing: the data processing node processes the received external data stream;
defining a low water level sliding time window, wherein the time stamp of the low water level sliding time window starts from a low water level initial value, and the width of the low water level sliding time window is w; the timestamp range of the low water level sliding time window is [ the initial value of the low water level, the initial value of the low water level + the width w of the low water level sliding time window ];
taking the time stamp as a horizontal coordinate axis, continuously moving a low-water-level sliding time window from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, wherein unprocessed data is arranged above the horizontal coordinate axis of the low-water-level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data;
and (3): and (3) data aggregation: and summarizing and outputting the data processing results.
And (3) the data streams from different keywords in the step (2) can be processed on different data processing nodes concurrently.
The low water level initial value acquisition step is as follows: identifying a timestamp of an earliest unprocessed data packet in the current data processing as a low water level of the current data processing; identifying a timestamp of an earliest unprocessed data packet in upstream data processing of the current data processing as a low water level of the upstream data processing; then comparing the low water level of the current data processing with the low water level of the upstream data processing of the current data processing, taking the small low water level as the low water level of the current data processing, then tracing the upstream data processing along the stream processing network topology, finally finding the earliest unprocessed data of the whole stream processing network through recursion, and taking the timestamp of the earliest unprocessed data of the whole stream processing network as the initial value of the low water level.
The step of finding whether the current data processing is outlier according to the position of the current data processing time stamp in the low water level sliding time window range comprises the following steps:
if the current data processing timestamp is in the range of [ low water level, low water level + sliding time window width/2 ], the unprocessed data packet is outlier data;
if the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ], the unprocessed data packet is normal data to be processed;
if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data.
The low level sliding time window size w is set according to the maximum data delay reaching time that can be tolerated by the data processing.
An outlier data discovery system based on a low water level sliding time window, comprising:
the data distribution module: receiving an external data stream, and then distributing the external data stream to each data processing node;
a data processing module: the data processing node processes the received external data stream;
defining a low water level sliding time window, wherein the time stamp of the low water level sliding time window starts from a low water level initial value, and the width of the low water level sliding time window is w; the timestamp range of the low water level sliding time window is [ the initial value of the low water level, the initial value of the low water level + the width w of the low water level sliding time window ];
taking the time stamp as a horizontal coordinate axis, continuously moving a low-water-level sliding time window from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, wherein unprocessed data is arranged above the horizontal coordinate axis of the low-water-level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data;
a data aggregation module: and summarizing and outputting the data processing results.
Data streams from different keywords in the data processing module can be processed concurrently on different data processing nodes.
The low water level initial value acquisition step is as follows: identifying a timestamp of an earliest unprocessed data packet in the current data processing as a low water level of the current data processing; identifying a timestamp of an earliest unprocessed data packet in upstream data processing of the current data processing as a low water level of the upstream data processing; then comparing the low water level of the current data processing with the low water level of the upstream data processing of the current data processing, taking the small low water level as the low water level of the current data processing, then tracing the upstream data processing along the stream processing network topology, finally finding the earliest unprocessed data of the whole stream processing network through recursion, and taking the timestamp of the earliest unprocessed data of the whole stream processing network as the initial value of the low water level.
The step of finding whether the current data processing is outlier according to the position of the current data processing time stamp in the low water level sliding time window range comprises the following steps:
if the current data processing timestamp is in the range of [ low water level, low water level + sliding time window width/2 ], the unprocessed data packet is outlier data;
if the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ], the unprocessed data packet is normal data to be processed;
if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data.
The low level sliding time window size w is set according to the maximum data delay reaching time that can be tolerated by the data processing.
Due to the control of the low water level sliding time window, discarded data is minimal.
For the interpretation of the terms:
stream processing is the real-time computation of a constantly changing data stream.
The data flow and the data packet processed in the flow processing process are composed of a keyword, a key value and a timestamp triple group. The stream computation may be run by distributing the keywords of the packet across multiple stream processing nodes.
The invention has the beneficial effects that:
1. effectively distinguishing data delay arrival and data processing faults in a low-water level sliding time window, and reducing the times of data processing fault recovery misjudgment in the stream processing process;
2. discardable data, outlier data and normal data to be processed in data processing are effectively distinguished;
3. and the outlier data in the data processing is found, so that the data processing reliability can be improved, and the fault recovery is accelerated.
4. The low water level of the step (2) is defined based on the data flow between data processing, and the current data processing is guaranteed not to generate data packets with earlier time stamps. The data processing only processes the data packets with the time stamps larger than the low water level, and the data packets with the time stamps smaller than the low water level are directly discarded.
Drawings
FIG. 1 is a flow processing network topology;
FIG. 2 is a low water level sliding time window;
FIG. 3 a low water level sliding time window embodiment;
fig. 4 data handles a low water level sliding time window snapshot at three points in time.
Detailed Description
The invention is further described with reference to the following figures and examples.
As shown in fig. 1, the stream handles the network topology, including data distribution 101, various data processing 102, 103, 104, 105, and data aggregation 106. The data distribution 101 is used for receiving external data streams and forwarding the external data streams to subsequent various data processing. The data processing 102, 103, 104, 105 is a calculation unit for stream processing. And the data aggregation 106 is used for summarizing and outputting the data processing results. Data streams from different keywords may be executed concurrently on different data processing nodes.
The low water level is defined based on the data flow between data processes, identifying the timestamp of the oldest unprocessed data packet in the current data process, ensuring that the current data process no longer produces data packets with an earlier timestamp. The data processing only processes the data packets with the time stamps larger than the low water level, and the data packets with the time stamps smaller than the low water level are directly discarded.
Given data processing a and B, which is the upstream data processing of a, the low water level recursion of data processing a is defined as lowwatermark (a) min (oldword (a), lowwatermark (B)), where the oldestword function represents the timestamp of the unprocessed data with the smallest timestamp in data processing a. Tracing back the upstream data processing along the stream processing network topology can find out the earliest unprocessed data processed by the whole stream and take the time stamp as the initial value of the low water level. And setting the size w of the low-water level sliding time window according to the maximum data delay reaching time which can be tolerated by data processing.
As shown in fig. 2, the low level slides through a time window 200, with a timestamp starting at a low level 201 and a width w. The timestamp range of the low water level sliding time window is [ low water level, low water level + sliding time window width ].
The outlier data discovery method based on the low water level sliding time window comprises the following steps:
unprocessed data packets with current data processing timestamps within the range of [ low water level, low water level + sliding time window width/2 ] are outlier data;
the unprocessed data packet of which the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ] is the normal data to be processed;
unprocessed data with a current data processing timestamp less than the low water level is discardable data, and due to control of the low water level sliding time window, discarded data are few.
As shown in FIG. 3, the low water level sliding time window embodiment, A is discardable data, B, D, F is processed data, C is outlier data, E, G is normal pending data.
As shown in fig. 4, the low water level sliding time window snapshots at three time points are processed by data processing, a rectangle represents the low water level sliding time window, a left vertical line of the rectangle is a low water level, and the low water level sliding time window is continuously shifted to the right as the system time goes by. The data in each data processing snapshot is distributed on a horizontal time axis. At any point in time, unprocessed data is above the horizontal axis of the low water level sliding time window and processed data is below the horizontal axis. The low water level sliding time window timestamp ranges [ lowwatermark, lowwatermark + w ]. In a low water level sliding time window, the unprocessed data with time stamp [ lowwaterfastermark + w/2, lowwaterfastermark + w ] is non-outlier data, and the unprocessed data with time stamp [ lowwaterfastermark, lowwaterfastermark + w/2) is outlier data (e.g. fig. 4 data I). Unprocessed data that is not within the low water-level sliding time window is discarded directly (as in data H of fig. 4), and the discarded data is rare due to the control of the low water-level sliding time window.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.
Claims (6)
1. The outlier data discovery method based on the low water level sliding time window is characterized by comprising the following steps:
step (1): data distribution: receiving an external data stream, and then distributing the external data stream to each data processing node;
step (2): data processing: the data processing node processes the received external data stream;
defining a low water level sliding time window, wherein the time stamp of the low water level sliding time window starts from a low water level initial value, and the width of the low water level sliding time window is w; the timestamp range of the low water level sliding time window is [ the initial value of the low water level, the initial value of the low water level + the width w of the low water level sliding time window ];
the low level sliding time window size w is set according to the maximum data delay reaching time which can be tolerated by data processing;
taking the time stamp as a horizontal coordinate axis, continuously moving a low-water-level sliding time window from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, wherein unprocessed data is arranged above the horizontal coordinate axis of the low-water-level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data;
the step of finding whether the current data processing is outlier according to the position of the current data processing time stamp in the low water level sliding time window range comprises the following steps:
if the current data processing timestamp is in the range of [ low water level, low water level + sliding time window width/2 ], the unprocessed data packet is outlier data;
if the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ], the unprocessed data packet is normal data to be processed;
if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data;
and (3): and (3) data aggregation: and summarizing and outputting the data processing results.
2. The method of claim 1 for outlier data discovery based on a low water level sliding time window,
and (3) the data streams from different keywords in the step (2) can be processed on different data processing nodes concurrently.
3. The method of claim 1 for outlier data discovery based on a low water level sliding time window,
the low water level initial value acquisition step is as follows: identifying a timestamp of an earliest unprocessed data packet in the current data processing as a low water level of the current data processing; identifying a timestamp of an earliest unprocessed data packet in upstream data processing of the current data processing as a low water level of the upstream data processing; then comparing the low water level of the current data processing with the low water level of the upstream data processing of the current data processing, taking the small low water level as the low water level of the current data processing, then tracing the upstream data processing along the stream processing network topology, finally finding the earliest unprocessed data of the whole stream processing network through recursion, and taking the timestamp of the earliest unprocessed data of the whole stream processing network as the initial value of the low water level.
4. An outlier data discovery system based on a low water level sliding time window, comprising:
the data distribution module: receiving an external data stream, and then distributing the external data stream to each data processing node;
a data processing module: the data processing node processes the received external data stream;
defining a low water level sliding time window, wherein the time stamp of the low water level sliding time window starts from a low water level initial value, and the width of the low water level sliding time window is w; the timestamp range of the low water level sliding time window is [ the initial value of the low water level, the initial value of the low water level + the width w of the low water level sliding time window ];
the low level sliding time window size w is set according to the maximum data delay reaching time which can be tolerated by data processing;
taking the time stamp as a horizontal coordinate axis, continuously moving a low-water-level sliding time window from left to right on the horizontal coordinate axis of the time stamp along with the time lapse, wherein unprocessed data is arranged above the horizontal coordinate axis of the low-water-level sliding time window at any time point, and processed data is arranged below the horizontal coordinate axis; then, according to the position of the current data processing time stamp in the low water level sliding time window range, finding whether the current data processing is outlier data;
the step of finding whether the current data processing is outlier according to the position of the current data processing time stamp in the low water level sliding time window range comprises the following steps:
if the current data processing timestamp is in the range of [ low water level, low water level + sliding time window width/2 ], the unprocessed data packet is outlier data;
if the current data processing timestamp is in the range of [ low water level + sliding time window width/2, low water level + sliding time window width ], the unprocessed data packet is normal data to be processed;
if the current data processing timestamp is less than the low water level, the unprocessed data is discardable data;
a data aggregation module: and summarizing and outputting the data processing results.
5. The system of claim 4 wherein the outlier data discovery system based on a low water level sliding time window,
data streams from different keywords in the data processing module can be processed concurrently on different data processing nodes.
6. The system of claim 4 wherein the outlier data discovery system based on a low water level sliding time window,
the low water level initial value acquisition step is as follows: identifying a timestamp of an earliest unprocessed data packet in the current data processing as a low water level of the current data processing; identifying a timestamp of an earliest unprocessed data packet in upstream data processing of the current data processing as a low water level of the upstream data processing; then comparing the low water level of the current data processing with the low water level of the upstream data processing of the current data processing, taking the small low water level as the low water level of the current data processing, then tracing the upstream data processing along the stream processing network topology, finally finding the earliest unprocessed data of the whole stream processing network through recursion, and taking the timestamp of the earliest unprocessed data of the whole stream processing network as the initial value of the low water level.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710284487.4A CN107124329B (en) | 2017-04-25 | 2017-04-25 | Outlier data discovery method and system based on low-water-level sliding time window |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710284487.4A CN107124329B (en) | 2017-04-25 | 2017-04-25 | Outlier data discovery method and system based on low-water-level sliding time window |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107124329A CN107124329A (en) | 2017-09-01 |
CN107124329B true CN107124329B (en) | 2020-05-05 |
Family
ID=59724825
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710284487.4A Active CN107124329B (en) | 2017-04-25 | 2017-04-25 | Outlier data discovery method and system based on low-water-level sliding time window |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107124329B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069331A (en) * | 2019-04-24 | 2019-07-30 | 北京百度网讯科技有限公司 | A kind of data processing method, device and electronic equipment |
CN110460495B (en) * | 2019-08-01 | 2024-02-23 | 北京百度网讯科技有限公司 | Water level propelling method and device, computing node and storage medium |
US11809430B2 (en) | 2019-10-23 | 2023-11-07 | Nec Corporation | Efficient stream processing with data aggregations in a sliding window over out-of-order data streams |
CN111478949B (en) * | 2020-03-25 | 2022-05-24 | 中国建设银行股份有限公司 | Data processing method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101908065A (en) * | 2010-07-27 | 2010-12-08 | 浙江大学 | On-line attribute abnormal point detecting method for supporting dynamic update |
CN103400152A (en) * | 2013-08-20 | 2013-11-20 | 哈尔滨工业大学 | High sliding window data stream anomaly detection method based on layered clustering |
CN105764162A (en) * | 2016-05-10 | 2016-07-13 | 江苏大学 | Wireless sensor network abnormal event detecting method based on multi-attribute correlation |
-
2017
- 2017-04-25 CN CN201710284487.4A patent/CN107124329B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101908065A (en) * | 2010-07-27 | 2010-12-08 | 浙江大学 | On-line attribute abnormal point detecting method for supporting dynamic update |
CN103400152A (en) * | 2013-08-20 | 2013-11-20 | 哈尔滨工业大学 | High sliding window data stream anomaly detection method based on layered clustering |
CN105764162A (en) * | 2016-05-10 | 2016-07-13 | 江苏大学 | Wireless sensor network abnormal event detecting method based on multi-attribute correlation |
Non-Patent Citations (3)
Title |
---|
"I-IncLOF: Improved Incremental Local Outlier Detection for Data Streams";Seyed Hesamodin Karimian等;《The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012)》;20120927;全文 * |
"MillWheel: Fault-Tolerant Stream Processing at Internet Scale";Tyler Akidau等;《VLDB Endowment》;20130430;第6卷(第11期);第2章-第7章 * |
"基于滑动窗口模型的数据流离群点检测研究";赵学良;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130315;第4.3-4.4节 * |
Also Published As
Publication number | Publication date |
---|---|
CN107124329A (en) | 2017-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107124329B (en) | Outlier data discovery method and system based on low-water-level sliding time window | |
US7865760B2 (en) | Use of T4 timestamps to calculate clock offset and skew | |
US9239864B2 (en) | Distributing and processing streams over one or more networks | |
CN106301953B (en) | Distributed fault-tolerant clock synchronous method and system suitable for time trigger Ethernet | |
US10491498B2 (en) | Method and device for fingerprint based status detection in a distributed processing system | |
CN106873945A (en) | Data processing architecture and data processing method based on batch processing and Stream Processing | |
EP3396909A1 (en) | Data processing method and device | |
CN109787867B (en) | Block generation method and device, computer equipment and storage medium | |
Volz et al. | Supporting strong reliability for distributed complex event processing systems | |
CN106027414A (en) | HDFS-oriented parallel network message reading method | |
CN107181805A (en) | It is a kind of that the method that global orderly is recurred is realized under micro services framework | |
US9948540B2 (en) | Method and system for detecting proxy internet access | |
WO2019091068A1 (en) | Double 2-vote-2 redundancy structure data processing method | |
CN103995901B (en) | A kind of method for determining back end failure | |
Hwang et al. | Fast and reliable stream processing over wide area networks | |
Fu et al. | Clustering-preserving network flow sketching | |
Perumalla et al. | Virtual time synchronization over unreliable network transport | |
Yingchareonthawornchai et al. | Analysis of bounds on hybrid vector clocks | |
US20130173691A1 (en) | System for high reliability and high performance application message delivery | |
Défago et al. | Total order broadcast and multicast algorithms: Taxonomy and survey | |
CN102841840B (en) | The message logging restoration methods that Effect-based operation reorders and message number is checked | |
WO2020172881A1 (en) | Block generation method and apparatus, computer device and storage medium | |
CN107491359A (en) | A kind of distributed magnanimity real-time stream disaster recovery system and method | |
CN105205168A (en) | Exposure system based on Redis database and operation method thereof | |
CN113381902A (en) | Method, apparatus and computer storage medium for detecting cross-regional network link |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |