CN112313638A

CN112313638A - Efficient time-based association of data streams

Info

Publication number: CN112313638A
Application number: CN201980041481.1A
Authority: CN
Inventors: 乔希思·雷亚洛斯科德利; 马尼卡瓦萨根·贾亚拉曼; 阿泰特·库马尔·K·谢蒂
Original assignee: Cisco Technology Inc
Current assignee: Cisco Technology Inc
Priority date: 2018-07-05
Filing date: 2019-07-02
Publication date: 2021-02-02
Also published as: EP3818452A1; US11068488B2; US20200012737A1; WO2020010165A1

Abstract

Techniques for efficient data association are provided. A first data segment is received and a first hash table of the plurality of hash tables is selected based on a timestamp associated with the first data segment. Further, a first hash bucket in the first hash table is identified based on the first data segment. Determining that the first hash bucket includes the second data segment. Upon determining that the first hash bucket satisfies the predefined criteria, the second data segment is removed from the first hash bucket and the first data segment and the second data segment are associated.

Description

Efficient time-based association of data streams

Cross Reference to Related Applications

This application claims the benefit of co-pending U.S. provisional patent application No. 62/694,403 filed on 5.7.2018. The foregoing related patent application is incorporated herein by reference in its entirety.

Technical Field

Embodiments presented in this disclosure generally relate to time-based association of data. More specifically, embodiments disclosed herein relate to techniques for associating different data sources in an efficient and scalable manner.

Background

Data streams may come from a variety of sources, and are increasingly used today for a variety of processes. For example, telemetry data is typically streamed to provide constant (or near constant) monitoring. Such telemetry data may include data for monitoring any number of data sources, e.g., network telemetry, vehicle or equipment telemetry, weather telemetry, etc. A data stream may include a large amount of data, and the nature of the stream typically means that the data is received relatively continuously, with limited (or no) interruptions occurring. Accordingly, there is a need for efficient techniques to process such data to ensure that monitoring of the data stream continues without delay or interruption. Such delays or interruptions can cause serious problems depending on the particular data stream. Further, in many embodiments, the data streams are received from a variety of different sources, and the data must be correlated or aligned to facilitate analysis of the data. Unfortunately, existing solutions require a large amount of resources and do not operate fast enough to enable real-time monitoring of large data streams.

Drawings

So that the manner in which the above recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective embodiments.

Fig. 1 is a block diagram illustrating a system for associating data streams according to one embodiment disclosed herein.

FIG. 2 illustrates a data correlation engine according to one embodiment disclosed herein.

FIG. 3 illustrates a multi-level association system according to one embodiment disclosed herein.

Fig. 4A-4B illustrate a cluster of multi-level association systems according to one embodiment disclosed herein.

FIG. 5 is a block diagram of a cluster of correlation engines in accordance with one embodiment disclosed herein.

FIG. 6 is a flow chart illustrating a method of associating data according to one embodiment disclosed herein.

FIG. 7 is a flow chart illustrating a method for determining data associations according to one embodiment disclosed herein.

FIG. 8 is a flow diagram illustrating a method of using a circular buffer to associate data in accordance with one embodiment disclosed herein.

Fig. 9 is a flow diagram illustrating a method of associating data segments (partitions) according to one embodiment disclosed herein.

FIG. 10 is a flow diagram illustrating a method of associating data segments according to one embodiment disclosed herein.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements disclosed in one embodiment may be beneficially utilized on other embodiments without specific recitation.

Detailed Description

Aspects of the invention are set out in the independent claims, with preferred features set out in the dependent claims. Features of one aspect may be applied to each aspect individually or in combination with the other aspects.

According to one embodiment presented in the present disclosure, a method is provided. The method comprises the following steps: a first data record of a plurality of data records in a data stream is received. The method further comprises the following steps: a first element in a ring buffer is selected based on a timestamp of a first data record, wherein the ring buffer includes a plurality of elements, each element corresponding to a respective time window. Further, the method comprises: a first hash table associated with a first element in a ring buffer is identified. The method further comprises the following steps: the method further includes generating a first hash value based on the first data record, and determining that the second data record is associated with the first hash value in the first hash table. Further, the method comprises: the method further includes removing the second data record from the first hash table, linking the first data record and the second data record, and transmitting the linked first data record and second data record to a downstream operator.

According to a second embodiment presented in the present disclosure, a computer program product is provided. The computer program product includes a computer-readable storage medium having computer-readable program code embodied therein, the computer-readable program code executable by one or more computer processors to perform operations. The operation includes: a first data segment is received, and a first hash table of the plurality of hash tables is selected based on a timestamp associated with the first data segment. Further, the operations include: a first hash bucket in the first hash table is identified based on the first data segment, and it is determined that the first hash bucket includes the second data segment. Further, upon determining that the first hash-bucket satisfies the predefined criteria, the operations include: the second data segment is removed from the first hash bucket and the first data segment and the second data segment are associated.

According to a third embodiment presented in the present disclosure, a system is provided. The system includes one or more computer processors and memory containing programs that, when executed by the one or more computer processors, perform operations. The operation includes: a first data segment is received, and a first hash table of the plurality of hash tables is selected based on a timestamp associated with the first data segment. Further, the operations include: a first hash bucket in the first hash table is identified based on the first data segment, and it is determined that the first hash bucket includes the second data segment. Further, upon determining that the first hash-bucket satisfies the predefined criteria, the operations include: the second data segment is removed from the first hash bucket and the first data segment and the second data segment are associated.

Example embodiments

New generation network devices are able to derive various telemetry data through both software (i.e., software telemetry) and hardware (i.e., hardware telemetry) sensors. In an embodiment, telemetry data from multiple sensors and switches (or other devices) must be correlated and aligned in real time to ensure meaningful insight. However, when hardware telemetry data (e.g., traffic telemetry, streaming statistics, etc.) is involved, the data rate may be very high (greater than one million streams per second). This requires efficient uptake and processing techniques. In various portions of this disclosure, network telemetry is used as an illustrative example. However, as will be appreciated by those of ordinary skill in the art, embodiments of the present disclosure may be readily applied to any data flow embodiment (e.g., in various embedded devices), or any implementation that requires association of a data source.

Existing time-based stream correlation engines do not provide sufficient efficiency, meaning that they cannot maintain real-time correlation of data, especially as data rates increase. This results in delays that can seriously affect processing and monitoring efforts. Furthermore, existing processing engines are computationally expensive, which limits the horizontal scaling of microservice architectures. Embodiments of the present disclosure enable efficient correlation of traffic telemetry (FT) data, including data collected from hardware, Application Specific Integrated Circuits (ASICs), forwarding engines in network nodes, and the like. Furthermore, embodiments of the present disclosure enable efficient correlation of streaming buffer statistics (SSX) data, which may be derived by hardware, ASICs, forwarding engines, and the like.

Embodiments of the present disclosure provide efficient algorithms and techniques for real-time correlation of high-rate data flow data (e.g., network telemetry data). In terms of resource requirements, embodiments disclosed herein are lightweight (lightweight), which enables horizontal extensions and multi-level deployment models. Furthermore, embodiments of the present disclosure provide O (1) efficiency, making implementations much less sensitive to data rate variations. Advantages of the present disclosure include better algorithm efficiency in terms of spatial and temporal complexity, very low memory footprint compared to existing systems, high level and vertical scalability for large use cases, and applicability to real-time correlation of highly rate and time sensitive data from sensors or arrays of data sources.

In one embodiment of the present disclosure, these advantages are achieved by using one or more circular buffers (also referred to as time wheels, circular buffers, etc.) to correlate data stream data. In an embodiment, each piece of data is placed into an appropriate element on the ring buffer based on a timestamp associated with the piece of data (e.g., a timestamp of the corresponding event, or a timestamp at which the data was recorded or sensed). Further, in one embodiment, the data is stored and associated based on establishing association logic using the corresponding metadata, as discussed in more detail below. Although the embodiments herein are discussed with respect to a circular buffer, in various embodiments, any buffer structure (including queues, linear buffers, etc.) may be utilized. Further, although a circular buffer is discussed, in embodiments, the buffer may occupy linear memory and each end of the buffer may be marked or annotated such that the read and write pointers may return to the beginning of the buffer upon reaching either end. Furthermore, in embodiments, each element or block of data in the circular buffer need not necessarily be stored in a contiguous block of memory, and may be stored with unrelated data between or among portions of memory associated with the circular buffer.

In an embodiment of the present disclosure, a circular buffer is created, wherein each slot in the circular buffer is associated with a respective hash table, and wherein each slot corresponds to a time window. Data is received and segmented. For each data segment, a slot in the ring buffer is selected based on a timestamp associated with the data segment. The hash table corresponding to the identified slot is searched to identify a match to the data segment. If a match is found, the data is removed from the table and the two segments are associated. If no matching entry is found, the segment is stored in the hash table.

Fig. 1 is a block diagram illustrating a system 100 for correlating data streams according to one embodiment disclosed herein. As shown, system 100 includes a stream associating device 105, a plurality of Data sources 103A-N, and an associated Data Sink (Correlated Data Sink) 155. In the illustrated embodiment, the data sources 103A-N are network devices (e.g., routers, switches, etc.), and the data streams to be associated include network telemetry. Of course, in various embodiments, the stream associating means 105 may be used to process and associate any type of data in any type of data stream from any type of sensor or device. Further, although a single association data receiver 155 is shown, in embodiments, any number of receivers may be utilized. Similarly, although the association data receiver 155 is shown as a database, in embodiments, the association data receiver 155 may include one or more programs or applications, devices, or the like, depending on the particular implementation.

In the illustrated embodiment, the stream association device 105 generally receives data (e.g., data in one or more data streams) and associates the data based on a timestamp of the data, metadata associated with the data, and/or content of the data. In embodiments, the particular function used to associate the data may vary based on the type of data being associated and the association desired. In one embodiment, a user or administrator may define the correlation function based on the particular implementation and desired results. Although a single device is shown, in embodiments, the stream associating device 105 may operate as a distributed system across one or more devices. Similarly, in embodiments, the operations of the flow association apparatus 105 may be performed in a virtual environment (e.g., in one or more logical partitions or virtual machines). Further, in embodiments, multiple flow association devices 105 may operate in a cluster or grouping to distribute incoming workloads or provide different associations, as discussed in more detail below.

In the illustrated embodiment, the flow association apparatus 105 includes a processor 110, a memory 115, a storage 120, and a network interface 125. In the illustrated embodiment, processor 110 retrieves and executes programming instructions stored in memory 115 and stores and retrieves application data residing in storage 120. Processor 110 represents a single CPU, multiple CPUs, a single CPU having multiple processing cores, etc. Generally, memory 115 is included to represent random access memory. The storage 120 may be a disk drive or a flash-based storage device, and it may comprise a fixed and/or removable storage device, such as a fixed disk drive, a removable memory card, or an optical storage, Network Attached Storage (NAS), or a Storage Area Network (SAN). The stream associating device 105 may be communicatively coupled with other devices (including data sources 103A-N, associated data receivers 155, etc.) through the network interface 125.

As shown, the memory 115 includes a correlation engine 130. Although a single correlation engine 130 is shown, in embodiments, multiple correlation engines 130 may operate sequentially or in parallel to perform multiple levels of correlation operations. Further, in some embodiments, the correlation engine 130 may operate as a cluster or group of engines to distribute processing of incoming data. As shown, the correlation engine 130 includes a pre-processing component 135, a buffer component 140, and a post-processing component 150. Further, in the illustrated embodiment, the buffer component 140 includes a plurality of hash tables 145A-N. Although the pre-processing component 135, the buffer component 140, and the post-processing component 150 are shown as separate components, in embodiments their operations may alternatively be combined or separated among one or more other components.

In one embodiment, the pre-processing component 135 receives a data stream (e.g., from another association engine 130, another stream association device 105, or from one or more data sources 103A-N) and performs pre-processing to facilitate associating the data. In one embodiment, the preprocessing component 135 delineates (delinte) the received data streams into discrete records, packets, portions, segments, or logical structures. For example, in such embodiments, if the data stream includes streaming statistics information (SSX) data, the preprocessing component 135 identifies SSX packets defined in the data. Further, in some embodiments, the pre-processing component 135 filters the incoming data stream (or identified or defined portions, packets, or records in the data stream) based on predefined criteria. For example, in such embodiments, an administrator may wish to exclude data collected at certain times, relating to certain streams or data, etc. from a specified device or interface.

Once the data has been pre-processed, the buffer component 140 receives the pre-processed data (e.g., demarcated portions of a data stream) and buffers the data for association. In an embodiment, buffering is used to provide the necessary time-buffering (time-buffering) for sensor delays and/or transmission delays from one or more data sources 103A-N. In one embodiment, the buffer component 140 generates a hash value for each received data portion and stores the data portion in one or more hash tables 145A-N based on the hash value. In one embodiment, the buffer component 140 generates a key for each data portion based on the content of the portion or metadata associated with the portion. For example, in one embodiment, the data stream comprises streaming data associated with a network flow. In such embodiments, buffer component 140 may generate a hash key for each data portion (e.g., for each flow or packet identified in traffic telemetry) based on the associated 5-tuple (e.g., source IP address, destination IP address, source port, destination port, and protocol). For example, in embodiments involving correlating traffic telemetry, each portion or record of traffic telemetry may include data about a particular flow or network event, e.g., the entry or exit of a particular packet. Depending on the particular association desired, the hash key for each telemetry record may correspond to one or more of the source and destination IP addresses, source and destination ports, and/or transmission protocols of the packet.

Similarly, in embodiments involving SSX telemetry, buffer component 140 may generate a hash key for each portion of SSX data based on metadata, such as switches, interfaces, queues (e.g., switches, interfaces, and/or queues to which the portion of data relates or describes) associated with the portion of data. Further, in embodiments, different types of data (e.g., traffic telemetry and SSX data) may be correlated by selecting correlation criteria that are meaningful for both data streams, as discussed in more detail below. In embodiments, the particular function (e.g., the particular data or metadata used) used to generate the hash key may depend on the type of data being associated and the type of association desired. In embodiments, these keys are defined based on user or administrator configuration.

In one embodiment, when a conflict is detected in the hash tables 145A-N, the buffer component 140 determines that the received records match the records stored in the hash tables 145A-N, and thus the records should be associated and correlated with each other. For example, assume that the first record stored in the hash tables 145A-N corresponds to the departure of a packet from a first switch or router, and the newly received record corresponds to the next switch or router in the network that the packet entered. In an embodiment, the buffer component 140 may eject the records stored in the hash tables 145A-N and associate or link them with the newly received record. In an embodiment, this associated data is then transmitted downstream (e.g., to the association data receiver 155, to another association engine 130, and/or to another stream association device 105).

In one embodiment, to provide time-based associations as well as content-based associations, the buffer component 140 can assign each hash table 145A-N to a particular time window. For example, in one embodiment, the buffer component 140 utilizes a ring buffer having N elements, where each element includes a particular hash table 145A-N (or a pointer to a particular hash table 145A-N). In one embodiment, the number of elements in the circular buffer may be defined by an administrator. In an embodiment, each element in the circular buffer corresponds to a particular time window (e.g., 100 milliseconds). When the rotation time of the circular buffer has elapsed, the oldest data may be deleted or overwritten to make room for newer data. For example, assume that the circular buffer includes ten elements, each of which spans one hundred milliseconds. In such an embodiment, the rotation time of the buffer is one second. That is, the end pointer of the circular buffer may point to a time of approximately one second in the past, and data older than one second may be discarded, as described in more detail below.

In an embodiment, upon receiving a record or portion of data, the buffer component 140 may determine a timestamp of the record. In embodiments, this may correspond to the time of an event corresponding to a data record (e.g., the entry or expiration of a packet, the addition of data to a queue, the removal of data from a queue, the receipt of data to be processed, the completion of processing, the triggering of a sensor, the time to record a sensor reading or data value, etc.). Based on the time stamp, the buffer component 140 can identify the corresponding element in the ring buffer to determine which hash table 145A-N to search. In this way, the data is associated based on the content of the data within a predefined time window.

In an embodiment, once the association is identified, the post-processing component 150 receives the associated data and completes any further operations. For example, in one embodiment, the post-processing component 150 filters the associated records based on predefined filter criteria. In another embodiment, the post-processing component 150 compacts or compresses the records, or otherwise prepares them for transmission, storage, and/or processing by a downstream entity. The associated data is then transmitted to the associated data receiver 155.

In embodiments, there are several configuration options that may be used to tune a particular implementation of the correlation engine 130. For example, in an embodiment, an administrator may define the amount of time that data should be buffered. Further, in an embodiment, an administrator may define the width of each element in the time wheel (e.g., the amount of time included within the window). In an embodiment, a larger time bucket size corresponds to a smaller association granularity. Further, in an embodiment, an administrator may define the memory available to the buffer, as well as the rate and amount of data for which the association is performed. Further, in embodiments, a user or administrator may define the association type based on specific data (e.g., specific data or metadata in each record to be used for alignment data) that is specified to be used in associating the data.

FIG. 2 illustrates a data correlation engine 130 according to one embodiment disclosed herein. In the illustrated embodiment, the correlation engine 130 includes a ring buffer 215, and

several blocks

205, 210, and 225 for processing data. As shown, data is consumed from one or more data pipelines and, at block 205, is preprocessed to demarcate it as a separate data record. Further, at block 210, any additional filtering and input processing is performed. The data records then proceed to a ring buffer 215. As shown, the ring buffer 215 includes individual data elements 220.1, 220.2, a. Further, as indicated by the ellipses, the ring buffer 215 may include any number of data elements 220, each corresponding to a respective time window. In an embodiment, the number of data elements 220 is defined by a user or administrator. Further, in an embodiment, the width of each data element 220 (e.g., the amount of time included therein) is also defined by a user or administrator. Further, while a single hash table 145 is shown for clarity, in an embodiment, each data element 220 is associated with a respective hash table 145, as described above. In one embodiment, each respective data element 220 includes a pointer to a respective hash table 145.

As described above, in one embodiment, each data element 220 covers a predefined time window. In the illustrated embodiment, when a data record is received, the correlation engine 130 identifies a timestamp of the record and selects the data element 220 that covers the time window that includes the timestamp. For example, a first data record may correspond to data element 220.1, while a subsequent data record corresponds to data element 220.2. Note that data records are assigned to data elements 220 not based on when they were received, but rather based on their timestamps. Thus, after placing a record in data element 220.2, the next data record may belong to data element 220.1, data element 220.M-1, or any other element. In an embodiment, once the appropriate data element 220 is identified, the correlation engine 130 utilizes the recorded hash value (which may be generated based on predefined criteria, as described above) to search the hash table 145 corresponding to the identified data element 220. If no match is found, in an embodiment, the correlation engine 130 inserts the data record into the hash table 145 at the location specified by the hash value and proceeds to the next data record.

As shown, the hash table 145 is partitioned into a number of entries or buckets. In an embodiment, a particular entry or bucket to which a data record belongs is defined based on the hash value of the data record. For example, in the illustrated embodiment, each entry corresponds to a hash value from "hash," hash2, "to" hash n. In an embodiment, the number of entries or buckets in each hash table 145 is defined by a user or administrator. In an embodiment, identifying one or more matching data records for the newly received data record comprises: the data records (or some identified portion thereof) are hashed to determine the appropriate hash bucket and to determine if there are one or more data records in the entry. If so, the records are considered as matches to the currently received records.

In an embodiment, if a matching entry is identified in the hash table 145, the correlation engine 130 removes the matching data record from the hash table 145, correlates or links the two data records (e.g., the record stored in the hash table 145 and the newly received and in-process record), and transfers them to block 225 for post-processing. In some embodiments, the associated data records are included in a combined data structure that references or points to all data records belonging to a particular association (e.g., a particular type of association and a particular time window). As shown, the correlation engine 130 may also perform any output processing, filtering, and compression of the desired correlated data records in block 225. In one embodiment, the correlation engine 130 also analyzes the timestamp to determine whether to rotate the buffer as each new data record is received. For example, if the timestamp is earlier than the time covered by the earliest data element 220 (e.g., the timestamp is less than the current time minus the total wheel rotation time), the record is immediately discarded. Similarly, if the timestamp is newer than the window included in the most recent data element 220, the correlation engine 130 rotates the wheel by the width of one data element 220, thereby erasing the data older than the wheel rotation time (found in the last data element 220) and clearing space for the new hash table 145 for the most recent time window.

In an embodiment, rotation of the circular buffer 215 is accomplished by moving a start pointer and an end pointer (also referred to as a start pointer and an end pointer, or head-to-tail pointers). In some embodiments, the start pointer points to the newest data element 220, and the tail pointer points to the oldest data element 220. For example, upon determining that the circular buffer 215 should be rotated, the correlation engine 130 may move two pointers to their adjacent data elements 220. If the circular buffer 215 is implemented as a linear sequence of data elements 220, the correlation engine 130 may check to see if the updated pointer location places it outside the end of the buffer. If so, a pointer is placed at the other end of the buffer. In this manner, the pointer(s) move the buffer down, one data element 220 at a time, until looping back to the beginning.

In some embodiments, the correlation engine 130 is used to correlate two pieces of data. For example, in one embodiment, data from two sources may be correlated. In such an embodiment, when a match is found, the data is immediately removed from the ring buffer 215 and forwarded downstream. However, in some embodiments, the correlation engine 130 is configured to correlate three or more pieces of data. For example, if data from three or more sources (e.g., three or more sensors, network devices, etc.) is to be correlated, the correlation engine 130 does not immediately pop data out of the circular buffer 215 upon identifying a matching entry.

In one embodiment, if a matching item is identified, the correlation engine 130 (e.g., the buffer component 140) determines the number of data sources to correlate and identifies the number of data items associated with the current hash value (e.g., how many data sources have been identified and correlated). If the number of data items (e.g., the number of matching items) in a particular bucket of the selected hash table 145 satisfies a predefined value, all of these data portions are removed from the ring buffer 215 and associated with one another. However, if the number of data items is below a defined value, a new data portion is inserted into the identified location and the correlation engine 130 continues to search for the remaining data.

For example, in a spine-leaf network topology, traffic telemetry data may be collected from three nodes (e.g., ingress leaf nodes, egress leaf nodes, and spine nodes). In such embodiments, the correlation engine 130 may refrain from removing data from the ring buffer 215 until data has been received from all three nodes. In an embodiment, each time data is inserted into hash table 145, the counter for a particular bucket (e.g., a particular hash value) is incremented. When the target count of traffic records has been received (e.g., one from each data source), the data in the identified bucket is popped and sent as a set of associated data records to the next processing stage. Further, in embodiments, similar logic exists when associating traffic data with SSX data from multiple switches, as well as associating multiple types of data for the same switch (e.g., ingress queue occupancy, egress queue occupancy, ingress queue drop, egress queue drop, etc.). Although traffic telemetry and SSX data are used herein as an example association, embodiments of the present disclosure may be utilized to associate any type of data from any number of data sources.

In some embodiments, any data still remaining in the oldest data element 220 is lost when the circular buffer 215 rotates. In other embodiments, before overwriting the data element 220, the correlation engine 130 extracts any remaining data, appropriately links it, and forwards it downstream. For example, assume that a particular implementation requires the association of data records from three different data sources. If a particular hash bucket has only two such records (e.g., the correlation engine 130 has not received a third record), the correlation engine 130 may still link, correlate, or otherwise connect the two records and transmit them for further processing. In some embodiments, these incomplete associations are translated to different data receivers, or to be modified or processed differently for output.

FIG. 3 illustrates a multi-level association system 300 according to one embodiment disclosed herein. In the illustrated embodiment, the independent correlation engines 130 work together as a fully functional single node correlation engine 300 in a correlation system. As shown, each correlation engine 130 serves different kinds of input sensors (e.g., different types of data streams). In the illustrated embodiment, the system 300 includes a pipeline stage 350, an association stage 351, and a data receiver stage 352. Pipeline stage 350 includes three pipelines, one for each data type. As shown, pipeline stage 300 includes a streaming statistics pipeline 305 for ingesting SSX data from network devices. Similarly, pipeline stage 300 includes a traffic telemetry pipeline 310 for ingesting traffic data from network devices. Further, pipeline stage 350 includes a software telemetry pipeline 315 for ingesting software telemetry (e.g., information about network topology, neighboring devices, etc.). The data pipelines included in system 300 are described for illustrative purposes only and are not intended to be limiting. In embodiments, the configuration of system 300 may be used to implement any multi-level association of data.

In the illustrated embodiment, the software telemetry pipeline 315 feeds a correlation engine 325 designated as the "FT stage". In an embodiment, the correlation engine 325 receives and correlates traffic telemetry data, such as incoming and outgoing events from various devices, as described above. In an embodiment, the correlation engine 325 correlates the traffic data based on the 5-tuple associated with the packet in question, as described above. Thus, in the illustrated embodiment, traffic telemetry data is received from multiple points (e.g., multiple network devices) in a flow path (e.g., in a network). In an embodiment, for each stream, a ring buffer 215 is used to associate the data. For example, in a leaf-spine topology, data is received from an ingress leaf node switch, an egress leaf node switch, and one or more spine node switches. In one embodiment, the correlation engine 325 correlates FT data from multiple supporting switches.

Further, as depicted in the illustrated embodiment, the first stage additionally includes a path computation 330 stage when the flow data records are associated. In an embodiment, path computation 330 receives associated traffic telemetry data from a plurality of switches or devices and augments it with link-level data (e.g., Link Layer Discovery Protocol (LLDP) data and/or Cisco Discovery Protocol (CDP) data) from the switches (including any switches in the path that do not support FT or hardware telemetry) to form an end-to-end flow path for each data packet or data flow from the source to the destination. In one embodiment, the path corresponds to a packet flow within the data center. In the illustrated embodiment, the data is received through software telemetry pipeline 315.

Further, in the illustrated embodiment, a second correlation engine 320 labeled "SSX-FT stage" receives SSX data from the streaming statistics pipeline 305 and correlates it with the correlated traffic telemetry data from the correlation engine 325. For example, in an embodiment, the association engine 320 associates SSX data and traffic data based on metadata (e.g., switches, interfaces, and/or queues associated with the SSX data and traffic data). That is, in embodiments, the hash key used by the correlation engine 320 to identify a matching entry in the corresponding circular buffer is based on metadata such as the particular device, interface, queue, port, or other component associated with the data. For example, in one embodiment, after an entry event has been associated (via the association engine 325) with its corresponding exit event, the association engine 320 identifies SSX data (e.g., from the device where the entry/exit occurred) associated with each of these entry and exit events and associates the data. In this manner, the correlation engine 320 generates correlations for traffic telemetry and SSX data (e.g., streaming buffer statistics), which provide accurate status for the forwarding queues and buffers in each hop of the flow. In one embodiment, the second level of correlation provided by the correlation engine 320 may be distributed and requests may be fed to the target correlation engine 320 based on routing logic including hash-based and/or stateful load balancing techniques.

In the illustrated embodiment, the system 300 includes a third stage with an association engine 335 labeled "SSX-FT final stage" that enables multiple association engine stages to be combined to form a larger system that distributes load and handles larger scales according to the number of switches/routers or associated streams. For example, more than one instance of the correlation engine 320 may involve the correlation of one stream. In one embodiment, each correlation engine 320 is designated to correlate traffic telemetry to SSX data for a defined set of switches or devices. In such an embodiment, the correlation engine 325 will distribute the SSX correlation requests across multiple correlation engines 320. All of these correlation engine 320 correlation stages will correlate traffic telemetry data based on requests directed to the respective switches. Thus, in the illustrated embodiment, the final correlation engine 335 receives multiple sets of FT-to-SSX correlation data (each set corresponding to one or more devices) from each correlation engine 320. The correlation engine 335 may then correlate or sort the various correlations to form final correlation data. In an embodiment, it is desirable to map FT-SSX correlation data from the second stage (e.g., correlation engine 320) to path computation FT data from the first stage (e.g., correlation engine 325 and path computation 330) again. Since the second stage (provided by the correlation engine 335) may be distributed, in embodiments, this third stage (correlation engine 335) is used to aggregate this data from the second stage and correlate it to the corresponding path computation FT data.

In an embodiment, this final association data is then transmitted to association data receiver 340. Thus, as shown, embodiments of the present disclosure may be utilized to perform multi-level association. Further, in embodiments, the data streams may be associated based on different factors for each level, based on the desired association and the nature of the underlying data. Nonetheless, the correlation techniques disclosed herein maintain 0(1) efficiency, resulting in efficient correlation. In this manner, as will be appreciated by those of ordinary skill in the art, embodiments of the present disclosure enable various associated functions and operations to be implemented as appropriate to the needs of a particular implementation.

Fig. 4A-4B illustrate a cluster 400 of a multi-level association system 300 according to one embodiment disclosed herein. The illustrated embodiment includes two

multi-level correlation engines

300A and 300B, as described above with reference to FIG. 3. In the illustrated embodiment, the separate systems 300A-B are configured to interoperate to cluster (cluster) data across each system. For example, as indicated by arrow 445, when the stream association data is transmitted from the association engine 425B to the association engine 420B, the data is also transmitted to the association engine 420A to be associated with the SSX data received by the system 300A. Similarly, as indicated by arrow 450, when a stream association is transmitted from the association engine 425A to the association engine 420A, the data is also transmitted to the association engine 420B to be associated with SSX data received by the system 300B. In this way, each system shares associations to ensure that the results are accurate throughout the cluster.

FIG. 5 is a block diagram of a cluster 500 of correlation engines, according to one embodiment disclosed herein. In the illustrated embodiment, any number of pipelines 505.1-N in pipeline stages 530 may be fed to the association stage 531. Further, as shown, pipeline stage 530 may include any number of different associative stages 515.1-M. In the embodiment shown, data is provided from pipeline stage 530 to associated stage 531 via bus 510. In one embodiment, data from a particular pipeline 505 is always provided to a particular identified associated stage 515. In other embodiments, the association stage 515 is dynamically selected, for example, by a hashing algorithm or a load balancing algorithm.

In the illustrated embodiment, each correlation stage 515 includes one or more correlation engines 130, each including one or more circular buffers 215. In the embodiment shown, once each stage has completed its association, the data will be forwarded to the data sink 525 via the second bus 520. In some embodiments, data may additionally or alternatively be provided to one or more other association stages 515, as defined by the needs of a particular implementation.

Fig. 6 is a flow diagram illustrating a method 600 of associating data according to one embodiment disclosed herein. The illustrated method 600 begins at block 605, where the correlation engine 130 receives streaming data at block 605. As described above, this data may be a telemetry stream, a sensor data stream, a logging stream, or any other data. The method 600 proceeds to block 610 where the preprocessing component 135 selects a data segment from the data stream at block 610. In one embodiment, the partitioning is based on the logical structure of the data stream. In some embodiments, the partitioning is performed by a pre-processing stage, and the segments are received sequentially by the correlation engine 130.

At block 615, the buffer component 140 identifies a timestamp of the selected segment. The method 600 then continues to block 620, at block 620, the buffer component 140 selects an appropriate hash table to process the record based on the timestamp. At block 625, the buffer component 140 searches the identified hash table based on the data included in the data segment. For example, as described above, in one embodiment, the buffer component 140 uses metadata about the data segments or data about a particular content record as a key to search the table. In an embodiment, the representation (formulation) of the determined key is configured by the user based on the desired association.

The method 600 then proceeds to block 630, at block 630, the buffer component 140 determines whether there is a matching entry in the table. That is, as described above, the buffer component 140 searches the selected hash table based on the generated key to identify a conflict with an existing record. If such a conflict exists, the buffer component 140 determines that the current segment should be associated with the stored segment. If there is no match, the method 600 proceeds to block 635, at block 635 the buffer component 140 inserts the segment into the table, and the method 600 returns to block 605 to receive additional streaming data. That is, in an embodiment, method 600 proceeds to process the next identified segment in the data stream.

If the buffer component 140 identifies an association in the table, the method 600 continues to block 640 where the buffer component 140 increments a match counter associated with the generated key/bucket. Of course, in some embodiments, the buffer component 140 may be configured to associate data between two sources or types of data. Thus, in some embodiments, a counter is not utilized to determine the number of data segments that have been identified, and once a conflict is found, the data is immediately removed from the table. Further, in some embodiments, the buffer component 140 determines the number of data records currently stored in the identified hash bucket by other means (rather than incrementing a counter).

At block 645, the buffer component 140 determines whether the match counter satisfies a predefined criterion (e.g., a threshold number of matching entries). In one embodiment, a user or administrator defines criteria based on the number of data records to be associated. As described above, in various embodiments, different data sources, different types of data, and the like may be associated. Criteria may be defined such that data is removed from the circular buffer 215 only when a predefined number of associated or matching entries have been identified. If the criteria are not satisfied, the method 600 proceeds to block 635 where data is inserted into the identified table at the identified location at block 635. The method 600 then returns to block 605 to continue receiving data.

However, if buffer component 140 determines that the criteria are satisfied, then method 600 proceeds to block 650, at block 650, buffer component 140 removes the identified matching entry(s) from the hash table and associates all data with the current segment. The association may be accomplished by linking the records, including an indication of the association, storing or transmitting the records in a single data structure, or any other suitable technique for identifying the association. The method 600 then proceeds to block 655, at block 655, the correlation engine 130 transmits the correlation to the downstream operator. As noted above, in embodiments, this may include one or more additional correlation engines, databases, applications, and the like.

FIG. 7 is a flow diagram illustrating a method 700 for determining data associations according to one embodiment disclosed herein. In one embodiment, method 700 provides additional details for block 620 in FIG. 6. The method 700 begins after the timestamp of the segment has been identified in block 615. At block 705, the buffer component 140 determines the desired association type. As described above, this may be defined by an administrator to ensure that the correlation engine 130 serves its intended purpose. For example, in an embodiment, the desired association may pertain to a 5-tuple of the record, a switch, port, interface, and/or queue associated with the record, and so forth. Further, in embodiments, the association may pertain to the sensor or device transmitting the data, the type of data included in the data stream, and so forth. The examples given herein are not intended to be limiting and other data associations will be apparent to those skilled in the art.

Once the desired association has been determined, the method 700 continues to block 710, at block 710, the buffer component 140 generates a hash key for the segment based on the association. For example, in embodiments utilizing 5-tuples of data packets, generating the hash key may include concatenating each value in the 5-tuple into a single value. In some embodiments, the 5-tuple itself may act as a key. Generally, in an embodiment, generating the hash key comprises: identifying data for matching or correlating records; extracting the data from the data segments (or from metadata associated with the segments); and formatting it for use in a hash algorithm, as defined in a configuration associated with the correlation engine. At block 715, the buffer component 140 generates a corresponding hash value for the key. Any suitable hashing algorithm may be used to generate the hash value. The method 700 then proceeds to block 720, at block 720, the buffer component 140 identifies an appropriate hash table based on the timestamp of the segment.

For example, as described above, in embodiments, the buffer component 140 utilizes ring buffers 215 of data elements 220, each ring element 215 including a pointer to a different hash table 145. Further, in an embodiment, each data element 220 includes data for a defined time window. In such embodiments, identifying the appropriate hash table includes: determines which data element 220 corresponds to the timestamp of the record, and selects the associated hash table 145. The method 700 then ends and the selected table is searched using the hash value in block 625.

Fig. 8 is a flow diagram illustrating a method 800 of using a circular buffer to associate data in accordance with one embodiment disclosed herein. In one embodiment, method 800 is performed in conjunction with method 600. That is, in an embodiment, the method 800 is also performed for each record or segment processed by the method 600. In the illustrated embodiment, the method 800 begins at block 805, where the buffer component 140 receives a data record. At block 810, the buffer component 140 identifies a timestamp of the record. The method 800 then continues to block 815 where the buffer component 140 determines whether the timestamp is earlier than the data included in the ring buffer 215 at block 815.

In an embodiment, it is determined whether the timestamp is before the circular buffer 215 based on the position of the buffer flag or pointer and the total wheel rotation time. For example, assume that the newest data element 220M ends at t-50 and the ring buffer 215 includes ten entries, each spanning one second. In such an embodiment, the circular buffer 215 spans 10 seconds of data. Thus, the total wheel rotation time is ten seconds and the buffer only includes data newer than t-40. Thus, if the timestamp of the current record is less than t-40 (e.g., t-35), the record will fall out of the buffer.

If the timestamp is before the width of the ring buffer 215, the method 800 continues to block 820 where the record is discarded at block 820. Method 800 then returns to block 805. If the timestamp is not before the end of the buffer, the method 800 continues to block 825 where the buffer component 140 determines whether the timestamp is present after the buffer at block 825. That is, the buffer component 140 determines whether the timestamp of the current record or segment is newer than the latest window in the ring buffer 215. Continuing with the example above, if the timestamp is greater than t 50, the record is newer than the record allowed by the ring buffer 215.

If the timestamp occurs after the window covered by the most recent data element 220, the method 800 continues to block 830 where the buffer component 140 rotates the buffer by one data element 220 at block 830. That is, the buffer component 140 moves the marker forward in time by one data element 220, clears the hash table 145 residing in that element, and creates a new hash table 145. Method 800 then continues to block 835. Further, if at block 825 the buffer component 140 determines that the timestamp is not before the buffer, the method continues to block 835. That is, continuing the example above, if the timestamp falls between 40 and 50, the buffer component 140 determines that the record falls somewhere in the ring buffer 215.

At block 835, the buffer component 140 associates the selected segment or record. For example, in one embodiment, the buffer component 140 performs the method 600 to insert a record in the hash table 145 or identify a matching entry in the hash table 145, as described above. In this way, the ring buffer 215 is kept up-to-date and older data that has not yet been associated is discarded. In some embodiments, the buffer component 140 does not discard the data in the oldest data element 220, but rather pops up any records in each hash bucket, links them together, and transmits them downstream as part of an association. In one embodiment, if only a single data record is included in a particular hash entry, that record is discarded.

Fig. 9 is a flow diagram illustrating a method 900 of associating data segments (partitions) according to one embodiment disclosed herein. The method 900 begins at block 905, the correlation engine 130 receives a first data record of a plurality of data records in a data stream. At block 910, the correlation engine 130 selects a first element in a ring buffer based on a timestamp of the first data record, wherein the ring buffer includes a plurality of elements, each element corresponding to a respective time window. The method 900 then continues to block 915, at block 915, the correlation engine 130 identifies a first hash table associated with the first element in the ring buffer. At block 920, the correlation engine 130 generates a first hash value based on the first data record. The method 900 then proceeds to block 925, at which the correlation engine 130 determines that the second data record is associated with the first hash value in the first hash table. At block 930, the correlation engine 130 removes the second data record from the first hash table. Further, at block 935, the correlation engine 130 links the first data record and the second data record. Finally, the method 900 proceeds to block 940, where the correlation engine 130 transmits the linked first and second data records to a downstream operator at block 940.

Fig. 10 is a flow diagram illustrating a method 1000 of associating data segments according to one embodiment disclosed herein. The method 1000 begins at block 1005, and at block 1005, the correlation engine 130 receives a first data segment. The method 1000 then proceeds to block 1010, at block 1010, the correlation engine 130 selects a first hash table of the plurality of hash tables based on a timestamp associated with the first data segment. At block 1015, the correlation engine 130 identifies a first hash bucket in the first hash table based on the first data segment. Further, at block 1020, the correlation engine 130 determines that the first hash bucket includes the second data segment. The method 1000 then continues to block 1020 where, at block 1020, upon determining that the first hash bucket satisfies the predefined criteria, the association engine 130 removes the second data segment from the first hash bucket and associates the first data segment with the second data segment.

In the foregoing, reference has been made to the embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to the specifically described embodiments. Rather, any combination of the described features and elements, whether related to different embodiments or not, is contemplated for implementing and practicing the contemplated embodiments. Moreover, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the disclosure. Thus, the foregoing aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).

As will be appreciated by one skilled in the art, the embodiments disclosed in the present disclosure may be embodied as a system, method or computer program product. Accordingly, aspects may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects may take the form of a computer program product embodied in computer-readable medium(s) having computer-readable program code embodied in the medium.

Any combination of computer-readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium is any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein (e.g., in baseband or as part of a carrier wave). Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Aspects of the present disclosure are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments presented in the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer programming instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Embodiments of the present invention may be provided to end users through a cloud computing infrastructure. Cloud computing generally refers to providing scalable computing resources as a service over a network. More formally, cloud computing may be defined as a computing capability that provides an abstract relationship between computing resources and their underlying technical architecture (e.g., servers, storage, networks) such that a shared pool of configurable computing resources that can be quickly configured and released with minimal administrative effort or service provider interaction can be conveniently network-accessed on-demand. Thus, cloud computing allows users to access virtual computing resources (e.g., storage, data, applications, or even complete virtualized computing systems) in the "cloud" without regard to the underlying physical systems (or the location of these systems) used to provide the computing resources.

Typically, cloud computing resources are provided to users in a pay-per-use manner, with users being charged only for the computing resources actually used (e.g., the amount of storage space consumed by the user or the number of virtualized systems instantiated by the user). A user can access any resource residing in the cloud from anywhere over the internet at any time. In the context of the present invention, a user may access applications (e.g., correlation engine 130) or related data available in the cloud. For example, the correlation engine 130 may execute on a computing system in the cloud and process data in the cloud. In this case, the correlation engine 130 may correlate the streaming data in real-time in the cloud and store such identified correlation at a storage location in the cloud. Doing so allows a user to access this information from any computing system attached to a network (e.g., the internet) connected to the cloud.

In summary, techniques for efficient data association are provided. A first data segment is received and a first hash table of the plurality of hash tables is selected based on a timestamp associated with the first data segment. Further, a first hash bucket in the first hash table is identified based on the first data segment. Determining that the first hash bucket includes the second data segment. Upon determining that the first hash bucket satisfies the predefined criteria, the second data segment is removed from the first hash bucket and the first data segment and the second data segment are associated.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In view of the foregoing, the scope of the present disclosure is determined by the appended claims.

Claims

1. A method, comprising:

receiving a first data record of a plurality of data records in a data stream;

selecting a first element in a ring buffer based on a timestamp of the first data record, wherein the ring buffer comprises a plurality of elements, each element corresponding to a respective time window;

identifying a first hash table associated with the first element in the ring buffer;

generating a first hash value based on the first data record;

determining that a second data record is associated with the first hash value in the first hash table;

removing the second data record from the first hash table;

linking the first data record and the second data record; and

the linked first and second data records are transmitted to a downstream operator.

2. The method of claim 1, further comprising:

receiving a third data record; and

upon determining that the timestamp associated with the third data record is newer than a predefined threshold:

selecting a second element in the circular buffer, wherein the second element corresponds to an oldest time window of the respective time windows;

identifying a second hash table associated with the second element; and

discarding the second hash table.

3. The method of claim 1 or 2, wherein removing the second data record from the first hash table, linking the first and second data records, and transmitting the linked first and second data records to the downstream operator is performed upon determining that a predefined criterion is met.

4. The method of claim 3, wherein determining that the predefined criteria is met comprises: determining that a number of data records associated with the first hash value in the first hash table exceeds a predefined threshold.

5. The method of any of claims 1-4, further comprising:

receiving a third data record;

generating a second hash value based on the third data record;

determining, based on the first hash value, that a number of data records in the first hash table associated with the second hash value does not exceed a predefined threshold; and

inserting the third data record into the first hash table.

6. The method of any of claims 1-5, further comprising:

receiving a third data record; and

discarding the third data record upon determining that a timestamp associated with the third data record is older than a predefined threshold.

7. A computer program product, comprising:

a computer-readable storage medium having computer-readable program code embodied therein, the computer-readable program code executable by one or more computer processors to perform operations comprising:

receiving a first data segment;

selecting a first hash table of a plurality of hash tables based on a timestamp associated with the first data segment;

identifying a first hash bucket in the first hash table based on the first data segment;

determining that the first hash bucket includes a second data segment; and

upon determining that the first hash-bucket satisfies predefined criteria:

removing the second data segment from the first hash bucket; and

associating the first data segment with the second data segment.

8. The computer program product of claim 7, the operations further comprising: transmitting the first data segment and the second data segment to one or more data receivers.

9. The computer program product of claim 7 or 8, wherein receiving a first data segment comprises:

receiving a data stream, wherein the data stream comprises a plurality of logical data units; and

the data stream is divided into a plurality of data segments based on the plurality of logical data units.

10. The computer program product of any of claims 7 to 9, wherein selecting the first hash table of the plurality of hash tables comprises: a first element in a ring buffer containing a plurality of elements is identified, wherein each element in the plurality of elements is associated with a respective hash table in the plurality of hash tables.

11. The computer program product of claim 10, wherein each respective element of the plurality of elements is associated with a respective time window.

12. The computer program product of claim 11, the operations further comprising:

receiving a third data segment; and

upon determining that the timestamp associated with the third data segment is newer than a predefined threshold:

selecting a second hash table of the plurality of hash tables, wherein the second hash table is associated with a second data element, wherein the second data element is associated with an oldest time window of the respective time windows; and

discarding the second hash table.

13. The computer program product of any of claims 7 to 12, wherein identifying the first hash bucket in the first hash table comprises:

generating a hash key based on the first data segment according to a predefined configuration;

generating a hash value based on the hash key; and

identifying the first hash bucket based on the generated hash value.

14. The computer program product of any of claims 7 to 13, wherein determining that the first hash bucket satisfies a predefined criterion comprises: determining that a threshold number of data segments are stored in the first hash bucket.

15. The computer program product of any of claims 7 to 14, the operations further comprising:

receiving a third data segment;

identifying a second hash bucket in the first hash table based on the third data segment; and

inserting the third data segment into the second hash bucket upon determining that the second hash bucket does not satisfy the predefined criteria.

16. The computer program product of any of claims 7 to 15, the operations further comprising:

receiving a third data segment; and

discarding the third data segment upon determining that a timestamp associated with the third data segment is older than a predefined threshold.

17. A system, comprising:

one or more computer processors; and

memory containing a program that, when executed by the one or more computer processors, performs operations comprising:

receiving a first data segment;

determining that the first hash bucket includes a second data segment; and

upon determining that the first hash-bucket satisfies predefined criteria:

removing the second data segment from the first hash bucket; and

associating the first data segment with the second data segment.

18. The system of claim 17, wherein selecting the first hash table of the plurality of hash tables comprises: identifying a first element in a ring buffer containing a plurality of elements, wherein each element in the plurality of elements is associated with a respective hash table in the plurality of hash tables, and wherein each respective element in the plurality of elements is associated with a respective time window.

19. The system of claim 18, the operations further comprising:

receiving a third data segment; and

discarding the second hash table.

20. The system of any of claims 17 to 19, the operations further comprising:

receiving a third data segment;

21. An apparatus, comprising:

receiving means for receiving a first data record of a plurality of data records in a data stream;

selecting a first element in a circular buffer based on a timestamp of the first data record, wherein the circular buffer comprises a plurality of elements, each element corresponding to a respective time window;

identifying means for identifying a first hash table associated with the first element in the ring buffer;

generating means for generating a first hash value based on the first data record;

determining means for determining that a second data record is associated with the first hash value in the first hash table;

removing means for removing the second data record from the first hash table;

linking means for linking said first data record and said second data record; and

transmitting means for transmitting the linked first and second data records to a downstream operator.

22. The apparatus of claim 21, further comprising: apparatus for implementing the method according to any one of claims 2 to 6.

23. A computer program, computer program product or computer readable medium comprising instructions which, when executed by a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 6.