US20110153554A1 - Method for summarizing data in unaggregated data streams - Google Patents
Method for summarizing data in unaggregated data streams Download PDFInfo
- Publication number
- US20110153554A1 US20110153554A1 US12/653,831 US65383109A US2011153554A1 US 20110153554 A1 US20110153554 A1 US 20110153554A1 US 65383109 A US65383109 A US 65383109A US 2011153554 A1 US2011153554 A1 US 2011153554A1
- Authority
- US
- United States
- Prior art keywords
- reservoir
- keys
- key
- weights
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000012360 testing method Methods 0.000 claims abstract description 42
- 238000007792 addition Methods 0.000 claims abstract description 11
- 238000004519 manufacturing process Methods 0.000 claims abstract description 5
- 230000004044 response Effects 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 abstract description 6
- 238000013459 approach Methods 0.000 abstract description 3
- 238000005070 sampling Methods 0.000 description 25
- 238000012545 processing Methods 0.000 description 14
- 230000002776 aggregation Effects 0.000 description 12
- 238000004220 aggregation Methods 0.000 description 12
- 238000003860 storage Methods 0.000 description 9
- 238000005259 measurement Methods 0.000 description 8
- 238000009826 distribution Methods 0.000 description 7
- 101100275737 Gallus gallus CHRDL1 gene Proteins 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000000654 additive Substances 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 3
- 230000002860 competitive effect Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 241000282596 Hylobatidae Species 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000001575 pathological effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/02—Capturing of monitoring data
- H04L43/028—Capturing of monitoring data by filtering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/04—Processing captured monitoring data, e.g. for logfile generation
Abstract
Description
- The present invention generally relates to methods for summarizing data, and more particularly to methods for producing summaries of unaggregated data appearing, e.g., in massive data streams, for use in subsequent analysis of the data.
- It is often useful to provide a summary of a high volume stream of unaggregated weighted items that arrive faster and in larger quantities than can be saved, so that only a sample can be stored efficiently. Preferably, we would like to provide a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets of the data.
- Many data sets occur as unaggregated data sets, where multiple data points are associated with each key. The weight of a key is the sum of the weights of data points associated with the key and the aggregated view of the data, over which aggregates of interest are defined, includes the set of keys and the weight associated with each key.
- In greater detail, this invention is concerned with the problem of summarizing a population of data points, each of which takes the form of (k, x) where k is a key and x≧0 is called a weight. Generally, in an unaggregated data set, a given key occurs in multiple data points of the population. An aggregate view of the population is provided by the set of key weights: the weight of given key is simply the sum of weights of data points with that key within the population. This aggregate view would support queries that require selection of sub-populations with arbitrary key predicates.
- However, in many application scenarios, it is not feasible to compute the aggregate key weights directly; we describe some of these application scenarios below. In these applications time and processing constraints prohibit direct queries and make it necessary to first compute a summary of the aggregate view over the data points, and then to process the query on the summary. A crucial requirement of such summaries is that they must also support selection of subpopulations with arbitrary key predicates. Since the keys of interest are not assumed to be known at the time of summarization, the summarization process must retain per-key statistical estimates of the aggregate weights.
- Turning now to applications of interest, communications networking provides a fertile area for developing summarization methods. In the Internet Protocol (IP) suite, routers forward packets between high speed interfaces based on the contents of their packet headers. The header contains the source and destination address of the packets, and usually also source and destination port number which are used by end hosts to direct packets to the correct application executing within them. These and other fields in the packet header each constitute a key that identifies the IP flow that the packet belongs to. In our context, we can think of the set of keys of packets arriving at the router in some time interval, each paired with the byte size of the corresponding packet, as a population of unaggregated data points.
- Routers commonly compile summary statistics on the traffic passing through them, and export them to a collector which stores the summaries and support query functions. Export of the unaggregated data is infeasible due to the expense of the bandwidth, storage and computation resources that would be required to support queries. On the other hand, direct aggregation of byte sizes over all distinct flow keys at a measuring router is generally infeasible at present due the amount of (fast) memory that would be required to maintain and update at line rate the summaries for the large number of distinct keys present in the data Thus some other form of summarization is required.
- Common queries for network administrators would include: (i) calculating the traffic matrix, i.e., the weight between source-destination address pairs; (ii) the application mix, as indicated by weight in various port numbers (iii) popular websites, as indicated by destination address using certain ports. Although some queries are routine, in exploratory and troubleshooting tasks the keys of interest are not known in advance.
- Other network devices that serve content or mediate network protocols generate logs comprising records of each transaction. Examples include web servers and caches; content distribution servers and caches; electronic libraries for software, video, music, books, papers; DNS and other protocol servers. Each record may be considered as a data point, keyed, e.g., by requester or item requested, with weight being unity or the size or price of the item requested if appropriate. Offline libraries can produce similar records. Queries include finding the most popular items or the heaviest users, requiring aggregation over keys with common user and/or item. Another example is sensor networks comprise a distributed set of devices each of which generates monitoring events in certain categories.
- All of these application examples, to a greater or lesser extent, share the feature that the approximate aggregation is subjected to physical resource constraints on the information that can be carried through time or between locations. For example, there are multiple distinct devices that produce data points, and from which information flows to a single ultimate collector and bandwidth is limited. If data points arrive as a data stream, then storage is limited. In the network traffic statistics application, measurements may be aggregated in mediation devices (e.g. one per geographic router center) which in turn export to a central collector. Sensor networks may deploy a large number of sensor nodes with limited capabilities that can collaborate locally to aggregate their measurements before relaying messages more widely. Physical layout aside, when summarizing data that resides on external memory or when exploiting parallel processing to speed up the computation, the computation is subjected to similar data flow constraints imposed by the underlying model.
- There has been considerable amount of work in past years devoted to finding efficient data summarization schemes.
- Summarizing Aggregated Data. In aggregated data sets, each data point has a unique key. There are many summarization methods for such data sets in the literature that produce summaries that support unbiased estimates for subpopulation weight. Reservoir sampling from a single stream is the base of the stream database of Johnson et. al. [T. Johnson, S. Muthukrishnan, and I. Rozenbaum, S
AMPLING ALGORITHMS IN A STREAM OPERATOR, In Proc. ACM SIGMOD, pages 1-12, 2005]. Classic algorithms for offline, data streams, and distributed settings include: Weighted sampling with replacement (probability proportional to size) (the k-mins framework) [E. Cohen, SIZE -ESTIMATION FRAMEWORK WITH APPLICATIONS TO TRANSITIVE CLOSURE AND REACHABILITY, J. Comput. System Sci., 55:441-453, 1997; E. Cohen and H. Kaplan, SPATIALLY -DECAYING AGGREGATION OVER A NETWORK: MODEL AND ALGORITHMS, J. Comput. System Sci., 73:265-288, 2007]; the stronger bottom-k framework [E. Cohen and H. Kaplan, BOTTOM -K SKETCHES: BETTER AND MORE EFFICIENT ESTIMATION OF AGGREGATES, In Proceedings of the ACM SIGMETRICS '07 Conferece, 2007, poster; E. Cohen and H. Kaplan, SUMMARIZING DATA USING BOTTOM -K SKETCHES, In Proceedings of the ACM PODC '07 Conference, 2007; E. Cohen and H. Kaplan, TIGHTER ESTIMATION USING BOTTOM -K SKETCHES, In Proceeding of the 34th VLDB Conference, 2008] that includes priority sampling [N. Duffield, M. Thorup, and C. Lund, Priority sampling for estimating arbitrary subset sums, J. Assoc. Comput. Mach., 54(6), 2007] and the classic weighted sampling with replacement; and the recently-proposed VAROPT [E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, VARIANCE OPTIMAL SAMPLING BASED ESTIMATION OF SUBSET SUMS, In Proc. 20th ACM-SIAM Symposium on Discrete Algorithms, ACM-SIAM, 2009] that achieves variance optimality. - These summarizations, however, can not be computed over the unaggregated data unless the data is first aggregated, which is prohibited by application constraints. Firstly, the best estimators for summaries derived from aggregated data utilize the exact weight of each key that is included in the summary. Secondly, the distribution itself of keys that are included in the summary can not be computed under the IFT-constraints. (The only exception is weighted sampling (with or without replacement), but even though we can efficiently determine the keys to include in the summary over the unaggregated data, we need a “second pass” (or another communication round) to obtain the total weight of each included key in order to compute the estimators.)
- These methods can be applied to produce data-point-level summaries, by effectively treating each data point as having a unique key. These summaries, however, have large multiplicities of the same key and they are considerably less accurate than key-level summaries. This prompted the development of methods that compute key-level summaries over the unaggregated data.
- Summarizing Unaggregated Data. Summarization of unaggregated data sets was extensively studied [N. Alon, Y. Matias, and M. Szegedy, T
HE SPACE COMPLEXITY OF APPROXIMATING THE FREQUENCY MOVEMENTS, J. Comput. System Sci. 58:137-147, 1999; P. Indyk and D. P. Woodruff, OPTIMAL APPROXIMATIONS OF THE FREQUENCY MOMENTS OF DATA STREAMS, In Proc 37th Annual ACM Symposium on Theory of Computing, pages 202-208, ACM, 2005; M. Charikar, K. Chen, and M. Farach-Colton, FINDING FREQUENT ITEMS IN DATA STREAMS, In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 693-703, 2002; R. Fagin, A. Lotem, and M. Naor, OPTIMAL AGGREGATION ALGORITHMS FOR MIDDLEWARE, In Proceedings of the 24th ACM Symposium on Principles of Database Systems, ACM-SIGMOD, 2001; P. Cao and Z. Wang, EFFICIENT TOP -kQUERY CALCULATION IN DISTRIBUTED NETWORKS, In Proc 23rd Annual ACM Symposium on Principles of Distributed Computing, ACM-SIGMOD,2004; G. Manku and R. Motwani, APPROXIMATE FREQUENCY COUNTS OVER DATA STREAMS, In International Conference on Very Large Databases (VLDB), pages 346,357, 2002; G. Cormode and S. Muthukrishnan, WHAT'S HOT AND WHAT'S NOT: TRACKING MOST FREQUENT ITEMS DYNAMICALLY, In Proceeding of ACM Ptinciples of Database Sysems, 2003] for applications that include data streams, distributed data, and in-network aggregation (sensor networks) [A. Manjhi, S. Nath, and P. B. Gibbons, TRIBUTARIES AND DELTAS: EFFICIENT AND ROBUST AGGREGATION IN SENSOR NETWORK STREAMS, In SIGMOD 2005, ACM, 2005]. We are specifically interested in summaries that support estimating the weight of selected subpopulations, specified using arbitrary selection predicates and compare our methods against alternative methods that do that. (We do not consider methods restricted to estimating an aggregate over the full data or geared for different aggregates such as top-k, heavy hitters, or frequency moments of the full data set.) - Concise samples [P. Gibbons and Y. Matias, N
EW SAMPLING -BASED SUMMARY STATISTICS FOR IMPROVING APPROXIMATE QUERY ANSWERS, In SIGMOD, ACM, 1998] refer to independent sampling of data points (this assumes that data points have uniform weights). The key idea is to combine in the sample all data points with the same key, and therefore obtain a larger effective sample using the same storage. This is also the flow counting mechanism deployed by Cisco's sampled NetFlow (NF) in routers [Cisco NetFlow, described in materials found at www.cisco.com/en/US/docs/ios/12—2sb/feature/guide/sbrsnf.html]. When sampling is performed at a fixed-rate we obtain variable-size summary. In many applications, a fixed-size summary is desirable, which is obtained by adaptively decreasing the sampling rate. We refer to this adaptive version asA NF. - Counting samples [P. Gibbons and Y. Matias, N
EW SAMPLING -BASED SUMMARY STATISTICS FOR IMPROVING APPROXIMATE QUERY ANSWERS, In SIGMOD, ACM, 1998] (also developed as sample-and-hold (SH) [C. Estan and G. Varghese, NEW DIRECTIONS IN TRAFFIC MEASUREMENT AND ACCOUNTING, In Proceeding of the ACM SIGCOMM '02 Conference, ACM, 2002]) is a summarization algorithm applicable to an unaggregated stream of data points with uniform weights. The algorithm samples all data points at a fixed rate, but once a key is sampled, all subsequent data points with the same key are counted. Similarly, there is an adaptive version of the algorithm that produces fixed-size summaries (A SH). - Subpopulation-weight estimators for
A SH andA NF have been proposed and evaluated [E. Cohen, N. Duffield, H. Kaplan, C. Lund and M. Thorup, SKETCHING UNAGGREGATED DATA STREAMS FOR SUBPOPULATION -SIZE QUERIES, In Proc of the 2007 ACM Symp. on Principles of Database Systems (PODS 2007), ACM, 2007; E. Cohen, N. Duffield, H. Kaplan, C. Lund and M. Thorup, ALGORITHMS AND ESTIMATORS FOR ACCURATE SUMMARIZATION OF INTERNET TRAFFIC, In Proceedings of the 7th ACM SIGCOMM conference on Internet measurements (IMC), 2007].A SH dominatesA NF on any sub-population and distribution.A NF (and NF), however, are applicable on general IFTs whereasA SH (and SH) are limited to streams. In addition,A SH does not support multiple-objectives unbiased estimation for other additive (over data points) weight functions [E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, ALGORITHMS AND ESTIMATORS FOR ACCURATE SUMMARIZATION OF INTERNET TRAFFIC, MANUSCRIPT, 2007] whereasA NF and our summarization algebra support multiple objectives.A SH is applicable to uniform weights and its extension to general weights does not utilize and benefit from a higher-level of aggregation. For example, in terms of the produced summary and estimate quality, it treats the sequence (i1, 1), (i2, 3), (i1, 2) as (i1,1), (i2,1), (i2,1), (i2,1), (i1,1), (i1,1). - Step-counting SH (
S SH) is another summarization scheme for unaggregated data streams that improves overA SH by exploiting the memory hierarchy structure at high speed IP routers. As a pure data stream algorithm, however,S SH utilizes larger storage to produce the same size summary asA SH. - Propagation of Summaries on Trees. Multistage aggregation for threshold sampling [N. G. Dufield, C. Lund, and M. Thorup, L
EARN MORE, SAMPLE LESS: CONTROL OF VOLUME AND VARIANCE IN NETWORK MEASUREMENTS, IEEE Transactions on Information Theory, 51(5):1756-1775, 2005] is represented on a tree [E. Cohen, N. Duffield, C. Lund, and M. Thorup, CONFIDENT ESTIMATION FOR MULTISTAGE MEASUREMENT SAMPLING AND AGGREGATION, In ACM SIGMETRICS, 2008, Jun. 2-6, 2008, Annapolis, Md., USA] for the purpose of developing exponential bounds on summary error. Applications include Sampled NetFlow, Counting Samples, and Sample and Hold. Some earlier work [N. Duffield and C. Lund, PREDICTING RESOURCE USAGE AND ESTIMATION ACCURACY IN AN IPFLOW MEASUREMENT COLLECTION INFRASTRUCTURE, In ACM SIGCOMM Internet Measurement Workshop, 2003, Miami Beach, Fla., Oct. 27-29, 2003] had analyzed variance for Sampled NetFlow, exploiting relationships similar to Lemma 8 set forth below for multistage sampling. - From the foregoing discussion, it will be apparent that a summarization method for unaggregated data sets desirably will work on massive data streams in the face of processing and storage constraints that prohibit full processing; will produce a summarization with low variance for accurate analysis of data; will be one that is efficient in its application (will not require inordinate amounts of time to produce); will provide unbiased summaries for arbitrary analysis of the data; and will limit the worst case variance for every single (arbitrary) subset.
- The prior art summarization methods described above have been unable to satisfy all of these desiderata.
- Accordingly, there is a need to provide a summarization method for unaggregated data that produces results better than those attainable by prior art methods.
- We formalize the above physical and logical constraints on the information flow using Information Flow Trees (IFTs). Data points are generated at leaves of the tree and information flows bottom-up from children to parent nodes. Each node in the tree obtains information (only) from its children and is subjected to a constraint on the information it can propagate to its parent node and to its internal processing constraints (that can also be captured by an IFT). For our summarization problem, at each node, IFT constraints prohibit the computation on the full aggregated data presented from it children nodes children. Rather, it combines them into one summary, which is hence a summary of all the data produced by leaf nodes descended from it. The physical and logical constraints translate to an IFT or family of applicable IFTs. Subjected to these constraints, we are interested in obtaining a summary that allows us to answer approximate queries most accurately.
- Our summaries are based on adjusted weights which means that both data sets and summaries have a consistent representation as a weighted set: a set of keys with weights associated with each key. We develop a Summarization Algebra for manipulating adjusted-weight summaries. In our framework, summarization and merging of summaries of unaggregated data sets are composable operators that allow us to perform summarization subject to arbitrary IFT constraints and at the same time preserve the good properties of the summarization.
- One of the discoveries we have made is that IFT constraints, and the data stream model constraints in particular, prohibit variance-optimal summarization of unaggregated data. This contrasts with what is possible for aggregated data, for which there exists an optimal summarization scheme (VAROPT) that is applicable for data streams and general IFT constraints [M. T. Chao, A
GENERAL PURPOSE UNEQUAL PROBABLILITY SAMPLING PLAN, Biometrika, 69(3):653-656, 1982; Y. Tillé, AN ELIMINATION PROCEDURE FOR UNEQUAL PROBABILITY SAMPLING WITHOUT REPLACEMENT, Biometrika, 83(1):238-241, 1966; Y. Tillé, Sampling algorithms, Springer-Verlag, New York, 2006; E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, VARIANCE OPTIMAL SAMPLING BASED ESTIMATION OF SUBSET SUMS, In Proc. 20th ACM-SIAM Symposium on Discrete Algorithms, ACM-SIAM, 2009]. On the other hand, we have discovered that by leveraging VAROPT in our method, we can obtain local optimality of intermediate summarization steps and approach variance optimality as the data set becomes “more” aggregated. - In particular, in accordance with the summarization method of the present invention, unaggregated data are summarized by utilizing at summarization points an adjusted weight summarization method that inputs a weighted set of size k+1 and outputs a weighted set of size k (removes a single key). As we discuss below, by including the local application of the VAROPT algorithm, we obtain the desirable properties we seek.
- In further aspects of the invention, the summarization is presented using mergins and sampling operations applied to a dataset of weighted keys. The algorithm maintains adjusted weights of keys that are unbiased estimates of their actual weight. The summarization is applied using the same adjusted weights.
- In a particular aspect of the invention, a method for producing a summary A of data points in an unaggregated data stream wherein the data points are in the form of weighted keys (a, w) where a is a key and w is a weight, and the summary is a sample of k keys a with adjusted weights wa, comprises providing a first set or reservoir L with keys having adjusted weights which are additions of weights of individual data points of included keys; providing a second set or reservoir T with keys having adjusted weights which are each equal to a threshold value τ whose value is adjusted based upon tests of new data points arriving in the data stream; and combining the keys and adjusted weights of the first reservoir L with the keys and adjusted weights of the second reservoir T to form the summary representing the data stream. A third reservoir X may advantageously be used for temporarily holding keys moved from reservoir L and for temporarily holding keys to be moved to reservoir T in response to tests applied to new data points arriving in the stream. The method proceeds by first merging new data points in the stream into the reservoir L until the reservoir contains k different keys, and thereafter applying a series of tests to new arriving data points to determine how their keys and weights compare to the keys and adjusted weights already included in the summary.
- For example, a first test may determine if the key of the new data point is already included in reservoir L, and if so, to increase the adjusted weight of the included key by the weight of the new data point; a second test may determine if the key of the new data point is already included in reservoir T, and if so, to move the key from reservoir T to reservoir L and to increase the adjusted weight of the included key by the weight of the new data point. If the key of the new data point is not already in included in reservoir T or reservoir L, a third test may determine if the weight of the new data point is greater than the threshold value T and if so, to add the key and weight of the new data point to reservoir L, and if not to add the key of the new data point to temporary reservoir X. Another test may be utilized to determine if the key with the minimum adjusted weight included in reservoir L is to be moved to reservoir X, and a further test based on a randomly generated number may be used to determine keys to be removed from reservoirs T or X. In this fashion, each new data point (the k+1 data point) is used to produce a sample of k keys that faithfully represents the data stream for use in subsequent analysis.
- The foregoing method may be used to summarize separate data streams, and their summaries may in turn be summarized using the same method.
- Our method supports multiple weight functions. These occur naturally in some contexts (e.g. number and total bytes of a set of packets). They may also be used for derived quantities, such as estimates of summary variance, which can be propagated up the IFT.
- We compared our method to state of the art methods that are applicable to IFTs and specifically only to data streams. We found that our methods produce more accurate summaries for a given size, with typically a reduction in variance.
- Our method performed very close to the (unattainable) variance optimality, making it a practically optimal summarization scheme for unaggregated data.
- Lastly, our method is efficient, using only 0(log k) amortized per step but in practice is much faster, being constant on non-pathological sequences.
- The summarization method for unaggregated data of the present invention provides a summarization that is a composable operator, and as such, is applicable in a scalable way to distributed data and data streams. The summaries support unbiased estimates of the weight of subpopulations of keys specified using arbitrary selection predicates and have the strong theoretical property that the variance approaches the minimum possible if the data set is “more aggregated.”
- The main benefit of the present method is that it provides much more effective summaries for a given allocated size than all previous summarization methods for an important class of applications. These applications include IP packet streams, where each IP flow occurs as multiple interleaving packets, distributed data streams produced by events registered by sensor networks, and Web page or multimedia requests to content distribution servers.
- These and other objects, advantages and features of the invention are set forth in the attached description.
- The foregoing summary of the invention, as well as the following detailed description of the preferred embodiments, is better understood when read in conjunction with the accompanying drawings, which are included by way of example and not by way of limitation with regard to the claimed invention:
-
FIG. 1 shows an Information Flow Tree corresponding to data stream model constraints. -
FIG. 2 shows an Information Flow Tree of an aggregation of multiple distributed data streams. -
FIG. 3 shows an Information Flow Tree for an aggregation of data from multiple servers sending data to a central server. -
FIG. 4 is a flowchart showing a summarization method performed according to the invention. -
FIGS. 5-8 are flowcharts showing in greater detail certain aspects of the method according to the invention. - Information-flow trees (IFTs) are graphic tools that may be used to represent both (a) the operations performed and (b) the constraints these operations are subjected to when summarizing an unaggregated data set. An IFT is a rooted tree with a data point at each leaf node (the input). Edges are directed from children to parent nodes and have associated capacities that capture storage or communication constraints that are imposed by the computational setup. Information is passed bottom-up from children nodes to parent nodes, subjected to capacity constraints on edges and processing constraints at each internal node. The processing constraints at each internal node can also be modeled using an IFT (this makes it a recursive structure.) The information contained at each node depends only on the data points at descendent leaves.
- The constraints imposed by the data stream model are captured by the IFT1 shown in
FIG. 1 that is in the form of a time-advancing path with a single leaf (data point) hanging off each node n. All edges e have the same capacity, which corresponds to the storage limitation when processing the stream. The summarization of data in an exemplary data stream is represented by IFT1 as follows: As shown inFIG. 1 , a stream S of data points (k,w) arrives over time with keys a, b, c, d and weights w given by Arabic numbers. The set of data point examples shown inFIG. 1 is given by (a,2), (c,1), etc. The weighted set of this stream is w(a)=2+3+4=9; w(b)=2+1=3; w(c)=1+2=3; w(d)=2+6=8. The nodes n in IFT1 output a summary of their prefix. For example, the third node n3 outputs a summary with w(a)=6; w(b)=2 and w(c)=1. The root node nr provides a summary of the weighted set. - An information flow tree IFT2 for summarization of multiple distributed data streams S1, S2, etc., over some communication network is illustrated in
FIG. 2 . Edge capacities at each “stream” module S1, S2, etc., capture storage constraints and other edge capacities capture network bandwidth constraints. As shown inFIG. 2 , the root node NR of IFT2 outputs a summary of all of the data streams S1, S2, etc, to a processor, e.g., in a server, to perform analysis of the data. -
FIG. 3 illustrates an information flow tree IFT3 for summarization performed by multiple servers R1, R2, etc., each summarizing a part of the data and sending a summary of their part to a single central server Rc, which produces a summary of the full data set. The IFTs ofFIGS. 1-3 may also be used to capture constraints of using multiple parallel processors: The data set is partitioned to separate processors, each processor produces a summary of its own chunk of the data, and these summaries are combined to produce the final output summary. - The summarization methods that take place in IFT1, IFT2 and IFT3 according to the present invention are arranged to summarize unaggregated data subject to the constraints noted above using adjusted-weight summarization, and to use merging and addition steps that advantageously preserve desirable data qualities to provide a resulting data summary in a form that allows us to answer approximate queries with respect to the data most accurately.
- Theoretical Background
- To understand the data qualities that the present invention seeks to obtain, and how the summarization methods of the present invention achieve these qualities, some definition of terminology and some background explanation is necessary with respect to adjusted-weight summaries and their variances.
- A weight assignment w: U is a function that maps all keys in some universe to non-negative real numbers. There is a bijection between weight assignments and corresponding weighted sets, and we use these terms interchangeably.
- The weighted set that corresponds to a weight assignment w is the pair (I,w), where I≡I(w)⊂U is the set of keys with strictly positive weights. (Thus, w is defined for all possible keys (the universe U) but requires explicit representation only for I.)
- A data point (I, x) corresponds to a weight assignment w such that w(i)=x and w(j)=0 for j≠i.
- In the following description, we include various definitions, theorems, and lemmas, but for simplicity have omitted proofs.
-
DEFINITION 1. Adjusted-weight summary (AW-summary) of a weight assignment w is a random weight assignment A such that for any key iεU, E[A(i)]=w(i). - AW-summaries support estimating the weight of arbitrary subpopulations: For any subpopulation J⊂U,
-
- is an unbiased estimate of w(J). Note that the estimate is obtained by applying the selection predicate only to keys that are included in the summary A and adding up the adjusted weights of keys that satisfy the predicate.
- Different AW-summaries of the same weighted set are compared based on their size and estimation quality. The size of a summary is the number of keys with positive adjusted weights. The average size of an AW-summary is E[|{i|A(i)>0}|]. An AW-summary has a fixed size k if it assigns positive adjusted weight to exactly k keys.
- Variance is the standard metric for the quality of an estimator for a single quantity, such as the weight of a particular subpopulation. In particular, the variance of A(i) (the adjusted weights assigned to a key i under AW-summary A) is
-
VARA [i]≡VAR[A(i)]=E[(A(i)−w(i))2 ]=E[A(i)]2 −w(i)2 - and the covariance of A(i) and A(j) is
-
COVA [i, j]≡COV[A(i), A(j)]=E[A(i)A(j)]−w(i)w(j). - The variance for a particular subpopulation J is equal to
-
- Since AW-summaries are used for arbitrary subpopulations that are not specified a priori, the notion of a good metric is more subtle. There is generally no single AW-summary that dominates all other of the same size on all subpopulations (it is very easy to construct AW-summaries that have zero variance on any one subpopulation but are very bad otherwise).
- The average variance over subpopulations of certain weight or size was considered by M. Szededy and M. Thorup (O
N THE VARIANCE OF SUBSET SUM ESTIMATION, In Proc. 15th ESA, LNCS 4698, pages 75-86, 2007), who showed that (for any subpopulations size), it is simply a linear combination of two quantities. The sum of per-key variances ΣV[A]≡Σiε1 VARA[i] and the variance of the sum VΣ[A]≡VAR[σiε1A(i)]. - An AW-summary preserves total weight if Σiε1 A(i)=w(I) (Therefore VΣ[A]=0 and is minimized.) The average variance, among AW-summaries that preserve total weight, is minimized when ΣV[A] is minimized. For two total-preserving AW-summaries, A1 and A2 of the same weighted set, the ratio of the average variance over any subpopulation size is ΣV[Ai]/ΣV[A2].
- In practice, average variance is an insufficient measure, as we need to be able to bound the variance on arbitrary subpopulations (avoid pathological cases) and obtain confidence intervals. Therefore, this metric is complemented by limiting the covariances structure so that the variance over subpopulations is more “balanced.” An AW-summary A has non positive covariances if for every two keys i≠j, COVA[i, j]≦0 (equivalently, E[A(i)A(j)]≦w(i)w(j)). We similarly consider zero covariances, if for every two keys i≠j, COVA[i, j]=0. A case for the combined properties of total preserving and non-positive covariances was made in E Cohen and H. Kaplan, T
IGHTER ESTIMATION USING BOTTOM -K SKETCHES (In Proceedings of the 34th VLDB Conference, 2008). - Combining the above desirable properties, we say that an AW-summary is optimal if ΣV is minimized, VΣ=0 (it is total preserving), it has non-positive covariances, and it has a fixed size. This combination of desirable properties dates back to A. B Sunter, List sequential sampling with equal or unequal probabilities without replacement (Applied Statistics, 26:261-268, 1977), but was first realized by E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, V
ARIANCE OPTIMAL SAMPLING BASED ESTIMATION OF SUBSET SUMS, In Proc. 20th ACM-SIAM Symposium on Discrete Algorithms, ACM-SIAM, 2009. - All the AW-summarizations we propose and evaluate preserve total weight and have fixed size and non-positive covariances and thus same-size summaries can be conveniently compared using the one-dimensional metric ΣV.
- An AW-summary has Horvitz-Thompson HT adjusted weights [see D. G. Horvitz and D. J. Thompson, A
GENERALIZATION OF SAMPLING WITHOUT REPLACEMENT FROM A FINITE UNIVERSE, Journal of the American Statistical Association, 47(260):663-685, 1952], if a key i with weight w(i) is included in the summary with probability p(i), it is assigned an adjusted weight A(i)=w(i)/p(i). It is well known that the HT adjusted weights minimize the variance for each key, and therefore also minimize ΣV for a given set of inclusion probabilities p(i),(i ε I). It is also well known that the inclusion probabilities that minimize ΣV for a given average summary size are those where p(i) is proportional to w(i) (IPPS sampling, as described in S. Sampath, Sampling Theory and Methods, CRC Press, 2000, and R. Singh and N. S. Mangat, Elements of survey sampling, Springer-Verlag, New York, 1996). A recent result by the present inventors has established the existence of, and provided an efficient algorithm (VAROPT), as described in the references cited previously, for an optimal AW-summarization on aggregated data. In the present invention, where we deal with unaggregated data, we show no optimal summarization exists, but provide a near-optimal summarization algorithm for use on unaggregated data. - The sum w=w1⊕w2 of two weight assignments w1 and w2 is a weight assignment defined by key-wise addition, w(i)=w1(i)+w2(i) (i ε U).
- For the sum (merge) of the corresponding weighted sets we use the notation
-
(I 1 ,w 1)⊕(I 2 ,w 2)=(I 1 ∪I 2 , w 1 ⊕w 2). - The definition naturally applies to the sum A1⊕A2 of random weight assignments A1 and A2, (and in particular also to AW-summaries), and extends to the sum of multiple weight assignments w1⊕w2 ⊕ . . . ⊕ wh=⊕j=1 h wj. Observe that the sum operation is commutative.
- Some important properties, including being an AW-summary, are additive. Let wj(1≦j≦h) be weight assignments with respective AW-summaries Aj. Let W=⊕j=1 h wj. Then two lemmas follow:
-
LEMMA 1. The random weight assignment A=⊕j=1 h Aj is an AW-summary of w. -
LEMMA 2. If the AW-summaries Aj are independent, the covariances are additive. That is, -
- Proof of
LEMMAS - The sum of AW-summaries preserves the non-positive covariances and zero covariances properties:
-
COROLLARY 3. If the AW-summaries Aj are independent, then if Aj (j=1, . . . , h) have the non positive covariances or the zero covariances properties then so does the AW-summary ⊕j=1 h Aj. This follows because if all summands are zero or non-positive, so is their sum. -
COROLLARY 4. If the AW-summaries Aj are independent, then for each key i ε U, -
-
COROLLARY 5. If the AW-summaries Aj are independent, then -
-
-
- We can now establish transitivity of AW-summary properties under the composition operation.
-
-
- (i) E[((w))(i)]=w(i) (i ε U)
- (ii) ∘ A is an AW-summary of w.
-
- Suppose A, B are AW-summaries of w with the property that E[B(i)|A]=A(i) for all i ε U. Then COVB[i, j|A] will denote the conditional covariance of B(i), B(j), i.e., conditioned on A. Set VARB[i|A]=COVB[i, i|A]. The following is a Law of Total (co)Variance for the present model.
- LEMMA 8. For each pair of keys i, j ε U,
- COROLLARY 9. For each key I ε U,
-
-
- As described above with reference to
FIG. 1 , in an IFT the output of each node is a weighted set that corresponds to an AW-summary of the data points below it. The input of each node is the unaggregated data set that corresponds to the union of the outputs of its children. For a fixed input, the output is an AW-summary of the sum of the weighted sets obtained from its children. - The internal summarization at each node can be performed by first adding (merging) the weighted sets collected from is children and then applying an AW-summarization to the merged set that reduces it as needed to satisfy the capacity constraint on the edge directed to its parent. There may be internal IFT constraints at the node, however, that do not allow for efficiently merging the input sets: We may want to speed up the summarization process by partitioning the input among multiple processors, or the data may be stored in external memory, and at the extreme, if internal memory only suffices to store the output summary size, it may be preferable to process the concatenated inputs as an unaggregated stream.
- The additivity and transitivity properties of AW-summaries guarantee that if each basic summarization step at and below a node utilizes total-preserving and non-positive covariances AW-summarization, then the output of the node is also a total preserving and non-positive covariances AW-summary.
- Note that for this property to hold, the IFT structure does not have to be fixed. The IFT nodes represent operations on the data. The next operation (in the structure above the node) can depend on the output and the operation itself can depend on the input data points. For a certain data set, we can consider a family of such recursive IFTs (which allow, for example, for different arrival orders of data points or for variable size streams).
- In the summarization of an unaggregated data stream, a fixed-size summary S of size k is propagated from child to parent. Each parent node adds the new single data point (i′, w′) to the summary to obtain S′=S⊕{({i′}, w′)}. If S′ contains k+1 distinct items (that is, the key i′ does not appear in S), we apply an AW-summarization that reduces the summary from size k+1 back to size k.
- The basic building block of data stream summarization is an AW-summarization that inputs a weighted set of size k+1 and outputs a weighted set of size k (removes a single key).
- Interestingly, any AW-summary that produces a size k AW-summary from a size k+1 weighted set using HT adjusted weights has the non-positive covariances property:
- LEMMA 11. Consider an AW-summarization that for an input weighted set of size k+1 produces summaries of fixed size k, (for inputs that are already of size k, it return the input set) and uses the HT adjusted weights. This AW-summarization has non positive covariances.
- Interestingly, there is a unique such AW-summarization that is also total-preserving and minimizes ΣV, which means it is locally optimal for this primitive. The scheme is L-VAROPTk (local application of VAROPT). We refer to an application of our summarization algebra on an unaggregated stream in conjunction with the L-VAROPTk primitive as SA-STREAM-VOPTk.
- When the IFT constraints allow, instead of adding one data point at time we can consider a sequence of batch additions (merges) of sets of data points followed by summarizations. The motivation for batch additions before summarizing is that we extend the local optimality (minimal ΣV) from being per data-point to being per batch. Formally, for a weighted set (J,A)(representing current summary) and data points (i1, w1), . . . , (ir, wr)
-
- The left hand side, by optimality of VAROPTk, is the minimum ΣV for size-k AW-summaries of the weighted set (J, A) ⊕⊕j=1 r{(ir, wr)}. The right hand side is another AW-summary of this weighted set. Concretely, consider a node that obtains multiple size-k summaries from its children, can internally store size k′ summary in memory (k′≧k), and outputs a size k summary. If the number of distinct keys is at most k′, we should merge the input summaries before summarizing them to size k. If k′=k, we apply SA-STREAM-VOPTk on the concatenation of the inputs. Otherwise, we add data points until we have k′ distinct keys (this is effectively a partial merge), apply SA-STREAM-VOPTk′ to the remaining data points, and apply VAROPTk to the result.
- We refer to the generic application of our summarization algebra (arbitrary addition and summarization steps) with L-VAROPT as the summarization primitive as SA+VOPT. If the data happen to be aggregated and all intermediate summarizations allow summary size that is at least the output size, then SA+VOPT is an instance of VAROPT. Therefore, by leveraging VAROPT as a building block, ΣV gracefully converges to the optimal when the data is more aggregated and attains it if the data set happens to be aggregated.
- As typical for “online” problems, we can show that there is no IFT-constrained summarization algorithm of unaggregated data sets that minimizes ΣV. This is in contrast to aggregated data sets (where VAROPT minimizes ΣV).
- THEOREM 12. There is no AW-summarization algorithm for unaggregated streams that produces a fixed-size summary that minimizes ΣV. (Proof omitted.)
- Given Theorem 12, it is not very surprising that we could construct an example where SA-STREAM-VOPT has a slightly larger ΣV than
A SH: consider a sequence of 7 packets p1, . . . , p7 where packets p1, p2 belong to flow f1, packets p3, p4 to f2 and packets p5, p6, p7 to f3. ΣV of VAROPT on this sequence is 8.4 and ΣV ofA SH is 8.05. The optimal aggregated VAROPT has ΣV of 7.5 on this distribution. On the other hand, we constructed a family of unaggregated streams whereA SH has larger ΣV by a logarithmic (in k) factor. - We conclude the theoretical discussion with a conjecture. We define the competitive ratio of an AW-summarization as the worst-case ratio (over all applicable unaggregated inputs data sets) between ΣV and the minimum possible ΣV on the corresponding aggregated data for summary of the same size. The competitive ratio of
A SH is at least log k whereas the worst example we could find for SA-STREAM-VOPTk (on a contrived family of sequences) was about 1.6. We conjecture that SA-STREAM-VOPTk advantageously has a small constant competitive ratio. - When considering SA+VOPT on a data set and corresponding family of IFTs, we define k′ to be the smallest size of an intermediate summary on which L-VAROPT is applied (that is, the smallest i such that L-VAROPTi is used). We conjecture that the ratio of ΣV to ΣV [VAROPTk′] is bounded by a constant. In practice SA+VOPT is very close to optimal and outperforms all other algorithms.
-
FIG. 4 is a flowchart showing amethod 100 according to the invention. It is possible to implement SA+VOPT with repetitive additions (merges) of data, however, this naïve implementation of SA+VOPT is inefficient: If the output weighted set of each application of L-VAROPT is transferred as a list, then each L-VAROPTk application performed after addition of data points requires 0(k) processing time. Similarly, without tuned data structures, the processing time of a merge (adding sets) depends on the sum of the sizes of the sets. - Accordingly, in order to provide improved processing, the present invention is implemented in
method 100, which maintains the summary in a tuned data structure that reduces worst-case per-data point processing to amortized 0(log k). The implementation is fast and further benefits from the fact that the theoretical amortized 0(log k) bound applies to worst-case distributions and arrangements of the data points. Performance on “real” sequences is closer to 0(1) time per data point and holds for randomly permuted data points. - In
method 100, the input is an unaggregated stream of data points (a, w) where a is a key and w is a positive weight. The output is a summary A which is a sample of up to k keys. Each included key a has an adjusted weight ŵa. If a key is not in A its adjusted weight is 0.Method 100 proceeds using the summarization algebra described above, and thus the summary A has the advantageous properties that accompany its use. - In
method 100, a threshold τ, initially set to 0, is calculated. The keys in A are partitioned into two sets or reservoirs L and T, each initially empty and populated with keys a in the data stream with adjusted weights ŵa as will be described below. In accordance withmethod 100, each a ε L has a weight wa≧τ. The set L is stored in a priority queue which always identifies the key with the smallest weight minaεL wa. As will be described below, when a new data point arrives, a determination is made whether to move the key with the smallest weight from set L. Each a ε T has a weight wa≦τ. The set T is stored as a prefix of an array ofsize k+ 1. For every a ε A, the adjusted weight is ŵa=max{τ,wa}. Thus ŵa=wa for a ε L while ŵa=τ for a ε T. - Referring to
FIG. 4 , instep 102 the sets L and T and threshold τ are initialized. The threshold τ is set to 0, and sets L and T are initially empty sets. I.e., L←Ø, T←Ø, τ←0. - In
step 104, the set or reservoir L is populated with arriving data points until it contains k different keys.FIG. 5 shows the operation ofstep 104 in greater detail. As shown inFIG. 5 , in step 200 a test is performed to determine if the number of keys in L is <k. If so, in step 202 a determination is made whether the key of the new data point is already in L (a ε L?). If so, instep 204 the weight w of the new data point is merged with the existing adjusted weight wa of the existing key to update the adjusted weight (wa←wa+w). If not, instep 206 the new data point is added to L and its adjusted weight (previously 0) is increased by w. Upon completion ofstep FIG. 4 . - Returning to
FIG. 4 , once set L is populated, in step 106 a determination is made whether the stream has ended (which may be determined by a further test, for example one based on elapsed time if the summary A is to be provided on an hourly or daily basis). - If
step 106 determines that the data stream has not ended, instep 108 new data point arrivals (the k+1 data points) are tested to determine whether their keys and weights are to be included in sets L or T, and whether other keys and weights are to be moved or removed in order to provide a sample with just k keys. The tests ofstep 108 are shown in greater detail inFIG. 6 , to be described below. After the tests ofstep 108 are performed, instep 110 the threshold τ is updated and the method returns to step 106 to determine if the data stream has ended. If the stream has ended, instep 112 the existing contents of sets L and T are merged to form the summary A, with adjusted weights given by ŵa=wa for a ε L and ŵa=τ for a ε T. The output of themethod 100 is summary A with these adjusted weights. - Referring now to
FIG. 6 , the tests performed bystep 108 are shown. In step 300 a determination is made whether the new data point a is already in summary A, and if so, in step 302 a determination is made whether the new data point a is in L. If it is already in L then instep 304 the adjusted weight of a is updated by merging the weight w of the new data point with the existing adjusted weight of that key in L. Ifstep 302 determines that the new data point a is not in L (which means it is in T becausestep 300 has determined that it is in A), then instep 306 the key a is moved from set T to set L, and it is given an adjusted weight of wa←τ+w in set L. It will be observed that because the key of the new data point is already in A, the steps 302-306 maintain the number of keys in A at k. - If
step 300 determines that the new data point is not in A, then the method proceeds to step 308 to apply a test to determine if keys with low adjusted weights in L are to be moved to T, and then to step 310 to apply a test to determine which key to remove to maintain the number of keys in A at k. The method then returns to thethreshold updating step 110 inFIG. 4 . - Step 308 is shown in greater detail in
FIG. 7 , and step 310 inFIG. 8 . - Referring to
FIG. 7 ,step 308 receives a new data point that is not in A. Instep 400, initial values are created for adjusted weight wa←w, for a new temporary reservoir or set X that is initially empty, and for a new variable smallsum←τ*|T|. Smallsum thus initially represents the total weight of keys in T. In step 402 a determination is made whether the weight of the new data point exceeds the threshold, i.e., w>τ, and if so instep 404 the new data point a is added to L with adjusted weight wa. If not, instep 406 the new data point a is added to the temporary reservoir X, and step 408 updates the variable smallsum←smallsum+w. -
Steps step 410 to determine if a new minimum adjusted weight member of L should be moved to X. - At the conclusion of
step 410, one or more low adjusted weight keys in L will have been moved to the temporary reservoir X, and the method then proceeds to step 310 to determine which key to remove from T or X to maintain the number of keys in summary A at k. - Step 310 is shown in
FIG. 8 . The input to step 310 includes the updated set X and updated variable smallsum generated instep 308. Instep 500 shown inFIG. 8 , a variable t is set as t←smallsum/(|T|+|X|−1) . In step 502, a random number r is generated, r ε U(0,1). Then instep 504 it is determined whether r<|T|(1−τ/t), and if so, instep 506 the random number r is used to find d in T such that dΔ└r/(1−τ/t)┘. Instep 508, T[d] is removed from T, and the total number of keys in summary A remains at k. - If
step 504 determines that it is not true that r<|T|(1−τ/t), then in step 510 r is updated as r←r−|T|(1−τ/t), and d is set as d←0. Then instep 512, while r>0, an X[d] is found such that rΔr−(1−wX[d]/t), then d is updated as d←d+1, and in step 514 X[d] is removed from X, and the number of keys in summary A remains at k. - The removal of a key from T in
step 508 or the removal of a key from X instep 514 result from a random selection process of the keys in T or X, which by reason of their selection for placement in these sets have adjusted weights below the threshold τ, and thus their removal does not influence the more significant weights of keys included in L. The selection process is consistent with the HT conditions and preserves the quality of the sample A. - After keys are removed from T in
step 508 or from X instep 514, the method proceeds to step 516, where T is updated as T←T∪X. At thispoint step 310 is completed, and the method proceeds to step 110 ofFIG. 4 . Step 110 is also shown inFIG. 8 , because the updating which takes place instep 110 uses items generated instep 310, namely, step 110 updates τ←t, where t is as given instep 500. After τ is updated, the method returns to step 106 inFIG. 4 to repeat the processing of new data points until the data stream ends. - The method described above for summarizing unaggregated data in a stream has been evaluated in comparison to other previously-known methods and the results have shown the method of the invention to provide improved results with lower variance providing tighter estimates than prior methods and indeed performs very closely to the unattainable optimum available with aggregated data.
- Thus, the invention describes a feature enabling unaggregated data to be summarized in situations where processing resources (storage, memory, time) are constrained. While the present invention has been described with reference to preferred and exemplary embodiments, it will be understood by those of ordinary skill in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation to the teachings of the invention without departing from the scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the appended claims.
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/653,831 US8195710B2 (en) | 2009-12-18 | 2009-12-18 | Method for summarizing data in unaggregated data streams |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/653,831 US8195710B2 (en) | 2009-12-18 | 2009-12-18 | Method for summarizing data in unaggregated data streams |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110153554A1 true US20110153554A1 (en) | 2011-06-23 |
US8195710B2 US8195710B2 (en) | 2012-06-05 |
Family
ID=44152493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/653,831 Active 2030-08-03 US8195710B2 (en) | 2009-12-18 | 2009-12-18 | Method for summarizing data in unaggregated data streams |
Country Status (1)
Country | Link |
---|---|
US (1) | US8195710B2 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130103713A1 (en) * | 2011-10-21 | 2013-04-25 | Iowa State University Research Foundation, Inc. | Computing correlated aggregates over a data stream |
US20140169187A1 (en) * | 2012-12-13 | 2014-06-19 | Tellabs Operations, Inc. | System, apparatus, procedure, and computer program product for planning and simulating an internet protocol network |
US9135300B1 (en) * | 2012-12-20 | 2015-09-15 | Emc Corporation | Efficient sampling with replacement |
US20150370906A1 (en) * | 2014-06-18 | 2015-12-24 | Electronics And Telecommunications Research Institute | System and method for mapping identifier with locator using bloom filter |
US20170300471A1 (en) * | 2014-09-30 | 2017-10-19 | Hewlett-Packard Development Company, L.P. | Specialized language identification |
-
2009
- 2009-12-18 US US12/653,831 patent/US8195710B2/en active Active
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130103713A1 (en) * | 2011-10-21 | 2013-04-25 | Iowa State University Research Foundation, Inc. | Computing correlated aggregates over a data stream |
US8645412B2 (en) | 2011-10-21 | 2014-02-04 | International Business Machines Corporation | Computing correlated aggregates over a data stream |
US8868599B2 (en) * | 2011-10-21 | 2014-10-21 | International Business Machines Corporation | Computing correlated aggregates over a data stream |
US20140169187A1 (en) * | 2012-12-13 | 2014-06-19 | Tellabs Operations, Inc. | System, apparatus, procedure, and computer program product for planning and simulating an internet protocol network |
US9794130B2 (en) * | 2012-12-13 | 2017-10-17 | Coriant Operations, Inc. | System, apparatus, procedure, and computer program product for planning and simulating an internet protocol network |
US10616074B2 (en) | 2012-12-13 | 2020-04-07 | Coriant Operations, Inc. | System, apparatus, procedure, and computer program product for planning and simulating an internet protocol network |
US9135300B1 (en) * | 2012-12-20 | 2015-09-15 | Emc Corporation | Efficient sampling with replacement |
US20150370906A1 (en) * | 2014-06-18 | 2015-12-24 | Electronics And Telecommunications Research Institute | System and method for mapping identifier with locator using bloom filter |
US20170300471A1 (en) * | 2014-09-30 | 2017-10-19 | Hewlett-Packard Development Company, L.P. | Specialized language identification |
US10216721B2 (en) * | 2014-09-30 | 2019-02-26 | Hewlett-Packard Development Company, L.P. | Specialized language identification |
Also Published As
Publication number | Publication date |
---|---|
US8195710B2 (en) | 2012-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cormode et al. | Space efficient mining of multigraph streams | |
Ben-Basat et al. | Heavy hitters in streams and sliding windows | |
Kumar et al. | Data streaming algorithms for efficient and accurate estimation of flow size distribution | |
Datar et al. | Estimating rarity and similarity over data stream windows | |
Duffield et al. | Predicting resource usage and estimation accuracy in an IP flow measurement collection infrastructure | |
Cormode et al. | What's new: Finding significant differences in network data streams | |
Dimitropoulos et al. | Probabilistic lossy counting: An efficient algorithm for finding heavy hitters | |
US7536396B2 (en) | Query-aware sampling of data streams | |
US9170984B2 (en) | Computing time-decayed aggregates under smooth decay functions | |
US7990982B2 (en) | Methods and apparatus to bound network traffic estimation error for multistage measurement sampling and aggregation | |
US8908554B2 (en) | Computing time-decayed aggregates in data streams | |
US7773538B2 (en) | Estimating origin-destination flow entropy | |
Cormode et al. | What's different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams | |
US8645412B2 (en) | Computing correlated aggregates over a data stream | |
Zhou et al. | Persistent spread measurement for big network data based on register intersection | |
US8195710B2 (en) | Method for summarizing data in unaggregated data streams | |
Duffield | Fair sampling across network flow measurements | |
Tirthapura et al. | A general method for estimating correlated aggregates over a data stream | |
Garofalakis et al. | Data stream management: A brave new world | |
Fu et al. | Clustering-preserving network flow sketching | |
Korn et al. | Modeling skew in data streams | |
Cohen et al. | Stream sampling for variance-optimal estimation of subset sums | |
US9112771B2 (en) | System and method for catching top hosts | |
US8400933B2 (en) | Efficient probabilistic counting scheme for stream-expression cardinalities | |
Liu et al. | SEAD counter: Self-adaptive counters with different counting ranges |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AT&T INTELLECTUAL PROPERTY I, L.P., NEVADA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:COHEN, EDITH;DUFFIELD, NICHOLAS;LUND, CARSTEN;AND OTHERS;SIGNING DATES FROM 20091216 TO 20091217;REEL/FRAME:023738/0145 |
|
AS | Assignment |
Owner name: AT&T INTELLECTUAL PROPERTY I, L.P., NEVADA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KAPLAN, HAIM;REEL/FRAME:024082/0191 Effective date: 20100226 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |